Entrez (http://www.ncbi.nlm.nih.gov/Entrez) is a data retrieval system that provides users access to NCBI’s databases such as PubMed, GenBank, GEO, and many others. You can access Entrez from a web browser to manually enter queries, or you can use Biopython’s Bio.Entrez
module for programmatic access to Entrez. The latter allows you for example to search PubMed or download GenBank records from within a Python script.
The Bio.Entrez
module makes use of the Entrez Programming Utilities, consisting of eight tools that are described in detail on NCBI’s page at http://www.ncbi.nlm.nih.gov/entrez/utils/. Each of these tools corresponds to one Python function in the Bio.Entrez
module, as described in the sections below. This module makes sure that the correct URL is used for the queries, and that not more than one request is made every three seconds, as required by NCBI.
The output returned by the Entrez Programming Utilities is typically in XML format. To parse such output, you have several options:
Bio.Entrez
’s parser to parse the XML output into a Python object;
For the DOM and SAX parsers, see the Python documentation. The parser in Bio.Entrez
is discussed below.
For sequence databases, the Entrez Programming Utilities can also generate output in other formats (such as the Fasta and GenBank file format). This can then be parsed into a SeqRecord using Bio.SeqIO
(see Chapter 4, and the example below).
Before using Biopython to access the NCBI’s online resources (via Bio.Entrez
or some of the other modules), please read the NCBI’s Entrez User Requirements. If the NCBI finds you are abusing their systems, they can and will ban your access!
To paraphrase:
For large queries, the NCBI also recommend using their session history feature (the WebEnv session cookie string). This is only slightly more complicated.
In conclusion, be sensible with your usage levels. If you plan to download lots of data, consider other options. For example, if you want easy access to all the human genes, consider fetching each chromosome by FTP as a GenBank file, and importing these into your own BioSQL database (see Section 9.5).
EInfo provides field index term counts, last update, and available links for each of NCBI’s databases. In addition, you can use EInfo to obtain a list of all database names accessible through the Entrez utilities:
>>> from Bio import Entrez >>> handle = Entrez.einfo(email="A.N.Other@example.com") >>> result = handle.read()
The variable result
now contains a list of databases in XML format:
>>> print result <?xml version="1.0"?> <!DOCTYPE eInfoResult PUBLIC "-//NLM//DTD eInfoResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eInfo_020511.dtd"> <eInfoResult> <DbList> <DbName>pubmed</DbName> <DbName>protein</DbName> <DbName>nucleotide</DbName> <DbName>nuccore</DbName> <DbName>nucgss</DbName> <DbName>nucest</DbName> <DbName>structure</DbName> <DbName>genome</DbName> <DbName>books</DbName> <DbName>cancerchromosomes</DbName> <DbName>cdd</DbName> <DbName>gap</DbName> <DbName>domains</DbName> <DbName>gene</DbName> <DbName>genomeprj</DbName> <DbName>gensat</DbName> <DbName>geo</DbName> <DbName>gds</DbName> <DbName>homologene</DbName> <DbName>journals</DbName> <DbName>mesh</DbName> <DbName>ncbisearch</DbName> <DbName>nlmcatalog</DbName> <DbName>omia</DbName> <DbName>omim</DbName> <DbName>pmc</DbName> <DbName>popset</DbName> <DbName>probe</DbName> <DbName>proteinclusters</DbName> <DbName>pcassay</DbName> <DbName>pccompound</DbName> <DbName>pcsubstance</DbName> <DbName>snp</DbName> <DbName>taxonomy</DbName> <DbName>toolkit</DbName> <DbName>unigene</DbName> <DbName>unists</DbName> </DbList> </eInfoResult>
Since this is a fairly simple XML file, we could extract the information it contains simply by string searching. Using Bio.Entrez
’s parser instead, we can directly parse this XML file into a Python object:
>>> from Bio import Entrez >>> handle = Entrez.einfo(email="A.N.Other@example.com") >>> record = Entrez.read(handle)
Now record
is a dictionary with exactly one key:
>>> record.keys() [u'DbList']
The values stored in this key is the list of database names shown in the XML above:
>>> record["DbList"] ['pubmed', 'protein', 'nucleotide', 'nuccore', 'nucgss', 'nucest', 'structure', 'genome', 'books', 'cancerchromosomes', 'cdd', 'gap', 'domains', 'gene', 'genomeprj', 'gensat', 'geo', 'gds', 'homologene', 'journals', 'mesh', 'ncbisearch', 'nlmcatalog', 'omia', 'omim', 'pmc', 'popset', 'probe', 'proteinclusters', 'pcassay', 'pccompound', 'pcsubstance', 'snp', 'taxonomy', 'toolkit', 'unigene', 'unists']
For each of these databases, we can use EInfo again to obtain more information:
>>> handle = Entrez.einfo(db="pubmed", email="A.N.Other@example.com") >>> record = Entrez.read(handle) >>> record["DbInfo"]["Description"] 'PubMed bibliographic record' >>> record["DbInfo"]["Count"] '17989604' >>> record["DbInfo"]["LastUpdate"] '2008/05/24 06:45'
Try record["DbInfo"].keys()
for other information stored in this record.
To search any of these databases, we use Bio.Entrez.esearch()
. For example, let’s search in PubMed for publications related to Biopython:
>>> from Bio import Entrez >>> handle = Entrez.esearch(db="pubmed", term="biopython", email="A.N.Other@example.com") >>> record = Entrez.read(handle) >>> record["IdList"] ['16403221', '16377612', '14871861', '14630660', '12230038']
In this output, you see five PubMed IDs (16403221, 16377612, 14871861, 14630660, 12230038), which can be retrieved by EFetch (see section 7.6).
You can also use ESearch to search GenBank. Here we’ll do a quick search for the rpl16 gene in Opuntia:
>>> handle = Entrez.esearch(db="nucleotide",term="Opuntia and rpl16", email="A.N.Other@example.com") >>> record = Entrez.read(handle) >>> record["Count"] '9' >>> record["IdList"] ['57240072', '57240071', '6273287', '6273291', '6273290', '6273289', '6273286', '6273285', '6273284']
Each of the IDs (57240072, 57240071, 6273287...) is a GenBank identifier. See section 7.6 for information on how to actually download these GenBank records.
As a final example, let’s get a list of computational journal titles:
>>> handle = Entrez.esearch(db="journals", term="computational", email="A.N.Other@example.com") >>> record = Entrez.read(handle) >>> record["Count"] '16' >>> record["IdList"] ['30367', '33843', '33823', '32989', '33190', '33009', '31986', '34502', '8799', '22857', '32675', '20258', '33859', '32534', '32357', '32249']
Again, we could use EFetch to obtain more information for each of these journal IDs.
ESearch has many useful options — see the ESearch help page for more information.
EPost posts a list of UIs for use in subsequent search strategies; see the EPost help page for more information. It is available from Biopython through Bio.Entrez.epost()
.
ESummary retrieves document summaries from a list of primary IDs (see the ESummary help page for more information). In Biopython, ESummary is available as Bio.Entrez.esummary()
. Using the search result above, we can for example find out more about the journal with ID 30367:
>>> from Bio import Entrez >>> handle = Entrez.esummary(db="journals", id="30367", email="A.N.Other@example.com") >>> record = Entrez.read(handle) >>> record[0]["Id"] '30367' >>> record[0]["Title"] 'Computational biology and chemistry' >>> record[0]["Publisher"] 'Pergamon,'
EFetch is what you use when you want to retrieve a full record from Entrez.
For the Opuntia example above, we can download GenBank record 57240072 using Bio.Entrez.efetch
:
>>> handle = Entrez.efetch(db="nucleotide", id="57240072", rettype="genbank", email="A.N.Other@example.com") >>> print handle.read() LOCUS AY851612 892 bp DNA linear PLN 10-APR-2007 DEFINITION Opuntia subulata rpl16 gene, intron; chloroplast. ACCESSION AY851612 VERSION AY851612.1 GI:57240072 KEYWORDS . SOURCE chloroplast Austrocylindropuntia subulata ORGANISM Austrocylindropuntia subulata Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicotyledons; Caryophyllales; Cactaceae; Opuntioideae; Austrocylindropuntia. REFERENCE 1 (bases 1 to 892) AUTHORS Butterworth,C.A. and Wallace,R.S. TITLE Molecular Phylogenetics of the Leafy Cactus Genus Pereskia (Cactaceae) JOURNAL Syst. Bot. 30 (4), 800-808 (2005) REFERENCE 2 (bases 1 to 892) AUTHORS Butterworth,C.A. and Wallace,R.S. TITLE Direct Submission JOURNAL Submitted (10-DEC-2004) Desert Botanical Garden, 1201 North Galvin Parkway, Phoenix, AZ 85008, USA FEATURES Location/Qualifiers source 1..892 /organism="Austrocylindropuntia subulata" /organelle="plastid:chloroplast" /mol_type="genomic DNA" /db_xref="taxon:106982" gene <1..>892 /gene="rpl16" intron <1..>892 /gene="rpl16" ORIGIN 1 cattaaagaa gggggatgcg gataaatgga aaggcgaaag aaagaaaaaa atgaatctaa 61 atgatatacg attccactat gtaaggtctt tgaatcatat cataaaagac aatgtaataa 121 agcatgaata cagattcaca cataattatc tgatatgaat ctattcatag aaaaaagaaa 181 aaagtaagag cctccggcca ataaagacta agagggttgg ctcaagaaca aagttcatta 241 agagctccat tgtagaattc agacctaatc attaatcaag aagcgatggg aacgatgtaa 301 tccatgaata cagaagattc aattgaaaaa gatcctaatg atcattggga aggatggcgg 361 aacgaaccag agaccaattc atctattctg aaaagtgata aactaatcct ataaaactaa 421 aatagatatt gaaagagtaa atattcgccc gcgaaaattc cttttttatt aaattgctca 481 tattttattt tagcaatgca atctaataaa atatatctat acaaaaaaat atagacaaac 541 tatatatata taatatattt caaatttcct tatataccca aatataaaaa tatctaataa 601 attagatgaa tatcaaagaa tctattgatt tagtgtatta ttaaatgtat atcttaattc 661 aatattatta ttctattcat ttttattcat tttcaaattt ataatatatt aatctatata 721 ttaatttata attctattct aattcgaatt caatttttaa atattcatat tcaattaaaa 781 ttgaaatttt ttcattcgcg aggagccgga tgagaagaaa ctctcatgtc cggttctgta 841 gtagagatgg aattaagaaa aaaccatcaa ctataacccc aagagaacca ga //
The argument rettype="genbank"
lets us download this record in the GenBank format. Alternatively, you could for example use rettype="fasta"
to get the Fasta-format; see the EFetch Help page for other options. The available formats depend on which database you are downloading from.
If you fetch the record in one of the formats accepted by Bio.SeqIO
(see Chapter 4), you can directly parse it into a SeqRecord
:
>>> from Bio import Entrez, SeqIO >>> handle = Entrez.efetch(db="nucleotide", id="57240072",rettype="genbank", email="A.N.Other@example.com") >>> record = SeqIO.read(handle, "genbank") >>> print record ID: AY851612.1 Name: AY851612 Desription: Opuntia subulata rpl16 gene, intron; chloroplast. /sequence_version=1 /source=chloroplast Austrocylindropuntia subulata ....
By default you get the output in XML format, which you can parse using the Bio.Entrez.read()
function:
>>> from Bio import Entrez >>> handle = Entrez.efetch(db="nucleotide", id="57240072", email="A.N.Other@example.com") >>> record = Entrez.read(handle) >>> record[0]["GBSeq_definition"] 'Opuntia subulata rpl16 gene, intron; chloroplast' >>> record[0]["GBSeq_source"] 'chloroplast Austrocylindropuntia subulata' ....
For help on ELink, see the ELink help page. ELink is available from Biopython through Bio.Entrez.elink()
.
EGQuery provides counts for a search term in each of the Entrez databases. This is particularly useful to find out how many items a search will return before actually performing the search with ESearch (see the example in 7.10.1 below).
In this example, we use Bio.Entrez.egquery()
to obtain the counts for “Biopython”:
>>> handle = Entrez.egquery(term="biopython", email="A.N.Other@example.com") >>> record = Entrez.read(handle) >>> record["eGQueryResult"][0]["DbName"] 'pubmed' >>> record["eGQueryResult"][0]["Count"] '5'
See the EGQuery help page for more information.
ESpell retrieves spelling suggestions. In this example, we use Bio.Entrez.espell()
to obtain the correct spelling of Biopython:
>>> from Bio import Entrez >>> handle = Entrez.espell(term="biopythooon", email="A.N.Other@example.com") >>> record = Entrez.read(handle) >>> record["Query"] 'biopythooon' >>> record["CorrectedQuery"] 'biopython'
See the ESpell help page for more information.
Here we’ll show a simple example of performing a remote Entrez query. In section 2.3 of the parsing examples, we talked about using NCBI’s Entrez website to search the NCBI nucleotide databases for info on Cypripedioideae, our friends the lady slipper orchids. Now, we’ll look at how to automate that process using a Python script. In this example, we’ll just show how to connect, get the results, and parse them, with the Entrez module doing all of the work.
First, we use EGQuery to find out the number of results we will get before actually downloading them:
>>> from Bio import Entrez >>> handle = Entrez.egquery(term='Cypripedioideae', email="A.N.Other@example.com") >>> record = Entrez.read(handle) >>> for row in record['eGQueryResult']: ... if row['DbName']=='nuccore': ... print row['Count'] 814
So, we expect to find 814 Entrez Nucleotide records. If you find some ridiculously high number of hits, you may want to reconsider if you really want to download all of them, which is our next step:
>>> from Bio import Entrez >>> handle = Entrez.esearch(db='nucleotide', term='Cypripedioideae', retmax=814, email="A.N.Other@example.com") >>> record = Entrez.read(handle)
Here, record
is a Python dictionary containing the search results and some auxiliary information. Just for information, let’s look at what is stored in this dictionary:
>>> print record.keys() [u'Count', u'RetMax', u'IdList', u'TranslationSet', u'RetStart', u'QueryTranslation']
First, let’s check how many results were found:
>>> print record['Count'] '814'
which is the number we expected. The 814 results are stored in record['IdList']
:
>>> print len(record['IdList']) 814
Let’s look at the first five results:
>>> print record['IdList'][:5] ['187237168', '187372713', '187372690', '187372688', '187372686']
We can download these records using efetch
.
While you could download these records one by one, to reduce the load on NCBI’s servers, it is better to fetch a bunch of records at the same time, shown below.
However, in this situation you should ideally be using the history feature described later in Section 7.10.3.
>>> idlist = ",".join(record['IdList'][:5]) >>> print idlist 187237168,187372713,187372690,187372688,187372686 >>> handle = Entrez.efetch(db='nucleotide', id=idlist, retmode='xml', email="A.N.Other@example.com") >>> records = Entrez.read(handle) >>> print len(records) 5
Each of these records corresponds to one GenBank record.
>>> print records[0].keys() [u'GBSeq_moltype', u'GBSeq_source', u'GBSeq_sequence', u'GBSeq_primary-accession', u'GBSeq_definition', u'GBSeq_accession-version', u'GBSeq_topology', u'GBSeq_length', u'GBSeq_feature-table', u'GBSeq_create-date', u'GBSeq_other-seqids', u'GBSeq_division', u'GBSeq_taxonomy', u'GBSeq_references', u'GBSeq_update-date', u'GBSeq_organism', u'GBSeq_locus', u'GBSeq_strandedness'] >>> print records[0]['GBSeq_primary-accession'] DQ110336 >>> print records[0]['GBSeq_other-seqids'] ['gb|DQ110336.1|', 'gi|187237168'] >>> print records[0]['GBSeq_definition'] Cypripedium calceolus voucher Davis 03-03 A maturase (matR) gene, partial cds; mitochondrial >>> print records[0]['GBSeq_organism'] Cypripedium calceolus
You could use this to quickly set up searches – but for heavy usage, see Section 7.10.3.
Staying with the same organism, let’s now find its lineage. First, we search the Taxonomy database for Cypripedioideae. We find exactly one accession number:
>>> handle = Entrez.esearch(db="Taxonomy", term="Cypripedioideae", email="A.N.Other@example.com") >>> record = Entrez.read(handle) >>> record["IdList"] ['158330'] >>> record["IdList"][0] '158330'
Now, we use efetch
to download this entry in the Taxonomy database and to parse it:
>>> handle = Entrez.efetch(db="Taxonomy", id="158330", retmode='xml') >>> records = Entrez.read(handle)
Again, this record stores lots of information:
>>> records[0].keys() [u'Lineage', u'Division', u'ParentTaxId', u'PubDate', u'LineageEx', u'CreateDate', u'TaxId', u'Rank', u'GeneticCode', u'ScientificName', u'MitoGeneticCode', u'UpdateDate']
We can get the lineage directly from this record:
>>> records[0]['Lineage'] 'cellular organisms; Eukaryota; Viridiplantae; Streptophyta; Streptophytina; Embryophyta; Tracheophyta; Euphyllophyta; Spermatophyta; Magnoliophyta; Liliopsida; Asparagales; Orchidaceae'
Often you will want to make a series of linked queries. Most typically, running a search, perhaps refining the search, and then retrieving detailed search results. You can do this by making a series of separate calls to Entrez. However, the NCBI prefer you to take advantage of their history support.
For example, suppose we want to search and download all Orchid rpl16 nucleotide sequences, and store them in a FASTA file. We could naively combine the example code for Bio.Entrez.esearch()
(Section 7.3) to get a list of GI numbers, and then repeatedly call Bio.Entrez.efetch()
(Section 7.6) to download them all. You could reduce the number of queries by asking for the records in batches (see Section 7.10.1). That would probably be better, but is still not what the NCBI encourage.
The approved approach is to run the search with the history feature. Then, we can fetch the results by reference to the search results - which the NCBI can anticipate and cache.
from Bio import Entrez search_handle = Entrez.esearch(db="nucleotide",term="Opuntia and rpl16", usehistory="y", email="history.user@example.com") search_results = Entrez.read(search_handle) search_handle.close() gi_list = search_results["IdList"] count = int(search_results["Count"]) assert count == len(gi_list) session_cookie = search_results["WebEnv"] query_key = search_results["QueryKey"]
In addition to the GI numbers of the sequences found in the search, because we have asked to use the history feature the XML search results also include WebEnv and QueryKey values which are used to refer to these search results. Having stored these values in variables session_cookie and query_key we can use them as parameters to Bio.Entrez.efetch()
instead of giving the GI numbers as identifiers.
While for small searches you might be OK downloading everything at once, its better download in batches. You use the retstart and retmax parameters to specify which range of search results you want returned (starting entry using zero-based counting, and maximum number of results to return). For example,
batch_size = 3 out_handle = open("orchid_rpl16.fasta", "w") for start in range(0,count,batch_size) : end = min(count, start+batch_size) print "Going to download record %i to %i" % (start+1, end) fetch_handle = Entrez.efetch(db="nucleotide", rettype="fasta", retstart=start, retmax=batch_size, webenv=session_cookie, query_key=query_key, email="history.user@example.com") data = fetch_handle.read() fetch_handle.close() out_handle.write(data) out_handle.close()
And finally, don’t forget to include your own email address in the Entrez calls.