Accessing NCBI Entrez Databases with E-Utilities

Open Script

This example shows how to programmatically search and retrieve data from NCBI's Entrez databases using NCBI's Entrez Utilities (E-Utilities).

Using NCBI E-Utilities to Retrieve Biological Data

E-Utilities (eUtils) are server-side programs (e.g. ESearch, ESummary, EFetch, etc.,) developed and maintained by NCBI for searching and retrieving data from most Entpwdrez Databases. You access tools via URLs with a strict syntax of a specific base URL, a call to the eUtil's script and its associated parameters. For more details on eUtils, see E-Utilities Help.

Searching Nucleotide Database with ESearch

In this example, we consider the genes sequenced from the H5N1 virus, isolated in 1997 from a chicken in Hong Kong as a starting point for our analysis. This particular virus jumped from chickens to humans, killing six people before the spread of the disease was brought under control by destroying all poultry in Hong Kong [1]. You can use ESearch to find the sequence data needed for the analysis. ESearch requires input of a database (db) and search term (term). Optionally, you can request for ESearch to store your search results on the NCBI history server through the usehistory parameter.

baseURL = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
eutil = 'esearch.fcgi?';
dbParam = 'db=nuccore';
termParam = '&term=A/chicken/Hong+Kong/915/97+OR+A/chicken/Hong+Kong/915/1997';
usehistoryParam = '&usehistory=y';
esearchURL = [baseURL, eutil, dbParam, termParam, usehistoryParam]

esearchURL =

    'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore&term=A/chicken/Hong+Kong/915/97+OR+A/chicken/Hong+Kong/915/1997&usehistory=y'

The term parameter can be any valid Entrez query. Note that there cannot be any spaces in the URL, so parameters are separated by '&' and any spaces in a query term need to be replaced with '+' (e.g. 'Hong+Kong').

You can use webread to send the URL and return the results from ESearch as a character array.

wbo = weboptions('Timeout', 15); % allow 15 seconds before timeout
searchReport = webread(esearchURL,wbo)

searchReport =

    '<?xml version="1.0" encoding="UTF-8" ?>
     <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
     <eSearchResult><Count>8</Count><RetMax>8</RetMax><RetStart>0</RetStart><QueryKey>1</QueryKey><WebEnv>MCID_677df81f3ff10ce8bc0edc67</WebEnv><IdList>
     <Id>6048875</Id>
     <Id>6048849</Id>
     <Id>6048770</Id>
     <Id>6048802</Id>
     <Id>6048927</Id>
     <Id>6048903</Id>
     <Id>6048829</Id>
     <Id>3421265</Id>
     </IdList><TranslationSet/><TranslationStack>   <TermSet>    <Term>A/chicken/Hong[All Fields]</Term>    <Field>All Fields</Field>    <Count>1099</Count>    <Explode>N</Explode>   </TermSet>   <TermSet>    <Term>Kong/915/97[All Fields]</Term>    <Field>All Fields</Field>    <Count>7</Count>    <Explode>N</Explode>   </TermSet>   <OP>AND</OP>   <OP>GROUP</OP>   <TermSet>    <Term>A/chicken/Hong[All Fields]</Term>    <Field>All Fields</Field>    <Count>1099</Count>    <Explode>N</Explode>   </TermSet>   <TermSet>    <Term>Kong[All Fields]</Term>    <Field>All Fields</Field>    <Count>7076573</Count>    <Explode>N</Explode>   </TermSet>   <OP>AND</OP>   <TermSet>    <Term>915[All Fields]</Term>    <Field>All Fields</Field>    <Count>529194</Count>    <Explode>N</Explode>   </TermSet>   <OP>AND</OP>   <TermSet>    <Term>1997[All Fields]</Term>    <Field>All Fields</Field>    <Count>1693531</Count>    <Explode>N</Explode>   </TermSet>   <OP>AND</OP>   <OP>GROUP</OP>   <OP>OR</OP>  </TranslationStack><QueryTranslation>(A/chicken/Hong[All Fields] AND Kong/915/97[All Fields]) OR (A/chicken/Hong[All Fields] AND Kong[All Fields] AND 915[All Fields] AND 1997[All Fields])</QueryTranslation></eSearchResult>
     '

ESearch returns the search results in XML. The report contains information about the query performed, which database was searched and UIDs (unique IDs) to the records that match the query. If you use the history server, the report contains two additional IDs, WebEnv and query_key, for accessing the results. WebEnv is the location of the results on the server, and query_key is a number indexing the queries performed. Since WebEnv and query_key are query dependent they will change every time the search is executed. Either the UIDs or WebEnv and query_key can be parsed out of the XML report then passed to other eUtils. You can use regexp to do the parsing and store the tokens in the structure with fieldnames WebEnv and QueryKey.

ncbi = regexp(searchReport,...
    '<QueryKey>(?<QueryKey>\w+)</QueryKey>\s*<WebEnv>(?<WebEnv>\S+)</WebEnv>',...
    'names')

ncbi = 

  struct with fields:

    QueryKey: '1'
      WebEnv: 'MCID_677df81f3ff10ce8bc0edc67'

Finding Links Between Databases with ELink

It might be useful to have PubMed articles related to these genes records. ELink provides this functionality. It finds associations between records within or between databases. You can give ELink the query_key and WebEnv IDs from above and tell it to find records in the PubMed Database (db parameter) associated with your records from the Nucleotide (nuccore) Database (dbfrom parameter). ELink returns an XML report with the UIDs for the records in PubMed. These UIDs can be parsed out of the report and passed to other eUtils (e.g. ESummary). Use the stylesheet created for viewing ESummary reports to view the results of ELink.

elinkReport = webread([baseURL...
    'elink.fcgi?dbfrom=nuccore&db=pubmed&WebEnv=', ncbi.WebEnv,...
    '&query_key=',ncbi.QueryKey],wbo);

Extract the PubMed UIDs from the ELink report.

pubmedIDs = regexp(elinkReport,'<Link>\s+<Id>(\w*)</Id>\s+</Link>','tokens');
NumberOfArticles = numel(pubmedIDs)

% Put PubMed UIDs into a string that can be read by EPost URL.

pubmed_str = [];
for ii = 1:NumberOfArticles
    pubmed_str = sprintf([pubmed_str '%s,'],char(pubmedIDs{ii}));
end

NumberOfArticles =

     2

Posting UIDs to NCBI History Server with EPost

You can use EPost to posts UIDs to the history server. It returns an XML report with a query_key and WebEnv IDs pointing to the location of the history server. Again, these can be parsed out of the report and used with other eUtils calls.

epostReport = webread([baseURL 'epost.fcgi?db=pubmed&id=',pubmed_str(1:end-1)]);
epostKeys = regexp(epostReport,...
    '<QueryKey>(?<QueryKey>\w+)</QueryKey>\s*<WebEnv>(?<WebEnv>\S+)</WebEnv>','names')

epostKeys = 

  struct with fields:

    QueryKey: '1'
      WebEnv: 'MCID_677df8307a237112a60bbf0e'

Using ELink to Find Associated Files Within the Same Database

ELink can do "within" database searches. For example, you can query for a nucleotide sequence within Nucleotide (nuccore) database to find similar sequences, essentially performing a BLAST search. For "within" database searches, ELink returns an XML report containing the related records, along with a score ranking its relationship to the query record. From the above PubMed search, you might be interested in finding all articles related to those articles in PubMed. This is easy to do with ELink. To do a "within" database search, set db and dbfrom to PubMed. You can use the query_key and WebEnv from the EPost call.

pm2pmReport = webread([baseURL...
    'elink.fcgi?dbfrom=pubmed&db=pubmed&query_key=',epostKeys.QueryKey,...
    '&WebEnv=',epostKeys.WebEnv]);
pubmedIDs = regexp(pm2pmReport,'(?<=<Id>)\w*(?=</Id>)','match');
NumberOfArticles = numel(unique(pubmedIDs))

pubmed_str = [];
for ii = 1:NumberOfArticles
    pubmed_str = sprintf([pubmed_str '%s,'],char(pubmedIDs{ii}));
end

NumberOfArticles =

   403

References

[1] Cristianini, N. and Hahn, M.W. "Introduction to Computational Genomics: A Case Studies Approach", Cambridge University Press, 2007.