Presentation is loading. Please wait.

Presentation is loading. Please wait.

A data retrieval workflow using NCBI E-Utils + Python John Pinney Tech talk Tue 12 th Nov.

Similar presentations


Presentation on theme: "A data retrieval workflow using NCBI E-Utils + Python John Pinney Tech talk Tue 12 th Nov."— Presentation transcript:

1 A data retrieval workflow using NCBI E-Utils + Python John Pinney Tech talk Tue 12 th Nov

2 Task Produce a data set given particular constraints. Allow easy revision/updates as needed. Output some kind of report for a biologist.

3 (One possible) solution A number of DBs/tools now accept queries via RESTful* interfaces, in principle allowing up-to-date data set retrieval. fully online analysis workflows. *REST = Representational State Transfer. A client/server architecture that ensures stateless communication, usually implemented via HTTP requests.

4 Bioinformatics REST services NCBI E-utilsPubMed, other DBs, BLAST EBI web servicesvarious UniProtprotein sequences KEGGmetabolic network data OMIMhuman genetic disorders + many others (see e.g. biocatalogue.org for a registry)

5 E-Utils services ESummary EFetch ESearch ELink all available through

6 Basic URL API e.g. retrieve IDs of all human genes: + esearch.fcgi?retmode=xml&db=gene&term=9606[TAXID] esearch( which EUtil) retmode=xml( output format) db=gene( which DB) term=9606[TAXID]( query term)

7 My tasks 1. Produce a list of human genes that are associated with at least one resolved structure in PDB AND at least one genetic disorder in OMIM 2. Make an online table to display them

8 My tasks: 1. Produce a list of human genes that are associated with at least one resolved structure in PDB AND at least one genetic disorder in OMIM 2. Make an online table to display them

9 Easy: Python requests using PyCogent PyCogent is a Python bioinformatics module that includes convenience methods for interaction with a number of online resources. from cogent.db.ncbi import * ef = EFetch(id=' ', rettype='fasta') protein = ef.read()

10 Bit more typing but still easy: Python requests using urllib2 For services that are not available through PyCogent, you can construct your own URLs using urllib2. import urllib2 url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ esummary.fcgi?retmode=xml&db=gene&id=7157" result = urllib2.urlopen(url).read() (TIP: use urllib.quote_plus to escape spaces and other special characters when preparing your URL query).

11 Making your life much easier: XML handling using BeautifulSoup Using retmode=xml ensures consistency in output format, but it can be very difficult to extract the data without a proper XML parser. The simplest and most powerful XML handling in Python I have found is via the BeautifulSoup object model.

12 Making your life much easier: XML handling using BeautifulSoup Example: extract all structure IDs linked to gene e = ELink(db='structure', dbfrom='gene', id=7153) result = e.read()

13 Making your life much easier: XML handling using BeautifulSoup Example: extract all structure IDs linked to gene e = ELink(db='structure', dbfrom='gene', id=7153) result = e.read() from bs4 import BeautifulSoup soup = BeautifulSoup(result,'xml') linkset = soup.eLinkResult.LinkSet s = [ x.Id.text for x in linkset.LinkSetDb.findAll('Link') ]

14 Using WebEnv to chain requests If you specify usehistory='y', NCBI can remember your output result (e.g. a list of gene IDs) and use it as a batch input for another EUtil request. This is extremely useful for minimising the number of queries for workflows involving large sets of IDs. You keep track of this “environment” using the WebEnv and query_key fields.

15 Using WebEnv to chain requests def webenv_search(**kwargs): e = ESearch(usehistory='y',**kwargs) result = e.read() soup = BeautifulSoup(result,'xml') return {'WebEnv':soup.WebEnv.text, 'query_key':soup.QueryKey.text }

16 Workflow for gene list

17 My tasks 1. Produce a list of human genes that are associated with at least one resolved structure in PDB AND at least one genetic disorder in OMIM 2. Make an online table to display them (next time!) ✓

18 Summary Using NCBI EUtils to produce a data set under given constraints was relatively straightforward. Resulting code is highly re-usable for future workflows (especially if written as generic functions).

19 Python modules used PyCogent Simple request handling for the main EUtils. pycogent.org urllib2 General HTTP request handler. docs.python.org/2/library/urllib2.html BeautifulSoup Amazingly easy to use object model for XML/HTML.


Download ppt "A data retrieval workflow using NCBI E-Utils + Python John Pinney Tech talk Tue 12 th Nov."

Similar presentations


Ads by Google