Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Distributed Framework for Computation on the Results of Large Scale NLP Christophe Roeder, William.

Similar presentations


Presentation on theme: "A Distributed Framework for Computation on the Results of Large Scale NLP Christophe Roeder, William."— Presentation transcript:

1 Chris.Roeder@ucdenver.edu http://compbio.ucdenver.edu A Distributed Framework for Computation on the Results of Large Scale NLP Christophe Roeder, William A. Baumgartner Jr., Kevin Livingston, Lawrence Hunter (UC Denver)

2 Questions that could be answered using large corpora Second source of data for validation/corroboration – Ligand binding site validation – Verspoor et al Rough ideas/leads to ppi from co- occurence Protein co-occurrence fraction for use in Hanalyzer networks Mine more and more recent knowledge than available from curated on ontologies

3 Available Tools and Data Data – Large corpora: PMC OA, publisher-arranged collections – Curated Ontologies: PRO, GO, etc. Tools – UIMA for NLP Processing – Batch schedulers (SGE, Torque) to scale UIMA – Hadoop to collate data – RDF to represent knowledge – Triple Store (Franz AllegroGraph) to store and access large amounts of RDF data

4 Bio Trends: a Sample Integration Project Function: – Count occurrences of proteins in articles – Collate by date, and display on a web app. Design – UIMA over SGE for protein ID, store in RDF files – Read RDF files and collate with Hadoop Call out to Allegrograph for ID and attribute lookup – Format resulting data as JSON for availability to web app

5 Prepare Available Data Start with raw text: PMC Open Access: – 250k full-text journal articles Identify (annotate) interesting spans (genes) – UIMA pipeline, NERs: ABNER, BANNER, etc, concept mapper on PRO dictionary to noramlize – Output to RDF for various uses

6 Options to Analyze Data Load into triple store and query – Necessity for exploring queries with complex results over entire graph – Ex. Load individual files into in-memory store and query in small groups – Possible for exploring simple queries over many small regions of the graph: article related – Easier to federate Hybrid – Some data not available from RDF files, but the triple store.

7 Map-Reduce Inspired by Lisp functions “map” and “reduce” – Map applies a function to each element of a list (a1, a2,…an), f(x)  (f(a1), f(a2), …f(an)) – Reduce combines lists by applying a function successively (a1, a2,…an), f(x,y)  f(f(f(a1,a2),a3), a4) (1,2,…n), +  (((1+2) + 3) + 4)

8 Map Reduce on HashMaps Map can be used to transform from one kind of key, value to a different kind of key, value – (Filename, text)  (gene, count) Reduce must have same kind of key and value output as input. A call to reduce gets all values for a particular key. – (gene, count)  (gene, count) – (BRCA1, 1), (BRCA1, 3), (BRCA1, 1)  (BRCA1, 5)

9 Hadoop: a distributed map- reduce on maps or hash tables Can divide into parallel friendly tasks by key Distributes files over network Reduces network traffic by performing computation where data is Map is used to move from one key-value type to another. From (filename => contents), to (protein-protein, co- occurrence count). Reduce is used to collate results.

10 Results PMC OA Medline Abstracts

11 Screen Shot Grants:

12 Thank You / Questions http://www.compbio.ucdenver/bio-trends Co-authors – William Baumgartner for data generation – Kevin Livingston for RDF and Clojure help Grants and PIs – Larry Hunter, UCDenver SOM NIH 2R01LM009254-04, NIH 2R01LM008111- 04A1, NIH 5R01GM083649-02 – Karin Verspoor, UCDenver SOM NIH R01 LM010120-01 – Gully Burns, ISI NSF 0849977


Download ppt "A Distributed Framework for Computation on the Results of Large Scale NLP Christophe Roeder, William."

Similar presentations


Ads by Google