Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards an intelligent framework to quickly find data from distributed heterogeneous biomedical resources. Despoina Antonakaki, Dasha Zhernakova, Erik.

Similar presentations


Presentation on theme: "Towards an intelligent framework to quickly find data from distributed heterogeneous biomedical resources. Despoina Antonakaki, Dasha Zhernakova, Erik."— Presentation transcript:

1 Towards an intelligent framework to quickly find data from distributed heterogeneous biomedical resources. Despoina Antonakaki, Dasha Zhernakova, Erik Roos, K Joeri van der Velde, Mark Kiestra,Tomasz Adamusiak, Niran Abeygunawardena, Helen Parkinson, Rolf Sijmons, Morris A. Swertz

2 Biologists challenges: A web of data ① Find data – Many different resources local, structured – array express, free text – pubmed – Type in many search boxes Google, NCBI/Entrez, EBI/EB-eye, KEGG/DBGET ② Merge and pool data – Big excel file (trying to make headers fit) ③ Size of data – Working for weeks (map and match) Major problem : “Using Microsoft Word as sequence annotation tool”

3 Informatics challenges: Too many silos… ① Differences in terminology – Need to reach “hidden”, structured data : DB encapsulated, legacy – Different conceptualization of information ② Differences in formats and structure – Too many formats, specifying & describing biomedical entities: no standard representation model ③ Automatic matching and merging – Difficult to merge into single query Working for weeks (map & match) ④ Query across silos DB1 DB2 DB3 Format 1 Format 2 Format3 …

4 Local? National? EU? Global? LifeLines GenerationR TweelingReg PSI Celiac Disease query Wanted: ‘meta’ search infrastructure to Find me cases Find me cohorts/partners Connecting different ‘ biobanks’?

5 Outline Three challenges for biologists’ and the corresponding for the Informatics’: 1.Merge and pool data - Differences in formats and structure 2.Find data - Differences in terminology 3.Size of data - Automatic matching and merging 4.Across data sets – All above + distribution Approaches 1.Integrate data into one ‘pheno’ model (MOLGENIS) 2.Use ontologies (OntoCAT) 3.Indexing (Lucene) 4.Query expansion (Lucene + OntoCAT) Discussion 1.Federated data queries (molgenis & rdf)

6 ①Data warehouse, put it all in one place? Loading … Pheno-OM

7 ①Pheno-OM data model Flexible: any feature, value, and target combination Observed value Observed value * Observation target Observation target time Observable feature Observable feature * Panel/cohort/Biob anks Individual * * Protocol application Protocol application * time Observed Relation Inferred Value * * time * Height 179cm Ind1

8 An example of excel data Or bbmri-nl

9 ②Use ontologies To overcome different terminologies, two approaches: 1.Use ontologies to annotate the source Of course depends on other parties 2.Use ontologies for query expansion (synonyms, part of, subclasses) Deformed ears? Abnormale shaped ears Pheno-DB Ontologies with mappings Index HPO: Abnormally shaped ears Auricular malformation Deformed auricles Deformed ears Malformed auricles Malformed ears Malformed external ears MP: Abnormally shaped ears Auricular malformation Deformed auricles Deformed ears Malformed auricles Malformed ears Malformed external ears

10 Outline Three challenges for biologists’ and the corresponding for the Informatics’: 1.Merge and pool data - Differences in formats and structure 2.Find data - Differences in terminology 3.Size of data - Automatic matching and merging 4.Across data sets – All above + distribution Approaches 1.Integrate data into one ‘pheno’ model (MOLGENIS) 2.Use ontologies (OntoCAT) 3.Indexing (Lucene) 4.Query expansion (Lucene + OntoCAT) Discussion 1.Federated data queries (molgenis & rdf)

11 Complexity in Ontologies..sometimes they change unpredictably....or sometimes they become suddenly unavailable.. To search across different ontologies requires expert knowledge

12 Some facts… NCBO Bioportal : – 204 ontologies, 29 REST signatures … – BUT : Rest signature change/break without notice, OLS: 79 OBO ontologies, 16 web service signatures - stable, open, local – BUT: not as rich, rudimentary documentation Individual user’s ontologies created Integration is hard … Ontology Browser EFO Bioportal Import OntoAPI OWL API

13 OntoCAT hides the complexity ontocat.org BioPortalEBI OLS OWL & OBO searchOntology() getChildren() getParents() getSynonyms() getDefinitions()...

14 ②Generic Ontology Service interface  Implemented in Java 6,  Open Source (LGPL v3),  Simple and easy-to-use API for BioPortal, OLS web services, OWL API (BioportalOntologyService, OlsOntologyService and FileOntologyService ). BBMRI ontology OWL API HPO NCBO Bioportal OLS (EMBL-EBI) OBO files

15 ②Use case diagram of OntoCAT  Use case of a simplified user interaction with existing ontology resources through OntoCAT.  Web applications can connect using REST or SOAP services  R connect with Ontocat bioconductor

16 ② Common workflow to integrate ontology resources

17 ②Functions in OntoCAT  The following features not available from underlying ontology resources

18 ②Ontocat example :Find “membrane” term in multiple ontologies

19 ②More examples available

20 1.Updating Ontology properties: – EFO involves construction of mappings to multiple domain specific ontologies (Disease, Cell Type) – Multithreading the Ontocat requests allows to process & import extra information from over 20,000 external ontology terms in less that 10 minutes 2.Annotate user experimental values with ontology terms – Array Express Archive & Gene Expression Atlas >1 million unique experiment annotated from EBI’s version EFO Not existing ones have to be checked against publicly available ontologies – Previously manual process now with Zooma (local EFO, OWL, local DBs) ② OntoCAT & Zooma use cases Array express archive Gene Expression Atlas > 1 million unique experiment annotations Array express archive Gene Expression Atlas > 1 million unique experiment annotations Annotate (ontology terms) EBI (pre release version of the application ontology EFO) Not available in EFO ? ???

21 ② OntoCAT & Zooma use cases 3. Local ontology management – eXtensive Genotype And Phenotype data platform (XGAP - Molgenis) : search widget Interactive annotation of data with ontology terms Allows search publically available ontologies & download terms for unambiguous annotation of QTL or GWAS data. 4.Data analysis & annotation – New Bioconductor ready to read & query OWL/OBO into R. Build in offline support for EFO & Bioportal ontology queries

22  OntoCAT provides synonym & definition lookup across two major implemented ontology services  Supports interoperability using RDF  Class combining multiple ontology resources including different repositories behind single entry point (CompositeOntologyService)  Cache  Ranking  Prioritization  Fallback mechanism if ontology resource unavailable ② OntoCAT characteristics & tools

23 ②Demo on Google App Engine framework

24 ②Ontocat browser retrieving OLS __target=main&select= OntocatBrowser

25 ②OntoCAT’s applications OntoCAT ontology mapping application: – OntoCAT Bioconductor/R package: – views/2.7/bioc/html/ontoCAT.html views/2.7/bioc/html/ontoCAT.html

26 Code examples : how to list all ontologies available through – OLS – NCBO Bioportal – OWL ontology : ②Ontocat for Biomedical ontologies

27 Outline Three challenges for biologists’ and the corresponding for the Informatics’: 1.Merge and pool data - Differences in formats and structure 2.Find data - Differences in terminology 3.Size of data - Automatic matching and merging 4.Across data sets – All above + distribution Approaches 1.Integrate data into one ‘pheno’ model (MOLGENIS) 2.Use ontologies (OntoCAT) 3.Indexing (Lucene) 4.Query expansion (Lucene + OntoCAT) Discussion 1.Federated data queries (molgenis & rdf)

28 ③Indexing: general features Data structure overcomes barriers in large DB – created by using DB tables as basis for search – Efficient access of ordered records & rapid random lookup – Less disk space for storage (key fields) Open source java library (known in internet search engines) – Full text indexing & searching capability – Format independent (documents & fields) Query Expansion: – Add additional terms related (synonyms & children) appended by OR operator, assigned lower weight – Changes document ranking  order of retrieved docs – Even if query expansion doesn’t improve search, query more precise

29 ③Indexing: the approach Overcome the barriers of searching in large data size – Optimize the in memory representation, e.g. as a tree – Steps: 1.Create a new index and add documents (fields from DB, ontology terms from Ontocat) 2.Analyzer: extract tokens out of text to be indexed and eliminates the rest 3.Parser: Select Fields (term/value) » Tokenized? Indexed? Case sensitive? 4.Collect results def: "Paired, cup-shaped cartilage that are dorsal to the septomaxillae and anterior to the oblique cartilage. The anterior, convex face of each alary cartilage is synchondrotically fused to the superior prenasal cartilage and the ventral edge is fused to the superior margin of the crista intermedia." [AAO:LAP] related_synonym: "alinasal cartilage" [] related_synonym: "cartilago alaris" []related_synonym: "cartilago alaris nasi" []related_synonym: "cartilago cupullaris" [] [Term] id: AAO: name: Meckel's_cartilage def: "Paired, rod-shaped elements that extend the length of the mandible and lie between the dentaries and the angulosplenials." [AAO:LAP] relationship: part_of AAO: ! lower_jaw_skeleton [Term] id: CHEBI:24431 name: molecular structure def: "A description of the molecular entity or part thereof based on its composition and/or the connectivity between its constituent atoms." [] def: "Paired, cup-shaped cartilage that are dorsal to the septomaxillae and anterior to the oblique cartilage. The anterior, convex face of each alary cartilage is synchondrotically fused to the superior prenasal cartilage and the ventral edge is fused to the superior margin of the crista intermedia." [AAO:LAP] related_synonym: "alinasal cartilage" [] related_synonym: "cartilago alaris" []related_synonym: "cartilago alaris nasi" []related_synonym: "cartilago cupullaris" [] [Term] id: AAO: name: Meckel's_cartilage def: "Paired, rod-shaped elements that extend the length of the mandible and lie between the dentaries and the angulosplenials." [AAO:LAP] relationship: part_of AAO: ! lower_jaw_skeleton [Term] id: CHEBI:24431 name: molecular structure def: "A description of the molecular entity or part thereof based on its composition and/or the connectivity between its constituent atoms." [] Oblique cartilage. Tokenized?? cartilago cupullaris Tokenized?? Septomaxillae angulosplenias index 1. Analyze Query 2. Parse Index 3. Collect Results 1. Analyze Query 2. Parse Index 3. Collect Results Enters search term Output results

30 ③Indexing DB: implementation

31 Outline Three challenges for biologists’ and the corresponding for the Informatics’: 1.Merge and pool data - Differences in formats and structure 2.Find data - Differences in terminology 3.Size of data - Automatic matching and merging 4.Across data sets – All above + distribution Approaches 1.Integrate data into one ‘pheno’ model (MOLGENIS) 2.Use ontologies (OntoCAT) 3.Indexing (Lucene) 4.Query expansion (Lucene + OntoCAT) Discussion 1.Federated data queries (molgenis & rdf)

32 32 Pheno Warehouse Deformed ears? HPO: Abnormally shaped ears Auricular malformation Deformed auricles MP: Malformed auricles Malformed ears Malformed external ears etc queryexpansion ④Query expansion Local ontologies (OLW or OBO) CWA BioPortal OLS OntoCAT – Ontology common API tasks and Deformed ears  Abnormally shaped ears

33 ④Query expansion details & ontology selection Ontologies used

34 ④The expanded query & the results

35 query: lung disease searching WITHOUT query expansion:

36 ④Indexing: implementation (ontocat) Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's query. Vector Space Model (VSM) of Information RetrievalBoolean model Lucene scoring uses a combination of the Vector Space Model (VSM) of Information Retrieval and the Boolean model to determine how relevant a given Document is to a User's query. Vector Space Model (VSM) of Information RetrievalBoolean model

37 query: lung disease searching WITH query expansion:

38 38 Pheno Warehouse Deformed ears? HPO: Abnormally shaped ears Auricular malformation Deformed auricles MP: Malformed auricles Malformed ears Malformed external ears etc queryexpansion ④Query expansion Local ontologies (OLW or OBO) CWA BioPortal OLS OntoCAT – Ontology common API tasks and

39 Outline Three challenges for biologists’ and the corresponding for the Informatics’: 1.Merge and pool data - Differences in formats and structure 2.Find data - Differences in terminology 3.Size of data - Automatic matching and merging 4.Across data sets – All above + distribution Approaches 1.Integrate data into one ‘pheno’ model (MOLGENIS) 2.Use ontologies (OntoCAT) 3.Indexing (Lucene) 4.Query expansion (Lucene + OntoCAT) Discussion 1.Federated data queries (molgenis & rdf)

40 Twin Registry Generation R LifeLines BBMRI-SE Deformed ears? query Distributed querying in BBMRI OntoCAT – Ontology common API tasks and RDF + OWL?

41 Standard formats Mapping and automated software Semantic web: formal specifications, syntax and representation, design principles, collaborative working groups & technologies RDF : Resource Data Framework – fact : expressed as triple [ Subject – Predicate – Object] – It's like a little English sentence – provides formal description of concepts, terms, and relationships for a specific domain DB1 DB2 DB3 RDF/JSON/X ML… … … DB1 DB2 DB3 Format 1 Format 2 Format3 … Semantic annotation expressed in rdf graph subject object Subject – Predicate – Object

42 Standard formats in Molgenis RDF is not XML, – represent information in a distributed world – Concerned with meaning: describe logical inferences between facts – Linking between distributed documents by common vocabularies SPARQL: query language for RDF MOLGENIS a simple model automatically generate flexible web platforms for all possible genomic, molecular and phenotypic experiments. We propose a RDF representation of data in MOLGENIS and SPARQL – query multiple and diverse data sources Molgenis DB

43 Federated data queries (molgenis & rdf) How to make Molgenis data distributed via RDF/SPARQL ? Deformed ears? Abnormale shaped ears HPO: Abnormally shaped ears Auricular malformation Deformed auricles Deformed ears Malformed auricles Malformed ears Malformed external ears MP: Abnormally shaped ears Auricular malformation Deformed auricles Deformed ears Malformed auricles Malformed ears Malformed external ears DB Ontologies with mappings DB ? RDF SPARQL

44 Discussion & next steps : distributed querying? How to map a database to RDF such that it helps querying? – Diversity : all data molgenis’ pheno model. (+ quick - working offline, - have to update all the time) – Map to all distributed sources “on the fly”. (RDF & SPARQL ) – Agree on distributed query mechanisms (+ always up to date – - slow, breaks if sources go offline) Investigate other project like Open Data – Can molgenis be part of open data?

45 NL

46 Thank you for your attention. Questions?

47 Ontocat – – – Guide/ examples – Available from : – __targe t=main&select= OntocatBrowser __targe t=main&select= OntocatBrowser – Ontocat Demo on Google App Engine framework : web.appspot.comhttp://ontocat- web.appspot.com Molgenis Lucene Index & query expansion app : – ype/handwritten/java/plugins/LuceneIndex/ ype/handwritten/java/plugins/LuceneIndex/ Pheno-OM datamodel : srv/pheno/doc/objectmodel.htmlhttp://wwwdev.ebi.ac.uk/microarray- srv/pheno/doc/objectmodel.html XGAP:

48 How this all can work together OLS (EMBL-EBI) NCBO Bioportal SPARQL Endpoint Semantic Search User query Recommendations local data sets, structured data (array express) ……… local data sets, structured data (array express) ………


Download ppt "Towards an intelligent framework to quickly find data from distributed heterogeneous biomedical resources. Despoina Antonakaki, Dasha Zhernakova, Erik."

Similar presentations


Ads by Google