Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis.

Similar presentations


Presentation on theme: "Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis."— Presentation transcript:

1 Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis Wright State University, Dayton, Ohio May 27, 2009

2 2 Outline u Why integrate data? u Ontologies and data integration u Examples u Challenging issues

3 Why integrate data?

4 4 u Sources of information l Created by n Independent researchers n Separate workflows l Heterogeneous l Scattered l “Silos” u To identify patterns in integrated datasets l Hypothesis generation l Knowledge discovery

5 5 Motivation Translational research u “Bench to Bedside” u Integration of clinical and research activities and results u Supported by research programs l NIH Roadmap l Clinical and Translational Science Awards (CTSA) u Requires the effective integration and exchange and of information between l Basic research l Clinical research

6 6 Genotype and phenotype [Goh, PNAS 2007] OMIM [HPO] OMIM [HPO]

7 Genes and environmental factors [Liu, BMC Bioinf. 2008] MEDLINE (MeSH index terms) Genetic Association Database MEDLINE (MeSH index terms) Genetic Association Database

8 8 Integrating drugs and targets [Yildirim, Nature Biot. 2007] DrugBank ATC Gene Ontology DrugBank ATC Gene Ontology

9 Why ontologies?

10 10 Uses of biomedical ontologies u Knowledge management l Annotating data and resources l Accessing biomedical information l Mapping across biomedical ontologies u Data integration, exchange and semantic interoperability u Decision support l Data selection and aggregation l Decision support l NLP applications l Knowledge discovery [Bodenreider, YBMI 2008]

11 11 Terminology and translational research Cancer Basic Research Cancer Basic Research EHR Cancer Patients EHR Cancer Patients NCI Thesaurus SNOMED CT

12 12 Approaches to data integration (1) u Warehousing l Sources to be integrated are transformed into a common format and converted to a common vocabulary u Mediation l Local schema (of the sources) l Global schema (in reference to which the queries are made) [Stein, Nature Rev. Gen. 2003] [Hernandez, SIGMOD Rec. 2004] [Goble J. Biomedical Informatics 2008]

13 13 Approaches to data integration (2) u Linked data l Links among data elements l Enable navigation by humans [Stein, Nature Rev. Gen. 2003] [Hernandez, SIGMOD Rec. 2004] [Goble J. Biomedical Informatics 2008]

14 14 Ontologies and warehousing u Role l Provide a conceptualization of the domain n Help define the schema n Information model vs. ontology l Provide value sets for data elements l Enable standardization and sharing of data u Examples l Annotations to the Gene Ontology l BioWarehouse l Clinical information systems http://biowarehouse.ai.sri.com/

15 15 Ontologies and mediation u Role l Reference for defining the global schema l Map between local and global schemas n Query reformulation n Local-as-view vs. Global-as-view u Examples l TAMBIS l BioMediator l OntoFusion [Stevens, Bioinformatics 2000] [Louie, AMIA 2005] [Perez-Rey, Comput Biol Med 2006]

16 16 Ontologies and linked data u Role l Explicit conceptualization of the domain l Semantic normalization of data elements u Examples l Entrez l Semantic Web mashups l Bio2RDF [http://www.ncbi.nlm.nih.gov/] [J. Biomedical informatics 41(5) 2008] [http://bio2rdf.org/]

17 17 Ontologies and data integration u Source of identifiers for biomedical entities l Semantic normalization l Warehouse approaches u Source of reference relations for the global schema l Mapping between local and global schemas l Mediator-based approaches u Source of identifiers for biomedical entities l Semantic normalization l Explicit conceptualization of the domain l Linked data approaches

18 18 Ontologies and data aggregation u Source of hierarchical relations l Aggregate data into coarser categories l Abstract away from low-frequency, fine grained data points l Increase power l Improve visualization

19 Examples Gene Ontology http://www.geneontology.org/

20 20 Annotating data u Gene Ontology l Functional annotation of gene products in several dozen model organisms u Various communities use the same controlled vocabularies u Enabling comparisons across model organisms u Annotations l Assigned manually by curators l Inferred automatically (e.g., from sequence similarity)

21 21 GO Annotations for Aldh2 (mouse) http:// www.informatics.jax.org/

22 22 GO ALD4 in Yeast http://db.yeastgenome.org/

23 23 GO Annotations for ALDH2 (Human) http://www.ebi.ac.uk/GOA/

24 24 Integration applications u Based on shared annotations l Enrichment analysis (within/across species) l Clustering (co-clustering with gene expression data) u Based on the structure of GO l Closely related annotations l Semantic similarity u Based on associations between gene products and annotations u Leveraging reasoning [Bodenreider, PSB 2005] [Sahoo, Medinfo 2007] [Lord, PSB 2003]

25 25 Gene Ontology Integration Entrez Gene + GO gene GO PubMed Gene name OMIM Sequence Interactions Glycosyltransferase Congenital muscular dystrophy Entrez Gene [Sahoo, Medinfo 2007]

26 26 From glycosyltransferase to congenital muscular dystrophy MIM:608840 Muscular dystrophy, congenital, type 1D GO:0008375 has_associated_phenotype has_molecular_function EG:9215 LARGE acetylglucosaminyl- transferase GO:0016757 glycosyltransferase GO:0008194 isa GO:0008375 acetylglucosaminyl- transferase GO:0016758

27 Examples caBIG http://cabig.nci.nih.gov/

28 28 Cancer Biomedical Informatics Grid u US National Cancer Institute u Common infrastructure used to share data and applications across institutions to support cancer research efforts in a grid environment u Service-oriented architecture u Data and application services available on the grid u Supported by ontological resources

29 29 caBIG services u caArray l Microarray data repository u caTissue l Biospecimen repository u caFE (Cancer Function Express) l Annotations on microarray data u…u…u…u… u caTRIP l Cancer Translational Research Informatics Platform l Integrates data services

30 30 Ontological resources u NCI Thesaurus l Reference terminology for the cancer domain l ~ 60,000 concepts l OWL Lite u Cancer Data Standards Repository (caDSR) l Metadata repository l Used to bridge across UML models through Common Data Elements l Links to concepts in ontologies

31 Examples Semantic Web for Health Care and Life Sciences http://www.w3.org/2001/sw/hcls/

32 32 Semantic Web layer cake

33 Linked data linkeddata.org

34 34 Linked data

35 35 Linked biomedical data [Tim Berners-Lee TED 2009 conference] http://www.w3.org/2009/Talks/0204-ted-tbl/#(1)

36 36 W3C Health Care and Life Sciences IG

37 37 Biomedical Semantic Web u Integration l Data/Information l E.g., translational research u Hypothesis generation u Knowledge discovery [Ruttenberg, BMC Bioinf. 2007]

38 38 HCLS mashup of biomedical sources NeuronDB BAMS NC Annotations Homologene SWAN Entrez Gene Gene Ontology Mammalian Phenotype PDSPki BrainPharm AlzGene Antibodies PubChem MeSH Reactome Allen Brain Atlas Publications http://esw.w3.org/topic/HCLS/HCLSIG_DemoHomePage_HCLSIG_Demo

39 39 Shared identifiers Example GO

40 40 HCLS mashup NeuronDB Protein (channels/receptors) Neurotransmitters Neuroanatomy Cell Compartments Currents BAMS Protein Neuroanatomy Cells Metabolites (channels) PubMedID NC Annotations Genes/Proteins Processes Cells (maybe) PubMed ID Allen Brain Atlas Genes Brain images Gross anatomy -> neuroanatomy Homologene Genes Species Orthologies Proofs SWAN PubMedID Hypothesis Questions Evidence Genes Entrez Gene Genes Protein GO PubMedID Interaction (g/p) Chromosome C. location GO Molecular function Cell components Biological process Annotation gene PubMedID Mammalian Phenotype Genes Phenotypes Disease PubMedID Proteins Chemicals Neurotransmitters PDSPki BrainPharm Drug Drug effect Pathological agent Phenotype Receptors Channels Cell types PubMedID Disease AlzGene Gene Polymorphism Population Alz Diagnosis Antibodies Genes Antibodies PubChem Name Structure Properties MeSH term MeSH Drugs Anatomy Phenotypes Compounds Chemicals PubMedID PubChem Reactome Genes/proteins Interactions Cellular location Processes (GO)

41 41 HCLS mashup NeuronDB Protein (channels/receptors) Neurotransmitters Neuroanatomy Cell Compartments Currents BAMS Protein Neuroanatomy Cells Metabolites (channels) PubMedID NC Annotations Genes/Proteins Processes Cells (maybe) PubMed ID Allen Brain Atlas Genes Brain images Gross anatomy -> neuroanatomy Homologene Genes Species Orthologies Proofs SWAN PubMedID Hypothesis Questions Evidence Genes Entrez Gene Genes Protein GO PubMedID Interaction (g/p) Chromosome C. location GO Molecular function Cell components Biological process Annotation gene PubMedID Mammalian Phenotype Genes Phenotypes Disease PubMedID Proteins Chemicals Neurotransmitters PDSPki BrainPharm Drug Drug effect Pathological agent Phenotype Receptors Channels Cell types PubMedID Disease AlzGene Gene Polymorphism Population Alz Diagnosis Antibodies Genes Antibodies PubChem Name Structure Properties MeSH term MeSH Drugs Anatomy Phenotypes Compounds Chemicals PubMedID PubChem Reactome Genes/proteins Interactions Cellular location Processes (GO)

42 42 HCLS mashups u Based on RDF/OWL u Based on shared identifiers l “Recombinant data” (E. Neumann) u Ontologies used in some cases u Support applications (SWAN, SenseLab, etc.)  Journal of Biomedical Informatics special issue on Semantic bio-mashups [J. Biomedical Informatics 41(5) 2008]

43 43 Semantic bio-mashups u Bio2RDF: Towards a mashup to build bioinformatics knowledge systems u Identifying disease-causal genes using Semantic Web-based representation of integrated genomic and phenomic knowledge u Schema driven assignment and implementation of life science identifiers (LSIDs) u The SWAN biomedical discourse ontology u An ontology-driven semantic mashup of gene and biological pathway information: Application to the domain of nicotine dependence u Towards an ontology for sharing medical images and regions of interest in neuroimaging u yOWL: An ontology-driven knowledge base for yeast biologists u Dynamic sub-ontology evolution for traditional Chinese medicine web ontology u Ontology-centric integration and navigation of the dengue literature u Infrastructure for dynamic knowledge integration—Automated biomedical ontology extension using textual resources u An ontological knowledge framework for adaptive medical workflow u Semi-automatic web service composition for the life sciences using the BioMoby semantic web framework u Combining Semantic Web technologies with Multi-Agent Systems for integrated access to biological resources [J. Biomedical Informatics 41(5) 2008]

44 Challenging issues

45 45 Challenging issues u Bridges across ontologies u Permanent identifiers for biomedical entities u Other issues

46 Challenging issues Bridges across ontologies

47 47 Trans-namespace integration Addison Disease (D000224) Addison's disease (363732003) Biomedical literature Biomedical literature MeSH Clinical repositories Clinical repositories SNOMED CT Primary adrenocortical insufficiency (E27.1) ICD 10

48 48 (Integrated) concept repositories  Unified Medical Language System http://umlsks.nlm.nih.gov http://umlsks.nlm.nih.gov  NCBO’s BioPortal http://www.bioontology.org/tools/portal/bioportal.html http://www.bioontology.org/tools/portal/bioportal.html  caDSR http://ncicb.nci.nih.gov/NCICB/infrastructure/cacore_overview/cadsr http://ncicb.nci.nih.gov/NCICB/infrastructure/cacore_overview/cadsr  Open Biomedical Ontologies (OBO) http://obofoundry.org/ http://obofoundry.org/

49 49 Integrating subdomains Biomedical literature Biomedical literature MeSH Genome annotations Genome annotations GO Model organisms Model organisms NCBI Taxonomy Genetic knowledge bases OMIM Clinical repositories Clinical repositories SNOMED CT Other subdomains Other subdomains … Anatomy FMA UMLS

50 50 Integrating subdomains Biomedical literature Biomedical literature Genome annotations Genome annotations Model organisms Model organisms Genetic knowledge bases Clinical repositories Clinical repositories Other subdomains Other subdomains Anatomy

51 51 Trans-namespace integration Genome annotations Genome annotations GO Model organisms Model organisms NCBI Taxonomy Genetic knowledge bases OMIM Other subdomains Other subdomains … Anatomy FMA UMLS Addison Disease (D000224) Addison's disease (363732003) Biomedical literature Biomedical literature MeSH Clinical repositories Clinical repositories SNOMED CT UMLS C0001403

52 52 Mappings u Created manually (e.g., UMLS) l Purpose l Directionality u Created automatically (e.g., BioPortal) l Lexically: ambiguity, normalization l Semantically: lack of / incomplete formal definitions u Key to enabling semantic interoperability u Enabling resource for the Semantic Web

53 Challenging issues Permanent identifiers for biomedical entities

54 54 Identifying biomedical entities u Multiple identifiers for the same entity in different ontologies u Barrier to data integration in general l Data annotated to different ontologies cannot “recombine” l Need for mappings across ontologies u Barrier to data integration in the Semantic Web l Multiple possible identifiers for the same entity n Depending on the underlying representational scheme (URI vs. LSID) n Depending on who creates the URI

55 55 Possible solutions  PURL http://purl.org http://purl.org l One level of indirection between developers and users l Independence from local constraints at the developer’s end u The institution creating a resource is also responsible for minting URIs l E.g., URI for genes in Entrez Gene u Guidelines: “URI note” l W3C Health Care and Life Sciences Interest Group u Shared names initiative l Identify resources vs. entities [http://sharedname.org/]

56 Challenging issues Other issues

57 57 Availability u Many ontologies are freely available u The UMLS is freely available for research purposes l Cost-free license required u Licensing issues can be tricky l SNOMED CT is freely available in member countries of the IHTSDO u Being freely available l Is a requirement for the Open Biomedical Ontologies (OBO) l Is a de facto prerequisite for Semantic Web applications

58 58 Discoverability u Ontology repositories l UMLS: 152 source vocabularies (biased towards healthcare applications) l NCBO BioPortal: ~141ontologies (biased towards biological applications) l Limited overlap between the two repositories u Need for discovery services l Metadata for ontologies

59 59 Formalism u Several major formalism l Web Ontology Language (OWL) – NCI Thesaurus l OBO format – most OBO ontologies l UMLS Rich Release Format (RRF) – UMLS, RxNorm u Conversion mechanisms l OBO to OWL l LexGrid (import/export to LexGrid internal format)

60 60 Ontology integration u Post hoc integration, form the bottom up l UMLS approach l Integrates ontologies “as is”, including legacy ontologies l Facilitates the integration of the corresponding datasets u Coordinated development of ontologies l OBO Foundry approach l Ensures consistency ab initio l Excludes legacy ontologies

61 61 Quality u Quality assurance in ontologies is still imperfectly defined l Difficult to define outside a use case or application u Several approaches to evaluating quality l Collaboratively, by users (Web 2.0 approach) n Marginal notes enabled by BioPortal l Centrally, by experts n OBO Foundry approach u Important factors besides quality l Governance l Installed base / Community of practice

62 62 Conclusions u Ontologies are enabling resources for data integration u Standardization works l Grass roots effort (GO) l Regulatory context (ICD 9-CM) u Bridging across resources is crucial l Ontology integration resources / strategies (UMLS, BioPortal / OBO Foundry) u Massive amounts of imperfect data integrated with rough methods might still be useful

63 Medical Ontology Research Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Contact:Web:olivier@nlm.nih.govmor.nlm.nih.gov


Download ppt "Ontologies and data integration in biomedicine Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Kno.e.sis."

Similar presentations


Ads by Google