Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linked Data as a Starting Point for Knowledge Discovery 1 Michel Dumontier, Ph.D. Associate Professor of Medicine (Biomedical Informatics) Stanford University.

Similar presentations


Presentation on theme: "Linked Data as a Starting Point for Knowledge Discovery 1 Michel Dumontier, Ph.D. Associate Professor of Medicine (Biomedical Informatics) Stanford University."— Presentation transcript:

1 Linked Data as a Starting Point for Knowledge Discovery 1 Michel Dumontier, Ph.D. Associate Professor of Medicine (Biomedical Informatics) Stanford University @micheldumontier::CSWR:Jan-2015

2 Scientific knowledge is growing at an incredible rate (but hard to use to directly answer questions) 2 @micheldumontier::CSWR:Jan-2015

3 Thousands of biological databases curate the literature into consumable facts (problems: access, format, identifiers & linking) 3 @micheldumontier::CSWR:Jan-2015

4 Specialized software is also needed to analyze, predict and evaluate (problems: OS, versioning, input/output formats) 4 @micheldumontier::CSWR:Jan-2015

5 We develop fairly sophisticated programs/workflows to uncover knowledge and test hypotheses 5 @micheldumontier::CSWR:Jan-2015 This currently requires substantial knowledge of the domain, and of tools and services available.

6 Wouldn’t it be great if we could just find the evidence required to support or dispute a scientific hypothesis using the most up-to-date and relevant data, tools and scientific knowledge? 6 @micheldumontier::CSWR:Jan-2015

7 So what do we need to achieve this? 1.Standards to construct and interrogate a massive, decentralized network of interconnected data and software 2.Methods and Tools – To prepare, interlink, and query data – To mine and discover associations – To identify novel, promising, or well-supported associations 3.Uptake – Publications, journals, funding agencies, institutions, societies, conferences, and workshops @micheldumontier::CSWR:Jan-2015 7

8 The Semantic Web is the new global web of knowledge 8 @micheldumontier::CSWR:Jan-2015 standards for publishing, sharing and querying facts, expert knowledge and services scalable approach for the discovery of independently formulated and distributed knowledge

9 We are building a massive network of linked data 9 @micheldumontier::CSWR:Jan-2015 Linking Open Data cloud diagram 2014, by Max Schmachtenberg, Christian Bizer, Anja Jentzsch and Richard Cyganiak. http://lod-cloud.net/"

10 @micheldumontier::CSWR:Jan-2015 Linked Data for the Life Sciences 10 Bio2RDF is an open source project to unify the representation and interlinking of biological data using RDF. chemicals/drugs/formulations, genomes/genes/proteins, domains Interactions, complexes & pathways animal models and phenotypes Disease, genetic markers, treatments Terminologies & publications Release 3 (June 2014): 11B+ interlinked statements from 35 biomedical datasets dataset description, provenance & statistics Partnerships with EBI, NCBI, DBCLS, NCBO, OpenPHACTS, and commercial tool providers

11 Resource Description Framework A knowledge representation language that is good for – Describing knowledge in terms of types, attributes, relations – Integrating data from different sources – Answering questions using a standard query language (SPARQL) – Publishing and linking to other data on the web Key is to reuse what is available, develop what you need, and contribute your data to the network. @micheldumontier::CSWR:Jan-2015 11

12 @micheldumontier::CSWR:Jan-2015 12 The URI is a name for a concept, relation or individual Bio2RDF Basics – the assertion It contains a base URI, a namespace, delimiter, and a resource identifier The following is the URI for Diclofenac, a drug in the DrugBank dataset For convenience, we will use the CURIE as a shorthand notation for the base URI + namespace drugbank:DB00586 PREFIX drugbank: The namespace is drawn from a registry of 2100 datasets. We also create two additional supporting namespaces: vocabulary and resource The vocabulary namespace is used for types and relations The resource namespace is used for generated data identifiers

13 @micheldumontier::CSWR:Jan-2015 dct:title "Diclofenac" 13 Bio2RDF Basics – the assertion We can annotate the URI with a human readable label using the “title” annotation property from Dublin Core Metadata Initiative’s terminology (prefix - dct) drugbank:DB00586 We have our first statement!

14 @micheldumontier::CSWR:Jan-2015 14 Our first assertion can be serialized into a variety of formats: RDF Basics – serialization “Diclofenac”@en. N-Triples Turtle RDF/XML PREFIX drugbank:. PREFIX dct:. drugbank:DB00586 dct:title "Diclofenac" @en. diclofenac

15 @micheldumontier::CSWR:Jan-2015 drugbank:DB00586 drugbank_vocabulary:Drug rdf:type dct:title "diclofenac" 15 Bio2RDF Basics – typing and labeling dct:title "Drug" All data items should be typed to a more general class of objects, and labeled for human consumption. PREFIX rdf:. PREFIX rdfs:. PREFIX dct:. PREFIX drugbank:. PREFIX drugbank_vocabulary:. PREFIX drugbank_resource:.

16 @micheldumontier::CSWR:Jan-2015 drugbank:290 drugbank_vocabulary:targets "Prostaglandin G/H synthase 2" dct:title 16 Bio2RDF Basics – use object properties to relate data items drugbank:DB00586 dct:title "diclofenac" PREFIX rdf:. PREFIX rdfs:. PREFIX dct:. PREFIX drugbank:. PREFIX drugbank_vocabulary:. PREFIX drugbank_resource:.

17 @micheldumontier::CSWR:Jan-2015 drugbank_resource:DB00586_DB00091 17 Bio2RDF Basics – convert n-ary relations into objects drugbank:DB00586 Diclofenac is involved in a drug-drug interaction (“monitor for nephrotoxicity”) with Cyclosporine, but there is no identifier for the interaction. drugbank_vocabulary:ddi-interactor-in rdf:type PREFIX rdf:. PREFIX rdfs:. PREFIX dct:. PREFIX drugbank:. PREFIX drugbank_vocabulary:. PREFIX drugbank_resource:. drugbank:DB00091 Drugbank_vocabulary:Drug-Drug-Interaction

18 @micheldumontier::CSWR:Jan-2015 drugbank_resource:14523e2498086b9f99333997452e7119 "Patheon Inc." dct:title 18 Bio2RDF Basics – transform names into labeled resources drugbank:DB00586 Patheon Inc. is a packager for diclofenac drugbank_vocabulary:packager rdf:type drugbank_vocabulary:Packager "Packager" dct:title PREFIX rdf:. PREFIX rdfs:. PREFIX dct:. PREFIX drugbank:. PREFIX drugbank_vocabulary:. PREFIX drugbank_resource:.

19 @micheldumontier::CSWR:Jan-2015 rdf:type drugbank_vocabulary:Target 19 Bio2RDF Basics – Building the knowledge graph drugbank:DB00586 drugbank_vocabulary:Drug rdf:type dct:title "diclofenac" dct:title "Drug" drugbank:290 drugbank_vocabulary :targets "Prostaglandin G/H synthase 2" dct:title "Target" drugbank_resource: 14523e2498086b9f99333997452e7119 "Patheon Inc." dct:title rdf:type drugbank_vocabulary:Packager drugbank_vocabulary :packager dct:title “Packager"

20 Linking Data case 1: using source provided links @micheldumontier::CSWR:Jan-2015 drugbank:DB00586 pharmgkb_vocabulary:Drug rdf:type dct:title diclofenac pharmgkb:PA449293 drugbank_vocabulary:Drug pharmgkb_vocabulary:x-drugbank diclofenac dct:title DrugBank PharmGKB 20

21 Linking Data case 2: lexical matching @micheldumontier::CSWR:Jan-2015 21 LIMES (Soren Auer) LInk discovery framework for MEtric Spaces http://aksw.org/Projects/limes SILK (Chris Bizer) Link Discovery Framework for the Web of Data http://www4.wiwiss.fu-berlin.de/bizer/silk/

22 We are building a highly connected network of data @micheldumontier::CSWR:Jan-2015 22

23 @micheldumontier::CSWR:Jan-2015 drugbank_vocabulary:Target 23 Data-driven schema generation drugbank_vocabulary:Drug drugbank_vocabulary :drug drugbank_vocabulary :packager drugbank_vocabulary:Drug-Drug-Interaction drugbank_vocabulary :targets drugbank_vocabulary:Packager

24 @micheldumontier::CSWR:Jan-2015 24

25 @micheldumontier::CSWR:Jan-2015 25

26 @micheldumontier::CSWR:Jan-2015 26

27 Graph summarization for query formulation @micheldumontier::CSWR:Jan-2015 PREFIX drugbank_vocabulary: PREFIX rdfs: SELECT ?ddi ?d1name ?d2name WHERE { ?ddi a drugbank_vocabulary:Drug-Drug-Interaction. ?d1 drugbank_vocabulary:ddi-interactor-in ?ddi. ?d1 rdfs:label ?d1name. ?d2 drugbank_vocabulary:ddi-interactor-in ?ddi. ?d2 rdfs:label ?d2name. FILTER (?d1 != ?d2) } 27

28 You can use query assistants @micheldumontier::CSWR:Jan-2015 http://sindicetech.com/sindice-suite/sparqled/ graph: http://sindicetech.com/analytics 28

29 Federated Queries over Independent SPARQL EndPoints Get all protein catabolic processes (and more specific) in biomodels query against SELECT ?go ?label count(distinct ?x) WHERE { ?go rdfs:label ?label. ?go rdfs:subClassOf ?tgo ?tgo rdfs:label ?tlabel. FILTER regex(?tlabel, "^protein catabolic process") service { ?x ?go. ?x a. } @micheldumontier::CSWR:Jan-2015 29

30 Despite all the data, it’s still hard to find answers to questions Because there are many ways to represent the same data and each dataset represents it differently @micheldumontier::CSWR:Jan-2015 30

31 Massive Proliferation of Ontologies / Vocabularies @micheldumontier::CSWR:Jan-2015 31

32 Multi-Stakeholder Efforts to Standardize Representations are Reasonable, Long Term Strategies for Data Integration @micheldumontier::CSWR:Jan-2015 32

33 @micheldumontier::CSWR:Jan-2015 33 http://tiny.cc/hcls-datadesc-ed

34 Three Component Metadata Model: description – version - distribution @micheldumontier::CSWR:Jan-2015 34

35 61 metadata elements Core Identifiers Title Description Attribution Homepage License Language Keywords Concepts and vocabularies used Standards Publication Provenance and Change Version number Source Provenance: retrieved from, derived from, created with Frequency of change Availability Format Download URL Landing page SPARQL endpoint 13 Content Statistics With SPARQL queries @micheldumontier::CSWR:Jan-2015 35

36 multiple formalizations of the same kind of data do emerge, each with their own merit @micheldumontier::CSWR:Jan-2015 36

37 uniprot:P05067 uniprot:Protein is a sio:gene is a Semantic data integration, consistency checking and query answering over Bio2RDF with the Semanticscience Integrated Ontology (SIO) dataset ontology Knowledge Base @micheldumontier::CSWR:Jan-2015 pharmgkb:PA30917 refseq:Protein is a omim:189931 omim:Genepharmgkb:Gene Querying Bio2RDF Linked Open Data with a Global Schema. Alison Callahan, José Cruz-Toledo and Michel Dumontier. Bio-ontologies 2012. 37

38 @micheldumontier::CSWR:Jan-2015 38 SRIQ(D) 10700+ axioms 1300+ classes 201 object properties (inc. inverses) 1 datatype property

39 Bio2RDF and SIO powered SPARQL 1.1 federated query: Find chemicals (from CTD) and proteins (from SGD) that participate in the same process (from GOA) SELECT ?chem, ?prot, ?proc FROM WHERE { SERVICE { ?chemical a sio:chemical-entity. ?chemical rdfs:label ?chem. ?chemical sio:is-participant-in ?process. ?process rdfs:label ?proc. FILTER regex (?process, "http://bio2rdf.org/go:") } SERVICE { ?protein a sio:protein. ?protein sio:is-participant-in ?process. ?protein rdfs:label ?prot. } @micheldumontier::CSWR:Jan-2015 39

40 tactical formalization @micheldumontier::CSWR:Jan-2015 40 Take what you need and represent it in a way that directly serves your objective STANDARDS USER DRIVEN REPRESENTATION identifying aberrant and pharmacological pathways predicting drug targets using organism phenotypes Biopax-pathway exploration FALDO-powered genome navigation

41 aberrant and pharmacological pathways Q1. Can we identify pathways that are associated with a particular disease or class of diseases? Q2. Can we identify pathways are associated with a particular drug or class of drugs? drug pathway disease @micheldumontier::CSWR:Jan-2015 41 gene

42 Identification of drug and disease enriched pathways Approach – Integrate 3 datasets DrugBank, PharmGKB and CTD – Integrate 7 terminologies MeSH, ATC, ChEBI, UMLS, SNOMED, ICD, DO – Formalize data of interest using a pre-defined pattern – Identify significant associations using enrichment analysis over the fully inferred knowledge base Identifying aberrant pathways through integrated analysis of knowledge in pharmacogenomics. Bioinformatics. 2012. @micheldumontier::CSWR:Jan-2015 42

43 Formal knowledge representation as a strategy for data integration @micheldumontier::CSWR:Jan-2015 43

44 Have you heard of OWL? @micheldumontier::CSWR:Jan-2015 44

45 mercaptopurine [pharmgkb:PA450379] mercaptopurine [drugbank:DB01033] purine-6-thiol [CHEBI:2208] Class Equivalence mercaptopurine [ATC:L01BB02] Top Level Classes (disjointness) drugdiseasegenepathway property chains Class subsumption mercaptopurine [mesh:D015122] Reciprocal Existentials drug disease pathway gene Formalized as an OWL-EL ontology 650,000+ classes, 3.2M subClassOf axioms, 75,000 equivalentClass axioms @micheldumontier::CSWR:Jan-2015 45

46 Benefits: Enhanced Query Capability – Use any mapped terminology to query a target resource. – Use knowledge in target ontologies to formulate more precise questions ask for drugs that are associated with diseases of the joint: ‘Chikungunya’ (do:0050012) is defined as a viral infectious disease located in the ‘joint’ (fma:7490) and caused by a ‘Chikungunya virus’ (taxon:37124). – Learn relationships that are inferred by automated reasoning. alcohol (ChEBI:30879) is associated with alcoholism (PA443309) since alcoholism is directly associated with ethanol (CHEBI:16236) ‘parasitic infectious disease’ (do:0001398) associated with 129 drugs, 15 more than are directly linked. @micheldumontier::CSWR:Jan-2015 46

47 Knowledge Discovery through Data Integration and Enrichment Analysis OntoFunc: Tool to discover significant associations between sets of objects and ontology categories. Enrichment of attribute among a selected set of input items as compared to a reference set. hypergeometric or the binomial distribution, Fisher's exact test, or a chi-square test. We found 22,653 disease-pathway associations, where for each pathway we find genes that are linked to disease. – Mood disorder (do:3324) associated with Zidovudine Pathway (pharmgkb:PA165859361). Zidovudine is for treating HIV/AIDS. Side effects include fatigue, headache, myalgia, malaise and anorexia We found 13,826 pathway-chemical associations – Clopidogrel (chebi:37941) associated with Endothelin signaling pathway (pharmgkb:PA164728163). Endothelins are proteins that constrict blood vessels and raise blood pressure. Clopidogrel inhibits platelet aggregation and prolongs bleeding time. @micheldumontier::CSWR:Jan-2015 47

48 PhenomeDrug A computational approach to predict drug targets, drug effects, and drug indications using phenotypes @micheldumontier::CSWR:Jan-2015 48 Mouse model phenotypes provide information about human drug targets. Hoehndorf R, Hiebert T, Hardy NW, Schofield PN, Gkoutos GV, Dumontier M. Bioinformatics. 2013.

49 animal models provide insight for on target effects In the majority of 100 best selling drugs ($148B in US alone), there is a direct correlation between knockout phenotype and drug effect Immunological Indications – Anti-histamines (Claritin, Allegra, Zyrtec) – KO of histamine H 1 receptor leads to decreased responsiveness of immune system – Predicts on target effects : drowsiness, reduced anxiety @micheldumontier::CSWR:Jan-2015 49 Zambrowicz and Sands. Nat Rev Drug Disc. 2003.

50 Identifying drug targets from mouse knock-out phenotypes @micheldumontier::CSWR:Jan-2015 50 drug gene phenotypes effects human gene non-functional gene model ortholog similar inhibits Main idea: if a drug’s phenotypes matches the phenotypes of a null model, this suggests that the drug is an inhibitor of the gene

51 @micheldumontier::CSWR:Jan-2015 51 B6.Cg-Alms1 foz/fox /J increased weight, adipose tissue volume, glucose homeostasis altered ALSM1(NM_015120.4) [c.10775delC] + [-] GENOTYPE PHENOTYPE obesity, diabetes mellitus, insulin resistance increased food intake, hyperglycemia, insulin resistance kcnj11 c14/c14 ; insr t143/+ (AB) ?

52 Terminological Interoperability (we must compare apples with apples) Mouse Phenotypes Drug effects (mappings from UMLS to DO, NBO, MP) Mammalian Phenotype Ontology PhenomeNet PhenomeDrug @micheldumontier::CSWR:Jan-2015

53 Semantic Similarity @micheldumontier::CSWR:Jan-2015 53 Given a drug effect profile D and a mouse model M, we compute the semantic similarity as an information weighted Jaccard metric. The similarity measure used is non-symmetrical and determines the amount of information about a drug effect profile D that is covered by a set of mouse model phenotypes M.

54 Loss of function models predict targets of inhibitor drugs 14,682 drugs; 7,255 mouse genotypes Validation against known and predicted inhibitor-target pairs – 0.76 ROC AUC for human targets (DrugBank) – 0.81 ROC AUC for mouse targets (STITCH) diclofenac (STITCH:000003032) – NSAID used to treat pain, osteoarthritis and rheumatoid arthritis – Drug effects include liver inflammation (hepatitis), swelling of liver (hepatomegaly), redness of skin (erythema) – 49% explained by PPARg knockout peroxisome proliferator activated receptor gamma (PPARg) regulates metabolism, proliferation, inflammation and differentiation, Diclofenac is a known inhibitor – 46% explained by COX-2 knockout Diclofenac is a known inhibitor @micheldumontier::CSWR:Jan-2015

55 Using the Semantic Web to Gather Evidence for Scientific Hypotheses What evidence supports or disputes that TKIs are cardiotoxic? @micheldumontier::CSWR:Jan-2015 55

56 Tyrosine Kinase Inhibitors (TKI) – Imatinib, Sorafenib, Sunitinib, Dasatinib, Nilotinib, Lapatinib – Used to treat cancer – Linked to cardiotoxicity. FDA launched drug safety program to detect toxicity – Need to integrate data and ontologies (Abernethy, CPT 2011) – Abernethy (2013) suggest using public data in genetics, pharmacology, toxicology, systems biology, to predict/validate adverse events What evidence could we gather to give credence that TKI’s causes non-QT cardiotoxicity? @micheldumontier::CSWR:Jan-2015 56 FDA Use Case: TKI non-QT Cardiotoxicity

57 @micheldumontier::CSWR:Jan-2015 57 Jane P.F. Bai and Darrell R. Abernethy. Systems Pharmacology to Predict Drug Toxicity: Integration Across Levels of Biological Organization. Annu. Rev. Pharmacol. Toxicol. 2013.53:451-473

58 The goal of HyQue is retrieve and evaluate evidence that supports/disputes a hypothesis – hypotheses are described as a set of events e.g. binding, inhibition, phenotypic effect – events are associated with types of evidence a query is written to retrieve data a weight is assigned to provide significance Hypotheses are written by people who seek answers data retrieval rules are written by people who know the data and how it should be interpreted @micheldumontier::CSWR:Jan-2015 58 HyQue 1. HyQue: Evaluating hypotheses using Semantic Web technologies. J Biomed Semantics. 2011 May 17;2 Suppl 2:S3. 2. Evaluating scientific hypotheses using the SPARQL Inferencing Notation. Extended Semantic Web Conference (ESWC 2012). Heraklion, Crete. May 27-31, 2012.

59 HyQue: A Semantic Web Application @micheldumontier::CSWR:Jan- 2015 59 Software OntologiesData HypothesisEvaluation

60 What evidence might we gather? clinical: Are there cardiotoxic effects associated with the drug? – Literature (studies) [curated db] – Product labels (studies) [r3:sider] – Clinical trials (studies) [r3:clinicaltrials] – Adverse event reports [r2:pharmgkb/onesides] – Electronic health records (observations) pre-clinical associations: – genotype-phenotype (null/disease models) [r2:mgi, r2:sgd; r3:wormbase] – in vitro assays (IC50) [r3:chembl] – drug targets [r2:drugbank; r2:ctd; r3:stitch] – drug-gene expression [r3:gxa] – pathways [r2:kegg; r3:reactome] – Drug-pathway, disease-pathway enrichments [aberrant pathways] – Chemical properties [r2:pubchem; r2.drugbank] – Toxicology [r1.toxkb/cebs] @micheldumontier::CSWR:Jan-2015 60

61 Data retrieval is done with SPARQL @micheldumontier::CSWR:Jan-2015 61

62 Data Evaluation is done with SPIN rules @micheldumontier::CSWR:Jan-2015 62

63 @micheldumontier::CSWR:Jan-2015 63

64 @micheldumontier::CSWR:Jan-2015 64 http://bio2rdf.org/drugbank:DB01268

65 @micheldumontier::CSWR:Jan-2015 65

66 @micheldumontier::CSWR:Jan-2015 66

67 @micheldumontier::CSWR:Jan-2015 67

68 In Summary This talk was about making sense out of the the structured data we already have RDF-based Linked Open Data acts as a substrate for query answering and task-based formalization in OWL Discovery through the generation of testable hypotheses in the target domain. Using Linked Data to evaluate scientific hypotheses @micheldumontier::CSWR:Jan-2015 68

69 Looking to the Future Community guidelines for RDF-based data and dataset descriptions (e.g. CEDAR) Alignment and consolidation of OWL ontologies (e.g. UMLS) Identifying and filling gaps in our knowledge (e.g. Adam the Robot scientist) Improving our coverage of available evidence (e.g. HyQue) More sophisticated data mining (e.g. you!) @micheldumontier::CSWR:Jan-2015 69

70 Acknowledgements Bio2RDF: Allison Callahan, Jose Cruz-Toledo, Peter Ansell W3C HCLS Dataset Descriptions: Alasdair Gray, M. Scott Marshall, Joachim Baran, and many others! Aberrant Pathways: Robert Hoehndorf, Georgios Gkoutos PhenomeDrug: Tanya Hiebert, Robert Hoehndorf, Georgios Gkoutos, Paul Schofield TKI Cardiotoxicity: Alison Callahan, Tania Hiebert, Beatriz Lujan, Sira Sarntivijai (FDA) @micheldumontier::CSWR:Jan-2015 70

71 dumontierlab.com michel.dumontier@stanford.edu Website: http://dumontierlab.comhttp://dumontierlab.com Presentations: http://slideshare.com/micheldumontierhttp://slideshare.com/micheldumontier 71 @micheldumontier::CSWR:Jan-2015


Download ppt "Linked Data as a Starting Point for Knowledge Discovery 1 Michel Dumontier, Ph.D. Associate Professor of Medicine (Biomedical Informatics) Stanford University."

Similar presentations


Ads by Google