Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal.

Similar presentations


Presentation on theme: "Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal."— Presentation transcript:

1 Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

2 Agenda Motivation Challenges Solution approaches

3 Emergence of biological datasets in the cloud of Linked Data. Biological objects (e.g., genes or proteins) or clinical trials are annotated with controlled vocabulary terms from ontologies such as GO, MeSH, SNOMED, NCI Thesaurus. Links form a graph that captures meaningful knowledge. Sense making of annotation graphs can explain phenomena, identify anomalies and potentially lead to discovery.

4 Agenda Motivation – Drug re-purposing – Cross ontology patterns and literature imprint – Cross genome analysis Challenges Solution approaches

5

6 Signature: Set of mRNAs that increase or decrease in patients and is significant w.r.t the general population. Compute similarity score [-1, +1]

7 Of 16,000 pairings, 2664 were significant (q < 0.05); half with an opposite relationship. 53 diseases had significant candidate therapeutic drug-disease relationships.

8 Sirota et al Findings Efficacy (literature) for 2 drugs: topiramate and prednisolone. Evaluated efficacy of cimetidine (over getfinib) for lung adenocarcinoma. Methodology does not provide avenues for explanation, validation or discovery.

9 Sirota et al: Identified anomaly in this cluster

10

11

12 Limitations and Extensions Sirota et al. – Anomaly in drug cluster but their methodology does not allow further investigation. Sims et al. – Methodology is limited to co-occurrence analysis. Cannot exploit heterogeneous evidence from LOD sources. Cannot exploit knowledge in ontologies. Finding patterns in graph datasets and visualization and explanation.

13 Agenda Challenges – Exploiting LOD to create datasets. – Knowledge captured in ontologies. – Similarity metrics/distances tuned for ontologies. – Discovering and validating patterns in graphs. – Literature imprint. – Heterogeneous evidence. – Reasoning with uncertainty.

14 Solution Approaches PAnG PSL Manjal ANAPSID Thanks to our collaborators / domain experts: Olivier Bodenreider, NLM, NIH Sherri de Coronado, NCI, NIH Andreas Thor, University of Leipzig Louiqa Raschid ++ at UMD Lise Getoor ++ at UMD Padmini Srinivasan ++ University of Iowa Maria Esther Vidal ++ Universidad Simon Bolivar

15 Integrated access for heterogeneous data sources: adaptive query processing for SPARQL endpoints The Arabidopsis Information Resource Gene Ontology Clinical Trials Patterns in ANnotation Graphs PSL: Annotation computation by knowledge propagation PANG: Pattern identification using dense subgraphs and graph summaries. Manjal – Text Mining for MEDLINE Annotation Visualizer – Visualize and explore annotations and patterns Solution approaches

16

17 Motivation: Gene Annotation Graphs Genes are annotated with Gene Ontology (GO) and Plant Ontology (PO) terms Prediction of new annotations as hypothesis for experiments – Link prediction is predicting new functional annotations for a gene Anno- tations

18 Link Prediction Framework Dense Subgraph (optional) – Focus on highly connected subgraphs Graph summarization: – Identify basic pattern (structure) of the graph Link Prediction – Predicted links reinforce underlying graph pattern Tripartite Anno- tation Graph (TAG) Ranked List of pre- dicted Links Link Prediction Scoring Function Dense Subgraph Distance Restriction Dense Subgraph Filter Graph Summa- rization Cost Model Graph summary Link Prediction

19 Dense Subgraph Motivation: graph area that is rich or dense with annotation is an “interesting region” Density of a subgraph = number of induced edges / number of vertices Tripartite graph with node set (A, B, C) is converted into bipartite graph with (A, C) – Weighted edges = number of shared b’s – Apply technique of [1] Distance restriction for DSG possible – Hierarchically arranged ontology terms – All node pairs of A and C are within a given distance [1] Saha et al. Dense subgraphs with restrictions and applications to gene annotation graphs. RECOMB, 2010

20 Graph Summarization Minimum description length approach [2] – Loss-free; employs cost model Graph summary = Signature + Corrections Signature: graph pattern / structure – Super nodes = complete partitioning of nodes – Super edges = edges between super nodes = all edges between nodes of super nodes Corrections: edges e between individual nodes – Additions: e  G but e  signature – Deletions: e  G but e  signature [2] Navlakha et.al. Graph summarization with bounded error. SIGMOD, 2008 PO_20030 PO_9006 PO_37 PO_20038 HY5 PHOT1 CIB5 CRY2 COP1 CRY1 PO_20030 PO_9006 PO_37 PO_20038 HY5 PHOT1 CIB5 CRY2 COP1 CRY1 = =

21 PO_20030 PO_9006 PO_37 PO_20038 HY5 PHOT1 CIB5 CRY2 COP1 CRY1 PO_20030 PO_9006 PO_37 PO_20038 HY5 PHOT1 CIB5 CRY2 COP1 CRY1 = = PO_20030 PO_9006 PO_37 PO_20038 HY5 CIB5 COP1 PHOT 1 CRY2 CRY1 DSG+GS PSL

22

23 Distance metrics

24

25 Different retrieved sets of lung cancer related clinical trials Idenitfy 100 clinical trials using the search keyword “lung cancer” in CONDITION. Retrieve CT, CONDITION and INTERVENTION. Created a dense subgraph (almost clique; highly connected subgraph). Created a graph summary to visualize the output. Retrieve 100 trials using “lung carcinoma” in the CONDITION field. Retrieve 100 trails using “lung carcinoma” in any field.

26 Retrieve 100 clinical trials using search keyword “lung cancer”. Created a dense subgraph (almost clique; highly connected subgraph). Created a graph summary to visualize the output.

27 100 clinical trials using search keyword “lung carcinoma” for CONDITION.

28 100 clinical trials using search keyword “lung carcinoma” for ALL FIELDS.

29 Questions? PAnG/PSL/ANAPSID/Manjal


Download ppt "Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal."

Similar presentations


Ads by Google