Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal.

Slides:



Advertisements
Similar presentations
Office of SA to CNS GeoIntelligence Introduction Data Mining vs Image Mining Image Mining - Issues and Challenges CBIR Image Mining Process Ontology.
Advertisements

Using Semantic Similarity Measures in the Biomedical Domain for Computing Similarity between Genes based on Gene Ontology By : Elham Khabiri Adviser :
Putting genetic interactions in context through a global modular decomposition Jamal.
6/23/03 IndoUS DL 2003 Text Metadata Mining: Exploring its potential* Padmini Srinivasan School of Library & Information Science The University of Iowa.
Searching and Exploring Biomedical Data Vagelis Hristidis School of Computing and Information Sciences Florida International University.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
1 Exploratory Tools for Follow-up Studies to Microarray Experiments Kaushik Sinha Ruoming Jin Gagan Agrawal Helen Piontkivska Ohio State and Kent State.
June 19-21, 2006WMS'06, Chania, Crete1 Design and Evaluation of Semantic Similarity Measures for Concepts Stemming from the Same or Different Ontologies.
Overview of Web Data Mining and Applications Part I
Samir Khuller University of Maryland Joint Work with Barna Saha, Allie Hoch, Louiqa Raschid, Xiao-Ning Zhang RECOMB 2010.
Introduction The goal of translational bioinformatics is to enable the transformation of increasingly voluminous genomic and biological data into diagnostics.
Srihari-CSE730-Spring 2003 CSE 730 Information Retrieval of Biomedical Text and Data Inroduction.
Automatic methods for functional annotation of sequences Petri Törönen.
Automated Explanation of Gene-Gene Relationships Wacek Kuśnierczyk.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
SeqExpress: Introduction. Features Visualisation Tools  Data: gene expression, gene function and gene location.  Analysis: probability models, hierarchies.
Linking Diseases and Genes through Informatics Knowledge Bases and Ontologies Joyce A. Mitchell, Ph.D. National Library of Medicine University of Missouri.
Bioinformatics and medicine: Are we meeting the challenge?
Olivier Bodenreider Lister Hill National Center for Biomedical Communications Bethesda, Maryland - USA Experiences in visualizing and navigating biomedical.
Samir Khuller University of Maryland Joint Work with Barna Saha, Allie Hoch, Louiqa Raschid, Xiao-Ning Zhang RECOMB 2010.
PattArAn – From Annotation Triplets to Sentence Fingerprints Motivation Motivation  Scientific concepts are annotated with controlled vocabulary (CV)
Creating Metabolic Network Models using Text Mining and Expert Knowledge J.A. Dickerson, D. Berleant, Z. Cox, W. Qi, and E. Wurtele Iowa State University.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
EASE: An Effective 3-in-1 Keyword Search Method for Unstructured, Semi-structured and Structured Data Cuoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong.
An integrative approach to drug repositioning: a use case for semantic web technologies Paul Rigor Institute for Genomics and Bioinformatics Donald Bren.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
Automated Hypothesis Generation Based on Mining Scientific Literature Scott Spangler, Angela D. Wilkins, Benjamin J. Bachman, Meena Nagarajan, Tajhal Dayaram,
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Algorithmic Detection of Semantic Similarity WWW 2005.
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Semantic Enhancement: Key to Massive and Heterogeneous Data Pools Violeta Damjanovic, Thomas Kurz, Rupert Westenthaler, Wernher Behrendt, Andreas Gruber,
Using Domain Ontologies to Improve Information Retrieval in Scientific Publications Engineering Informatics Lab at Stanford.
BBN Technologies Copyright 2009 Slide 1 The S*QL Plugin for Cytoscape Visual Analytics on the Web of Linked Data Rusty (Robert J.) Bobrow Jeff Berliner,
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
1 Knowledge Discovery from Transportation Network Data Paper Review Jiang, W., Vaidya, J., Balaporia, Z., Clifton, C., and Banich, B. Knowledge Discovery.
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
TWC Illuminate Knowledge Elements in Geoscience Literature Xiaogang (Marshall) Ma, Jin Guang Zheng, Han Wang, Peter Fox Tetherless World Constellation.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
Automatic Labeling of Multinomial Topic Models
NC-BSI: TASK 3.5: Reduction of False Alarm Rates from Fused Data Problem Statement/Objectives Research Objectives Intelligent fusing of data from hybrid.
Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Multi-scale network biology model & the model library 多尺度网络生物学模型 -- 兼论模型库的建立与应用 Jianghui Xiong 熊江辉
Mining Coherent Dense Subgraphs across Multiple Biological Networks Vahid Mirjalili CSE 891.
Modeling Perspective Effects in Photographic Composition Zihan Zhou, Siqiong He, Jia Li, and James Z. Wang The Pennsylvania State University.
Semantic Graph Mining for Biomedical Network Analysis: A Case Study in Traditional Chinese Medicine Tong Yu HCLS
TDM in the Life Sciences Application to Drug Repositioning *
Outline Introduction State-of-the-art solutions Equi-Truss Experiments
David Amar, Tom Hait, and Ron Shamir
Outline Introduction State-of-the-art solutions
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Analysis of bio-molecular networks through RANKS (RAnking of Nodes
Harry Hochheiser Assistant Professor
Probabilistic Data Management
Summarizing Entities: A Survey Report
Research Areas Christoph F. Eick
Universidad Simón Bolívar
Benjamin Wooden, Nicolas Goossens, Yujin Hoshida, Scott L. Friedman 
Class Prediction Based on Gene Expression Data Issues in the Design and Analysis of Microarray Experiments Michael D. Radmacher, Ph.D. Biometric Research.
Information Networks: State of the Art
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Presentation transcript:

Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal

Agenda Motivation Challenges Solution approaches

Emergence of biological datasets in the cloud of Linked Data. Biological objects (e.g., genes or proteins) or clinical trials are annotated with controlled vocabulary terms from ontologies such as GO, MeSH, SNOMED, NCI Thesaurus. Links form a graph that captures meaningful knowledge. Sense making of annotation graphs can explain phenomena, identify anomalies and potentially lead to discovery.

Agenda Motivation – Drug re-purposing – Cross ontology patterns and literature imprint – Cross genome analysis Challenges Solution approaches

Signature: Set of mRNAs that increase or decrease in patients and is significant w.r.t the general population. Compute similarity score [-1, +1]

Of 16,000 pairings, 2664 were significant (q < 0.05); half with an opposite relationship. 53 diseases had significant candidate therapeutic drug-disease relationships.

Sirota et al Findings Efficacy (literature) for 2 drugs: topiramate and prednisolone. Evaluated efficacy of cimetidine (over getfinib) for lung adenocarcinoma. Methodology does not provide avenues for explanation, validation or discovery.

Sirota et al: Identified anomaly in this cluster

Limitations and Extensions Sirota et al. – Anomaly in drug cluster but their methodology does not allow further investigation. Sims et al. – Methodology is limited to co-occurrence analysis. Cannot exploit heterogeneous evidence from LOD sources. Cannot exploit knowledge in ontologies. Finding patterns in graph datasets and visualization and explanation.

Agenda Challenges – Exploiting LOD to create datasets. – Knowledge captured in ontologies. – Similarity metrics/distances tuned for ontologies. – Discovering and validating patterns in graphs. – Literature imprint. – Heterogeneous evidence. – Reasoning with uncertainty.

Solution Approaches PAnG PSL Manjal ANAPSID Thanks to our collaborators / domain experts: Olivier Bodenreider, NLM, NIH Sherri de Coronado, NCI, NIH Andreas Thor, University of Leipzig Louiqa Raschid ++ at UMD Lise Getoor ++ at UMD Padmini Srinivasan ++ University of Iowa Maria Esther Vidal ++ Universidad Simon Bolivar

Integrated access for heterogeneous data sources: adaptive query processing for SPARQL endpoints The Arabidopsis Information Resource Gene Ontology Clinical Trials Patterns in ANnotation Graphs PSL: Annotation computation by knowledge propagation PANG: Pattern identification using dense subgraphs and graph summaries. Manjal – Text Mining for MEDLINE Annotation Visualizer – Visualize and explore annotations and patterns Solution approaches

Motivation: Gene Annotation Graphs Genes are annotated with Gene Ontology (GO) and Plant Ontology (PO) terms Prediction of new annotations as hypothesis for experiments – Link prediction is predicting new functional annotations for a gene Anno- tations

Link Prediction Framework Dense Subgraph (optional) – Focus on highly connected subgraphs Graph summarization: – Identify basic pattern (structure) of the graph Link Prediction – Predicted links reinforce underlying graph pattern Tripartite Anno- tation Graph (TAG) Ranked List of pre- dicted Links Link Prediction Scoring Function Dense Subgraph Distance Restriction Dense Subgraph Filter Graph Summa- rization Cost Model Graph summary Link Prediction

Dense Subgraph Motivation: graph area that is rich or dense with annotation is an “interesting region” Density of a subgraph = number of induced edges / number of vertices Tripartite graph with node set (A, B, C) is converted into bipartite graph with (A, C) – Weighted edges = number of shared b’s – Apply technique of [1] Distance restriction for DSG possible – Hierarchically arranged ontology terms – All node pairs of A and C are within a given distance [1] Saha et al. Dense subgraphs with restrictions and applications to gene annotation graphs. RECOMB, 2010

Graph Summarization Minimum description length approach [2] – Loss-free; employs cost model Graph summary = Signature + Corrections Signature: graph pattern / structure – Super nodes = complete partitioning of nodes – Super edges = edges between super nodes = all edges between nodes of super nodes Corrections: edges e between individual nodes – Additions: e  G but e  signature – Deletions: e  G but e  signature [2] Navlakha et.al. Graph summarization with bounded error. SIGMOD, 2008 PO_20030 PO_9006 PO_37 PO_20038 HY5 PHOT1 CIB5 CRY2 COP1 CRY1 PO_20030 PO_9006 PO_37 PO_20038 HY5 PHOT1 CIB5 CRY2 COP1 CRY1 = =

PO_20030 PO_9006 PO_37 PO_20038 HY5 PHOT1 CIB5 CRY2 COP1 CRY1 PO_20030 PO_9006 PO_37 PO_20038 HY5 PHOT1 CIB5 CRY2 COP1 CRY1 = = PO_20030 PO_9006 PO_37 PO_20038 HY5 CIB5 COP1 PHOT 1 CRY2 CRY1 DSG+GS PSL

Distance metrics

Different retrieved sets of lung cancer related clinical trials Idenitfy 100 clinical trials using the search keyword “lung cancer” in CONDITION. Retrieve CT, CONDITION and INTERVENTION. Created a dense subgraph (almost clique; highly connected subgraph). Created a graph summary to visualize the output. Retrieve 100 trials using “lung carcinoma” in the CONDITION field. Retrieve 100 trails using “lung carcinoma” in any field.

Retrieve 100 clinical trials using search keyword “lung cancer”. Created a dense subgraph (almost clique; highly connected subgraph). Created a graph summary to visualize the output.

100 clinical trials using search keyword “lung carcinoma” for CONDITION.

100 clinical trials using search keyword “lung carcinoma” for ALL FIELDS.

Questions? PAnG/PSL/ANAPSID/Manjal