Using Semantic Similarity Measures in the Biomedical Domain for Computing Similarity between Genes based on Gene Ontology By : Elham Khabiri Adviser :

Slides:



Advertisements
Similar presentations
Semantic Similarity Measures Across The Gene Ontology. Relating Sequence to Annotation. P.W. Lord, R.D. Stevens, A.Brass, and C. Goble Department of Computer.
Advertisements

Improved TF-IDF Ranker
Semantic News Recommendation Using WordNet and Bing Similarities 28th Symposium On Applied Computing 2013 (SAC 2013) March 21, 2013 Michel Capelle
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Vincent Schickel-Zuber - AI IJCAI’07 US provisional patent number: 60/819,290 2-Jun-15 IJCAI 2007 Conference – Hyderabad, India January 6-12, 2007.
K NOWLEDGE - BASED M ETHOD FOR D ETERMINING THE M EANING OF A MBIGUOUS B IOMEDICAL T ERMS U SING I NFORMATION C ONTENT M EASURES OF S IMILARITY Bridget.
Seeing the forest for the trees : using the Gene Ontology to restructure hierarchical clustering Dikla Dotan-Cohen, Simon Kasif and Avraham A. Melkman.
Threshold selection in gene co- expression networks using spectral graph theory techniques Andy D Perkins*,Michael A Langston BMC Bioinformatics 1.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
COG and GO tutorial.
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
June 19-21, 2006WMS'06, Chania, Crete1 Design and Evaluation of Semantic Similarity Measures for Concepts Stemming from the Same or Different Ontologies.
DI FC UL1 Gene Function Prediction by Mining Biomedical Literature Pooja Jain Master in Bioinformatics Supervisor - Mário Jorge Costa Gaspar.
Graph, Search Algorithms Ka-Lok Ng Department of Bioinformatics Asia University.
09 / 23 / Predicting Protein Function Using Machine-Learned Hierarchical Classifiers Roman Eisner Supervisors: Duane Szafron.
Literature Mining Tools for Analysis of Genomic Data Ramin Homayouni, Ph.D. Associate Professor of Biology Director of Bioinformatics UTHSC BINF April.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
1 Gene Ontology and Semantic Similarity Measures.
 2 Outline  Review of major computational approaches to facilitate biological interpretation of  high-throughput microarray  and RNA-Seq experiments.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
9/30/2004TCSS588A Isabelle Bichindaritz1 Introduction to Bioinformatics.
Enriching the Ontology for Biomedical Investigations (OBI) to Improve Its Suitability for Web Service Annotations Chaitanya Guttula, Alok Dhamanaskar,
Review of Ondex Bernice Rogowitz G2P Visualization and Visual Analytics Team March 18, 2010.
Using The Gene Ontology: Gene Product Annotation.
Intelligent Database Systems Lab Presenter : BEI-YI JIANG Authors : UNIVERSIT´E CATHOLIQUE DE LOUVAIN, BELGIUM ASSOCIATION FOR COMPUTING MACHINERY.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Machine Learning Approach for Ontology Mapping using Multiple Concept Similarity Measures IEEE/ACIS International Conference on Computer and Information.
GO and OBO: an introduction. Jane Lomax EMBL-EBI What is the Gene Ontology? What is OBO? OBO-Edit demo & practical What is the Gene Ontology? What is.
Discovery from Linking Open Data (LOD) Annotated Datasets Louiqa Raschid University of Maryland PAnG/PSL/ANAPSID/Manjal.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
The Gene Ontology: a real-life ontology, progress and future. Jane Lomax EMBL-EBI.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
The Gene Ontology and its insertion into UMLS Jane Lomax.
Sharing Ontologies in the Biomedical Domain Alexa T. McCray National Library of Medicine National Institutes of Health Department of Health & Human Services.
Semantics-Based News Recommendation with SF-IDF+ International Conference on Web Intelligence, Mining, and Semantics (WIMS 2013) June 13, 2013 Marnix Moerland.
Anis Karimpour-Fard ‡, Ryan T. Gill †,
Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.
Statistical Testing with Genes Saurabh Sinha CS 466.
Using Semantic Relatedness for Word Sense Disambiguation
Ontologies Working Group Agenda MGED3 1.Goals for working group. 2.Primer on ontologies 3.Working group progress 4.Example sample descriptions from different.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
Mining the Biomedical Research Literature Ken Baclawski.
Scope of the Gene Ontology Vocabularies. Compile structured vocabularies describing aspects of molecular biology Describe gene products using vocabulary.
1 Measuring the Semantic Similarity of Texts Author : Courtney Corley and Rada Mihalcea Source : ACL-2005 Reporter : Yong-Xiang Chen.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
Predicting Protein Function Annotation using Protein- Protein Interaction Networks By Tamar Eldad Advisor: Dr. Yanay Ofran Computational Biology.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Semantic Evaluation of Machine Translation Billy Wong, City University of Hong Kong 21 st May 2010.
Joined up ontologies: incorporating the Gene Ontology into the UMLS.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Hansheng Xue School of Computer Science and Technology
Department of Genetics • Stanford University School of Medicine
Bridget McInnes Ted Pedersen Serguei Pakhomov
Genome Annotation Continued
Functional Coherence in Domain Interaction Networks
Overview Gene Ontology Introduction Biological network data
A User’s Guide to GO: Structural and Functional Annotation
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Anastasia Baryshnikova  Cell Systems 
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Presentation transcript:

Using Semantic Similarity Measures in the Biomedical Domain for Computing Similarity between Genes based on Gene Ontology By : Elham Khabiri Adviser : Dr. Hisham Al-Mubaid

University of Houston - Clear Lake 2 Motivation Goal : –Measure functional similarity between genes and Proteins Reason: –It is useful to measure the functional difference between genes in different organisms –Find the genes with unknown functions HumanYeast Drug Target

University of Houston - Clear Lake 3Motivation To compute the similarity between two genes g 1 and g 2, we can use one of the following information sources: – gene sequence information – gene functional annotations (GO terms) – biomedical literature and texts – gene expression profiles. In this work, we use Gene functional annotations and the gene ontology GO to measure the similarity between genes.

University of Houston - Clear Lake 4Motivation Given two genes G p and G q such that gene G p is annotated with a set of n different GO terms, we call it the set GO p : GO p = {t p 1, t p 2, …., t p n } Similarly, the annotation set for gene G q is: GO q = {t q 1, t q 2, …., t q n } that is, gene G q is annotated with m different GO terms The terms t p i or t q j are nodes in the GO If both genes G p and G q are annotated with only one term (n=m=1) and the same GO term ( t p 1 = t q 1 ) then the similarity between them is maximum.

University of Houston - Clear Lake 5Motivation In general, if both genes G p and G q are annotated with the same set of GO terms (n=m≥1) (that is, t p i = t q j ) then the similarity between them is maximum.

University of Houston - Clear Lake 6 Motivation Many data resources in bioinformatics not only hold data in the form of sequences, but also as annotation –Scientific natural language – Suitable for human but not easy for machine processing

University of Houston - Clear Lake 7 Related Work: Semantic Measures in NLP Resnik, 1995 Lin, 1998 Jiang and Conrath, 1997 Wu & Palmer, 1994 Leacock and Chodorow, 1998 Based on Information Content (IC) of Least Common Ancestor(LCA) Common Ancestor (LCA) Based on Ontology Structure

University of Houston - Clear Lake 8 Related Work WordNet [Miller 1995] Information Content Based Measures –Resnik, 1995 freq(t): Frequency of concept c in database. N: the number of all the concepts in database.

University of Houston - Clear Lake 9 Related Work –Jiang and Conrath, 1997 –Lin, 1998

University of Houston - Clear Lake 10 Related Work Ontology Structure Based Measures: –Wu & Palmer, 1994 Based on the depths of the two concepts in the taxonomies, and the depth of the LCS –Leacock and Chodorow, 1998: PL Based on the PL(t 1,t 2 ) of the shortest path between two concepts Scale the measure by the overall depth D of the taxonomy

University of Houston - Clear Lake 11 Related Work: Measures in Biomedical Domain First semantic similarity measure in biomedical domain: –Rada et al., 1989 : Path Length between biomedical terms in the MeSH ontology Measure of semantic similarity in Gene Ontology (GO) –Lord et al., 2003: Applied Resnik’s to GO –Validated the correlation between sequence and semantic similarity

University of Houston - Clear Lake 12 Related Work: Recent Works in Biomedical Domain Al-Mubaid and Nguyen, 2007 –Investigated the effectiveness of using Medline corpus as the information source for measuring the semantic similarity in the biomedical domain Al-Mubaid and Nguyen, 2007 – A technique for computing the semantic similarity between biomedical terms across multiple ontologies within a unified framework like UMLS Wang et. al, 2007 –Functional similarity measure of GO terms based on contributions of the term’s ancestors in GO Evaluation: Compare it with Resnik’s measure Found it was closer to human perception

University of Houston - Clear Lake 13 Sequence Similarity –BLAST [Altschul 1990] :Finds regions of local similarity between sequences of genes –WU-BLAST2 Output  E-value  Bit-score

University of Houston - Clear Lake 14 Drawbacks of Sequence Similarity Sequence similarity holds for most genes with the same functionality Devos 2000: 30% of the functional similarity found by sequence similarity might be erroneous –Reason: Genes that are not evolved from a common ancestors do not have a considerable sequence similarity One drawback for the sequence notation is that, it is not readable and understandable by human.

University of Houston - Clear Lake 15 New approach Ontology structure based –Path Length (PL) between the two terms –Number of minimum paths between terms –Depth of LCA of two terms Ontology used: Gene Ontology –A comprehensive resource for gene functional information Validation –Correlation with sequence similarity –Correlation with two other semantic measures

University of Houston - Clear Lake 16 Gene Ontology One of the greatest project in bioinformatics Created in 2000 by GO Consortium [Ashburner et. al] Consists of a set of controlled vocabularies for –Biological Process –Molecular Functions –Cellular Components Shows the functional and biological terms related to genes in a hierarchical and structured way

University of Houston - Clear Lake 17 Gene Ontology

University of Houston - Clear Lake 18 Gene Ontology Directed Acyclic Graph Each term may have more than one parent There may be more than one path between two nodes (terms) Each two node have at least one LCA (Least Common Ancestor)

University of Houston - Clear Lake 19 3 Proposed Measures 1.Plain Path Length (PL) –Number of edges between the two terms 2.Path Length with Variation (PL m ) –Number of common terms –Number of minimum paths 3.Path Length with Depth (Sim PLD ) –Path Length between two terms –Depth of LCA of the two terms

University of Houston - Clear Lake 20 Plain Path Length Parents of 11 Parents of 12 Parent of 4 Parents of 8 Considers the first level ancestor of each node in the list Parent of 5

University of Houston - Clear Lake 21 PL between two Genes gene p is annotated with terms {t 1,..., t n } gene q is annotated with terms {t 1,..., t m } Facl6 Annotated with 3 MF d ij : Shortest PL between t i of gene 1 and t j of gene 2

University of Houston - Clear Lake 22 PL Evaluation Based on Correlation with Sequence Similarity Genome Used: –SGD (Saccharomyces cerevisiae): 3 datasets –FlyBase (Drosophila Melanogaster): 1 dataset Divide datasets Based On E-Value: –High Sequence Similarity (HSS): E-value ≤ –Low Sequence Similarity (LSS): < E-value <1 –No Sequence Similarity (NSS): E-value = 1

University of Houston - Clear Lake 23 Evaluation: Compare PL with Sequence Similarity

University of Houston - Clear Lake 24 Evaluation: Compare PL with Sequence Similarity  70% of HSS have PL<=2  7% of HSS have PL>7  7% of NSS have PL<=2  80% of HSS have PL<=2  4% of HSS have PL>7  17% of NSS have PL<=2

University of Houston - Clear Lake 25 3 Proposed Measures 1.Plain Path Length (PL) –Number of edges between the two terms 2.Path Length with Variation (PL m ) –Number of common terms –Number of minimum paths 3.Path Length with Depth (Sim PLD ) –Path Length between two terms –Depth of LCA of the two terms

University of Houston - Clear Lake 26 Path Length with Variation More than one LCA Two minimum Paths –“ ” –“ ” More functional similarity that those who have only one minimum path between them

University of Houston - Clear Lake 27 PL with Variation PL(go x, go y ) if nmp = 1 PL(go x, go y )/w 1.nmp, otherwise PL m (go x, go y ) PL(gox, goy) = the minimum path length in the GO graph between the two GO terms gox and goy

University of Houston - Clear Lake 28 Path Length with Variation gene p is annotated with terms {t 1,..., t n } gene q is annotated with terms {t 1,..., t m } Max go_pl = 15 nct = number of common GO terms between Gp, Gq.

University of Houston - Clear Lake 29 Validate PL m We measured the similarity of gene pairs in SGD pathways Pathway is a series of chemical reactions occurring within a cell –Pathway #5 (allantoin degradation): 4 genes –pathway #6 (arginine biosynthesis): 7 genes –pathway #141 (tryptophan degradation): 12 genes Compare with –Resnik measure –Wang et. al measure

University of Houston - Clear Lake 30 Validate PL m : Compare with Resnik Pathway 5: allantoin degradation –4 genes, 6 pairs gene1gene2ResOurs DAL1DAL DAL1DAL DAL1DUR1, DAL2DAL DAL2DUR1, DAL3DUR1, They Correlate well with each other Minimum Maximum

University of Houston - Clear Lake 31 Validate PL m : Compare with Resnik Pathway 6: 7 genes, 21 pairs gene1gene2ResOurs ARG1ARG ARG1ARG ARG2ARG ARG3ARG5, ARG4ARG ARG1ARG PL(ARG2, ARG3) > PL(ARG3, ARG5,6) PL(ARG4, ARG8) > PL(ARG1, ARG8)

University of Houston - Clear Lake 32 Evaluation: Clusters of Genes Wang et. al vs. Our Method

University of Houston - Clear Lake 33 3 Proposed Measures 1.Plain Path Length (PL) –Number of edges between the two terms 2.Path Length with Variation (PL m ) –Number of common terms –Number of minimum paths 3.Path Length with Depth (Sim PLD ) –Path Length between two terms –Depth of LCA of the two terms

University of Houston - Clear Lake 34 Similarity between GO terms PL(go x, go y ) = minimum path length between the two GO terms go x and go y

University of Houston - Clear Lake 35 Sim PLD between two Genes g p is annotated with terms {go 1,..., go n } g q is annotated with terms {go 1,..., go m }

University of Houston - Clear Lake 36 Evaluation: Sim PLD Correlation between Sim PLD and sequence similarity Dataset: – SGD –FlyBase –Human-Yeast Ontology Used: –Molecular function (MF)

University of Houston - Clear Lake 37 Compare Sim PLD with Sequence Similarity Based On BLAST E-Value:  High Sequence Similarity  Low Sequence Similarity  No Sequence Similarity

University of Houston - Clear Lake 38 Conclusion Gene Ontology is a reliable source to be used for functional similarity Our semantic measures –Can be used as an automated tool to determine the genes with the similar functionalities –Has a fairly well agreement with Blast sequence similarity and results of other famous semantic measures

University of Houston - Clear Lake 39 Resulted Publications Khabiri E., Al-Mubaid H. (2007) “A path length method for gene functional similarity using GO annotations.” 16th International Conference on Software Engineering and Data Engineering SEDE Las Vegas, Nevada USA, 2007 Khabiri E. (2007) “A Preliminary study of Correlation between depth and Path Length of GO nodes with Gene Sequence Similarity.” IEEE 7 International Conference on BioInformatics and BioEngineering BIBE07, Boston, Massachusetts USA, 2007 Al-Mubaid H., Khabiri E., “A New Path Length Based Measure for Functional Similarity of Genes with Evaluation Using SGD Pathways.” Computational Structural Bioinformatics Workshop (CSBW), San Jose, CA (Accepted, not finalized)

University of Houston - Clear Lake 40 Future Work Apply path length-based measures to more datasets from different model organisms More accurate evaluation –Biomedical literature –Microarray data analysis Consider the number of distinct paths Prediction of functionally unknown genes

University of Houston - Clear Lake 41

University of Houston - Clear Lake 42