Presentation on theme: "Semantic Similarity Measures Across The Gene Ontology. Relating Sequence to Annotation. P.W. Lord, R.D. Stevens, A.Brass, and C. Goble Department of Computer."— Presentation transcript:
Semantic Similarity Measures Across The Gene Ontology. Relating Sequence to Annotation. P.W. Lord, R.D. Stevens, A.Brass, and C. Goble Department of Computer Science, The University of Manchester, M13 9PL, UK. firstname.lastname@example.org Abstract: ● Bioinformatics Resources are rich in knowledge, but are often held as free text. ● Ontologies provide a way of representing knowledge in a form which is computationally accessible ● The Gene Ontology (GO) represents knowledge about:- ● The molecular function of a gene product ● The biological process it is involved in ● The cellular compartment of which it is a part. ● Can we ask a database for proteins with “semantically similar” annotation to a query protein? ● We present, and validate a measure which enables us to measure semantic similarity, and show several uses for this measure. Information Content Measures ● Originally by Resnik (1995) developed for WordNet (Fellbaum, 1998), but adaptable to GO. ● Less frequently occurring terms are “more informative”. ● To calculate:- ● For each term count the number of occurrences of that term, or any children ● Divide by the total number of terms to give a probability The Information Content For Each Node. ● The similarity is then given ● Where p ms is the information content of any shared parents. Validation ● Two proteins which have similar sequences should probably also have semantically similar annotation. ● We tested this by BLAST searching all SWISS-PROT proteins, taking the top bit scores, and comparing to semantic similarity. ● Semantic similarity over the molecular function aspect is most strongly correlated with sequence similarity. ● A similar experiment shows “Traceable Author Statement” associations are mostly tightly correlated with sequence similarity. ● This results fit well with biological expectations, and therefore serve to validate the semantic similarity measure. Applications ● We have developed two prototype applications ● A simple search tool, which uses the similarity score for ranking ● An annotation checker, which looks for high semantic similarity and low sequence similarity. ● The annotation checker has identified several “misannotations”, and errors in GO. ● The search tool, while primitive, appears to be producing results which intuitively appear “correct”. Future Work ● We are currently investigating several other information content based measures, and their behaviour over the GO dataset. ● We plan to offer a web based portal, to enable us to seek user feedback on our search tool. Acknowledgements ● The GO curators, and SWISS-PROT annotators for helpful comments ● The GO database, and API, and bioperl, were used during this work ● This work was funded under EPSRC/BBSRC Bioinformatics Programme (Grant number BIF/10507) References C.Fellbaum (1998) WordNet:- an electronic lexical database. MIT Press P. Resnik (1995) Using information content to evaulate semantic similarity in a t taxonomy Proc. 14 th Intl Joint Conf. On Artifical Intelligence pg 448-453 Morgan Kaufman.