Presentation is loading. Please wait.

Presentation is loading. Please wait.

Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,

Similar presentations


Presentation on theme: "Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,"— Presentation transcript:

1 Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load, R.D.Stevens,A.Brass and C.A.Goble University of Manchester Presented by 임 동혁 July 22, 2005

2 Contents Introduction Semantic Similarity Measures
Validating Semantic Similarity Investigating Semantic and Sequence Similarity Semantic Searching of GO Annotated Resources Discussion

3 Introduction Bioinformatics resources Ontologies
In form of sequence, which are then annotated In scientific natural language as text Human readable and understandable Not easy to interpret computationally Ontologies Provide a mechanism for capturing a view of a domain in a shareable form Both accessible by humans and computationally amenable Provide a set of vocabulary terms that label concepts in the domain “is-a” relationship between parent and child “part-of” relationship between part and whole

4 Gene Ontology(1/2) GO comprises three orthogonal taxonomies of aspects
Molecular function Biological process Cellular component GO is a rapidly growing collection of about phrases, representing terms or concepts Directed Acyclic Graph(DAG)

5 Gene Ontology(2/2) Allow improved querying of databases
Different resources queried with the same term Shared understanding improve retrieval consistency across resources and recall and precision One obvious alternative way Ask for proteins semantically similar to a query protein Semantic similarity Taxonomy of biomedical terms Ex) Medical Subject Heading(MeSH) : similar content(by words)

6 the Gene Ontology Receptor-associated protein GO: p= Transmembrane receptor GO: p=0.0997 isa isa signal transducer GO: p=0.208 isa receptor GO: p=0.124 isa isa photoreceptor GO: p= molecular function GO: p=1 isa Receptor signaling protein GO: p=0.0281 isa isa chaperone GO: p=0.0102 ligrand GO: p=0.0460 Two proteins are both annotated as “transmembrane receptor” (GO: ) Similar semantic description One as just “receptor”(GO: ) Semantically less similar

7 Semantic Similarity Measure(1/3)
Early techniques (Rada et al, 1989) Path distances between terms Assumes that all of semantic links are of equal weight Poor assumption Ex) “photoreceptor” and “transmembrane receptor” are semantically more closely related than “chaperone” and “signal transducer”

8 Semantic Similarity Measure(2/3)
Edge could be weighted The greater distance from root of the graph, the more specific the terms However, GO varies widely in the distance of nodes from the root Ex) (GO: ) is 14 terms deep, (GO: ) is only 3 terms deep Not significantly less semantically precise

9 Semantic Similarity Measure(2/3)
Usage of terms within the corpus (Resnik, 1999) Use the notion of “information content” Familiar from most internet search engines Ex) “chaperone” is a more informative term than “signal transducer” The former is used several times, the later thousand times GO: occurs, GO: and GO: have also occurred (“is-a” link are considered) More informative

10 Probabilities in the Gene Ontology
Each node is annotated with its GO accession and the probability of this term occurring in the SWISS-PROT-Human database 1. Count the number of times each concept occurrs, 2. A concept occurs if a term, or any node its children occur 3. The probability, p(c), for each node is this value, divided by the number of times (the probability of root node will be 1)

11 Semantic Similarity between terms
Use simplest of measure (Resnik, 1999) Based on the information content of shared parents of the two terms S(c1, c2) is the set of parental concepts shared by both c1 and c2 Minimum p(c) : GO allows multiple path Pms(probability of the minimum subsumer) Similirity score between two terms As probability increase, informativeness decrease

12 Validating Semantic Similarity
How do we validate such a measure? Protein’s sequence relates to its function Highly similar sequences should be highly semantically similar Protein sequences in pairs and plotting sequence similarity against semantic ssimilarity should a relationship

13 Adapting the Similarity Measures to GO and SWISS-PROT
“part-of” relationship Orphan term Linked them directly to the root Ex) GO: Is-a’s links alone Proteins may be annotated with more than a single term Wordnet : Maximum similarity GO : average similarity

14 Comparing Semantic Similarity Across GO Aspects
There is a good correlation between sequence similarity and semantic similarity The correlation is greater when measured against the “molecular function”

15 The Relationship Between Semantic Similarity and Evidence Codes
TAS : regarded as the highest standard of evidence When only TAS GO annotation are considered, the correlation is much greater

16 Effect of Using Semantic Links in Semantic Similarity
Consider only links of a single type “is-a” or “part-of” Little difference between all link and “is-a” : almost link are of “is-a” type (6167 / 6202) No links drop in the middle part : proteins share similar (links are included in semantic similarity measure)

17 Analysis(1/2) Very high semantic similarity but little sequence similarity “Polymorphic” groups Two or more classes of protein involved in the same process Heterodimerize or sub-families Hyper variable protein families arbitrary Mis-annotations SWISS-PROT “x-like” but in GO “x” Spelling mistake

18 Analysis(2/2) - Example

19 Semantic Searching of GO Annotated Resources
Develop a search tool Given query protein against all the others in SWISS-PROT-Human Generates a ranked list of semantically similar proteins Ex) “OPSR_HUMAN”

20 Discussion Investigated semantic similarity measure Future work
All cases semantic similarity is correlated with sequence similarity GO aspect : molecular funstion Evidence code : “Traceable Author Statement” Future work Effect of the different semantic links in ontologies Co-expression as revealed by microarray experiments Expect that biological process aspect would be of great use


Download ppt "Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,"

Similar presentations


Ads by Google