Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Gene Ontology Categorizer C.A. Joslyn 1, S.M. Mniszewski 1, A. Fulmer 2 and G. Heaton 3 1 Computer and Computational Sciences, Los Alamos National.

Similar presentations


Presentation on theme: "The Gene Ontology Categorizer C.A. Joslyn 1, S.M. Mniszewski 1, A. Fulmer 2 and G. Heaton 3 1 Computer and Computational Sciences, Los Alamos National."— Presentation transcript:

1 The Gene Ontology Categorizer C.A. Joslyn 1, S.M. Mniszewski 1, A. Fulmer 2 and G. Heaton 3 1 Computer and Computational Sciences, Los Alamos National Laboratory, 2 Corporate Biotechnology, Miami Valley Labs and 3 Corporate Functions-IT, Procter & Gamble, USA (Bioinformatics, Vol. 20, Suppl. 1, 2004, p. i169-i177)

2 2/25 Abstract (1/2) Given a list of genes of interest, what are the best nodes of the GO to summarize or categorize that list? From a drug discovery process, we wish to understand the overall effect of some cell treatment or condition by identifying ‘ where ’ in the GO the differentially expressed genes fall.

3 3/25 Abstract (2/2) View bio-ontologies more as combinatorially structured databases than facilities for logical inference, and draw on the discrete mathematics of finite partially ordered sets (posets) to develop data representation and algorithms appropriate for the GO. Issues: categorization task, distances in ontologies and ontology merger and exchange.

4 4/25 1. Introduction (1/3) A gene expression experiment involves high- throughput microarrays, a biomedical researcher will need to extract useful information on the types of biological processes affected in the experiment. The categorization task arises from the researcher wanting to take the names of some genes and gain an understanding of their overall function by examining their distribution through the GO: are they localized, grouped in distinct areas or spread uniformly?

5 5/25 1. Introduction (2/3) The Gene Ontology Categorizer (GOC) applies novel research in the discrete mathematics of posets for semantic hierarchies to GO analysis. Represent the GO as a poset ontology, then use pseudo-distances between comparable nodes to develop scoring functions. Finally, cluster the resulting rank-ordered list to produce a ranked list of appropriate summarizing nodes within the GO, which act as functional hypotheses about the characteristics of the genes expressed.

6 6/25 1. Introduction (3/3) GO analysis weaknesses Many researchers consider the GO simply as a list of categories, ignoring any structural relationships among the categories. Even those researchers with a treatment closest in spirit to authors consider the GO primarily as a tree, or even cast it as a graph for determining distances between nodes.

7 7/25 2. Methodology (1/2) A finite partially ordered set (poset) is a mathematical structure P =, where P is a finite set and ≤ ⊆ P 2 is a reflexive, anti- symmetric, transitive binary relation on P. Every poset is a digraph with no cycles and they are general than trees or lattices in that collections of nodes can have multiple parents. The GO is a pair of directed acyclic graphs (DAGs), one for the is-a and has-part links.

8 8/25

9 9/25 2. Methodology (2/2) P GO is the set of nodes such as ‘ DNA unwinding ’ and ‘ DNA replication ’. The ordering ≤ in ‘ DNA repair ≤ DNA metabolism ’ represents that DNA repair is a kind of DNA metabolism. GO, cast as a pair of posets P is = and P has = for the two kinds of relations, is a large, taxonomically organized semantic hierarchy. This paper treats two kinds of links to be equivalent: P GO =, where ≤ GO =≤ is  ≤ has.

10 10/25 2.1 Poset theory (1/3) Two nodes p 1, p 2 ∈ P are comparable, denoted p 1 ~p 2, if either p 1 ≤p 2 or p 2 ≤p 1. A chain C ⊆ P is a collection of comparable nodes. Height H ( P ) is the size of the largest chain. Two nodes p 1, p 2 ∈ P are non-comparable if p 1 p 2. An antichain is a collection of non-comparable nodes. Width W ( P ) is the size of the largest anti-chain.

11 11/25 2.1 Poset theory (2/3) Given two comparable nodes p 1 ≤p 2, the set of all nodes ‘ between ’ them is the interval [p 1, p 2 ] ={p: p 1 ≤p≤p 2 }, which is equivalent to the set of all chains between p 1 and p 2, denoted C(p 1, p 2 ). The vector of chain lengths h(p 1, p 2 )=|C(p 1, p 2 )| is the collection of the lengths of all these chains. Minimal and maximum chain lengths between p 1 and p 2 are h ∗ (p 1, p 2 )= min C ∈ C(p 1,p 2 ) |C| and h ∗ (p 1, p 2 )=max C ∈ C(p 1,p 2 ) |C|, respectively.

12 12/25 2.1 Poset theory (3/3) P={1,A,B,...,K} B and J are noncomparable, while A≤B are comparable. [A,B]={A,F,G,H,I,B} consists of the three chains C(A, B)={A≤F≤B, A≤G≤B, A≤H≤I≤B}. h(A, B)= with h ∗ (A,B)=2, h ∗ (A,B)=3. H(P)=5 (a maximal chain is D≤E≤I≤C≤1) and W(P)=5 (the largest anti-chain is {F,G,H,E,J}). 18

13 13/25 Poset statistics of the GO

14 14/25 2.2 Methods (1/4) Define a POSet Ontology (POSO) as O=, where X is a finite, non-empty set of labels, and F: X → 2 P is an annotation function mapping each label x ∈ X to a collection of nodes F(x) ⊆ P. E.g. X={a,b, …,j }, F(b)={A,E,F}. In GOC, O GO =, where the gene products X GO and annotations F GO are provided by the GO file.

15 15/25 2.2 Methods (2/4) A pseudo-distance function δ: P 2 → R The minimum path length δ m =h ∗ The maximum path length δ x =h ∗ The average of extreme path lengths The average of all path lengths h ∗ (p 1, p 2 )≤δ(p 1, p 2 )≤h ∗ (p 1, p 2 ). A normalized distance as δ=δ/ H(P).

16 16/25 2.2 Methods (3/4) A scoring function S y (p) that returns the weighted rank of a node p  P based on requested nodes Y. Two kinds of scores An unnormalized score S Y : P → R + which returns an ‘ absolute ’ number A normalized score which returns a ‘ relative ’ number.

17 17/25 2.2 Methods (4/4) s  { …,-1,0,1,2,3, … }, where low s emphasizes coverages, and high s emphasizes specificity. Let r=2 s, then we have four scoring functions: Unnormalized distance and unnormalized score: Unnormalized distance and normalized score: Normalized distance and unnormalized score: Normalized distance and normalized score:

18 18/25 Cluster heads are marked with +, and secondaries with -. 12

19 19/25 3. Expert validation (1/2) An experienced molecular immunologist constructed two nonoverlapping lists of genes: KT1 a list of 242 genes involved in immune processes; and KT4 a list of 147 genes involved in cell – cell/cell – matrix interactions. KT1, KT4 and KT1 ∪ KT4 provided three queries for GOC into the BP branch of the GO using δ m, s=7 and scoring function.

20 20/25 3. Expert validation (2/2) Two assessed values Utility (1=low to 5=high): Did the cluster terms provide a useful description of a specific biological process? Expectation (1=high to 5=low): Was the identified biological process expected for the genes in the query?

21 21/25

22 22/25 4. Formal validation (1/3) An independent source of annotations of collections of GO nodes: the InterPro project, which catalogs assignments of protein families, domains and functional sites to GO IDs. E.g. ‘ phosphofructokinase ’ is InterPro ID IPR000023, and is annotated to GO:0006096= ‘ glycolysis ’, GO:0003872= ‘ 6-phosphofructokinase activity ’, and GO:0005945= ‘ 6-phosphofructokinase complex ’. It also maps to 175 proteins. Thus the validation task is to make these 175 proteins a GOC query, and see how well cluster heads match against the set of GO IDs {GO:0006096, GO:0003872, GO:0005945}.

23 23/25 4. Formal validation (2/3) In the run, there were 4,866 InterPro IDs with GO annotations, with 11,370 mappings to GO nodes and 787,760 mappings to proteins in total. Of these proteins, they were able to locate 778 494, or >99% with GO annotations.

24 24/25 4. Formal validation (3/3) Immediate family: child/parent/sibling. Extended family: grandparent/grandchild/cousin/aunt/uncle/niece/nephew

25 25/25 5. Conclusions The GOC methodology provides a valid and useful approach to categorization in the GO. Future work Methodological development in combinatorial approaches to data analysis, including distances between noncomparable nodes, interval-valued measures of ‘ level ’ in posets, algorithms for poset width calculation and poset matching. Expansion to other ontologies. Continuation of work in textual approaches, mapping back and forth from semantic relations among GO nodes to those among its lexical components.


Download ppt "The Gene Ontology Categorizer C.A. Joslyn 1, S.M. Mniszewski 1, A. Fulmer 2 and G. Heaton 3 1 Computer and Computational Sciences, Los Alamos National."

Similar presentations


Ads by Google