The Gene Ontology Categorizer C.A. Joslyn 1, S.M. Mniszewski 1, A. Fulmer 2 and G. Heaton 3 1 Computer and Computational Sciences, Los Alamos National.

Slides:



Advertisements
Similar presentations
Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Biological Networks Analysis Introduction and Dijkstras algorithm.
Advertisements

Basic Gene Expression Data Analysis--Clustering
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Fast Algorithms For Hierarchical Range Histogram Constructions
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
Gene Ontology John Pinney
CS2210(22C:19) Discrete Structures Relations Spring 2015 Sukumar Ghosh.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 21: Graphs.
Seeing the forest for the trees : using the Gene Ontology to restructure hierarchical clustering Dikla Dotan-Cohen, Simon Kasif and Avraham A. Melkman.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data Cristina Manfredotti.
Partially Ordered Sets Basic Concepts
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
 2 Outline  Review of major computational approaches to facilitate biological interpretation of  high-throughput microarray  and RNA-Seq experiments.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Chapter 2 Graph Algorithms.
Knowledge Representation and Indexing Using the Unified Medical Language System Kenneth Baclawski* Joseph “Jay” Cigna* Mieczyslaw M. Kokar* Peter Major.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Gene expression analysis
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Keng-Wei Chang Author: Yehuda.
LECTURE 5 HYPOTHESIS TESTING EPSY 640 Texas A&M University.
Algorithmic Detection of Semantic Similarity WWW 2005.
Relations and their Properties
A Knowledge-Based Clustering Algorithm Driven by Gene Ontology Jill Cheng Affymetrix, Inc. Jan 15, 2004.
COSC 2007 Data Structures II Chapter 14 Graphs I.
Statistical Testing with Genes Saurabh Sinha CS 466.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
Hierarchy Overview Background: Hierarchy surrounds us: what is it? Micro foundations of social stratification Ivan Chase: Structure from process Action.
The Structure of the Web. Getting to knowing the Web How big is the web and how do you measure it? How many people use the web? How many use search engines?
Mining the Biomedical Research Literature Ken Baclawski.
Lecture 3 1.Different centrality measures of nodes 2.Hierarchical Clustering 3.Line graphs.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Problem Statement How do we represent relationship between two related elements ?
Unit II Discrete Structures Relations and Functions SE (Comp.Engg.)
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
Chapter 8: Relations. 8.1 Relations and Their Properties Binary relations: Let A and B be any two sets. A binary relation R from A to B, written R : A.
Graph Theory. undirected graph node: a, b, c, d, e, f edge: (a, b), (a, c), (b, c), (b, e), (c, d), (c, f), (d, e), (d, f), (e, f) subgraph.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
Class 2: Graph Theory IST402.
Advanced Gene Selection Algorithms Designed for Microarray Datasets Limitation of current feature selection methods: –Ignores gene/gene interaction: single.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Sanghamitra.
Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S
Graphs Definition: a graph is an abstract representation of a set of objects where some pairs of the objects are connected by links. The interconnected.
INFORMATION IN COMPLEX SYSTEMS: SEMANTICS, SELF-REFERENCE AND CAUSALITY." D. C. Mikulecky Professor Emeritus and Senior Fellow Center for the Study of.
Dilworth’s theorem and extremal set theory 張雁婷 國立交通大學應用數學系.
Some Terminology experiment vs. correlational study IV vs. DV descriptive vs. inferential statistics sample vs. population statistic vs. parameter H 0.
Computational Biology
Clustering Manpreet S. Katari.
Learn about relations and their basic properties
Statistical Testing with Genes
Network Science: A Short Introduction i3 Workshop
Reference based assembly
CS2210 Discrete Structures Relations
Functional Coherence in Domain Interaction Networks
Chapter 9: Graphs Basic Concepts
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001) Weekly Lab. Seminar
교환 학생 프로그램 내년 1월 중순부터 6월 초 현재 학부 2,3 학년?
Chapter 9: Graphs Basic Concepts
Statistical Testing with Genes
Presentation transcript:

The Gene Ontology Categorizer C.A. Joslyn 1, S.M. Mniszewski 1, A. Fulmer 2 and G. Heaton 3 1 Computer and Computational Sciences, Los Alamos National Laboratory, 2 Corporate Biotechnology, Miami Valley Labs and 3 Corporate Functions-IT, Procter & Gamble, USA (Bioinformatics, Vol. 20, Suppl. 1, 2004, p. i169-i177)

2/25 Abstract (1/2) Given a list of genes of interest, what are the best nodes of the GO to summarize or categorize that list? From a drug discovery process, we wish to understand the overall effect of some cell treatment or condition by identifying ‘ where ’ in the GO the differentially expressed genes fall.

3/25 Abstract (2/2) View bio-ontologies more as combinatorially structured databases than facilities for logical inference, and draw on the discrete mathematics of finite partially ordered sets (posets) to develop data representation and algorithms appropriate for the GO. Issues: categorization task, distances in ontologies and ontology merger and exchange.

4/25 1. Introduction (1/3) A gene expression experiment involves high- throughput microarrays, a biomedical researcher will need to extract useful information on the types of biological processes affected in the experiment. The categorization task arises from the researcher wanting to take the names of some genes and gain an understanding of their overall function by examining their distribution through the GO: are they localized, grouped in distinct areas or spread uniformly?

5/25 1. Introduction (2/3) The Gene Ontology Categorizer (GOC) applies novel research in the discrete mathematics of posets for semantic hierarchies to GO analysis. Represent the GO as a poset ontology, then use pseudo-distances between comparable nodes to develop scoring functions. Finally, cluster the resulting rank-ordered list to produce a ranked list of appropriate summarizing nodes within the GO, which act as functional hypotheses about the characteristics of the genes expressed.

6/25 1. Introduction (3/3) GO analysis weaknesses Many researchers consider the GO simply as a list of categories, ignoring any structural relationships among the categories. Even those researchers with a treatment closest in spirit to authors consider the GO primarily as a tree, or even cast it as a graph for determining distances between nodes.

7/25 2. Methodology (1/2) A finite partially ordered set (poset) is a mathematical structure P =, where P is a finite set and ≤ ⊆ P 2 is a reflexive, anti- symmetric, transitive binary relation on P. Every poset is a digraph with no cycles and they are general than trees or lattices in that collections of nodes can have multiple parents. The GO is a pair of directed acyclic graphs (DAGs), one for the is-a and has-part links.

8/25

9/25 2. Methodology (2/2) P GO is the set of nodes such as ‘ DNA unwinding ’ and ‘ DNA replication ’. The ordering ≤ in ‘ DNA repair ≤ DNA metabolism ’ represents that DNA repair is a kind of DNA metabolism. GO, cast as a pair of posets P is = and P has = for the two kinds of relations, is a large, taxonomically organized semantic hierarchy. This paper treats two kinds of links to be equivalent: P GO =, where ≤ GO =≤ is  ≤ has.

10/ Poset theory (1/3) Two nodes p 1, p 2 ∈ P are comparable, denoted p 1 ~p 2, if either p 1 ≤p 2 or p 2 ≤p 1. A chain C ⊆ P is a collection of comparable nodes. Height H ( P ) is the size of the largest chain. Two nodes p 1, p 2 ∈ P are non-comparable if p 1 p 2. An antichain is a collection of non-comparable nodes. Width W ( P ) is the size of the largest anti-chain.

11/ Poset theory (2/3) Given two comparable nodes p 1 ≤p 2, the set of all nodes ‘ between ’ them is the interval [p 1, p 2 ] ={p: p 1 ≤p≤p 2 }, which is equivalent to the set of all chains between p 1 and p 2, denoted C(p 1, p 2 ). The vector of chain lengths h(p 1, p 2 )=|C(p 1, p 2 )| is the collection of the lengths of all these chains. Minimal and maximum chain lengths between p 1 and p 2 are h ∗ (p 1, p 2 )= min C ∈ C(p 1,p 2 ) |C| and h ∗ (p 1, p 2 )=max C ∈ C(p 1,p 2 ) |C|, respectively.

12/ Poset theory (3/3) P={1,A,B,...,K} B and J are noncomparable, while A≤B are comparable. [A,B]={A,F,G,H,I,B} consists of the three chains C(A, B)={A≤F≤B, A≤G≤B, A≤H≤I≤B}. h(A, B)= with h ∗ (A,B)=2, h ∗ (A,B)=3. H(P)=5 (a maximal chain is D≤E≤I≤C≤1) and W(P)=5 (the largest anti-chain is {F,G,H,E,J}). 18

13/25 Poset statistics of the GO

14/ Methods (1/4) Define a POSet Ontology (POSO) as O=, where X is a finite, non-empty set of labels, and F: X → 2 P is an annotation function mapping each label x ∈ X to a collection of nodes F(x) ⊆ P. E.g. X={a,b, …,j }, F(b)={A,E,F}. In GOC, O GO =, where the gene products X GO and annotations F GO are provided by the GO file.

15/ Methods (2/4) A pseudo-distance function δ: P 2 → R The minimum path length δ m =h ∗ The maximum path length δ x =h ∗ The average of extreme path lengths The average of all path lengths h ∗ (p 1, p 2 )≤δ(p 1, p 2 )≤h ∗ (p 1, p 2 ). A normalized distance as δ=δ/ H(P).

16/ Methods (3/4) A scoring function S y (p) that returns the weighted rank of a node p  P based on requested nodes Y. Two kinds of scores An unnormalized score S Y : P → R + which returns an ‘ absolute ’ number A normalized score which returns a ‘ relative ’ number.

17/ Methods (4/4) s  { …,-1,0,1,2,3, … }, where low s emphasizes coverages, and high s emphasizes specificity. Let r=2 s, then we have four scoring functions: Unnormalized distance and unnormalized score: Unnormalized distance and normalized score: Normalized distance and unnormalized score: Normalized distance and normalized score:

18/25 Cluster heads are marked with +, and secondaries with -. 12

19/25 3. Expert validation (1/2) An experienced molecular immunologist constructed two nonoverlapping lists of genes: KT1 a list of 242 genes involved in immune processes; and KT4 a list of 147 genes involved in cell – cell/cell – matrix interactions. KT1, KT4 and KT1 ∪ KT4 provided three queries for GOC into the BP branch of the GO using δ m, s=7 and scoring function.

20/25 3. Expert validation (2/2) Two assessed values Utility (1=low to 5=high): Did the cluster terms provide a useful description of a specific biological process? Expectation (1=high to 5=low): Was the identified biological process expected for the genes in the query?

21/25

22/25 4. Formal validation (1/3) An independent source of annotations of collections of GO nodes: the InterPro project, which catalogs assignments of protein families, domains and functional sites to GO IDs. E.g. ‘ phosphofructokinase ’ is InterPro ID IPR000023, and is annotated to GO: = ‘ glycolysis ’, GO: = ‘ 6-phosphofructokinase activity ’, and GO: = ‘ 6-phosphofructokinase complex ’. It also maps to 175 proteins. Thus the validation task is to make these 175 proteins a GOC query, and see how well cluster heads match against the set of GO IDs {GO: , GO: , GO: }.

23/25 4. Formal validation (2/3) In the run, there were 4,866 InterPro IDs with GO annotations, with 11,370 mappings to GO nodes and 787,760 mappings to proteins in total. Of these proteins, they were able to locate , or >99% with GO annotations.

24/25 4. Formal validation (3/3) Immediate family: child/parent/sibling. Extended family: grandparent/grandchild/cousin/aunt/uncle/niece/nephew

25/25 5. Conclusions The GOC methodology provides a valid and useful approach to categorization in the GO. Future work Methodological development in combinatorial approaches to data analysis, including distances between noncomparable nodes, interval-valued measures of ‘ level ’ in posets, algorithms for poset width calculation and poset matching. Expansion to other ontologies. Continuation of work in textual approaches, mapping back and forth from semantic relations among GO nodes to those among its lexical components.