1 Gene Ontology Javier Cabrera. 2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

Clustering II.
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Basic Gene Expression Data Analysis--Clustering
Cluster Analysis: Basic Concepts and Algorithms
Clustering Categorical Data The Case of Quran Verses
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
Cluster Analysis.
Clustering II.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Using Gene Ontology Models and Tests Mark Reimers, NCI.
Mutual Information Mathematical Biology Seminar
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Microarray Data Preprocessing and Clustering Analysis
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Cluster Analysis: Basic Concepts and Algorithms
Clustering (Gene Expression Data) 6.095/ Computational Biology: Genomes, Networks, Evolution LectureOctober 4, 2005.
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
Gene expression profiling identifies molecular subtypes of gliomas
Analysis of Variance. ANOVA Probably the most popular analysis in psychology Why? Ease of implementation Allows for analysis of several groups at once.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Computer Graphics and Image Processing (CIS-601).
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Clustering.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Flat clustering approaches
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
The Broad Institute of MIT and Harvard Differential Analysis.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Slide Slide 1 Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Lecture Slides Elementary Statistics Tenth Edition and the.
Multivariate statistical methods Cluster analysis.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning
Semi-Supervised Clustering
Data Mining K-means Algorithm
Microarray Clustering
Clustering and Multidimensional Scaling
Multivariate Statistical Methods
Data Mining – Chapter 4 Cluster Analysis Part 2
Clustering Wei Wang.
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Unsupervised Learning
Presentation transcript:

1 Gene Ontology Javier Cabrera

2 Outline Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to identify biological processes or biochemical pathways that are changed by treatment. Common procedure: select ‘changed’ genes, and look for members of known functionCommon procedure: select ‘changed’ genes, and look for members of known function

3 Annotations Goal: How to identify biological processes or biochemical pathways that are changed by treatment.Goal: How to identify biological processes or biochemical pathways that are changed by treatment. Common procedure: select ‘changed’ genes, and look for members of known functionCommon procedure: select ‘changed’ genes, and look for members of known function

4 GO Problem: moderate changes in many genes simultaneously will escape detection. Problem: moderate changes in many genes simultaneously will escape detection. New approach: New approach: Start with a vocabulary of known GO categories or pathways. Start with a vocabulary of known GO categories or pathways. Find GO categories that are differentially expressed as a group. Find GO categories that are differentially expressed as a group. GO term (Gene Ontology Consortium, (2000)) Other possible variations: look for chromosome locations, or protein domains, that are common among many genes that are changed. Other possible variations: look for chromosome locations, or protein domains, that are common among many genes that are changed.

5 GoMiner: Leverages the Gene Ontology (Zeeberg, et al., Genome Biology 4: R28, 2002)

6 GO How likely is it that the set of ‘significant’ genes will include as many from the category, as you see? How likely is it that the set of ‘significant’ genes will include as many from the category, as you see? Two-way table: Two-way table: Fisher Exact test Fisher Exact test handles small categories handles small categoriesbetter How to deal with multiple categories? How to deal with multiple categories? ,500 Category Others On list Not on list

7 P-values About 3,000 GO biological process categories About 3,000 GO biological process categories Most overlap with some others Most overlap with some others p-values for categories are not independent p-values for categories are not independent Permutation test of all categories simultaneously in parallel Permutation test of all categories simultaneously in parallel

8 Data Gene 1 T 1 p-val 1 Gene 2 T 2 p-val 1 Gene 3 T 3 p-val 1 … Gene G T G p-val 1 G 11 T 11 p-val 11 G 12 T 11 p-val 12 … G 1n1 T 11 p-val 1n1 G 21 T 21 p-val 21 G 22 T 22 p-val 22 … G 2n2 T 2n2 p-val 2n2 … G k1 T k1 p-val k1 G k2 T k1 p-valk k2 … G k nk T 11 p-val k,nk P-val* 1 P-val* 2 P-val* k

9 Gene Set Expression Analysis §Ignore for the moment the ‘meaning’ of the p-value: consider it just as a ranking of S/N l between group difference relative to within-group §If we select a set of genes ‘at random’, then the ranking of S/N ratios should be random l ie. a sample from a uniform distribution §Adapt standard (K-S) test of distribution

10 Other Stats (i) “MEAN-LOG-P” statistic which is calculated as mean(- log(p-value)) (ii) Thresholded mean statistic (“LoMean”), which is calculated by settting all p-values above  equal to  (with, e.g.,  =0.25) and taking the arithmetic mean of the resulting values; (iii) LoMLP which is a hybrid of the “LoMean” and the MEAN-LOG-P, obtained by first setting all p-values above  equal to  (as in (ii) above) and then calculating the mean(-log(p-value)); (iv) HM, which is the harmonic mean of the p-values and obtained as the reciprocal of mean(1/p-value).

11 Continuous Tests §Model: all genes in group contribute roughly equally to effect §Test: for each group G §Compare z to permutation distribution §More sensitive under model assumptions

12 GO Ontology: Conditioning on N Abs(T) Log(n)

13 Cluster Analysis: Group the observations into k distinct natural groups. Hierarchical clustering: We have a dataset with n observations and we want to group the observations into k distinct natural groups of similar observations. We distinguish three stages of cluster analysis: Input Stage Algorithm stage Output stage Input Stage 1. Scaling: a) Divide variables by the standard deviation. b) Spherize the data: Invariance under afine transformations. Z = A Y ; A = Chol ( S ) -1 or the symmetric square root S -1/2 ; c) Spherize the data with the within variance. T = W + B To obtain W use iteration.

14 2. Similarity and dissimilarity measures. Clustering methods require the definition of a similarity or dissimilarity measure. For example an inter-point distance d(x1,x2) and an inter-cluster distance d*(C1,C2) are examples of dissimilarity. The inter point distance is often taken to be the Euclidean distance or Mahalanobis distance. Some times we may use the Manhattan distance. When the data is not metric we may define any distance or similarity measure from characteristics of the problem. For example for binary data given any two vector observations we construct the table Then we define distance as the square root of the  2 statistic. Also Dist = 1 - (a+d)/p or Dist = 1- a/(a+b+c)

15 Hierarchical clustering: Build a hierarchical tree - Inter point distance is normally the Euclidean distance (some times we may use Manhattan distance). - Inter cluster distance: Single Linkage: distance between the closes two points Complete Linkage: distance between the furthest two points Average Linkage: Average distance between every pair of points Ward: R^2 change. - Build a hierarchical tree: 1. Start with a cluster at each sample point 2. At each stage of building the tree the two closest clusters joint to form a new cluster.

16 At any stage we construct the dissimilarity matrix ( or distance matrix) reflecting all the inter-cluster distances between any pair of categories. We build a hierarchical tree starting with a cluster at each sample point, and at each stage of the tree Build dissimilarity matrix The two closest clusters joint to form a new cluster. Once we finish building the tree the question becomes: "how many clusters do we chose?" One way of making this determination is by inspecting the hierarchical tree and finding a reasonable point to break the clusters. We can also plot the criteria function for the different number of cluster and visually look for unusually large jumps. In the example below with WARD’s clustering method we stop at the first place where the R 2 change (percent-wise) is large. 10 CL45 CL CL25 CL CL23 CL CL8 CL CL17 CL CL9 CL

SAMPLE 1 SAMPLE 2 SAMPLE 3 SAMPLE 4 SAMPLE 5 SAMPLE 6 SAMPLE 7 Hierarchical Cluster Example

18 Non Hierarchical clustering: Centroid methods. k-means algorithm. We start with a choice of k clusters and a choice of distance. a. Determine the initial set of k clusters. k seed points are chosen and the data is distributed among k clusters. b. Calculate the centroids of the k clusters and move each point to the cluster whose centroid is closest. c. Repeat step b. until no change is observed. This is the same as optimizing the R 2 criteria. At each stage of the algorithm one point is moved to the cluster that will optimize the criteria function. This is iterated until convergence occurs. The final configuration has some dependence on the initial configuration so it is important to take a good start. One possibility is to run WARD's method and use the outcome as initial configuration for k-means.

19 Step 1 Step 2 Step n Centroid methods: K-means algorithm. 1. K seed points are chosen and the data is distributed among k clusters. 2. At each step we switch a point from one cluster to another if the R 2 is increased. 3. Then the clusters are slowly optimized by switching points until no improvement of the R 2 is possible.

20 Non Hierarchical clustering: PAM Pam is a robust version of k-means. It used the medioids as the center and L 1 distance (Manhattan) and it is otherwise the same as K-means. The cluster R package contains the pam function. Model Based Hierarchical Clustering Another approach to hierarchical clustering is model-based clustering, which is based on the assumption that the data are generated by a mixture of underlying probability distributions. The mclust function fits model-based clustering models. It also fits models based on heuristic criteria similar to those used by pam. The R package mclust and the function of the same name are available from CRAN. The mclust function is separate from the cluster library, and has somewhat different semantics than the methods discussed previously.

21 Detecting the number of clusters: silhouette graphs library(cluster); data(raspini); plot(silhouette(pam(ruspini, k=4)), main = paste("k = ",4), do.n.k=FALSE)

22 1. A Bootstrap approach called ABC Refers to the Bagging of genes and samples from Microarray data. Genes are bagged using weights proportional to their variances. 2. By creating new datasets out of subsets of columns and genes we are able to create estimates of the class response several hundred times. 3.These estimates are then used to obtain a dissimilarity (distance) measure between the samples of the original data. 4.This dissimilarity matrix is then adopted to cluster the data. ABC clustering

23 Select n samples and g genes Data {S1,S2,S3,S4}{S5,S6} Final Clusters Compute similarity

24 ArmstrongColonTaoGolub Iris BagWeight BagEquiWeight BagWholeData NoBagWeight NoBagEquiWeight Ward Kmeans Examples For each data set: # Genes Selected=  G, # Simulations = 500 Genes Bagged By Variance

25 Histogram of P-values