Cluster Analysis of Gene Expression Profiles

Slides:

Advertisements

Similar presentations

Yinyin Yuan and Chang-Tsun Li Computer Science Department

Advertisements

BioInformatics (3).

Basic Gene Expression Data Analysis--Clustering

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.

Cluster Analysis: Basic Concepts and Algorithms

PARTITIONAL CLUSTERING

Cluster analysis for microarray data Anja von Heydebreck.

Introduction to Bioinformatics

UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.

Mutual Information Mathematical Biology Seminar

SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

Microarray Data Preprocessing and Clustering Analysis

Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.

Cluster Analysis Class web site: Statistics for Microarrays.

Introduction to Bioinformatics - Tutorial no. 12

Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:

Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz

Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.

Clustering Unsupervised learning Generating “classes”

The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Programming Collective Intelligence by Toby.

Genetic network inference: from co-expression clustering to reverse engineering Patrik D’haeseleer,Shoudan Liang and Roland Somogyi.

First approach - repeating a simple analysis for each gene separately - 30k times Assume we have two experimental conditions (j=1,2) We measure.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

Gene expression analysis

Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

Hierarchical Bayesian Model Specification Model is specified by the Directed Acyclic Network (DAG) and the conditional probability distributions of all.

Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.

An Overview of Clustering Methods Michael D. Kane, Ph.D.

Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.

MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia Armstrong et al, Nature Genetics 30, (2002)

CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.

Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.

Analyzing Expression Data: Clustering and Stats Chapter 16.

Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.

Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.

Lloyd Algorithm K-Means Clustering. Gene Expression Susumu Ohno: whole genome duplications The expression of genes can be measured over time. Identifying.

1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.

Clustering Algorithms Sunida Ratanothayanon. What is Clustering?

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Multivariate statistical methods Cluster analysis.

C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.

Clustering [Idea only, Chapter 10.1, 10.2, 10.4].

Unsupervised Learning

Multivariate statistical methods

PREDICT 422: Practical Machine Learning

Microarrays Cluster analysis.

Cluster Analysis II 10/03/2012.

Clustering CSC 600: Data Mining Class 21.

CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Data Mining K-means Algorithm

Dimension reduction : PCA and Clustering by Agnieszka S. Juncker

John Nicholas Owen Sarah Smith

Hierarchical clustering approaches for high-throughput data

Gene expression analysis

Multivariate Statistical Methods

Cluster Analysis in Bioinformatics

Dimension reduction : PCA and Clustering

Register variation: correlation, clusters and factors

Text Categorization Berlin Chen 2003 Reference:

Clustering The process of grouping samples so that the samples are similar within each group.

SEEM4630 Tutorial 3 – Clustering.

Unsupervised Learning

Presentation transcript:

Cluster Analysis of Gene Expression Profiles Identifying groups of genes that exhibit a similar expression "behavior" across a number of experimental conditions Assuming that such "co-expression" will tell us something about these genes are regulated or even possibly something about their function (Functional Genomics) Using information from multiple genes at a time - as opposed to the single gene at a time analysis we did so far We can also cluster biological samples based on the expression of some or all of the genes Example: Identifying groups of molecularly similar tumor "Molecular phenotyping" "Unsupervised learning" in the computer science lingo 1-12-2006

Cluster analysis source("http://eh3. uc Often a large portion of genes are "not interesting" The meaning of the "not interesting" depends on the context Possibly we are interested in genes that whose expression is not constant across all experimental conditions. To remove "non-interesting" genes one can apply a "variation filter". Various sorts of "filtering" of "non-interesting" genes generally amounts to performing some kind of informal statistical testing with a very low confidence. For now, we will just play with our data with some more exciting examples to follow We have six measurements for each gene and will try to cluster genes and experimental conditions using this data 1-12-2006

Cluster analysis > load(url("http://eh3.uc.edu/teaching/cfg/2006/data/SimpleData.RData")) > Nic<-grep("Nic",dimnames(SimpleData)[[2]]) > Ctl<-grep("Ctl",dimnames(SimpleData)[[2]]) > MNic<-apply(SimpleData[,Nic],1,mean,na.rm=TRUE) > VNic<-apply(SimpleData[,Nic],1,var,na.rm=TRUE) > MCtl<-apply(SimpleData[,Ctl],1,mean,na.rm=TRUE) > VCtl<-apply(SimpleData[,Ctl],1,var,na.rm=TRUE) > NNic<-apply(!is.na(SimpleData[,Nic]),1,sum,na.rm=TRUE) > NCtl<-apply(!is.na(SimpleData[,Ctl]),1,sum,na.rm=TRUE) > VNicCtl<-(((NNic-1)*VNic)+((NCtl-1)*VCtl))/(NCtl+NNic-2) > DF<-NNic+NCtl-2 > TStat<-abs(MNic-MCtl)/((VNicCtl*((1/NNic)+(1/NCtl)))^0.5) > TPvalue<-2*pt(TStat,DF,lower.tail=FALSE) > SigGenes<-(TPvalue<0.001) > sum(SigGenes) [1] 7 1-12-2006

Cluster analysis 1-12-2006 > library(marray) > library(mclust) > pal<-maPalette(low="green", high="red", mid="black") > MinExp<-min(SimpleData[SigGenes,2:7]) > MaxExp<-max(SimpleData[SigGenes,2:7]) > heatmap(data.matrix(SimpleData[SigGenes,2:7]),Colv=NA,Rowv=NA,col=pal,labRow=as.character(SimpleData[SigGenes,1]),scale="none") > maColorBar(seq(MinExp,MaxExp,(MaxExp-MinExp)/5), col=pal, horizontal=FALSE, k=5) 1-12-2006

Cluster analysis > heatmap(data.matrix(SimpleData[SigGenes,2:7]),col=pal,labRow=as.character(SimpleData[SigGenes,1]),scale="none") Genes were selected based on their differences between Nic and Ctl treatments - not obvsious except for one gene 1-12-2006

Cluster analysis - centered data > CenteredData<-SimpleData[,2:7]-apply(SimpleData[,2:7],1,mean,na.rm=T) > heatmap(data.matrix(CenteredData[SigGenes,]),col=pal,labRow=as.character(SimpleData[SigGenes,1]),scale="none") > heatmap(data.matrix(SimpleData[SigGenes,2:7]),col=pal,labRow=as.character(SimpleData[SigGenes,1])) 1-12-2006

Hierarchical Clustering Calculating the "distance" or "similarity between each pair of expression profiles Merging two "closest" profiles, forming a "node" in the clustering tree and re-calculating the "distance between such a "sub-cluster" and rest of the profiles or sub-clusters using on of the "linkage" principles. Again merge two closest sub-clusters Complete linkage - define the distance/similarity between the two clusters as the maximum/minimum distance/similarity between pairs of profiles in which one profile is from the first sub-cluster and the other profile is from the second sub-cluster Average linkage - define the distance/similarity between the two clusters as the average distance/similarity between pairs of profiles in which one profile is from the first sub-cluster and the other profile is from the second sub-cluster Single linkage - define the distance/similarity between the two clusters as the minimum/maximum distance/similarity between pairs of profiles in which one profile is from the first sub-cluster and the other profile is from the second sub-cluster 1-12-2006

Euclidian Distance R actually operates on distances, so similarities have to be transformed into distances - usually straightforward Euclidian distance: In 2 and 3 dimensions, this is our usual, every day's distance > EDistances<-dist(CenteredData[SigGenes,],method = "euclidean", diag = T, upper = T) > print(EDistances,digits=2) 1-12-2006

Distance Matrix Distance Matrix - whole: 34 440 596 2797 4466 4512 7651 34 0.00 8.55 5.64 5.46 8.15 8.03 9.14 440 8.55 0.00 3.01 3.19 0.82 0.82 1.13 596 5.64 3.01 0.00 0.33 2.53 2.48 3.59 2797 5.46 3.19 0.33 0.00 2.71 2.62 3.72 4466 8.15 0.82 2.53 2.71 0.00 0.47 1.18 4512 8.03 0.82 2.48 2.62 0.47 0.00 1.14 7651 9.14 1.13 3.59 3.72 1.18 1.14 0.00 Distance Matrix - lower triangular: > EDistances<-dist(CenteredData[SigGenes,],method = "euclidean") > print(EDistances,digits=2) 34 440 596 2797 4466 4512 440 8.55 596 5.64 3.01 2797 5.46 3.19 0.33 4466 8.15 0.82 2.53 2.71 4512 8.03 0.82 2.48 2.62 0.47 7651 9.14 1.13 3.59 3.72 1.18 1.14 1-12-2006

Dendrograms - Complete Linkage > Clustering<-hclust(EDistances,method="complete") > plot(Clustering) Distance Matrix - lower triangular: 34 440 596 2797 4466 4512 440 8.55 596 5.64 3.01 2797 5.46 3.19 0.33 4466 8.15 0.82 2.53 2.71 4512 8.03 0.82 2.48 2.62 0.47 7651 9.14 1.13 3.59 3.72 1.18 1.14 1-12-2006

Clustering genes and samples > EDistancesS<-dist(t(CenteredData[SigGenes,]),method = "euclidean") > ClusteringS<-hclust(EDistancesS,method="complete") > heatmap(data.matrix(CenteredData[SigGenes,]),Colv=as.dendrogram(ClusteringS),Rowv=as.dendrogram(Clustering), col=pal,scale="none") > TwoClusters<-cutree(ClusteringS,k = 2, h = NULL) > TwoClusters Ctl Nic Nic.1 Nic.2 Ctl.1 Ctl.2 1 2 2 2 1 1 1-12-2006

Clustering by partitioning: K-means algorithm For a pre-specified number of clusters iterate between calculating cluster "centroides" (i.e. cluster means) and re-assigning each profile to the cluster with the closest "centroid" t+1st iteration: iterate until ct+1=ct 1-12-2006

Clustering k-means > TwoCKmeans<-kmeans(t(CenteredData[SigGenes,]), 2, iter.max = 10) > TwoCKmeans K-means clustering with 2 clusters of sizes 3, 3 Cluster means: 34 440 596 2797 4466 4512 7651 1 2.510742 -0.9565299 0.2554164 0.3246475 -0.770937 -0.7398173 -1.181848 2 -2.510742 0.9565299 -0.2554164 -0.3246475 0.770937 0.7398173 1.181848 Clustering vector: Ctl Nic Nic.1 Nic.2 Ctl.1 Ctl.2 1 2 2 2 1 1 Within cluster sum of squares by cluster: [1] 1.0805679 0.9474704 Available components: [1] "cluster" "centers" "withinss" "size" 1-12-2006

Questions How many clusters there are in the data? What is the statistical significance of a clustering? What is a confidence in assigning any particular expression profile to any particular cluster? Difficult questions, particularly difficult to answer when using heuristic methods like hierarchical clustering and k-means Need statistical models 1-12-2006

Statistical Significance Two genes at a time Are these two genes co-expressed? By looking at their expression patterns alone, combined with the “null distribution” of the similarity measure in non-co-expressed genes, we could conclude that this is the case. YDR113c and YDR183c Pearson Correlation = .83 Statistical Significance P-value = 0.00006 1-12-2006

Another look What if we knew that there are two and only two distinct patterns in the data and we know how they look (thick dashed lines)? Given this additional information we are likely to conclude that our two genes actually have different patterns of expression. 1-12-2006

Many genes at a time Simultaneous detection of “patterns” of expression defined by groups of expression profiles and assignment of individual expression profiles to appropriate patterns. By looking at “all” genes at the same time, we came up with a completely different conclusion than when looking at only two of them. Questions: How many clusters? How confident are we in the number of clusters in the data? How confident are we that our two genes belong to two different clusters? Is such a confidence statement taking into account the “uncertainty” about the true number of clusters? 1-12-2006

Gene-specific normalization of the data 1-12-2006

Clustering using non-normalized data K-means Euclidian Distance Pearson's correlation 1-12-2006

Clustering using normalized data K-means Euclidian Distance Pearson's correlation 1-12-2006

Why do we cluster? Co-expression Co-regulation Functional relationship “Guilt by association” Co-expression Dissecting regulatory mechanisms Co-regulation Functional relationship Assigning function to genes 1-12-2006

Why do we cluster - Functional Annotation? 1-12-2006

Dissecting the gene expression regulatory mechanisms S.Tavazoie, J.D.Hughes, M.J.Campbell, R.J.Cho, G.M.Church. Systematic determination of genetic network architecture, Nat.Genet., 22, (1999) 281-285. 1-12-2006