Cluster Analysis of Gene Expression Profiles Identifying groups of genes that exhibit a similar expression "behavior" across a number of experimental conditions Assuming that such "co-expression" will tell us something about these genes are regulated or even possibly something about their function (Functional Genomics) Using information from multiple genes at a time - as opposed to the single gene at a time analysis we did so far We can also cluster biological samples based on the expression of some or all of the genes Example: Identifying groups of molecularly similar tumor "Molecular phenotyping" "Unsupervised learning" in the computer science lingo 1-12-2006
Cluster analysis source("http://eh3. uc Often a large portion of genes are "not interesting" The meaning of the "not interesting" depends on the context Possibly we are interested in genes that whose expression is not constant across all experimental conditions. To remove "non-interesting" genes one can apply a "variation filter". Various sorts of "filtering" of "non-interesting" genes generally amounts to performing some kind of informal statistical testing with a very low confidence. For now, we will just play with our data with some more exciting examples to follow We have six measurements for each gene and will try to cluster genes and experimental conditions using this data 1-12-2006
Cluster analysis > load(url("http://eh3.uc.edu/teaching/cfg/2006/data/SimpleData.RData")) > Nic<-grep("Nic",dimnames(SimpleData)[[2]]) > Ctl<-grep("Ctl",dimnames(SimpleData)[[2]]) > MNic<-apply(SimpleData[,Nic],1,mean,na.rm=TRUE) > VNic<-apply(SimpleData[,Nic],1,var,na.rm=TRUE) > MCtl<-apply(SimpleData[,Ctl],1,mean,na.rm=TRUE) > VCtl<-apply(SimpleData[,Ctl],1,var,na.rm=TRUE) > NNic<-apply(!is.na(SimpleData[,Nic]),1,sum,na.rm=TRUE) > NCtl<-apply(!is.na(SimpleData[,Ctl]),1,sum,na.rm=TRUE) > VNicCtl<-(((NNic-1)*VNic)+((NCtl-1)*VCtl))/(NCtl+NNic-2) > DF<-NNic+NCtl-2 > TStat<-abs(MNic-MCtl)/((VNicCtl*((1/NNic)+(1/NCtl)))^0.5) > TPvalue<-2*pt(TStat,DF,lower.tail=FALSE) > SigGenes<-(TPvalue<0.001) > sum(SigGenes) [1] 7 1-12-2006
Cluster analysis 1-12-2006 > library(marray) > library(mclust) > pal<-maPalette(low="green", high="red", mid="black") > MinExp<-min(SimpleData[SigGenes,2:7]) > MaxExp<-max(SimpleData[SigGenes,2:7]) > heatmap(data.matrix(SimpleData[SigGenes,2:7]),Colv=NA,Rowv=NA,col=pal,labRow=as.character(SimpleData[SigGenes,1]),scale="none") > maColorBar(seq(MinExp,MaxExp,(MaxExp-MinExp)/5), col=pal, horizontal=FALSE, k=5) 1-12-2006
Cluster analysis > heatmap(data.matrix(SimpleData[SigGenes,2:7]),col=pal,labRow=as.character(SimpleData[SigGenes,1]),scale="none") Genes were selected based on their differences between Nic and Ctl treatments - not obvsious except for one gene 1-12-2006
Cluster analysis - centered data > CenteredData<-SimpleData[,2:7]-apply(SimpleData[,2:7],1,mean,na.rm=T) > heatmap(data.matrix(CenteredData[SigGenes,]),col=pal,labRow=as.character(SimpleData[SigGenes,1]),scale="none") > heatmap(data.matrix(SimpleData[SigGenes,2:7]),col=pal,labRow=as.character(SimpleData[SigGenes,1])) 1-12-2006
Hierarchical Clustering Calculating the "distance" or "similarity between each pair of expression profiles Merging two "closest" profiles, forming a "node" in the clustering tree and re-calculating the "distance between such a "sub-cluster" and rest of the profiles or sub-clusters using on of the "linkage" principles. Again merge two closest sub-clusters Complete linkage - define the distance/similarity between the two clusters as the maximum/minimum distance/similarity between pairs of profiles in which one profile is from the first sub-cluster and the other profile is from the second sub-cluster Average linkage - define the distance/similarity between the two clusters as the average distance/similarity between pairs of profiles in which one profile is from the first sub-cluster and the other profile is from the second sub-cluster Single linkage - define the distance/similarity between the two clusters as the minimum/maximum distance/similarity between pairs of profiles in which one profile is from the first sub-cluster and the other profile is from the second sub-cluster 1-12-2006
Euclidian Distance R actually operates on distances, so similarities have to be transformed into distances - usually straightforward Euclidian distance: In 2 and 3 dimensions, this is our usual, every day's distance > EDistances<-dist(CenteredData[SigGenes,],method = "euclidean", diag = T, upper = T) > print(EDistances,digits=2) 1-12-2006
Distance Matrix Distance Matrix - whole: 34 440 596 2797 4466 4512 7651 34 0.00 8.55 5.64 5.46 8.15 8.03 9.14 440 8.55 0.00 3.01 3.19 0.82 0.82 1.13 596 5.64 3.01 0.00 0.33 2.53 2.48 3.59 2797 5.46 3.19 0.33 0.00 2.71 2.62 3.72 4466 8.15 0.82 2.53 2.71 0.00 0.47 1.18 4512 8.03 0.82 2.48 2.62 0.47 0.00 1.14 7651 9.14 1.13 3.59 3.72 1.18 1.14 0.00 Distance Matrix - lower triangular: > EDistances<-dist(CenteredData[SigGenes,],method = "euclidean") > print(EDistances,digits=2) 34 440 596 2797 4466 4512 440 8.55 596 5.64 3.01 2797 5.46 3.19 0.33 4466 8.15 0.82 2.53 2.71 4512 8.03 0.82 2.48 2.62 0.47 7651 9.14 1.13 3.59 3.72 1.18 1.14 1-12-2006
Dendrograms - Complete Linkage > Clustering<-hclust(EDistances,method="complete") > plot(Clustering) Distance Matrix - lower triangular: 34 440 596 2797 4466 4512 440 8.55 596 5.64 3.01 2797 5.46 3.19 0.33 4466 8.15 0.82 2.53 2.71 4512 8.03 0.82 2.48 2.62 0.47 7651 9.14 1.13 3.59 3.72 1.18 1.14 1-12-2006
Clustering genes and samples > EDistancesS<-dist(t(CenteredData[SigGenes,]),method = "euclidean") > ClusteringS<-hclust(EDistancesS,method="complete") > heatmap(data.matrix(CenteredData[SigGenes,]),Colv=as.dendrogram(ClusteringS),Rowv=as.dendrogram(Clustering), col=pal,scale="none") > TwoClusters<-cutree(ClusteringS,k = 2, h = NULL) > TwoClusters Ctl Nic Nic.1 Nic.2 Ctl.1 Ctl.2 1 2 2 2 1 1 1-12-2006
Clustering by partitioning: K-means algorithm For a pre-specified number of clusters iterate between calculating cluster "centroides" (i.e. cluster means) and re-assigning each profile to the cluster with the closest "centroid" t+1st iteration: iterate until ct+1=ct 1-12-2006
Clustering k-means > TwoCKmeans<-kmeans(t(CenteredData[SigGenes,]), 2, iter.max = 10) > TwoCKmeans K-means clustering with 2 clusters of sizes 3, 3 Cluster means: 34 440 596 2797 4466 4512 7651 1 2.510742 -0.9565299 0.2554164 0.3246475 -0.770937 -0.7398173 -1.181848 2 -2.510742 0.9565299 -0.2554164 -0.3246475 0.770937 0.7398173 1.181848 Clustering vector: Ctl Nic Nic.1 Nic.2 Ctl.1 Ctl.2 1 2 2 2 1 1 Within cluster sum of squares by cluster: [1] 1.0805679 0.9474704 Available components: [1] "cluster" "centers" "withinss" "size" 1-12-2006
Questions How many clusters there are in the data? What is the statistical significance of a clustering? What is a confidence in assigning any particular expression profile to any particular cluster? Difficult questions, particularly difficult to answer when using heuristic methods like hierarchical clustering and k-means Need statistical models 1-12-2006
Statistical Significance Two genes at a time Are these two genes co-expressed? By looking at their expression patterns alone, combined with the “null distribution” of the similarity measure in non-co-expressed genes, we could conclude that this is the case. YDR113c and YDR183c Pearson Correlation = .83 Statistical Significance P-value = 0.00006 1-12-2006
Another look What if we knew that there are two and only two distinct patterns in the data and we know how they look (thick dashed lines)? Given this additional information we are likely to conclude that our two genes actually have different patterns of expression. 1-12-2006
Many genes at a time Simultaneous detection of “patterns” of expression defined by groups of expression profiles and assignment of individual expression profiles to appropriate patterns. By looking at “all” genes at the same time, we came up with a completely different conclusion than when looking at only two of them. Questions: How many clusters? How confident are we in the number of clusters in the data? How confident are we that our two genes belong to two different clusters? Is such a confidence statement taking into account the “uncertainty” about the true number of clusters? 1-12-2006
Gene-specific normalization of the data 1-12-2006
Clustering using non-normalized data K-means Euclidian Distance Pearson's correlation 1-12-2006
Clustering using normalized data K-means Euclidian Distance Pearson's correlation 1-12-2006
Why do we cluster? Co-expression Co-regulation Functional relationship “Guilt by association” Co-expression Dissecting regulatory mechanisms Co-regulation Functional relationship Assigning function to genes 1-12-2006
Why do we cluster - Functional Annotation? 1-12-2006
Dissecting the gene expression regulatory mechanisms S.Tavazoie, J.D.Hughes, M.J.Campbell, R.J.Cho, G.M.Church. Systematic determination of genetic network architecture, Nat.Genet., 22, (1999) 281-285. 1-12-2006