Cluster Analysis of Gene Expression Profiles

Cluster Analysis of Gene Expression Profiles
Identifying groups of genes that exhibit a similar expression "behavior" across a number of experimental conditions Assuming that such "co-expression" will tell us something about these genes are regulated or even possibly something about their function (Functional Genomics) Using information from multiple genes at a time - as opposed to the single gene at a time analysis we did so far We can also cluster biological samples based on the expression of some or all of the genes Example: Identifying groups of molecularly similar tumor "Molecular phenotyping" "Unsupervised learning" in the computer science lingo

Cluster analysis source("http://eh3. uc
Often a large portion of genes are "not interesting" The meaning of the "not interesting" depends on the context Possibly we are interested in genes that whose expression is not constant across all experimental conditions. To remove "non-interesting" genes one can apply a "variation filter". Various sorts of "filtering" of "non-interesting" genes generally amounts to performing some kind of informal statistical testing with a very low confidence. For now, we will just play with our data with some more exciting examples to follow We have six measurements for each gene and will try to cluster genes and experimental conditions using this data

Cluster analysis > load(url(" > Nic<-grep("Nic",dimnames(SimpleData)[[2]]) > Ctl<-grep("Ctl",dimnames(SimpleData)[[2]]) > MNic<-apply(SimpleData[,Nic],1,mean,na.rm=TRUE) > VNic<-apply(SimpleData[,Nic],1,var,na.rm=TRUE) > MCtl<-apply(SimpleData[,Ctl],1,mean,na.rm=TRUE) > VCtl<-apply(SimpleData[,Ctl],1,var,na.rm=TRUE) > NNic<-apply(!is.na(SimpleData[,Nic]),1,sum,na.rm=TRUE) > NCtl<-apply(!is.na(SimpleData[,Ctl]),1,sum,na.rm=TRUE) > VNicCtl<-(((NNic-1)*VNic)+((NCtl-1)*VCtl))/(NCtl+NNic-2) > DF<-NNic+NCtl-2 > TStat<-abs(MNic-MCtl)/((VNicCtl*((1/NNic)+(1/NCtl)))^0.5) > TPvalue<-2*pt(TStat,DF,lower.tail=FALSE) > SigGenes<-(TPvalue<0.001) > sum(SigGenes) [1] 7

Cluster analysis 1-12-2006 > library(marray) > library(mclust)
> pal<-maPalette(low="green", high="red", mid="black") > MinExp<-min(SimpleData[SigGenes,2:7]) > MaxExp<-max(SimpleData[SigGenes,2:7]) > heatmap(data.matrix(SimpleData[SigGenes,2:7]),Colv=NA,Rowv=NA,col=pal,labRow=as.character(SimpleData[SigGenes,1]),scale="none") > maColorBar(seq(MinExp,MaxExp,(MaxExp-MinExp)/5), col=pal, horizontal=FALSE, k=5)

Cluster analysis > heatmap(data.matrix(SimpleData[SigGenes,2:7]),col=pal,labRow=as.character(SimpleData[SigGenes,1]),scale="none") Genes were selected based on their differences between Nic and Ctl treatments - not obvsious except for one gene

Cluster analysis - centered data
> CenteredData<-SimpleData[,2:7]-apply(SimpleData[,2:7],1,mean,na.rm=T) > heatmap(data.matrix(CenteredData[SigGenes,]),col=pal,labRow=as.character(SimpleData[SigGenes,1]),scale="none") > heatmap(data.matrix(SimpleData[SigGenes,2:7]),col=pal,labRow=as.character(SimpleData[SigGenes,1]))

Hierarchical Clustering
Calculating the "distance" or "similarity between each pair of expression profiles Merging two "closest" profiles, forming a "node" in the clustering tree and re-calculating the "distance between such a "sub-cluster" and rest of the profiles or sub-clusters using on of the "linkage" principles. Again merge two closest sub-clusters Complete linkage - define the distance/similarity between the two clusters as the maximum/minimum distance/similarity between pairs of profiles in which one profile is from the first sub-cluster and the other profile is from the second sub-cluster Average linkage - define the distance/similarity between the two clusters as the average distance/similarity between pairs of profiles in which one profile is from the first sub-cluster and the other profile is from the second sub-cluster Single linkage - define the distance/similarity between the two clusters as the minimum/maximum distance/similarity between pairs of profiles in which one profile is from the first sub-cluster and the other profile is from the second sub-cluster

Euclidian Distance R actually operates on distances, so similarities have to be transformed into distances - usually straightforward Euclidian distance: In 2 and 3 dimensions, this is our usual, every day's distance > EDistances<-dist(CenteredData[SigGenes,],method = "euclidean", diag = T, upper = T) > print(EDistances,digits=2)

Distance Matrix Distance Matrix - whole:
Distance Matrix - lower triangular: > EDistances<-dist(CenteredData[SigGenes,],method = "euclidean") > print(EDistances,digits=2)

Dendrograms - Complete Linkage
> Clustering<-hclust(EDistances,method="complete") > plot(Clustering) Distance Matrix - lower triangular:

Clustering genes and samples
> EDistancesS<-dist(t(CenteredData[SigGenes,]),method = "euclidean") > ClusteringS<-hclust(EDistancesS,method="complete") > heatmap(data.matrix(CenteredData[SigGenes,]),Colv=as.dendrogram(ClusteringS),Rowv=as.dendrogram(Clustering), col=pal,scale="none") > TwoClusters<-cutree(ClusteringS,k = 2, h = NULL) > TwoClusters Ctl Nic Nic.1 Nic.2 Ctl.1 Ctl.2

Clustering by partitioning: K-means algorithm
For a pre-specified number of clusters iterate between calculating cluster "centroides" (i.e. cluster means) and re-assigning each profile to the cluster with the closest "centroid" t+1st iteration: iterate until ct+1=ct

Clustering k-means > TwoCKmeans<-kmeans(t(CenteredData[SigGenes,]), 2, iter.max = 10) > TwoCKmeans K-means clustering with 2 clusters of sizes 3, 3 Cluster means: Clustering vector: Ctl Nic Nic.1 Nic.2 Ctl.1 Ctl.2 Within cluster sum of squares by cluster: [1] Available components: [1] "cluster" "centers" "withinss" "size"

Questions How many clusters there are in the data?
What is the statistical significance of a clustering? What is a confidence in assigning any particular expression profile to any particular cluster? Difficult questions, particularly difficult to answer when using heuristic methods like hierarchical clustering and k-means Need statistical models

Statistical Significance
Two genes at a time Are these two genes co-expressed? By looking at their expression patterns alone, combined with the “null distribution” of the similarity measure in non-co-expressed genes, we could conclude that this is the case. YDR113c and YDR183c Pearson Correlation = .83 Statistical Significance P-value =

Another look What if we knew that there are two and only two distinct patterns in the data and we know how they look (thick dashed lines)? Given this additional information we are likely to conclude that our two genes actually have different patterns of expression.

Many genes at a time Simultaneous detection of “patterns” of expression defined by groups of expression profiles and assignment of individual expression profiles to appropriate patterns. By looking at “all” genes at the same time, we came up with a completely different conclusion than when looking at only two of them. Questions: How many clusters? How confident are we in the number of clusters in the data? How confident are we that our two genes belong to two different clusters? Is such a confidence statement taking into account the “uncertainty” about the true number of clusters?

Gene-specific normalization of the data

Clustering using non-normalized data
K-means Euclidian Distance Pearson's correlation

Clustering using normalized data
K-means Euclidian Distance Pearson's correlation

Why do we cluster? Co-expression Co-regulation Functional relationship
“Guilt by association” Co-expression Dissecting regulatory mechanisms Co-regulation Functional relationship Assigning function to genes

Why do we cluster - Functional Annotation?

Dissecting the gene expression regulatory mechanisms
S.Tavazoie, J.D.Hughes, M.J.Campbell, R.J.Cho, G.M.Church. Systematic determination of genetic network architecture, Nat.Genet., 22, (1999)

Cluster Analysis of Gene Expression Profiles

Similar presentations

Presentation on theme: "Cluster Analysis of Gene Expression Profiles"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cluster Analysis of Gene Expression Profiles

Similar presentations

Presentation on theme: "Cluster Analysis of Gene Expression Profiles"— Presentation transcript:

Similar presentations

About project

Feedback