Essential Statistics in Biology: Getting the Numbers Right Raphael Gottardo Clinical Research Institute of Montreal (IRCM)

Essential Statistics in Biology: Getting the Numbers Right Raphael Gottardo Clinical Research Institute of Montreal (IRCM) raphael.gottardo@ircm.qc.ca http://www.rglab.org

Day 1 2 Outline Exploratory Data Analysis 1-2 sample t-tests, multiple testing Clustering SVD/PCA Frequentists vs. Bayesians

Clustering (Multivariate analysis)

Day 1 - Section 3 4 Outline Basics of clustering Hierarchical clustering K-means Model based clustering

Day 1 - Section 3 5 What is it? Clustering is the classification of similar objects into different groups. Partition a data set into subsets (clusters), so that the data in each subset are “close” to one another - often proximity according to some defined distance measure. Examples: www, gene clustering

Day 1 - Section 3 6 Hierarchical clustering Given N items and a distance metric 1. Assign each item to a cluster Initialize the distance matrix between clusters as the distance between items 2. Find the closest pair of clusters and merge them into a single cluster 3. Compute new distances between clusters 4. Repeat 2-3 until call items are classified into a single cluster

Day 1 - Section 3 7 Single linkage The distance between clusters is defined as the shortest distance from any member of one cluster to any member of the other cluster. Cluster 1Cluster 2 d

Day 1 - Section 3 8 Complete linkage The distance between clusters is defined as the greatest distance from any member of one cluster to any member of the other cluster. Cluster 1Cluster 2 d

Day 1 - Section 3 9 Average linkage The distance between clusters is defined as the average distance from any member of one cluster to any member of the other cluster. Cluster 1Cluster 2 d=Average of all distances

Day 1 - Section 3 10 Example Cell cycle dataset (Cho et al. 1998) Expression levels of ~6000 genes during the cell cycle 17 time points (2 cell cycles)

Day 1 - Section 3 11 Example cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19])D.cho<-dist(cho.data, method = "euclidean")hc.single<-hclust(D.cho, method = "single", members=NULL)plot(hc.single)rect.hclust(hc.single,k=4)class.single<-cutree(hc.single, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.single==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.single==2,]),type="l",xlab="time",ylab="log expression value")matplot(as.matrix(cho.data[class.single==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.single==4,]),type="l",xlab="time",ylab="log expression value")hc.complete<-hclust(D.cho, method = "complete", members=NULL)plot(hc.complete)rect.hclust(hc.complete,k=4)class.complete<-cutree(hc.complete, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.complete==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.complete==4,]),type="l",xlab="time",ylab="log expression value")hc.average<-hclust(D.cho, method = "average", members=NULL)plot(hc.average)rect.hclust(hc.average,k=4)class.average<-cutree(hc.average, k = 4)par(mfrow=c(2,2))matplot(t(cho.data[class.average==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.average==2,]),type="l",xlab="time",ylab="log expression value")matplot(as.matrix(cho.data[class.average==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[class.average==4,]),type="l",xlab="time",ylab="log expression value")set.seed(100)km.cho<-kmeans(cho.data, 4)par(mfrow=c(2,2))matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value")matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value")

Day 1 - Section 3 12 Example Single linkage

Day 1 - Section 3 13 Example Single linkage k=2

Day 1 - Section 3 18 Example Single linkage k=4 12 34

Day 1 - Section 3 19 Example Complete linkage k=4

Day 1 - Section 3 20 Example Complete linkage k=4 12 34

Day 1 - Section 3 21 K-means N items, assume K clusters Goal is to minimized over the possible assignments and centroids. represents the location of the cluster.

Day 1 - Section 3 22 K-means - algorithm 1. Divide the data into K clusters Initialize the centroids with the mean of the clusters 2. Assign each item to the cluster with closest centroid 3. When all objects have been assigned, recalculate the centroids (mean) 4. Repeat 2-3 until the centroids no longer move

Day 1 - Section 3 23 K-means - algorithm set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2),matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100) for(i in 1:4) { set.seed(100) cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=i) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) Sys.sleep(2) }

Day 1 - Section 3 24 Example 12 34 Why?

Day 1 - Section 3 26 Example 12 34

Day 1 - Section 3 27 Summary K-means and hierarchical clustering methods are useful techniques Fast and easy to implement Beware of memory requirements for HC A bit “ad hoc”: Number of clusters? Distance metric? Good clustering?

Day 1 - Section 3 28 Model based clustering Based on probability models (e.g. Normal mixture models) We could talk about good clustering Compare several models Estimate the number of clusters!

Day 1 - Section 3 29 Model based clustering Multivariate observations K clusters Assume observation i belongs to cluster k, then that is each cluster can be represented by a multivariate normal distribution with mean and covariance Yeung et al. (2001)

Day 1 - Section 3 30 Model based clustering Banfield and Raftery (1993) VolumeOrientationShape Eigenvalue decomposition

Day 1 - Section 3 31 Model based clustering 0 0 0 0 Equal volume spherical EII Unequal volume spherical VII Equal volume, shape, orientation (EEE) Unconstrained (VVV)

Day 1 - Section 3 32 Estimation Given the number of clusters and the covariance structure the EM algorithm can be used Mclust R package available from CRAN Likelihood (Mixture model)

Day 1 - Section 3 33 Model selection Which model is appropriate? - Which covariance structure? - How many clusters? Compare the different models using BIC

Day 1 - Section 3 34 Model selection We wish to compare two models and with parameters and respectively. Given the observed data D, define the integrated likelihood Probability to observe the data given model andmight have different dimensionsNB:

Day 1 - Section 3 35 Model selection To compare two models and use the integrated likelihoods. The integral is difficult to compute! Bayesian information criteria: is the maximum likelihood is the number of parameter in model

Day 1 - Section 3 36 Model selection Bayesian information criteria: Measure of fitPenalty term A large BIC score indicates strong evidence for the corresponding model BIC can be used to choose the number of clusters and the covariance parametrization (Mclust)

Day 1 - Section 3 37 Example revisited 1 EII 2 EEI library(mclust) cho.mclust.bic<- EMclust(cho.data.std,modelNames=c("EII","EEI")) plot(cho.mclust.bic) cho.mclust<-EMclust(cho.data.std,4,"EII") sum.cho<-summary(cho.mclust,cho.data.std)

Day 1 - Section 3 38 Example revisited par(mfrow=c(2,2)) matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==4,]),type="l",xlab="time",ylab="log expression value")

Day 1 - Section 3 39 Example revisited 12 34 EII 4 clusters

Day 1 - Section 3 40 Example revisited cho.mclust<-EMclust(cho.data.std,3,"EEI") sum.cho<-summary(cho.mclust,cho.data.std) par(mfrow=c(2,2)) matplot(t(cho.data[sum.cho$classification==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[sum.cho$classification==3,]),type="l",xlab="time",ylab="log expression value")

Day 1 - Section 3 41 Example revisited 12 3 EEI 3 clusters

Day 1 - Section 3 42 Summary Model based clustering is a nice alternative to heuristic clustering algorithms BIC can be used for choosing the covariance structure and the number of clusters

Day 1 - Section 3 43 Conclusion We have seen a few clustering algorithms There are many others Two way clustering Plaid model... Clustering is a useful tool and... a dangerous weapon To be consumed with moderation!

Essential Statistics in Biology: Getting the Numbers Right Raphael Gottardo Clinical Research Institute of Montreal (IRCM)

Similar presentations

Presentation on theme: "Essential Statistics in Biology: Getting the Numbers Right Raphael Gottardo Clinical Research Institute of Montreal (IRCM)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Essential Statistics in Biology: Getting the Numbers Right Raphael Gottardo Clinical Research Institute of Montreal (IRCM)

Similar presentations

Presentation on theme: "Essential Statistics in Biology: Getting the Numbers Right Raphael Gottardo Clinical Research Institute of Montreal (IRCM)"— Presentation transcript:

Similar presentations

About project

Feedback