Canadian Bioinformatics Workshops www.bioinformatics.ca.

Canadian Bioinformatics Workshops www.bioinformatics.ca

2Module #: Title of Module

Module 5 Clustering D EPARTMENT OF B IOCHEMISTRY D EPARTMENT OF M OLECULAR G ENETICS † Herakles and Iolaos battle the Hydra. Classical (450-400 BCE) This workshop includes material originally developed by Raphael Gottardo, FHCRC and by Sohrab Shah, UBC † Exploratory Data Analysis of Biological Data using R Boris Steipe Toronto, May 23. and 24. 2013

Module 5: Clustering bioinformatics.ca Principles Hierarchical clustering Partitioning methods Centroid based clustering (K-means etc.) Model–based clustering Outline

Module 5: Clustering bioinformatics.ca Complexes in interaction data Domains in protein structure Proteins of similar function (based on measured similar properties, e.g. coregulation)... Examples

Module 5: Clustering bioinformatics.ca Introduction to clustering Clustering...... is an example of unsupervised learning... is useful for the analysis of patterns in data... can lead to class discovery. Clustering is the partitioning of a data set into groups of elements that are more similar to each other than to elements in other groups. Clustering is a completely general method that can be applied to genes, samples, or both.

Module 5: Clustering bioinformatics.ca Hierarchical clustering Given N items and a distance metric... 1. Assign each item to its own "cluster". Initialize the distance matrix between clusters as the distance between items. 2. Find the closest pair of clusters and merge them into a single cluster. 3. Compute new distances between clusters. 4. Repeat 2-3 until all clusters have been merged into a single cluster.

Module 5: Clustering bioinformatics.ca "Given N items and a distance metric..." What is a metric? A metric has to fulfill three conditions: "identity" "symmetry" "triangle inequality" Hierarchical clustering

Module 5: Clustering bioinformatics.ca Distance metrics Common metrics include: Manhattan distance: Euclidean distance: 1-correlation: (proportional to Euclidean distance, but invariant to range of measurement from one sample to the next). dissimilar similar

Module 5: Clustering bioinformatics.ca Distance metrics compared EuclideanManhattan1-Correlation Distance matters!

Module 5: Clustering bioinformatics.ca Other distance metrics Hamming distance for ordinal, binary or categorical data:

Module 5: Clustering bioinformatics.ca Agglomerative hierarchical clustering

Module 5: Clustering bioinformatics.ca Hierarchical clustering Anatomy of hierarchical clustering distance matrix linkage method Output dendrogram a tree that defines the relationships between objects and the distance between clusters a nested sequence of clusters

Module 5: Clustering bioinformatics.ca Linkage methods single complete average distance between centroids

Module 5: Clustering bioinformatics.ca cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19]) D.cho<-dist(cho.data, method = "euclidean") hc.single<-hclust(D.cho, method = "single", members=NULL) Example: cell cycle data

Module 5: Clustering bioinformatics.ca plot(hc.single) Single linkage Example: cell cycle data

Module 5: Clustering bioinformatics.ca Careful with the interpretation of dendrograms: they introduce a proximity between elements that does not correlate with distance between elements! cf.: # 1 and #47 Example: cell cycle data

Module 5: Clustering bioinformatics.ca Single linkage, k=2 rect.hclust(hc.single,k=2) Example: cell cycle data

Module 5: Clustering bioinformatics.ca 12 34 class.single<-cutree(hc.single, k = 4) par(mfrow=c(2,2)) matplot(t(cho.data[class.single==1,]),type="l", xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==2,]),type="l", xlab="time",ylab="log expression value") matplot(as.matrix(cho.data[class.single==3,]), type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==4,]),type="l", xlab="time",ylab="log expression value") Properties of cluster members, single linkage, k=4 Example: cell cycle data

Module 5: Clustering bioinformatics.ca 12 34 Single linkage, k=4 1 2 34 Example: cell cycle data

Module 5: Clustering bioinformatics.ca hc.complete<-hclust(D.cho, method = "complete", members=NULL) plot(hc.complete) rect.hclust(hc.complete,k=4) class.complete<-cutree(hc.complete, k = 4) par(mfrow=c(2,2)) matplot... Example: cell cycle data

Module 5: Clustering bioinformatics.ca Complete linkage, k=4 rect.hclust(hc.complete,k=4) Example: cell cycle data

Module 5: Clustering bioinformatics.ca 12 34 Properties of cluster members, complete linkage, k=4 Example: cell cycle data

Module 5: Clustering bioinformatics.ca 12 34 Complete linkage, k=4 12 34 Single linkage, k=4 Example: cell cycle data

Module 5: Clustering bioinformatics.ca 12 34 Complete linkage, k=4 12 34 Single linkage, k=4 Could use this to revise analysis... bimodal distribution? Analyse signal, not noise! Example: Cell cycle data

Module 5: Clustering bioinformatics.ca hc.average<-hclust(D.cho, method = "average", members=NULL) plot(hc.average)rect.hclust(hc.average,k=4) class.average<-cutree(hc.average, k = 4) par(mfrow=c(2,2)) matplot(t(cho.data[class.average==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.average==2,]),type="l",xlab="time",ylab="log expression value") matplot(as.matrix(cho.data[class.average==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.average==4,]),type="l",xlab="time",ylab="log expression value") Example: Cell cycle data

Module 5: Clustering bioinformatics.ca Hierarchical clustering analyzed AdvantagesDisadvantages There may be small clusters nested inside large ones Clusters might not be naturally represented by a hierarchical structure No need to specify number groups ahead of time Its necessary to ‘cut’ the dendrogram in order to produce clusters Flexible linkage methodsBottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’

Module 5: Clustering bioinformatics.ca Partitioning methods Anatomy of a partitioning based method data matrix distance function number of groups Output group assignment of every object

Module 5: Clustering bioinformatics.ca Partitioning based methods Choose K groups initialise group centers aka centroid, medoid assign each object to the nearest centroid according to the distance metric reassign (or recompute) centroids repeat last 2 steps until assignment stabilizes

Module 5: Clustering bioinformatics.ca K-means vs. K-medoids K-meansK-medoids Centroids are the ‘mean’ of the clusters Centroids are an actual object that minimizes the total within cluster distance Centroids need to be recomputed every iteration Centroid can be determined from quick look up into the distance matrix Initialisation difficult as notion of centroid may be unclear before beginning Initialisation is simply K randomly selected objects kmeanspam

Module 5: Clustering bioinformatics.ca Partitioning based methods AdvantagesDisadvantages Number of groups is well defined Have to choose the number of groups A clear, deterministic assignment of an object to a group Sometimes objects do not fit well to any cluster Simple algorithms for inference Can converge on locally optimal solutions and often require multiple restarts with random initializations

Module 5: Clustering bioinformatics.ca N items, assume K clusters Goal is to minimize over the possible assignments and centroids. represents the location of the cluster. K-means

Module 5: Clustering bioinformatics.ca 1. Divide the data into K clusters Initialize the centroids with the mean of the clusters 2. Assign each item to the cluster with closest centroid 3. When all objects have been assigned, recalculate the centroids (mean) 4. Repeat 2-3 until the centroids no longer move K-means

Module 5: Clustering bioinformatics.ca set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) K-means

Module 5: Clustering bioinformatics.ca K-means, k=4 12 34 set.seed(100) km.cho<-kmeans(cho.data, 4) par(mfrow=c(2,2)) matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value") K-means

Module 5: Clustering bioinformatics.ca 12 34 K-means, k=4 12 34 Single linkage, k=4 K-means

Module 5: Clustering bioinformatics.ca K-means and hierarchical clustering methods are simple, fast and useful techniques Beware of memory requirements for HC Both are bit “ad hoc”: Number of clusters? Distance metric? Good clustering? Summary

Module 5: Clustering bioinformatics.ca Model based approaches Assume the data are ‘generated’ from a mixture of K distributions What cluster assignment and parameters of the K distributions best explain the data? ‘Fit’ a model to the data Try to get the best fit Classical example: mixture of Gaussians (mixture of normals) Take advantage of probability theory and well-defined distributions in statistics

Module 5: Clustering bioinformatics.ca Model based clustering: array CGH

Module 5: Clustering bioinformatics.ca Model based clustering of aCGH Approach: Cluster the data by extending the profiling to the multi-group setting Shah et al (Bioinformatics, 2009) Patient p Group g State k …… Profile State c Problem: patient cohorts often exhibit molecular heterogeneity making rarer shared CNAs hard to detect A mixture of HMMs: HMM-Mix Sparse profiles Distribution of calls in a group CNA calls Raw data

Module 5: Clustering bioinformatics.ca Advantages of model based approaches In addition to clustering patients into groups, we output a ‘model’ that best represents the patients in a group We can then associate each model with clinical variables and simply output a classifier to be used on new patients Choosing the number of groups becomes a model selection problem (cf. the Bayesian Information Criterion) see Yeung et al Bioinformatics (2001)

Module 5: Clustering bioinformatics.ca Advanced topics in clustering Top down clustering Bi-clustering or ‘two-way’ clustering Principal components analysis Choosing the number of groups model selection AIC, BIC Silhouette coefficient The Gap curve Joint clustering and feature selection

Module 5: Clustering bioinformatics.ca What is the best clustering method? That depends on what you want to achieve. And sometimes clustering is not the best approach to begin with. Best method?

Module 5: Clustering bioinformatics.ca set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") Clustering is a partition method. However, consider the following data: Density estimation

Module 5: Clustering bioinformatics.ca set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 5, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") Density estimation

Module 5: Clustering bioinformatics.ca > length(faithful$eruptions) [1] 272 > head(faithful$eruptions, 8) [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 > hist(faithful$eruptions, col=rgb(0.9,0.9,0.9), main="") (From density() documentation...) (univariate) Density estimation

Module 5: Clustering bioinformatics.ca > length(faithful$eruptions) [1] 272 > head(faithful$eruptions, 8) [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 > hist(faithful$eruptions, col=rgb(0.9,0.9,0.9), main="") > par(new="T") > plot(density(faithful$eruptions, bw = "sj"), main="", xlab="", ylab="", axes="F", col="red", lwd=3) (From density() documentation...) (univariate) Density estimation

Module 5: Clustering bioinformatics.ca set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 5, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") (univariate) Density estimation

Module 5: Clustering bioinformatics.ca set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") x3<-rbind(x1, x2) par(new=T) plot(density(x3[,2], bw = "sj"), main="", xlab="", ylab="", axes="F", col="blue", lwd=3) set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") x3<-rbind(x1, x2) par(new=T) plot(density(x3[,2], bw = "sj"), main="", xlab="", ylab="", axes="F", col="blue", lwd=3) (univariate) Density estimation

Module 5: Clustering bioinformatics.ca boris.steipe@utoronto.ca

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations

Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

Similar presentations

About project

Feedback