Presentation is loading. Please wait.

Presentation is loading. Please wait.

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations


Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

1 Canadian Bioinformatics Workshops www.bioinformatics.ca

2 2Module #: Title of Module

3 Module 5 Clustering D EPARTMENT OF B IOCHEMISTRY D EPARTMENT OF M OLECULAR G ENETICS † Herakles and Iolaos battle the Hydra. Classical (450-400 BCE) This workshop includes material originally developed by Raphael Gottardo, FHCRC and by Sohrab Shah, UBC † Exploratory Data Analysis of Biological Data using R Boris Steipe Toronto, May 23. and 24. 2013

4 Module 5: Clustering bioinformatics.ca Principles Hierarchical clustering Partitioning methods Centroid based clustering (K-means etc.) Model–based clustering Outline

5 Module 5: Clustering bioinformatics.ca Complexes in interaction data Domains in protein structure Proteins of similar function (based on measured similar properties, e.g. coregulation)... Examples

6 Module 5: Clustering bioinformatics.ca Introduction to clustering Clustering...... is an example of unsupervised learning... is useful for the analysis of patterns in data... can lead to class discovery. Clustering is the partitioning of a data set into groups of elements that are more similar to each other than to elements in other groups. Clustering is a completely general method that can be applied to genes, samples, or both.

7 Module 5: Clustering bioinformatics.ca Hierarchical clustering Given N items and a distance metric... 1. Assign each item to its own "cluster". Initialize the distance matrix between clusters as the distance between items. 2. Find the closest pair of clusters and merge them into a single cluster. 3. Compute new distances between clusters. 4. Repeat 2-3 until all clusters have been merged into a single cluster.

8 Module 5: Clustering bioinformatics.ca "Given N items and a distance metric..." What is a metric? A metric has to fulfill three conditions: "identity" "symmetry" "triangle inequality" Hierarchical clustering

9 Module 5: Clustering bioinformatics.ca Distance metrics Common metrics include: Manhattan distance: Euclidean distance: 1-correlation: (proportional to Euclidean distance, but invariant to range of measurement from one sample to the next). dissimilar similar

10 Module 5: Clustering bioinformatics.ca Distance metrics compared EuclideanManhattan1-Correlation Distance matters!

11 Module 5: Clustering bioinformatics.ca Other distance metrics Hamming distance for ordinal, binary or categorical data:

12 Module 5: Clustering bioinformatics.ca Agglomerative hierarchical clustering

13 Module 5: Clustering bioinformatics.ca Hierarchical clustering Anatomy of hierarchical clustering distance matrix linkage method Output dendrogram a tree that defines the relationships between objects and the distance between clusters a nested sequence of clusters

14 Module 5: Clustering bioinformatics.ca Linkage methods single complete average distance between centroids

15 Module 5: Clustering bioinformatics.ca cho.data<-as.matrix(read.table("logcho_237_4class.txt",skip=1)[1:50,3:19]) D.cho<-dist(cho.data, method = "euclidean") hc.single<-hclust(D.cho, method = "single", members=NULL) Example: cell cycle data

16 Module 5: Clustering bioinformatics.ca plot(hc.single) Single linkage Example: cell cycle data

17 Module 5: Clustering bioinformatics.ca Careful with the interpretation of dendrograms: they introduce a proximity between elements that does not correlate with distance between elements! cf.: # 1 and #47 Example: cell cycle data

18 Module 5: Clustering bioinformatics.ca Single linkage, k=2 rect.hclust(hc.single,k=2) Example: cell cycle data

19 Module 5: Clustering bioinformatics.ca Single linkage, k=3 rect.hclust(hc.single,k=3) Example: cell cycle data

20 Module 5: Clustering bioinformatics.ca Single linkage, k=4 rect.hclust(hc.single,k=4) Example: cell cycle data

21 Module 5: Clustering bioinformatics.ca Single linkage, k=5 rect.hclust(hc.single,k=5) Example: cell cycle data

22 Module 5: Clustering bioinformatics.ca Single linkage, k=25 rect.hclust(hc.single,k=25) Example: cell cycle data

23 Module 5: Clustering bioinformatics.ca 12 34 class.single<-cutree(hc.single, k = 4) par(mfrow=c(2,2)) matplot(t(cho.data[class.single==1,]),type="l", xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==2,]),type="l", xlab="time",ylab="log expression value") matplot(as.matrix(cho.data[class.single==3,]), type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.single==4,]),type="l", xlab="time",ylab="log expression value") Properties of cluster members, single linkage, k=4 Example: cell cycle data

24 Module 5: Clustering bioinformatics.ca 12 34 Single linkage, k=4 1 2 34 Example: cell cycle data

25 Module 5: Clustering bioinformatics.ca hc.complete<-hclust(D.cho, method = "complete", members=NULL) plot(hc.complete) rect.hclust(hc.complete,k=4) class.complete<-cutree(hc.complete, k = 4) par(mfrow=c(2,2)) matplot... Example: cell cycle data

26 Module 5: Clustering bioinformatics.ca Complete linkage, k=4 rect.hclust(hc.complete,k=4) Example: cell cycle data

27 Module 5: Clustering bioinformatics.ca 12 34 Properties of cluster members, complete linkage, k=4 Example: cell cycle data

28 Module 5: Clustering bioinformatics.ca 12 34 Complete linkage, k=4 12 34 Single linkage, k=4 Example: cell cycle data

29 Module 5: Clustering bioinformatics.ca 12 34 Complete linkage, k=4 12 34 Single linkage, k=4 Could use this to revise analysis... bimodal distribution? Analyse signal, not noise! Example: Cell cycle data

30 Module 5: Clustering bioinformatics.ca hc.average<-hclust(D.cho, method = "average", members=NULL) plot(hc.average)rect.hclust(hc.average,k=4) class.average<-cutree(hc.average, k = 4) par(mfrow=c(2,2)) matplot(t(cho.data[class.average==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.average==2,]),type="l",xlab="time",ylab="log expression value") matplot(as.matrix(cho.data[class.average==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[class.average==4,]),type="l",xlab="time",ylab="log expression value") Example: Cell cycle data

31 Module 5: Clustering bioinformatics.ca Hierarchical clustering analyzed AdvantagesDisadvantages There may be small clusters nested inside large ones Clusters might not be naturally represented by a hierarchical structure No need to specify number groups ahead of time Its necessary to ‘cut’ the dendrogram in order to produce clusters Flexible linkage methodsBottom up clustering can result in poor structure at the top of the tree. Early joins cannot be ‘undone’

32 Module 5: Clustering bioinformatics.ca Partitioning methods Anatomy of a partitioning based method data matrix distance function number of groups Output group assignment of every object

33 Module 5: Clustering bioinformatics.ca Partitioning based methods Choose K groups initialise group centers aka centroid, medoid assign each object to the nearest centroid according to the distance metric reassign (or recompute) centroids repeat last 2 steps until assignment stabilizes

34 Module 5: Clustering bioinformatics.ca K-means vs. K-medoids K-meansK-medoids Centroids are the ‘mean’ of the clusters Centroids are an actual object that minimizes the total within cluster distance Centroids need to be recomputed every iteration Centroid can be determined from quick look up into the distance matrix Initialisation difficult as notion of centroid may be unclear before beginning Initialisation is simply K randomly selected objects kmeanspam

35 Module 5: Clustering bioinformatics.ca Partitioning based methods AdvantagesDisadvantages Number of groups is well defined Have to choose the number of groups A clear, deterministic assignment of an object to a group Sometimes objects do not fit well to any cluster Simple algorithms for inference Can converge on locally optimal solutions and often require multiple restarts with random initializations

36 Module 5: Clustering bioinformatics.ca N items, assume K clusters Goal is to minimize over the possible assignments and centroids. represents the location of the cluster. K-means

37 Module 5: Clustering bioinformatics.ca 1. Divide the data into K clusters Initialize the centroids with the mean of the clusters 2. Assign each item to the cluster with closest centroid 3. When all objects have been assigned, recalculate the centroids (mean) 4. Repeat 2-3 until the centroids no longer move K-means

38 Module 5: Clustering bioinformatics.ca set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) K-means

39 Module 5: Clustering bioinformatics.ca set.seed(100) x <- rbind(matrix(rnorm(100, sd = 0.3), ncol = 2), matrix(rnorm(100, mean = 1, sd = 0.3), ncol = 2)) colnames(x) <- c("x", "y") set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=1) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=2) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) set.seed(100); cl <- NULL cl<-kmeans(x, matrix(runif(10,-.5,.5),5,2),iter.max=3) plot(x,col=cl$cluster) points(cl$centers, col = 1:5, pch = 8, cex=2) K-means

40 Module 5: Clustering bioinformatics.ca K-means, k=4 12 34 set.seed(100) km.cho<-kmeans(cho.data, 4) par(mfrow=c(2,2)) matplot(t(cho.data[km.cho$cluster==1,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==2,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==3,]),type="l",xlab="time",ylab="log expression value") matplot(t(cho.data[km.cho$cluster==4,]),type="l",xlab="time",ylab="log expression value") K-means

41 Module 5: Clustering bioinformatics.ca 12 34 K-means, k=4 12 34 Single linkage, k=4 K-means

42 Module 5: Clustering bioinformatics.ca K-means and hierarchical clustering methods are simple, fast and useful techniques Beware of memory requirements for HC Both are bit “ad hoc”: Number of clusters? Distance metric? Good clustering? Summary

43 Module 5: Clustering bioinformatics.ca Model based approaches Assume the data are ‘generated’ from a mixture of K distributions What cluster assignment and parameters of the K distributions best explain the data? ‘Fit’ a model to the data Try to get the best fit Classical example: mixture of Gaussians (mixture of normals) Take advantage of probability theory and well-defined distributions in statistics

44 Module 5: Clustering bioinformatics.ca Model based clustering: array CGH

45 Module 5: Clustering bioinformatics.ca Model based clustering of aCGH Approach: Cluster the data by extending the profiling to the multi-group setting Shah et al (Bioinformatics, 2009) Patient p Group g State k …… Profile State c Problem: patient cohorts often exhibit molecular heterogeneity making rarer shared CNAs hard to detect A mixture of HMMs: HMM-Mix Sparse profiles Distribution of calls in a group CNA calls Raw data

46 Module 5: Clustering bioinformatics.ca Advantages of model based approaches In addition to clustering patients into groups, we output a ‘model’ that best represents the patients in a group We can then associate each model with clinical variables and simply output a classifier to be used on new patients Choosing the number of groups becomes a model selection problem (cf. the Bayesian Information Criterion) see Yeung et al Bioinformatics (2001)

47 Module 5: Clustering bioinformatics.ca Advanced topics in clustering Top down clustering Bi-clustering or ‘two-way’ clustering Principal components analysis Choosing the number of groups model selection AIC, BIC Silhouette coefficient The Gap curve Joint clustering and feature selection

48 Module 5: Clustering bioinformatics.ca What is the best clustering method? That depends on what you want to achieve. And sometimes clustering is not the best approach to begin with. Best method?

49 Module 5: Clustering bioinformatics.ca set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") Clustering is a partition method. However, consider the following data: Density estimation

50 Module 5: Clustering bioinformatics.ca set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 5, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") Density estimation

51 Module 5: Clustering bioinformatics.ca > length(faithful$eruptions) [1] 272 > head(faithful$eruptions, 8) [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 > hist(faithful$eruptions, col=rgb(0.9,0.9,0.9), main="") (From density() documentation...) (univariate) Density estimation

52 Module 5: Clustering bioinformatics.ca > length(faithful$eruptions) [1] 272 > head(faithful$eruptions, 8) [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 > hist(faithful$eruptions, col=rgb(0.9,0.9,0.9), main="") (From density() documentation...) (univariate) Density estimation

53 Module 5: Clustering bioinformatics.ca > length(faithful$eruptions) [1] 272 > head(faithful$eruptions, 8) [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 > hist(faithful$eruptions, col=rgb(0.9,0.9,0.9), main="") > par(new="T") > plot(density(faithful$eruptions, bw = "sj"), main="", xlab="", ylab="", axes="F", col="red", lwd=3) (From density() documentation...) (univariate) Density estimation

54 Module 5: Clustering bioinformatics.ca > length(faithful$eruptions) [1] 272 > head(faithful$eruptions, 8) [1] 3.600 1.800 3.333 2.283 4.533 2.883 4.700 3.600 > hist(faithful$eruptions, col=rgb(0.9,0.9,0.9), main="") > par(new="T") > plot(density(faithful$eruptions, bw = "sj"), main="", xlab="", ylab="", axes="F", col="red", lwd=3) (From density() documentation...) (univariate) Density estimation

55 Module 5: Clustering bioinformatics.ca set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 5, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") (univariate) Density estimation

56 Module 5: Clustering bioinformatics.ca set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") x3<-rbind(x1, x2) par(new=T) plot(density(x3[,2], bw = "sj"), main="", xlab="", ylab="", axes="F", col="blue", lwd=3) set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") x3<-rbind(x1, x2) par(new=T) plot(density(x3[,2], bw = "sj"), main="", xlab="", ylab="", axes="F", col="blue", lwd=3) (univariate) Density estimation

57 Module 5: Clustering bioinformatics.ca set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") x3<-rbind(x1, x2) par(new=T) plot(density(x3[,2], bw = "sj"), main="", xlab="", ylab="", axes="F", col="blue", lwd=3) set.seed(103) x1<-array(c(runif(70, 0,10)), c(35,2)) x2<-array(c(rnorm(30, 7, 0.7)), c(15,2)) xrange<-range(x1[,1], x2[,1]) yrange<-range(x1[,2], x2[,2]) plot(x1, xlim=xrange, ylim=yrange, col="black", xlab="x", ylab="y") par(new=T) plot(x2, xlim=xrange, ylim=yrange, col="red", axes=F, xlab="", ylab="") x3<-rbind(x1, x2) par(new=T) plot(density(x3[,2], bw = "sj"), main="", xlab="", ylab="", axes="F", col="blue", lwd=3) (univariate) Density estimation

58 Module 5: Clustering bioinformatics.ca boris.steipe@utoronto.ca


Download ppt "Canadian Bioinformatics Workshops www.bioinformatics.ca."

Similar presentations


Ads by Google