The Broad Institute of MIT and Harvard Clustering.

The Broad Institute of MIT and Harvard Clustering

The Broad Institute of MIT and Harvard Clustering Preliminaries Log 2 transformation Row centering and normalization Filtering

The Broad Institute of MIT and Harvard Log 2 Transformation Log 2 -transformation makes sure that the noise is independent of the mean and similar differences have the same meaning along the dynamic range of the values. –We would like dist(100,200)=dist(1000,2000). Advantages of log 2 transformation:

The Broad Institute of MIT and Harvard Row Centering & Normalization x y=x-mean(x) z=y/stdev(y)

The Broad Institute of MIT and Harvard Filtering genes Filtering is very important for unsupervised analysis since many noisy genes may totally mask the structure in the data After finding a hypothesis one can identify marker genes in a larger dataset via supervised analysis. Clustering Supervised Analysis Marker Selection All genes

The Broad Institute of MIT and Harvard Clustering/Class Discovery Aim: Partition data (e.g. genes or samples) into sub-groups (clusters), such that points of the same cluster are “more similar”. Challenge: Not well defined. No single objective function / evaluation criterion Example: How many clusters? 2+noise, 3+noise, 20, Hierarchical: 2  3 + noise One has to choose: –Similarity/distance measure –Clustering method –Evaluate clusters

The Broad Institute of MIT and Harvard Clustering in GenePattern Representative based: Find representatives/centroids –K-means: KMeansClustering –Self Organizing Maps (SOM): SOMClustering Bottom-up (Agglomerative): HierarchicalClustering Hierarchically unite clusters –single linkage analysis –complete linkage analysis –average linkage analysis Clustering-like: –NMFConsensus –PCA (Principal Components Analysis) No BEST method! For easy problems – most of them work. Each algorithm has its assumptions and strengths and weaknesses

The Broad Institute of MIT and Harvard K-means Clustering Aim: Partition the data points into K subsets and associate each subset with a centroid such that the sum of squared distances between the data points and their associated centroid is minimal.

The Broad Institute of MIT and Harvard Iteration = 0 K-means: Algorithm Initialize centroids at random positions Iterate: –Assign each data point to its closest centroid –Move centroids to center of assigned points Stop when converged Guaranteed to reach a local minimum Iteration = 1 K=3 Iteration = 1Iteration = 2

The Broad Institute of MIT and Harvard K-means: Summary Result depends on initial centroids’ position Fast algorithm: needs to compute distances from data points to centroids Must preset number of clusters. Fails for non-spherical distributions

The Broad Institute of MIT and Harvard Hierarchical Clustering 3 1 4 2 5 52413 Distance between joined clusters Dendrogram

The Broad Institute of MIT and Harvard 5 24 13 3 1 4 2 5 Distance between joined clusters Dendrogram The dendrogram induces a linear ordering of the data points (up to left/right flip in each split) Hierarchical Clustering Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Complete Linkage: distance between farthest pair. Average Linkage: average distance between all pairs or distance between cluster centers

The Broad Institute of MIT and Harvard Average Linkage Leukemia samples and genes

The Broad Institute of MIT and Harvard Single and Complete Linkage Single-linkage Complete-linkage Leukemia samples and genes

The Broad Institute of MIT and Harvard Similarity/Distance Measures Decide: which samples/genes should be clustered together –Euclidean: the "ordinary" distance between two points that one would measure with a ruler, and is given by the Pythagorean formula –Pearson correlation - a parametric measure of the strength of linear dependence between two variables. –Absolute Pearson correlation - the absolute value of the Pearson correlation –Spearman rank correlation - a non-parametric measure of independence between two variables –Uncentered correlation - same as Pearson but assumes the mean is 0 –Absolute uncentered correlation - the absolute value of the uncentered correlation –Kendall’s tau - a non-parametric similarity measure used to measure the degree of correspondence between two rankings –City-block/Manhattan - the distance that would be traveled to get from one point to the other if a grid-like path is followed

The Broad Institute of MIT and Harvard Reasonable Distance Measure Gene 1 Gene 2 Gene 3 Gene 4 Genes: Close -> Correlated Samples: Similar profile giving Gene 1 and 2 a similar contribution to the distance between sample 1 and 5 Sample 1Sample 5 Euclidean distance on samples and genes on row- centered and normalized data.

The Broad Institute of MIT and Harvard Pitfalls in Clustering Elongated clusters Filament Clusters of different sizes

The Broad Institute of MIT and Harvard Compact Separated Clusters All methods work Adapted from E. Domany

The Broad Institute of MIT and Harvard Elongated Clusters  Single linkage succeeds to partition  Average linkage fails

The Broad Institute of MIT and Harvard Filament Single linkage not robust Adapted from E. Domany

The Broad Institute of MIT and Harvard Filament with Point Removed Single linkage not robust Adapted from E. Domany

The Broad Institute of MIT and Harvard Two-way Clustering Two independent cluster analyses on genes and samples used to reorder the data (two-way clustering):

The Broad Institute of MIT and Harvard Hierarchical Clustering Results depend on distance update method –Single Linkage: elongated clusters –Complete Linkage: sphere-like clusters Greedy iterative process NOT robust against noise No inherent measure to choose the clusters – we return to this point in cluster validation Summary

The Broad Institute of MIT and Harvard Clustering Protocol

The Broad Institute of MIT and Harvard Validating Number of Clusters How do we know how many real clusters exist in the dataset?

The Broad Institute of MIT and Harvard... D1D1 D2D2 DnDn Generate “perturbed” datasets Consensus Clustering Apply clustering algorithm to each D i Clustering 1 Clustering 2.. Clustering n Original Dataset Consensus matrix: counts proportion of times two samples are clustered together. (1)two samples always cluster together (0)two samples never cluster together s 1 s 2 … s n s1s2…sns1s2…sn compute consensus matrix dendogram based on matrix The Broad Institute of MIT and Harvard

Consensus Clustering Consensus matrix: counts proportion of times two samples are clustered together. (1)two samples always cluster together (0)two samples never cluster together C1C1 C2C2 C3C3 s 1 s 3 … s i s1s3…sis1s3…si consensus matrix ordered according to dendogram compute consensus matrix dendogram based on matrix... D1D1 D2D2 DnDn Apply clustering algorithm to each D i Clustering 1 Clustering 2.. Clustering n Original Dataset

The Broad Institute of MIT and Harvard Validation Aim: Measure agreement between clustering results on “perturbed” versions of the data. Method: –Iterate N times: Generate “perturbed” version of the original dataset by subsampling, resampling with repeats, adding noise Cluster the perturbed dataset –Calculate fraction of iterations where different samples belong to the same cluster –Optimize the number of clusters K by choosing the value of K which yields the most consistent results Consistency / Robustness Analysis

The Broad Institute of MIT and Harvard Consensus Clustering in GenePattern

The Broad Institute of MIT and Harvard Clustering Cookbook Reduce number of genes by variation filtering –Use stricter parameters than for comparative marker selection Choose a method for cluster discovery (e.g. hierarchical clustering) Select a number of clusters –Check for sensitivity of clusters against filtering and clustering parameters –Validate on independent data sets –Internally test robustness of clusters with consensus clustering

The Broad Institute of MIT and Harvard References 1.Brunet, J-P., Tamayo, P., Golub, T.R., and Mesirov, J.P. 2004. Metagenes and molecular pattern discovery using matrix factorization. Proc. Natl. Acad. Sci. USA 101(12):4164–4169. 2.Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc. Natl. Acad. Sci. USA 95:14863-14868. 3.Kim, P. M. and Tidor, B. 2003. Subsystem Identification Through Dimensionality Reduction of Large-Scale Gene Expression Data. Genome Research 13:1706-1718. 4.MacQueen, J. B. 1967. Some Methods for classification and Analysis of Multivariate Observations. In Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability, Vol. 1. University of California Press, California. pp. 281-297. 5.Monti, S., Tamayo, P., Mesirov, J.P., and Golub, T. 2003. Consensus Clustering: A resampling- based method for class discovery and visualization of gene expression microarray data. Machine Learning Journal 52(1-2):91-118. 6.Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Dmitrovsky, E., Lander, E.S., and Golub, T.R. 1999. Interpreting gene expression with self-organizing maps: Methods and application to hematopoeitic differentiation. Proc. Natl. Acad. Sci. USA.

The Broad Institute of MIT and Harvard Clustering.

Similar presentations

Presentation on theme: "The Broad Institute of MIT and Harvard Clustering."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Broad Institute of MIT and Harvard Clustering.

Similar presentations

Presentation on theme: "The Broad Institute of MIT and Harvard Clustering."— Presentation transcript:

Similar presentations

About project

Feedback