Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.

Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005

Background Cell/Tissue 1 Cell/Tissue 2 Cell/Tissue N … Data 1 Data 2 Data N … Put similar samples/entries together.

Background  Clustering is one of the most important unsupervised learning processes that organizing objects into groups whose members are similar in some way.  Clustering finds structures in a collection of unlabeled data.  A cluster is a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters.

Motivation I Microarray data quality checking –Does replicates cluster together? –Does similar conditions, time points, tissue types cluster together?

Data: Rat Schizophrenia Data (Allen Fienberg and Mayetri Gupta)  Two time points:35 days (PD 35) and 60 days (PD60) past birth.  Two brain regions: Prefrontal cortex (PFC) and Nucleus accumbens (NA).  Two replicates (Samples are from the same set of tissue split into different tubes so that replicates should be in close agreement.)  dChip was used to normalize the data and get model-based expression values, using the full PM/MM model. How to read this clustering result? Heat map Link length Sample IDs Gene IDs Clustering results Problem?

Motivation II Cluster genes  Prediction of functions of unknown genes by known ones

Functional significant gene clusters Two-way clustering Gene clusters Sample clusters

Motivation II Cluster genes  Prediction of functions of unknown genes by known ones Cluster samples  Discover clinical characteristics (e.g. survival, marker status) shared by samples.

Bhattacharjee et al. (2001) Human lung carcinomas mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. USA, Vol. 98, 13790- 13795.

Motivation II Cluster genes  Prediction of functions of unknown genes by known ones Cluster samples  Discover clinical characteristics (e.g. survival, marker status) shared by samples Promoter analysis of commonly regulated genes

David J. Lockhart & Elizabeth A. Winzeler, NATURE | VOL 405 | 15 JUNE 2000, p827 Promoter analysis of commonly regulated genes

Clustering Algorithms Start with a collection of n objects each represented by a p–dimensional feature vector x i, i=1, …n. The goal is to divide these n objects into k clusters so that objects within a clusters are more “similar” than objects between clusters. k is usually unknown. Popular methods: hierarchical, k-means, SOM, mixture models, etc.

Hierarchical Clustering Dendrogram Venn Diagram of Clustered Data From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt

Hierarchical Clustering (Cont.) Multilevel clustering: level 1 has n clusters  level n has one cluster. Agglomerative HC: starts with singleton and merge clusters. Divisive HC: starts with one sample and split clusters.

Nearest Neighbor Algorithm Nearest Neighbor Algorithm is an agglomerative approach (bottom-up). Starts with n nodes (n is the size of our sample), merges the 2 most similar nodes at each step, and stops when the desired number of clusters is reached. From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt

Nearest Neighbor, Level 2, k = 7 clusters. From http://www.stat.unc.edu/postscript/papers/marron/Stat321FDA/RimaIzempresentation.ppt

Nearest Neighbor, Level 3, k = 6 clusters.

Nearest Neighbor, Level 8, k = 1 cluster.

Calculate the similarity between all possible combinations of two profiles Two most similar clusters are grouped together to form a new cluster Calculate the similarity between the new cluster and all remaining clusters. Hierarchical Clustering Keys Similarity Clustering

Similarity Measurements Pearson Correlation Two profiles (vectors) and +1  Pearson Correlation  – 1

Similarity Measurements Pearson Correlation: Trend Similarity

Similarity Measurements Euclidean Distance

Similarity Measurements Euclidean Distance: Absolute difference

Similarity Measurements Cosine Correlation +1  Cosine Correlation  – 1

Similarity Measurements Cosine Correlation: Trend + Mean Distance

Similarity Measurements

Similar?

Clustering C1C1 C2C2 C3C3 Merge which pair of clusters?

+ + Clustering Single Linkage C1C1 C2C2 Dissimilarity between two clusters = Minimum dissimilarity between the members of two clusters Tend to generate “long chains”

+ + Clustering Complete Linkage C1C1 C2C2 Dissimilarity between two clusters = Maximum dissimilarity between the members of two clusters Tend to generate “clumps”

+ + Clustering Average Linkage C1C1 C2C2 Dissimilarity between two clusters = Averaged distances of all pairs of objects (one from each cluster).

+ + Clustering Average Group Linkage C1C1 C2C2 Dissimilarity between two clusters = Distance between two cluster means.

Considerations What genes are used to cluster samples? –Expression variation –Inherent variation –Prior knowledge (irrelevant genes) –Etc.

Take Home Questions Which clustering method is better? How to cut the clustering tree to get relatively tight clusters of genes or samples?

Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.

Similar presentations

Presentation on theme: "Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.

Similar presentations

Presentation on theme: "Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005."— Presentation transcript:

Similar presentations

About project

Feedback