Agenda 1.Introduction to clustering 1.Dissimilarity measure 2.Preprocessing 2.Clustering method 1.Hierarchical clustering 2.K-means and K-memoids 3.Self-organizing.

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Cluster analysis for microarray data Anja von Heydebreck.
Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department.
Machine Learning and Data Mining Clustering
Introduction to Bioinformatics
Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.
Clustering II.
Mutual Information Mathematical Biology Seminar
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Clustering.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Cluster Analysis Class web site: Statistics for Microarrays.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
What is Cluster Analysis?
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Normalization Review and Cluster Analysis Class web site: Statistics for Microarrays.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)
Lecture 09 Clustering-based Learning
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Elizabeth Garrett-Mayer November 5, 2003 Oncology Biostatistics
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
More on Microarrays Chitta Baral Arizona State University.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Estimating the Number of Data Clusters via the Gap Statistic Paper by: Robert Tibshirani, Guenther Walther and Trevor Hastie J.R. Statist. Soc. B (2001),
Agenda 1.Introduction to clustering 1.Dissimilarity measure 2.Preprocessing 2.Clustering method 1.Hierarchical clustering 2.K-means and K-memoids 3.Self-organizing.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Tight Clustering: a method for extracting stable and tight patterns in expression profiles George C. Tseng Dept. of Biostatistics & Human Genetics University.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Randomized Algorithms for Bayesian Hierarchical Clustering
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Clustering.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Flat clustering approaches
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning
Cluster Analysis II 10/03/2012.
Clustering CSC 600: Data Mining Class 21.
Machine Learning and Data Mining Clustering
Canadian Bioinformatics Workshops
Clustering (3) Center-based algorithms Fuzzy k-means
Microarray Clustering
Clustering.
Dimension reduction : PCA and Clustering
Text Categorization Berlin Chen 2003 Reference:
Machine Learning and Data Mining Clustering
Machine Learning and Data Mining Clustering
Unsupervised Learning
Presentation transcript:

Agenda 1.Introduction to clustering 1.Dissimilarity measure 2.Preprocessing 2.Clustering method 1.Hierarchical clustering 2.K-means and K-memoids 3.Self-organizing maps (SOM) 4.Model-based clustering 3.Estimate # of clusters 4.Two new methods allowing scattered genes 4.1 Tight clustering 4.2 Penalized and weighted K-means 5.Cluster validation and evaluation 6.Comparison and discussion

Experimental design Image analysis Preprocessing (Normalization, filtering, MV imputation) Data visualization Identify differentially expressed genes Regulatory network ClusteringClassification Statistical Issues in Microarray Analysis Gene enrichment analysis Integrative analysis & meta-analysis

1. Introduction “Clustering” is an exploratory tool for looking at associations within gene expression data These methods allow us to hypothesize about relationships between genes and classes. Cluster genes: similar expression pattern implies co-regulation. Cluster samples: identify potential sub-classes of disease.

1.Assigning objects to the groups 2.Estimating number of clusters 3.Assessing strength/confidence of cluster assignments for individual objects 1. Introduction

1.1. Dissimilarity measure (correlation) A metric is a measure of the similarity or dissimilarity between two data objects and it’s used to form data points into clusters Two main classes of distance: Correlation coefficients (scale-invariant) -- Distance metric (scale-dependent)

1.1. Dissimilarity measure (correlation)

Centered vs Uncentered correlation (centered is preferred in microarray analysis)

Positive and negative correlation 1.1. Dissimilarity measure (correlation) Biologically not interesting

1.1. Dissimilarity measure (distance)

1.1. Dissimilarity measure Pearson correlation Euclidean distance

A procedure normally used when clustering genes in microarray: 1.Filter genes with small coefficient of variation (CV). (CV=stdev/mean) 2.Standardize gene rows to mean 0 and stdev 1. 3.Use Euclidean distance. Advantages: Step 1. takes into account the fact that high abundance genes normally have larger variation. This filters “flat” genes. Step 2. make Euclidean distance and correlation equivalent. Many useful methods require the data to be in Euclidean space Preprocessing

2. Clustering methods

Hierarchical clustering K-means; K-memoids/Partition around Memoids (PAM) Self-Organizing maps Model-based clustering CLICK FUZZY Bayesian model-based clustering Tight clustering Penalized and weighted K-means 2. Clustering methods

2.1. Hierarchical clustering

The most popular hierarchical clustering methods

2.1. Hierarchical clustering

Choice of linkage 2.1. Hierarchical clustering

Choice of linkage 2.1. Hierarchical clustering

Choice of linkage 2.1. Hierarchical clustering

Choice of linkage 2.1. Hierarchical clustering

Tree presentation 2.1. Hierarchical clustering

Tree presentation 2.1. Hierarchical clustering

Discussion: The most overused statistical method in gene expression analysis Gives us pretty red-green picture with patterns But the pretty picture tends to be unstable. Many different ways to perform hierarchical clustering Tend to be sensitive to small changes in the data Provided with clusters of every size: where to “cut” the dendrogram is user-determined Hierarchical clustering

2.2.~2.4. Partitioning methods

2.2. K-means

Algorithm: Minimize the within-cluster dispersion to the cluster centers. Note: 1. Points should be in Euclidean space. 2. Optimization performed by iterative relocation algorithms. Local minimum inevitable. 3. k has to be correctly estimated.

2.2. K-means 1.Choose K centroids at random. 2.Make initial partition of objects into k clusters by assigning objects to closest centroid. 3.M-step: Calculate the centroid (mean) of each of the k clusters. 4.E-step: Reassign objects to the closest centroids. 5.Repeat 3 and 4 until no reallocations occur.

2.2. K-means EM algorithm: The object assignment (into clusters) is viewed as missing.

2.2. K-means Initial values for K-means. “x” falls into local minimum. Local minimum problem:

Advantage:  Fast: o(KN) instead of o(N 2 ) in hierarchical clustering.  Relationship to Gaussian mixture model. Disadvantage:  Local minimum problem very common  Does not allow scattered genes  Estimation of # of clusters has to be accurate 2.2. K-means

Very useful when objects not in Euclidean space!

2.3. SOM RpRp R2R2

1.Six nodes (N 1, N 2,…, N 6 ) of 3  2 grids on 2D are first selected. Denote d(N i,N j ) the distance of two nodes on 2D space. 2.Set a random initial map f 0 : (N 1, N 2,…, N 6 )  R p (the gene expression space). 3.At each iteration, a data point (gene) P is randomly selected. N P is the node that maps nearest to P. Then the mapping of nodes is updated by moving points towards P: The learning rate decreases with distance of nod N from N P and with iteration number i. 4.Data point P is selected from a random order of genes and recycled. The procedures continue for 20,000-50,000 iterations.

2.3. SOM 1.Yeast cell cycle data. 828 genes. 6  5 SOM. 2.Average patterns and error bars are shown for each cluster.

2.3. SOM Advantages: Better visualization and interpretation. Nodes closer in 2D space will have corresponding clusters also closer. Easier to re-group similar clusters Disadvantages: SOM can be viewed as a restricted form of K- means. (K-means restricted on 2D geometry). The result is suboptimal in some sense. A heuristic algorithm.

2.4. Model-based clustering

-- Classification likelihood: -- Mixture likelihood: 2.4. Model-based clustering Gaussian assumption:

2.4. Model-based clustering Model-based approach: Fraley and Raftery (1998) applied a Gaussian mixture model.

1.EM algorithm to maximize the mixture likelihood. 2.Model selection: Bayesian Information Criterion (BIC) for determining k and the complexity of the covariance matrix Model-based clustering

Mixture model with noise or outliers:

2.4. Model-based clustering Advantages: Flexibility on cluster covariance structure. Rigorous statistical inference with full model. Disadvantages: Model selection is usually difficult. Data may not fit Gaussian model. Too many parameters to estimate in complex covariance structure. Local minimum problem

3. Estimate # of clusters Milligan & Cooper(1985) compared over 30 published rules. None is particularly better than all the others. The best method is data dependent. 1. Calinski & Harabasz (1974) Where B(k) and W(k) are the between-and within-cluster sum of sqaures with k clusters. 2. Hartigan (1975), Stop when H(k)<10

3. Estimate # of clusters 3. Gap statistic: Tibshirani, Walther & Hastie (2000) Within sum of squares W(k) is a decreasing function of k. Normally look for a turning point of elbow-shape to identify the number of clusters, k.

Instead of the above arbitrary criterion, the paper proposes to maximize the following Gap statistics. The background expectation is calculated from random sampling from. 3. Estimate # of clusters

The background expectation is calculated from random sampling from uniform distribution. Observation s Bounding Box (aligned with feature axes) Monte Carlo Simulations Observatio ns Bounding Box (aligned with principle axes) Monte Carlo Simulations

Computation of the Gap Statistic for l = 1 to B Compute Monte Carlo sample X 1b, X 2b, …, X nb (n is # obs.) for k = 1 to K Cluster the observations into k groups and compute log W k for l = 1 to B Cluster the M.C. sample into k groups and compute log W kb Compute Compute sd(k), the s.d. of {log W kb } l=1,…,B Set the total s.e. Find the smallest k such that

4. Resampling methods: Tibshirani et al. (2001); Dudoit and Fridlyand (2002) 3. Estimate # of clusters Treat as underlying truth. Compare

3. Estimate # of clusters A k1, A k2,…, A kk be the K-means clusters from test data X te n kj =#(A kj ) : training data : testing data : K-means clustering cetroids from training data : comembership matrix Prediction strenth: Find k that maximizes ps(k).

3. Estimate # of clusters Conclusions: 1.There’s no dominating “good” method for estimating the number of clusters. Some are good only in some specific simulations or examples. 2.Imagine in a high-dimensional complex data set. There might not be a clear “true” number of clusters. 3.The problem is also about the “resolution”. In “coarser” resolution few loose clusters may be identified, while in “refined” resolution many small tight clusters will stand out.