Clustering microarray data 09/26/07. Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

Slides:

Advertisements

Similar presentations

Cluster Analysis: Basic Concepts and Algorithms

Advertisements

Hierarchical Clustering, DBSCAN The EM Algorithm

PARTITIONAL CLUSTERING

Gene Shaving – Applying PCA Identify groups of genes a set of genes using PCA which serve as the informative genes to classify samples. The “gene shaving”

Cluster analysis for microarray data Anja von Heydebreck.

Introduction to Bioinformatics

Cluster Analysis.

BASIC METHODOLOGIES OF ANALYSIS: SUPERVISED ANALYSIS: HYPOTHESIS TESTING USING CLINICAL INFORMATION (MLL VS NO TRANS.) IDENTIFY DIFFERENTIATING GENES Basic.

UNSUPERVISED ANALYSIS GOAL A: FIND GROUPS OF GENES THAT HAVE CORRELATED EXPRESSION PROFILES. THESE GENES ARE BELIEVED TO BELONG TO THE SAME BIOLOGICAL.

Agenda 1.Introduction to clustering 1.Dissimilarity measure 2.Preprocessing 2.Clustering method 1.Hierarchical clustering 2.K-means and K-memoids 3.Self-organizing.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.

University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.

L16: Micro-array analysis Dimension reduction Unsupervised clustering.

Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.

Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation.

Basic Data Mining Techniques

Cluster Analysis: Basic Concepts and Algorithms

Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Cluster Analysis (1).

Introduction to Hierarchical Clustering Analysis Pengyu Hong 09/16/2005.

Introduction to Bioinformatics - Tutorial no. 12

What is Cluster Analysis?

Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:

Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.

Lecture 09 Clustering-based Learning

Evaluating Performance for Data Mining Techniques

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

tch?v=Y6ljFaKRTrI Fireflies.

Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.

Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.

Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.

An Overview of Clustering Methods Michael D. Kane, Ph.D.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.

CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.

Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.

Analyzing Expression Data: Clustering and Stats Chapter 16.

Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.

Machine Learning Queens College Lecture 7: Clustering.

Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.

Flat clustering approaches

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.

CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.

Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

Unsupervised Learning

PREDICT 422: Practical Machine Learning

Semi-Supervised Clustering

Clustering CSC 600: Data Mining Class 21.

Data Mining K-means Algorithm

Multivariate Statistical Methods

Data Mining – Chapter 4 Cluster Analysis Part 2

Cluster Analysis.

Text Categorization Berlin Chen 2003 Reference:

Unsupervised Learning

Presentation transcript:

Clustering microarray data 09/26/07

Sub-classes of lung cancer types have signature genes (Bhattacharjee 2001)

David J. Lockhart & Elizabeth A. Winzeler, NATURE | VOL 405 | 15 JUNE 2000, p827 Promoter analysis of commonly regulated genes

Discovery of new cancer subtype These classes are unknown at the time of study.

Overview Clustering is an unsupervised learning clustering is used to build groups of genes with related expression patterns. The classes are not known in advance. Aim is to discover new patterns from microarray data. In contrast, supervised learning refers to the learning process where classes are known. The aim is to define classification rules to separate the classes. Supervised learning will be discussed in the next lecture.

Dissimilar function To identify clusters, we first need to define what “close” means. There are many choices of distances Euclidian distance: 1 – Pearson correlation: Manhattan distance: …

Where is the “truth”? “ In the context of unsupervised learning, there is no such direct measure of success. It is difficult to ascertain the validity of inference drawn from the output of most unsupervised learning algorithms. One must often resort to heuristic arguments not only for motivating the algorithm, but also for judgments as to the quality of results. This uncomfortable situation has led to heavy proliferation of proposed methods, since effectiveness is a matter of opinion and cannot be verified directly.” Hastie et al. 2001; ESL

Clustering Methods Partitioning methods –Seek to optimally divide objects into a fixed number of clusters. Hierarchical methods –Produce a nested sequence of clusters (Speed, Chapter 4)

Methods k-means Hierarchical clustering Self-organizing maps (SOM)

k-means Divide objects into k clusters. Goal is to minimize total intra-cluster variance Global minimum is difficult to obtain.

Algorithm for k-means clustering Step 1: Initialization: randomly select k centroids. Step 2: For each object, find its closest centroid, assign the object to the corresponding cluster. Step 3: For each cluster, update its centroid to the mean position of all objects in that cluster. Repeat Steps 2 and 3 until convergence.

Shows the initial randomized centers and a number of points.

Centers have been associated with the points and have been moved to the respective centroids.

Now, the association is shown in more detail, once the centroids have been moved.

Again, the centers are moved to the centroids of the corresponding associated points.

Properties of k-means Achieves local minimum of Very fast.

Practical issues with k-means k must be known in advance Results are dependent on initial assignment of centroids.

Milligan & Cooper(1985) compared 30 published rules. 1. Calinski & Harabasz (1974) 2. Hartigan (1975), Stop when H(k)<10. W(k)= total sum of squares within clusters B(k)= sum of squares between cluster means How to choose k?

How to choose k (continued)? log W K Random Observed Gap k (Tibshriani 2001) Estimate log Wk for randomly data (uniformly distributed in a rectangle) Choose k so that Gap is largest.

How to select initial centroids Repeat the procedure many times with randomly chosen initial centroids. Alternatively, initialize centroids “smartly”, e.g. by hierarchical clustering

with-in sum of Sq. X: O: K-means requires good initial values. Hierarchical Clustering could be used but sometimes performs poorly.

Hierarchical clustering Hierarchical clustering builds a hierarchy of clusters, represented by a tree (called a dendrogram). Close clusters are joined together. Height of a branch represents the dissimilarity between the two clusters joined by it.

How to construct a dendrogram Bottom-up approach –Initialization: each cluster contains a single object –Iteration: merge the “closest” clusters. –Stop: when all objects are included in a single cluster Top-down approach –Starting from a single cluster containing all objects, iteratively partition into smaller clusters. Truncate dendrogram at a similarity threshold level, e.g., correlation > 0.6; or requiring a cluster containing at least a minimum number of objects.

Hierarchical Clustering Dendrogram

Dendrogram can be reordered

Ordered dendrograms 2 n-1 linear orderings of n elements (n= # genes or conditions) Maximizing adjacent similarity is impractical. So order by: Average expression level, Time of max induction, or Chromosome positioning Eisen98

Properties of Hierarchical Clustering Top-down approach is more favorable when only a few clusters are desired. Single linkage tends to produce long chains of clusters. Complete linkage tends to produce compact clusters.

Partitioning clustering vs hierarchical clustering Dendrogram k = 4

Partitioning clustering vs hierarchical clustering Dendrogram k = 3

Dendrogram k = 2 Partitioning clustering vs hierarchical clustering

Impose partial structure on the clusters (in contrast to the rigid structure of hierarchical clustering, the strong prior hypotheses used in Bayesian clustering, and the nonstructure of k-means clustering) easy visualization and interpretation. Self-organizing Map

SOM Algorithm Initialize prototypes m j on a lattice of p X q nodes. Each prototype is a weight vector whose dimension is the same as input data. Iteration: for each observation x i, find the closest prototype m j, and for all neighbors of m k of m j, move by During iterations, reduce learning rate  and neighborhood size r gradually. May take many iterations before convergence.

(Hastie 2001)

SOM clustering of periodic genes

Applications to microarray data

With only a few nodes, one tends not to see distinct patterns and there is large within- cluster scatter. As nodes are added, distinctive and tight clusters emerge. SOM is an “incremental learning” algorithm involving cases by case presentation rather than batch presentation. As with all exploratory data analysis tools, the use of SOMs involves inspection of the data to extract insights.

Other Clustering Methods Gene Shaving MDS Affinity Propagation Spectral Clustering Two-way clustering …

“Algorithms for unsupervised classification or cluster analysis abound. Unfortunately however, algorithm development seems to be a preferred activity to algorithm evaluation among methodologists. …… No consensus or clear guidelines exist to guide these decisions. Cluster analysis always produces clustering, but whether a pattern observed in the sample data characterizes a pattern present in the population remains an open question. Resampling-based methods can address this last point, but results indicate that most clusterings in microarray data sets are unlikely to reflect reproducible patterns or patterns in the overall population.” -Allison et al. (2006)

Stability of a cluster Motivation: Real clusters should be reproducible under perturbation: adding noise, omission of data, etc. Procedure: Perturb observed data by adding noise. Apply clustering procedure to cluster the perturbed data. Repeat the above procedures, generate a sample of clusters. Global test Cluster-specific tests: R-index, D-index. (McShane et al. 2002)

Global test Null hypothesis: Data come from a multivariate Gaussian distribution. Procedure: Consider a subspace spanned by top principle components. Estimate distribution of “nearest neighbor” distances Compare observed with simulated data.

R-index If cluster i contains n i objects, then it contains m i = n i *(n i – 1)/2 of pairs. Let c i be the number of pairs that fall in the same cluster for the re-clustered perturbed data. r i = c i /m i measures the robustness of the cluster i. R-index =  i c i /  i m i measures overall stability of a clustering algorithm.

D-index For each cluster, determine the closest cluster for the perturbed data Calculated the average discrepancy between the clusters for the original and perturbed data: omission vs addition. D-index is a summation of all cluster- specific discrepancy.

Applications 16 prostate cancer; 9 benign tumor 6500 genes Use hierarchical clustering to obtain 2,3, and 4 clusters. Questions: are these clusters reliable?

Issues with calculating R and D indices How big is the size of perturbation? How to quantify the significance level? What about nested consistency?

Acknowldegment Slide sources from –Cheng Li