SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.

Slides:



Advertisements
Similar presentations
Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Basic Gene Expression Data Analysis--Clustering
Self-Organizing Maps Projection of p dimensional observations to a two (or one) dimensional grid space Constraint version of K-means clustering –Prototypes.
Outlines Background & motivation Algorithms overview
Cluster analysis for microarray data Anja von Heydebreck.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Introduction to Bioinformatics
More Microarray Analysis: Unsupervised Approaches Matt Hibbs Troyanskaya Lab.
Making Sense of Complicated Microarray Data Part II Gene Clustering and Data Analysis Gabriel Eichler Boston University Some slides adapted from: MeV documentation.
Mutual Information Mathematical Biology Seminar
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Microarray Data Preprocessing and Clustering Analysis
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Computational Biology, Part 12 Expression array cluster analysis Robert F. Murphy, Shann-Ching Chen Copyright  All rights reserved.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Cluster analysis  Function  Places genes with similar expression patterns in groups.  Sometimes genes of unknown function will be grouped with genes.
Introduction to Bioinformatics - Tutorial no. 12
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Lecture 09 Clustering-based Learning
Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Clustering of DNA Microarray Data Michael Slifker CIS 526.
CZ5225: Modeling and Simulation in Biology Lecture 5: Clustering Analysis for Microarray Data III Prof. Chen Yu Zong Tel:
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
More on Microarrays Chitta Baral Arizona State University.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Microarrays.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Computer Graphics and Image Processing (CIS-601).
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Clustering.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Unsupervised Learning
Unsupervised Learning
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Principal Component Analysis (PCA)
Molecular Classification of Cancer
Clustering (3) Center-based algorithms Fuzzy k-means
Multivariate Statistical Methods
GPX: Interactive Exploration of Time-series Microarray Data
Dimension reduction : PCA and Clustering
Unsupervised Learning
Unsupervised Learning
Presentation transcript:

SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation distance metrics  Principle Components Analysis: Reducing the dimensionality of microarray data  Clustering Agorithms:  Kmeans  Self-Organizing Maps (SOM)  Hierarchical Clustering

MATRIX genes,conditions = Expression dataset the first genevector = (x 11, x 12, x 13, x 14 … x 1n ) the leftmost condition vector = (x 11, x 21, x 31 … x m1 ) Rows (genes) Columns (conditions [timepoints, or tissues]) x 11, x 12, x 13, … x 1n x 21 x 31, … X m1 … x mn

 Clustering identifies group of genes with “similar” expression profiles  How is similarity measured?  Euclidian distance  Correlation coefficient  Others: Manhattan, Chebychev, Euclidean Squared Similarity measures

In an experiment with 10 conditions, the gene expression profiles for two genes X, and Y would have this form X = (x 1, x 2, x 3, …, x 10 ) Y = (y 1, y 2, y 3, …, y 10 )

d(Ga, Gb) = sqrt( (x 1 -y 1 ) 2 + (x 2 -y 2 ) 2 ) Similarity measure - Euclidian distance In general: if there are M experiments: X = (x 1, x 2, x 3, …, x m ) Y = (y 1, y 2, y 3, …, y m ) Gb: (x 1, x 2 ) Ga: (y 1, y 2 )

D = 1 - r r = [Z(X)*Z(Y)] (dot product of the z-scores of vectors X and Y) r = |Z(X)| |Z(Y)| cos(T) When two unit vectors are completely correlated, r=1 and D=0 When two unit vectors are non correlated, r=0 and D = 1 Dot product review: Similarity measure – Pearson Correlation Coefficient X = (x 1, x 2, x 3, …, x m ), Y = (y 1, y 2, y 3, …, y m )

Euclidian vs Pearson Correlation  Euclidian distance – takes into account the magnitude of the expression  Correlation distance - insensitive to the amplitude of expression, takes into account the trends of the change.  Common trends are considered biologically relevant, the magnitude is considered less important Gene X Gene Y

What correlation distance sees What euclidean distance sees

Principle Components Analysis (PCA)  A method for projecting microarray data onto a reduced (2 or 3 dimensional) easily visualized space Definition: Principle Components - A set of variables that define a projection that encapsulates the maximum amount of variation in a dataset and is orthogonal (and therefore uncorrelated) to the previous principle component of the same dataset.  Example Dataset : Thousands of genes probed in 10 conditions.  The expression profile of each gene is presented by the vector of its expression levels: X = (X 1, X 2, X 3, X 4, X 5 )  Imagine each gene X as a point in a 5-dimentional space.  Each direction/axis corresponds to a specific condition  Genes with similar profiles are close to each other in this space  PCA- Project this dataset to 2 dimensions, preserving as much information as possible

PCA transformation of a microarray dataset Visual estimation of the number of clusters in the data

1-page tutorial on singular value decomposition (PCA)

Cluster analysis  Function  Places genes with similar expression patterns in groups.  Sometimes genes of unknown function will be grouped with genes of known function.  The functions that are known allow the investigator to hypothesize regarding the functions of genes not yet characterized.  Examples:  Identify genes important in cell cycle regulation  Identify genes that participate in a biosynthetic pathway  Identify genes involved in a drug response  Identify genes involved in a disease response

Clustering yeast cell cycle dataset VS gene tree ordering

How to choose the number of clusters needed to informatively partition the data Trial and error: Try clustering with a different number of clusters, and compare your results  Criteria for comparison: Homogeneity vs Separation  Use PCA (Principle Component Analysis) to visually determine how well the algorithm grouped genes  Calculate the mean distance between all genes within a cluster (it should be small) and compare that to the distance between clusters (which should be large)

Mathematical evaluation of clustering solution Merits of a ‘good’ clustering solution:  Homogeneity:  Genes inside a cluster are highly similar to each other.  Average similarity between a gene and the center (average profile) of its cluster.  Separation:  Genes from different clusters have low similarity to each other.  Weighted average similarity between centers of clusters.  These are conflicting features: increasing the number of clusters tends to improve with-in cluster Homogeneity on the expense of between-cluster Separation

“True” CAST* GeneCluster K-means CLICK Homogeneity Separation Performance on Yeast Cell Cycle Data *Ben-Dor, Shamir, Yakhini genes, 72 conditions (Spellman et al. 1998). Each algorithm was run by its authors in a “blind” test.

Clustering Algorithms  K–means  SOMs  Hierarchical clustering

K-MEANS 1. The user sets the number of clusters- k 2. Initialization: each gene is randomly assigned to one of the k clusters 3. Average expression vector is calculated for each cluster (cluster’s profile) 4. Iterate over the genes: For each gene- compute its similarity to the cluster profiles. Move the gene to the cluster it is most similar to. Recalculated cluster profiles. 5. Score current partition: sum of distances between genes and the profile of the cluster they are assigned to (homogeneity of the solution). 6. Stop criteria: further shuffling of genes results in minor improvement in the clustering score

K-MEANS example: 4 clusters (too many?) Mean profile Standard deviation in each condition

Evaluating Kmeans Cluster 3 Cluster 1 Cluster 4 Cluster 2 Mis- classified

K-means example: 3 clusters (looks right)

Kmeans clustering: K=2 (too few)

SOMs (Self-Organizing Maps) less clustering and more data organizing map nodes  User sets the number of clusters in a form of a rectangular grid (e.g., 3x2) – ‘map nodes’  Imagine genes as points in (M- dimensional) space  Initialization: map nodes are randomly placed in the data space

Genes – data points Clusters – map nodes

SOM - Scheme Randomly choose a data point (gene). Find its closest map node Move this map node towards the data point Move the neighbor map nodes towards this point, but to lesser extent (thinner arrows show weaker shift) Iterate over data points

Each successive gene profile (black dot) has less of an influence on the displacement of the nodes. Iterate through all profiles several times (10-100) When positions of the cluster nodes have stabilized, assign each gene to its closest map node (cluster)

Hierarchical Clustering  Goal#1: Organize the genes in a structure of a hierarchical tree  1) Initial step: each gene is regarded as a cluster with one item  2) Find the 2 most similar clusters and merge them into a common node (red dot)  3) Merge successive nodes until all genes are contained in a single cluster  Goal#2: Collapse branches to group genes into distinct clusters g1g2g3g4g5 {1,2} {4,5} {1,2,3} {1,2,3,4,5}

Which genes to cluster?  Apply filtering prior to clustering – focus the analysis on the ‘responding genes’  The application of controlled statistical tests to identify ‘responding genes’ usually ends up with too few genes that do not allow for a global characterization of the response.  Variance: filter out genes that do not vary greatly among the conditions of the experiment.  Non-varying genes skew clustering results, especially when using a correlation coefficient  Fold change: choose genes that change by at least M-fold in at least L conditions.

Clustering – Tools  Cluster (Eisen) – hierarchical clustering   GeneCluster (Tamayo) – SOM   TIGR MeV – K-Means, SOM, hierarchical, QTC, CAST   Expander – CLICK, SOM, K-means, hierarchical  l  Many others (e.g. GeneSpring) 

1)Transform Dataset Using PCA 2)Cluster Parameters to test: Distance Metric Number of clusters Separation & Homogeneity 3)Assign biological meaning to clusters Analysis Strategy

Original presentation created by Rani Elkon and posted at: /DNA_microarray_winter_2003.html