Computational Biology, Part 12 Expression array cluster analysis Robert F. Murphy, Shann-Ching Chen Copyright  2004-2009. All rights reserved.

Slides:



Advertisements
Similar presentations
Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
Advertisements

K-means Clustering Given a data point v and a set of points X,
Clustering.
PARTITIONAL CLUSTERING
Introduction to Bioinformatics
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Introduction to Bioinformatics Algorithms Clustering.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
L16: Micro-array analysis Dimension reduction Unsupervised clustering.
Multivariate Distance and Similarity Robert F. Murphy Cytometry Development Workshop 2000.
Clustering Petter Mostad. Clustering vs. class prediction Class prediction: Class prediction: A learning set of objects with known classes A learning.
Introduction to Bioinformatics Algorithms Clustering.
CSE182-L17 Clustering Population Genetics: Basics.
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Gene Expression 1. Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC EPCLUST 2.
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Fuzzy K means.
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
Tutorial 8 Clustering 1. General Methods –Unsupervised Clustering Hierarchical clustering K-means clustering Expression data –GEO –UCSC –ArrayExpress.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Comp602 Bioinformatics Algorithms -m werner 2011
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Clustering Unsupervised learning Generating “classes”
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Gene expression & Clustering (Chapter 10)
Summarized by Soo-Jin Kim
An Overview of Clustering Methods With Applications to Bioinformatics Sorin Istrail Informatics Research Celera Genomics.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Lecture 11. Microarray and RNA-seq II
Potential Data Mining Techniques for Flow Cyt Data Analysis Li Xiong.
Microarrays.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Microarray Data Analysis (Lecture for CS498-CXZ Algorithms in Bioinformatics) Oct 13, 2005 ChengXiang Zhai Department of Computer Science University of.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Clustering.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
Lecture 3 1.Different centrality measures of nodes 2.Hierarchical Clustering 3.Line graphs.
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Computational Biology
Unsupervised Learning
PREDICT 422: Practical Machine Learning
Machine Learning Clustering: K-means Supervised Learning
Clustering BE203: Functional Genomics Spring 2011 Vineet Bafna and Trey Ideker Trey Ideker Acknowledgements: Jones and Pevzner, An Introduction to Bioinformatics.
Clustering.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Text Categorization Berlin Chen 2003 Reference:
Clustering.
Unsupervised Learning
Presentation transcript:

Computational Biology, Part 12 Expression array cluster analysis Robert F. Murphy, Shann-Ching Chen Copyright  All rights reserved.

Gene Expression Microarray A popular method to since reported by Pat Brown Laboratory in 1995 and by Affymetrix in 1996 A popular method to detect mRNA expression level since reported by Pat Brown Laboratory in 1995 and by Affymetrix in 1996 Different technologies for producing the microarray chips and different approaches for analyzing microarray data Different technologies for producing the microarray chips and different approaches for analyzing microarray data Should carefully process and analyze the data Should carefully process and analyze the data

What is gene expression and why is it important?

Gene Expression in different organs A highly specific process in which a gene is switched on, and therefore begins production of its protein. A highly specific process in which a gene is switched on, and therefore begins production of its protein. Sources: image from the Cancer Genome Anatomy Project (CGAP), Conceptual Tour, July 21,

Gene Expression in single organ Gene expression also varies within a certain type of cell at different points in time. Gene expression also varies within a certain type of cell at different points in time. Sources: image from the Cancer Genome Anatomy Project (CGAP), Conceptual Tour, July 21,

Microarray raw data Label mRNA from one sample with a red fluorescence probe (Cy5) and mRNA from another sample with a green fluorescence probe (Cy3) Label mRNA from one sample with a red fluorescence probe (Cy5) and mRNA from another sample with a green fluorescence probe (Cy3) Hybridize to a chip with specific DNAs fixed to each well Hybridize to a chip with specific DNAs fixed to each well Measure amounts of green and red fluorescence Measure amounts of green and red fluorescence Flash animations: PCR Microarray

Example microarray image Microarray is a technology to globally (simultaneously detecting thousands of genes) detect mRNA expression level.

mRNA expression microarray data for 9800 genes (gene number shown vertically) for 0 to 24 h (time shown horizontally) after addition of serum to a human cell line that had been deprived of serum (from

Data extraction Adjust fluorescent intensities using standards (as necessary) Adjust fluorescent intensities using standards (as necessary) Calculate ratio of red to green fluorescence Calculate ratio of red to green fluorescence Convert to log 2 and round to integer Convert to log 2 and round to integer Display saturated green=-3 to black = 0 to saturated red = +3 Display saturated green=-3 to black = 0 to saturated red = +3

Many different types: Hierarchical clustering k – means clustering Self-organising maps Hill Climbing Simulated Annealing All have the same three basic tasks of: 1.Pattern representation – patterns or features in the data. 2.Pattern proximity – a measure of the distance or similarity defined on pairs of patterns 3.Pattern grouping – methods and rules used in grouping the patterns Unsupervised clustering algorithms

Distances High dimensionality High dimensionality Based on vector geometry – how close are two data points? Based on vector geometry – how close are two data points? Array2 Array 1 Array 2 Gene 114 …

Distances High dimensionality High dimensionality Based on vector geometry – how close are two data points? Based on vector geometry – how close are two data points? Array2 Array 1 Array 2 Gene 114 Gene 213 … Gene 1 Gene 2 Distance(Gene 1, Gene 2) = 1

Distances High dimensionality High dimensionality Based on vector geometry – how close are two data points? Based on vector geometry – how close are two data points? Based on distances to determine clusters Based on distances to determine clusters Array2 Array 1 Array 2 Gene 114 Gene 213 … Gene 1 Gene 2 Distance(Gene 1, Gene 2) = 1

a1 a2b2 b1 Distance Sample 2 Sample 1 Gene a Gene b sample 1 sample 2 a1a2 b1b2 Bivariate Euclidean Distance

General Multivariate Dataset We are given values of p variables for n independent observations We are given values of p variables for n independent observations Construct an n x p matrix M consisting of vectors X 1 through X n each of length p Construct an n x p matrix M consisting of vectors X 1 through X n each of length p

Multivariate Sample Mean Define mean vector I of length p Define mean vector I of length p or matrix notation vector notation

Multivariate Variance Define variance vector   of length p Define variance vector   of length p matrix notation

Multivariate Variance or or vector notation

Covariance Matrix Define a p x p matrix cov (called the covariance matrix) analogous to  2 Define a p x p matrix cov (called the covariance matrix) analogous to  2

Covariance Matrix Note that the covariance of a variable with itself is simply the variance of that variable Note that the covariance of a variable with itself is simply the variance of that variable

Univariate Distance The simple distance between the values of a single variable j for two observations i and l is The simple distance between the values of a single variable j for two observations i and l is

Univariate z-score Distance To measure distance in units of standard deviation between the values of a single variable j for two observations i and l we define the z-score distance To measure distance in units of standard deviation between the values of a single variable j for two observations i and l we define the z-score distance

Bivariate Euclidean Distance The most commonly used measure of distance between two observations i and l on two variables j and k is the Euclidean distance The most commonly used measure of distance between two observations i and l on two variables j and k is the Euclidean distance M(i,j) k variable j variable i observation l observation M(l,j) M(i,k)M(l,k)

Multivariate Euclidean Distance This can be extended to more than two variables This can be extended to more than two variables

Effects of variance and covariance on Euclidean distance Points A and B have similar Euclidean distances from the mean, but point B is clearly “more different” from the population than point A. B A The ellipse shows the 50% contour of a hypothetical population.

Mahalanobis Distance To account for differences in variance between the variables, and to account for correlations between variables, we use the Mahalanobis distance To account for differences in variance between the variables, and to account for correlations between variables, we use the Mahalanobis distance

Other distance functions We can use other distance functions, including ones in which the weights on each variable are learned We can use other distance functions, including ones in which the weights on each variable are learned Cluster analysis tools for microarray data most commonly use Pearson correlation coefficient Cluster analysis tools for microarray data most commonly use Pearson correlation coefficient

Pearson correlation coefficient

Software for performing microarray cluster analysis Michael Eisen’s Cluster (Windows only) Michael Eisen’s Cluster (Windows only) M. de Hoon’s Cluster 3.0 (all OS) M. de Hoon’s Cluster 3.0 (all OS) Tree viewing (links on same site) Tree viewing (links on same site)  Java Treeview  Mapletree

Input data for clustering Genes in rows, conditions in columns Genes in rows, conditions in columns

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Clustering of Microarray Data (cont’d) Clusters

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Homogeneity and Separation Principles Homogeneity: Elements within a cluster are close to each other Separation: Elements in different clusters are further apart from each other …clustering is not an easy task! Given these points a clustering algorithm might make two distinct clusters as follows

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Bad Clustering This clustering violates both Homogeneity and Separation principles Close distances from points in separate clusters Far distances from points in the same cluster

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Good Clustering This clustering satisfies both Homogeneity and Separation principles

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Clustering Techniques Agglomerative: Start with every element in its own cluster, and iteratively join clusters together Divisive: Start with one cluster and iteratively divide it into smaller clusters Hierarchical: Organize elements into a tree, leaves represent genes and the length of the pathes between leaves represents the distances between genes. Similar genes lie within the same subtrees

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering: Example

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering: Example

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering: Example

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering: Example

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering: Example

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering (cont’d) Hierarchical Clustering is often used to reveal evolutionary history

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering Algorithm 1.Hierarchical Clustering (d, n) 2. Form n clusters each with one element 3. Construct a graph T by assigning one vertex to each cluster 4. while there is more than one cluster 5. Find the two closest clusters C 1 and C 2 6. Merge C 1 and C 2 into new cluster C with |C 1 | +|C 2 | elements 7. Compute distance from C to all other clusters 8. Add a new vertex C to T and connect to vertices C 1 and C 2 9. Remove rows and columns of d corresponding to C 1 and C Add a row and column to d corrsponding to the new cluster C 11. return T The algorithm takes a n x n distance matrix d of pairwise distances between points as an input.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering Algorithm 1.Hierarchical Clustering (d, n) 2. Form n clusters each with one element 3. Construct a graph T by assigning one vertex to each cluster 4. while there is more than one cluster 5. Find the two closest clusters C 1 and C 2 6. Merge C 1 and C 2 into new cluster C with |C 1 | +|C 2 | elements 7. Compute distance from C to all other clusters 8. Add a new vertex C to T and connect to vertices C 1 and C 2 9. Remove rows and columns of d corresponding to C 1 and C Add a row and column to d corrsponding to the new cluster C 11. return T Different ways to define distances between clusters may lead to different clusterings

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Hierarchical Clustering: Recomputing Distances d min (C, C * ) = min d(x,y) for all elements x in C and y in C * Distance between two clusters is the smallest distance between any pair of their elements d avg (C, C * ) = (1 / |C * ||C|) ∑ d(x,y) for all elements x in C and y in C * Distance between two clusters is the average distance between all pairs of their elements

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Squared Error Distortion Given a data point v and a set of points X, define the distance from v to X d(v, X) as the (Eucledian) distance from v to the closest point from X. Given a set of n data points V={v 1 …v n } and a set of k points X, define the Squared Error Distortion d(V,X) = ∑d(v i, X) 2 / n 1 < i < n

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info K-Means Clustering Problem: Formulation Input: A set, V, consisting of n points and a parameter k Output: A set X consisting of k points (cluster centers) that minimizes the squared error distortion d(V,X) over all possible choices of X

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info 1-Means Clustering Problem: an Easy Case Input: A set, V, consisting of n points Output: A single points x (cluster center) that minimizes the squared error distortion d(V,x) over all possible choices of x

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info 1-Means Clustering Problem: an Easy Case Input: A set, V, consisting of n points Output: A single points x (cluster center) that minimizes the squared error distortion d(V,x) over all possible choices of x 1-Means Clustering problem is easy. However, it becomes very difficult (NP-complete) for more than one center. An efficient heuristic method for K-Means clustering is the Lloyd algorithm

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info K-Means Clustering: Lloyd Algorithm 1.Lloyd Algorithm 2. Arbitrarily assign the k cluster centers 3. while the cluster centers keep changing 4. Assign each data point to the cluster C i corresponding to the closest cluster representative (center) (1 ≤ i ≤ k) 5. After the assignment of all data points, compute new cluster representatives according to the center of gravity of each cluster, that is, the new cluster representative is ∑v \ |C| for all v in C for every cluster C *This may lead to merely a locally optimal clustering.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info x1x1 x2x2 x3x3

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info x1x1 x2x2 x3x3

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info x1x1 x2x2 x3x3

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info x1x1 x2x2 x3x3

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Clique Graphs A clique is a graph with every vertex connected to every other vertex A clique graph is a graph where each connected component is a clique

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Transforming an Arbitrary Graph into a Clique Graphs A graph can be transformed into a clique graph by adding or removing edges

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Clique Graphs (cont’d) – REVISION –show yet another way of transformation and compare the costs. A graph can be transformed into a clique graph by adding or removing edges Example: removing two edges to make a clique graph

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Corrupted Cliques Problem Input: A graph G Output: The smallest number of additions and removals of edges that will transform G into a clique graph

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Distance Graphs Turn the distance matrix into a distance graph Genes are represented as vertices in the graph Choose a distance threshold θ If the distance between two vertices is below θ, draw an edge between them The resulting graph may contain cliques These cliques represent clusters of closely located data points!

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Transforming Distance Graph into Clique Graph The distance graph (threshold θ=7) is transformed into a clique graph after removing the two highlighted edges After transforming the distance graph into the clique graph, the dataset is partitioned into three clusters

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Heuristics for Corrupted Clique Problem Corrupted Cliques problem is NP-Hard, some heuristics exist to approximately solve it: CAST (Cluster Affinity Search Technique): a practical and fast algorithm: CAST is based on the notion of genes close to cluster C or distant from cluster C Distance between gene i and cluster C: d(i,C) = average distance between gene i and all genes in C Gene i is close to cluster C if d(i,C)< θ and distant otherwise

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info CAST Algorithm 1.CAST(S, G, θ) 2. P  Ø 3. while S ≠ Ø 4. V  vertex of maximal degree in the distance graph G 5. C  {v} 6. while a close gene i not in C or distant gene i in C exists 7. Find the nearest close gene i not in C and add it to C 8. Remove the farthest distant gene i in C 9. Add cluster C to partition P 10. S  S \ C 11. Remove vertices of cluster C from the distance graph G 12. return P S – set of elements, G – distance graph, θ - distance threshold

Choosing the number of Centers A difficult problem A difficult problem Most common approach is to try to find the solution that minimizes the Bayesian Information Criterion Most common approach is to try to find the solution that minimizes the Bayesian Information Criterion L = the likelihood function for the estimated model K = # of parameters n = # of samples

Clustering genes and conditions Rows and columns can be clustered independently - hierarchical is preferred for visualizing this Rows and columns can be clustered independently - hierarchical is preferred for visualizing this