Presentation on theme: "Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department."— Presentation transcript:
Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department of Computer Science & Engineering, University of Washington 2 Department of Statistics, University of Washington 3 Insightful Corporation CAIMS 2001 UW CSE Computational Biology Group
Model-based clustering of gene expression data Walter L. Ruzzo University of Washington Computer Science and Engineering CAIMS 2001 UW CSE Computational Biology Group
3 Overview Overview of gene expression data Model-based clustering Data sets Results Summary and Conclusions
4 Each chip is one experiment: –time course –drug response –tissue samples (normal/cancer) –… Snapshot of activities of all genes in the cell …….. A gene expression data set p experiments n genes X ij
5 Analysis of expression data? Hard problem –High dimensionality –Noise & artifacts –Goals loosely defined, etc. Common exploratory analysis technique: Clustering
6 Why Cluster? Find biologically related genes Tissue classification Many algorithms Most heuristic-based No clear winner How to cluster?
7 Overview Overview of gene expression data Model-based clustering Data sets Results Summary and Conclusions
8 Model-based clustering Mixture model: –Each mixture component generated by a specified underlying probability distribution Gaussian mixture model: –Each component k is modeled by the multivariate normal distribution with parameters : Mean vector: k Covariance matrix: k EM algorithm: to estimate parameters
9 k = k D k A k D k T volume orientation shape Covariance models (Banfield & Raftery 1993) Equal volume spherical model (EI): ~ kmeans k = I Unequal volume spherical (VI): k = k I Diagonal model: k = k B k, where B k is diagonal with | B k |=1 EEE elliptical model: k = DAD T Unconstrained model (VVV): k = k D k A k D k T More flexible More descriptive But more parameters
10 Advantages of model-based clustering Higher quality clusters Flexible models Model selection - choose right model and right # of clusters –Bayesian Information Criterion (BIC): Approximate Bayes factor: posterior odds for one model against another model –A large BIC score indicates strong evidence for the corresponding model.
11 Overview Overview of gene expression data Model-based clustering Data sets –Gene expression data sets –Synthetic data sets Results Summary and Conclusions
12 Our Goal To show the model-based approach has superior performance with respect to: –Quality of clusters –Number of clusters and model chosen (BIC)
13 Our Approach Validate the results from model-based clustering using data sets with external criteria (BIC scores do not require the external criteria) To compare clusters with external criterion: –Adjusted Rand index (Hubert and Arabie 1985) –Adjusted Rand index = 1 perfect agreement –2 random partitions have an expected index of 0 Compare the quality of clusters with a leading heuristic-based algorithm: CAST (Ben-Dor & Yakhini 1999)
14 Gene expression data sets Ovarian cancer data set (Michel Schummer, Institute of Systems Biology) –Subset of data : 235 clones 24 experiments (cancer/normal tissue samples) –235 clones correspond to 4 genes Yeast cell cycle data (Cho et al 1998) –17 time points –Subset of 384 genes associated with 5 phases of cell cycle
15 Synthetic data sets Both based on ovary data Gaussian mixture –Generate multivariate normal distributions with the sample covariance matrix and mean vector of each class in the ovary data Randomly resampled ovary data –For each class, randomly sample the expression levels in each experiment, independently –Near diagonal covariance matrix
16 Overview Overview of gene expression data Model-based clustering Data sets Results –Synthetic data sets –Expression data sets Summary and Conclusions
17 Results: Gaussian mixture based on ovary data (2350 genes) At 4 clusters, VVV, diagonal, CAST : high adjusted Rand BIC selects VVV at 4 clusters.
18 Results: randomly resampled ovary data Diagonal model achieves the max adjusted Rand and BIC score (higher than CAST) BIC max at 4 clusters Confirms expected result
19 Results: square root ovary data Max adjusted Rand at EEE 4 clusters (> CAST) BIC selects VI at 8 clusters BIC curve shows a bend at 4 clusters for EEE.
20 Results: log yeast cell cycle data CAST achieves much higher adjusted Rand indices than most model-based approaches (except EEE). BIC scores of EEE much higher than the other models.
21 Results: standardized yeast cell cycle data EI slightly higher adjusted Rand indices than CAST at 5 clusters. BIC selects EEE at 5 clusters.
22 Overview Overview of gene expression data Model-based clustering Data sets Results Summary and Conclusions
23 Summary and Conclusions Synthetic data sets: –With the correct model, the model-based clustering better than a leading heuristic clustering algorithm –BIC selects the right model & right number of clusters Real expression data sets: –Comparable adjusted Rand indices to CAST –BIC gives a hint to the number of clusters Time course data: –Standardization makes classes spherical
24 Future Work Custom refinements to the model-based implementation: –Design models that incorporate specific information about the experiments, eg. Block diagonal covariance matrix –Missing data –Outliers & noise
25 Acknowledgements Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2 1 Department of Computer Science & Engineering, University of Washington 2 Department of Statistics, University of Washington 3 Insightful Corporation Michèl Schummer (Institute of Systems Biology) –the ovary data set Jeremy Tantrum –help with MBC software (diagonal model)
26 More Info http://www.cs.washington.edu/homes/ruzzo CAIMS 2001 UW CSE Computational Biology Group
27 Model-based clustering Mixture model: –Assume each component is generated by an underlying probability distribution –Likelihood for the mixture model: Gaussian mixture model: –Each component k is modeled by the multivariate normal distribution with parameters : Mean vector: k Covariance matrix: k
28 Eigenvalue decomposition of the covariance matrix Covariance matrix k determines geometric features of each component k. Banfield and Raftery (1993): –D k : orientation of component k (orthogonal matrix) –A k : shape of component k (diagonal matrix) – k : volume of component k (scalar) Allowing some but not all of the parameters to vary many models within this framework
29 EM algorithm General appraoch to maximum likelihood Iterate between E and M steps: –E step: compute the probability of each observation belonging to each cluster using the current parameter estimates –M-step: estimate model parameters using the current group membership probabilities
30 Definition of the BIC score The integrated likelihood p(D|M k ) is hard to evaluate, where D is the data, M k is the model. BIC is an approximation to log p(D|M k ) k : number of parameters to be estimated in model M k
31 Randomly resampled synthetic data set genes experiments class Ovary dataSynthetic data
33 Results: mixture of normal distributions based on ovary data (235 genes) VVV and CAST: comparable adjusted Rand index at 4 clusters With only 235 genes, no BIC scores available from VVV. With bigger data sets (2350 genes), BIC selects VVV at 4 clusters.
34 Results: log ovary data Max adjusted Rand: VVV 4 clusters, EEE 4-6 clusters (higher than CAST) BIC selects VI at 9 clusters BIC curve shows a bend at 4 clusters for the diagonal model.
35 Results: standardized ovary data Comparable adjusted Rand indices for CAST, EI and EEE at 4 clusters. BIC selects EEE at 7 clusters.
36 Key advantage of the model- based approach: choose the right model and the number of clusters Bayesian Information Criterion (BIC): –Approximate Bayes factor: posterior odds for one model against another model A large BIC score indicates strong evidence for the corresponding model.