Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Yinyin Yuan and Chang-Tsun Li Computer Science Department
Discrimination amongst k populations. We want to determine if an observation vector comes from one of the k populations For this purpose we need to partition.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
1 Parametric Empirical Bayes Methods for Microarrays 3/7/2011 Copyright © 2011 Dan Nettleton.
An Introduction to Multivariate Analysis
Dimension reduction (1)
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department.
Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.
An introduction to Principal Component Analysis (PCA)
Bi-correlation clustering algorithm for determining a set of co- regulated genes BIOINFORMATICS vol. 25 no Anindya Bhattacharya and Rajat K. De.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Principal Component Analysis
DNA Microarray Bioinformatics - #27611 Program Normalization exercise (from last week) Dimension reduction theory (PCA/Clustering) Dimension reduction.
Mutual Information Mathematical Biology Seminar
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Factor Analysis Research Methods and Statistics. Learning Outcomes At the end of this lecture and with additional reading you will be able to Describe.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
University of CreteCS4831 The use of Minimum Spanning Trees in microarray expression data Gkirtzou Ekaterini.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Independent Component Analysis (ICA) and Factor Analysis (FA)
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Microarray analysis Algorithms in Computational Biology Spring 2006 Written by Itai Sharon.
Cluster Analysis for Gene Expression Data Ka Yee Yeung Center for Expression Arrays Department of Microbiology.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker Part of the slides is adapted from Chris Workman.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Summarized by Soo-Jin Kim
Chapter 2 Dimensionality Reduction. Linear Methods
Presented By Wanchen Lu 2/25/2013
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Interpreting Principal Components Simon Mason International Research Institute for Climate Prediction The Earth Institute of Columbia University L i n.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Lecture 12 Factor Analysis.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Model-based evaluation of clustering validation measures.
Cluster validation Integration ICES Bioinformatics.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Aristidis Likas Nikos Vlassis Jakob J.Verbeek 國立雲林科技大學 National Yunlin.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Principal Components Analysis ( PCA)
Central limit theorem - go to web applet. Correlation maps vs. regression maps PNA is a time series of fluctuations in 500 mb heights PNA = 0.25 *
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Principal Component Analysis
Unsupervised Learning
PREDICT 422: Practical Machine Learning
Outlier Processing via L1-Principal Subspaces
Dimension reduction : PCA and Clustering by Agnieszka S. Juncker
Principal Component Analysis (PCA)
Clustering (3) Center-based algorithms Fuzzy k-means
Principal Component Analysis
Interpreting Principal Components
Dimension reduction : PCA and Clustering
Principal Component Analysis
Multivariate Methods Berlin Chen, 2005 References:
Unsupervised Learning
Presentation transcript:

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo

Organization Association of PCA and this paper Association of PCA and this paper Approach of this paper Approach of this paper Data sets Data sets Clustering algorithms and similarity metrics Clustering algorithms and similarity metrics Results and discussion Results and discussion

The Functions of PCA? PCA can reduce the dimensionality of the data set. PCA can reduce the dimensionality of the data set. Few PCs may capture most of the variation in the original data set. Few PCs may capture most of the variation in the original data set. PCs are uncorrelated and ordered. PCs are uncorrelated and ordered. We expect the first few PCs may ‘ extract ’ the cluster structure in the original data set. We expect the first few PCs may ‘ extract ’ the cluster structure in the original data set.

This Paper ’ s Point of View A theoretical result shows that the first few PCs may not contain cluster information. (Chang, 1983). A theoretical result shows that the first few PCs may not contain cluster information. (Chang, 1983). Chang ’ s example. Chang ’ s example. A motivating example. (Coming next). A motivating example. (Coming next).

A Motivating Example Data: A subset of the sporulation data (477 genes) were classified into seven temporal patterns (Chu et al., 1998) Data: A subset of the sporulation data (477 genes) were classified into seven temporal patterns (Chu et al., 1998) The first 2 PCs contains 85.9% of the variation in the data. (Figure 1a) The first 2 PCs contains 85.9% of the variation in the data. (Figure 1a) The first 3 PCs contains 93.2% of the variation in the data. (Figure 1b) The first 3 PCs contains 93.2% of the variation in the data. (Figure 1b)

Sporulation Data The patterns overlap around the origin in (1a). The patterns overlap around the origin in (1a). The patterns are much more separated in (1b). The patterns are much more separated in (1b).

The Goal EMPIRICALLY investigate the effectiveness of clustering gene expression data using PCs instead of the original variables. EMPIRICALLY investigate the effectiveness of clustering gene expression data using PCs instead of the original variables.

Outline of Methods Genes are to be clustered, and the experimental conditions are the variables. Genes are to be clustered, and the experimental conditions are the variables. Effectiveness of clustering with the orginal data and with different sets of PCs is determined, measured by comparing the clustering results to an objective external criterion. Effectiveness of clustering with the orginal data and with different sets of PCs is determined, measured by comparing the clustering results to an objective external criterion. Assume the number of clusters is known. Assume the number of clusters is known.

Agreement Between Two Partitions The Rand index (Rand, 1971): Given a set of n objects S, let U and V be two different partitions of S. Let: a = # of pairs that are placed in the same cluster in U and in the same cluster in V d = # of pairs that are placed in different clusters in U and in different clusters in V Rand index = (a+d)/nC2

Agreement (Cont ’ d) The adjusted Rand index (ARI, Hubert & Arabie, 1985): Note: Higher ARI means higher correspondence between two partitions.

Subset of PCs Motivated by Chang ’ s example, it is possible to find other subsets of PCs to preserve the cluster structure better than the first few PCs. Motivated by Chang ’ s example, it is possible to find other subsets of PCs to preserve the cluster structure better than the first few PCs. How? How? --- The greedy approach. --- The modified greedy approach.

The Greedy Approach Let m 0 be the minimum number of PCs to be clustered, and p be the number of variables in the data. 1) Search for a set of m 0 PCs with maximum ARI, denoted as s m 0. 2) For each m=(m 0 +1), … p, add another PC to s (m-1) and calculate ARI. The PC giving the maximum ARI is then added to get s m.

The Modified Greedy Approach In each step of the greedy approach (# of PCs = m), retain the k best subsets of PCs for the next step (# of PCs = m+1). In each step of the greedy approach (# of PCs = m), retain the k best subsets of PCs for the next step (# of PCs = m+1). If k=1, this is just the greedy approach. If k=1, this is just the greedy approach. k=3 in this paper. k=3 in this paper.

The Scheme of the Study Given a gene expression data set with n genes (subjects) and p experimental conditions (variables), apply a clustering algorithm to 1) the given data set, ARI w/ external criterion. 2) the first m PCs where m=m 0, … p. 3) the subset of PCs found by the (modified) greedy approach. 4) 30 sets of random PCs. 5) 30 sets of random orthogonal projections.

Data Sets “ Class ” refers to a group in the external criterion. “ Cluster ” refers to clusters obtained by a clustering algorithm. “ Class ” refers to a group in the external criterion. “ Cluster ” refers to clusters obtained by a clustering algorithm. There are two real data sets and three synthetic data sets in this study. There are two real data sets and three synthetic data sets in this study.

The Ovary Data The data contains 235 clones and 24 tissue samples. The data contains 235 clones and 24 tissue samples. For the 24 tissue samples, 7 are from normal tissues, 4 from blood samples, and 13 from ovarian cancers. For the 24 tissue samples, 7 are from normal tissues, 4 from blood samples, and 13 from ovarian cancers. The 235 clones were found to correspond four different genes (classes), each having 58, 88, 57 and 32 clones. The 235 clones were found to correspond four different genes (classes), each having 58, 88, 57 and 32 clones. The data for each clone was normalized across the 24 experiments to have mean 0 and variance 1. The data for each clone was normalized across the 24 experiments to have mean 0 and variance 1.

The Yeast Cell Cycle Data The data set shows the fluctuation of expression levels over two cell cycles. The data set shows the fluctuation of expression levels over two cell cycles. 380 gene were classified into five phases (classes). 380 gene were classified into five phases (classes). The data for each gene was normalized to have mean 0 and variance 1 across each cell cycle. The data for each gene was normalized to have mean 0 and variance 1 across each cell cycle.

Mixture of Normal on Ovary In each gene (class), the sample covariance matrix and the mean vector are computed. In each gene (class), the sample covariance matrix and the mean vector are computed. Sample (58, 88, 57, 32) clones from the MVN in each class. 10 replicates. Sample (58, 88, 57, 32) clones from the MVN in each class. 10 replicates. It preserves the mean and covariance of the original data, but relies on the MVN assumption. It preserves the mean and covariance of the original data, but relies on the MVN assumption.

Marginal Normality

Randomly Resample Ovary Data For each class c (c=1, …,4) under experimental condition j (j=1, …,24), resample the expression level with replacement. Retain the size of each class. 10 replicates. For each class c (c=1, …,4) under experimental condition j (j=1, …,24), resample the expression level with replacement. Retain the size of each class. 10 replicates. No MVN assumption. The independent sampling for different experimental conditions is reasonable as inspected. No MVN assumption. The independent sampling for different experimental conditions is reasonable as inspected.

Cyclic Data This data set models cyclic behavior of genes over different time points. This data set models cyclic behavior of genes over different time points. Behavior of genes modeled by the sine function. Behavior of genes modeled by the sine function. A drawback of this model is the arbitrary choice of several parameters. A drawback of this model is the arbitrary choice of several parameters.

Clustering Algorithms and Similarity Metrics Clustering algorithms: Cluster affinity search technique (CAST) Cluster affinity search technique (CAST) Hierarchical average-link algorithm Hierarchical average-link algorithm K-mean algorithm K-mean algorithm Similarity metrics: Euclidean distance (m 0 =2) Euclidean distance (m 0 =2) Correlation coefficient (m 0 =3) Correlation coefficient (m 0 =3)

Table 1

Table 2 One sided Wilcoxon signed rank test. One sided Wilcoxon signed rank test. CAST always favorites ‘ no PCA ’. CAST always favorites ‘ no PCA ’. The two significances for PCA are not clear sucesses. The two significances for PCA are not clear sucesses.

Conclusion 1) The quality of clustering results on the data after PCA is not necessarily higher than that on the original data, sometimes lower. 2) The first m PCs do not give the highest adjusted Rand index, i.E. Another set of PCs gives higher ARI.

Conclusion (Cont ’ d) 3) There are no clear trends regarding the choice of optimal number of PCs over all the data sets and over all the clustering algorithms and over the different similarity metrics. There is no obvious relationship between cluster quality and the number or set of PCs used.

Conclusion (Cont ’ d) 4) On average, the quality of clusters obtained by clustering random sets of PCs tend to be slightly lower than those obtained by clustering random sets of orthogonal projections, esp. when the number of components is small.

Grand Conclusion In general, we recommend AGAINST using PCA to reduce dimensionality of the data before applying clustering algorithms unless external information is available.