Correlate February 19, 2010 Sam Gross, Balasubramanian Narasimhan, Robert Tibshirani, and Daniela Witten A method for the integrative analysis of two genomic.

Slides:



Advertisements
Similar presentations
ICSA, 6/2007 Pei Wang, 1 Spatial Smoothing and Hot Spot Detection for CGH data using the Fused Lasso Pei Wang Cancer Prevention Research.
Advertisements

Correlate BaMBA 6 Sam Gross, Balasubramanian Narasimhan, Robert Tibshirani, and Daniela Witten A method for the integrative analysis of two or more genomic.
Regularization David Kauchak CS 451 – Fall 2013.
Prediction with Regression
Gene Shaving – Applying PCA Identify groups of genes a set of genes using PCA which serve as the informative genes to classify samples. The “gene shaving”
COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.
Dissertation work in Functional Genomics Naim Rashid 4 th Year PhD, Biostatistics.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
DNA methylation age of human tissues and cell types. Genome Biol (10):R115 PMID:
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
Lecture 6: Multiple Regression
The Chicken Project Dimension Reduction-Based Penalized logistic Regression for cancer classification Using Microarray Data By L. Shen and E.C. Tan Name.
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Fuzzy K means.
1 Multiple Kernel Learning Naouel Baili MRL Seminar, Fall 2009.
Multiple Regression Research Methods and Statistics.
Introduction to Longitudinal Phase Space Tomography Duncan Scott.
Multivariate Data and Matrix Algebra Review BMTRY 726 Spring 2012.
Regression Basics For Business Analysis If you've ever wondered how two or more things relate to each other, or if you've ever had your boss ask you to.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Presented By Wanchen Lu 2/25/2013
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Radiogenomics in glioblastoma multiforme
1 Blockwise Coordinate Descent Procedures for the Multi-task Lasso with Applications to Neural Semantic Basis Discovery ICML 2009 Han Liu, Mark Palatucci,
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Non Negative Matrix Factorization
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression.
Diagnosis of multiple cancer types by shrunken centroids of gene expression Course: Topics in Bioinformatics Presenter: Ting Yang Teacher: Professor.
Analysis of the yeast transcriptional regulatory network.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
Nonlinear Programming Models
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Chapter 6 Linear Programming: The Simplex Method Section 4 Maximization and Minimization with Problem Constraints.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.
CpSc 881: Machine Learning
A comparative study of survival models for breast cancer prognostication based on microarray data: a single gene beat them all? B. Haibe-Kains, C. Desmedt,
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Flat clustering approaches
EXCEL DECISION MAKING TOOLS BASIC FORMULAE - REGRESSION - GOAL SEEK - SOLVER.
Ridge Regression: Biased Estimation for Nonorthogonal Problems by A.E. Hoerl and R.W. Kennard Regression Shrinkage and Selection via the Lasso by Robert.
Logistic Regression & Elastic Net
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
An Efficient Algorithm for a Class of Fused Lasso Problems Jun Liu, Lei Yuan, and Jieping Ye Computer Science and Engineering The Biodesign Institute Arizona.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 7: Regression Algorithms Padhraic Smyth Department of.
Approximation Algorithms based on linear programming.
Regularized Least-Squares and Convex Optimization.
Guidelines for building a bar graph in Excel and using it in a laboratory report IB Biology (December 2012)
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Markus Uhr Feature Extraction Sparse, Flexible and Efficient Modeling using L 1 -Regularization Saharon Rosset and Ji Zhu.
Learning to Align: a Statistical Approach
A graph-based integration of multiple layers of cancer genomics data (Progress Report) Do Kyoon Kim 1.
PREDICT 422: Practical Machine Learning
Jinbo Bi Joint work with Jiangwen Sun, Jin Lu, and Tingyang Xu
Chapter 5 STATISTICS (PART 4).
Basic Algorithms Christina Gallner
Logistic Regression & Parallel SGD
Somi Jacob and Christian Bach
Sparse Principal Component Analysis
Chapter 7 Excel Extension: Now You Try!
Presentation transcript:

Correlate February 19, 2010 Sam Gross, Balasubramanian Narasimhan, Robert Tibshirani, and Daniela Witten A method for the integrative analysis of two genomic data sets

Introduction Sparse Canonical Correlation Analysis Correlate: an Excel add-in that implements sparse CCA

A world of data

Statistical analyses There are great statistical methods for the analysis of gene expression, DNA copy number, and SNP data sets.

An integrative approach But what if we have access to multiple types of data (for instance, gene expression and DNA copy number data) on a single set of samples?

An integrative approach But what if we have access to multiple types of data (for instance, gene expression and DNA copy number data) on a single set of samples? The data types can be apples and oranges: for instance, imaging data and gene expression data

An integrative approach But what if we have access to multiple types of data (for instance, gene expression and DNA copy number data) on a single set of samples? The data types can be apples and oranges: for instance, imaging data and gene expression data

Introduction In this talk, we’ll consider the case of DNA copy number and gene expression measurements on a single set of samples.

Introduction In this talk, we’ll consider the case of DNA copy number and gene expression measurements on a single set of samples. Sparse CCA gives us a tool that can be used to answer the question: Can we identify a small set of gene expression measurements that is correlated with a region of DNA copy number gain/loss?

Introduction In this talk, we’ll consider the case of DNA copy number and gene expression measurements on a single set of samples. Sparse CCA gives us a tool that can be used to answer the question: Can we identify a small set of gene expression measurements that is correlated with a region of DNA copy number gain/loss? Correlate provides an easy way to apply that method using Microsoft Excel

Canonical Correlation Analysis (CCA) CCA is a classical statistical method Suppose we have n samples and p+q features for each sample –Let the sample be a group of n kids –Let the first p features be their scores on a set of p tests: reading comprehension, Latin, math… –Let the next q features be the amount of time they spend on certain activities per week: team sports, watching TV, reading…

CCA The question: How are the q activities associated with scores on the p exams? Maybe –More Reading ⇔ Better Reading Comprehension Scores –More Reading And Less TV ⇔ Even Better Reading Comprehension Scores –More Reading, More team sports, More Homework, and Less TV ⇔ Good Scores on all tests

CCA Canonical correlation analysis allows us to discover relationships like this between the sets of variables. For instance, perhaps 0.6*ReadingComp + 0.8*Math +.743*Latin is highly correlated with 2*TeamSports − 11*TV + 8*Reading + 234*Homework

CCA CCA looks for linear combinations of variables in the two groups that are highly correlated with each other. Let X be a matrix with n columns - one for each student - and p = 3 rows, one for each test (Reading Comprehension, Math, Latin). And let Y be a matrix with n columns and q = 4 rows, one for each activity (Team Sports, TV, Reading, Homework). Statistically, we seek vectors u and v such that Cor(X’u, Y’v) is big. We can think of the components of u and v as weights for each variable.

CCA Thus, the output tell us that 0.6*ReadingComp + 0.8*Math +.743*Latin is highly correlated with 2*TeamSports − 11*TV + 8*Reading + 234*Homework Here, –u = (0.6, 0.8, 0.743)’ –v = (2, −11, 8, 234)’

Why is it useful? How does this apply to genomics and bioinformatics?

Why is it useful? How does this apply to genomics and bioinformatics? If we have copy number and gene expression measurements on the same set of samples, we can ask:

Why is it useful? How does this apply to genomics and bioinformatics? If we have copy number and gene expression measurements on the same set of samples, we can ask: Which genes have expression that is associated with which regions of DNA gain or loss?

Sparse CCA This is almost the question that CCA answers for us... –But, CCA will give us a linear combination of genes that is associated with a linear combination of DNA copy number measurements –These linear combinations will involve every gene expression measurement and every copy number measurement

Sparse CCA This is almost the question that CCA answers for us... –But, CCA will give us a linear combination of genes that is associated with a linear combination of DNA copy number measurements –These linear combinations will involve every gene expression measurement and every copy number measurement What we really want is this: –A short list of genes that are associated with a particular region of DNA gain/loss

Sparse CCA From now on: –X is a matrix of gene expression data, with samples on the columns and genes on the rows –Y is a matrix of copy number data, with samples on the columns and copy number measurements on the rows

Sparse CCA CCA seeks weights u, v such that Cor(X’u, Y’v) is big

Sparse CCA CCA seeks weights u, v such that Cor(X’u, Y’v) is big Sparse CCA seeks weights u, v such that Cor(X’u, Y’v) is big, and most of the weights are zero

Sparse CCA CCA seeks weights u, v such that Cor(X’u, Y’v) is big Sparse CCA seeks weights u, v such that Cor(X’u, Y’v) is big, and most of the weights are zero u contains weights for the gene expression data, and v contains weights for the copy number data

Sparse CCA CCA seeks weights u, v such that Cor(X’u, Y’v) is big Sparse CCA seeks weights u, v such that Cor(X’u, Y’v) is big, and most of the weights are zero u contains weights for the gene expression data, and v contains weights for the copy number data Since the columns of Y are copy number measurements along the chromosome, then we want the weights in v to be smooth (not jumpy)

Sparse CCA By imposing the right penalty on u and v, we can ensure that –The elements of u are sparse –The elements of v are sparse and smooth – (Remember: u contains weights for the gene expression data, and v contains weights for the copy number data)

Sparse CCA By imposing the right penalty on u and v, we can ensure that –The elements of u are sparse –The elements of v are sparse and smooth – (Remember: u contains weights for the gene expression data, and v contains weights for the copy number data) We can also constrain u and v such that their weights are positive or negative

Sparse CCA, mathematically We choose weights u and v to maximize Cor(X’u, Y’v) subject to ∑ i |u i |≤ c 1, ∑ j (|v j | + |v j+1 - v j |) ≤ c 2 This is a lasso constraint on u and a fused lasso constraint on v. For small values of c 1 and c 2, some elements of u and v are exactly zero, and v is smooth.

For the statisticians: the criterion maximize u,v u’XY’v subject to u’u ≤ 1, v’v ≤ 1, P 1 (u) ≤ c 1, P 2 (v) ≤ c 2 Assume that the features are standardized to have mean 0 and standard deviation 1. Here, P 1 and P 2 are convex penalties on the elements of u and v.

For the statisticians: biconvexity With u fixed, the criterion is convex in v, and with v fixed, it’s convex in u. This suggests a simple iterative optimization strategy: 1. Hold u fixed and optimize with respect to v. 2. Hold v fixed and optimize with respect to u. maximize u,v u’XY’v subject to u’u ≤ 1, v’v ≤ 1, P 1 (u) ≤ c 1, P 2 (v) ≤ c 2

For the statisticians: the algorithm Initialize v. Iterate until convergence: 1. Hold v fixed, and optimize: maximize u u’XY’v subject to u’u ≤ 1, P 1 (u) ≤ c Hold u fixed, and optimize: maximize v u’XY’v subject to v’v ≤ 1, P 2 (v) ≤ c 2. maximize u,v u’XY’v subject to u’u ≤ 1, v’v ≤ 1, P 1 (u) ≤ c 1, P 2 (v) ≤ c 2

For the statisticians: the penalties maximize u,v u’XY’v subject to u’u ≤ 1, v’v ≤ 1, P 1 (u) ≤ c 1, P 2 (v) ≤ c 2 If P 1 is a lasso or L 1 penalty, P 1 (u)=||u|| 1, then to update u: u=S(XY’v, d)/||S(XY’v,d)|| 2, where d≥0 is chosen such that ||u|| 1 =c 1. Here, S is the soft-thresholding operator: S(a,c)=sign(a)(|a|-c) +.

For the statisticians: the penalties maximize u,v u’XY’v subject to u’u ≤ 1, v’v ≤ 1, P 1 (u) ≤ c 1, P 2 (v) ≤ c 2 If P 2 is a fused lasso penalty: P 2 (v)=∑ j (|v j | + |v j+1 - v j |) ≤ c 2, then the update is a little harder and requires software for fused lasso regression.

Sparse CCA results So what do we end up with? –A set of genes that is associated with a region (or regions) of DNA gain/loss –Weights for the gene expression measurements (can be constrained to all have the same sign) –Weights for the DNA copy number measurements, which will be smooth –We can get multiple (gene set, DNA gain/loss) pairs

Sparse CCA results So what do we end up with? –A set of genes that is associated with a region (or regions) of DNA gain/loss –Weights for the gene expression measurements (can be constrained to all have the same sign) –Weights for the DNA copy number measurements, which will be smooth –We can get multiple (gene set, DNA gain/loss) pairs We use a permutation approach to get a p- value for the significance of the results

Permutation approach Dataset 1 X 1 2 … n 1p1p 1q1q Dataset 2 Y

Permutation approach Dataset 1 X 1 2 … n 1p1p 1q1q Dataset 2 Y Cor(X’u, Y’v)

Permutation approach Dataset 1 X 1 2 … n 1p1p 1q1q Permuted Dataset 2 Y* 1 2 … n

Permutation approach Dataset 1 X 1 2 … n 1p1p 1q1q Permuted Dataset 2 Y* Cor(X’u*, Y*’v*) 1 2 … n

Permutation approach Dataset 1 X 1 2 … n 1p1p 1q1q Permuted Dataset 2 Y* Cor(X’u*, Y*’v*) 1. Repeat 100 times. 2. Compare Cor(X’u, Y’v) to {Cor(X’u*, Y*’v*)}.

Extensions These ideas have been extended to the following cases: –More than two data sets –A supervising outcome (e.g. survival time or tumor subtype) for each sample

Data Applied to breast cancer data: –n = 89 tissue samples –p = gene expression measurements –q = 2149 DNA copy number measurements –Chin, DeVries, Fridlyand, et al. (2006) Cancer Cell 10, Look for a region of copy number change on chromosome 20 that’s correlated with the expression of some set of genes

Example Copy number data on chromosome 20 Gene expression data from all chromosomes Can we find a region of copy number change on chromosome 20 that’s correlated with the expression of a set of genes?

Correlate

Example Copy number data on chromosome 20 Gene expression data from all chromosomes Can we find a region of copy number change on chromosome 20 that’s correlated with the expression of a set of genes?

Correlate - chromosome 20

Non-zero gene expression weights by chromosome

Correlate - chromosome 1

All 44 non-zero gene expression weights are on chromosome 1 Top 10: –splicing factor 3b, subunit 4, 49kD –HSPC003 protein –rab3 GTPase-activating protein, non-catalytic subunit (150kD) –hypothetical protein My014 –UDP-Gal:betaGlcNAc beta 1,4- galactosyltransferase, polypeptide 3 –glyceronephosphate O-acyltransferase –NADH dehydrogenase (ubiquinone) Fe-S protein 2 (49kD) (NADH- coenzyme Q reductase) –hypothetical protein FLJ12671 –mitochondrial ribosomal protein L24 –CGI-78 protein

Correlate – Conclusions Can be applied to any pair of data sets: SNP, methylation, microRNA expression data, and more….

Correlate – Conclusions Can be applied to any pair of data sets: SNP, methylation, microRNA expression data, and more…. Think broadly… a collaborator is using it to correlate image data and gene expression data in cancer. Linear combination of image features is highly predictive of survival!

Correlate – Conclusions Can be applied to any pair of data sets: SNP, methylation, microRNA expression data, and more…. Think broadly… a collaborator is using it to correlate image data and gene expression data in cancer. Linear combination of image features is highly predictive of survival! A principled way to discover associations and perform an integrative analysis of two data sets.

Try it out! Or, for R users: package PMA on CRAN Or google “Tibshirani”

Sam Gross (Harvard), Balasubramanian Narasimhan (Stanford), and Robert Tibshirani (Stanford) Acknowledgments

References Witten DM, Tibshirani R, and T Hastie (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3): Witten DM and R Tibshirani (2009) Extensions of sparse canonical correlation analysis, with applications to genomic data. Statistical Applications in Genetics and Molecular Biology 8(1): Article 28.