Sparse Principal Component Analysis

Slides:



Advertisements
Similar presentations
Nonnegative Matrix Factorization with Sparseness Constraints S. Race MA591R.
Advertisements

Lecture 4. Linear Models for Regression
Chapter Outline 3.1 Introduction
Regularization David Kauchak CS 451 – Fall 2013.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Prediction with Regression
Dimension reduction (1)
PCA + SVD.
Chapter 2: Lasso for linear models
Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Principal Component Analysis
Curve-Fitting Regression
Chapter 4 Multiple Regression.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Ordinary least squares regression (OLS)
Objectives of Multiple Regression
Chapter 2 Dimensionality Reduction. Linear Methods
Shrinkage Estimation of Vector Autoregressive Models Pawin Siriprapanukul 11 January 2010.
CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
The Group Lasso for Logistic Regression Lukas Meier, Sara van de Geer and Peter Bühlmann Presenter: Lu Ren ECE Dept., Duke University Sept. 19, 2008.
R EGRESSION S HRINKAGE AND S ELECTION VIA THE L ASSO Author: Robert Tibshirani Journal of the Royal Statistical Society 1996 Presentation: Tinglin Liu.
CpSc 881: Machine Learning
From OLS to Generalized Regression Chong Ho Yu (I am regressing)
Additive Models , Trees , and Related Models Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University.
Logistic Regression & Elastic Net
Ultra-high dimensional feature selection Yun Li
Computacion Inteligente Least-Square Methods for System Identification.
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
LECTURE 13: LINEAR MODEL SELECTION PT. 3 March 9, 2016 SDS 293 Machine Learning.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
The simple linear regression model and parameter estimation
Bayesian Semi-Parametric Multiple Shrinkage
PREDICT 422: Practical Machine Learning
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Probability and Statistics for Computer Scientists Second Edition, By: Michael Baron Section 11.1: Least squares estimation CIS Computational.
Regression Analysis Module 3.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Boosting and Additive Trees (2)
Regression.
CJT 765: Structural Equation Modeling
Machine Learning Basics
Latent Variables, Mixture Models and EM
Roberto Battiti, Mauro Brunato
Probability and Statistics for Computer Scientists Second Edition, By: Michael Baron Section 11.1: Least squares estimation CIS Computational.
Probabilistic Models with Latent Variables
Multiple Regression Models
What is Regression Analysis?
OVERVIEW OF LINEAR MODELS
Lecture 4: Econometric Foundations
A Generalization of Sparse PCA to Multiple Correspondence Analysis
Linear Model Selection and regularization
Parallelization of Sparse Coding & Dictionary Learning
Econometrics Chengyaun yin School of Mathematics SHUFE.
OVERVIEW OF LINEAR MODELS
Multivariate Linear Regression Models
Generally Discriminant Analysis
CRISP: Consensus Regularized Selection based Prediction
Factor Analysis BMTRY 726 7/19/2018.
Basis Expansions and Generalized Additive Models (1)
Penalized Regression, Part 3
Principal Component Analysis
Lecturer Dr. Veronika Alhanaqtah
Multidisciplinary Optimization
Presentation transcript:

Sparse Principal Component Analysis Hui Zou, trevor hastie, and Robert tibshirani 2005 Berlene Shipes

Abbreviations Principal Component Analysis PCA Singular value decomposition SVD Sparse Principal Component Analysis SPCA Principal Component PC

Model Specifications n = number of observations p = number of predictors Response vector j=1,…,p

Principal Component Analysis Uses Data-processing Dimension-reduction Computed using singular value decomposition of the data matrix

PCA Optimal Properties Suboptimal Properties Principal components sequentially capture the maximum variability among the columns of X This guarantees minimal information loss Principal components are uncorrelated One principal component is independent of others Suboptimal Properties PC are linear combinations of all p variables Loadings are normally nonzero

Previous Solutions Interpretation of PC Dimensionality reduction Jolliffe (1995) suggested rotation techniques Vines (2000) considered simple principal components Loadings take values from a SMALL set of integers Dimensionality reduction Cadima and Jolliffe (1995) artificially set the loadings with absolute values smaller than some threshold to zero McCabe (1984) found a subset of principal variables Jolliffe, Trendafilov, and Uddin (2003) introduced SCoTLASS to get modified PC with possible zero loadings

Lasso Tibshirani (1996) introduced Lasso as variable selection technique Focused on accurate and sparse models Penalized least squares method Constraint on the L1 norm of the regression coefficients λ is non-negative

Lasso Continued Continuously shrinks the coefficients towards zero Prediction accuracy via the bias variance trade-off Estimated using the LARS algorithm Limitations Number of variables that are selected by lasso is limited by the number of observations Can only select at most n predictors

Elastic Net Zou and Hastie (2005) proposed Elastic Net as a generalization of Lasso Convex combination of the ridge and lasso penalties λ1 and λ2 are non-negative Estimated using the LARS-EN algorithm

Elastic Net Continued p>n Choose λ2 > 0 Removes the limitation on the number of variables that can be included in the fitted model

SCoTLASS Obtains sparse loadings by directly imposing an L1 constraint on PCA Sufficiently small t yields some exact zero loadings Process as below:

SCoTLASS Continued Limitations No guidance on choosing t High computational cost Not sparse enough with a high percentage of explained variance

Simple Regression Approach Theorem 1: For each i, denote by Zi =UiDii the ith principal component. Consider a positive λ and the ridge estimates given by

Theorem 1 Implications Using theorem 1, PCA and a regression method are connected. PCA always gives a unique solution in all situations Extending this to naïve elastic net allows us to flexibly choose a sparse approximation to the ith principal component

SPCA Connecting PCA and regression while using the lasso approach for producing sparse loadings gives the following equation to be optimized:

General SPCA Algorithm 1. Let A start at V[,1:k], the loadings of the first k ordinary principal components. 2. Given a fixed A = [α1, …, αk], solve the following elastic net problem for j = 1,2,…,k 3. For a fixed B = [β1, …, βk], compute the SVD of XTXB = UDVT, then update A = UVT 4. Repeat Steps 2-3, until convergence. 5. Normalization:

Remarks about General SPCA Algorithm Output does not change much regardless of λ If n > p, then λ is defaulted to zero Small λ allows for overcoming collinearity problems in X Algorithm converges quickly Can try multiple combinations of {λ1,j} Choose a value that gives an acceptable compromise between variance and sparsity Prioritize variance

Adjusted Total Variance Take into account the correlations among the modified PCs using the below formula:

Computation Complexity When n > p and p ≥ k, the total computation cost is at most np2 + mO(p3) where m is the number of iterations before convergence and O(p3) represents the maximum number of operations for each elastic net solution SPCA is efficient for huge n and small p p < 100 When p >> n, the total computation cost is of order mkO(pJn+J3) for a positive finite λ. Expensive for large J and p Elastic Net is the most costly Special algorithm for this type of data

SPCA for p>>n Theorem 5.

SPCA for p>>n Using theorem 5, replace step 2 in the general SPCA algorithm with soft- thresholding. Step 2: for j = 1,2,…,k

Pitprops Data 180 Observations with 13 measured variables Classic example showing the difficulty of interpreting PCs Set λ=0 and λ1=(0.06, 0.16, 0.1, 0.5, 0.5, 0.5) Chosen so sparse approximation explained almost the same amount of variance as the ordinary PC

Pitprops Data PCs by SPCA accounts for 75.8% of the variance SCoTLASS accounts for 69.3% of the variance SPCA is more sparse SPCA was completed in seconds SC0TLASS, simple thresholding, then SPCA are increasingly better in terms of variance

Synthetic Data Three hidden factors with 10 observable variables Exact covariance matrix was used to perform PCA, SPCA, and simple thresholding There should be a “correct” sparse representation due to the way the data was imputed SPCA and SCoTLASS produce the ideal sparse PCs Both use the lasso penalty Simple thresholding incorrectly specified variables as most important Additionally the variance explained is lower than SPCA

Ramaswamy Data p=16,063 genes and n=144 samples Goal was to find the set of genes that are biologically relevant to the outcome PCA has been popular for this analysis If sparse principal component can explain a large part of the total variance of gene expression levels, then the subset of genes representing the principal component are considered important Apply SPCA with λ = ∞

Ramaswamy Data SCoTLASS cannot be used for finding sparse PCs Simple thresholding always explains slightly higher variance then SPCA does for the same number of genes. 2% different genes Difference is consistent

Discussion Good method to achieve sparseness should possess the properties: Without any sparsity constraint, the method should reduce to PCA Computationally efficient for both small p and big p data Avoid misidentifying the important variables Simple thresholding approach Not criterion based Has property 1 and 2 Benchmark for any potential better method

Discussion Continued SCoTLASS SPCA Derives sparse loadings Not computationally efficient Lacks an adequate rule for choosing a tuning parameter Cannot be applied to gene expression arrays SPCA Computationally efficient High explained variance Identifies important variables

Questions?