Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sparse Principal Component Analysis

Similar presentations


Presentation on theme: "Sparse Principal Component Analysis"— Presentation transcript:

1 Sparse Principal Component Analysis
Hui Zou, trevor hastie, and Robert tibshirani 2005 Berlene Shipes

2 Abbreviations Principal Component Analysis
PCA Singular value decomposition SVD Sparse Principal Component Analysis SPCA Principal Component PC

3 Model Specifications n = number of observations
p = number of predictors Response vector j=1,…,p

4 Principal Component Analysis
Uses Data-processing Dimension-reduction Computed using singular value decomposition of the data matrix

5 PCA Optimal Properties Suboptimal Properties
Principal components sequentially capture the maximum variability among the columns of X This guarantees minimal information loss Principal components are uncorrelated One principal component is independent of others Suboptimal Properties PC are linear combinations of all p variables Loadings are normally nonzero

6 Previous Solutions Interpretation of PC Dimensionality reduction
Jolliffe (1995) suggested rotation techniques Vines (2000) considered simple principal components Loadings take values from a SMALL set of integers Dimensionality reduction Cadima and Jolliffe (1995) artificially set the loadings with absolute values smaller than some threshold to zero McCabe (1984) found a subset of principal variables Jolliffe, Trendafilov, and Uddin (2003) introduced SCoTLASS to get modified PC with possible zero loadings

7 Lasso Tibshirani (1996) introduced Lasso as variable selection technique Focused on accurate and sparse models Penalized least squares method Constraint on the L1 norm of the regression coefficients λ is non-negative

8 Lasso Continued Continuously shrinks the coefficients towards zero
Prediction accuracy via the bias variance trade-off Estimated using the LARS algorithm Limitations Number of variables that are selected by lasso is limited by the number of observations Can only select at most n predictors

9 Elastic Net Zou and Hastie (2005) proposed Elastic Net as a generalization of Lasso Convex combination of the ridge and lasso penalties λ1 and λ2 are non-negative Estimated using the LARS-EN algorithm

10 Elastic Net Continued p>n
Choose λ2 > 0 Removes the limitation on the number of variables that can be included in the fitted model

11 SCoTLASS Obtains sparse loadings by directly imposing an L1 constraint on PCA Sufficiently small t yields some exact zero loadings Process as below:

12 SCoTLASS Continued Limitations No guidance on choosing t
High computational cost Not sparse enough with a high percentage of explained variance

13 Simple Regression Approach
Theorem 1: For each i, denote by Zi =UiDii the ith principal component. Consider a positive λ and the ridge estimates given by

14 Theorem 1 Implications Using theorem 1, PCA and a regression method are connected. PCA always gives a unique solution in all situations Extending this to naïve elastic net allows us to flexibly choose a sparse approximation to the ith principal component

15 SPCA Connecting PCA and regression while using the lasso approach for producing sparse loadings gives the following equation to be optimized:

16 General SPCA Algorithm
1. Let A start at V[,1:k], the loadings of the first k ordinary principal components. 2. Given a fixed A = [α1, …, αk], solve the following elastic net problem for j = 1,2,…,k 3. For a fixed B = [β1, …, βk], compute the SVD of XTXB = UDVT, then update A = UVT 4. Repeat Steps 2-3, until convergence. 5. Normalization:

17 Remarks about General SPCA Algorithm
Output does not change much regardless of λ If n > p, then λ is defaulted to zero Small λ allows for overcoming collinearity problems in X Algorithm converges quickly Can try multiple combinations of {λ1,j} Choose a value that gives an acceptable compromise between variance and sparsity Prioritize variance

18 Adjusted Total Variance
Take into account the correlations among the modified PCs using the below formula:

19 Computation Complexity
When n > p and p ≥ k, the total computation cost is at most np2 + mO(p3) where m is the number of iterations before convergence and O(p3) represents the maximum number of operations for each elastic net solution SPCA is efficient for huge n and small p p < 100 When p >> n, the total computation cost is of order mkO(pJn+J3) for a positive finite λ. Expensive for large J and p Elastic Net is the most costly Special algorithm for this type of data

20 SPCA for p>>n Theorem 5.

21 SPCA for p>>n Using theorem 5, replace step 2 in the general SPCA algorithm with soft- thresholding. Step 2: for j = 1,2,…,k

22 Pitprops Data 180 Observations with 13 measured variables
Classic example showing the difficulty of interpreting PCs Set λ=0 and λ1=(0.06, 0.16, 0.1, 0.5, 0.5, 0.5) Chosen so sparse approximation explained almost the same amount of variance as the ordinary PC

23 Pitprops Data PCs by SPCA accounts for 75.8% of the variance
SCoTLASS accounts for 69.3% of the variance SPCA is more sparse SPCA was completed in seconds SC0TLASS, simple thresholding, then SPCA are increasingly better in terms of variance

24 Synthetic Data Three hidden factors with 10 observable variables
Exact covariance matrix was used to perform PCA, SPCA, and simple thresholding There should be a “correct” sparse representation due to the way the data was imputed SPCA and SCoTLASS produce the ideal sparse PCs Both use the lasso penalty Simple thresholding incorrectly specified variables as most important Additionally the variance explained is lower than SPCA

25 Ramaswamy Data p=16,063 genes and n=144 samples
Goal was to find the set of genes that are biologically relevant to the outcome PCA has been popular for this analysis If sparse principal component can explain a large part of the total variance of gene expression levels, then the subset of genes representing the principal component are considered important Apply SPCA with λ = ∞

26 Ramaswamy Data SCoTLASS cannot be used for finding sparse PCs
Simple thresholding always explains slightly higher variance then SPCA does for the same number of genes. 2% different genes Difference is consistent

27 Discussion Good method to achieve sparseness should possess the properties: Without any sparsity constraint, the method should reduce to PCA Computationally efficient for both small p and big p data Avoid misidentifying the important variables Simple thresholding approach Not criterion based Has property 1 and 2 Benchmark for any potential better method

28 Discussion Continued SCoTLASS SPCA Derives sparse loadings
Not computationally efficient Lacks an adequate rule for choosing a tuning parameter Cannot be applied to gene expression arrays SPCA Computationally efficient High explained variance Identifies important variables

29

30 Questions?


Download ppt "Sparse Principal Component Analysis"

Similar presentations


Ads by Google