Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 7: Principal component analysis (PCA)

Similar presentations


Presentation on theme: "Lecture 7: Principal component analysis (PCA)"— Presentation transcript:

1 Lecture 7: Principal component analysis (PCA)
Rationale and use of PCA The underlying model (what is a principal component anyway?) Eigenvectors and eigenvalues of the sample covariance matrix revisited! PCA scores and loadings The use of and rationale for rotations Orthogonal and oblique rotations Component retention, significance, and reliability. Bio 8100s Applied Multivariate Biostatistics 2001

2 What is PCA? From a set of p variables X1, X2,…, Xp, we try and find (“extract”) a set of ordered indices Z1, Z2,…, Zp that are uncorrelated and ordered in terms of their variability: Var(Z1) > Var(Z2) > … > Var(Zp) Because the Zi s (principal components) are uncorrelated, they measure different “dimensions” in the data. The hope (sometimes faint) is that most of the variability in the original set of p variables will be accounted for by c < p components. Bio 8100s Applied Multivariate Biostatistics 2001

3 Why use PCA? PCA is generally used to reduce the number of variables considered in subsequent analyses, i.e. reduce the “dimensionality” of the data. Examples include: Reduce number of dependent variables in MANOVA, mutivariate regression, correlation analysis, etc. Reduce number of independent variables (predictors) in regression analysis Bio 8100s Applied Multivariate Biostatistics 2001

4 Estimating principal components
The second principal component is obtained by “fitting” (i.e. estimating the coefficients of) the function which maximizes Var(Z2), subject to: The first principal component is obtained by “fitting” (i.e. estimating the coefficients of) the linear function which maximizes Var(Z1), subject to: Bio 8100s Applied Multivariate Biostatistics 2001

5 Estimating principal components (cont’d)
The third principal component is obtained by “fitting” (i.e. estimating the coefficients of) the function which maximizes Var(Z3), subject to: …as well as the additional constraints... … and Bio 8100s Applied Multivariate Biostatistics 2001

6 Estimating principal components
Estimation of the coefficients for each principal component can be accomplished through several different methods (e.g. least-square estimation, maximum likelihood estimation, iterated principal axis, etc.)… The extracted principal components may differ depending on the method of estimation. Bio 8100s Applied Multivariate Biostatistics 2001

7 The geometry of principal components
X1 X2 Principal components (Zi) are linear functions of the original variables, and as such, define hyperplanes in the p dimensional space of Z and the original variables. Because the Zi s are uncorrelated, these planes meet at right angles. Z Z1 X2 X1 Bio 8100s Applied Multivariate Biostatistics 2001

8 Multivariate variance: a geometric interpretation
Larger variance Smaller variance Univariate variance is a measure of the “volume” occupied by sample points in one dimension. Multivariate variance involving p variables is the volume occupied by sample points in an p -dimensional space. X X X1 X2 Occupied volume Bio 8100s Applied Multivariate Biostatistics 2001

9 Multivariate variance: effects of correlations among variables
No correlation Multivariate variance: effects of correlations among variables X1 X2 Correlations between pairs of variables reduce the volume occupied by sample points… …and hence, reduce the multivariate variance. Positive correlation Negative correlation X1 Occupied volume X2 Bio 8100s Applied Multivariate Biostatistics 2001

10 C and the generalized multivariate variance
The determinant of the sample covariance matrix C is a generalized multivariate variance… … because area2 of a parallelogram with sides given by the individual standard deviations and angle determined by the correlation between variables equals the determinant of C. Bio 8100s Applied Multivariate Biostatistics 2001

11 Eigenvalues and eigenvectors of C
No correlation Eigenvectors of the covariance matrix C are orthogonal directed line segments that “span” the variation in the data, and the corresponding (unsigned) eigenvalues are the length of these segments. … so the product of the eigenvalues is the “volume” occupied by the data, i.e. the determinant of the covariance matrix. X1 Positive correlation X2 Negative correlation X1 X2 Bio 8100s Applied Multivariate Biostatistics 2001

12 The geometry of principal components (cont’d)
X1 The coefficients (aij) of the principal components (Zi) define vectors in the space of coefficients. These vectors are the eigenvectors (ai) of the sample covariance matrix C, and the corresponding (unsigned) eigenvalues (li) are the variances of each component, i.e. Var(Zi)... … and the product of the eigenvalues is the “volume” occupied by the data, i.e. the determinant of the covariance matrix. X2 1 l2 a2 a1 a1 l1 -1 -1 1 a2 Bio 8100s Applied Multivariate Biostatistics 2001

13 Another important relationship!
The sum of the eigenvalues of the covariance matrix C equals the sum of the diagonal elements of C, i.e. the trace of C. So, the sum of the variances of the principal components equals the sum of the variances of the original variables. Bio 8100s Applied Multivariate Biostatistics 2001

14 Scale and the correlation matrix
Since variables may be measured on different scales, and we want to eliminate scale effects, we usually work with standardized values so that each variable is scaled to have zero mean and unit variance. The sample covariance matrix of standardized variables is the sample correlation matrix R. Bio 8100s Applied Multivariate Biostatistics 2001

15 Principal component scores
Because principal components are functions, we can “plug in” the values for each variable for each observation, and calculate a PC score for each observation and each principal component. Bio 8100s Applied Multivariate Biostatistics 2001

16 Principal component loadings
Component loadings (Lij) are the covariances (correlations for standardized values) of the original variables used in the PCA with the components, and are proportional to the component coefficients (aij). For each component, the (loading)2 for each variable summed over all variables equals the variance of the component. Bio 8100s Applied Multivariate Biostatistics 2001

17 More on loadings Sometimes components have variables with similar loadings, which form a “natural” group. To assist in interpretation, we may want to choose another component frame which emphasizes these differences among groups. FACTOR(2) Factor plot Bio 8100s Applied Multivariate Biostatistics 2001

18 Orthogonal rotations: varimax
unrotated FACTOR(2) Orthogonal (angle preserving): new (rotated) components are still uncorrelated Varimax: rotation done so that each component loads high on a small number of variables and low on other variables (simplifies factors) FACTOR(2) Varimax Bio 8100s Applied Multivariate Biostatistics 2001

19 Orthogonal rotations: quartimax
unrotated FACTOR(2) Orthogonal (angle preserving): new (rotated) components are still uncorrelated Varimax: rotation done so that each variable loads mainly on one factor (simplified variables) FACTOR(2) Varimax Bio 8100s Applied Multivariate Biostatistics 2001

20 Orthogonal rotations: Equamax
unrotated FACTOR(2) Orthogonal (angle preserving): new (rotated) components are still uncorrelated Equamax: Combines varimax and quartimax. Number of variables that load highly on a factor and the number of factors needed to explain the variable are optimized. FACTOR(2) Equamax Bio 8100s Applied Multivariate Biostatistics 2001

21 Oblique rotations, e.g. Oblimin
unrotated FACTOR(2) Oblique (non-angle preserving): new (rotated) components are now correlated Most reasonable when significant intercorrelations among factors exist. FACTOR(2) Oblimin Bio 8100s Applied Multivariate Biostatistics 2001

22 The consequences of rotation
Unrotated components are (1) uncorrelated; (2) ordered in terms of decreasing variance (i.e., Var(Z1) > Var (Z2) >…). Orthogonally rotated components are (1) still uncorrelated, but (2) need not be ordered in terms of decreasing variance (e.g. for Varimax rotation). Obliquely rotated components are (1) correlated; (2) unordered (in general). Bio 8100s Applied Multivariate Biostatistics 2001

23 The rotated pattern matrix for obliquely rotated factors
The elements of the matrix are analogous to standardized partial regression coefficients from a multiple regression analysis. So each element quantifies the importance of the variable in question to the component, once the effects of other variables are controlled. Rotated Pattern Matrix (OBLIMIN, Gamma = ) HEIGHT ARM_SPAN FOREARM LOWERLEG WEIGHT BITRO CHESTGIR CHESTWID Bio 8100s Applied Multivariate Biostatistics 2001

24 The rotated structure matrix for obliquely rotated factors
HEIGHT ARM_SPAN FOREARM LOWERLEG WEIGHT BITRO CHESTGIR CHESTWID The elements of the rotated structure matrix are the simple correlations of the variable in question with the factor, i.e. the component loadings. For orthogonal factors, the factor pattern and factor structure matrices are identical. Bio 8100s Applied Multivariate Biostatistics 2001

25 Which rotation is the best?
Object: find the rotation which achieves the simplest structure among component loadings, thereby making interpretation comparatively easy. Thurstone’s criteria: for p variables and m < p components: (1) each component should have at least m near-zero loadings; (2) few components should have non-zero loadings on the same variable. Bio 8100s Applied Multivariate Biostatistics 2001

26 A final word on rotations
“You cannot say that any rotation is better than any other rotation from a statistical point of view: all rotations are equally good statistically. Therefore, the choice among different rotations must be based on non-statistical grounds…” SAS STAT User’s guide, Vol. 1, p. 776. Bio 8100s Applied Multivariate Biostatistics 2001

27 How many components to retain in subsequent analysis?
Kaiser rule: retain only components with eigenvalues > 1. Scree test: plot eigenvalues against their ordinal numbers, retain all components in “steep decent” part of the curve. Retain as many factors as required to account for a specified amount of the total variance (e.g. 85%) Scree plot Kaiser threshold Eigenvalue Bio 8100s Applied Multivariate Biostatistics 2001

28 More on interpretation: the significance of loadings
Since loadings are correlation coefficients (r), we can test the null that each correlation equals zero. But analytic estimates of standard errors are often too small, especially for rotated loadings. So, as a rule of thumb, use double the critical value to test significance. E.g., for N = 100, r(a = 0.01) = 0.286, so “significant” factors have loadings greater than 2(0.286). Bio 8100s Applied Multivariate Biostatistics 2001

29 Component reliability: rules of thumb
The absolute magnitude and number of loadings are crucial for determining reliability Components with at least 4 loadings > |0.60| or with at least 3 loadings > |0.80| are reliable. For N > 150, components with at least 10 loadings > |0.40| are reliable. Bio 8100s Applied Multivariate Biostatistics 2001

30 PCA: the procedure 1. Calculate sample covariance matrix or correlation matrix. If all variables are on same scale, use sample covariance matrix, otherwise use correlation matrix. 2. Run PCA to extract unrotated components (“initial extraction”). 3. Decide which components to use in subsequent analysis based on Kaiser rule, Scree plots, etc. 4. Based on (3), rerun analysis using different orthogonal and oblique rotations and compare using factor plots (‘follow-up extraction”) Bio 8100s Applied Multivariate Biostatistics 2001

31 PCA: the procedure (cont’d)
5. For obliquely rotated components, calculate correlations among components. Small correlations suggest that orthogonal rotations are reasonable. 6. Evaluate statistical significance of component loadings obtained from “best” rotation. 7. Check component reliability by redoing steps (1) - (6) with another (independent) data set, and compare the component loadings obtained from the two data sets. Are they close? Bio 8100s Applied Multivariate Biostatistics 2001


Download ppt "Lecture 7: Principal component analysis (PCA)"

Similar presentations


Ads by Google