Multivariate Statistical Methods Principal Components Analysis (PCA) By Jen-pei Liu, PhD Division of Biometry, Department of Agronomy, National Taiwan University and Division of Biostatistics and Bioinformatics National Health Research Institutes 2019/2/24 Copyright by Jen-pei Liu, PhD
Principal Components Analysis Introduction Procedures Properties Examples Summary 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Introduction Described by K. Pearson (1901) Computing methods by Hotelling (1933) Objective To transform the original variables X1,…,Xp into index variables Z1,…,Zp Z1,…,Zp are linear combinations of X1,…,Xp Z1,…,Zp are independent and are in order of important To describe the variation in the data 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Introduction Lack of correlation index variables measure different dimensions (domains) Lack of correlation only consider the variance of index variables and do not have to take covariance into consideration Ordering Var(Z1) Var(Z2) … Var(Zp) The Z index variables are called the principal components 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Introduction Variance of the variation in the full data set can be adequately describe by the few Z index variables Reduction of dimension from 2-digit number to just 2 to 4 principal compoents High correlations in the original variables 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Introduction 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Introduction Correlations of Female Sparrows X1 X2 X3 X4 X5 Total length (X1) 1.000 Alar length (X2) 0.735 1.000 Length of beak and Head (X3) 0.662 0.674 1.000 Length of humerus (X4) 0.645 0.769 0.763 1.000 Length of keel of sternum (X5) 0.605 0.529 0.626 0.607 1.000 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Introduction Coefficients for Components Component Variance X1 X2 X3 X4 X5 1 3.616 0.452 0.462 0.451 0.471 0.398 2 0.532 -0.051 0.300 0.325 0.185 -0.877 3 0.386 0.691 0.341 -0.455 -0.411 -0.179 4 0.302 -0.420 0.548 -0.606 0.388 0.069 5 0.165 0.374 -0.530 -0.343 0.652 -0.192 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Introduction Z1=0.452X1+0.462X2+0.451X3+0.471X4+0.398X5 Variance of Z1 is 3.62 Variance of Z1 accounts for 72.3% (3.62/5.00) of the total variation All coefficients of Z1 are smaller than 1 and sum of squares of these coefficients is equal to 1 Z1 is in fact as the average (or sum) of X1, X2, X3, X4, and X5 Z1 can be interpreted as the index for the size of the sparrow 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Procedures Data Structure Case X1 X2 … Xp 1 x11 x12 … x1p 2 x21 x22 … x2p . N xn1 xn2 … xnp 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Procedures The First Component The first component is a linear combination of X1, X2, …, Xp Z1= a11X1+a12X2+…+a1pXp Var(Z1) is as large as possible subject to condition that a112+a122+…+a1p2=1 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Procedures 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Procedures The second Component The second component is also a linear combination of X1, X2, …, and Xp Z1= a21X1+a22X2+…+a2pXp Var(Z2) is as large as possible subject to condition that a212+a222+…+a2p2=1, Var(Z2) is the second largest, Z1 and Z2 are not correlated 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Procedures The third Component The third component is also a linear combination of X1, X2, …, and Xp Z1= a31X1+a32X2+…+a3pXp Var(Z2) is as large as possible subject to condition that a312+a322+…+a3p2=1, Var(Z3) is the second largest, Z1, Z2 and Z3 are not correlated 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Procedures Continue until all p principal components are computed Covariance matrix of p variables 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Procedures 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Procedures 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Procedures Different variables might have different units and magnitudes PCA might be influenced by these magnitudes and units Standardization to have zero mean and unit variance Covariance on standardized variables is the correlation matrix 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Procedures Steps of (PCA) Standardizing variables X1, X2,…,Xp to have zero means and unit variances unless that the importance of variables is reflected in their variances Calculate the covariance matrix (correlation matrix) 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Procedures Steps of (PCA) Find the eigenvalues 1, 2,…, p and their corresponding eigenvectors a1, a2, …, ap The coefficients of the ith principal component Zi is the element of ai and i the variance of Zi Discard any components that accounts for only a small proportion of the variation in the data 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Properties 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Properties E(Z)=A V(Z)=AA’=diag{I, i=1,…,p} Cov(Zi,Xj)=aiji Corr(Zi,Xj)=aiji/cjj Corr(Zi,Xj)=aiji, if correlation matrix is used 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples Determination of the number of principal components Depends upon the needs of practitioners The proportion of the total variation explained by the selected principal components is high, e.g., at least 80% If correlation matrix is used, select the principal component with the variance greater than 1 because they accounts for more variation than the original variables (=1) Use scree plot 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples Evaluation of Statistics Course 16 students for 11 items (variables) Evaluation scales: 1(poor or not at all) to 5(excellent, strongly, or difficult) The first two principal components explain 76.0% of total variation and the last four principal components explain only 2.2% 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples Test scores of 10 students in 4 subjects Student Subject 1 2 3 4 5 6 7 8 9 10 Chinese(X1) 85 90 60 70 68 77 50 80 85 55 English(X2) 76 95 45 65 56 80 30 70 75 60 Math(X3) 60 80 38 60 70 65 40 60 65 40 Social(X4) 85 72 80 76 70 68 80 66 84 50 Source: Shen (1998) 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples Correlation Matrix X1 X2 X3 X4 X1 1 0.8846 0.8375 0.2784 X2 1 0.8059 -0.1101 X3 1 0.1118 X4 1 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples Eigenvalues and Eigenvectors Cum. Eigenvector Eigenvalue Prop. Prop. X1 X2 X3 X4 2.70159 0.6754 0.6554 0.5897 0.1254 0.3592 -0.7124 1.06380 0.2660 0.9414 0.1254 -0.2651 -0.0281 0.9556 0.19870 0.0497 0.9910 0.3592 0.4378 -0.8227 0.0501 0.03591 0.0090 1.0000 -0.7124 0.6444 0.0485 0.2737 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples Because the first two principal components account for 94.14%, we can just use these two principal components The first principal component can be interpreted as the index for the sum of Chinese, English and math The second principal component can be thought as social science 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples The above results can be also obtained by inspecting the correlation matrix Correlations among Chinese, English, and math exceed 0.8 Correlations between Chinese, English, and math with social science are below 0.3 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples Correlation between the first principal component with original variables Corr(Z1,X1)=a111 =0.58972.70159=0.9692 Corr(Z1,X2)=a121 =0.56822.70159=0.9339 Corr(Z1,X3)=a131 =0.56572.70159=0.9298 Corr(Z1,X4)=a14i = 0.09692.70159=0.1592 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples Correlation between the second principal component with original variables Corr(Z2,X1)=a212 =0.12541.0638=0.1294 Corr(Z2,X2)=a222 =-0.26511.0638=-0.2734 Corr(Z2,X3)=a232 =-0.02811.0638=-0.0290 Corr(Z2,X4)=a242 = 0.95561.0638=0.9856 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples Student 1st Component 2nd Component 1 0.91883 1.12685 2 2.58868 -0.41488 3 -1.85920 0.84509 4 0.03527 0.23932 5 0.01741 -0.21745 6 0.92643 -0.65337 7 -2.67248 0.96553 8 0.52758 -0.65459 9 1.32646 0.92471 10 -1.80897 -0.16121 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples Correlations of Female Sparrows X1 X2 X3 X4 X5 Total length (X1) 1.000 Alar length (X2) 0.735 1.000 Length of beak and Head (X3) 0.662 0.674 1.000 Length of humerus (X4) 0.645 0.769 0.763 1.000 Length of keel of sternum (X5) 0.605 0.529 0.626 0.607 1.000 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples Coefficients for Components Component Variance X1 X2 X3 X4 X5 1 3.616 0.452 0.462 0.451 0.471 0.398 2 0.532 -0.051 0.300 0.325 0.185 -0.877 3 0.386 0.691 0.341 -0.455 -0.411 -0.179 4 0.302 -0.420 0.548 -0.606 0.388 0.069 5 0.165 0.374 -0.530 -0.343 0.652 -0.192 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples The first principal component Z1=0.452X1+0.462X2+0.451X3+0.471X4+0.398X5 An index of bird size The second principal component Z2=-0.051X1+0.300X2+0.325X3+0.185X4-0.877X5 An index of bird shape 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples The value of the first principal component for the first bird Z1=0.452(-0.542)+0.462(0.725)+0.451(0.177)+ 0.471(0.055)+0.398(-0.33) = 0.064 The value of the second principal component for the first bird Z2=-0.051(-0.542)+0.300(0.725)+0.325(0.177)+ 0.185(0.055)+(-0.877(-0.33) = 0.602 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples Mean Standard Deviation Survivor Nonsurvivor Survivor Nonsurvivor -0.100 0.075 1.506 2.176 0.004 -0.003 0.684 0.776 -0.140 0.105 0.522 0.677 0.073 -0.055 0.563 0.543 0.023 -0.017 0.411 0.408 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples Employment in European Countries AGR MIN MAN PS CON SER FIN SPC TC AGR 1.000 MIN 0.316 1.000 MAN -0.254 -0.672 1.000 PS(3) -0.382 -0.387 0.388 1.000 CON -0.349 -0.129 -0.034 0.165 1.000 SER -0.605 -0.407 -0.033 0.155 0.473 1.000 FIN -0.176 -0.248 -0.274 0.094 -0.018 0.379 1.000 SPC -0.811 -0.316 0.050 0.238 0.072 0.388 0.166 1.000 TC -0.487 0.045 0.243 0.105 -0.055 -0.085 -0.391 0.475 1.000 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples 9 eigenvalues: 3.112(34.6%), 1.809(20.1%), 1.496(16.6%), 1.063(11.8%), 0.710(7.9%) 0.311(3.5%), 0.293(3.3%), 0.204(2.4%), and 0(0.0%) The sum of percent employment is 1 The columns of correlation matrix are linearly dependent The last eigenvalue is 0 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples Select the principal components with eigenvaleues greater than 1 the first 4 principal components that explain 85% of the total variation in the data If we take first two principal components which can account only for 55% of total variation 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples The first principal component Z1=0.51(AGR)+0.37(Min)-0.25(MAN)-0.31(PS)-0.22(CON)-0.38(SER)-0.13(FIN)-0.42(SPS)-0.21(TC) A contrast between AGR(agriculture, forestry, and fishing) and MIN(mining and quarrying) versus others 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Examples The second principal component Z1=-0.-2(AGR)+0.00(Min)+0.43(MAN) +0.11(PS)-0.24(CON)-0.41(SER) -0.55(FIN)+0.05(SPS)+0.52(TC) A contrast between MAN(manufacturing) and TC(transport and communication) versus CON(construction),SER(service industry) and FIN(finance) 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD 2019/2/24 Copyright by Jen-pei Liu, PhD
Copyright by Jen-pei Liu, PhD Summary A linear combination of the original variables Try to reduce a large number of variables to a few index variables Index variables are not correlated and ordered in the magnitude of variation Illustration with real examples 2019/2/24 Copyright by Jen-pei Liu, PhD