Presentation is loading. Please wait.

Presentation is loading. Please wait.

Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.

Similar presentations


Presentation on theme: "Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate."— Presentation transcript:

1 Principal Components Analysis

2 Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate data set while accounting for as much of the original variation as possible present in the data set. The basic goal of PCA is to describe variation in a set of correlated variables, X T =(X 1, ……,X q ), in terms of a new set of uncorrelated variables, Y T =(Y 1, …….,Y q ), each of which is linear combination of the X variables. Y 1, ………….,Y q - principal components decrease in the amount of variation in the original data

3 Principal Components Analysis (PCA) The principal components analysis are most commonly used for constructing an informative graphical representation of the data. Principal components might be useful when: There are too many explanatory variables relative to the number of observations. The explanatory variables are highly correlated.

4 Principal Components Analysis (PCA) The principal component is the linear combination of the variables X 1, X 2, ….X q Y1 accounts for as much as possible of the variation in the original data among all linear combinations of

5 Principal Components Analysis (PCA) The second principal component accounts for as much as possible of the remaining variation: with the constrain: and are uncorrelated.

6 Principal Components Analysis (PCA) The third principal component: is uncorrelated with and. If there are q variables, there are q principal components.

7 Principal Components Analysis (PCA) HeightFirst Leaf 10812 11111 14723 21821 24037 22330 24228 48077 29040 26355 Each observation is considered a coordinate in N-dimensional data space, where N is the number of variables and each axis of data space is one variable. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Mean length Mean of height Step 1: A new set of axes is created, whose origins (0,0) is located at the mean of the dataset. Step 2: The new axes are rotated around their origins until the first axis gives a least squares best fit to the data (residuals are fitted orthogonally). Data: Height of first leaf length of Dactylorhyza orchids.

8 Principal Components Analysis (PCA) PCA gives three useful sets of information about the dataset: projection onto new coordinate axes (i.e. new set of variables encapsulating the overall information content). the rotations needed to generate each new axis (i.e. the relative importance of each old variable to each new axis). the actual information content of each new axis.

9 Mechanics of PCA Normalising the data Most multivariate datasets consists of extremely different variables (i.e. plant percentage cover will range from 0% to 100%, animal population values may exceed 10000, chemical concentrations may take any positive value). How to compare such disparate types of data? Approach: calculate the mean (µ) and standard deviation(s) of each variable (X i ) separately, then convert each observation into a corresponding Z score: Z score is dimensionless, each column of the data has been converted into a new variable which preserves the shape of the original data but has µ=0 and s=1. The process of converting to Z scores is known as normalization.

10 Mechanics of PCA Normalising the data XYZ XYZ 1.716-0.5670.991-1.09-1.35-1.09 1.76-0.481.016-1.02-1.26-1.02 1.933-0.1341.116-0.73-0.9-0.73 2.3660.7321.366-0.01 2.5821.1651.4910.350.440.35 3.0152.0311.7411.081.331.08 3.2322.4641.8661.441.781.44 1.6161.2320.933-1.260.51-1.26 1.9910.9821.15-0.630.25-0.63 2.7410.4821.5820.62-0.270.62 3.1160.2321.7991.24-0.521.24 µ: 2.370.741.368000 s: 0.60.970.346 111 Before normalisation After normalisation x, y, and z - axes µ - mean s - standard deviation

11 Mechanics of PCA The extraction of principal components The cloud of N-dimensional data points needs to be rotated to generate a set of N principal axes. The ordination is achieved by finding a set of numbers (loadings) which rotates the data to give the best fit. How to find the best possible values for the loadings? Answer: Finding the eigenvectors and eigenvalues of the Pearson’s correlation matrix (the matrix of all possible Pearson’s correlation coefficients between the variables under examination). The covariance matrix can be used instead of correlation matrix when all the original variables have the same scale or if the data was normalized. XYZ X1.0000.5930.999 Y0.5931.0000.594 Z0.9990.5941.000

12 Mechanics of PCA Eigenvalues and eigenvectors When a square (N x N) matrix is multiplied with a (1 x N) matrix, the result is a new (1 x N) matrix. This operation can be repeated on a new (1 x N) matrix, generating another (1 x N) matrix. After a number of repeats (iterations) the pattern of numbers generated settles down to a constant shape, although their actual values change each time by a constant amount. The rate of growth (or shrinkage) per multiplication it is known as dominant eigenvalue, and the pattern they form is the dominant (or principal) eigenvector. - (N x N) matrix - (1 x N) matrix - eigenvalue

13 Mechanics of PCA Eigenvalues and eigenvectors 10.5930.999 0.59310.594 0.9990.5941 x 1 1 1 = 2.592 2.187 2.593 First iteration: Second iteration: 10.5930.999 0.59310.594 0.9990.5941 x 2.592 2.187 2.593 = 6.48 5.26 6.48 Iteration number: 5 10 20 Resulting matrix: 98.6 79.3 98.6 9181 7384 9181 7.96e7 6.40e7 7.96e7 First eigenvector: Second eigenvector: 0.967 0.777 0.967 -0.253 0.629 -0.253 Dominant eigenvalue: 2.48 Once the equilibrium is reached each generation of numbers increases by a factor of 2.48.

14 Mechanics of PCA PCA takes a set of R observations on N variables as a set of R points in an N-dimensional space. A new set of N principal axes is derived, each one defined by rotating the dataset by a certain angle with respect to the old axes. The first axis in the new space (the first principal axis of the data) encapsulate the maximum possible information content, the second axis contains the second greatest information content and so on. Eigenvectors - a relative patterns of numbers which is preserved under matrix multiplication. Eigenvalues - give a precise indication of the relative importance of each ordination axis, with the largest eigenvalue being associated with the first principal axis, the second largest eigenvalue being associated with the second principal axis, etc.

15 Mechanics of PCA For example, a matrix with 20 species would generate 20 eigenvectors, but only the first three or four would be of any importance for interpreting the data. The relationship between eigenvalues and variance in PCA: - percent variance explained by the mth ordination axis - the mth eigenvalue - number of variables There is no formal test of significance available to decide if any given ordination axis is meaningful, nor is there any test to decide whether or not individual variables contribute significantly to an ordination axis.

16 Mechanics of PCA Axis scores The Nth axis of the ordination diagram is derived by multiplying the matrix of normalized data by the Nth eigenvector. XYZ -1.09-1.35-1.09 -1.02-1.26-1.02 -0.73-0.9-0.73 -0.01 0.350.440.35 1.081.331.08 1.441.781.44 -1.260.51-1.26 -0.630.25-0.63 0.62-0.270.62 1.24-0.521.24 XYZ -1.09-1.35-1.09 -1.02-1.26-1.02 -0.73-0.9-0.73 -0.01 0.350.440.35 1.081.331.08 1.441.781.44 -1.260.51-1.26 -0.630.25-0.63 0.62-0.270.62 1.24-0.521.24 x 0.967 0.777 0.967 = -3.16 -2.95 -2.11 -0.02 1.02 3.12 4.17 -2.04 -1.02 0.99 1.99 x -0.253 0.629 -0.253 = -0.30 -0.28 -0.20 0.00 0.10 0.29 0.39 0.96 0.48 -0.48 -0.95 first eigenvector second axis scores first axis scores second eigenvector

17 PCA Example Excavations of prehistoric sites in northeast Thailand have produced a series of canid (dog) bones covering a period from about 3500 BC to the present. In order to clarify the ancestry of the prehistoric dogs, mandible measurements were made on the available specimens. These were then compared with similar measurements on the golden jackal, the Chinese wolf, the Indian wolf, the dingo, the cuon, and the modern dog from Thailand. How these groups are related, and how the prehistoric group is related to the others? R data “Phistdog” Variables: Mbreadth- breadth of mandible Mheight- height of mandible below 1 st molar mlength- length of 1 st molar mbreadth- breadth of 1 st molar mdist- length from 1 st to 3 rd molars inclusive pmdist- length from 1 st to 4 th premolars inclusive

18 PCA Example >Phistdog=read.csv("E:/Multivariate_analysis/Data/Prehist_dog.csv",header=T,ro w.names=1) # read the “Phistdog” data and consider the first column as the row names > round(sapply(Phistdog,var),2) Mbreath Mheight mlength mbreadth mdist pmdist 2.88 10.56 9.61 1.36 24.30 31.52 Calculate the variance of Phistdog data set. The round command is used to reduce the number of decimals at 2 for the reason of space. The measurements are on a similar scale, variances are not very different. We can use either correlation or the covariance matrix.

19 PCA Example > round(cor(Phistdog),2) Mbreath Mheight mlength mbreadth mdist pmdist Mbreath 1.00 0.95 0.92 0.98 0.78 0.81 Mheight 0.95 1.00 0.88 0.95 0.71 0.85 mlength 0.92 0.88 1.00 0.97 0.88 0.94 mbreadth 0.98 0.95 0.97 1.00 0.85 0.91 mdist 0.78 0.71 0.88 0.85 1.00 0.89 pmdist 0.81 0.85 0.94 0.91 0.89 1.00 Calculate the correlation matrix of the data.

20 PCA Example Calculate the covariance matrix of the data. > round(cov(Phistdog),2) Mbreath Mheight mlength mbreadth mdist pmdist Mbreath 2.88 5.25 4.85 1.93 6.52 7.74 Mheight 5.25 10.56 8.90 3.59 11.45 15.58 mlength 4.85 8.90 9.61 3.51 13.39 16.31 mbreadth 1.93 3.59 3.51 1.36 4.86 5.92 mdist 6.52 11.45 13.39 4.86 24.30 24.60 pmdist 7.74 15.58 16.31 5.92 24.60 31.52

21 PCA Example Calculate the eigenvectores and eigenvalues of the correlation matrix: > eigen(cor(Phistdog)) $values [1] 5.429026124 0.369268401 0.128686279 0.064760299 0.006117398 0.002141499 $vectors [,1] [,2] [,3] [,4] [,5] [,6] [1,] -0.4099426 0.40138614 -0.45937507 -0.005510479 0.009871866 0.6779992 [2,] -0.4033020 0.48774128 0.29350469 -0.511169325 -0.376186947 -0.3324158 [3,] -0.4205855 -0.08709575 0.02680772 0.737388619 -0.491604714 -0.1714245 [4,] -0.4253562 0.16567935 -0.12311823 0.170218718 0.739406740 -0.4480710 [5,] -0.3831615 -0.67111237 -0.44840921 -0.404660012 -0.136079802 -0.1394891 [6,] -0.4057854 -0.33995660 0.69705234 -0.047004708 0.226871533 0.4245063

22 PCA Example Calculate the eigenvectores and eigenvalues of the covariance matrix: > eigen(cov(Phistdog)) $values [1] 72.512852567 4.855621390 2.156165476 0.666083782 0.024355099 [6] 0.005397877 $vectors [,1] [,2] [,3] [,4] [,5] [,6] [1,] -0.1764004 -0.2228937 -0.4113227 -0.10162260 0.65521113 0.557123088 [2,] -0.3363603 -0.6336812 -0.3401245 0.47472891 -0.36879498 -0.090818041 [3,] -0.3519843 -0.1506859 -0.1472096 -0.83773573 -0.36033271 -0.009453262 [4,] -0.1301150 -0.1132540 -0.1502766 -0.10976633 0.51257082 -0.820294484 [5,] -0.5446003 0.7091113 -0.3845381 0.20868622 -0.09193887 -0.026446421 [6,] -0.6467862 -0.1019554 0.7231913 0.08309978 0.18348673 0.087716189

23 PCA Example Extract the principal components from the correlation matrix: > Phistdog_Cor=princomp(Phistdog,cor=TRUE) > summary(Phistdog_Cor,loadings=TRUE) Importance of components: Comp.1 Comp.2 Comp.3 Standard deviation 2.3300271 0.60767458 0.35872870 Proportion of Variance 0.9048377 0.06154473 0.02144771 Cumulative Proportion 0.9048377 0.96638242 0.98783013 Loadings: Comp.1 Comp.2 Comp.3 Mbreath -0.410 0.401 -0.459 Mheight -0.403 0.488 0.294 mlength -0.421 mbreadth -0.425 0.166 -0.123 mdist -0.383 -0.671 -0.448 pmdist -0.406 -0.340 0.697 The first principal component accounts for 90% of variance. All other components account for less than 10% variance each.

24 PCA Example Extract the principal components from the covariance matrix: > Phistdog_Cov=princomp(Phistdog) > summary(Phistdog_Cov,loadings=TRUE) Importance of components: Comp.1 Comp.2 Comp.3 Standard deviation 7.8837728 2.04008853 1.35946380 Proportion of Variance 0.9039195 0.06052845 0.02687799 Cumulative Proportion 0.9039195 0.96444795 0.99132595 Loadings: Comp.1 Comp.2 Comp.3 Mbreath -0.176 0.223 -0.411 Mheight -0.336 0.634 -0.340 mlength -0.352 0.151 -0.147 mbreadth -0.130 0.113 -0.150 mdist -0.545 -0.709 -0.385 pmdist -0.647 0.102 0.723 The loadings obtained from the covariance matrix are different compared to those from the correlation matrix. Proportions of variance are similar.

25 PCA Example Plot variances of the principal components: > screeplot(Phistdog_Cor,main="Phistdog",cex.names=0.75)

26 PCA Example Equations for the first two principal components from the correlation matrix: Equations for the first two principal components from the covariance matrix: Negative loadings on first principal axis for all variables. Mostly positive loadings on the second principal axis.

27 PCA Example Calculate the axis scores for the principal components from the correlation matrix: > round(Phistdog_Cor$scores,2) Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Modern 1.47 0.04 -0.05 -0.18 -0.08 0.09 G.jackal 3.32 -0.66 -0.25 0.34 0.05 -0.01 C.wolf -4.33 0.03 -0.23 0.11 0.09 0.03 I.wolf -2.13 -0.58 -0.09 0.03 -0.14 -0.05 Cuon 0.45 1.16 0.29 0.30 -0.03 -0.02 Dingo 0.08 -0.47 0.73 -0.20 0.06 -0.01 Prehistoric 1.14 0.49 -0.40 -0.40 0.04 -0.05

28 PCA Example Calculate the axis scores for the principal components from the covariance matrix: > round(Phistdog_Cov$scores,2) Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Modern 4.77 -0.27 -0.18 0.49 0.01 -0.15 G.jackal 10.23 -2.76 0.26 -1.04 0.08 0.03 C.wolf -13.89 0.18 -0.83 -0.39 0.22 -0.01 I.wolf -8.25 -1.67 -0.25 -0.23 -0.29 0.00 Cuon 3.98 4.31 0.17 -0.76 -0.07 0.01 Dingo -2.00 0.02 2.83 0.82 0.04 0.04 Prehistoric 5.16 0.20 -2.01 1.10 0.01 0.08

29 PCA Example Plot the first principal component vs. second principal component obtained from the correlation matrix and >plot(Phistdog_Cor$scores[,2]~Phistdog_Cor$scores[,1],xlab="PC1",ylab="PC2",pch=15,xlim=c(-4.5,3.5),ylim=c(-0.75,1.5)) >text(Phistdog_Cor$scores[,1],Phistdog_Cor$scores[,2],labels=row.names(Phist dog),cex=0.7,pos=rep(1,7)) > abline(h=0) > abline(v=0) >plot(Phistdog_Cov$scores[,2]~Phistdog_Cov$scores[,1],xlab="PC1",ylab="PC 2",pch=15,xlim=c(-14.5,11),ylim=c(-3.5,4.5)) >text(Phistdog_Cov$scores[,1],Phistdog_Cov$scores[,2],labels=row.names(Phi stdog),cex=0.7,pos=rep(1,7)) > abline(v=0) > abline(h=0) from the covariance matrix:

30 PCA Example PCA diagram based on Covariance PCA diagram based on Correlation

31 PCA Example Even if the scores given by the covariance and correlation matrix are different the information provided by the two diagrams is the same. The Modern dog has the closest mandible measurements to the Prehistoric dog, which shows that the two groups are related. Cuon and Dingo groups are the next closest groups to the Prehistoric dog. I. wolf, C wolf, and G. jack are not related to the Prehistoric dog or to any other group.


Download ppt "Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate."

Similar presentations


Ads by Google