PCA Given a set of variables x 1, x 2, …, x n, –find a set of coefficients a 11, a 12, …, a 1n, so that PC1 = a 11 x 1 + a 12 x 2 + …+ a 1n x n has the maximum variance (v 1 ) subject to the constraint that a 1 is a unit vector, i.e., sqrt(a a 12 2 …+ a 1n 2 ) = 1 –find a 2 nd set of coefficients a 2 so that PC2 has the maximum variance (v 2 ) subject to the unit vector constraint and the additional constraint that a 2 is orthogonal to a 1 –find 3 rd, 4 th,… n th set of coefficients so that PC3, PC4, … have the maximum variance (v 3, v 4, …) subject to the unit vector constraint and that a i is orthogonal to all a i-1 vectors. –It turns out that v 1, v 2, … are eigenvalues and a 1, a 2, … are eigenvectors of the variance-covariance matrix of x 1, x 2, …, x n PCA is to find the eigenvalues and eigenvectors. Slide 2
Xuhua Xia Slide 3 Typical Form of Data A data set in a 8x3 matrix. The rows could be species and columns sampling sites X = A matrix is often referred to as a n x p matrix (n for number of rows and p for number of columns). Our matrix has 8 rows and 3 columns, and is an 8x3 matrix. A variance-covariance matrix has n = p, and is called n-dimensional square matrix.
Xuhua Xia Slide 4 What are Principal Components? Principal components are linear combinations of the observed variables. The coefficients of these principal components are chosen to meet four criteria What are the four criteria? PC = a 1 X 1 + a 2 X 2 + … a n X n
Xuhua Xia Slide 5 What are Principal Components? The four criteria: –There are exactly p principal components (PCs), each being a linear combination of the observed variables; –The PCs are mutually orthogonal (i.e., perpendicular and uncorrelated); –The components are extracted in order of decreasing variance. –The components are in the form of eigenvalues and eigenvector of unit length.
Xuhua Xia Slide 6 A Simple Data Set XYX11Y11XYX11Y11 XY X Y Correlation matrix Covariance matrix X1X
Xuhua Xia Slide 7 General observations The total variance is 3 (= 1 + 2) The two variables, X and Y, are perfectly correlated, with all points fall on the regression line. The spatial relationship among the 5 points can therefore be represented by a single dimension. For this reason, PCA is often referred to as a dimension-reduction technique. What would happen if we apply PCA to the data?
Xuhua Xia Slide 8 Graphic PCA X Y
Xuhua Xia Slide 9 R functions X1 X options("scipen"=100, "digits"=6) objPCA<-prcomp(~X1+X2) objPCA<-prcomp(md) objPCA<-prcomp(md,scale.=T) predict(objPCA,md) predict(objPCA,data.frame(X1=0.3,X2=0.5) screeplot(objPCA) Requesting the PCA to be carried out on the covariance matrix (default) rather than the correlation matrix. Use scale.=TRUE to request PCA on correlation matrix Help decide how many PCs to keep when there are many variables Don’t use scientific notation.
Xuhua Xia Slide 10 A positive definite matrix When you run the SAS program, the log file will warn that “The Correlation Matrix is not positive definite.”. What does that mean? A symmetric matrix M (such as a correlation matrix or a covariance matrix) is positive definite if z’Mz > 0 for all non- zero vectors z with real entries, where z’ is the transpose of z. Given our correlation matrix with all entries being 1, it is easy to find z that lead to z’Mz = 0. So the matrix is not positive definite: Replace the correlation matrix with the covariance matrix and solve for z.
Xuhua Xia Slide 11 SAS Output Standard deviations:  Rotation: PC1 PC2 X X PC1 PC2 [1,] [2,] [3,] [4,] [5,] Principal component scores What’s the variance in PC 1 ? better to output in variance (eigenvalue) accounted for by each PC eigenvectors: PC1 = X X 2
Xuhua Xia Slide 12 Standard deviations:  Rotation: PC1 PC2 X X PC1 PC2 [1,] [2,] [3,] [4,] [5,] PCA on correlation matrix (scale.=T)
Xuhua Xia Slide 13 The Eigenvalue Problem The covariance matrix. The Eigenvalue is the set of values that satisfy this condition. The resulting eigenvalues (There are n eigenvalues for n variables). The sum of eigenvalues is equal to the sum of variances in the covariance matrix. Finding the eigenvalues and eigenvectors is called an eigenvalue problem (or a characteristic value problem).
Xuhua Xia Slide 14 Get the Eigenvectors An eigenvector is a vector (x) that satisfies the following condition: A x = x In our case A is a variance-covariance matrix of the order of 2, and a vector x is a vector specified by x 1 and x 2.
Xuhua Xia Slide 15 Get the Eigenvectors We want to find an eigenvector of unit length, i.e., x x 2 2 = 1 We therefore have From Previous Slide The first eigenvector is one associated with the largest eigenvalue. Solve x 1
Xuhua Xia Slide 16 Get the PC Scores First PC score Second PC score Original data (x and y)Eigenvectors The original data in a two dimensional space is reduced to one dimension..
Xuhua Xia Slide 17 Crime Data in 50 States STATE MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTO ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE FLORIDA GEORGIA HAWAII IDAHO ILLINOIS PROC PRINCOMP OUT=CRIMCOMP;
STATE MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTO Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey
New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming md<-read.fwf("crime.txt",c(14,6,5,6,6,7,7,6),header=T) attach(md) cor(md[,2:8] objPCA<-prcomp(md[,2:8],scale.=T) objPCA summary(objPCA) PCScore<-predict(objPCA,md) Crime data (cont.) If you copy the data to a text file, add a top line with a comment sign #, otherwise you need to specify the 'sep=' with read.fwf
Xuhua Xia Slide 20 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO Correlation Matrix If variables are not correlated, there would be no point in doing PCA. The correlation matrix is symmetric, so we only need to inspect either the upper or lower triangular matrix.
Xuhua Xia Slide 21 > summary(objPCA) Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation Proportion of Variance Cumulative Proportion Eigenvalues screeplot(objPCA,type = "lines")
Xuhua Xia Slide 22 Eigenvectors PC1 PC2 PC3 PC4 PC5 PC6 PC7 MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTO Do these eigenvectors mean anything? –All crimes are negatively correlated with the first eigenvector, which is therefore interpreted as a measure of overall safety. –The 2nd eigenvector has positive loadings on AUTO, LARCENY and ROBBERY and negative loadings on MURDER, ASSAULT and RAPE. It is interpreted to measure the preponderance of property crime over violent crime…...
biplot(objPCA) Xuhua Xia Slide 23
Plot PC1 and PC PC1 PC2 Connecticut Indiana Iowa Maine Minnesota Montana Nebraska New Hampshire North Dakota Rhode Island Utah Vermont Wisconsin Wyoming Alabama Arkansas Idaho Kansas Kentucky Mississippi North Carolina Oklahoma Pennsylvania South Dakota Tennessee Virginia West Virginia Alaska Arizona California Colorado Delaware Hawaii Illinois Massachusetts Michigan New Jersey New York Ohio Oregon Washington Florida Georgia Louisiana Maryland Missouri Nevada New Mexico South Carolina Texas
Xuhua Xia Slide 25 PC Plot: Crime Data North and South Dakota Nevada, New York, California Mississippi, Alabama, Louisiana, South Carolina Maryland
Xuhua Xia Slide 26 Steps in a PCA Generate a correlation or variance-covariance matrix Obtain eigenvalues and eigenvectors Generate principal component (PC) scores Choose the number of PCs Plot the PC scores in the space with reduced dimensions