Presentation is loading. Please wait.

Presentation is loading. Please wait.

Xuhua Xia Slide 1 Multivariate statistics PCA: principal component analysis Correspondence analysis Canonical correlation Discriminant function analysis.

Similar presentations


Presentation on theme: "Xuhua Xia Slide 1 Multivariate statistics PCA: principal component analysis Correspondence analysis Canonical correlation Discriminant function analysis."— Presentation transcript:

1 Xuhua Xia Slide 1 Multivariate statistics PCA: principal component analysis Correspondence analysis Canonical correlation Discriminant function analysis Cluster analysis MANOVA

2 PCA Given a set of variables x 1, x 2, …, x n, –find a set of coefficients a 11, a 12, …, a 1n, so that PC1 = a 11 x 1 + a 12 x 2 + …+ a 1n x n has the maximum variance (v 1 ) subject to the constraint that a 1 is a unit vector, i.e., sqrt(a a 12 2 …+ a 1n 2 ) = 1 –find a 2 nd set of coefficients a 2 so that PC2 has the maximum variance (v 2 ) subject to the unit vector constraint and the additional constraint that a 2 is orthogonal to a 1 –find 3 rd, 4 th,… n th set of coefficients so that PC3, PC4, … have the maximum variance (v 3, v 4, …) subject to the unit vector constraint and that a i is orthogonal to all a i-1 vectors. –It turns out that v 1, v 2, … are eigenvalues and a 1, a 2, … are eigenvectors of the variance-covariance matrix of x 1, x 2, …, x n PCA is to find the eigenvalues and eigenvectors. Slide 2

3 Xuhua Xia Slide 3 Typical Form of Data A data set in a 8x3 matrix. The rows could be species and columns sampling sites X = A matrix is often referred to as a n x p matrix (n for number of rows and p for number of columns). Our matrix has 8 rows and 3 columns, and is an 8x3 matrix. A variance-covariance matrix has n = p, and is called n-dimensional square matrix.

4 Xuhua Xia Slide 4 What are Principal Components? Principal components are linear combinations of the observed variables. The coefficients of these principal components are chosen to meet four criteria What are the four criteria? PC = a 1 X 1 + a 2 X 2 + … a n X n

5 Xuhua Xia Slide 5 What are Principal Components? The four criteria: –There are exactly p principal components (PCs), each being a linear combination of the observed variables; –The PCs are mutually orthogonal (i.e., perpendicular and uncorrelated); –The components are extracted in order of decreasing variance. –The components are in the form of eigenvalues and eigenvector of unit length.

6 Xuhua Xia Slide 6 A Simple Data Set XYX11Y11XYX11Y11 XY X Y Correlation matrix Covariance matrix X1X

7 Xuhua Xia Slide 7 General observations The total variance is 3 (= 1 + 2) The two variables, X and Y, are perfectly correlated, with all points fall on the regression line. The spatial relationship among the 5 points can therefore be represented by a single dimension. For this reason, PCA is often referred to as a dimension-reduction technique. What would happen if we apply PCA to the data?

8 Xuhua Xia Slide 8 Graphic PCA X Y

9 Xuhua Xia Slide 9 R functions X1 X options("scipen"=100, "digits"=6) objPCA<-prcomp(~X1+X2) objPCA<-prcomp(md) objPCA<-prcomp(md,scale.=T) predict(objPCA,md) predict(objPCA,data.frame(X1=0.3,X2=0.5) screeplot(objPCA) Requesting the PCA to be carried out on the covariance matrix (default) rather than the correlation matrix. Use scale.=TRUE to request PCA on correlation matrix Help decide how many PCs to keep when there are many variables Don’t use scientific notation.

10 Xuhua Xia Slide 10 A positive definite matrix When you run the SAS program, the log file will warn that “The Correlation Matrix is not positive definite.”. What does that mean? A symmetric matrix M (such as a correlation matrix or a covariance matrix) is positive definite if z’Mz > 0 for all non- zero vectors z with real entries, where z’ is the transpose of z. Given our correlation matrix with all entries being 1, it is easy to find z that lead to z’Mz = 0. So the matrix is not positive definite: Replace the correlation matrix with the covariance matrix and solve for z.

11 Xuhua Xia Slide 11 SAS Output Standard deviations: [1] Rotation: PC1 PC2 X X PC1 PC2 [1,] [2,] [3,] [4,] [5,] Principal component scores What’s the variance in PC 1 ? better to output in variance (eigenvalue) accounted for by each PC eigenvectors: PC1 = X X 2

12 Xuhua Xia Slide 12 Standard deviations: [1] Rotation: PC1 PC2 X X PC1 PC2 [1,] [2,] [3,] [4,] [5,] PCA on correlation matrix (scale.=T)

13 Xuhua Xia Slide 13 The Eigenvalue Problem The covariance matrix. The Eigenvalue is the set of values that satisfy this condition. The resulting eigenvalues (There are n eigenvalues for n variables). The sum of eigenvalues is equal to the sum of variances in the covariance matrix. Finding the eigenvalues and eigenvectors is called an eigenvalue problem (or a characteristic value problem).

14 Xuhua Xia Slide 14 Get the Eigenvectors An eigenvector is a vector (x) that satisfies the following condition: A x = x In our case A is a variance-covariance matrix of the order of 2, and a vector x is a vector specified by x 1 and x 2.

15 Xuhua Xia Slide 15 Get the Eigenvectors We want to find an eigenvector of unit length, i.e., x x 2 2 = 1 We therefore have From Previous Slide The first eigenvector is one associated with the largest eigenvalue. Solve x 1

16 Xuhua Xia Slide 16 Get the PC Scores First PC score Second PC score Original data (x and y)Eigenvectors The original data in a two dimensional space is reduced to one dimension..

17 Xuhua Xia Slide 17 Crime Data in 50 States STATE MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTO ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE FLORIDA GEORGIA HAWAII IDAHO ILLINOIS PROC PRINCOMP OUT=CRIMCOMP;

18 STATE MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTO Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey

19 New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming md<-read.fwf("crime.txt",c(14,6,5,6,6,7,7,6),header=T) attach(md) cor(md[,2:8] objPCA<-prcomp(md[,2:8],scale.=T) objPCA summary(objPCA) PCScore<-predict(objPCA,md) Crime data (cont.) If you copy the data to a text file, add a top line with a comment sign #, otherwise you need to specify the 'sep=' with read.fwf

Similar presentations


Ads by Google