# Multivariate statistics

## Presentation on theme: "Multivariate statistics"— Presentation transcript:

Multivariate statistics
PCA: principal component analysis Correspondence analysis Canonical correlation Discriminant function analysis Cluster analysis MANOVA Xuhua Xia

PCA Given a set of variables x1, x2, …, xn,
find a set of coefficients a11, a12, …, a1n, so that PC1 = a11x1 + a12x2 + …+ a1nxn has the maximum variance (v1) subject to the constraint that a1 is a unit vector, i.e., sqrt(a112+ a122 …+ a1n2) = 1 find a 2nd set of coefficients a2 so that PC2 has the maximum variance (v2) subject to the unit vector constraint and the additional constraint that a2 is orthogonal to a1 find 3rd, 4th,… nth set of coefficients so that PC3, PC4, … have the maximum variance (v3, v4, …) subject to the unit vector constraint and that ai is orthogonal to all ai-1 vectors. It turns out that v1, v2, … are eigenvalues and a1, a2, … are eigenvectors of the variance-covariance matrix of x1, x2, …, xn PCA is to find the eigenvalues and eigenvectors.

Typical Form of Data A data set in a 8x3 matrix. The rows could be species and columns sampling sites. X = A matrix is often referred to as a nxp matrix (n for number of rows and p for number of columns). Our matrix has 8 rows and 3 columns, and is an 8x3 matrix. A variance-covariance matrix has n = p, and is called n-dimensional square matrix. Xuhua Xia

What are Principal Components?
PC = a1X1 + a2X2 + … anXn Principal components are linear combinations of the observed variables. The coefficients of these principal components are chosen to meet four criteria What are the four criteria? Xuhua Xia

What are Principal Components?
The four criteria: There are exactly p principal components (PCs), each being a linear combination of the observed variables; The PCs are mutually orthogonal (i.e., perpendicular and uncorrelated); The components are extracted in order of decreasing variance. The components are in the form of eigenvalues and eigenvector of unit length. Xuhua Xia

A Simple Data Set X Y X 1 1 Y 1 1 X Y X 1 1.414 Y 1.414 2
X1 X2 Note that, for standardized X and Y, the denominator of r_XY is sqrt(SS_x*SS_y) = sqrt(var_x*df * var_y*df) = sqrt(df*df) = df. That is, for standardized variables, correlation matrix and var-covar matrix are the same. X Y X 1 1 Y 1 1 X Y X Y Correlation matrix Covariance matrix Xuhua Xia

General observations The total variance is 3 (= 1 + 2)
The two variables, X and Y, are perfectly correlated, with all points fall on the regression line. The spatial relationship among the 5 points can therefore be represented by a single dimension. For this reason, PCA is often referred to as a dimension-reduction technique. What would happen if we apply PCA to the data? Xuhua Xia

Graphic PCA -2 -1.5 -1 -0.5 0.5 1 1.5 2 X Y Xuhua Xia

R functions Don’t use scientific notation.
X X2 options("scipen"=100, "digits"=6) objPCA<-prcomp(~X1+X2) objPCA<-prcomp(md) objPCA<-prcomp(md,scale.=T) predict(objPCA,md) predict(objPCA,data.frame(X1=0.3,X2=0.5) screeplot(objPCA) Don’t use scientific notation. Requesting the PCA to be carried out on the covariance matrix (default) rather than the correlation matrix. Use scale.=TRUE to request PCA on correlation matrix Help decide how many PCs to keep when there are many variables Xuhua Xia

A positive definite matrix
When you run the SAS program, the log file will warn that “The Correlation Matrix is not positive definite.”. What does that mean? A symmetric matrix M (such as a correlation matrix or a covariance matrix) is positive definite if z’Mz > 0 for all non-zero vectors z with real entries, where z’ is the transpose of z. Given our correlation matrix with all entries being 1, it is easy to find z that lead to z’Mz = 0. So the matrix is not positive definite: if the correlations (off-diagnal elements) are smaller than 1, then the solutions would be complex. Replace the correlation matrix with the covariance matrix and solve for z. Xuhua Xia

SAS Output Principal component scores What’s the variance in PC1?
Standard deviations: [1] Rotation: PC PC2 X X PC PC2 [1,] [2,] [3,] [4,] [5,] better to output in variance (eigenvalue) accounted for by each PC eigenvectors: PC1 = X X2 Principal component scores What’s the variance in PC1? Xuhua Xia

PCA on correlation matrix (scale.=T)
Standard deviations: [1] Rotation: PC PC2 X X PC PC2 [1,] [2,] [3,] [4,] [5,] Xuhua Xia

The Eigenvalue Problem
The covariance matrix. The Eigenvalue is the set of values that satisfy this condition. The resulting eigenvalues (There are n eigenvalues for n variables). The sum of eigenvalues is equal to the sum of variances in the covariance matrix. Finding the eigenvalues and eigenvectors is called an eigenvalue problem (or a characteristic value problem). Xuhua Xia

Get the Eigenvectors An eigenvector is a vector (x) that satisfies the following condition: A x = x In our case A is a variance-covariance matrix of the order of 2, and a vector x is a vector specified by x1 and x2. Xuhua Xia

Get the Eigenvectors We want to find an eigenvector of unit length, i.e., x12 + x22 = 1 We therefore have From Previous Slide Solve x1 The first eigenvector is one associated with the largest eigenvalue. Xuhua Xia

Get the PC Scores First PC score Original data (x and y) Eigenvectors Second PC score The original data in a two dimensional space is reduced to one dimension.. Xuhua Xia

Crime Data in 50 States PROC PRINCOMP OUT=CRIMCOMP;
STATE MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTO ALABAMA ALASKA ARIZONA ARKANSAS CALIFORNIA COLORADO CONNECTICUT DELAWARE FLORIDA GEORGIA HAWAII IDAHO ILLINOIS PROC PRINCOMP OUT=CRIMCOMP; Xuhua Xia

STATE MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTO
Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas Kentucky Louisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey

Crime data (cont.) New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming md<-read.fwf("crime.txt",c(14,6,5,6,6,7,7,6),header=T) attach(md) cor(md[,2:8] objPCA<-prcomp(md[,2:8],scale.=T) objPCA summary(objPCA) PCScore<-predict(objPCA,md) If you copy the data to a text file, add a top line with a comment sign #, otherwise you need to specify the 'sep=' with read.fwf

Correlation Matrix MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO If variables are not correlated, there would be no point in doing PCA. The correlation matrix is symmetric, so we only need to inspect either the upper or lower triangular matrix. Xuhua Xia

Eigenvalues screeplot(objPCA,type = "lines") > summary(objPCA)
Importance of components: PC1 PC2 PC3 PC4 PC5 PC6 PC7 Standard deviation Proportion of Variance Cumulative Proportion screeplot(objPCA,type = "lines") Xuhua Xia

Eigenvectors Do these eigenvectors mean anything?
PC PC PC PC PC PC PC7 MURDER RAPE ROBBE ASSAU BURGLA LARCEN AUTO Do these eigenvectors mean anything? All crimes are negatively correlated with the first eigenvector, which is therefore interpreted as a measure of overall safety. The 2nd eigenvector has positive loadings on AUTO, LARCENY and ROBBERY and negative loadings on MURDER, ASSAULT and RAPE. It is interpreted to measure the preponderance of property crime over violent crime…... Xuhua Xia

biplot(objPCA) Xuhua Xia

Plot PC1 and PC2 2 1 PC2 -1 -2 -4 -2 2 4 PC1 Massachusetts
Rhode Island 2 Hawaii Connecticut Delaware 1 Minnesota Colorado New Jersey Utah Vermont Arizona New Hampshire Iowa Washington Wisconsin Oregon Maine New York North Dakota Montana Alaska Nebraska PC2 California Michigan Illinois Ohio Indiana Wyoming Kansas Idaho Nevada Maryland Pennsylvania South Dakota Florida Missouri Texas Oklahoma Virginia -1 West Virginia New Mexico Tennessee Kentucky Georgia Arkansas North Carolina -2 South Carolina Louisiana Alabama Mississippi -4 -2 2 4 PC1

PC Plot: Crime Data Maryland Nevada, New York, California
North and South Dakota Mississippi, Alabama, Louisiana, South Carolina Xuhua Xia

Steps in a PCA Generate a correlation or variance-covariance matrix
Obtain eigenvalues and eigenvectors Generate principal component (PC) scores Choose the number of PCs Plot the PC scores in the space with reduced dimensions When to use a correlation matrix: 1. When different units are used for different variables 2. When data are from species of very different mean densities Xuhua Xia