# 1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.

## Presentation on theme: "1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification."— Presentation transcript:

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification of data. The purpose is to reduce the dimensionality of a data set (sample) by finding a new set of variables, smaller than the original set, that nonetheless retains most of the sample's information. 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis By information we mean the variation present in the sample, given by the correlations between the original variables. The new variables, called principal components (PCs), are uncorrelated, and are ordered by the fraction of the total information each retains.

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis Principal component direction of maximum variance in the input space principal eigenvector of the covariance matrix Goal: Relate these two definitions

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis Variance A random variable fluctuating about its mean value Average of the square of the fluctuations

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis Covariance Pair of random variables, each fluctuating about its mean value. Average of product of fluctuations

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis Covariance matrix N random variables NxN symmetric matrix Diagonal elements are variances

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis eigenvectors with k largest eigenvalues Now you can calculate them, but what do they mean? Principal components

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis Covariance to variance From the covariance, the variance of any projection can be calculated. w : unit vector

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis Maximizing parallel variance Principal eigenvector of C (the one with the largest eigenvalue)

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis Geometric picture of principal components the 1st PC z 1 is a minimum distance fit to a line in X space the 2nd PC z 2 is a minimum distance fit to a line in the plane perpendicular to the 1st PC

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis Then…

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Algebraic definition of PCs Given a sample of n observations on a vector of p variables x = (x 1,x 2,….,x p ) define the first principal component of the sample by the linear transformation λ where the vector is chosen such thatis maximum 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Likewise, define the k th PC of the sample by the linear transformation where the vector is chosen such that is maximum subject to 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 To find first note that is the covariance matrix for the variables 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 To find maximize subject to Let λ be a Lagrange multiplier by differentiating… then maximize is an eigenvector of corresponding to eigenvalue therefore 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 We have maximized Algebraic derivation of So is the largest eigenvalue of The first PC retains the greatest amount of variation in the sample. 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 To find vector maximize then let λ and φ be Lagrange multipliers, and maximize subject to First note that 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 We find that is also an eigenvector of whose eigenvalue is the second largest. In general The k th largest eigenvalue of is the variance of the k th PC. The k th PC retains the k th greatest fraction of the variation in the sample. 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Algebraic formulation of PCA Given a sample of n observations define a vector of p PCs according to on a vector of p variables whose k th column is the k th eigenvector of where is an orthogonal p x p matrix 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Then is the covariance matrix of the PCs, being diagonal with elements 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 In general it is useful to define standardized variables by If the are each measured about their sample mean then the covariance matrix of will be equal to the correlation matrix of and the PCs will be dimensionless 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Practical computation of PCs Given a sample of n observations on a vector of p variables (each measured about its sample mean) compute the covariance matrix where is the n x p matrix whose i th row is the i th observation 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Then, compute the n x p matrix whose i th row is the PC score for the i th observation: Write to decompose each observation into PCs: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Usage of PCA: Data compression Because the k th PC retains the k th greatest fraction of the variation, we can approximate each observation by truncating the sum at the first m < p PCs: 6.Principal Component Analysis

1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 This reduces the dimensionality of the data from p to m < p by approximating where is the n x m portion of and is the p x m portion of 6.Principal Component Analysis

Download ppt "1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification."

Similar presentations