Download presentation
Presentation is loading. Please wait.
Published byMelvyn Murphy Modified over 9 years ago
1
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification of data. The purpose is to reduce the dimensionality of a data set (sample) by finding a new set of variables, smaller than the original set, that nonetheless retains most of the sample's information. 6.Principal Component Analysis
2
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis By information we mean the variation present in the sample, given by the correlations between the original variables. The new variables, called principal components (PCs), are uncorrelated, and are ordered by the fraction of the total information each retains.
3
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis Principal component direction of maximum variance in the input space principal eigenvector of the covariance matrix Goal: Relate these two definitions
4
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis Variance A random variable fluctuating about its mean value Average of the square of the fluctuations
5
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis Covariance Pair of random variables, each fluctuating about its mean value. Average of product of fluctuations
6
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis
7
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis Covariance matrix N random variables NxN symmetric matrix Diagonal elements are variances
8
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis eigenvectors with k largest eigenvalues Now you can calculate them, but what do they mean? Principal components
9
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis Covariance to variance From the covariance, the variance of any projection can be calculated. w : unit vector
10
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis Maximizing parallel variance Principal eigenvector of C (the one with the largest eigenvalue)
11
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis Geometric picture of principal components the 1st PC z 1 is a minimum distance fit to a line in X space the 2nd PC z 2 is a minimum distance fit to a line in the plane perpendicular to the 1st PC
12
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis Then…
13
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis
14
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis
15
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis
16
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis
17
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis
18
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis
19
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis
20
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis
21
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis
22
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis
23
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis
24
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis
25
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis
26
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 6.Principal Component Analysis
27
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Algebraic definition of PCs Given a sample of n observations on a vector of p variables x = (x 1,x 2,….,x p ) define the first principal component of the sample by the linear transformation λ where the vector is chosen such thatis maximum 6.Principal Component Analysis
28
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Likewise, define the k th PC of the sample by the linear transformation where the vector is chosen such that is maximum subject to 6.Principal Component Analysis
29
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 To find first note that is the covariance matrix for the variables 6.Principal Component Analysis
30
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 To find maximize subject to Let λ be a Lagrange multiplier by differentiating… then maximize is an eigenvector of corresponding to eigenvalue therefore 6.Principal Component Analysis
31
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 We have maximized Algebraic derivation of So is the largest eigenvalue of The first PC retains the greatest amount of variation in the sample. 6.Principal Component Analysis
32
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 To find vector maximize then let λ and φ be Lagrange multipliers, and maximize subject to First note that 6.Principal Component Analysis
33
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 We find that is also an eigenvector of whose eigenvalue is the second largest. In general The k th largest eigenvalue of is the variance of the k th PC. The k th PC retains the k th greatest fraction of the variation in the sample. 6.Principal Component Analysis
34
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Algebraic formulation of PCA Given a sample of n observations define a vector of p PCs according to on a vector of p variables whose k th column is the k th eigenvector of where is an orthogonal p x p matrix 6.Principal Component Analysis
35
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Then is the covariance matrix of the PCs, being diagonal with elements 6.Principal Component Analysis
36
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 In general it is useful to define standardized variables by If the are each measured about their sample mean then the covariance matrix of will be equal to the correlation matrix of and the PCs will be dimensionless 6.Principal Component Analysis
37
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Practical computation of PCs Given a sample of n observations on a vector of p variables (each measured about its sample mean) compute the covariance matrix where is the n x p matrix whose i th row is the i th observation 6.Principal Component Analysis
38
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Then, compute the n x p matrix whose i th row is the PC score for the i th observation: Write to decompose each observation into PCs: 6.Principal Component Analysis
39
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 Usage of PCA: Data compression Because the k th PC retains the k th greatest fraction of the variation, we can approximate each observation by truncating the sum at the first m < p PCs: 6.Principal Component Analysis
40
1er. Escuela Red ProTIC - Tandil, 18-28 de Abril, 2006 This reduces the dimensionality of the data from p to m < p by approximating where is the n x m portion of and is the p x m portion of 6.Principal Component Analysis
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.