Download presentation

Presentation is loading. Please wait.

Published byMelvyn Murphy Modified about 1 year ago

1
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification of data. The purpose is to reduce the dimensionality of a data set (sample) by finding a new set of variables, smaller than the original set, that nonetheless retains most of the sample's information. 6.Principal Component Analysis

2
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis By information we mean the variation present in the sample, given by the correlations between the original variables. The new variables, called principal components (PCs), are uncorrelated, and are ordered by the fraction of the total information each retains.

3
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis Principal component direction of maximum variance in the input space principal eigenvector of the covariance matrix Goal: Relate these two definitions

4
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis Variance A random variable fluctuating about its mean value Average of the square of the fluctuations

5
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis Covariance Pair of random variables, each fluctuating about its mean value. Average of product of fluctuations

6
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis

7
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis Covariance matrix N random variables NxN symmetric matrix Diagonal elements are variances

8
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis eigenvectors with k largest eigenvalues Now you can calculate them, but what do they mean? Principal components

9
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis Covariance to variance From the covariance, the variance of any projection can be calculated. w : unit vector

10
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis Maximizing parallel variance Principal eigenvector of C (the one with the largest eigenvalue)

11
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis Geometric picture of principal components the 1st PC z 1 is a minimum distance fit to a line in X space the 2nd PC z 2 is a minimum distance fit to a line in the plane perpendicular to the 1st PC

12
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis Then…

13
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis

14
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis

15
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis

16
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis

17
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis

18
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis

19
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis

20
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis

21
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis

22
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis

23
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis

24
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis

25
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis

26
1er. Escuela Red ProTIC - Tandil, de Abril, Principal Component Analysis

27
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Algebraic definition of PCs Given a sample of n observations on a vector of p variables x = (x 1,x 2,….,x p ) define the first principal component of the sample by the linear transformation λ where the vector is chosen such thatis maximum 6.Principal Component Analysis

28
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Likewise, define the k th PC of the sample by the linear transformation where the vector is chosen such that is maximum subject to 6.Principal Component Analysis

29
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 To find first note that is the covariance matrix for the variables 6.Principal Component Analysis

30
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 To find maximize subject to Let λ be a Lagrange multiplier by differentiating… then maximize is an eigenvector of corresponding to eigenvalue therefore 6.Principal Component Analysis

31
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 We have maximized Algebraic derivation of So is the largest eigenvalue of The first PC retains the greatest amount of variation in the sample. 6.Principal Component Analysis

32
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 To find vector maximize then let λ and φ be Lagrange multipliers, and maximize subject to First note that 6.Principal Component Analysis

33
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 We find that is also an eigenvector of whose eigenvalue is the second largest. In general The k th largest eigenvalue of is the variance of the k th PC. The k th PC retains the k th greatest fraction of the variation in the sample. 6.Principal Component Analysis

34
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Algebraic formulation of PCA Given a sample of n observations define a vector of p PCs according to on a vector of p variables whose k th column is the k th eigenvector of where is an orthogonal p x p matrix 6.Principal Component Analysis

35
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Then is the covariance matrix of the PCs, being diagonal with elements 6.Principal Component Analysis

36
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 In general it is useful to define standardized variables by If the are each measured about their sample mean then the covariance matrix of will be equal to the correlation matrix of and the PCs will be dimensionless 6.Principal Component Analysis

37
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Practical computation of PCs Given a sample of n observations on a vector of p variables (each measured about its sample mean) compute the covariance matrix where is the n x p matrix whose i th row is the i th observation 6.Principal Component Analysis

38
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Then, compute the n x p matrix whose i th row is the PC score for the i th observation: Write to decompose each observation into PCs: 6.Principal Component Analysis

39
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Usage of PCA: Data compression Because the k th PC retains the k th greatest fraction of the variation, we can approximate each observation by truncating the sum at the first m < p PCs: 6.Principal Component Analysis

40
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 This reduces the dimensionality of the data from p to m < p by approximating where is the n x m portion of and is the p x m portion of 6.Principal Component Analysis

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google