Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;

Similar presentations


Presentation on theme: "Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;"— Presentation transcript:

1 Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable; simpler explanation Data visualization (beyond 2 attributes, it gets complicated) 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Why Reduce Dimensionality?

2 Feature Selection vs Extraction Feature selection: Chose k<d important features, ignore the remaining d – k Data snooping Genetic algorithm Feature extraction: Project the original d attributes onto a new k<d dimensional feature space Principal components analysis (PCA), Linear discriminant analysis (LDA), Factor analysis (FA) Auto-association ANN 2 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

3 Principal Components Analysis (PCA) Assume that attributes in dataset are drawn from a multivariate normal distribution. P(x)=N( ,  ) 3 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) dx1 1xd dxd Variance is a matrix called “covariance”. Diagonal elements are  2 of individual attributes. Off diagonals describe how fluctuations in one attribute affect fluctuations in another.

4 dx1 1xd dxd Dividing off-diagonal elements by the product of variances, gives “correlation coefficients” Correlation among attributes makes it difficult to say how any one attribute contributes to an effect.

5 Consider a linear transformation of attributes z = Mx where M is a dxd matrix. The d features z will also be normally distributed (proof later). A choice of M that results in a diagonal covariance matrix in feature-space has the following advantages: 1.Interpretation of uncorrelated features is easier 2.Total variance of features is the sum of diagonal elements

6 Diagonalization of the covariance matrix: The transformation z = Mx that leads to a diagonal feature-space covariance has M = W T where the columns of W are the eigenvectors of the covariance matrix  The collection of eigenvalue equations  w k = k w k can be written as  W = WD where D = diag( 1... d ) and W is formed by column vectors [w 1... w d ]. W T = W -1 so W T  W = W -1 WD = D If we arrange the eigenvectors so that eigenvalues 1... d are in decreasing order of magnitude, then z i = w i T x, i = 1…k < d are the “principle components”

7 Proportion of Variance (PoV) explained by k principal components (λ i sorted in descending order) is 7 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) A plot PoV vs k shows how many eigenvalues are required in capture given part of total variance How many principal components ?

8 Proof that if attributes x are normally distributed with mean  and covariance , then z=w T x is normally distributed with mean w T  and variance w T  w. Var(z) = Var(w T x) = E[(w T x – w T μ) 2 ] = E[(w T x – w T μ)(x T w –  T w)] = E[w T (x – μ)(x – μ) T w] = w T E[(x – μ)(x –μ) T ]w = w T ∑ w The objective of PCA is to maximize Var(z)=w T ∑ w Must be done subject to the constraint ||w 1 || = w 1 T w 1 = 1 8 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

9 Review: constrained optimization by Lagrange multipliers find the stationary point of f(x 1, x 2 ) = 1 - x 1 2 – x 2 2 subject to the constraint g(x 1, x 2 ) = x 1 + x 2 = 1 Constrained optimization

10 Form the Lagrangian L(x, ) = f(x 1, x 2 ) + (g(x 1, x 2 ) - c) L(x, ) = 1-x 1 2 -x 2 2 + (x 1 +x 2 -1)

11 -2x 1 + = 0 -2x 2 + = 0 x 1 + x 2 -1 = 0 Solve for x 1 and x 2 Set the partial derivatives of L with respect to x 1, x 2, and equal to zero L(x, ) = 1-x 1 2 -x 2 2 + (x 1 +x 2 -1)

12 In this case, not necessary to find sometimes called “undetermined multiplier ” Solution is x 1 * = x 2 * = ½

13 Application of Lagrange multipliers in PCA Find w 1 such that w 1 T  w 1 is maximum subject to constraint ||w 1 || = w 1 T w 1 = 1 Maximize L = w 1 T  w 1 + c(w 1 T w 1 – 1) gradient of L = 2  w 1 + 2cw 1 = 0  w 1 = -cw 1 w 1 is an eigenvector of covariance matrix let c = - 1 1 is eigenvalue associate with w 1 13 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

14 Prove that 1 is the variance of principal component 1 z 1 = w 1 T x  w 1 = 1 w 1 var(z 1 ) = w 1 T  w 1 = 1 w 1 T w 1 =  1 To maximize var(z 1 ), chose 1 as largest eigenvalue 14 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

15 More principal components: If  has 2 distinct eigenvalues, define 2 nd principal component by max Var(z 2 ), such that ||w 2 ||=1 and orthogonal to w 1 Introduce Lagrange multipliers  and  Set gradient of L with respect to w 2 to zero 2  w 2 – 2  w 2 –  w 1 = 0 Choose  = 0 and  = 2 get  w 2 = 2 w 2 To maximize Var(z 2 ) chose 2 as the second largest eigenvalue

16 For any dxd matrix M, z=M T x is a linear transformation of attributes x that defines features z If attributes x are normally distributed with mean  and covariance , then z is normally distributed with mean M T  and covariance M T  M. (proof slide 8) If M = W, a matrix with columns that are the normalized eigenvectors of , then the covariance of z is diagonal with elements equal to the eigenvalues of  (proof slide 6) Arrange the eigenvalues in decreasing order of magnitude and find 1... k that account for most (e.g. 90%) of the total variance, then z i = w i T x, are the “principle components” 16 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Review

17 MatLab’s [V,D] = eig(A) returns both eigenvectors (columns of V) and eigenvalues D in increasing order. Invert the order and construct 17 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) More review Chose k that captures the desired amount of total variance

18 Example: cancer diagnostics Metabonomics data 94 samples 35 metabolites in each sample = d 60 control samples 34 diseased samples

19 73.6809 18.7491 2.8856 1.9068 0.7278 0.5444 0.4238 0.3501 0.1631 proportion of variance plot ranked eigenvalues 3 PCs capture > 95%

20 1-34 cancer >35 control Samples from cancer patients cluster Scatter plot of PCs 1 and 2

21 Assignment 5 due 10-30-15 Find the accuracy of a model that classifies all 6 types of beer bottles in glassdata.csv by multivariate linear regression. Find the eigenvalues and eigenvectors of the covariance matrix for the full beer-bottle data set. How many eigenvalues are required to capture more than the 90% of the variance? Transform the attribute data by the eigenvectors of the 3 largest eigenvalues. What is the accuracy of a linear model that uses these features. Plot the accuracy when you successively extent the linear model by including z 1 2, z 2 2, z 3 2, z 1 z 2, z 1 z 3, and z 2 z 3.

22 PCA code for glass data

23 eigenvalues indexed by decreasing magnitude

24 PoV

25 Extend MLR with PCA features

26 L +x 1 2 +x 2 2 +x 1 3 +x 1 x 2 +x 1 x 3 +x 2 x 3


Download ppt "Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;"

Similar presentations


Ads by Google