# CHAPTER 6: Dimensionality Reduction Author: Christoph Eick The material is mostly based on the Shlens PCA Tutorial

## Presentation on theme: "CHAPTER 6: Dimensionality Reduction Author: Christoph Eick The material is mostly based on the Shlens PCA Tutorial"— Presentation transcript:

CHAPTER 6: Dimensionality Reduction Author: Christoph Eick The material is mostly based on the Shlens PCA Tutorial http://www2.cs.uh.edu/~ceick/ML/pca.pdfhttp://www2.cs.uh.edu/~ceick/ML/pca.pdf and to a lesser extend based on material in the Alpaydin book.

2 Why Reduce Dimensionality? 1. Reduces time complexity: Less computation 2. Reduces space complexity: Less parameters 3. Saves the cost of aquiring the feature 4. Simpler models are more robust 5. Easier to interpret; simpler explanation 6. Data visualization (structure, groups, outliers, etc) if plotted in 2 or 3 dimensions Ch. Eick: Dimensionality Reduction

3 Feature Selection/Extraction/Construction Feature selection: Choosing k { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/11/3286273/slides/slide_3.jpg", "name": "3 Feature Selection/Extraction/Construction Feature selection: Choosing k

4 Key Ideas Dimensionality Reduction Given a dataset X Find a low-dimensional linear projection Two possible formulations  The variance in low-d is maximized  The average projection cost is minimized Both are equivalent Ch. Eick: Dimensionality Reduction

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 5 Principal Components Analysis (PCA) Find a low-dimensional space such that when x is projected there, information loss is minimized. The projection of x on the direction of w is: z = w T x Find w such that Var(z) capture is maximized Var(z) = Var(w T x) = E[(w T x – w T μ ) 2 ] = E[(w T x – w T μ )(w T x – w T μ )] = E[w T (x – μ )(x – μ ) T w] = w T E[(x – μ )(x – μ ) T ]w = w T ∑ w where Var(x)= E[(x – μ )(x – μ ) T ] = ∑ Question: Why does PCA maximize and not minimize the variance in z?

Clarifications Assume the dataset x is d-dimensional with n examples and we want to reduce it to a k-dimensional dataset z : x= {(…) n w T = {(…) d (…) (…) d } kxd (…) n } d  n z= w T x kxn (you take scalar products of the elements in x with w obtaining a k-dimensional dataset) Remarks: w contains the k eigenvectors of the co-variance matrix  of x with the highest eigenvalues: w i  = i w i k is usually chosen based on the variance captured/largeness of the first k eigenvalues. 6 Ch. Eick: Dimensionality Reduction http://en.wikipedia.org/wiki/Eigenvalues_and_eigenvectors k Eigenvectors Corrected on 2/24/2011

Example x= {(…) n w T = {(…) d (…) (…) d } kxd (…) n } d  n z= w T x kxn (you take scalar products of the elements in x with w obtaining a k-dimensional dataset) Example: 4d dataset which contains 5 examples 1 2 0 1 4 0.5 0.5 -1.0 0.5 0 0 2 -1 2 1.0 0.0 1.0 1.0 1 1 1 1 1 0 -1 0 3 0 x w T 7 Ch. Eick: Dimensionality Reduction Corrected on 2/24/2011 (z1,z2):= (0.5*a1+0.5*a2  a3+0.5*a4, a1+a2+a4)

Shlens Tutorial on PCA PCA most valuable result of applied linear algebra (other PageRank) “The goal of PCA is to compute the most meaningful basis to re-express a noisy dataset. The hope is that the new basis will filter out noise and reveal hidden structure”. The goal of PCA is deciphering “garbled” data, referring to: rotation, redundancy, and noise. PCA is a non-parametric method; no way to incorporate preferences and other choices 8 Ch. Eick: Dimensionality Reduction

Computing Principal Components as Eigenvectors of the Covariance Matrix 1. Normalize x by subtracting from each attribute value its mean, obtaining y. 2. Compute 1/(n-1)*yy T =  the covariance matrix of x. 3. Diagonalize  obtaining a set of eigenvectors e with: e  e T = i I ( i is the eigenvalue of the i th eigenvector) 4. Select how many and which eigenvectors in e to keep, obtaining w (based on variance expressed/largeness of eigenvalues and possibly other criteria) 5. Create your transformed dataset z= w T x Remark: Symmetric matrices are always orthogonally diagonalizable see proof page 11 of Shlens paper! 9 Ch. Eick: Dimensionality Reduction

10 Maximize Var(z) subject to ||w||=1 ∑w 1 = α w 1 that is, w 1 is an eigenvector of ∑ Choose the one principal component with the largest eigenvalue for Var(z) Second principal component: Max Var(z 2 ), s.t., ||w 2 ||=1 and orthogonal to w 1 ∑ w 2 = α w 2 that is, w 2 is another eigenvector of ∑ and so on. Textbook’s PCA Version

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 11 What PCA does z = W T (x – m) where the columns of W are the eigenvectors of ∑, and m is sample mean Centers the data at the origin and rotates the axes http://www.youtube.com/watch?v=BfTMmoDFXyE

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 12 How to choose k ? Proportion of Variance (PoV) explained when λ i are sorted in descending order Typically, stop at PoV>0.9 Scree graph plots of PoV vs k, stop at “elbow”

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 13

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 15 Visualizing Numbers after applying PCA

16 Multidimensional Scaling Given pairwise distances between N points, d ij, i,j =1,...,N place on a low-dim map s.t. distances are preserved. z = g (x | θ )Find θ that min Sammon stress L1-Norm: http://en.wikipedia.org/wiki/Taxicab_geometryhttp://en.wikipedia.org/wiki/Taxicab_geometry Lq-NormL1-Norm: http://en.wikipedia.org/wiki/Lp_spacehttp://en.wikipedia.org/wiki/Lp_space

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 17 Map of Europe by MDS Map from CIA – The World Factbook: http://www.cia.gov/ http://forrest.psych.unc.edu/teaching/p208a/mds/mds.html

Download ppt "CHAPTER 6: Dimensionality Reduction Author: Christoph Eick The material is mostly based on the Shlens PCA Tutorial"

Similar presentations