Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dimensional reduction, PCA

Similar presentations


Presentation on theme: "Dimensional reduction, PCA"— Presentation transcript:

1 Dimensional reduction, PCA

2 Curse of dimensionality
The higher the dimension, the more data is needed to draw any conclusion Probability density estimation: Continuous: histograms Discrete: k-factorial designs Decision rules: Nearest-neighbor and K-nearest neighbor

3 How to reduce dimension?
Assume we know something about the distribution Parametric approach: assume data follow distributions within a family H Example: counting histograms for 10-D data needs lots of bins, but knowing it’s normal allows to summarize the data in terms of sufficient statistics (Number of bins)10 v.s. ( *11/2)

4 Linear dimension reduction
Normality assumption is crucial for linear methods Examples: Principle Components Analysis (also Latent Semantic Indexing) Factor Analysis Linear discriminant analysis

5 Covariance structure of multivariate Gaussian
2-dimensional example No correlations --> diagonal covariance matrix, e.g. Special case:  = I - log likelihood  Euclidean distance to the center Variance in each dimension Correlation between dimensions

6 Covariance structure of multivariate Gaussian
Non-zero correlations --> full covariance matrix, COV(X1,X2)  0 E.g.  = Nice property of Gaussians: closed under linear transformation This means we can remove correlation by rotation

7 Covariance structure of multivariate Gaussian
Rotation matrix: R = (w1, w2), where w1, w2 are two unit vectors perpendicular to each other Rotation by 90 degree Rotation by 45 degree w1 w2 w1 w2 w1 w2

8 Covariance structure of multivariate Gaussian
Matrix diagonalization: any 2X2 covariance matrix A can be written as: Interpretation: we can always find a rotation to make the covariance look “nice” -- no correlation between dimensions This IS PCA when applied to N dimensions Rotation!

9 Computation of PCA The new coordinates uniquely identify the rotation
In computation, it’s easier to identify one coordinate at a time. Step 1: centering the data X <-- X - mean(X) Want to rotate around the center w1 w2 w3 3-D: 3 coordinates

10 Computation of PCA Step 2: finding a direction of projection that has the maximal variance Linear projection of X onto vector w: Projw(X) = XNXd * wdX1 (X centered) Now measure the stretch This is sample variance = Var(X*w) X x w w

11 Computation of PCA Step 3: formulate this as a constrained optimization problem Objective of optimization: Var(X*w) Need constraint on w: (otherwise can explode), only consider the direction, not the scaling So formally: argmax||w||=1 Var(X*w)

12 Computation of PCA Recall single variable case: Var(a*X) = a2 Var(X)
Apply to multivariate case using matrix notation: Var(X*w) = wT XT X w = wTCov(X) w Cov(X) is a dXd matrixSymmetric (easy) For any y, yTCov(X) y > 0

13 Computation of PCA Going back to the optimization problem: = max||w||=1 Var(X*w) = max||w||=1 wTCOV(X) w The answer is the largest eigenvalue for COV(X) w1 Pca2d_demo(d, 1); The first Principle Component! (see demo)

14 More principle components
We keep looking among all the projections perpendicular to w1 Formally: max||w2||=1,w2w1 wTCov(X) w This turns out to be another eigenvector corresponding to the 2nd largest eigenvalue (see demo) pca2d_demo(d, 2); w2 New coordinates!

15 Rotation Can keep going until we find all projections/coordinates w1,w2,…,wd Putting them together, we have a big matrix W=(w1,w2,…,wd) W is called an orthogonal matrix This corresponds to a rotation (sometimes plus reflection) of the pancake This pancake has no correlation between dimensions (see demo) pca2d_demo(d3, 1); pca2d_demo(d3, 2); pca2d_demo(d3,3)

16 When does dimension reduction occur?
Decomposition of covariance matrix If only the first few ones are significant, we can ignore the rest, e.g. The PCs are not the new coordinates. They are just new basis. 2-D coordinates of X

17 Measuring “degree” of reduction
Pancake data in 3D a2 a1

18 An application of PCA Latent Semantic Indexing in document retrieval
Documents as vectors of word counts Try to extract some “features” by linear combination of word counts The underlying geometry unclear (mean? Distance?) The meaning of principle components unclear (rotation?) #market #stock #bonds

19 Summary of PCA: PCA looks for: Defining “interesting”:
A sequence of linear, orthogonal projections that reveal interesting structure in data (rotation) Defining “interesting”: Maximal variance under each projection Uncorrelated structure after projection

20 Departure from PCA 3 directions of divergence
Other definitions of “interesting”? Linear Discriminant Analysis Independent Component Analysis Other methods of projection? Linear but not orthogonal: sparse coding Implicit, non-linear mapping Turning PCA into a generative model Factor Analysis

21 Re-thinking “interestingness”
It all depends on what you want Linear Disciminant Analysis (LDA): supervised learning Example: separating 2 classes Maximal separation Maximal variance

22 Re-thinking “interestingness”
Most high-dimensional data look like Gaussian under linear projections Maybe non-Gaussian is more interesting Independent Component Analysis Projection pursuits Example: ICA projection of 2-class data Most unlike Gaussian (e.g. maximize kurtosis)

23 The “efficient coding” perspective
Sparse coding: Projections do not have to be orthogonal There can be more basis vectors than the dimension of the space Representation using over-complete basis x w2 w1 w3 w4 Basis expansion p << d; compact coding (PCA) p > d; sparse coding

24 “Interesting” can be expensive
Often faces difficult optimization problems Need many constraints Lots of parameter sharing Expensive to compute, no longer an eigenvalue problem

25 PCA’s relatives: Factor Analysis
PCA is not a generative model: reconstruction error is not likelihood Sensitive to outliers Hard to build into bigger models Factor Analysis: adding a measurement noise to account for variability observation Measurement noise N(0,R), R diagonal Loading matrix (scaled PC’s) Factors: spherical Gaussian N(0,I)

26 PCA’s relatives: Factor Analysis
Generative view: sphere --> stretch and rotate --> add noise Learning: a version of EM algorithm Plot(r’); [C, R, syn] = fa_demo(r, 2, lab) Plot(syn’);


Download ppt "Dimensional reduction, PCA"

Similar presentations


Ads by Google