Dimensional reduction, PCA

Slides:



Advertisements
Similar presentations
Independent Component Analysis
Advertisements

Eigen Decomposition and Singular Value Decomposition
Component Analysis (Review)
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
Dimensionality Reduction PCA -- SVD
Dimension reduction (1)
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Principal Component Analysis
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Face Recognition Jeremy Wyatt.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Independent Component Analysis (ICA) and Factor Analysis (FA)
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ICA Alphan Altinok. Outline  PCA  ICA  Foundation  Ambiguities  Algorithms  Examples  Papers.
Continuous Latent Variables --Bishop
Lightseminar: Learned Representation in AI An Introduction to Locally Linear Embedding Lawrence K. Saul Sam T. Roweis presented by Chan-Su Lee.
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Survey on ICA Technical Report, Aapo Hyvärinen, 1999.
Summarized by Soo-Jin Kim
Machine Learning CS 165B Spring Course outline Introduction (Ch. 1) Concept learning (Ch. 2) Decision trees (Ch. 3) Ensemble learning Neural Networks.
Chapter 2 Dimensionality Reduction. Linear Methods
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Object Orie’d Data Analysis, Last Time
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
1 LING 696B: PCA and other linear projection methods.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CSE 185 Introduction to Computer Vision Face Recognition.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
EE4-62 MLCV Lecture Face Recognition – Subspace/Manifold Learning Tae-Kyun Kim 1 EE4-62 MLCV.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Principal Component Analysis (PCA)
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Principal Component Analysis Zelin Jia Shengbin Lin 10/20/2015.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Irena Váňová. B A1A1. A2A2. A3A3. repeat until no sample is misclassified … labels of classes Perceptron algorithm for i=1...N if then end * * * * *
Feature Extraction 主講人:虞台文.
Unsupervised Learning II Feature Extraction
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Principal Component Analysis (PCA)
LECTURE 11: Advanced Discriminant Analysis
Background on Classification
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
Principal Component Analysis (PCA)
Principal Component Analysis
Dimensionality Reduction
INTRODUCTION TO Machine Learning
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Multivariate Methods Berlin Chen
Feature Selection Methods
Principal Component Analysis
Multivariate Methods Berlin Chen, 2005 References:
Presentation transcript:

Dimensional reduction, PCA

Curse of dimensionality The higher the dimension, the more data is needed to draw any conclusion Probability density estimation: Continuous: histograms Discrete: k-factorial designs Decision rules: Nearest-neighbor and K-nearest neighbor

How to reduce dimension? Assume we know something about the distribution Parametric approach: assume data follow distributions within a family H Example: counting histograms for 10-D data needs lots of bins, but knowing it’s normal allows to summarize the data in terms of sufficient statistics (Number of bins)10 v.s. (10 + 10*11/2)

Linear dimension reduction Normality assumption is crucial for linear methods Examples: Principle Components Analysis (also Latent Semantic Indexing) Factor Analysis Linear discriminant analysis

Covariance structure of multivariate Gaussian 2-dimensional example No correlations --> diagonal covariance matrix, e.g. Special case:  = I - log likelihood  Euclidean distance to the center Variance in each dimension Correlation between dimensions

Covariance structure of multivariate Gaussian Non-zero correlations --> full covariance matrix, COV(X1,X2)  0 E.g.  = Nice property of Gaussians: closed under linear transformation This means we can remove correlation by rotation

Covariance structure of multivariate Gaussian Rotation matrix: R = (w1, w2), where w1, w2 are two unit vectors perpendicular to each other Rotation by 90 degree Rotation by 45 degree w1 w2 w1 w2 w1 w2

Covariance structure of multivariate Gaussian Matrix diagonalization: any 2X2 covariance matrix A can be written as: Interpretation: we can always find a rotation to make the covariance look “nice” -- no correlation between dimensions This IS PCA when applied to N dimensions Rotation!

Computation of PCA The new coordinates uniquely identify the rotation In computation, it’s easier to identify one coordinate at a time. Step 1: centering the data X <-- X - mean(X) Want to rotate around the center w1 w2 w3 3-D: 3 coordinates

Computation of PCA Step 2: finding a direction of projection that has the maximal variance Linear projection of X onto vector w: Projw(X) = XNXd * wdX1 (X centered) Now measure the stretch This is sample variance = Var(X*w) X x w w

Computation of PCA Step 3: formulate this as a constrained optimization problem Objective of optimization: Var(X*w) Need constraint on w: (otherwise can explode), only consider the direction, not the scaling So formally: argmax||w||=1 Var(X*w)

Computation of PCA Recall single variable case: Var(a*X) = a2 Var(X) Apply to multivariate case using matrix notation: Var(X*w) = wT XT X w = wTCov(X) w Cov(X) is a dXd matrixSymmetric (easy) For any y, yTCov(X) y > 0

Computation of PCA Going back to the optimization problem: = max||w||=1 Var(X*w) = max||w||=1 wTCOV(X) w The answer is the largest eigenvalue for COV(X) w1 Pca2d_demo(d, 1); The first Principle Component! (see demo)

More principle components We keep looking among all the projections perpendicular to w1 Formally: max||w2||=1,w2w1 wTCov(X) w This turns out to be another eigenvector corresponding to the 2nd largest eigenvalue (see demo) pca2d_demo(d, 2); w2 New coordinates!

Rotation Can keep going until we find all projections/coordinates w1,w2,…,wd Putting them together, we have a big matrix W=(w1,w2,…,wd) W is called an orthogonal matrix This corresponds to a rotation (sometimes plus reflection) of the pancake This pancake has no correlation between dimensions (see demo) pca2d_demo(d3, 1); pca2d_demo(d3, 2); pca2d_demo(d3,3)

When does dimension reduction occur? Decomposition of covariance matrix If only the first few ones are significant, we can ignore the rest, e.g. The PCs are not the new coordinates. They are just new basis. 2-D coordinates of X

Measuring “degree” of reduction Pancake data in 3D a2 a1

An application of PCA Latent Semantic Indexing in document retrieval Documents as vectors of word counts Try to extract some “features” by linear combination of word counts The underlying geometry unclear (mean? Distance?) The meaning of principle components unclear (rotation?) #market #stock #bonds

Summary of PCA: PCA looks for: Defining “interesting”: A sequence of linear, orthogonal projections that reveal interesting structure in data (rotation) Defining “interesting”: Maximal variance under each projection Uncorrelated structure after projection

Departure from PCA 3 directions of divergence Other definitions of “interesting”? Linear Discriminant Analysis Independent Component Analysis Other methods of projection? Linear but not orthogonal: sparse coding Implicit, non-linear mapping Turning PCA into a generative model Factor Analysis

Re-thinking “interestingness” It all depends on what you want Linear Disciminant Analysis (LDA): supervised learning Example: separating 2 classes Maximal separation Maximal variance

Re-thinking “interestingness” Most high-dimensional data look like Gaussian under linear projections Maybe non-Gaussian is more interesting Independent Component Analysis Projection pursuits Example: ICA projection of 2-class data Most unlike Gaussian (e.g. maximize kurtosis)

The “efficient coding” perspective Sparse coding: Projections do not have to be orthogonal There can be more basis vectors than the dimension of the space Representation using over-complete basis x w2 w1 w3 w4 Basis expansion p << d; compact coding (PCA) p > d; sparse coding

“Interesting” can be expensive Often faces difficult optimization problems Need many constraints Lots of parameter sharing Expensive to compute, no longer an eigenvalue problem

PCA’s relatives: Factor Analysis PCA is not a generative model: reconstruction error is not likelihood Sensitive to outliers Hard to build into bigger models Factor Analysis: adding a measurement noise to account for variability observation Measurement noise N(0,R), R diagonal Loading matrix (scaled PC’s) Factors: spherical Gaussian N(0,I)

PCA’s relatives: Factor Analysis Generative view: sphere --> stretch and rotate --> add noise Learning: a version of EM algorithm Plot(r’); [C, R, syn] = fa_demo(r, 2, lab) Plot(syn’);