Principal Component Analysis

Slides:



Advertisements
Similar presentations
Eigen Decomposition and Singular Value Decomposition
Advertisements

Chapter 28 – Part II Matrix Operations. Gaussian elimination Gaussian elimination LU factorization LU factorization Gaussian elimination with partial.
Component Analysis (Review)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Input Space versus Feature Space in Kernel- Based Methods Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola presented by: Joe Drish Department of.
Machine Learning Lecture 8 Data Processing and Representation
Dimension reduction (1)
PCA + SVD.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Principal Component Analysis
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
Dimensionality Reduction Chapter 3 (Duda et al.) – Section 3.8
Principal Component Analysis
Some useful linear algebra. Linearly independent vectors span(V): span of vector space V is all linear combinations of vectors v i, i.e.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
3D Geometry for Computer Graphics
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
Techniques for studying correlation and covariance structure
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
SVD(Singular Value Decomposition) and Its Applications
Summarized by Soo-Jin Kim
Linear Least Squares Approximation. 2 Definition (point set case) Given a point set x 1, x 2, …, x n  R d, linear least squares fitting amounts to find.
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Presented By Wanchen Lu 2/25/2013
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
1 Recognition by Appearance Appearance-based recognition is a competing paradigm to features and alignment. No features are extracted! Images are represented.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CSE 185 Introduction to Computer Vision Face Recognition.
Algorithms 2005 Ramesh Hariharan. Algebraic Methods.
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
Principle Component Analysis and its use in MA clustering Lecture 12.
Principal Component Analysis Zelin Jia Shengbin Lin 10/20/2015.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Irena Váňová. B A1A1. A2A2. A3A3. repeat until no sample is misclassified … labels of classes Perceptron algorithm for i=1...N if then end * * * * *
Feature Extraction 主講人:虞台文.
Chapter 13 Discrete Image Transforms
Principal Components Analysis ( PCA)
Unsupervised Learning II Feature Extraction
Unsupervised Learning II Feature Extraction
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Lecture XXVII. Orthonormal Bases and Projections Suppose that a set of vectors {x 1,…,x r } for a basis for some space S in R m space such that r  m.
CSE 554 Lecture 8: Alignment
Spectral Methods for Dimensionality
PREDICT 422: Practical Machine Learning
Background on Classification
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
University of Ioannina
LECTURE 10: DISCRIMINANT ANALYSIS
9.3 Filtered delay embeddings
Dimensionality reduction
Principal Component Analysis (PCA)
Spectral Methods Tutorial 6 1 © Maks Ovsjanikov
Principal Component Analysis
Singular Value Decomposition
Numerical Analysis Lecture 16.
Recitation: SVD and dimensionality reduction
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Feature space tansformation methods
Generally Discriminant Analysis
Symmetric Matrices and Quadratic Forms
Numerical Analysis Lecture 17.
Principal Components What matters most?.
LECTURE 09: DISCRIMINANT ANALYSIS
Lecture 13: Singular Value Decomposition (SVD)
Principal Component Analysis
Presentation transcript:

Principal Component Analysis CMPUT 466/551 Nilanjan Ray

Overview Principal component analysis (PCA) is a way to reduce data dimensionality PCA projects high dimensional data to a lower dimension PCA projects the data in the least square sense– it captures big (principal) variability in the data and ignores small variability

PCA: An Intuitive Approach Let us say we have xi, i=1…N data points in p dimensions (p is large) If we want to represent the data set by a single point x0, then Sample mean Can we justify this choice mathematically? It turns out that if you minimize J0, you get the above solution, viz., sample mean Source: Chapter 3 of [DHS]

PCA: An Intuitive Approach… Representing the data set xi, i=1…N by its mean is quite uninformative So let’s try to represent the data by a straight line of the form: This is equation of a straight line that says that it passes through m e is a unit vector along the straight line And the signed distance of a point x from m is a The training points projected on this straight line would be

PCA: An Intuitive Approach… Let’s now determine ai’s Partially differentiating with respect to ai we get: Plugging in this expression for ai in J1 we get: where is called the scatter matrix

PCA: An Intuitive Approach… So minimizing J1 is equivalent to maximizing: Subject to the constraint that e is a unit vector: Use Lagrange multiplier method to form the objective function: Differentiate to obtain the equation: Solution is that e is the eigenvector of S corresponding to the largest eigenvalue

PCA: An Intuitive Approach… The preceding analysis can be extended in the following way. Instead of projecting the data points on to a straight line, we may now want to project them on a d-dimensional plane of the form: d is much smaller than the original dimension p In this case one can form the objective function: It can also be shown that the vectors e1, e2, …, ed are d eigenvectors corresponding to d largest eigen values of the scatter matrix

PCA: Visually Data points are represented in a rotated orthogonal coordinate system: the origin is the mean of the data points and the axes are provided by the eigenvectors.

Computation of PCA In practice we compute PCA via SVD (singular value decomposition) Form the centered data matrix: Compute its SVD: U and V are orthogonal matrices, D is a diagonal matrix

Computation of PCA… Note that the scatter matrix can be written as: So the eigenvectors of S are the columns of U and the eigenvalues are the diagonal elements of D2 Take only a few significant eigenvalue-eigenvector pairs d << p; The new reduced dimension representation becomes:

Computation of PCA… Sometimes we are given only a few high dimensional data points, i.e., p >> N In such cases compute the SVD of XT: So that we get: Then, proceed as before, choose only d < N significant eigenvalues for data representation:

PCA: A Gaussian Viewpoint where the covariance matrix  is estimated from the scatter matrix as (1/N)S u’s and ’s are respectively eigenvectors and eigenvalues of S. If p is large, then we need a even larger number of data points to estimate the covariance matrix. So, when a limited number of training data points is available the estimation of the covariance matrix goes quite wrong. This is known as curse of dimensionality in this context. To combat curse of dimensionality, we discard smaller eigenvalues and be content with:

PCA Examples Image compression example Novelty detection example

Kernel PCA Assumption behind PCA is that the data points x are multivariate Gaussian Often this assumption does not hold However, it may still be possible that a transformation (x) is still Gaussian, then we can perform PCA in the space of (x) Kernel PCA performs this PCA; however, because of “kernel trick,” it never computes the mapping (x) explicitly!

KPCA: Basic Idea

Kernel PCA Formulation We need the following fact: Let v be a eigenvector of the scatter matrix: Then v belongs to the linear space spanned by the data points xi i=1, 2, …N. Proof:

Kernel PCA Formulation… Let C be the scatter matrix of the centered mapping (x): Let w be an eigenvector of C, then w can be written as a linear combination: Also, we have: Combining, we get:

Kernel PCA Formulation… Kernel or Gram matrix

Kernel PCA Formulation… From the eigen equation And the fact that the eigenvector w is normalized to 1, we obtain:

KPCA Algorithm Step 1: Compute the Gram matrix: Step 2: Compute (eigenvalue, eigenvector) pairs of K: Step 3: Normalize the eigenvectors: Thus, an eigenvector wl of C is now represented as: To project a test feature (x) onto wl we need to compute: So, we never need  explicitly

Feature Map Centering So far we assumed that the feature map (x) is centered for thedata points x1,… xN Actually, this centering can be done on the Gram matrix without ever explicitly computing the feature map (x). is the kernel matrix for centered features, i.e., A similar expression exist for projecting test features on the feature eigenspace Scholkopf, Smola, Muller, “Nonlinear component analysis as a kernel eigenvalue problem,” Technical report #44, Max Plank Institute, 1996.

KPCA: USPS Digit Recognition Linear PCA Kernel function: Classier: Linear SVM with features as kernel principal components N = 3000, p = 16-by-16 image Scholkopf, Smola, Muller, “Nonlinear component analysis as a kernel eigenvalue problem,” Technical report #44, Max Plank Institute, 1996.