# Summarized by Soo-Jin Kim

## Presentation on theme: "Summarized by Soo-Jin Kim"— Presentation transcript:

Ch 12. continuous latent variables Pattern Recognition and Machine Learning, C. M. Bishop, 2006.
Summarized by Soo-Jin Kim Biointelligence Laboratory, Seoul National University

(C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Contents 12.1 Principle Component Analysis Maximum variance formulation Minimum-error formulation Application of PCA PCA for high-dimensional data (C) 2007, SNU Biointelligence Lab,

12.1 Principal Component Analysis
Principal Component Analysis (PCA) PCA is used for applications such as dimensionality reduction, lossy data compression, feature extraction and data visualization. Also known as Karhunen-Loeve transform PCA can be defined as the orthogonal projection of the data onto a lower dimensional linear space, known as the principal subspace, such that the variance of the projected data is maximized. Principal subspace Orthogonal projection of the data points The subspace maximizes the variance of the projected points Minimizing the sum-of-squares of the projection errors (C) 2007, SNU Biointelligence Lab,

12.1.1 Maximum variance formulation (1/2)
Consider data set of observations {xn} where n = 1,…,N, and xn is a Euclidean variable with dimensionality D. To project the data onto a space having dimensionality M<D while maximizing the variance of the projected data. One-dimensional space (M=1) Define the direction of this space using a D-dimensional vector u1 The mean of the projected data is where is the sample set mean given by The variance of the projected data is given by Maximize the projected variance S is the data covariance matrix (C) 2007, SNU Biointelligence Lab,

12.1.1 Maximum variance formulation (2/2)
Lagrange multiplier Make an unconstrained maximization of (u1 must be an eigenvector of S) The variance will be maximum when we set u1 equal to the eigenvector having the largest eigenvalue λ1. PCA involves evaluating the mean x and the covariance matrix S of the data set and then finding the M eigenvectors of S corresponding to the M largest eigenvalues. The cost of computing the full eigenvector decomposition for a matrix of size D x D is O(D3) (this eigenvector is the first principal component) (C) 2007, SNU Biointelligence Lab,

12.1.2 Minimum-error formulation (1/4)
Based on projection error minimization A complete orthogonal set of D-dimensional basis vectors {ui} where i = 1,…,D that satisfy Each data point can be represented exactly by a linear combination of the basis vectors To approximate this data point using a representation involving a restricted number M<D of variables corresponding to projection onto a lower-dimensional subspace. coefficient (Without loss of generality) (C) 2007, SNU Biointelligence Lab,

12.1.2 Minimum-error formulation (2/4)
We approximate each data point xn by Distortion measure the squared distance between the original data point xn and its approximation , so that goal is to minimize the minimization with respect to the quantities {znj} where j = 1,…, M and similarly, where j = M+1, …, D. Depend on particular data point Constants that are the same for all data points to choose the {ui}, the {zni}, and the {bj} so as to minimize the distortion The displacement vector from xn to lies in the space orthogonal to the principal subspace (Fig ). (C) 2007, SNU Biointelligence Lab,

12.1.2 Minimum-error formulation (3/4)
Therefore, distortion measure J The case of a 2-dimensional data space D=2 and 1-dimensional principal subspace M=1 To choose a direction u2 so as to minimize Using a Lagrange multiplier λ2 to enforce the constraint The minimum value of J by choosing u2 to be the eigenvector corresponding to the smaller of the two eigenvalues choose the principal subspace to be aligned with the eigenvector having the larger eigenvalues (in order to minimize the average squared projection distance) (C) 2007, SNU Biointelligence Lab,

12.1.2 Minimum-error formulation (4/4)
General solution to the minimization of J for arbitrary M < D The distortion measure is simply the sum of the eigenvalues of those eigenvectors that are orthogonal to the principal subspace. Canonical correlation analysis (CCA) linear dimensionality reduction technique called CCA PCA works with a single random variable, CCA consider two (or more) variables (C) 2007, SNU Biointelligence Lab,

12.1.3 Applications of PCA (1/4)
Data compression The first five eigenvectors, along with the corresponding eigenvalues The distortion measure J introduced by projecting the data onto a principal component subspace of dimensionality M the eigenvalue spectrum (C) 2007, SNU Biointelligence Lab,

12.1.3 Applications of PCA (2/4)
PCA approximation to a data vector xn in the form PCA reconstruction of data points by M principal components (C) 2007, SNU Biointelligence Lab,

12.1.3 Applications of PCA (3/4)
Data pre-processing The transformation of a data set in order to standardize certain of its properties (allowing subsequent pattern recognition algorithm) PCA makes a more substantial normalization of the data to give it zero mean and unit covariance L, DxD diagonal matrix with elements λi U, DxD orthogonal matrix with columns given by ui (C) 2007, SNU Biointelligence Lab,

12.1.3 Applications of PCA (4/4)
Compare to PCA with the Fisher linear discriminant for linear dimensionality reduction PCA chooses the direction of maximum variance Fisher linear discriminant takes account of the class labels information (C) 2007, SNU Biointelligence Lab,

12.1.4 PCA for high-dimensional data
computationally infeasible (O(D3) ) Resolve this problem Let us define X to be the (NxD)-dimensional centered data matrix, whose nth row is given by Solve the eigenvector problem in spaces of lower dimensionality with computational cost O(N3) instead of O(D3) This is an eigenvector of S with eigenvalue λi (vi=Xui) (C) 2007, SNU Biointelligence Lab,