Feature space tansformation methods

Slides:



Advertisements
Similar presentations
3D Geometry for Computer Graphics
Advertisements

Component Analysis (Review)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Dimension reduction (1)
PCA + SVD.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Principal Component Analysis
Dimensionality Reduction Chapter 3 (Duda et al.) – Section 3.8
Principal Component Analysis
Computer Graphics Recitation 5.
3D Geometry for Computer Graphics
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
3D Geometry for Computer Graphics
Olga Sorkine’s slides Tel Aviv University. 2 Spectra and diagonalization A If A is symmetric, the eigenvectors are orthogonal (and there’s always an eigenbasis).
Ch. 10: Linear Discriminant Analysis (LDA) based on slides from
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
SVD(Singular Value Decomposition) and Its Applications
Summarized by Soo-Jin Kim
Linear Least Squares Approximation. 2 Definition (point set case) Given a point set x 1, x 2, …, x n  R d, linear least squares fitting amounts to find.
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Principles of Pattern Recognition
Additive Data Perturbation: data reconstruction attacks.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CSE 185 Introduction to Computer Vision Face Recognition.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
Discriminant Analysis
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Feature Extraction 主講人:虞台文.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 09: Discriminant Analysis Objectives: Principal.
Chapter 13 Discrete Image Transforms
Out of sample extension of PCA, Kernel PCA, and MDS WILSON A. FLORERO-SALINAS DAN LI MATH 285, FALL
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Principal Component Analysis (PCA)
LECTURE 11: Advanced Discriminant Analysis
Background on Classification
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
School of Computer Science & Engineering
University of Ioannina
LECTURE 10: DISCRIMINANT ANALYSIS
Roberto Battiti, Mauro Brunato
PCA vs ICA vs LDA.
Singular Value Decomposition
Computational Intelligence: Methods and Applications
Multivariate Analysis: Theory and Geometric Interpretation
Support Vector Machines Most of the slides were taken from:
Principal Component Analysis
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
PCA is “an orthogonal linear transformation that transfers the data to a new coordinate system such that the greatest variance by any projection of the.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Dimensionality Reduction
X.2 Linear Discriminant Analysis: 2-Class
Generally Discriminant Analysis
Principal Components What matters most?.
LECTURE 09: DISCRIMINANT ANALYSIS
Lecture 13: Singular Value Decomposition (SVD)
Principal Component Analysis
Presentation transcript:

Feature space tansformation methods

Basic idea Instead of using the original features as they are, we will transform the features into a new feature space Similar to what we how the φ() function worked for SVM: instead of the original feature vector x, we create a new feature vector y for example: We hope that this transformation Will reduce the number of features, so k<d It makes training faster, and helps avoid the “curse of dimensionality” But retains the important information from the original features while reduces the redundancy So it also helps the subsequent classification algorithm Here we will study only linear transformation methods

Simple transformation methods The simplest transformation methods transform each feature independently Centering (or zero-centering) : from each feature we substract the mean of that feature This simply shifts the mean of the data to the origin Scaling: we multiply each feature with a constant: xy=f(x)=ax For example, min-max scaling means that we guarantee that the value of x for each sample will fall between some given limits. For example, for [0,1]:

Simple transformation methods Min-max scaling is very sensitive to outliers, i.e. a data sample that takes very large or small values (perhaps only by mistake) So it is more reliable if instead of scaling the min and max of x, we scale its variance Centering each feature’s mean to 0 and scaling its variance to 1 is also known as data standardization (sometimes called data normalization)

Linear feature space transformation methods While data normalization shifted and scaled each feature independently, we now move on to methods that apply a general linear transformation to the features, so that the new feature vector y is obtained from x as A general linear transformation can also rotate the feature space For example, we will see how to get decorrelated features Whitening: decorrelation plus standardization

Principal components analysis (PCA) Main idea: we want to reduce the number of dimensions (features) But we also want to minimize the information loss caused by the dimension reduction Example: let’s suppose we have two features and our data has a Gaussian distribution with a diagonal covariance matrix The data has a much larger variance in x than in y (x2 is much larger than y2) We want to drop one of the features to reduce the dimension from 2 to 1 We will keep x and drop y, as this choice results in the smaller information loss

Principal components analysis (PCA) Example with 3 dimensions: Original 3D data best 2D approximation best 1D approximation PCA will be similar, but instead of keeping or dropping features separately, it applies a linear transformation, so it is also able to rotate the data Gaussian example: when the covariance matrix is not diagonal, PCA will create the new feature set by finding the directions with the largest variance Although the above example is for Gaussian distributions, PCA does not require the data to have Gaussian distribution

Principal components analysis (PCA) Let’s return to the example where we want to reduce the number of dimensions from 2 to 1 But instead of discarding one of the features, now we look for a projection into 1D The projection axis will be our new 1D feature y The projection will be optimal if it preserves the direction along which the data has the largest variance As we will see, it is equivalent to looking for the projection direction which minimizes the projection error

Formalizing PCA We start with centering the data And examine only such projection directions that go through the origin Any d-dimensional linear space can be represented by a set of d orthonormal basis vectors Orthonormal: they are orthogonal and their length is 1 If our basis vectors are denoted by v1, v2, .., vd Then any x point in this space can be written as x1v1+ x2v2 +…+xnvn x are scalar, v are vectors! PCA seeks for a new orthonormal basis (i.e. coordinate system) which has a dimension k<d So it will define a subspace in the original space

Formalizing PCA We want to find the most accurate representation of our training data D={x1,…,xn} is some subspace W for which k<d Let {e1,…ek} denote the orthonormal basis vectors of this subspace. Than any data point in this subspace can be represented as However, the original data points are in a higher dimensional space (remember that k<d), so they cannot be represented in this subspace Still, we are looking for the representation that has the smallest error: E.g. for data point x1:

Formalizing PCA Any training sample xj can be written as And we want to minimize the error of the representation in the new space over all the training samples So the total error can be written as We have to minimize J, and also take care that the resulting e1,…ek vectors should be orthonormal The solution can be found by setting the derivative to zero, and using Lagrange multipliers to satisfy the constraints For the derivation see http://www.csd.uwo.ca/~olga/Courses/CS434a_541a/Lecture7.pdf

PCA – the result After a long derivation, J reduces to Where Apart from a constant multiplier, it is equivalent to the covariance matrix of the data (remember that μ is zero, as we centered our data): Minimizing J is equivalent to maximizing This is maximized by defining the new basis vectors e1,…,ek as those eigenvectors of S which have the largest corresponding eigenvalues λ1,… ,λ k

PCA – the result The larger the eigenvalue λ1 is, the larger is the variance in he direction corresponding to the eigenvector So the solution is indeed what we intuitively expected: PCA projects the data into a subspace of dimension k along the directions for which the data shows the largest variance So PCA can indeed be interpreted as finding a new orthogonal basis by rotating the coordinate system until the directions of maximum variance are found

How PCA approximates the data Let e1,…,ed be the eigenvectors of S, sorted in decreasing order according to the corresponding eigenvalues In the original space, any x point can be written as In the transformed space, we keep only the first k components and drop the remaining ones – so we obtain only an approximation of x The αj components are called the principal components of x The larger the value of k, the better the approximation is The components are arranged in the order of importance, the more important components come first PCA keeps the k most important principal components of x as the approximation of x

The last step – transforming the data Let’s see how to transform the data into the new coordinate system Let’s define the linear transformation matrix E as E=[e1,…,ek] Then each data item x can be transformed as y=Ex So instead of the original d-dimensional feature vector x we can now use the k-dimensional transformed feature vector y

PCA – summary

Linear discriminant analysis (LDA) PCA keeps the directions where the variance of the data is maximal However, it ignores the class labels The directions with the maximal variance may be useless for separating the classes Example: Linear discriminant analysis (also known as Fisher’s linear discriminant in the 2-class case) seeks to project the data in directions which are useful for separating the classes PCA is an unsupervised feature space transformation method (no class label required) LDA is a supervised feature space transformation method (takes the class labels into consideration)

LDA – main idea The main idea is to find projection directions along which the samples that belong to the different classes are well separated Example (2D, 2 classes): How can we measure the separation between the projections of the two classes? Let’s try to use the distance between the projection of the means μ1 and μ2 This will be denoted by

LDA – example How good is as the measure of class separation? In the above example projection to the vertical axis separates the classes better than the projection to the horizontal axis, yet is larger for the latter The problem is that we should also take the variances into consideration! Both classes have a larger variance along the horizontal axis than along the vertical axis

LDA – formalization for 2 classes We will look for a projection direction that maximizes the distance of the class means while minimizing the class variances at the same time Let yi denote the projected samples Let’s define the scatter of the two projected classes as (scatter is the same as variance, within a constant multiplier) Fisher’s linear discriminant seeks a projection direction v which maximizes

LDA – interpretation A large J(v) value Means that To solve this maximization we have to express J(v) explicitly, and then maximize it

LDA – derivation We define the scatter matrices of class 1 and class 2 (before projection) as: We define the “within class” scatter matrix as And the “between class” scatter matrix as It measures the separation between the means of the two classes (before projection) It can be shown that And

LDA – derivation Using the above notation, we obtain that We can maximize it by taking the derivative and setting it to zero Similar to PCA, it leads to an eigenvalue problem with the solution This is the optimal direction when we have two classes The derivation can be extended to multiple classes as well In the case of c classes we will obtain c-1 directions So instead of a projection vector v, we will obtain a projection matrix V

LDA – multi-class case We are looking for the projection matrix V, so our data samples xi will be converted into a new feature vector b by a linear transformation

LDA – multi-class case The function to be maximized is now given as Solving the maximization again requires eigenvalue decomposition The decomposition results in v1, ..,vc-1 eigenvectors The optimal projection matrix into a subspace of k dimensions is given by the k eigenvectors that correspond to the largest k eigenvalues As the rank of Sb is c-1 (at most), LDA cannot produce more than c-1 directions, so the number of the projected features obtained by LDA is limited to c-1 (remember that the number of projected features was at most d in the case of PCA, where d is the number of the dimensions of the original feature space)