Feature Selection and Dimensionality Reduction. “Curse of dimensionality” – The higher the dimensionality of the data, the more data is needed to learn.

Slides:



Advertisements
Similar presentations
EigenFaces.
Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Machine Learning Lecture 8 Data Processing and Representation
Face Recognition and Biometric Systems
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Dimensionality Reduction Chapter 3 (Duda et al.) – Section 3.8
Principal Component Analysis
Pattern Recognition Topic 1: Principle Component Analysis Shapiro chap
CS 790Q Biometrics Face Recognition Using Dimensionality Reduction PCA and LDA M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Face detection and recognition Many slides adapted from K. Grauman and D. Lowe.
Face Recognition using PCA (Eigenfaces) and LDA (Fisherfaces)
Project 4 out today –help session today –photo session today Project 2 winners Announcements.
Face Recognition Jeremy Wyatt.
Face Recognition Using Eigenfaces
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Principal Component Analysis Barnabás Póczos University of Alberta Nov 24, 2009 B: Chapter 12 HRF: Chapter 14.5.
Computer Vision I Instructor: Prof. Ko Nishino. Today How do we recognize objects in images?
Computer Vision Spring ,-685 Instructor: S. Narasimhan WH 5409 T-R 10:30am – 11:50am Lecture #18.
CS 485/685 Computer Vision Face Recognition Using Principal Components Analysis (PCA) M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Empirical Modeling Dongsup Kim Department of Biosystems, KAIST Fall, 2004.
Summarized by Soo-Jin Kim
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Recognition Part II Ali Farhadi CSE 455.
Face Recognition and Feature Subspaces
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #19.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
1 Recognition by Appearance Appearance-based recognition is a competing paradigm to features and alignment. No features are extracted! Images are represented.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Face Recognition: An Introduction
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CSE 185 Introduction to Computer Vision Face Recognition.
CSSE463: Image Recognition Day 27 This week This week Today: Applications of PCA Today: Applications of PCA Sunday night: project plans and prelim work.
Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Dimensionality reduction
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Face detection and recognition Many slides adapted from K. Grauman and D. Lowe.
CSSE463: Image Recognition Day 25 This week This week Today: Applications of PCA Today: Applications of PCA Sunday night: project plans and prelim work.
Principal Components Analysis ( PCA)
Unsupervised Learning II Feature Extraction
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Unsupervised Learning II Feature Extraction
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Support Vector Machines
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Lecture 8:Eigenfaces and Shared Features
Face Recognition and Feature Subspaces
Recognition: Face Recognition
Principal Component Analysis (PCA)
Machine Learning Dimensionality Reduction
Principal Component Analysis
Outline Peter N. Belhumeur, Joao P. Hespanha, and David J. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection,”
PCA vs ICA vs LDA.
In summary C1={skin} C2={~skin} Given x=[R,G,B], is it skin or ~skin?
Principal Component Analysis
PCA is “an orthogonal linear transformation that transfers the data to a new coordinate system such that the greatest variance by any projection of the.
Dimensionality Reduction
Feature space tansformation methods
CS4670: Intro to Computer Vision
Feature Selection Methods
The “Margaret Thatcher Illusion”, by Peter Thompson
Presentation transcript:

Feature Selection and Dimensionality Reduction

“Curse of dimensionality” – The higher the dimensionality of the data, the more data is needed to learn a good model. – Learned model may be overly complex, overfitted, especially if some dimensions are irrelevant to the classification task How to deal with the curse of dimensionality? Assume a strong “prior”: e.g., data follows a Gaussian distribution (so all we have do to is estimate the parameters) Reduce the dimensionality via: – Selecting only most relevant / useful features – Projecting data to lower dimensional space (e.g., principal components analysis)

Feature Selection Goal: Select a subset of d features (d < n) in order to maximize classification performance.

Types of Feature Selection Methods Filter Wrapper Embedded

Filter Methods Independent of the classification algorithm. Apply a filtering function to the features before applying classifier. Examples of filtering functions: – Information gain of individual features – Statistical variance of individual features – Statistical correlation among features Training data Filter Selected features Classifier

Information Gain Measures, from training data, how much information about the classification task an individual feature f provides. Let S denote the training set. Information Gain = Entropy ( S ) − Entropy ( S f )

Intuitive idea of entropy Entropy (or “information”) = “degree of disorder” Entropy measures the degree of uniformity or non-uniformity in a collection. Example {1, 1, 1, 1, 1, 1, 1} vs. {1, −1, −1, −1, 1, 1, −1} Note, this is the same as {1, 1, 1, 1, 1, 1, 1} vs. {−1, −1, −1, −1, 1, 1, 1}

Intuitive idea of entropy of a training set Task: classify as Female or Male Training set 1: Barbara, Elizabeth, Susan, Lisa, Ellen, Kristi, Mike, John (lower entropy) Training set 2: Jane, Mary, Alice, Jenny, Bob, Allen, Doug, Steve (higher entropy)

Intuitive idea of entropy with respect to a feature Task: classify as Female or Male Instances: Jane, Mary, Alice, Bob, Allen, Doug Features: Each instance has two binary features: “wears lipstick” and “has long hair” Wears lipstick T F Jane, Mary, Alice, Jenny Pure split (feature has low entropy) Impure split (feature has high entropy) Bob, Allen, Doug, Steve T F Jane, Mary, Bob, Steve Alice, Jenny, Allen, Doug Has long hair

Entropy of a training set Let S be a set of training examples. p + = proportion of positive examples. p − = proportion of negative examples Roughly measures how predictable collection is, only on basis of distribution of + and − examples. Examples?

Let f be a numeric feature. Let t be a threshold. Split training set S into two subsets: Let Entropy with respect to a feature Then define:

Example Wears lipstick T F Jane, Mary, Alice, Jenny Bob, Allen, Doug, Steve T F Jane, Mary, Bob, Steve Alice, Jenny, Allen, Doug Has long hair Training set S: Jane, Mary, Alice, Jenny, Bob, Allen, Doug, Steve

Wrapper Methods Training data Choose subset of features Classifier (cross-validation) Accuracy on validation set Classifier (Entire training set with best subset of features) Evaluate classifier on test set After cross-validation

Filter Methods Pros: Fast Cons: Chosen filter might not be relevant for a specific kind of classifier. Doesn’t take into account interactions among features Often hard to know how many features to select.

Wrapper Methods Pros: Features are evaluated in context of classification Wrapper method selects number of features to use Cons: Slow Filter Methods Pros: Fast Cons: Chosen filter might not be relevant for a specific kind of classifier. Doesn’t take into account interactions among features Often hard to know how many features to select.

Wrapper Methods Pros: Features are evaluated in context of classification Wrapper method selects number of features to use Cons: Slow Filter Methods Pros: Fast Cons: Chosen filter might not be relevant for a specific kind of classifier. Doesn’t take into account interactions among features Often hard to know how many features to select. Intermediate method, often used with SVMs: Train SVM using all features Select features f i with highest | w i | Retrain SVM with selected features Test SVM on test set

From M., Thomure, The Role of Prototype Learning in Hierarchical Models of Vision Number of Features Used

Embedded Methods Feature selection is intrinsic part of learning algorithm One example: L 1 SVM: Instead of minimizing (the “L2 norm”) we minimize the L 1 norm of the weight vector: Result is that most of the weights go to zero, leaving a small subset of the weights. Cf., field of “sparse coding”

Dimensionality Reduction: Principal Components Analysis Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)

Data x2x2 x1x1

x2x2 x1x1 First principal component Gives direction of largest variation of the data

Data x2x2 x1x1 Second principal component Gives direction of second largest variation First principal component Gives direction of largest variation of the data

Rotation of Axes x2x2 x1x1

Dimensionality reduction x2x2 x1x1

Classification (on reduced dimensionality space) + − x2x2 x1x1

Classification (on reduced dimensionality space) + − x2x2 x1x1 Note: Can be used for labeled or unlabeled data.

Principal Components Analysis (PCA) Summary: PCA finds new orthogonal axes in directions of largest variation in data. PCA used to create high-level features in order to improve classification and reduce dimensions of data without much loss of information. Used in machine learning and in signal processing and image compression (among other things).

Suppose features are A 1 and A 2, and we have n training examples. x’s denote values of A 1 and y’s denote values of A 2 over the training examples. Variance of an feature: Background for PCA

Covariance of two features: If covariance is positive, both dimensions increase together. If negative, as one increases, the other decreases. Zero: independent of each other.

Covariance matrix – Suppose we have n features, A 1,..., A n. – Covariance matrix:

Covariance matrix

Eigenvectors: – Let M be an n  n matrix. v is an eigenvector of M if M  v = v is called the eigenvalue associated with v – For any eigenvector v of M and scalar a, – Thus you can always choose eigenvectors of length 1: – If M is symmetric with real entries, it has n eigenvectors, and they are orthogonal to one another. – Thus eigenvectors can be used as a new basis for a n-dimensional vector space. Review of Matrix Algebra

Principal Components Analysis (PCA) 1.Given original data set S = {x 1,..., x k }, produce new set by subtracting the mean of feature A i from each x i. Mean: Mean: 0 0

2.Calculate the covariance matrix: 3.Calculate the (unit) eigenvectors and eigenvalues of the covariance matrix: x y xyxy

Eigenvector with largest eigenvalue traces linear pattern in data

4.Order eigenvectors by eigenvalue, highest to lowest. In general, you get n components. To reduce dimensionality to p, ignore n  p components at the bottom of the list.

Construct new “feature vector” (assuming v i is a column vector). Feature vector = (v 1, v 2,...v p ) V1V1 V2V2

5.Derive the new data set. TransformedData = RowFeatureVector  RowDataAdjust where RowDataAdjust = transpose of mean-adjusted data This gives original data in terms of chosen components (eigenvectors)—that is, along these axes.

Intuition: We projected the data onto new axes that captures the strongest linear trends in the data set. Each transformed data point tells us how far it is above or below those trend lines.

Reconstructing the original data We did: TransformedData = RowFeatureVector  RowDataAdjust so we can do RowDataAdjust = RowFeatureVector -1  TransformedData = RowFeatureVector T  TransformedData and RowDataOriginal = RowDataAdjust + OriginalMean

What you need to remember General idea of what PCA does – Finds new, rotated set of orthogonal axes that capture directions of largest variation – Allows some axes to be dropped, so data can be represented in lower-dimensional space. – This can improve classification performance and avoid overfitting due to large number of dimensions.

Example: Linear discrimination using PCA for face recognition (“Eigenfaces”) 1.Preprocessing: “Normalize” faces Make images the same size Line up with respect to eyes Normalize intensities

2.Raw features are pixel intensity values (2061 features) 3.Each image is encoded as a vector  i of these features 4.Compute “mean” face in training set:

Subtract the mean face from each face vector Compute the covariance matrix C Compute the (unit) eigenvectors v i of C Keep only the first K principal components (eigenvectors) From W. Zhao et al., Discriminant analysis of principal components for face recognition.

The eigenfaces encode the principal sources of variation in the dataset (e.g., absence/presence of facial hair, skin tone, glasses, etc.). We can represent any face as a linear combination of these “basis” faces. Use this representation for: Face recognition (e.g., Euclidean distance from known faces) Linear discrimination (e.g., “glasses” versus “no glasses”, or “male” versus “female”) Interpreting and Using Eigenfaces

Eigenfaces Demo hm/ hm/

Kernel PCA PCA: Assumes direction of variation are all straight lines Kernel PCA: Maps data to higher dimensional space

Kernel PCA Note that covariance terms can be expressed in terms of dot products: Use kernel function K(x,y) to project to higher-dimensional space:

From Wikipedia Original data Data after kernel PCA