Feature Selection Jain, A.K.; Duin, P.W.; Jianchang Mao, “Statistical pattern recognition: a review”, IEEE Transactions on Pattern Analysis and Machine.

Slides:



Advertisements
Similar presentations
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Advertisements

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Minimum Redundancy and Maximum Relevance Feature Selection
Institute of Intelligent Power Electronics – IPE Page1 Introduction to Basics of Genetic Algorithms Docent Xiao-Zhi Gao Department of Electrical Engineering.
Feature Selection Presented by: Nafise Hatamikhah
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
A novel credit scoring model based on feature selection and PSO.
Dimensionality Reduction Chapter 3 (Duda et al.) – Section 3.8
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
Object Recognition Using Genetic Algorithms CS773C Advanced Machine Intelligence Applications Spring 2008: Object Recognition.
Feature Selection for Regression Problems
Pattern Recognition Topic 1: Principle Component Analysis Shapiro chap
CS 790Q Biometrics Face Recognition Using Dimensionality Reduction PCA and LDA M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Reduced Support Vector Machine
Intro to AI Genetic Algorithm Ruth Bergman Fall 2002.
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Selecting Informative Genes with Parallel Genetic Algorithms Deodatta Bhoite Prashant Jain.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Genetic Algorithm What is a genetic algorithm? “Genetic Algorithms are defined as global optimization procedures that use an analogy of genetic evolution.
Computer Vision UNR Dr. George Bebis
Genetic Feature Subset Selection for Gender Classification: A Comparison Study Zehang Sun, George Bebis, Xiaojing Yuan, and Sushil Louis Computer Vision.
Optimization of thermal processes2007/2008 Optimization of thermal processes Maciej Marek Czestochowa University of Technology Institute of Thermal Machinery.
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering.
Genetic Algorithms and Ant Colony Optimisation
Efficient Model Selection for Support Vector Machines
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
1 Graph Embedding (GE) & Marginal Fisher Analysis (MFA) 吳沛勳 劉冠成 韓仁智
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
SOFT COMPUTING (Optimization Techniques using GA) Dr. N.Uma Maheswari Professor/CSE PSNA CET.
Intro. ANN & Fuzzy Systems Lecture 36 GENETIC ALGORITHM (1)
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Zorica Stanimirović Faculty of Mathematics, University of Belgrade
Chapter 8 The k-Means Algorithm and Genetic Algorithm.
1 Particle Swarm Optimization-based Dimensionality Reduction for Hyperspectral Image Classification He Yang, Jenny Q. Du Department of Electrical and Computer.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Derivative Free Optimization G.Anuradha. Contents Genetic Algorithm Simulated Annealing Random search method Downhill simplex method.
Genetic Algorithms. Evolutionary Methods Methods inspired by the process of biological evolution. Main ideas: Population of solutions Assign a score or.
2005MEE Software Engineering Lecture 11 – Optimisation Techniques.
Why to reduce the number of the features? Having D features, we want to reduce their number to n, where n
Project 11: Determining the Intrinsic Dimensionality of a Distribution Okke Formsma, Nicolas Roussis and Per Løwenborg.
A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
NTU & MSRA Ming-Feng Tsai
D Nagesh Kumar, IIScOptimization Methods: M8L5 1 Advanced Topics in Optimization Evolutionary Algorithms for Optimization and Search.
Feature Selection and Dimensionality Reduction. “Curse of dimensionality” – The higher the dimensionality of the data, the more data is needed to learn.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
2D-LDA: A statistical linear discriminant analysis for image matrix
Feature Selection Methods Part-I By: Dr. Rajeev Srivastava IIT(BHU), Varanasi.
Dr. Gheith Abandah 1.  Feature selection is typically a search problem for finding an optimal or suboptimal subset of m features out of original M features.
Data Mining 2, Filter methods T statistic Information Distance Correlation Separability …
Artificial Intelligence By Mr. Ejaz CIIT Sahiwal Evolutionary Computation.
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Feature Selection Jain, A.K.; Duin, P.W.; Jianchang Mao, “Statistical pattern recognition: a review”, IEEE Transactions on Pattern Analysis and Machine.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Face Recognition and Feature Subspaces
Final Year Project Presentation --- Magic Paint Face
Advanced Artificial Intelligence Feature Selection
Outline Peter N. Belhumeur, Joao P. Hespanha, and David J. Kriegman, “Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection,”
Feature Selection To avid “curse of dimensionality”
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Presented by: Chang Jia As for: Pattern Recognition
Dimensionality Reduction
Boltzmann Machine (BM) (§6.4)
Feature Selection Methods
FEATURE WEIGHTING THROUGH A GENERALIZED LEAST SQUARES ESTIMATOR
Presentation transcript:

Feature Selection Jain, A.K.; Duin, P.W.; Jianchang Mao, “Statistical pattern recognition: a review”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 1, pp. 4-37, Jan Section 4.2 Genetic Algorithms Chapter 7 (Duda et al.) – Section 7.5 CS479/679 Pattern Recognition Dr. George Bebis

Feature Selection Given a set of n features, the role of feature selection is to select a subset of d features (d < n) in order to minimize the classification error. Fundamentally different from dimensionality reduction (e.g., PCA or LDA) based on feature combinations (i.e., feature extraction). dimensionality reduction

Feature Selection vs Dimensionality Reduction Dimensionality Reduction – When classifying novel patterns, all features need to be computed. – The measurement units (length, weight, etc.) of the features are lost. Feature Selection – When classifying novel patterns, only a small number of features need to be computed (i.e., faster classification). – The measurement units (length, weight, etc.) of the features are preserved.

Feature Selection Steps Feature selection is an optimization problem. – Step 1: Search the space of possible feature subsets. – Step 2: Pick the subset that is optimal or near-optimal with respect to some objective function.

Feature Selection Steps (cont’d) Search strategies –Optimum –Heuristic –Randomized Evaluation strategies -Filter methods -Wrapper methods

Search Strategies Assuming n features, an exhaustive search would require: –Examining all possible subsets of size d. –Selecting the subset that performs the best according to the criterion function. The number of subsets grows combinatorially, making exhaustive search impractical. In practice, heuristics are used to speed-up search but they cannot guarantee optimality. 6

Evaluation Strategies Filter Methods –Evaluation is independent of the classification algorithm. –The objective function evaluates feature subsets by their information content, typically interclass distance, statistical dependence or information-theoretic measures.

Evaluation Strategies Wrapper Methods –Evaluation uses criteria related to the classification algorithm. –The objective function is a pattern classifier, which evaluates feature subsets by their predictive accuracy (recognition rate on test data) by statistical resampling or cross- validation.

Filter vs Wrapper Approaches

Filter vs Wrapper Approaches (cont’d)

Naïve Search Sort the given n features in order of their probability of correct recognition. Select the top d features from this sorted list. Disadvantage –Feature correlation is not considered. –Best pair of features may not even contain the best individual feature.

Sequential forward selection (SFS) (heuristic search) First, the best single feature is selected (i.e., using some criterion function). Then, pairs of features are formed using one of the remaining features and this best feature, and the best pair is selected. Next, triplets of features are formed using one of the remaining features and these two best features, and the best triplet is selected. This procedure continues until a predefined number of features are selected. 12 SFS performs best when the optimal subset is small.

Example 13 Results of sequential forward feature selection for classification of a satellite image using 28 features. x-axis shows the classification accuracy (%) and y-axis shows the features added at each iteration (the first iteration is at the bottom). The highest accuracy value is shown with a star. features added at each iteration

Sequential backward selection (SBS) (heuristic search) First, the criterion function is computed for all n features. Then, each feature is deleted one at a time, the criterion function is computed for all subsets with n-1 features, and the worst feature is discarded. Next, each feature among the remaining n-1 is deleted one at a time, and the worst feature is discarded to form a subset with n-2 features. This procedure continues until a predefined number of features are left. 14 SBS performs best when the optimal subset is large.

Example 15 Results of sequential backward feature selection for classification of a satellite image using 28 features. x-axis shows the classification accuracy (%) and y-axis shows the features removed at each iteration (the first iteration is at the top). The highest accuracy value is shown with a star. features removed at each iteration

Bidirectional Search (BDS) BDS applies SFS and SBS simultaneously: – SFS is performed from the empty set – SBS is performed from the full set To guarantee that SFS and SBS converge to the same solution: – Features already selected by SFS are not removed by SBS – Features already removed by SBS are not added by SFS

SFS and SBS The main disadvantage of SFS is that it is unable to remove features that become obsolete after the addition of other features. The main limitation of SBS is its inability to reevaluate the usefulness of a feature after it has been discarded.

“Plus-L, minus-R” selection (LRS) A generalization of SFS and SBS – If L>R, LRS starts from the empty set and: Repeatedly add L features Repeatedly remove R features – If L<R, LRS starts from the full set and: Repeatedly removes R features Repeatedly add L features Its main limitation is the lack of a theory to help predict the optimal values of L and R.

Sequential floating selection (SFFS and SFBS) An extension to LRS with flexible backtracking capabilities – Rather than fixing the values of L and R, floating methods determine these values from the data. – The dimensionality of the subset during the search can be thought to be “floating” up and down There are two floating methods: – Sequential floating forward selection (SFFS) – Sequential floating backward selection (SFBS) P. Pudil, J. Novovicova, J. Kittler, Floating search methods in feature selection, Pattern Recognition Lett. 15 (1994) 1119–1125.

Sequential floating forward selection (SFFS) Sequential floating forward selection (SFFS) starts from the empty set. After each forward step, SFFS performs backward steps as long as the objective function increases.

Sequential floating backward selection (SFBS) Sequential floating backward selection (SFBS) starts from the full set. After each backward step, SFBS performs forward steps as long as the objective function increases. SFBS algorithm is analogous to SFFS.

Feature Selection using Genetic Algorithms (GAs) (randomized search) Classifier Feature Subset Pre- Processing Feature Extraction Feature Selection (GA) Feature Subset GAs provide a simple, general, and powerful framework for feature selection.

Genetic Algorithms (GAs) What are GAs? – An optimization technique for searching very large spaces. – Inspired by the biological mechanisms of natural selection and reproduction. Main characteristics of GAs – Global optimization technique. – Searches probabilistically using a population of structures, each properly encoded as a string of symbols. – Uses an objective function, not derivatives.

Encoding scheme – Represent solutions in parameter space as finite length strings (chromosomes) over some finite set of symbols. Encoding

Fitness function – Evaluates the goodness of a solution. Fitness Evaluation

Searching … … … … … … … … … SelectionCrossoverMutation Current GenerationNext Generation

Selection Probabilistically filters out solutions that perform poorly, choosing high performance solutions to exploit. fitness

Explore new solutions. Crossover: information exchange between strings. Mutation: restore lost genetic material. Crossover and Mutation mutated bit

Feature Selection Using GAs Binary encoding Fitness evaluation (to be maximized) Feature#1Feature#N fitness=10 4  accuracy  #zeros Classification accuracy using a validation set number of features

Case Study 1: Gender Classification Determine the gender of a subject from facial images. –Challenges: race, age, facial expression, hair style, etc. Z. Sun, G. Bebis, X. Yuan, and S. Louis, "Genetic Feature Subset Selection for Gender Classification: A Comparison Study", IEEE Workshop on Applications of Computer Vision, pp , Orlando, December 2002.

Feature Extraction Using PCA PCA maps the data in a lower-dimensional space using a linear transformation. The columns of the projection matrix are the “best” eigenvectors (i.e., eigenfaces) of the covariance matrix of the data.

Which eigenvectors encode mostly gender information? EV#1EV#2EV#3EV#4EV#5EV#6 EV#8EV#10EV#12EV#14EV#19EV#20 IDEA: Use GAs to search the space of eigenvectors!

Dataset 400 frontal images from 400 different people –200 male, 200 female –Different races, lighting conditions, and facial expressions Images were registered and normalized –No hair information –Normalized to account for different lighting conditions

Experiments Classifiers –LDA –Bayes classifier –Neural Networks (NNs) –Support Vector Machines (SVMs) Three-fold cross validation –Training set: 75% of the data –Validation set: 12.5% of the data –Test set: 12.5% of the data Comparison with SFBS

Error Rates ERM: error rate using top eigenvectors ERG: error rate using GA selected eigenvectors 17.7% 11.3% 22.4% 13.3% 14.2% 9%8.9% 4.7% 6.7%

Ratio of Features - Information Kept RN: percentage of eigenvectors in the feature subset selected. RI: percentage of information contained in the eigenvector subset selected. 17.6% 38% 13.3% 31% 36.4% 61.2% 8.4% 32.4% 42.8% 69.0%

Histograms of Selected Eigenvectors (a) LDA (b) Bayes (d) SVMs(c) NN

Reconstructed Images Reconstructed faces using GA-selected EVs do not contain information about identity but disclose strong gender information! Original images Using top 30 EVs Using EVs selected by SVM+GA Using EVs selected by NN+GA

Comparison with SFBS Original images Top 30 EVs EVs selected by SVM+GA EVs selected by SVM+SFBS

Case Study 2: Vehicle Detection low light camera rear views Non-vehicles class much larger than vehicle class. Z. Sun, G. Bebis, and R. Miller, "Object Detection Using Feature Subset Selection", Pattern Recognition, vol. 37, pp , Ford Motor Company

Which eigenvectors encode the most important vehicle features?

Experiments Training data set (collected in Fall 2001)  2102 images (1051 vehicles and 1051 non-vehicles) Test data sets (collected in Summer 2001)  231 images (vehicles and non-vehicles) SVM for classification Three-fold cross-validation Comparison with SFBS

Error Rate 6.49%

Histograms of Selected Eigenvectors SFBS-SVMGA-SVM Number of eigenvectors selected by SBFS: 87 (43.5% information) Number of eigenvectors selected by GA: 46 (23% information)

Vehicle Detection Original Top 50 EVs EVs selected by SFBS Evs selected by GAs Reconstructed images using the selected feature subsets. - Lighting differences have been disregarded by the GA approach.

Case Study 3: Fusion of Visible-Thermal IR Imagery for Face Recognition G. Bebis, A. Gyaourova, S. Singh, and I. Pavlidis, "Face Recognition by Fusing Thermal Infrared and Visible Imagery", Image and Vision Computing, vol. 24, no. 7, pp , 2006.

Visible vs Thermal IR Visible spectrum –High resolution, less sensitive to the presence of eyeglasses. –Sensitive to changes in illumination and facial expression. Thermal IR spectrum –Robust to illumination changes and facial expressions. –Low resolution, sensitive to air currents, face heat patterns, aging, and the presence of eyeglasses (i.e., glass is opaque to thermal IR).

How should we fuse information from visible and thermal IR? Feature Extraction Fusion Using Genetic Algorithms Reconstruct Image Fused Image We have tested two feature extraction methods: - PCA and Wavelets

Dataset Co-registered thermal infrared and visible images. Variations among images: - Eyeglasses - Illumination (front, left, right) - Facial expression (smile, frown, surprise and vowels) Equinox database

Eyeglasses/Illumination Tests ELG ELnG EFnGEFG Frontal Illumination GlassesNo Glasses EnGEG Lateral illumination DATA

Experiments EGELGEFGEnGELnGEFnG EG ELG EFG EnG ELnG EFnG no glassesglasses no glasses database test Eyeglasses Tests Illumination Tests

Results

Facial Expression Tests EA Smile, Frown & Surprise Frontal Illumination Lateral illumination EF EL VA Speaking Vowels Frontal Illumination VF VL

Results

Overall Results