Feature Selection Methods Part-I By: Dr. Rajeev Srivastava IIT(BHU), Varanasi.

Feature Selection Methods Part-I By: Dr. Rajeev Srivastava IIT(BHU), Varanasi

Introduction The feature is defined as a function of one or more measurements each of which specifies some quantifiable property of an image, and is computed such that it quantifies some significant characteristics of the object. Feature selection is the process of selecting a subset of relevant features for use in model construction. The features removed should be useless, redundant, or of the least possible use The goal of feature selection is to find the subset of features that produces the best target detection and recognition performance and requires the least computational effort.

Reasons of Feature Selection Feature selection is important to target detection and recognition systems mainly for three reasons: First, using more features can increase system complexity, yet it may not always lead to higher detection/recognition accuracy. Sometimes, many features are available to a detection/recognition system. These features are not independent and may be correlated. A bad feature may greatly degrade the performance of the system. Thus, selecting a subset of good features is important Second, Selecting many features means a complicated model being used to approximate the training data. According to the minimum description length principle (MDLP), a simple model is better than a complex model Third, using fewer features can reduce the computational cost, which is important for real-time applications. Also it may lead to better classification accuracy due to the finite sample size effect. Feature selection techniques provide three main benefits when constructing predictive models: Improved model interpretability Shorter Computation times Enhanced generalisation by reduction

Advantages of Feature Selection It reduces the dimensionality of the feature space, to limit storage requirements and increase algorithm speed It removes the redundant, irrelevant or noisy data. The immediate effects for data analysis tasks are speeding up the running time of the learning algorithms. Improving the data quality. Increasing the accuracy of the resulting model. Feature set reduction, to save resources in the next round of data collection or during utilization; Performance improvement, to gain in predictive accuracy; Data understanding, to gain knowledge about the process that generated the data or simply visualize the data

Taxonomy of Feature Selection (Statistical pattern Recognition) (Produce same subset on a given problem every time)

Feature Selection Approaches There are two approaches in Feature selection: Forward Selection: Start with no variables and add them one by one, at each step adding the one that decreases the error the most, until any further addition does not significantly decrease the error. Backward Selection: Start with all the variables and remove them one by one, at each step removing the one that decreases the error the most (or increases it only slightly), until any further removal increases the error significantly. To reduce over fitting, the error referred to above is the error on a validation set that is distinct from the training set.

Schemes for Feature Selection The relationship between a FSA and the inducer chosen to evaluate the usefulness of the feature selection process can take three main forms: Filter Methods : These methods select features based on discriminating criteria that are relatively independent of classification Minimum redundancy-maximum relevance (MRMR) method is example of filter method. They supplement the maximum relevance criteria along with minimum redundancy criteria to choose additional features that are maximally dissimilar to already identified ones.

Wrapper Methods : These methods select features based on discriminating criteria that are relatively independent of classification Embedded Methods : The inducer has its own FSA (either explicit or implicit). The methods to induce logical conjunctions provide an example of this embedding. Other traditional machine learning tools like decision trees or artificial neural networks are included in this scheme.

Filters vs Wrappers Filters: Fast execution (+): Filters generally involve a non-iterative computation on the dataset, which can execute much faster than a classifier training session Generality (+): Since filters evaluate the intrinsic properties of the data, rather than their interactions with a particular classifier, their results exhibit more generality: the solution will be “good” for a larger family of classifiers Tendency to select large subsets (-): Since the filter objective functions are generally monotonic, the filter tends to select the full feature set as the optimal solution. This forces the user to select an arbitrary cutoff on the number of features to be selected Wrappers: Accuracy (+): wrappers generally achieve better recognition rates than filters since they are tuned to the specific interactions between the classifier and the dataset Ability to generalize (+): wrappers have a mechanism to avoid overfitting, since they typically use cross-validation measures of predictive accuracy Slow execution (-): since the wrapper must train a classifier for each feature subset (or several classifiers if cross-validation is used), the method can become unfeasible for computationally intensive methods Lack of generality (-): the solution lacks generality since it is tied to the bias of the classifier used in the evaluation function. The “optimal” feature subset will be specific to the classifier under consideration

Naïve method Begin with a single solution (feature subset) & iteratively add or remove features until some termination criterion is met Bottom up (forward method): begin with an empty set & add features Top-down (backward method): begin with a full set & delete features These “greedy” methods do not examine all possible subsets, so no guarantee of finding the optimal subset Sequential methods Sequential methods Sort the given d features in order of their prob. of correct recognition Select the top m features from this sorted list Disadvantage: Feature correlation is not considered; best pair of features may not even contain the best individual feature

Sequential Forward Selection 1. Start with the empty set Y 0 ={ ∅ } 2. Select the next best feature X + 3. Update Y k+1 =Y k + X + ; =+1 4. Go to 2 SFS performs best when the optimal subset is small, When the search is near the empty set, a large number of states can be potentially evaluated Towards the full set, the region examined by SFS is narrower since most features have already been selected The search space is drawn like an ellipse to emphasize the fact that there are fewer states towards the full or empty sets Disadvantage: Once a feature is retained, it cannot be discarded; nesting problem

Sequential Backward Selection 1. Start with the full set Y 0 = 2. Remove the worst feature X - 3. Update Y k+1 =Y k – X - ; =+1 4. Go to 2 Sequential Backward Elimination works in the opposite direction of SFS SBS works best when the optimal feature subset is large, since SBS spends most of its time visiting large subsets The main limitation of SBS is its inability to re-evaluate the usefulness of a feature after it has been discarded

Generalized sequential forward selection Generalized sequential forward selection Start with the empty set, X=0 Repeatedly add the most significant m-subset of (Y - X) (found through exhaustive search) Generalized sequential backward selection Generalized sequential backward selection Start with the full set, X=Y Repeatedly delete the least significant m-subset of X (found through exhaustive search)

Bidirectional Search (BDS) BDS is a parallel implementation of SFS and SBS SFS is performed from the empty set SBS is performed from the full set To guarantee that SFS and SBS converge to the same solution: Features already selected by SFS are not removed by SBS Features already removed by SBS are not selected by SFS 1.Start SFS with Y f ={ ∅ } 2.Start SBS with Y B = 3.Select the best feature X + Update Y F(k+1) =Y Fk + X + ; =+1 4.Remove the worst feature X - Update Y B(k+1) =Y Bk + X - ; =+1

Sequential floating selection (SFFS & SFBS) There are two floating methods Sequential floating forward selection (SFFS) starts from the empty set After each forward step, SFFS performs backward steps as long as the objective function increases Sequential floating backward selection (SFBS) starts from the full set After each backward step, SFBS performs forward steps as long as the objective function increases SFFS algorithm: 1. Y 0 ={ ∅ } 2. Select the best feature X + Update Y k+1 =Y k + X + ; =+1 3.Select the best feature X - 4.If J(Y k -x - )>J(Y k ) then { J(x)=Criterion Function} Y k+1 =Y k -x - ;k=k+1 go to step 3 Else go to step 2 (We need to do some book-keeping to avoid infinite loop)

Genetic Algorithm Feature Selection In a GA approach, a given feature subset is represented as a binary string a “chromosome" of length n with a zero or one in position i denoting the absence or presence of feature i in the set ( n = total number of available features) A population of chromosomes is maintained Each chromosome is evaluated from evaluation function to determine its “fitness" which determines how likely the chromosome is to survive and breed into the next generation New chromosomes are created from old chromosomes by the processes of Crossover, where parts of two different parent chromosomes are mixed to create offspring Mutation: where the bits of a single parent are randomly perturbed to create a child Choosing an appropriate evaluation function is an essential step for successful application of GAs to any problem domain

Minimum Redundancy Maximum Relevance Feature Selection This approach is based on recognizing that the combinations of individually good variables do not necessarily lead to good classification To maximize the joint dependency of top ranking variables on the target variable, the redundancy among them must be reduced, So we select maximally relevant variables and avoiding the redundant ones First, mutual information (MI) between the candidate variable and the target variable is calculated (relevance term) Then average MI between the candidate variable and the variables that are already selected is computed (redundancy term) The entropy-based mRMR score (higher it is for a feature, more that feature is needed) is obtained by subtracting the redundancy from relevance Both relevance and redundancy estimation are low dimensional problems (involves only 2 variables). This is much easier than directly estimating multivariate density or mutual information in the high dimensional space It only measures the quantity of redundancy between the candidate variables and the selected variables but does not deal with the type of this redundancy

References FEATURE SELECTION METHODS AND ALGORITHMS L.Ladha, Research Scholar, Department Of Computer Science, Sri Ramakrishna College Of Arts and Science for Women, Coimbatore, Tamilnadu, India Feature Selection: Evaluation, application and small sample performance, Anil Jaiin, Douglas Zongker Michigan State University USA Using covariates for improving the minimum Redundancy Maximum Relevance feature selection Method Olcay KURS¸UN1, C. Okan S¸AKAR2, Oleg FAVOROV3

Feature Selection Methods Part-I By: Dr. Rajeev Srivastava IIT(BHU), Varanasi.

Similar presentations

Presentation on theme: "Feature Selection Methods Part-I By: Dr. Rajeev Srivastava IIT(BHU), Varanasi."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Feature Selection Methods Part-I By: Dr. Rajeev Srivastava IIT(BHU), Varanasi.

Similar presentations

Presentation on theme: "Feature Selection Methods Part-I By: Dr. Rajeev Srivastava IIT(BHU), Varanasi."— Presentation transcript:

Similar presentations

About project

Feedback