Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on.

Similar presentations


Presentation on theme: "Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on."— Presentation transcript:

1 Feature Selection Poonam Buch

2 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on.  If data is inadequate or highly dimensional, or contains extraneous and irrelevant information, machine learning algorithms produce less accurate and less understandable results.

3 3 Motivation  Our primary focus is on obtaining the best overall classification regardless of the number of features needed to obtain that performance.  Feature selection can result in enhanced performance, a reduced hypothesis space and reduced storage requirement.

4 4  Feature selection is necessary to make learning task efficient and more accurate.  Well-chosen features conserves computation.

5 5 Methods of Feature Selection  Wrapper Method : It estimates the accuracy of feature subsets. It re-samples using actual induction algorithm.  Filter Method : It eliminates undesirable features out of the data before induction commences and uses entire training data while selection. It is faster than wrapper method.

6 6 Metrics of filter method of feature selection:  Information Gain : It chooses an attribute which best fits the training instances into subsets corresponding to the values of the attribute.  Bi-normal separation : Defined as F -1 (tpr) - F -1 (fpr) where F -1 is the standard normal distribution’s inverse cumulative probability function.

7 Performance Goals Precision: tp/(tp + fp) Recall: tp/(tp + fn) F-measure: It is the average of precision and recall. tp=no of positive cases containing word. fp=no of negative cases containing word. fn=false negatives. tn=true negatives.

8 7 This paper evaluates twelve metrics on a benchmark of 229 text classification instances. Algorithm: For each dataset d, For N = 5 trials : For each feature selection metric (various subset of features) : * Train a classifier on the training set split. * Measure performance on the testing set split. End Evaluate the performance based on precision, recall, F- measure and accuracy. End

9 8 Some Important Results  F-measure averaged over 229 problems for each metric, varying the number of features. BNS performs better than using all features.  BNS obtained on average better recall than any other method. But if precision is the goal IG performs the best.

10 9  Under low skew, IG performs best and reaches the performance using all features. Under high skew, BNS performs substantially better than IG.  Under low skew, BNS is the only metrics that performs better than using all features. Under high skew, it performed best by a wide margin for any number of features selected.

11 10  Strong point : The paper succeeds in comparing all the metrics of feature selection and the results are well analyzed from point of view of f- measure, recall, precision and accuracy.  Weak point : The paper in trying to analyze all feature selection metrics fails to explain why a certain metric perform better than the other.

12 Conclusion  The paper presents an extensive study of feature selection metrics for highly dimensional data.  The paper contributes a novel evaluation methodology that considers the common problem of trying to select one or more metrics that have the best chances of obtaining best performance for a given dataset.  BNS paired with Odds ratio yields good precision.  BNS paired with F1 optimizes recall.


Download ppt "Feature Selection Poonam Buch. 2 The Problem  The success of machine learning algorithms is usually dependent on the quality of data they operate on."

Similar presentations


Ads by Google