Download presentation

Presentation is loading. Please wait.

Published byLogan Klein Modified over 4 years ago

1
Robust Feature Selection by Mutual Information Distributions Marco Zaffalon & Marcus Hutter IDSIA IDSIA Galleria 2, 6928 Manno (Lugano), Switzerland www.idsia.ch/~{zaffalon,marcus}{zaffalon,marcus}@idsia.ch

2
Mutual Information (MI) Consider two discrete random variables (, ) Consider two discrete random variables (, ) (In)Dependence often measured by MI (In)Dependence often measured by MI –Also known as cross-entropy or information gain –Examples Inference of Bayesian nets, classification trees Inference of Bayesian nets, classification trees Selection of relevant variables for the task at hand Selection of relevant variables for the task at hand

3
MI-Based Feature-Selection Filter (F) Lewis, 1992 Classification Classification –Predicting the class value given values of features –Features (or attributes) and class = random variables –Learning the rule features class from data Filters goal: removing irrelevant features Filters goal: removing irrelevant features –More accurate predictions, easier models MI-based approach MI-based approach –Remove feature if class does not depend on it: –Or: remove if is an arbitrary threshold of relevance is an arbitrary threshold of relevance

4
Empirical Mutual Information a common way to use MI in practice Data ( ) contingency table Data ( ) contingency table –Empirical (sample) probability: –Empirical mutual information: Problems of the empirical approach Problems of the empirical approach – due to random fluctuations? (finite sample) –How to know if it is reliable, e.g. by j\ij\ij\ij\i12…r 1 n 11 n 12 … n 1r 2 n 21 n 22 … n 2r s n s1 n s2 … n sr

5
We Need the Distribution of MI Bayesian approach Bayesian approach –Prior distribution for the unknown chances (e.g., Dirichlet) –Posterior: Posterior probability density of MI: Posterior probability density of MI: How to compute it? How to compute it? –Fitting a curve by the exact mean, approximate variance

6
Mean and Variance of MI Hutter, 2001; Wolpert & Wolf, 1995 Exact mean Exact mean Leading and next to leading order term (NLO) for the variance Leading and next to leading order term (NLO) for the variance Computational complexity O(rs) Computational complexity O(rs) –As fast as empirical MI

7
MI Density Example Graphs

8
Robust Feature Selection Filters: two new proposals Filters: two new proposals –FF: include feature iff (include iff proven relevant) (include iff proven relevant) –BF: exclude feature iff (exclude iff proven irrelevant) (exclude iff proven irrelevant) Examples Examples FF includes BF includes FF excludes BF includes FF excludes BF excludes I

9
Comparing the Filters Experimental set-up Experimental set-up –Filter (F,FF,BF) + Naive Bayes classifier –Sequential learning and testing Collected measures for each filter Collected measures for each filter –Average # of correct predictions (prediction accuracy) –Average # of features used Naive Bayes Classification Test instance Filter Instance kInstance k+1Instance N Learning data Store after classification

10
Results on 10 Complete Datasets # of used features # of used features Accuracies NOT significantly different Accuracies NOT significantly different –Except Chess & Spam with FF

11
Results on 10 Complete Datasets - ctd

12
FF: Significantly Better Accuracies Chess Chess Spam Spam

13
Extension to Incomplete Samples MAR assumption MAR assumption –General case: missing features and class EM + closed-form expressions EM + closed-form expressions –Missing features only Closed-form approximate expressions for Mean and Variance Closed-form approximate expressions for Mean and Variance Complexity still O(rs) Complexity still O(rs) New experiments New experiments –5 data sets –Similar behavior

14
Conclusions Expressions for several moments of MI distribution are available Expressions for several moments of MI distribution are available –The distribution can be approximated well –Safer inferences, same computational complexity of empirical MI –Why not to use it? Robust feature selection shows power of MI distribution Robust feature selection shows power of MI distribution –FF outperforms traditional filter F Many useful applications possible Many useful applications possible –Inference of Bayesian nets –Inference of classification trees –…

Similar presentations

OK

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google