Download presentation

Presentation is loading. Please wait.

Published byIan Fox Modified over 4 years ago

1
Bayesian Treatment of Incomplete Discrete Data applied to Mutual Information and Feature Selection Marcus Hutter & Marco Zaffalon IDSIA IDSIA Galleria 2, 6928 Manno (Lugano), Switzerland www.idsia.ch/~{marcus,zaffalon}{marcus,zaffalon}@idsia.ch

2
Abstract Given the joint chances of a pair of random variables one can compute quantities of interest, like the mutual information. The Bayesian treatment of unknown chances involves computing, from a second order prior distribution and the data likelihood, a posterior distribution of the chances. A common treatment of incomplete data is to assume ignorability and determine the chances by the expectation maximization (EM) algorithm. The two different methods above are well established but typically separated. This paper joins the two approaches in the case of Dirichlet priors, and derives efficient approximations for the mean, mode and the (co)variance of the chances and the mutual information. Furthermore, we prove the unimodality of the posterior distribution, whence the important property of convergence of EM to the global maximum in the chosen framework. These results are applied to the problem of selecting features for incremental learning and naive Bayes classification. A fast filter based on the distribution of mutual information is shown to outperform the traditional filter based on empirical mutual information on a number of incomplete real data sets. Incomplete data, Bayesian statistics, expectation maximization, global optimization, Mutual Information, Cross Entropy, Dirichlet distribution, Second order distribution, Credible intervals, expectation and variance of mutual information, missing data, Robust feature selection, Filter approach, naive Bayes classifier. Keywords

3
Mutual Information (MI) Consider two discrete random variables (, ) Consider two discrete random variables (, ) (In)Dependence often measured by MI (In)Dependence often measured by MI –Also known as cross-entropy or information gain –Examples Inference of Bayesian nets, classification trees Inference of Bayesian nets, classification trees Selection of relevant variables for the task at hand Selection of relevant variables for the task at hand

4
MI-Based Feature-Selection Filter (F) Lewis, 1992 Classification Classification –Predicting the class value given values of features –Features (or attributes) and class = random variables –Learning the rule features class from data Filters goal: removing irrelevant features Filters goal: removing irrelevant features –More accurate predictions, easier models MI-based approach MI-based approach –Remove feature if class does not depend on it: –Or: remove if is an arbitrary threshold of relevance is an arbitrary threshold of relevance

5
Empirical Mutual Information a common way to use MI in practice Data ( ) contingency table Data ( ) contingency table –Empirical (sample) probability: –Empirical mutual information: Problems of the empirical approach Problems of the empirical approach – due to random fluctuations? (finite sample) –How to know if it is reliable, e.g. by j\ij\ij\ij\i12…r 1 n 11 n 12 … n 1r 2 n 21 n 22 … n 2r s n s1 n s2 … n sr

6
Incomplete Samples Missing features/classes Missing features/classes –Missing class: (i,?) n i? = # features i with missing class label –Missing feature: (?,j) n ?j = # classes j with missing feature –Total sample size N ij =n ij +n i? +n ?j MAR assumption: i? = i+, ?j = +j MAR assumption: i? = i+, ?j = +j –General case: missing features and class EM + closed-form leading order in N -1 expressions EM + closed-form leading order in N -1 expressions –Missing features only Closed-form leading order expressions for Mean and Variance Closed-form leading order expressions for Mean and Variance Complexity O(rs) Complexity O(rs)

7
We Need the Distribution of MI Bayesian approach Bayesian approach –Prior distribution for the unknown chances (e.g., Dirichlet) –Posterior: Posterior probability density of MI: Posterior probability density of MI: How to compute it? How to compute it? –Fitting a curve using mode and approximate variance

8
Mean and Variance of and I (missing features only) Exact mode = leading mean Exact mode = leading mean Leading covariance: Leading covariance: with with Exact mode = = leading order mean Exact mode = = leading order mean Leading variance: Leading variance: Missing features & classes: EM converges globally, since p( |n) is unimodal Missing features & classes: EM converges globally, since p( |n) is unimodal

9
MI Density Example Graphs (complete sample)

10
Robust Feature Selection Filters: two new proposals Filters: two new proposals –FF: include feature iff (include iff proven relevant) (include iff proven relevant) –BF: exclude feature iff (exclude iff proven irrelevant) (exclude iff proven irrelevant) Examples Examples FF includes BF includes FF excludes BF includes FF excludes BF excludes I

11
Comparing the Filters Experimental set-up Experimental set-up –Filter (F,FF,BF) + Naive Bayes classifier –Sequential learning and testing Collected measures for each filter Collected measures for each filter –Average # of correct predictions (prediction accuracy) –Average # of features used Naive Bayes Classification Test instance Filter Instance kInstance k+1Instance N Learning data Store after classification

12
Results on 10 Complete Datasets # of used features # of used features Accuracies NOT significantly different Accuracies NOT significantly different –Except Chess & Spam with FF

13
Results on 10 Complete Datasets - ctd

14
FF: Significantly Better Accuracies Chess Chess Spam Spam

15
Results on 5 Incomplete Data Sets # Instances# Features# miss.valsDatasetFFFBF 22669317Audiology64.368.068.7 6901567Crx9.712.613.8 368181281Horse-Colic11.816.117.4 3163231980Hypothyroidloss4.38.313.2 683352337Soybean-large34.235.0

16
Conclusions Expressions for several moments of and MI distribution even for incomplete categorical data Expressions for several moments of and MI distribution even for incomplete categorical data –The distribution can be approximated well –Safer inferences, same computational complexity of empirical MI –Why not to use it? Robust feature selection shows power of MI distribution Robust feature selection shows power of MI distribution –FF outperforms traditional filter F Many useful applications possible Many useful applications possible –Inference of Bayesian nets –Inference of classification trees –…

Similar presentations

OK

A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.

A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google