Presentation is loading. Please wait.

Presentation is loading. Please wait.

Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM www.grunwald.nl joint work with John Langford, TTI Chicago, Preliminary version.

Similar presentations


Presentation on theme: "Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM www.grunwald.nl joint work with John Langford, TTI Chicago, Preliminary version."— Presentation transcript:

1 Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM www.grunwald.nl joint work with John Langford, TTI Chicago, Preliminary version appeared in Proceedings17 th annual Conference On Learning Theory (COLT 2004)

2 Our Result We study Bayesian and Minimum Description Length (MDL) inference in classification problems Bayes and MDL should automatically deal with overfitting We show there exist classification domains where Bayes and MDL… when applied in a standard manner …perform suboptimally (overfit!) even if sample size tends to infinity

3 Why is this interesting? Practical viewpoint: –Bayesian methods used a lot in practice sometimes claimed to be ‘universally optimal’ –MDL methods even designed to deal with overfitting –Yet MDL and Bayes can ‘fail’ even with infinite data Theoretical viewpoint –How can result be reconciled with various strong Bayesian consistency theorems?

4 Menu 1.Classification 2.Abstract statement of main result 3.Precise statement of result 4.Discussion

5 Classification Given: –Feature space –Label space –Sample –Set of hypotheses (classifiers) Goal: find a that makes few mistakes on future data from the same source –We say ‘c has small generalization error/classification risk’

6 Classification Models Types of Classifiers 1.hard classifiers: (-1/1-output) decision trees, stumps, forests 2.soft classifiers (real-valued output) support vector machines neural networks 3.probabilistic classifiers Naïve Bayes/Bayesian network classifiers Logistic regression initial focus

7 Generalization Error As is customary in statistical learning theory, we analyze classification by postulating some (unknown) distribution D on joint (input,label)-space Performance of a classifier measured in terms of its generalization error (classification risk) defined as

8 Learning Algorithms A learning algorithm LA based on set of candidate classifiers, is a function that, for each sample S of arbitrary length, outputs classifier :

9 Consistent Learning Algorithms Suppose are i.i.d. A learning algorithm is consistent or asymptotically optimal if, no matter what the ‘true’ distribution D is, in D – probability, as.

10 Consistent Learning Algorithms Suppose are i.i.d. A learning algorithm is consistent or asymptotically optimal if, no matter what the ‘true’ distribution D is, in D – probability, as. ‘learned’ classifier where is ‘best’ classifier in

11 Main Result There exists –input domain –prior P, non-zero on a countable set of classifiers –‘true’ distribution D –a constant such that the Bayesian learning algorithm is is is is is asymptotically K-suboptimal:

12 Main Result There exists –input domain –prior, non-zero on a countable set of classifiers –‘true’ distribution D –a constant such that the Bayesian learning algorithm is is is is is asymptotically K-suboptimal: Same holds for MDL learning algorithm

13 Remainder of Talk 1.How is “Bayes learning algorithm” defined? 2.What is scenario? how do, ‘true’ distr. D and prior P look like? 3.How dramatic is result? How large is K? How strange are choices for ? 4.How bad can Bayes get? 5.Why is result surprising? can it be reconciled with Bayesian consistency results?

14 Bayesian Learning of Classifiers Problem: Bayesian inference defined for models that are sets of probability distributions In our scenario, models are sets of classifiers, i.e. functions How can we find a posterior over classifiers using Bayes rule? Standard answer: convert each to a corresponding distribution and apply Bayes to the set of distributions thus obtained

15 classifiers probability distrs. Standard conversion method from to : logistic (sigmoid) transformation For each and, set Define priors on and on and set

16 Logistic transformation - intuition Consider ‘hard’ classifiers For each, Here is empirical error that c makes on data, and is number of mistakes c makes on data

17 Logistic transformation - intuition For fixed –log-likelihood is linear function of number of mistakes c makes on data –(log-) likelihood maximized for c that is optimal for observed data For fixed c, –Maximizing likelihood over also makes sense

18 Logistic transformation - intuition In Bayesian practice, logistic transformation is standard tool, nowadays performed without giving any motivation or explanation –We did not find it in Bayesian textbooks, … –…, but tested it with three well-known Bayesians! Analogous to turning set of predictors with squared error into conditional distributions with normally distributed noise expresses where Z is independent noise bit

19 Main Result There exists –input domain –prior P on a countable set of classifiers –‘true’ distribution D –a constant such that the Bayesian learning algorithm is is is is is asymptotically K-suboptimal: holds both for full Bayes and for Bayes (S)MAP Grünwald & Langford, COLT 2004

20 Definition of. Posterior: Predictive Distribution: “Full Bayes” learning algorithm:

21 Issues/Remainder of Talk 1.How is “Bayes learning algorithm” defined? 2.What is scenario? how do, ‘true’ distr. D and prior P look like? 3.How dramatic is result? How large is K? How strange are choices for ? 4.How bad can Bayes get? 5.Why is result surprising? can it be reconciled with Bayesian consistency results?

22 Scenario Definition of Y, X and : Definition of prior: – for some small, for all large n, – can be any strictly positive smooth prior (or discrete prior with sufficient precision)

23 Scenario – II: Definition of true D 1.Toss fair coin to determine value of Y. 2.Toss coin Z with bias 3.If (easy example) then for all, set 4.If (hard example) then set and for all, independently set

24 Result: All features are informative of, but is more informative than all the others, so is best classifier: Nevertheless, with ‘true’ D- probability 1, as (but note: for each fixed j, )

25 Issues/Remainder of Talk 1.How is “Bayes learning algorithm” defined? 2.What is scenario? how do, ‘true’ distr. D and prior P look like? 3.How dramatic is result? How large is K? How strange are choices for ? 4.How bad can Bayes get? 5.Why is result surprising? can it be reconciled with Bayesian consistency results?

26 Theorem 1 There exists –input domain –prior P on a countable set of classifiers –‘true’ distribution D –a constant such that the Bayesian learning algorithm is is is is is asymptotically K-suboptimal: holds both for full Bayes and for Bayes MAP Grünwald & Langford, COLT 2004

27 Theorem 1, extended X-axis: = maximum Bayes MAP/MDL = maximum full Bayes (binary entropy) Maximum difference achieved at

28 How ‘natural’ is scenario? Basic scenario is quite unnatural We chose it because we could prove something about it! But: 1.Priors are natural (take e.g. Rissanen’s universal prior) 2.Clarke (2002) reports practical evidence that Bayes performs suboptimally with large yet misspecified models in a regression context 3.Bayesian inference is consistent under very weak conditions. So even if unnatural, result is still interesting!

29 Issues/Remainder of Talk 1.How is “Bayes learning algorithm” defined? 2.What is scenario? how do, ‘true’ distr. D and prior P look like? 3.How dramatic is result? How large is K? How strange are choices for ? 4.How bad can Bayes get? 5.Why is result surprising? can it be reconciled with Bayesian consistency results?

30 Bayesian Consistency Results Doob (1949, special case): Suppose –Countable –Contains ‘true’ conditional distribution Then with D -probability 1, weakly/in Hellinger distance

31 Bayesian Consistency Results If …then we must also have Our result says that this does not happen in our scenario. Hence the (countable!) we constructed must be misspecified: Model homoskedastic, ‘true’ D heteroskedastic!

32 Bayesian consistency under misspecification Suppose we use Bayesian inference based on ‘model’ If, then under ‘mild’ generality conditions, Bayes still converges to distribution that is closest to in KL-divergence (relative entropy). The logistic transformation ensures that achieved for c that also achieves

33 Bayesian consistency under misspecification In our case, Bayesian posterior does not converge to distribution with smallest classification generalization error, so it also does not converge to distribution closest to true D in KL-divergence Apparently, ‘mild’ generality conditions for ‘Bayesian consistency under misspecification’ are violated Conditions for consistency under misspecification are much stronger than conditions for standard consistency! – must either be convex or “simple” (e.g. parametric)

34 Is consistency achievable at all? Methods for avoiding overfitting proposed in statistical and computational learning theory literature are consistent –Vapnik’s methods (based on VC-dimension etc.) –McAllester’s PAC-Bayes methods These methods invariably punish ‘complex’ (low prior) classifiers much more than ordinary Bayes – in the simplest version of PAC-Bayes,

35 Consistency and Data Compression - I Our inconsistency result also holds for (various incarnations of) MDL learning algorithm MDL is a learning method based on data compression; in practice it closely resembles Bayesian inference with certain special priors ….however…

36 Consistency and Data Compression - II There already exist (in)famous inconsistency results for Bayesian inference by Diaconis and Freedman For some highly non-parametric, even if “true” D is in, Bayes may not converge to it These type of inconsistency results do not apply to MDL, since Diaconis and Freedman use priors that do not compress the data With MDL priors, if true D is in, then consistency is guaranteed under no further conditions at all (Barron ’98)

37 Issues/Remainder of Talk 1.How is “Bayes learning algorithm” defined? 2.What is scenario? how do, ‘true’ distr. D and prior P look like? 3.How dramatic is result? How large is K? How strange are choices for ? 4.How bad can Bayes get? (& “what happens”) 5.Why is result surprising? can it be reconciled with Bayesian consistency results?

38 Thm 2: full Bayes result is ‘tight’ X-axis: = maximum Bayes MAP/MDL = maximum full Bayes (binary entropy) Maximum difference achieved at

39 Theorem 2

40 Proof Sketch 1.Log loss of Bayes upper bounds 0/1-loss: For every sequence 2.Log loss of Bayes upper bounded by log loss of 0/1 optimal plus log-term

41 Proof Sketch

42 1.Log loss of Bayes upper bounds 0/1-loss: For every sequence 2.Log loss of Bayes upper bounded by log loss of 0/1 optimal plus log-term

43 Proof Sketch 1.Log loss of Bayes upper bounds 0/1-loss: For every sequence 2.Log loss of Bayes upper bounded by log loss of 0/1 optimal plus log-term (Law of large nrs/Hoeffding)

44 Wait a minute… Accumulated log loss of sequential Bayesian predictions is always within of accumulated log loss of optimal So Bayes is ‘good’ with respect to log loss/KL-div. But Bayes is ‘bad’ with respect to 0/1-loss How is this possible? The Bayesian posterior effectively becomes a mixture of ‘bad’ distributions (different mixture at different m) –Mixture is closer to true distribution D than in KL-divergence/log loss prediction –But performs worse than in terms of 0/1 error

45 Bayes predicts too well Let be a set of distribution, and let be defined with respect to a prior that makes a universal data-compressor wrt One can show that the only true distributions D for which Bayes can ever become inconsistent in KL-divergence sense… …are those under which the posterior predictive distribution becomes closer in KL-divergence to D than the best single distribution in

46 Conclusion Our result applies to hard classifiers and (equivalently) to probabilistic classifiers under slight misspecification Bayesian may argue that the Bayesian machinery was never intended for misspecified models Yet, computational resources and human imagination being limited, in practice Bayesian inference is applied to misspecified models all the time. In this case, Bayes may overfit even in the limit for an infinite amount of data

47 Thank you for your attention!


Download ppt "Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM www.grunwald.nl joint work with John Langford, TTI Chicago, Preliminary version."

Similar presentations


Ads by Google