Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM www.grunwald.nl joint work with John Langford, TTI Chicago, Preliminary version.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Linear Regression.

Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.

Boosting Approach to ML

Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.

Model Assessment, Selection and Averaging

CMPUT 466/551 Principal Source: CMU

Chapter 4: Linear Models for Classification

Visual Recognition Tutorial

0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.

Presenting: Assaf Tzabari

Machine Learning CMPT 726 Simon Fraser University

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

SVM Support Vectors Machines

Visual Recognition Tutorial

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Crash Course on Machine Learning

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu

Principles of Pattern Recognition

All of Statistics Chapter 5: Convergence of Random Variables Nick Schafer.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Benk Erika Kelemen Zsolt

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

1 E. Fatemizadeh Statistical Pattern Recognition.

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.

START OF DAY 5 Reading: Chap. 8. Support Vector Machine.

Slides for “Data Mining” by I. H. Witten and E. Frank.

BCS547 Neural Decoding.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Ensemble Methods in Machine Learning

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

Statistical NLP: Lecture 4 Mathematical Foundations I: Probability Theory (Ch2)

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Bayesian Inconsistency under Misspecification Peter Grünwald CWI, Amsterdam Extension of joint work with John Langford, TTI Chicago (COLT 2004)

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Empirical risk minimization

Data Mining Lecture 11.

Latent Variables, Mixture Models and EM

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Generally Discriminant Analysis

Parametric Methods Berlin Chen, 2005 References:

Empirical risk minimization

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Suboptimality of Bayes and MDL in Classification Peter Grünwald CWI/EURANDOM joint work with John Langford, TTI Chicago, Preliminary version appeared in Proceedings17 th annual Conference On Learning Theory (COLT 2004)

Our Result We study Bayesian and Minimum Description Length (MDL) inference in classification problems Bayes and MDL should automatically deal with overfitting We show there exist classification domains where Bayes and MDL… when applied in a standard manner …perform suboptimally (overfit!) even if sample size tends to infinity

Why is this interesting? Practical viewpoint: –Bayesian methods used a lot in practice sometimes claimed to be ‘universally optimal’ –MDL methods even designed to deal with overfitting –Yet MDL and Bayes can ‘fail’ even with infinite data Theoretical viewpoint –How can result be reconciled with various strong Bayesian consistency theorems?

Menu 1.Classification 2.Abstract statement of main result 3.Precise statement of result 4.Discussion

Classification Given: –Feature space –Label space –Sample –Set of hypotheses (classifiers) Goal: find a that makes few mistakes on future data from the same source –We say ‘c has small generalization error/classification risk’

Classification Models Types of Classifiers 1.hard classifiers: (-1/1-output) decision trees, stumps, forests 2.soft classifiers (real-valued output) support vector machines neural networks 3.probabilistic classifiers Naïve Bayes/Bayesian network classifiers Logistic regression initial focus

Generalization Error As is customary in statistical learning theory, we analyze classification by postulating some (unknown) distribution D on joint (input,label)-space Performance of a classifier measured in terms of its generalization error (classification risk) defined as

Learning Algorithms A learning algorithm LA based on set of candidate classifiers, is a function that, for each sample S of arbitrary length, outputs classifier :

Consistent Learning Algorithms Suppose are i.i.d. A learning algorithm is consistent or asymptotically optimal if, no matter what the ‘true’ distribution D is, in D – probability, as.

Consistent Learning Algorithms Suppose are i.i.d. A learning algorithm is consistent or asymptotically optimal if, no matter what the ‘true’ distribution D is, in D – probability, as. ‘learned’ classifier where is ‘best’ classifier in

Main Result There exists –input domain –prior P, non-zero on a countable set of classifiers –‘true’ distribution D –a constant such that the Bayesian learning algorithm is is is is is asymptotically K-suboptimal:

Main Result There exists –input domain –prior, non-zero on a countable set of classifiers –‘true’ distribution D –a constant such that the Bayesian learning algorithm is is is is is asymptotically K-suboptimal: Same holds for MDL learning algorithm

Remainder of Talk 1.How is “Bayes learning algorithm” defined? 2.What is scenario? how do, ‘true’ distr. D and prior P look like? 3.How dramatic is result? How large is K? How strange are choices for ? 4.How bad can Bayes get? 5.Why is result surprising? can it be reconciled with Bayesian consistency results?

Bayesian Learning of Classifiers Problem: Bayesian inference defined for models that are sets of probability distributions In our scenario, models are sets of classifiers, i.e. functions How can we find a posterior over classifiers using Bayes rule? Standard answer: convert each to a corresponding distribution and apply Bayes to the set of distributions thus obtained

classifiers probability distrs. Standard conversion method from to : logistic (sigmoid) transformation For each and, set Define priors on and on and set

Logistic transformation - intuition Consider ‘hard’ classifiers For each, Here is empirical error that c makes on data, and is number of mistakes c makes on data

Logistic transformation - intuition For fixed –log-likelihood is linear function of number of mistakes c makes on data –(log-) likelihood maximized for c that is optimal for observed data For fixed c, –Maximizing likelihood over also makes sense

Logistic transformation - intuition In Bayesian practice, logistic transformation is standard tool, nowadays performed without giving any motivation or explanation –We did not find it in Bayesian textbooks, … –…, but tested it with three well-known Bayesians! Analogous to turning set of predictors with squared error into conditional distributions with normally distributed noise expresses where Z is independent noise bit

Main Result There exists –input domain –prior P on a countable set of classifiers –‘true’ distribution D –a constant such that the Bayesian learning algorithm is is is is is asymptotically K-suboptimal: holds both for full Bayes and for Bayes (S)MAP Grünwald & Langford, COLT 2004

Definition of. Posterior: Predictive Distribution: “Full Bayes” learning algorithm:

Issues/Remainder of Talk 1.How is “Bayes learning algorithm” defined? 2.What is scenario? how do, ‘true’ distr. D and prior P look like? 3.How dramatic is result? How large is K? How strange are choices for ? 4.How bad can Bayes get? 5.Why is result surprising? can it be reconciled with Bayesian consistency results?

Scenario Definition of Y, X and : Definition of prior: – for some small, for all large n, – can be any strictly positive smooth prior (or discrete prior with sufficient precision)

Scenario – II: Definition of true D 1.Toss fair coin to determine value of Y. 2.Toss coin Z with bias 3.If (easy example) then for all, set 4.If (hard example) then set and for all, independently set

Result: All features are informative of, but is more informative than all the others, so is best classifier: Nevertheless, with ‘true’ D- probability 1, as (but note: for each fixed j, )

Issues/Remainder of Talk 1.How is “Bayes learning algorithm” defined? 2.What is scenario? how do, ‘true’ distr. D and prior P look like? 3.How dramatic is result? How large is K? How strange are choices for ? 4.How bad can Bayes get? 5.Why is result surprising? can it be reconciled with Bayesian consistency results?

Theorem 1 There exists –input domain –prior P on a countable set of classifiers –‘true’ distribution D –a constant such that the Bayesian learning algorithm is is is is is asymptotically K-suboptimal: holds both for full Bayes and for Bayes MAP Grünwald & Langford, COLT 2004

Theorem 1, extended X-axis: = maximum Bayes MAP/MDL = maximum full Bayes (binary entropy) Maximum difference achieved at

How ‘natural’ is scenario? Basic scenario is quite unnatural We chose it because we could prove something about it! But: 1.Priors are natural (take e.g. Rissanen’s universal prior) 2.Clarke (2002) reports practical evidence that Bayes performs suboptimally with large yet misspecified models in a regression context 3.Bayesian inference is consistent under very weak conditions. So even if unnatural, result is still interesting!

Issues/Remainder of Talk 1.How is “Bayes learning algorithm” defined? 2.What is scenario? how do, ‘true’ distr. D and prior P look like? 3.How dramatic is result? How large is K? How strange are choices for ? 4.How bad can Bayes get? 5.Why is result surprising? can it be reconciled with Bayesian consistency results?

Bayesian Consistency Results Doob (1949, special case): Suppose –Countable –Contains ‘true’ conditional distribution Then with D -probability 1, weakly/in Hellinger distance

Bayesian Consistency Results If …then we must also have Our result says that this does not happen in our scenario. Hence the (countable!) we constructed must be misspecified: Model homoskedastic, ‘true’ D heteroskedastic!

Bayesian consistency under misspecification Suppose we use Bayesian inference based on ‘model’ If, then under ‘mild’ generality conditions, Bayes still converges to distribution that is closest to in KL-divergence (relative entropy). The logistic transformation ensures that achieved for c that also achieves

Bayesian consistency under misspecification In our case, Bayesian posterior does not converge to distribution with smallest classification generalization error, so it also does not converge to distribution closest to true D in KL-divergence Apparently, ‘mild’ generality conditions for ‘Bayesian consistency under misspecification’ are violated Conditions for consistency under misspecification are much stronger than conditions for standard consistency! – must either be convex or “simple” (e.g. parametric)

Is consistency achievable at all? Methods for avoiding overfitting proposed in statistical and computational learning theory literature are consistent –Vapnik’s methods (based on VC-dimension etc.) –McAllester’s PAC-Bayes methods These methods invariably punish ‘complex’ (low prior) classifiers much more than ordinary Bayes – in the simplest version of PAC-Bayes,

Consistency and Data Compression - I Our inconsistency result also holds for (various incarnations of) MDL learning algorithm MDL is a learning method based on data compression; in practice it closely resembles Bayesian inference with certain special priors ….however…

Consistency and Data Compression - II There already exist (in)famous inconsistency results for Bayesian inference by Diaconis and Freedman For some highly non-parametric, even if “true” D is in, Bayes may not converge to it These type of inconsistency results do not apply to MDL, since Diaconis and Freedman use priors that do not compress the data With MDL priors, if true D is in, then consistency is guaranteed under no further conditions at all (Barron ’98)

Issues/Remainder of Talk 1.How is “Bayes learning algorithm” defined? 2.What is scenario? how do, ‘true’ distr. D and prior P look like? 3.How dramatic is result? How large is K? How strange are choices for ? 4.How bad can Bayes get? (& “what happens”) 5.Why is result surprising? can it be reconciled with Bayesian consistency results?

Thm 2: full Bayes result is ‘tight’ X-axis: = maximum Bayes MAP/MDL = maximum full Bayes (binary entropy) Maximum difference achieved at

Theorem 2

Proof Sketch 1.Log loss of Bayes upper bounds 0/1-loss: For every sequence 2.Log loss of Bayes upper bounded by log loss of 0/1 optimal plus log-term

Proof Sketch

1.Log loss of Bayes upper bounds 0/1-loss: For every sequence 2.Log loss of Bayes upper bounded by log loss of 0/1 optimal plus log-term

Proof Sketch 1.Log loss of Bayes upper bounds 0/1-loss: For every sequence 2.Log loss of Bayes upper bounded by log loss of 0/1 optimal plus log-term (Law of large nrs/Hoeffding)

Wait a minute… Accumulated log loss of sequential Bayesian predictions is always within of accumulated log loss of optimal So Bayes is ‘good’ with respect to log loss/KL-div. But Bayes is ‘bad’ with respect to 0/1-loss How is this possible? The Bayesian posterior effectively becomes a mixture of ‘bad’ distributions (different mixture at different m) –Mixture is closer to true distribution D than in KL-divergence/log loss prediction –But performs worse than in terms of 0/1 error

Bayes predicts too well Let be a set of distribution, and let be defined with respect to a prior that makes a universal data-compressor wrt One can show that the only true distributions D for which Bayes can ever become inconsistent in KL-divergence sense… …are those under which the posterior predictive distribution becomes closer in KL-divergence to D than the best single distribution in

Conclusion Our result applies to hard classifiers and (equivalently) to probabilistic classifiers under slight misspecification Bayesian may argue that the Bayesian machinery was never intended for misspecified models Yet, computational resources and human imagination being limited, in practice Bayesian inference is applied to misspecified models all the time. In this case, Bayes may overfit even in the limit for an infinite amount of data

Thank you for your attention!