Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.

Similar presentations


Presentation on theme: "1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour."— Presentation transcript:

1 1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour

2 2 Motivation Statistician: “Are you a Bayesian or a Frequentist?” Yoav: “I don’t know, you tell me…” I need a better answer….

3 3 Toy example Computer receives telephone call Measures Pitch of voice Decides gender of caller Human Voice Male Female

4 4 Generative modeling Voice Pitch Probability mean1 var1 mean2 var2

5 5 Discriminative approach Voice Pitch No. of mistakes

6 6 Discriminative Bayesian approach Voice Pitch Probability Conditional probability: Prior Posterior

7 7 Suggested approach Voice Pitch No. of mistakes Unsure Definitely female Definitely male

8 8 Formal Frameworks For stating theorems regarding the dependence of the generalization error on the size of the training set.

9 9 The PAC set-up 1.Learner chooses classifier set C c  C, c: X  {-1,+1} and requests m training examples 2.Nature chooses a target classifier c from C and a distribution P over X 3.Nature generates training set (x 1,y 1 ), (x 2,y 2 ), …,(x m,y m ) 4.Learner generates h: X  {-1,+1} Goal: P(h(x)  c(x)) <   c,P

10 10 The agnostic set-up 1. Learner chooses classifier set C c  C, c: X  {-1,+1} and requests m training examples 2. Nature chooses distribution D over X  {-1,+1} 3. Nature generates training set according to D (x 1,y 1 ), (x 2,y 2 ), …,(x m,y m ) 4. Learner generates h: X  {-1,+1} Goal: P D (h(x)  y) < P D (c * (x)  y) +   D Where c * = argmin c  C (P D (c(x)  y)) Vapnik’s pattern-recognition problem

11 11 Self-bounding learning 1. Learner selects concept class C 2. Nature generates training set T=(x 1,y 1 ), (x 2,y 2 ), …,(x m,y m ) IID according to a distribution D over X  {-1,+1} 3. Learner generates h: X  {-1,+1} and a bound  T such that with high probability over the random choice of the training set T P D (h(x)  y) < P D (c * (x)  y) +  T Freund 97

12 12 Learning a region predictor 1.Learner selects concept class C 2.Nature generates training set (x 1,y 1 ), (x 2,y 2 ), …,(x m,y m ) IID according to a distribution D over X  {-1,+1} 3.Learner generates h: X  { {-1}, {+1}, {-1,+1}, {} } such that with high probability P D (y  h(x)) < P D (c * (x)  y) +  1 and P D (h(x)={-1,+1} ) <  2 Vovk 2000

13 13 Intuitions The rough idea

14 14 A motivating example - - - + + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +- - - - - - - - - -- - - ? ? ?

15 15 Distribution of errors 01/2 True error Empirical error 01/2 Worst case Contenders for best. -> Predict with majority vote Non-contenders -> ignore! 0 1/2 Typical case

16 16 Main result Finite concept class

17 17 Notation Data distribution: Generalization error: Training set: Training error:

18 18 The algorithm Parameters Hypothesis weight: Empirical Log Ratio : Prediction rule:

19 19 Suggested tuning Yields:

20 20 Main properties 1. The ELR is very stable. Probability of large deviations is independent of size of concept class. 2. Expected value of ELR is close to the True Log Ratio (using true hypothesis errors instead of estimates.) 3. TLR is a good proxy of the best concept in the class.

21 21 McDiarmid’s theorem If Andare independent random variables Then

22 22 Empirical log ratio is stable training error with one example changed

23 23 Bounded variation proof

24 24 Infinite concept classes Geometry of the concept class

25 25 Infinite concept classes Stated bounds are vacuous. How to approximate a infinite class with a finite class? Unlabeled examples give useful information.

26 26 A metric space of classifiers g f Classifier spaceExample Space d d(f,g) = P( f(x) = g(x) ) Neighboring models make similar predictions

27 27  -covers Classifier space Classifier class No. of neighbors increases like No. of neighbors increases like

28 28 Computational issues How to compute the  -cover? We can use unlabeled examples to generate cover. Estimate prediction by ignoring concepts with high error.

29 29 Application: comparing perfect features 45,000 features Training Examples:  10 2 negative  2-10 positive  10 4 unlabeled >1 features have zero training error. Which feature(s) should we use? How to combine them?

30 30 A typical perfect feature Feature value No. of images Negative examples Positive examples Unlabeld examples

31 31 Pseudo-Bayes for single threhold Set of possible thresholds is uncountably infinite Using an  -cover over thresholds Equivalent to using the distribution of unlabeled examples as the prior distribution over the set of thresholds.

32 32 What it will do Feature value Negative examples Prior weights Error factor 0 +1

33 33 Relation to large margins Neighborhood of good classifiers SVM and Adaboost search for a linear discriminator with a large margin

34 34 Relation to Bagging Bagging:  Generate classifiers from random subsets of training set.  Predict according to the majority vote among classifiers. (Another possibility: flip label of a small random subset of the training set) Can be seen as a randomized estimate of the log ratio.

35 35 Bias/Variance for classification Bias: error of predicting with the sign of the True Log Ratio (infinite training set). Variance: additional error from predicting with the sign of the Empirical Log Ratio which is based on a finite training sample.

36 36 New directions How a measure of confidence can help in practice

37 37 Face Detection Paul Viola and Mike Jones developed a face detector that can work in real time (15 frames per second).

38 38 Using confidence to save time The detector combines 6000 simple features using Adaboost. In most boxes, only 8-9 features are calculated. Feature 1 All boxes Definitely not a face Might be a face Feature 2

39 39 Selective sampling Unlabeled data Partially trained classifier Sample of unconfident examples Labeled examples

40 40 Co-training Images that Might contain faces Color info Shape info Partially trained Color based Classifier Partially trained Shape based Classifier Confident Predictions Confident Predictions

41 41 Summary Bayesian averaging is justifiable even without Bayesian assumptions. Infinite concept classes: use  -covers Efficient implementations: Thresholds, SVM, boosting, bagging… still largely open. Calibration (Recent work of Vovk) A good measure of confidence is very important in practice. >2 classes (predicting with a subset)


Download ppt "1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour."

Similar presentations


Ads by Google