Presentation is loading. Please wait.

Presentation is loading. Please wait.

On-line learning and Boosting Overview of “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” by Freund and Schapire.

Similar presentations


Presentation on theme: "On-line learning and Boosting Overview of “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” by Freund and Schapire."— Presentation transcript:

1 On-line learning and Boosting Overview of “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” by Freund and Schapire (1997). Tim Miller University of Minnesota Department of Computer Science and Engineering

2 Hedge - Motivation  Generalization of Weighted Majority Algorithm  Given a set of expert predictions, minimize mistakes over time  Slight emphasis in motivation on possibility of treating w t as a prior.

3 Hedge Algorithm  Parameters , w, T  For 1..T 1. Choose allocation p (probability distribution formed from weights) 2. Receive loss vector l 3. Suffer loss p  l 4. Set new weight vector to w   l

4 Hedge Analysis  Does not perform “too much worse” than best strategy: L Hedge(  )  ( - ln (w 1 ) – L i ln  ) · Z Z = 1 / (1 -  )  Is it possible to do better?

5 Boosting  If we have n classifiers, possibly looking at the problem from different perspectives, how can we optimally combine them  Example: We have a collection of “rules of thumb” for predicting horse races, how to weight them

6 Definitions  Given labeled data, where c is the target concept, c: X  {0, 1}. C,  c  C, the concept class  Strong PAC-learning algorithm: For parameters , , hypothesis has error less than  with probability (1-  )  Weak algorithm:   (0.5 -  ),  > 0

7 AdaBoost Algorithm  Input: Sequence of N labeled examples Distribution D over the N examples Weak learning algorithm (called WeakLearn) Number of iterations T

8 AdaBoost contd.  Initialize: w 1 = D  For t =1..T 1. Form probability distribution p from w 2. Call WeakLearn with distribution p 3. Calculate error  t =  i=1..N p i | h t (x i ) – y i | 4. Set  t =  t / (1 -  t ) 5. Multiplicatively adjust weights (w)by  t 1-|ht(xi)–yi|

9 AdaBoost Output  Output (+1) if:  t=1..T (log 1/  t ) h t (x)  ½  t=1..T log 1/  t 0 otherwise Computes a weighted average

10 AdaBoost Analysis  Note of “dual” relationship with Hedge Strategies  Examples Trials  Weak hypotheses Hedge increases weight for successful strategies, AdaBoost increases weight for difficult examples AdaBoost has dynamic 

11 AdaBoost Bounds    2 T  t=1..T sqrt(  t (1 -  t ))  Previous bounds depended on maximum error of weakest hypothesis (weak link syndrome)  AdaBoost takes advantage of gains from best hypotheses

12 Multi-class Setting  k > 2 output labels, i.e. Y = {1, 2, …, k}  Error: Probability of incorrect prediction  Two algorithms: AdaBoost.M1 – More direct AdaBoost.M2 – Somewhat complex constraints on weak learners  Could also just divide into “one vs. one” or “one vs. all” categories

13 AdaBoost.M1  Requires each classifier to have error less than 50% (stronger requirement than binary case)  Similar to regular AdaBoost algorithm except: Error is 1 if h t (x i )  y i Can’t use algorithms with error > 0.5 Algorithm outputs vector of length k with values between 0 & 1

14 AdaBoost.M1 Analysis    2 T  t=1..T sqrt(  t (1 -  t ))  Same as bounds for regular AdaBoost  Proof converts multi-class problem to a binary setup  Can we improve this algorithm?

15 AdaBoost.M2  More expressive, more complex constraints on weak hypotheses  Defines idea of “Pseudo-Loss”  Pseudo-loss of each weak hypothesis must be better than chance  Benefit: Allows contributions from hypotheses with accuracy < 0.5

16 Pseudo-loss  Replaces straightforward loss of AdaBoost.M1  ploss q (h,i) = 0.5 ( 1 – h(x i,y i ) +  y  yi q(i,y) h(x i,y)  Intuition: For each incorrect label, pit it against known label in binary classification (second term), then take a weighted average.  Makes use of information in entire hypothesis vector, not just prediction

17 AdaBoost.M2 Details  Extra init: w t i,y = D(i) / (k-1)  For each iteration t = 1 to T W t i =  y  yi w t i,y q t (i,y) = w t i,y / W t i D t (i) = W t i /  i=1..N W t i WeakLearn gets D as well as q Calculate  t as shown above  t =  t / (1 -  t ) w t i,y ·  t (0.5)(1 + ht(xi,yi) – ht(xi,y))

18 Error Bounds    (k – 1) 2 T  t=1..T sqrt(  t (1 -  t )) Where  is traditional error and the  t are pseudo- losses

19 Regression Setting  Instead of picking from a discrete set of output labels, choose a continuous value  More formally Y = [0, 1]  Minimize the mean squared error: E[(h(x) – y) 2 ]  Reduce to binary classification and use AdaBoost!

20 How it works (roughly)  For each example in training set, create continuum of associated instances x tilde (x i, y) where y  [0, 1].  Label is 1 if y  y i  Mapping to an infinite training set – need to convert discrete distributions to density functions

21 AdaBoost.R Bounds    2 T  t=1..T sqrt(  t (1 -  t ))

22 Conclusions  Starting from a on-line learning perspective, it is possible to generalize to boosting  Boosting can take weak learners and convert them to strong learners  This paper presented several algorithms to do boosting, with proofs of error bounds


Download ppt "On-line learning and Boosting Overview of “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” by Freund and Schapire."

Similar presentations


Ads by Google