The Boosting Approach to Machine Learning

The Boosting Approach to Machine Learning
Maria-Florina Balcan 10/31/2016

Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets s.t. even modest performance on these can be used to produce an overall high-accuracy predictor. Works amazingly well in practice --- Adaboost and its variations one of the top 10 algorithms. Add a slide with Adaboost! Backed up by solid foundations.

Readings: Plan for today:
The Boosting Approach to Machine Learning: An Overview. Rob Schapire, 2001 Theory and Applications of Boosting. NIPS tutorial. Plan for today: Motivation. Add a slide with Adaboost! A bit of history. Adaboost: algo, guarantees, discussion. Focus on supervised classification.

An Example: Spam Detection
E.g., classify which s are spam and which are important. Not spam spam Key observation/motivation: Easy to find rules of thumb that are often correct. Add a slide with Adaboost! E.g., “If buy now in the message, then predict spam.” E.g., “If say good-bye to debt in the message, then predict spam.” Harder to find single rule that is very highly accurate.

An Example: Spam Detection
Boosting: meta-procedure that takes in an algo for finding rules of thumb (weak learner). Produces a highly accurate rule, by calling the weak learner repeatedly on cleverly chosen datasets. 𝒉 𝟏 𝒉 𝟐 𝒉 𝟑 s 𝒉 𝑻 … Add a slide with Adaboost! apply weak learner to a subset of s, obtain rule of thumb apply to 2nd subset of s, obtain 2nd rule of thumb apply to 3rd subset of s, obtain 3rd rule of thumb repeat T times; combine weak rules into a single highly accurate rule.

Boosting: Important Aspects
How to choose examples on each round? Typically, concentrate on “hardest” examples (those most often misclassified by previous rules of thumb) How to combine rules of thumb into single prediction rule? take (weighted) majority vote of rules of thumb Add a slide with Adaboost!

Historically…. Add a slide with Adaboost!

Weak Learning vs Strong Learning
[Kearns & Valiant ’88]: defined weak learning: being able to predict better than random guessing (error ≤ 1 2 −𝛾) , consistently. Posed an open pb: “Does there exist a boosting algo that turns a weak learner into a strong learner (that can produce arbitrarily accurate hypotheses)?” Informally, given “weak” learning algo that can consistently find classifiers of error ≤ 1 2 −𝛾, a boosting algo would provably construct a single classifier with error ≤𝜖.

Weak Learning =Strong Learning
Surprisingly…. Weak Learning =Strong Learning Original Construction [Schapire ’89]: poly-time boosting algo, exploits that we can learn a little on every distribution. A modest booster obtained via calling the weak learning algorithm on 3 distributions. Error =𝛽< 1 2 −𝛾→ error 3 𝛽 2 −2 𝛽 3 Learn h1 from D0 with error < 49% Modify D0 so that h1 has error 50% (call this D1) If heads wait for an example x where h1(x)=f(x), otherwise wait for an example where h1(x)!=f(x). Learn h2 from D1 with error < 49% Modify D1 so that h1 and h2 always disagree (call this D2) Learn h3 from D2 with error <49%. Now vote h1,h2, and h3. This has error better than any of the “weak” hypotheses h1,h2 or h3. Repeat this as needed to lower the error rate more…. Then amplifies the modest boost of accuracy by running this somehow recursively. Cool conceptually and technically, not very practical.

An explosion of subsequent work
Add a slide with Adaboost!

Adaboost (Adaptive Boosting)
“A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting” [Freund-Schapire, JCSS’97] Add a slide with Adaboost! Godel Prize winner 2003

Informal Description Adaboost
Boosting: turns a weak algo into a strong learner. Input: S={( x 1 , 𝑦 1 ), …,( x m , 𝑦 m )}; x i ∈𝑋, 𝑦 𝑖 ∈𝑌={−1,1} + weak learning algo A (e.g., Naïve Bayes, decision stumps) + + + + For t=1,2, … ,T h t + + + - Construct D t on { x 1 , …, x m } - - Run A on D t producing h t :𝑋→{−1,1} (weak classifier) - - - - - ϵ t = P x i ~D t ( h t x i ≠ y i ) error of h t over D t Output H final 𝑥 =sign 𝑡=1 𝛼 𝑡 ℎ 𝑡 𝑥 Add a slide with Adaboost! Roughly speaking D t+1 increases weight on x i if h t incorrect on x i ; decreases it on x i if h t correct.

Weak learning algorithm A. For t=1,2, … ,T Construct 𝐃 𝐭 on { 𝐱 𝟏 , …, 𝒙 𝐦 } Run A on D t producing h t Constructing 𝐷 𝑡 D 1 uniform on { x 1 , …, x m } [i.e., D 1 𝑖 = 1 𝑚 ] Given D t and h t set 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 𝑍 𝑡 e − 𝛼 𝑡 𝑦 𝑖 ℎ 𝑡 𝑥 𝑖 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 𝑍 𝑡 e − 𝛼 𝑡 if 𝑦 𝑖 = ℎ 𝑡 𝑥 𝑖 Add a slide with Adaboost! 𝑍 𝑡 = 𝑖 𝐷 𝑡 𝑖 𝑒 − 𝛼 𝑡 𝑦 𝑖 ℎ 𝑡 𝑥 𝑖 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 𝑍 𝑡 e 𝛼 𝑡 if 𝑦 𝑖 ≠ ℎ 𝑡 𝑥 𝑖 𝛼 𝑡 = 1 2 ln 1− 𝜖 𝑡 𝜖 𝑡 >0 D t+1 puts half of weight on examples x i where h t is incorrect & half on examples where h t is correct Final hyp: H final 𝑥 =sign 𝑡 𝛼 𝑡 ℎ 𝑡 𝑥

Adaboost: A toy example
Weak classifiers: vertical or horizontal half-planes (a.k.a. decision stumps)

Adaboost: A toy example

Weak learning algorithm A. For t=1,2, … ,T Construct 𝐃 𝐭 on { 𝐱 𝟏 , …, 𝒙 𝐦 } Run A on D t producing h t Constructing 𝐷 𝑡 D 1 uniform on { x 1 , …, x m } [i.e., D 1 𝑖 = 1 𝑚 ] Given D t and h t set 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 𝑍 𝑡 e − 𝛼 𝑡 if 𝑦 𝑖 = ℎ 𝑡 𝑥 𝑖 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 𝑍 𝑡 e − 𝛼 𝑡 𝑦 𝑖 ℎ 𝑡 𝑥 𝑖 Add a slide with Adaboost! 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 𝑍 𝑡 e 𝛼 𝑡 if 𝑦 𝑖 ≠ ℎ 𝑡 𝑥 𝑖 𝛼 𝑡 = 1 2 ln 1− 𝜖 𝑡 𝜖 𝑡 >0 D t+1 puts half of weight on examples x i where h t is incorrect & half on examples where h t is correct Final hyp: H final 𝑥 =sign 𝑡 𝛼 𝑡 ℎ 𝑡 𝑥

Nice Features of Adaboost
Very general: a meta-procedure, it can use any weak learning algorithm!!! (e.g., Naïve Bayes, decision stumps) Very fast (single pass through data each round) & simple to code, no parameters to tune. Shift in mindset: goal is now just to find classifiers a bit better than random guessing. Grounded in rich theory.

Guarantees about the Training Error
Theorem 𝜖 𝑡 =1/2− 𝛾 𝑡 (error of ℎ 𝑡 over 𝐷 𝑡 ) 𝑒𝑟 𝑟 𝑆 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤ exp −2 𝑡 𝛾 𝑡 2 So, if ∀𝑡, 𝛾 𝑡 ≥𝛾>0, then 𝑒𝑟 𝑟 𝑆 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤ exp −2 𝛾 2 𝑇 The training error drops exponentially in T!!! To get 𝑒𝑟 𝑟 𝑆 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤𝜖, need only 𝑇=𝑂 1 𝛾 2 log 1 𝜖 rounds Add a slide with Adaboost! Adaboost is adaptive Does not need to know 𝛾 or T a priori Can exploit 𝛾 𝑡 ≫ 𝛾

Understanding the Updates & Normalization
Claim: D t+1 puts half of the weight on x i where h t was incorrect and half of the weight on x i where h t was correct. Recall 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 𝑍 𝑡 e − 𝛼 𝑡 𝑦 𝑖 ℎ 𝑡 𝑥 𝑖 Probabilities are equal! 𝜖 𝑡 𝑍 𝑡 −𝜖 𝑡 𝜖 𝑡 = 𝜖 𝑡 1− 𝜖 𝑡 𝑍 𝑡 Pr 𝐷 𝑡 𝑦 𝑖 ≠ ℎ 𝑡 𝑥 𝑖 = 𝑖: 𝑦 𝑖 ≠ ℎ 𝑡 𝑥 𝑖 𝐷 𝑡 𝑖 𝑍 𝑡 𝑒 𝛼 𝑡 = 𝜖 𝑡 1 𝑍 𝑡 𝑒 𝛼 𝑡 = Pr 𝐷 𝑡 𝑦 𝑖 = ℎ 𝑡 𝑥 𝑖 = 𝑖: 𝑦 𝑖 = ℎ 𝑡 𝑥 𝑖 𝐷 𝑡 𝑖 𝑍 𝑡 𝑒 − 𝛼 𝑡 = 1− 𝜖 𝑡 𝑍 𝑡 𝑒 − 𝛼 𝑡 = 1− 𝜖 𝑡 𝑍 𝑡 𝜖 𝑡 1− 𝜖 𝑡 = 1− 𝜖 𝑡 𝜖 𝑡 𝑍 𝑡 Add a slide with Adaboost! = 𝑖: 𝑦 𝑖 = ℎ 𝑡 𝑥 𝑖 𝐷 𝑡 𝑖 𝑒 − 𝛼 𝑡 + 𝑖: 𝑦 𝑖 ≠ ℎ 𝑡 𝑥 𝑖 𝐷 𝑡 𝑖 𝑒 𝛼 𝑡 𝑍 𝑡 = 𝑖 𝐷 𝑡 𝑖 𝑒 − 𝛼 𝑡 𝑦 𝑖 ℎ 𝑡 𝑥 𝑖 = 1− 𝜖 𝑡 𝑒 − 𝛼 𝑡 + 𝜖 𝑡 𝑒 𝛼 𝑡 =2 𝜖 𝑡 1− 𝜖 𝑡

Generalization Guarantees (informal)
𝑒𝑟 𝑟 𝑆 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤ exp −2 𝑡 𝛾 𝑡 2 Theorem where 𝜖 𝑡 =1/2− 𝛾 𝑡 How about generalization guarantees? Original analysis [Freund&Schapire’97] Let H be the set of rules that the weak learner can use Let G be the set of weighted majority rules over T elements of H (i.e., the things that AdaBoost might output) Theorem [Freund&Schapire’97] T= # of rounds ∀ 𝑔∈𝐺, 𝑒𝑟𝑟 𝑔 ≤𝑒𝑟 𝑟 𝑆 𝑔 + 𝑂 𝑇𝑑 𝑚 d= VC dimension of H

Generalization Guarantees
Theorem [Freund&Schapire’97] T= # of rounds ∀ 𝑔∈𝐺, 𝑒𝑟𝑟 𝑔 ≤𝑒𝑟 𝑟 𝑆 𝑔 + 𝑂 𝑇𝑑 𝑚 d= VC dimension of H generalization error error train error complexity T= # of rounds

Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large. Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!

Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large. Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!! These results seem to contradict FS’87 bound and Occam’s razor (in order achieve good test error the classifier should be as simple as possible)!

How can we explain the experiments?
R. Schapire, Y. Freund, P. Bartlett, W. S. Lee. present in “Boosting the margin: A new explanation for the effectiveness of voting methods” a nice theoretical explanation. Key Idea: Training error does not tell the whole story. We need also to consider the classification confidence!! Add a slide with Adaboost!

What you should know The difference between weak and strong learners.
AdaBoost algorithm and intuition for the distribution update step. The training error bound for Adaboost and overfitting/generalization guarantees aspects. Add a slide with Adaboost!

Advanced (Not required) Material
Add a slide with Adaboost!

Analyzing Training Error: Proof Intuition
Theorem 𝜖 𝑡 =1/2− 𝛾 𝑡 (error of ℎ 𝑡 over 𝐷 𝑡 ) 𝑒𝑟 𝑟 𝑆 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤ exp −2 𝑡 𝛾 𝑡 2 On round 𝑡, we increase weight of 𝑥 𝑖 for which ℎ 𝑡 is wrong. If 𝐻 𝑓𝑖𝑛𝑎𝑙 incorrectly classifies 𝑥 𝑖 , - Then 𝑥 𝑖 incorrectly classified by (wtd) majority of ℎ 𝑡 ’s. - Which implies final prob. weight of 𝑥 𝑖 is large. Can show probability ≥ 1 𝑚 1 𝑡 𝑍 𝑡 Since sum of prob. =1, can’t have too many of high weight. Can show # incorrectly classified ≤𝑚 𝑡 𝑍 𝑡 . And ( 𝑡 𝑍 𝑡 )→0 .

[Unthresholded weighted vote of ℎ 𝑖 on 𝑥 𝑖 ]
Analyzing Training Error: Proof Math Step 1: unwrapping recurrence: 𝐷 𝑇+1 𝑖 = 1 𝑚 exp −𝑦 𝑖 𝑓 𝑥 𝑖 𝑡 𝑍 𝑡 where 𝑓 𝑥 𝑖 = 𝑡 𝛼 𝑡 ℎ 𝑡 𝑥 𝑖 . [Unthresholded weighted vote of ℎ 𝑖 on 𝑥 𝑖 ] Step 2: err S 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤ 𝑡 𝑍 𝑡 . Step 3: 𝑡 𝑍 𝑡 = 𝑡 2 𝜖 𝑡 1− 𝜖 𝑡 = 𝑡 1−4 𝛾 𝑡 2 ≤ 𝑒 −2 𝑡 𝛾 𝑡 2

Analyzing Training Error: Proof Math
Step 1: unwrapping recurrence: 𝐷 𝑇+1 𝑖 = 1 𝑚 exp −𝑦 𝑖 𝑓 𝑥 𝑖 𝑡 𝑍 𝑡 where 𝑓 𝑥 𝑖 = 𝑡 𝛼 𝑡 ℎ 𝑡 𝑥 𝑖 . Recall 𝐷 1 𝑖 = 1 𝑚 and 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 exp − 𝑦 𝑖 𝛼 𝑡 ℎ 𝑡 𝑥 𝑖 𝑍 𝑡 𝐷 𝑇+1 𝑖 = exp − 𝑦 𝑖 𝛼 𝑇 ℎ 𝑇 𝑥 𝑖 𝑍 𝑇 × 𝐷 𝑇 𝑖 = exp − 𝑦 𝑖 𝛼 𝑇 ℎ 𝑇 𝑥 𝑖 𝑍 𝑇 × exp − 𝑦 𝑖 𝛼 𝑇−1 ℎ 𝑇−1 𝑥 𝑖 𝑍 𝑇−1 ×𝐷 𝑇−1 𝑖 ……. = exp − 𝑦 𝑖 𝛼 𝑇 ℎ 𝑇 𝑥 𝑖 𝑍 𝑇 ×⋯× exp − 𝑦 𝑖 𝛼 1 ℎ 1 𝑥 𝑖 𝑍 𝑚 = 1 𝑚 exp − 𝑦 𝑖 𝑓 𝑥 𝑖 𝑡 𝑍 𝑡 = 1 𝑚 exp − 𝑦 𝑖 (𝛼 1 ℎ 1 𝑥 𝑖 +…+ 𝛼 𝑇 ℎ 𝑇 𝑥 𝑇 ) 𝑍 1 ⋯ 𝑍 𝑇

Step 1: unwrapping recurrence: 𝐷 𝑇+1 𝑖 = 1 𝑚 exp −𝑦 𝑖 𝑓 𝑥 𝑖 𝑡 𝑍 𝑡 where 𝑓 𝑥 𝑖 = 𝑡 𝛼 𝑡 ℎ 𝑡 𝑥 𝑖 . Step 2: err S 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤ 𝑡 𝑍 𝑡 . err S 𝐻 𝑓𝑖𝑛𝑎𝑙 = 1 𝑚 𝑖 1 𝑦 𝑖 ≠ 𝐻 𝑓𝑖𝑛𝑎𝑙 𝑥 𝑖 1 0/1 loss exp loss = 1 𝑚 𝑖 1 𝑦 𝑖 𝑓 𝑥 𝑖 ≤0 ≤ 1 𝑚 𝑖 exp − 𝑦 𝑖 𝑓 𝑥 𝑖 = 𝑖 𝐷 𝑇+1 𝑖 𝑡 𝑍 𝑡 = 𝑡 𝑍 𝑡 .

Step 1: unwrapping recurrence: 𝐷 𝑇+1 𝑖 = 1 𝑚 exp −𝑦 𝑖 𝑓 𝑥 𝑖 𝑡 𝑍 𝑡 where 𝑓 𝑥 𝑖 = 𝑡 𝛼 𝑡 ℎ 𝑡 𝑥 𝑖 . Step 2: err S 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤ 𝑡 𝑍 𝑡 . Step 3: 𝑡 𝑍 𝑡 = 𝑡 2 𝜖 𝑡 1− 𝜖 𝑡 = 𝑡 1−4 𝛾 𝑡 2 ≤ 𝑒 −2 𝑡 𝛾 𝑡 2 Note: recall 𝑍 𝑡 = 1− 𝜖 𝑡 𝑒 − 𝛼 𝑡 + 𝜖 𝑡 𝑒 𝛼 𝑡 =2 𝜖 𝑡 1− 𝜖 𝑡 𝛼 𝑡 minimizer of 𝛼→ 1− 𝜖 𝑡 𝑒 −𝛼 + 𝜖 𝑡 𝑒 𝛼

The Boosting Approach to Machine Learning

Similar presentations

Presentation on theme: "The Boosting Approach to Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Boosting Approach to Machine Learning

Similar presentations

Presentation on theme: "The Boosting Approach to Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback