The Boosting Approach to Machine Learning

Slides:

Advertisements

Similar presentations

Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.

Advertisements

On-line learning and Boosting

Data Mining and Machine Learning

BOOSTING & ADABOOST Lecturer: Yishay Mansour Itay Dangoor.

Boosting Approach to ML

Paper by Yoav Freund and Robert E. Schapire

FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.

CMPUT 466/551 Principal Source: CMU

Longin Jan Latecki Temple University

Review of : Yoav Freund, and Robert E

Introduction to Boosting Slides Adapted from Che Wanxiang( 车万翔 ) at HIT, and Robin Dhamankar of Many thanks!

Sparse vs. Ensemble Approaches to Supervised Learning

Ensemble Learning what is an ensemble? why use an ensemble?

2D1431 Machine Learning Boosting.

A Brief Introduction to Adaboost

Ensemble Learning: An Introduction

Adaboost and its application

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Machine Learning: Ensemble Methods

Sparse vs. Ensemble Approaches to Supervised Learning

Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.

Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.

F ACE D ETECTION FOR A CCESS C ONTROL By Dmitri De Klerk Supervisor: James Connan.

Machine Learning CS 165B Spring 2012

AdaBoost Robert E. Schapire (Princeton University) Yoav Freund (University of California at San Diego) Presented by Zhi-Hua Zhou (Nanjing University)

A speech about Boosting Presenter: Roberto Valenti.

CS 391L: Machine Learning: Ensembles

Benk Erika Kelemen Zsolt

Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.

BOOSTING David Kauchak CS451 – Fall Admin Final project.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,

Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Ensemble Methods in Machine Learning

Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.

Boosting and Additive Trees (Part 1) Ch. 10 Presented by Tal Blum.

Boosting ---one of combining models Xin Li Machine Learning Course.

Ensemble Methods for Machine Learning. COMBINING CLASSIFIERS: ENSEMBLE APPROACHES.

AdaBoost Algorithm and its Application on Object Detection Fayin Li.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)

Cost-Sensitive Boosting algorithms: Do we really need them?

Ensemble Classifiers.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning: Ensemble Methods

Reading: R. Schapire, A brief introduction to boosting

Bagging and Random Forests

COMP61011 : Machine Learning Ensemble Models

Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.

The Boosting Approach to Machine Learning

ECE 5424: Introduction to Machine Learning

Combining Base Learners

Adaboost Team G Youngmin Jun

Data Mining Practical Machine Learning Tools and Techniques

Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.

A New Boosting Algorithm Using Input-Dependent Regularizer

Introduction to Data Mining, 2nd Edition

Introduction to Boosting

Machine Learning Ensemble Learning: Voting, Boosting(Adaboost)

Implementing AdaBoost

Ensemble Methods for Machine Learning: The Ensemble Strikes Back

Ensemble learning.

Model Combination.

Ensemble learning Reminder - Bagging of Trees Random Forest

Recitation 10 Oznur Tastan

Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.

Presentation transcript:

The Boosting Approach to Machine Learning Maria-Florina Balcan 10/31/2016

Boosting General method for improving the accuracy of any given learning algorithm. Works by creating a series of challenge datasets s.t. even modest performance on these can be used to produce an overall high-accuracy predictor. Works amazingly well in practice --- Adaboost and its variations one of the top 10 algorithms. Add a slide with Adaboost! Backed up by solid foundations.

Readings: Plan for today: The Boosting Approach to Machine Learning: An Overview. Rob Schapire, 2001 Theory and Applications of Boosting. NIPS tutorial. http://www.cs.princeton.edu/~schapire/talks/nips-tutorial.pdf Plan for today: Motivation. Add a slide with Adaboost! A bit of history. Adaboost: algo, guarantees, discussion. Focus on supervised classification.

An Example: Spam Detection E.g., classify which emails are spam and which are important. Not spam spam Key observation/motivation: Easy to find rules of thumb that are often correct. Add a slide with Adaboost! E.g., “If buy now in the message, then predict spam.” E.g., “If say good-bye to debt in the message, then predict spam.” Harder to find single rule that is very highly accurate.

An Example: Spam Detection Boosting: meta-procedure that takes in an algo for finding rules of thumb (weak learner). Produces a highly accurate rule, by calling the weak learner repeatedly on cleverly chosen datasets. 𝒉 𝟏 𝒉 𝟐 𝒉 𝟑 Emails 𝒉 𝑻 … Add a slide with Adaboost! apply weak learner to a subset of emails, obtain rule of thumb apply to 2nd subset of emails, obtain 2nd rule of thumb apply to 3rd subset of emails, obtain 3rd rule of thumb repeat T times; combine weak rules into a single highly accurate rule.

Boosting: Important Aspects How to choose examples on each round? Typically, concentrate on “hardest” examples (those most often misclassified by previous rules of thumb) How to combine rules of thumb into single prediction rule? take (weighted) majority vote of rules of thumb Add a slide with Adaboost!

Historically…. Add a slide with Adaboost!

Weak Learning vs Strong Learning [Kearns & Valiant ’88]: defined weak learning: being able to predict better than random guessing (error ≤ 1 2 −𝛾) , consistently. Posed an open pb: “Does there exist a boosting algo that turns a weak learner into a strong learner (that can produce arbitrarily accurate hypotheses)?” Informally, given “weak” learning algo that can consistently find classifiers of error ≤ 1 2 −𝛾, a boosting algo would provably construct a single classifier with error ≤𝜖.

Weak Learning =Strong Learning Surprisingly…. Weak Learning =Strong Learning Original Construction [Schapire ’89]: poly-time boosting algo, exploits that we can learn a little on every distribution. A modest booster obtained via calling the weak learning algorithm on 3 distributions. Error =𝛽< 1 2 −𝛾→ error 3 𝛽 2 −2 𝛽 3 Learn h1 from D0 with error < 49% Modify D0 so that h1 has error 50% (call this D1) If heads wait for an example x where h1(x)=f(x), otherwise wait for an example where h1(x)!=f(x). Learn h2 from D1 with error < 49% Modify D1 so that h1 and h2 always disagree (call this D2) Learn h3 from D2 with error <49%. Now vote h1,h2, and h3. This has error better than any of the “weak” hypotheses h1,h2 or h3. Repeat this as needed to lower the error rate more…. Then amplifies the modest boost of accuracy by running this somehow recursively. Cool conceptually and technically, not very practical.

An explosion of subsequent work Add a slide with Adaboost!

Adaboost (Adaptive Boosting) “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting” [Freund-Schapire, JCSS’97] Add a slide with Adaboost! Godel Prize winner 2003

Informal Description Adaboost Boosting: turns a weak algo into a strong learner. Input: S={( x 1 , 𝑦 1 ), …,( x m , 𝑦 m )}; x i ∈𝑋, 𝑦 𝑖 ∈𝑌={−1,1} + weak learning algo A (e.g., Naïve Bayes, decision stumps) + + + + For t=1,2, … ,T h t + + + - Construct D t on { x 1 , …, x m } - - Run A on D t producing h t :𝑋→{−1,1} (weak classifier) - - - - - ϵ t = P x i ~D t ( h t x i ≠ y i ) error of h t over D t Output H final 𝑥 =sign 𝑡=1 𝛼 𝑡 ℎ 𝑡 𝑥 Add a slide with Adaboost! Roughly speaking D t+1 increases weight on x i if h t incorrect on x i ; decreases it on x i if h t correct.

Adaboost (Adaptive Boosting) Weak learning algorithm A. For t=1,2, … ,T Construct 𝐃 𝐭 on { 𝐱 𝟏 , …, 𝒙 𝐦 } Run A on D t producing h t Constructing 𝐷 𝑡 D 1 uniform on { x 1 , …, x m } [i.e., D 1 𝑖 = 1 𝑚 ] Given D t and h t set 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 𝑍 𝑡 e − 𝛼 𝑡 𝑦 𝑖 ℎ 𝑡 𝑥 𝑖 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 𝑍 𝑡 e − 𝛼 𝑡 if 𝑦 𝑖 = ℎ 𝑡 𝑥 𝑖 Add a slide with Adaboost! 𝑍 𝑡 = 𝑖 𝐷 𝑡 𝑖 𝑒 − 𝛼 𝑡 𝑦 𝑖 ℎ 𝑡 𝑥 𝑖 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 𝑍 𝑡 e 𝛼 𝑡 if 𝑦 𝑖 ≠ ℎ 𝑡 𝑥 𝑖 𝛼 𝑡 = 1 2 ln 1− 𝜖 𝑡 𝜖 𝑡 >0 D t+1 puts half of weight on examples x i where h t is incorrect & half on examples where h t is correct Final hyp: H final 𝑥 =sign 𝑡 𝛼 𝑡 ℎ 𝑡 𝑥

Adaboost: A toy example Weak classifiers: vertical or horizontal half-planes (a.k.a. decision stumps)

Adaboost: A toy example

Adaboost: A toy example

Adaboost (Adaptive Boosting) Weak learning algorithm A. For t=1,2, … ,T Construct 𝐃 𝐭 on { 𝐱 𝟏 , …, 𝒙 𝐦 } Run A on D t producing h t Constructing 𝐷 𝑡 D 1 uniform on { x 1 , …, x m } [i.e., D 1 𝑖 = 1 𝑚 ] Given D t and h t set 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 𝑍 𝑡 e − 𝛼 𝑡 if 𝑦 𝑖 = ℎ 𝑡 𝑥 𝑖 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 𝑍 𝑡 e − 𝛼 𝑡 𝑦 𝑖 ℎ 𝑡 𝑥 𝑖 Add a slide with Adaboost! 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 𝑍 𝑡 e 𝛼 𝑡 if 𝑦 𝑖 ≠ ℎ 𝑡 𝑥 𝑖 𝛼 𝑡 = 1 2 ln 1− 𝜖 𝑡 𝜖 𝑡 >0 D t+1 puts half of weight on examples x i where h t is incorrect & half on examples where h t is correct Final hyp: H final 𝑥 =sign 𝑡 𝛼 𝑡 ℎ 𝑡 𝑥

Nice Features of Adaboost Very general: a meta-procedure, it can use any weak learning algorithm!!! (e.g., Naïve Bayes, decision stumps) Very fast (single pass through data each round) & simple to code, no parameters to tune. Shift in mindset: goal is now just to find classifiers a bit better than random guessing. Grounded in rich theory.

Guarantees about the Training Error Theorem 𝜖 𝑡 =1/2− 𝛾 𝑡 (error of ℎ 𝑡 over 𝐷 𝑡 ) 𝑒𝑟 𝑟 𝑆 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤ exp −2 𝑡 𝛾 𝑡 2 So, if ∀𝑡, 𝛾 𝑡 ≥𝛾>0, then 𝑒𝑟 𝑟 𝑆 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤ exp −2 𝛾 2 𝑇 The training error drops exponentially in T!!! To get 𝑒𝑟 𝑟 𝑆 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤𝜖, need only 𝑇=𝑂 1 𝛾 2 log 1 𝜖 rounds Add a slide with Adaboost! Adaboost is adaptive Does not need to know 𝛾 or T a priori Can exploit 𝛾 𝑡 ≫ 𝛾

Understanding the Updates & Normalization Claim: D t+1 puts half of the weight on x i where h t was incorrect and half of the weight on x i where h t was correct. Recall 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 𝑍 𝑡 e − 𝛼 𝑡 𝑦 𝑖 ℎ 𝑡 𝑥 𝑖 Probabilities are equal! 𝜖 𝑡 𝑍 𝑡 1−𝜖 𝑡 𝜖 𝑡 = 𝜖 𝑡 1− 𝜖 𝑡 𝑍 𝑡 Pr 𝐷 𝑡+1 𝑦 𝑖 ≠ ℎ 𝑡 𝑥 𝑖 = 𝑖: 𝑦 𝑖 ≠ ℎ 𝑡 𝑥 𝑖 𝐷 𝑡 𝑖 𝑍 𝑡 𝑒 𝛼 𝑡 = 𝜖 𝑡 1 𝑍 𝑡 𝑒 𝛼 𝑡 = Pr 𝐷 𝑡+1 𝑦 𝑖 = ℎ 𝑡 𝑥 𝑖 = 𝑖: 𝑦 𝑖 = ℎ 𝑡 𝑥 𝑖 𝐷 𝑡 𝑖 𝑍 𝑡 𝑒 − 𝛼 𝑡 = 1− 𝜖 𝑡 𝑍 𝑡 𝑒 − 𝛼 𝑡 = 1− 𝜖 𝑡 𝑍 𝑡 𝜖 𝑡 1− 𝜖 𝑡 = 1− 𝜖 𝑡 𝜖 𝑡 𝑍 𝑡 Add a slide with Adaboost! = 𝑖: 𝑦 𝑖 = ℎ 𝑡 𝑥 𝑖 𝐷 𝑡 𝑖 𝑒 − 𝛼 𝑡 + 𝑖: 𝑦 𝑖 ≠ ℎ 𝑡 𝑥 𝑖 𝐷 𝑡 𝑖 𝑒 𝛼 𝑡 𝑍 𝑡 = 𝑖 𝐷 𝑡 𝑖 𝑒 − 𝛼 𝑡 𝑦 𝑖 ℎ 𝑡 𝑥 𝑖 = 1− 𝜖 𝑡 𝑒 − 𝛼 𝑡 + 𝜖 𝑡 𝑒 𝛼 𝑡 =2 𝜖 𝑡 1− 𝜖 𝑡

Generalization Guarantees (informal) 𝑒𝑟 𝑟 𝑆 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤ exp −2 𝑡 𝛾 𝑡 2 Theorem where 𝜖 𝑡 =1/2− 𝛾 𝑡 How about generalization guarantees? Original analysis [Freund&Schapire’97] Let H be the set of rules that the weak learner can use Let G be the set of weighted majority rules over T elements of H (i.e., the things that AdaBoost might output) Theorem [Freund&Schapire’97] T= # of rounds ∀ 𝑔∈𝐺, 𝑒𝑟𝑟 𝑔 ≤𝑒𝑟 𝑟 𝑆 𝑔 + 𝑂 𝑇𝑑 𝑚 d= VC dimension of H

Generalization Guarantees Theorem [Freund&Schapire’97] T= # of rounds ∀ 𝑔∈𝐺, 𝑒𝑟𝑟 𝑔 ≤𝑒𝑟 𝑟 𝑆 𝑔 + 𝑂 𝑇𝑑 𝑚 d= VC dimension of H generalization error error train error complexity T= # of rounds

Generalization Guarantees Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large. Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!!

Generalization Guarantees Experiments with boosting showed that the test error of the generated classifier usually does not increase as its size becomes very large. Experiments showed that continuing to add new weak learners after correct classification of the training set had been achieved could further improve test set performance!!! These results seem to contradict FS’87 bound and Occam’s razor (in order achieve good test error the classifier should be as simple as possible)!

How can we explain the experiments? R. Schapire, Y. Freund, P. Bartlett, W. S. Lee. present in “Boosting the margin: A new explanation for the effectiveness of voting methods” a nice theoretical explanation. Key Idea: Training error does not tell the whole story. We need also to consider the classification confidence!! Add a slide with Adaboost!

What you should know The difference between weak and strong learners. AdaBoost algorithm and intuition for the distribution update step. The training error bound for Adaboost and overfitting/generalization guarantees aspects. Add a slide with Adaboost!

Advanced (Not required) Material Add a slide with Adaboost!

Analyzing Training Error: Proof Intuition Theorem 𝜖 𝑡 =1/2− 𝛾 𝑡 (error of ℎ 𝑡 over 𝐷 𝑡 ) 𝑒𝑟 𝑟 𝑆 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤ exp −2 𝑡 𝛾 𝑡 2 On round 𝑡, we increase weight of 𝑥 𝑖 for which ℎ 𝑡 is wrong. If 𝐻 𝑓𝑖𝑛𝑎𝑙 incorrectly classifies 𝑥 𝑖 , - Then 𝑥 𝑖 incorrectly classified by (wtd) majority of ℎ 𝑡 ’s. - Which implies final prob. weight of 𝑥 𝑖 is large. Can show probability ≥ 1 𝑚 1 𝑡 𝑍 𝑡 Since sum of prob. =1, can’t have too many of high weight. Can show # incorrectly classified ≤𝑚 𝑡 𝑍 𝑡 . And ( 𝑡 𝑍 𝑡 )→0 .

[Unthresholded weighted vote of ℎ 𝑖 on 𝑥 𝑖 ] Analyzing Training Error: Proof Math Step 1: unwrapping recurrence: 𝐷 𝑇+1 𝑖 = 1 𝑚 exp −𝑦 𝑖 𝑓 𝑥 𝑖 𝑡 𝑍 𝑡 where 𝑓 𝑥 𝑖 = 𝑡 𝛼 𝑡 ℎ 𝑡 𝑥 𝑖 . [Unthresholded weighted vote of ℎ 𝑖 on 𝑥 𝑖 ] Step 2: err S 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤ 𝑡 𝑍 𝑡 . Step 3: 𝑡 𝑍 𝑡 = 𝑡 2 𝜖 𝑡 1− 𝜖 𝑡 = 𝑡 1−4 𝛾 𝑡 2 ≤ 𝑒 −2 𝑡 𝛾 𝑡 2

Analyzing Training Error: Proof Math Step 1: unwrapping recurrence: 𝐷 𝑇+1 𝑖 = 1 𝑚 exp −𝑦 𝑖 𝑓 𝑥 𝑖 𝑡 𝑍 𝑡 where 𝑓 𝑥 𝑖 = 𝑡 𝛼 𝑡 ℎ 𝑡 𝑥 𝑖 . Recall 𝐷 1 𝑖 = 1 𝑚 and 𝐷 𝑡+1 𝑖 = 𝐷 𝑡 𝑖 exp − 𝑦 𝑖 𝛼 𝑡 ℎ 𝑡 𝑥 𝑖 𝑍 𝑡 𝐷 𝑇+1 𝑖 = exp − 𝑦 𝑖 𝛼 𝑇 ℎ 𝑇 𝑥 𝑖 𝑍 𝑇 × 𝐷 𝑇 𝑖 = exp − 𝑦 𝑖 𝛼 𝑇 ℎ 𝑇 𝑥 𝑖 𝑍 𝑇 × exp − 𝑦 𝑖 𝛼 𝑇−1 ℎ 𝑇−1 𝑥 𝑖 𝑍 𝑇−1 ×𝐷 𝑇−1 𝑖 ……. = exp − 𝑦 𝑖 𝛼 𝑇 ℎ 𝑇 𝑥 𝑖 𝑍 𝑇 ×⋯× exp − 𝑦 𝑖 𝛼 1 ℎ 1 𝑥 𝑖 𝑍 1 1 𝑚 = 1 𝑚 exp − 𝑦 𝑖 𝑓 𝑥 𝑖 𝑡 𝑍 𝑡 = 1 𝑚 exp − 𝑦 𝑖 (𝛼 1 ℎ 1 𝑥 𝑖 +…+ 𝛼 𝑇 ℎ 𝑇 𝑥 𝑇 ) 𝑍 1 ⋯ 𝑍 𝑇

Analyzing Training Error: Proof Math Step 1: unwrapping recurrence: 𝐷 𝑇+1 𝑖 = 1 𝑚 exp −𝑦 𝑖 𝑓 𝑥 𝑖 𝑡 𝑍 𝑡 where 𝑓 𝑥 𝑖 = 𝑡 𝛼 𝑡 ℎ 𝑡 𝑥 𝑖 . Step 2: err S 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤ 𝑡 𝑍 𝑡 . err S 𝐻 𝑓𝑖𝑛𝑎𝑙 = 1 𝑚 𝑖 1 𝑦 𝑖 ≠ 𝐻 𝑓𝑖𝑛𝑎𝑙 𝑥 𝑖 1 0/1 loss exp loss = 1 𝑚 𝑖 1 𝑦 𝑖 𝑓 𝑥 𝑖 ≤0 ≤ 1 𝑚 𝑖 exp − 𝑦 𝑖 𝑓 𝑥 𝑖 = 𝑖 𝐷 𝑇+1 𝑖 𝑡 𝑍 𝑡 = 𝑡 𝑍 𝑡 .

Analyzing Training Error: Proof Math Step 1: unwrapping recurrence: 𝐷 𝑇+1 𝑖 = 1 𝑚 exp −𝑦 𝑖 𝑓 𝑥 𝑖 𝑡 𝑍 𝑡 where 𝑓 𝑥 𝑖 = 𝑡 𝛼 𝑡 ℎ 𝑡 𝑥 𝑖 . Step 2: err S 𝐻 𝑓𝑖𝑛𝑎𝑙 ≤ 𝑡 𝑍 𝑡 . Step 3: 𝑡 𝑍 𝑡 = 𝑡 2 𝜖 𝑡 1− 𝜖 𝑡 = 𝑡 1−4 𝛾 𝑡 2 ≤ 𝑒 −2 𝑡 𝛾 𝑡 2 Note: recall 𝑍 𝑡 = 1− 𝜖 𝑡 𝑒 − 𝛼 𝑡 + 𝜖 𝑡 𝑒 𝛼 𝑡 =2 𝜖 𝑡 1− 𝜖 𝑡 𝛼 𝑡 minimizer of 𝛼→ 1− 𝜖 𝑡 𝑒 −𝛼 + 𝜖 𝑡 𝑒 𝛼