The

Slides:



Advertisements
Similar presentations
Ensemble Learning Reading: R. Schapire, A brief introduction to boosting.
Advertisements

On-line learning and Boosting
An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.
BOOSTING & ADABOOST Lecturer: Yishay Mansour Itay Dangoor.
Boosting Approach to ML
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
Games of Prediction or Things get simpler as Yoav Freund Banter Inc.
CMPUT 466/551 Principal Source: CMU
Longin Jan Latecki Temple University
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei Li,
Introduction to Boosting Slides Adapted from Che Wanxiang( 车 万翔 ) at HIT, and Robin Dhamankar of Many thanks!
Boosting CMPUT 615 Boosting Idea We have a weak classifier, i.e., it’s error rate is a little bit better than 0.5. Boosting combines a lot of such weak.
2D1431 Machine Learning Boosting.
Ensemble Learning: An Introduction
Adaboost and its application
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Visual Recognition Tutorial
Sparse vs. Ensemble Approaches to Supervised Learning
Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
Machine Learning CS 165B Spring 2012
Chapter 10 Boosting May 6, Outline Adaboost Ensemble point-view of Boosting Boosting Trees Supervised Learning Methods.
CSSE463: Image Recognition Day 27 This week This week Last night: k-means lab due. Last night: k-means lab due. Today: Classification by “boosting” Today:
The Multiplicative Weights Update Method Based on Arora, Hazan & Kale (2005) Mashor Housh Oded Cats Advanced simulation methods Prof. Rubinstein.
Benk Erika Kelemen Zsolt
BOOSTING David Kauchak CS451 – Fall Admin Final project.
Online Passive-Aggressive Algorithms Shai Shalev-Shwartz joint work with Koby Crammer, Ofer Dekel & Yoram Singer The Hebrew University Jerusalem, Israel.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Learning with AdaBoost
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 AdaBoost.. Binary Classification. Read 9.5 Duda,
Boosting and Differential Privacy Cynthia Dwork, Microsoft Research TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A.
CSSE463: Image Recognition Day 33 This week This week Today: Classification by “boosting” Today: Classification by “boosting” Yoav Freund and Robert Schapire.
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
Decision Trees IDHairHeightWeightLotionResult SarahBlondeAverageLightNoSunburn DanaBlondeTallAverageYesnone AlexBrownTallAverageYesNone AnnieBlondeShortAverageNoSunburn.
Boosting ---one of combining models Xin Li Machine Learning Course.
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.
Adaboost (Adaptive boosting) Jo Yeong-Jun Schapire, Robert E., and Yoram Singer. "Improved boosting algorithms using confidence- rated predictions."
By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)
Online Learning Model. Motivation Many situations involve online repeated decision making in an uncertain environment. Deciding how to invest your money.
Ensemble Classifiers.
Machine Learning: Ensemble Methods
HW 2.
Reading: R. Schapire, A brief introduction to boosting
Bagging and Random Forests
The Boosting Approach to Machine Learning
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Ensemble Learning Introduction to Machine Learning and Data Mining, Carla Brodley.
The Boosting Approach to Machine Learning
ECE 5424: Introduction to Machine Learning
A “Holy Grail” of Machine Learing
Adaboost Team G Youngmin Jun
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
A New Boosting Algorithm Using Input-Dependent Regularizer
CSCI B609: “Foundations of Data Science”
The Curve Merger (Dvir & Widgerson, 2008)
Aviv Rosenberg 10/01/18 Seminar on Experts and Bandits
The Nonstochastic Multiarmed Bandit Problem
Comparing Populations
Lecture 18: Bagging and Boosting
Implementing AdaBoost
Ensemble learning.
Product moment correlation
Lecture 06: Bagging and Boosting
Model Combination.
Ensemble learning Reminder - Bagging of Trees Random Forest
Model generalization Brief summary of methods
Presentation transcript:

The 𝐇𝐞𝐝𝐠𝐞 𝜷 Algorithm and Its Applications Yogev Bar-On Seminar on Experts and Bandits, Tel-Aviv University, November 2017 Based On “Y. Freund and R. E. Schapire, A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting”

Review – Prediction With Expert Advice (Binary Setting) The problem: predict the next bit in a sequence 𝑇 steps, where at step 𝑖: We choose a bit 𝑏 𝑖 Obtain the outcome bit 𝑐 𝑖 and suffer loss If 𝑏 𝑖 ≠ 𝑐 𝑖 𝑁 experts help us: before choosing 𝑏 𝑖 we receive the advice vector 𝒃 𝑖 = 𝑏 𝑖 1 ,…, 𝑏 𝑖 𝑁 Our goal is to minimize the regret: 𝑖=1 𝑇 𝑏 𝑖 ≠ 𝑐 𝑖 − min 1≤𝑘≤𝑁 𝑖=1 𝑇 𝑏 𝑖 𝑘 ≠ 𝑐 𝑖

Review – Prediction With Expert Advice (Binary Setting) – Cont. We used the Weighted-Majority Algorithm, with parameter 𝛽∈ 0,1 : Initialize a weight vector 𝒘 0 = 1,…,1 At step 𝑖: Choose 𝑏 𝑖 =1 if: 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 ∙ 𝑏 𝑖 𝑘 ≥ 1 2 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 otherwise, choose 𝑏 𝑖 =0 Obtain outcome 𝑐 𝑖 If expert 𝑘 was correct, set 𝑤 𝑖 𝑘 = 𝑤 𝑖−1 𝑘 Otherwise, set 𝑤 𝑖 𝑘 = 𝑤 𝑖−1 𝑘 ∙𝛽 Let 𝐿 𝐴 be our overall loss with WMA, and 𝐿 𝑘 the overall loss of expert 𝑘: 𝐿 𝐴 =𝑂 𝐿 𝑘 + ln 𝑁

What Are We Doing Today? Look at a more generalized setting for the Prediction With Expert Advice problem Generalize the Weighted-Majority Algorithm for this setting See applications (AdaBoost)

Prediction With Expert Advice (General Setting) At step 𝑖, we obtain a loss vector ℓ 𝑖 expert 𝑘 suffer loss ℓ 𝑖 𝑘 ∈ 0,1 Instead of choosing a bit, we only choose a distribution 𝒑 𝑖 over the experts and suffer the average loss: 𝒑 𝑖 ∙ ℓ 𝑖 = 𝑘=1 𝑁 𝑝 𝑖 𝑘 ℓ 𝑖 𝑘 Again, our goal is to minimize the regret: 𝑖=1 𝑇 𝒑 𝑖 ∙ ℓ 𝑖 − min 1≤𝑘≤𝑁 𝑖=1 𝑇 ℓ 𝑖 𝑘

The 𝐇𝐞𝐝𝐠𝐞 𝜷 Algorithm We generalize the Weighted-Majority Algorithm to fit our new setting The Hedge Algorithm, with parameter 𝛽∈ 0,1 : Initialize a weight vector 𝒘 0 = 1 𝑁 ,…, 1 𝑁 For 𝑇 steps, at step 𝑖: Choose 𝒑 𝑖 = 𝒘 𝑖−1 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 Obtain the loss vector ℓ 𝑖 Set 𝑤 𝑖 𝑘 = 𝑤 𝑖−1 𝑘 ∙ 𝛽 ℓ 𝑖 𝑘

𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis Similar to the WMA analysis We derive upper and lower bounds for 𝑘=1 𝑁 𝑤 𝑇 𝑘 Those imply a bound on our overall loss with the algorithm We denote by 𝐿 𝐴 our overall loss and by 𝐿 𝑘 the overall loss of expert 𝑘

𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. We start with the easier lower bound For all 1≤𝑘≤𝑁: 𝑘′=1 𝑁 𝑤 𝑇 𝑘′ ≥ 𝑤 𝑇 𝑘 = 𝑤 0 𝑘 𝛽 ℓ 1 𝑘 ⋯ 𝛽 ℓ 𝑇 𝑘 = 𝑤 0 𝑘 𝛽 𝑖=1 𝑇 ℓ 𝑖 𝑘 = 1 𝑁 𝛽 𝐿 𝑘

𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. For the upper bound, we first notice that for all ℓ∈ 0,1 : 𝛽 ℓ ≤1− 1−𝛽 ℓ 𝛽 ℓ is convex, and 1− 1−𝛽 ℓ is the line between 𝛽 0 and 𝛽 1 Second derivative of 𝛽 ℓ is ln 2 𝛽 ∙ 𝛽 ℓ which is always positive

𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. Now we can derive the upper bound for 𝑘=1 𝑁 𝑤 𝑇 𝑘 For all 1≤𝑖≤𝑇: 𝑘=1 𝑁 𝑤 𝑖 𝑘 = 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 𝛽 ℓ 𝑖 𝑘 ≤ 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 1− 1−𝛽 ℓ 𝑖 𝑘 = 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 − 1−𝛽 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 ℓ 𝑖 𝑘 = 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 1− 1−𝛽 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 ℓ 𝑖 𝑘 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 = 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 1− 1−𝛽 𝒑 𝑖 ∙ ℓ 𝑖

𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. Hence: 𝑘=1 𝑁 𝑤 𝑇 𝑘 ≤ 𝑘=1 𝑁 𝑤 𝑇−1 𝑘 1− 1−𝛽 𝒑 𝑇 ∙ ℓ 𝑇 ≤… ≤ 𝑘=1 𝑁 𝑤 0 𝑘 1− 1−𝛽 𝒑 1 ∙ ℓ 1 ⋯ 1− 1−𝛽 𝒑 𝑇 ∙ ℓ 𝑇 = 𝑖=1 𝑇 1− 1−𝛽 𝒑 𝑖 ∙ ℓ 𝑖 ≤ 𝑖=1 𝑇 𝑒 − 1−𝛽 𝒑 𝑖 ∙ ℓ 𝑖 = 𝑒 − 1−𝛽 𝑖=1 𝑇 𝒑 𝑖 ∙ ℓ 𝑖 = 𝑒 − 1−𝛽 𝐿 𝐴 Using the inequality 1+𝑥≤ 𝑒 𝑥

𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. Combining both bounds we obtain for all 1≤𝑘≤𝑁: 𝑒 − 1−𝛽 𝐿 𝐴 ≥ 1 𝑁 𝛽 𝐿 𝑘 Thus, 𝐿 𝐴 ≤ − ln 𝛽 1−𝛽 𝐿 𝑘 + 1 1−𝛽 ln 𝑁 And specifically: 𝐿 𝐴 ≤ − ln 𝛽 1−𝛽 min 1≤𝑘≤𝑁 𝐿 𝑘 + 1 1−𝛽 ln 𝑁

𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. 𝐿 𝐴 ≤ − ln 𝛽 1−𝛽 min 1≤𝑘≤𝑁 𝐿 𝑘 + 1 1−𝛽 ln 𝑁

𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. It can be shown that for every other algorithm for the problem, that satisfies for some 𝑎,𝑏: 𝐿 𝐴 ≤𝑎 min 1≤𝑘≤𝑁 𝐿 𝑘 +𝑏 ln 𝑁 either 𝑎≥ − ln 𝛽 1−𝛽 or 𝑏≥ 1 1−𝛽 for all 𝛽∈ 0,1

Choosing 𝜷 We would like to choose 𝛽 in a way that exploits any prior knowledge we have on the problem Let 𝐿 be an upper bound on the overall loss of the best expert. We will choose: 𝛽= 1 1+ 2 ln 𝑁 𝐿

Choosing 𝜷 – Cont. We can use the inequality: 1+𝛽 2𝛽 + ln 𝛽 1−𝛽 ≥0 for all 𝛽∈ 0,1

Choosing 𝜷 – Cont. If we know the number of steps 𝑇 ahead of time, we can bound the loss with 𝐿 =𝑇 We will choose: 𝛽= 1 1+ 2 ln 𝑁 𝑇 and obtain: 𝐿 𝐴 ≤ min 1≤𝑘≤𝑁 𝐿 𝑘 + ln 𝑁 ln 𝑁 + 2𝑇 ln 𝑁

Applications - Boosting One important application for Hedge 𝛽 is boosting Consider a “weak” learning algorithm, with relatively large error Boosting turns a weak-learner into a strong-learner

PAC Learning Model – Brief Intro We have a domain 𝑋 We want to find a function 𝑐:𝑋→ 0,1 PAC-learning algorithm Input Labeled samples chosen randomly from 𝑥,𝑐 𝑥 , 𝑥∈𝑋 with unknown distribution 𝒟 𝜀,𝛿>0 Output Hypothesis ℎ:𝑋→ 0,1 Limited to some class of hypotheses to avoid over-fitting With high probability (1−𝛿), ℎ has low error (𝜀) Polynomial time in 1 𝜀 , 1 𝛿 PAC-learner ℎ 𝜀,𝛿>0 Samples P 𝐸 𝑥~𝒟 ℎ 𝑥 −𝑐 𝑥 <𝜀 >1−𝛿

PAC Learning Model – Brief Intro – Cont. A weak PAC-learner is the same, but its error is always larger than 𝛾 for some 1 2 >𝛾≥0 We can call the weak-learner many times each time with a different distribution 𝐷 𝑖 on the samples The learner will minimize the observed error relative to 𝐷 𝑖 : 𝐸 𝑐~ 𝐷 𝑖 ℎ 𝑖 𝑥 −𝑐 𝑥 Combining the resulting hypotheses can lead to a smaller error (this is called boosting)

AdaBoost AdaBoost is a boosting algorithm based on Hedge 𝛽 The input: Some weak-learner 𝑁 samples: 𝑥 1 , 𝑦 1 ,…, 𝑥 𝑁 , 𝑦 𝑁 The number of steps 𝑇 The main idea is: Use the samples as experts The distribution 𝒑 𝒊 on the experts is the distribution 𝐷 𝑖 we provide to the weak-learner Give more weight to samples with larger error

AdaBoost – Cont. We initialize a weight vector 𝒘 0 = 1 𝑁 ,…, 1 𝑁 For 𝑇 steps, at step 𝑖: Choose 𝐷 𝑖 = 𝒘 𝑖−1 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 Call the weak-learner providing 𝐷 𝑖 as the samples distribution Obtain hypothesis ℎ 𝑖 Calculate the observed error 𝜀 𝑖 of ℎ 𝑖 : 𝜀 𝑖 = 𝐸 𝑥,𝑦 ~ 𝐷 𝑖 ℎ 𝑖 𝑥 −𝑦 = 𝑘=1 𝑁 𝐷 𝑖 𝑘 ℎ 𝑖 𝑥 𝑘 − 𝑦 𝑘 Set: 𝛽 𝑖 = 𝜀 𝑖 1− 𝜀 𝑖 ℓ 𝑖 𝑘 =1− ℎ 𝑖 𝑥 𝑘 − 𝑦 𝑘 Update the weight vector: 𝑤 𝑖 𝑘 = 𝑤 𝑖−1 𝑘 ∙ 𝛽 𝑖 ℓ 𝑖 𝑘

AdaBoost - Cont. The final hypothesis ℎ is a weighted majority of the hypotheses ℎ 𝑖 | 1≤𝑖≤𝑇 : ℎ 𝑥 = 𝑖=1 𝑇 ln 1 𝛽 𝑖 ℎ 𝑖 𝑥 𝑖=1 𝑇 ln 1 𝛽 𝑖 Notice the major differences between AdaBoost and Hedge 𝛽 The loss ℓ 𝑖 measures how well the expert did 𝛽 is no longer fixed

AdaBoost Analysis It can be shown that the error 𝜀 of the final hypothesis ℎ is bounded by: 𝜀≤ 2 𝑇 𝑖=1 𝑇 𝜀 𝑖 1− 𝜀 𝑖 ≤ 𝑒 −2 𝑖=1 𝑇 1 2 − 𝜀 𝑖 2 ≤ 𝑒 −2𝑇 1 2 −𝛾 2 Notice we don’t need to know 𝛾 for AdaBoost to work

Questions?