The

The 𝐇𝐞𝐝𝐠𝐞 𝜷 Algorithm and Its Applications
Yogev Bar-On Seminar on Experts and Bandits, Tel-Aviv University, November 2017 Based On “Y. Freund and R. E. Schapire, A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting”

Review – Prediction With Expert Advice (Binary Setting)
The problem: predict the next bit in a sequence 𝑇 steps, where at step 𝑖: We choose a bit 𝑏 𝑖 Obtain the outcome bit 𝑐 𝑖 and suffer loss If 𝑏 𝑖 ≠ 𝑐 𝑖 𝑁 experts help us: before choosing 𝑏 𝑖 we receive the advice vector 𝒃 𝑖 = 𝑏 𝑖 1 ,…, 𝑏 𝑖 𝑁 Our goal is to minimize the regret: 𝑖=1 𝑇 𝑏 𝑖 ≠ 𝑐 𝑖 − min 1≤𝑘≤𝑁 𝑖=1 𝑇 𝑏 𝑖 𝑘 ≠ 𝑐 𝑖

Review – Prediction With Expert Advice (Binary Setting) – Cont.
We used the Weighted-Majority Algorithm, with parameter 𝛽∈ 0,1 : Initialize a weight vector 𝒘 0 = 1,…,1 At step 𝑖: Choose 𝑏 𝑖 =1 if: 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 ∙ 𝑏 𝑖 𝑘 ≥ 1 2 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 otherwise, choose 𝑏 𝑖 =0 Obtain outcome 𝑐 𝑖 If expert 𝑘 was correct, set 𝑤 𝑖 𝑘 = 𝑤 𝑖−1 𝑘 Otherwise, set 𝑤 𝑖 𝑘 = 𝑤 𝑖−1 𝑘 ∙𝛽 Let 𝐿 𝐴 be our overall loss with WMA, and 𝐿 𝑘 the overall loss of expert 𝑘: 𝐿 𝐴 =𝑂 𝐿 𝑘 + ln 𝑁

What Are We Doing Today? Look at a more generalized setting for the Prediction With Expert Advice problem Generalize the Weighted-Majority Algorithm for this setting See applications (AdaBoost)

Prediction With Expert Advice (General Setting)
At step 𝑖, we obtain a loss vector ℓ 𝑖 expert 𝑘 suffer loss ℓ 𝑖 𝑘 ∈ 0,1 Instead of choosing a bit, we only choose a distribution 𝒑 𝑖 over the experts and suffer the average loss: 𝒑 𝑖 ∙ ℓ 𝑖 = 𝑘=1 𝑁 𝑝 𝑖 𝑘 ℓ 𝑖 𝑘 Again, our goal is to minimize the regret: 𝑖=1 𝑇 𝒑 𝑖 ∙ ℓ 𝑖 − min 1≤𝑘≤𝑁 𝑖=1 𝑇 ℓ 𝑖 𝑘

The 𝐇𝐞𝐝𝐠𝐞 𝜷 Algorithm We generalize the Weighted-Majority Algorithm to fit our new setting The Hedge Algorithm, with parameter 𝛽∈ 0,1 : Initialize a weight vector 𝒘 0 = 1 𝑁 ,…, 1 𝑁 For 𝑇 steps, at step 𝑖: Choose 𝒑 𝑖 = 𝒘 𝑖−1 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 Obtain the loss vector ℓ 𝑖 Set 𝑤 𝑖 𝑘 = 𝑤 𝑖−1 𝑘 ∙ 𝛽 ℓ 𝑖 𝑘

𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis Similar to the WMA analysis
We derive upper and lower bounds for 𝑘=1 𝑁 𝑤 𝑇 𝑘 Those imply a bound on our overall loss with the algorithm We denote by 𝐿 𝐴 our overall loss and by 𝐿 𝑘 the overall loss of expert 𝑘

𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. We start with the easier lower bound
For all 1≤𝑘≤𝑁: 𝑘′=1 𝑁 𝑤 𝑇 𝑘′ ≥ 𝑤 𝑇 𝑘 = 𝑤 0 𝑘 𝛽 ℓ 1 𝑘 ⋯ 𝛽 ℓ 𝑇 𝑘 = 𝑤 0 𝑘 𝛽 𝑖=1 𝑇 ℓ 𝑖 𝑘 = 1 𝑁 𝛽 𝐿 𝑘

𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. For the upper bound, we first notice that for all ℓ∈ 0,1 : 𝛽 ℓ ≤1− 1−𝛽 ℓ 𝛽 ℓ is convex, and 1− 1−𝛽 ℓ is the line between 𝛽 0 and 𝛽 1 Second derivative of 𝛽 ℓ is ln 2 𝛽 ∙ 𝛽 ℓ which is always positive

𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. Now we can derive the upper bound for 𝑘=1 𝑁 𝑤 𝑇 𝑘 For all 1≤𝑖≤𝑇: 𝑘=1 𝑁 𝑤 𝑖 𝑘 = 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 𝛽 ℓ 𝑖 𝑘 ≤ 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 1− 1−𝛽 ℓ 𝑖 𝑘 = 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 − 1−𝛽 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 ℓ 𝑖 𝑘 = 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 1− 1−𝛽 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 ℓ 𝑖 𝑘 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 = 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 1− 1−𝛽 𝒑 𝑖 ∙ ℓ 𝑖

𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. Hence: 𝑘=1 𝑁 𝑤 𝑇 𝑘 ≤ 𝑘=1 𝑁 𝑤 𝑇−1 𝑘 1− 1−𝛽 𝒑 𝑇 ∙ ℓ 𝑇 ≤… ≤ 𝑘=1 𝑁 𝑤 0 𝑘 1− 1−𝛽 𝒑 1 ∙ ℓ 1 ⋯ 1− 1−𝛽 𝒑 𝑇 ∙ ℓ 𝑇 = 𝑖=1 𝑇 1− 1−𝛽 𝒑 𝑖 ∙ ℓ 𝑖 ≤ 𝑖=1 𝑇 𝑒 − 1−𝛽 𝒑 𝑖 ∙ ℓ 𝑖 = 𝑒 − 1−𝛽 𝑖=1 𝑇 𝒑 𝑖 ∙ ℓ 𝑖 = 𝑒 − 1−𝛽 𝐿 𝐴 Using the inequality 1+𝑥≤ 𝑒 𝑥

𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. Combining both bounds we obtain for all 1≤𝑘≤𝑁: 𝑒 − 1−𝛽 𝐿 𝐴 ≥ 1 𝑁 𝛽 𝐿 𝑘 Thus, 𝐿 𝐴 ≤ − ln 𝛽 1−𝛽 𝐿 𝑘 −𝛽 ln 𝑁 And specifically: 𝐿 𝐴 ≤ − ln 𝛽 1−𝛽 min 1≤𝑘≤𝑁 𝐿 𝑘 −𝛽 ln 𝑁

𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. 𝐿 𝐴 ≤ − ln 𝛽 1−𝛽 min 1≤𝑘≤𝑁 𝐿 𝑘 −𝛽 ln 𝑁

𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. It can be shown that for every other algorithm for the problem, that satisfies for some 𝑎,𝑏: 𝐿 𝐴 ≤𝑎 min 1≤𝑘≤𝑁 𝐿 𝑘 +𝑏 ln 𝑁 either 𝑎≥ − ln 𝛽 1−𝛽 or 𝑏≥ 1 1−𝛽 for all 𝛽∈ 0,1

Choosing 𝜷 We would like to choose 𝛽 in a way that exploits any prior knowledge we have on the problem Let 𝐿 be an upper bound on the overall loss of the best expert. We will choose: 𝛽= ln 𝑁 𝐿

Choosing 𝜷 – Cont. We can use the inequality: 1+𝛽 2𝛽 + ln 𝛽 1−𝛽 ≥0 for all 𝛽∈ 0,1

Choosing 𝜷 – Cont. If we know the number of steps 𝑇 ahead of time, we can bound the loss with 𝐿 =𝑇 We will choose: 𝛽= ln 𝑁 𝑇 and obtain: 𝐿 𝐴 ≤ min 1≤𝑘≤𝑁 𝐿 𝑘 + ln 𝑁 ln 𝑁 + 2𝑇 ln 𝑁

Applications - Boosting
One important application for Hedge 𝛽 is boosting Consider a “weak” learning algorithm, with relatively large error Boosting turns a weak-learner into a strong-learner

PAC Learning Model – Brief Intro
We have a domain 𝑋 We want to find a function 𝑐:𝑋→ 0,1 PAC-learning algorithm Input Labeled samples chosen randomly from 𝑥,𝑐 𝑥 , 𝑥∈𝑋 with unknown distribution 𝒟 𝜀,𝛿>0 Output Hypothesis ℎ:𝑋→ 0,1 Limited to some class of hypotheses to avoid over-fitting With high probability (1−𝛿), ℎ has low error (𝜀) Polynomial time in 1 𝜀 , 1 𝛿 PAC-learner ℎ 𝜀,𝛿>0 Samples P 𝐸 𝑥~𝒟 ℎ 𝑥 −𝑐 𝑥 <𝜀 >1−𝛿

PAC Learning Model – Brief Intro – Cont.
A weak PAC-learner is the same, but its error is always larger than 𝛾 for some 1 2 >𝛾≥0 We can call the weak-learner many times each time with a different distribution 𝐷 𝑖 on the samples The learner will minimize the observed error relative to 𝐷 𝑖 : 𝐸 𝑐~ 𝐷 𝑖 ℎ 𝑖 𝑥 −𝑐 𝑥 Combining the resulting hypotheses can lead to a smaller error (this is called boosting)

AdaBoost AdaBoost is a boosting algorithm based on Hedge 𝛽 The input:
Some weak-learner 𝑁 samples: 𝑥 1 , 𝑦 1 ,…, 𝑥 𝑁 , 𝑦 𝑁 The number of steps 𝑇 The main idea is: Use the samples as experts The distribution 𝒑 𝒊 on the experts is the distribution 𝐷 𝑖 we provide to the weak-learner Give more weight to samples with larger error

AdaBoost – Cont. We initialize a weight vector 𝒘 0 = 1 𝑁 ,…, 1 𝑁
For 𝑇 steps, at step 𝑖: Choose 𝐷 𝑖 = 𝒘 𝑖−1 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 Call the weak-learner providing 𝐷 𝑖 as the samples distribution Obtain hypothesis ℎ 𝑖 Calculate the observed error 𝜀 𝑖 of ℎ 𝑖 : 𝜀 𝑖 = 𝐸 𝑥,𝑦 ~ 𝐷 𝑖 ℎ 𝑖 𝑥 −𝑦 = 𝑘=1 𝑁 𝐷 𝑖 𝑘 ℎ 𝑖 𝑥 𝑘 − 𝑦 𝑘 Set: 𝛽 𝑖 = 𝜀 𝑖 1− 𝜀 𝑖 ℓ 𝑖 𝑘 =1− ℎ 𝑖 𝑥 𝑘 − 𝑦 𝑘 Update the weight vector: 𝑤 𝑖 𝑘 = 𝑤 𝑖−1 𝑘 ∙ 𝛽 𝑖 ℓ 𝑖 𝑘

AdaBoost - Cont. The final hypothesis ℎ is a weighted majority of the hypotheses ℎ 𝑖 | 1≤𝑖≤𝑇 : ℎ 𝑥 = 𝑖=1 𝑇 ln 1 𝛽 𝑖 ℎ 𝑖 𝑥 𝑖=1 𝑇 ln 1 𝛽 𝑖 Notice the major differences between AdaBoost and Hedge 𝛽 The loss ℓ 𝑖 measures how well the expert did 𝛽 is no longer fixed

AdaBoost Analysis It can be shown that the error 𝜀 of the final hypothesis ℎ is bounded by: 𝜀≤ 2 𝑇 𝑖=1 𝑇 𝜀 𝑖 1− 𝜀 𝑖 ≤ 𝑒 −2 𝑖=1 𝑇 − 𝜀 𝑖 ≤ 𝑒 −2𝑇 −𝛾 2 Notice we don’t need to know 𝛾 for AdaBoost to work

Questions?

The

Similar presentations

Presentation on theme: "The "— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The

Similar presentations

Presentation on theme: "The "— Presentation transcript:

Similar presentations

About project

Feedback