The 𝐇𝐞𝐝𝐠𝐞 𝜷 Algorithm and Its Applications Yogev Bar-On Seminar on Experts and Bandits, Tel-Aviv University, November 2017 Based On “Y. Freund and R. E. Schapire, A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting”
Review – Prediction With Expert Advice (Binary Setting) The problem: predict the next bit in a sequence 𝑇 steps, where at step 𝑖: We choose a bit 𝑏 𝑖 Obtain the outcome bit 𝑐 𝑖 and suffer loss If 𝑏 𝑖 ≠ 𝑐 𝑖 𝑁 experts help us: before choosing 𝑏 𝑖 we receive the advice vector 𝒃 𝑖 = 𝑏 𝑖 1 ,…, 𝑏 𝑖 𝑁 Our goal is to minimize the regret: 𝑖=1 𝑇 𝑏 𝑖 ≠ 𝑐 𝑖 − min 1≤𝑘≤𝑁 𝑖=1 𝑇 𝑏 𝑖 𝑘 ≠ 𝑐 𝑖
Review – Prediction With Expert Advice (Binary Setting) – Cont. We used the Weighted-Majority Algorithm, with parameter 𝛽∈ 0,1 : Initialize a weight vector 𝒘 0 = 1,…,1 At step 𝑖: Choose 𝑏 𝑖 =1 if: 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 ∙ 𝑏 𝑖 𝑘 ≥ 1 2 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 otherwise, choose 𝑏 𝑖 =0 Obtain outcome 𝑐 𝑖 If expert 𝑘 was correct, set 𝑤 𝑖 𝑘 = 𝑤 𝑖−1 𝑘 Otherwise, set 𝑤 𝑖 𝑘 = 𝑤 𝑖−1 𝑘 ∙𝛽 Let 𝐿 𝐴 be our overall loss with WMA, and 𝐿 𝑘 the overall loss of expert 𝑘: 𝐿 𝐴 =𝑂 𝐿 𝑘 + ln 𝑁
What Are We Doing Today? Look at a more generalized setting for the Prediction With Expert Advice problem Generalize the Weighted-Majority Algorithm for this setting See applications (AdaBoost)
Prediction With Expert Advice (General Setting) At step 𝑖, we obtain a loss vector ℓ 𝑖 expert 𝑘 suffer loss ℓ 𝑖 𝑘 ∈ 0,1 Instead of choosing a bit, we only choose a distribution 𝒑 𝑖 over the experts and suffer the average loss: 𝒑 𝑖 ∙ ℓ 𝑖 = 𝑘=1 𝑁 𝑝 𝑖 𝑘 ℓ 𝑖 𝑘 Again, our goal is to minimize the regret: 𝑖=1 𝑇 𝒑 𝑖 ∙ ℓ 𝑖 − min 1≤𝑘≤𝑁 𝑖=1 𝑇 ℓ 𝑖 𝑘
The 𝐇𝐞𝐝𝐠𝐞 𝜷 Algorithm We generalize the Weighted-Majority Algorithm to fit our new setting The Hedge Algorithm, with parameter 𝛽∈ 0,1 : Initialize a weight vector 𝒘 0 = 1 𝑁 ,…, 1 𝑁 For 𝑇 steps, at step 𝑖: Choose 𝒑 𝑖 = 𝒘 𝑖−1 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 Obtain the loss vector ℓ 𝑖 Set 𝑤 𝑖 𝑘 = 𝑤 𝑖−1 𝑘 ∙ 𝛽 ℓ 𝑖 𝑘
𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis Similar to the WMA analysis We derive upper and lower bounds for 𝑘=1 𝑁 𝑤 𝑇 𝑘 Those imply a bound on our overall loss with the algorithm We denote by 𝐿 𝐴 our overall loss and by 𝐿 𝑘 the overall loss of expert 𝑘
𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. We start with the easier lower bound For all 1≤𝑘≤𝑁: 𝑘′=1 𝑁 𝑤 𝑇 𝑘′ ≥ 𝑤 𝑇 𝑘 = 𝑤 0 𝑘 𝛽 ℓ 1 𝑘 ⋯ 𝛽 ℓ 𝑇 𝑘 = 𝑤 0 𝑘 𝛽 𝑖=1 𝑇 ℓ 𝑖 𝑘 = 1 𝑁 𝛽 𝐿 𝑘
𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. For the upper bound, we first notice that for all ℓ∈ 0,1 : 𝛽 ℓ ≤1− 1−𝛽 ℓ 𝛽 ℓ is convex, and 1− 1−𝛽 ℓ is the line between 𝛽 0 and 𝛽 1 Second derivative of 𝛽 ℓ is ln 2 𝛽 ∙ 𝛽 ℓ which is always positive
𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. Now we can derive the upper bound for 𝑘=1 𝑁 𝑤 𝑇 𝑘 For all 1≤𝑖≤𝑇: 𝑘=1 𝑁 𝑤 𝑖 𝑘 = 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 𝛽 ℓ 𝑖 𝑘 ≤ 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 1− 1−𝛽 ℓ 𝑖 𝑘 = 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 − 1−𝛽 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 ℓ 𝑖 𝑘 = 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 1− 1−𝛽 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 ℓ 𝑖 𝑘 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 = 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 1− 1−𝛽 𝒑 𝑖 ∙ ℓ 𝑖
𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. Hence: 𝑘=1 𝑁 𝑤 𝑇 𝑘 ≤ 𝑘=1 𝑁 𝑤 𝑇−1 𝑘 1− 1−𝛽 𝒑 𝑇 ∙ ℓ 𝑇 ≤… ≤ 𝑘=1 𝑁 𝑤 0 𝑘 1− 1−𝛽 𝒑 1 ∙ ℓ 1 ⋯ 1− 1−𝛽 𝒑 𝑇 ∙ ℓ 𝑇 = 𝑖=1 𝑇 1− 1−𝛽 𝒑 𝑖 ∙ ℓ 𝑖 ≤ 𝑖=1 𝑇 𝑒 − 1−𝛽 𝒑 𝑖 ∙ ℓ 𝑖 = 𝑒 − 1−𝛽 𝑖=1 𝑇 𝒑 𝑖 ∙ ℓ 𝑖 = 𝑒 − 1−𝛽 𝐿 𝐴 Using the inequality 1+𝑥≤ 𝑒 𝑥
𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. Combining both bounds we obtain for all 1≤𝑘≤𝑁: 𝑒 − 1−𝛽 𝐿 𝐴 ≥ 1 𝑁 𝛽 𝐿 𝑘 Thus, 𝐿 𝐴 ≤ − ln 𝛽 1−𝛽 𝐿 𝑘 + 1 1−𝛽 ln 𝑁 And specifically: 𝐿 𝐴 ≤ − ln 𝛽 1−𝛽 min 1≤𝑘≤𝑁 𝐿 𝑘 + 1 1−𝛽 ln 𝑁
𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. 𝐿 𝐴 ≤ − ln 𝛽 1−𝛽 min 1≤𝑘≤𝑁 𝐿 𝑘 + 1 1−𝛽 ln 𝑁
𝐇𝐞𝐝𝐠𝐞 𝜷 Analysis – Cont. It can be shown that for every other algorithm for the problem, that satisfies for some 𝑎,𝑏: 𝐿 𝐴 ≤𝑎 min 1≤𝑘≤𝑁 𝐿 𝑘 +𝑏 ln 𝑁 either 𝑎≥ − ln 𝛽 1−𝛽 or 𝑏≥ 1 1−𝛽 for all 𝛽∈ 0,1
Choosing 𝜷 We would like to choose 𝛽 in a way that exploits any prior knowledge we have on the problem Let 𝐿 be an upper bound on the overall loss of the best expert. We will choose: 𝛽= 1 1+ 2 ln 𝑁 𝐿
Choosing 𝜷 – Cont. We can use the inequality: 1+𝛽 2𝛽 + ln 𝛽 1−𝛽 ≥0 for all 𝛽∈ 0,1
Choosing 𝜷 – Cont. If we know the number of steps 𝑇 ahead of time, we can bound the loss with 𝐿 =𝑇 We will choose: 𝛽= 1 1+ 2 ln 𝑁 𝑇 and obtain: 𝐿 𝐴 ≤ min 1≤𝑘≤𝑁 𝐿 𝑘 + ln 𝑁 ln 𝑁 + 2𝑇 ln 𝑁
Applications - Boosting One important application for Hedge 𝛽 is boosting Consider a “weak” learning algorithm, with relatively large error Boosting turns a weak-learner into a strong-learner
PAC Learning Model – Brief Intro We have a domain 𝑋 We want to find a function 𝑐:𝑋→ 0,1 PAC-learning algorithm Input Labeled samples chosen randomly from 𝑥,𝑐 𝑥 , 𝑥∈𝑋 with unknown distribution 𝒟 𝜀,𝛿>0 Output Hypothesis ℎ:𝑋→ 0,1 Limited to some class of hypotheses to avoid over-fitting With high probability (1−𝛿), ℎ has low error (𝜀) Polynomial time in 1 𝜀 , 1 𝛿 PAC-learner ℎ 𝜀,𝛿>0 Samples P 𝐸 𝑥~𝒟 ℎ 𝑥 −𝑐 𝑥 <𝜀 >1−𝛿
PAC Learning Model – Brief Intro – Cont. A weak PAC-learner is the same, but its error is always larger than 𝛾 for some 1 2 >𝛾≥0 We can call the weak-learner many times each time with a different distribution 𝐷 𝑖 on the samples The learner will minimize the observed error relative to 𝐷 𝑖 : 𝐸 𝑐~ 𝐷 𝑖 ℎ 𝑖 𝑥 −𝑐 𝑥 Combining the resulting hypotheses can lead to a smaller error (this is called boosting)
AdaBoost AdaBoost is a boosting algorithm based on Hedge 𝛽 The input: Some weak-learner 𝑁 samples: 𝑥 1 , 𝑦 1 ,…, 𝑥 𝑁 , 𝑦 𝑁 The number of steps 𝑇 The main idea is: Use the samples as experts The distribution 𝒑 𝒊 on the experts is the distribution 𝐷 𝑖 we provide to the weak-learner Give more weight to samples with larger error
AdaBoost – Cont. We initialize a weight vector 𝒘 0 = 1 𝑁 ,…, 1 𝑁 For 𝑇 steps, at step 𝑖: Choose 𝐷 𝑖 = 𝒘 𝑖−1 𝑘=1 𝑁 𝑤 𝑖−1 𝑘 Call the weak-learner providing 𝐷 𝑖 as the samples distribution Obtain hypothesis ℎ 𝑖 Calculate the observed error 𝜀 𝑖 of ℎ 𝑖 : 𝜀 𝑖 = 𝐸 𝑥,𝑦 ~ 𝐷 𝑖 ℎ 𝑖 𝑥 −𝑦 = 𝑘=1 𝑁 𝐷 𝑖 𝑘 ℎ 𝑖 𝑥 𝑘 − 𝑦 𝑘 Set: 𝛽 𝑖 = 𝜀 𝑖 1− 𝜀 𝑖 ℓ 𝑖 𝑘 =1− ℎ 𝑖 𝑥 𝑘 − 𝑦 𝑘 Update the weight vector: 𝑤 𝑖 𝑘 = 𝑤 𝑖−1 𝑘 ∙ 𝛽 𝑖 ℓ 𝑖 𝑘
AdaBoost - Cont. The final hypothesis ℎ is a weighted majority of the hypotheses ℎ 𝑖 | 1≤𝑖≤𝑇 : ℎ 𝑥 = 𝑖=1 𝑇 ln 1 𝛽 𝑖 ℎ 𝑖 𝑥 𝑖=1 𝑇 ln 1 𝛽 𝑖 Notice the major differences between AdaBoost and Hedge 𝛽 The loss ℓ 𝑖 measures how well the expert did 𝛽 is no longer fixed
AdaBoost Analysis It can be shown that the error 𝜀 of the final hypothesis ℎ is bounded by: 𝜀≤ 2 𝑇 𝑖=1 𝑇 𝜀 𝑖 1− 𝜀 𝑖 ≤ 𝑒 −2 𝑖=1 𝑇 1 2 − 𝜀 𝑖 2 ≤ 𝑒 −2𝑇 1 2 −𝛾 2 Notice we don’t need to know 𝛾 for AdaBoost to work
Questions?