Presentation is loading. Please wait.

Presentation is loading. Please wait.

The

Similar presentations


Presentation on theme: "The "β€” Presentation transcript:

1 The π‡πžππ πž 𝜷 Algorithm and Its Applications
Yogev Bar-On Seminar on Experts and Bandits, Tel-Aviv University, November 2017 Based On β€œY. Freund and R. E. Schapire, A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting”

2 Review – Prediction With Expert Advice (Binary Setting)
The problem: predict the next bit in a sequence 𝑇 steps, where at step 𝑖: We choose a bit 𝑏 𝑖 Obtain the outcome bit 𝑐 𝑖 and suffer loss If 𝑏 𝑖 β‰  𝑐 𝑖 𝑁 experts help us: before choosing 𝑏 𝑖 we receive the advice vector 𝒃 𝑖 = 𝑏 𝑖 1 ,…, 𝑏 𝑖 𝑁 Our goal is to minimize the regret: 𝑖=1 𝑇 𝑏 𝑖 β‰  𝑐 𝑖 βˆ’ min 1β‰€π‘˜β‰€π‘ 𝑖=1 𝑇 𝑏 𝑖 π‘˜ β‰  𝑐 𝑖

3 Review – Prediction With Expert Advice (Binary Setting) – Cont.
We used the Weighted-Majority Algorithm, with parameter π›½βˆˆ 0,1 : Initialize a weight vector π’˜ 0 = 1,…,1 At step 𝑖: Choose 𝑏 𝑖 =1 if: π‘˜=1 𝑁 𝑀 π‘–βˆ’1 π‘˜ βˆ™ 𝑏 𝑖 π‘˜ β‰₯ 1 2 π‘˜=1 𝑁 𝑀 π‘–βˆ’1 π‘˜ otherwise, choose 𝑏 𝑖 =0 Obtain outcome 𝑐 𝑖 If expert π‘˜ was correct, set 𝑀 𝑖 π‘˜ = 𝑀 π‘–βˆ’1 π‘˜ Otherwise, set 𝑀 𝑖 π‘˜ = 𝑀 π‘–βˆ’1 π‘˜ βˆ™π›½ Let 𝐿 𝐴 be our overall loss with WMA, and 𝐿 π‘˜ the overall loss of expert π‘˜: 𝐿 𝐴 =𝑂 𝐿 π‘˜ + ln 𝑁

4 What Are We Doing Today? Look at a more generalized setting for the Prediction With Expert Advice problem Generalize the Weighted-Majority Algorithm for this setting See applications (AdaBoost)

5 Prediction With Expert Advice (General Setting)
At step 𝑖, we obtain a loss vector β„“ 𝑖 expert π‘˜ suffer loss β„“ 𝑖 π‘˜ ∈ 0,1 Instead of choosing a bit, we only choose a distribution 𝒑 𝑖 over the experts and suffer the average loss: 𝒑 𝑖 βˆ™ β„“ 𝑖 = π‘˜=1 𝑁 𝑝 𝑖 π‘˜ β„“ 𝑖 π‘˜ Again, our goal is to minimize the regret: 𝑖=1 𝑇 𝒑 𝑖 βˆ™ β„“ 𝑖 βˆ’ min 1β‰€π‘˜β‰€π‘ 𝑖=1 𝑇 β„“ 𝑖 π‘˜

6 The π‡πžππ πž 𝜷 Algorithm We generalize the Weighted-Majority Algorithm to fit our new setting The Hedge Algorithm, with parameter π›½βˆˆ 0,1 : Initialize a weight vector π’˜ 0 = 1 𝑁 ,…, 1 𝑁 For 𝑇 steps, at step 𝑖: Choose 𝒑 𝑖 = π’˜ π‘–βˆ’1 π‘˜=1 𝑁 𝑀 π‘–βˆ’1 π‘˜ Obtain the loss vector β„“ 𝑖 Set 𝑀 𝑖 π‘˜ = 𝑀 π‘–βˆ’1 π‘˜ βˆ™ 𝛽 β„“ 𝑖 π‘˜

7 π‡πžππ πž 𝜷 Analysis Similar to the WMA analysis
We derive upper and lower bounds for π‘˜=1 𝑁 𝑀 𝑇 π‘˜ Those imply a bound on our overall loss with the algorithm We denote by 𝐿 𝐴 our overall loss and by 𝐿 π‘˜ the overall loss of expert π‘˜

8 π‡πžππ πž 𝜷 Analysis – Cont. We start with the easier lower bound
For all 1β‰€π‘˜β‰€π‘: π‘˜β€²=1 𝑁 𝑀 𝑇 π‘˜β€² β‰₯ 𝑀 𝑇 π‘˜ = 𝑀 0 π‘˜ 𝛽 β„“ 1 π‘˜ β‹― 𝛽 β„“ 𝑇 π‘˜ = 𝑀 0 π‘˜ 𝛽 𝑖=1 𝑇 β„“ 𝑖 π‘˜ = 1 𝑁 𝛽 𝐿 π‘˜

9 π‡πžππ πž 𝜷 Analysis – Cont. For the upper bound, we first notice that for all β„“βˆˆ 0,1 : 𝛽 β„“ ≀1βˆ’ 1βˆ’π›½ β„“ 𝛽 β„“ is convex, and 1βˆ’ 1βˆ’π›½ β„“ is the line between 𝛽 0 and 𝛽 1 Second derivative of 𝛽 β„“ is ln 2 𝛽 βˆ™ 𝛽 β„“ which is always positive

10 π‡πžππ πž 𝜷 Analysis – Cont. Now we can derive the upper bound for π‘˜=1 𝑁 𝑀 𝑇 π‘˜ For all 1≀𝑖≀𝑇: π‘˜=1 𝑁 𝑀 𝑖 π‘˜ = π‘˜=1 𝑁 𝑀 π‘–βˆ’1 π‘˜ 𝛽 β„“ 𝑖 π‘˜ ≀ π‘˜=1 𝑁 𝑀 π‘–βˆ’1 π‘˜ 1βˆ’ 1βˆ’π›½ β„“ 𝑖 π‘˜ = π‘˜=1 𝑁 𝑀 π‘–βˆ’1 π‘˜ βˆ’ 1βˆ’π›½ π‘˜=1 𝑁 𝑀 π‘–βˆ’1 π‘˜ β„“ 𝑖 π‘˜ = π‘˜=1 𝑁 𝑀 π‘–βˆ’1 π‘˜ 1βˆ’ 1βˆ’π›½ π‘˜=1 𝑁 𝑀 π‘–βˆ’1 π‘˜ β„“ 𝑖 π‘˜ π‘˜=1 𝑁 𝑀 π‘–βˆ’1 π‘˜ = π‘˜=1 𝑁 𝑀 π‘–βˆ’1 π‘˜ 1βˆ’ 1βˆ’π›½ 𝒑 𝑖 βˆ™ β„“ 𝑖

11 π‡πžππ πž 𝜷 Analysis – Cont. Hence: π‘˜=1 𝑁 𝑀 𝑇 π‘˜ ≀ π‘˜=1 𝑁 𝑀 π‘‡βˆ’1 π‘˜ 1βˆ’ 1βˆ’π›½ 𝒑 𝑇 βˆ™ β„“ 𝑇 ≀… ≀ π‘˜=1 𝑁 𝑀 0 π‘˜ 1βˆ’ 1βˆ’π›½ 𝒑 1 βˆ™ β„“ 1 β‹― 1βˆ’ 1βˆ’π›½ 𝒑 𝑇 βˆ™ β„“ 𝑇 = 𝑖=1 𝑇 1βˆ’ 1βˆ’π›½ 𝒑 𝑖 βˆ™ β„“ 𝑖 ≀ 𝑖=1 𝑇 𝑒 βˆ’ 1βˆ’π›½ 𝒑 𝑖 βˆ™ β„“ 𝑖 = 𝑒 βˆ’ 1βˆ’π›½ 𝑖=1 𝑇 𝒑 𝑖 βˆ™ β„“ 𝑖 = 𝑒 βˆ’ 1βˆ’π›½ 𝐿 𝐴 Using the inequality 1+π‘₯≀ 𝑒 π‘₯

12 π‡πžππ πž 𝜷 Analysis – Cont. Combining both bounds we obtain for all 1β‰€π‘˜β‰€π‘: 𝑒 βˆ’ 1βˆ’π›½ 𝐿 𝐴 β‰₯ 1 𝑁 𝛽 𝐿 π‘˜ Thus, 𝐿 𝐴 ≀ βˆ’ ln 𝛽 1βˆ’π›½ 𝐿 π‘˜ βˆ’π›½ ln 𝑁 And specifically: 𝐿 𝐴 ≀ βˆ’ ln 𝛽 1βˆ’π›½ min 1β‰€π‘˜β‰€π‘ 𝐿 π‘˜ βˆ’π›½ ln 𝑁

13 π‡πžππ πž 𝜷 Analysis – Cont. 𝐿 𝐴 ≀ βˆ’ ln 𝛽 1βˆ’π›½ min 1β‰€π‘˜β‰€π‘ 𝐿 π‘˜ βˆ’π›½ ln 𝑁

14 π‡πžππ πž 𝜷 Analysis – Cont. It can be shown that for every other algorithm for the problem, that satisfies for some π‘Ž,𝑏: 𝐿 𝐴 β‰€π‘Ž min 1β‰€π‘˜β‰€π‘ 𝐿 π‘˜ +𝑏 ln 𝑁 either π‘Žβ‰₯ βˆ’ ln 𝛽 1βˆ’π›½ or 𝑏β‰₯ 1 1βˆ’π›½ for all π›½βˆˆ 0,1

15 Choosing 𝜷 We would like to choose 𝛽 in a way that exploits any prior knowledge we have on the problem Let 𝐿 be an upper bound on the overall loss of the best expert. We will choose: 𝛽= ln 𝑁 𝐿

16 Choosing 𝜷 – Cont. We can use the inequality: 1+𝛽 2𝛽 + ln 𝛽 1βˆ’π›½ β‰₯0 for all π›½βˆˆ 0,1

17 Choosing 𝜷 – Cont. If we know the number of steps 𝑇 ahead of time, we can bound the loss with 𝐿 =𝑇 We will choose: 𝛽= ln 𝑁 𝑇 and obtain: 𝐿 𝐴 ≀ min 1β‰€π‘˜β‰€π‘ 𝐿 π‘˜ + ln 𝑁 ln 𝑁 + 2𝑇 ln 𝑁

18 Applications - Boosting
One important application for Hedge 𝛽 is boosting Consider a β€œweak” learning algorithm, with relatively large error Boosting turns a weak-learner into a strong-learner

19 PAC Learning Model – Brief Intro
We have a domain 𝑋 We want to find a function 𝑐:𝑋→ 0,1 PAC-learning algorithm Input Labeled samples chosen randomly from π‘₯,𝑐 π‘₯ , π‘₯βˆˆπ‘‹ with unknown distribution π’Ÿ πœ€,𝛿>0 Output Hypothesis β„Ž:𝑋→ 0,1 Limited to some class of hypotheses to avoid over-fitting With high probability (1βˆ’π›Ώ), β„Ž has low error (πœ€) Polynomial time in 1 πœ€ , 1 𝛿 PAC-learner β„Ž πœ€,𝛿>0 Samples P 𝐸 π‘₯~π’Ÿ β„Ž π‘₯ βˆ’π‘ π‘₯ <πœ€ >1βˆ’π›Ώ

20 PAC Learning Model – Brief Intro – Cont.
A weak PAC-learner is the same, but its error is always larger than 𝛾 for some 1 2 >𝛾β‰₯0 We can call the weak-learner many times each time with a different distribution 𝐷 𝑖 on the samples The learner will minimize the observed error relative to 𝐷 𝑖 : 𝐸 𝑐~ 𝐷 𝑖 β„Ž 𝑖 π‘₯ βˆ’π‘ π‘₯ Combining the resulting hypotheses can lead to a smaller error (this is called boosting)

21 AdaBoost AdaBoost is a boosting algorithm based on Hedge 𝛽 The input:
Some weak-learner 𝑁 samples: π‘₯ 1 , 𝑦 1 ,…, π‘₯ 𝑁 , 𝑦 𝑁 The number of steps 𝑇 The main idea is: Use the samples as experts The distribution 𝒑 π’Š on the experts is the distribution 𝐷 𝑖 we provide to the weak-learner Give more weight to samples with larger error

22 AdaBoost – Cont. We initialize a weight vector π’˜ 0 = 1 𝑁 ,…, 1 𝑁
For 𝑇 steps, at step 𝑖: Choose 𝐷 𝑖 = π’˜ π‘–βˆ’1 π‘˜=1 𝑁 𝑀 π‘–βˆ’1 π‘˜ Call the weak-learner providing 𝐷 𝑖 as the samples distribution Obtain hypothesis β„Ž 𝑖 Calculate the observed error πœ€ 𝑖 of β„Ž 𝑖 : πœ€ 𝑖 = 𝐸 π‘₯,𝑦 ~ 𝐷 𝑖 β„Ž 𝑖 π‘₯ βˆ’π‘¦ = π‘˜=1 𝑁 𝐷 𝑖 π‘˜ β„Ž 𝑖 π‘₯ π‘˜ βˆ’ 𝑦 π‘˜ Set: 𝛽 𝑖 = πœ€ 𝑖 1βˆ’ πœ€ 𝑖 β„“ 𝑖 π‘˜ =1βˆ’ β„Ž 𝑖 π‘₯ π‘˜ βˆ’ 𝑦 π‘˜ Update the weight vector: 𝑀 𝑖 π‘˜ = 𝑀 π‘–βˆ’1 π‘˜ βˆ™ 𝛽 𝑖 β„“ 𝑖 π‘˜

23 AdaBoost - Cont. The final hypothesis β„Ž is a weighted majority of the hypotheses β„Ž 𝑖 | 1≀𝑖≀𝑇 : β„Ž π‘₯ = 𝑖=1 𝑇 ln 1 𝛽 𝑖 β„Ž 𝑖 π‘₯ 𝑖=1 𝑇 ln 1 𝛽 𝑖 Notice the major differences between AdaBoost and Hedge 𝛽 The loss β„“ 𝑖 measures how well the expert did 𝛽 is no longer fixed

24 AdaBoost Analysis It can be shown that the error πœ€ of the final hypothesis β„Ž is bounded by: πœ€β‰€ 2 𝑇 𝑖=1 𝑇 πœ€ 𝑖 1βˆ’ πœ€ 𝑖 ≀ 𝑒 βˆ’2 𝑖=1 𝑇 βˆ’ πœ€ 𝑖 ≀ 𝑒 βˆ’2𝑇 βˆ’π›Ύ 2 Notice we don’t need to know 𝛾 for AdaBoost to work

25 Questions?


Download ppt "The "

Similar presentations


Ads by Google