Download presentation
Presentation is loading. Please wait.
1
The ππππ π π· Algorithm and Its Applications
Yogev Bar-On Seminar on Experts and Bandits, Tel-Aviv University, November 2017 Based On βY. Freund and R. E. Schapire, A Decision-Theoretic Generalization of On-Line Learning and an Application to Boostingβ
2
Review β Prediction With Expert Advice (Binary Setting)
The problem: predict the next bit in a sequence π steps, where at step π: We choose a bit π π Obtain the outcome bit π π and suffer loss If π π β π π π experts help us: before choosing π π we receive the advice vector π π = π π 1 ,β¦, π π π Our goal is to minimize the regret: π=1 π π π β π π β min 1β€πβ€π π=1 π π π π β π π
3
Review β Prediction With Expert Advice (Binary Setting) β Cont.
We used the Weighted-Majority Algorithm, with parameter π½β 0,1 : Initialize a weight vector π 0 = 1,β¦,1 At step π: Choose π π =1 if: π=1 π π€ πβ1 π β π π π β₯ 1 2 π=1 π π€ πβ1 π otherwise, choose π π =0 Obtain outcome π π If expert π was correct, set π€ π π = π€ πβ1 π Otherwise, set π€ π π = π€ πβ1 π βπ½ Let πΏ π΄ be our overall loss with WMA, and πΏ π the overall loss of expert π: πΏ π΄ =π πΏ π + ln π
4
What Are We Doing Today? Look at a more generalized setting for the Prediction With Expert Advice problem Generalize the Weighted-Majority Algorithm for this setting See applications (AdaBoost)
5
Prediction With Expert Advice (General Setting)
At step π, we obtain a loss vector β π expert π suffer loss β π π β 0,1 Instead of choosing a bit, we only choose a distribution π π over the experts and suffer the average loss: π π β β π = π=1 π π π π β π π Again, our goal is to minimize the regret: π=1 π π π β β π β min 1β€πβ€π π=1 π β π π
6
The ππππ π π· Algorithm We generalize the Weighted-Majority Algorithm to fit our new setting The Hedge Algorithm, with parameter π½β 0,1 : Initialize a weight vector π 0 = 1 π ,β¦, 1 π For π steps, at step π: Choose π π = π πβ1 π=1 π π€ πβ1 π Obtain the loss vector β π Set π€ π π = π€ πβ1 π β π½ β π π
7
ππππ π π· Analysis Similar to the WMA analysis
We derive upper and lower bounds for π=1 π π€ π π Those imply a bound on our overall loss with the algorithm We denote by πΏ π΄ our overall loss and by πΏ π the overall loss of expert π
8
ππππ π π· Analysis β Cont. We start with the easier lower bound
For all 1β€πβ€π: πβ²=1 π π€ π πβ² β₯ π€ π π = π€ 0 π π½ β 1 π β― π½ β π π = π€ 0 π π½ π=1 π β π π = 1 π π½ πΏ π
9
ππππ π π· Analysis β Cont. For the upper bound, we first notice that for all ββ 0,1 : π½ β β€1β 1βπ½ β π½ β is convex, and 1β 1βπ½ β is the line between π½ 0 and π½ 1 Second derivative of π½ β is ln 2 π½ β π½ β which is always positive
10
ππππ π π· Analysis β Cont. Now we can derive the upper bound for π=1 π π€ π π For all 1β€πβ€π: π=1 π π€ π π = π=1 π π€ πβ1 π π½ β π π β€ π=1 π π€ πβ1 π 1β 1βπ½ β π π = π=1 π π€ πβ1 π β 1βπ½ π=1 π π€ πβ1 π β π π = π=1 π π€ πβ1 π 1β 1βπ½ π=1 π π€ πβ1 π β π π π=1 π π€ πβ1 π = π=1 π π€ πβ1 π 1β 1βπ½ π π β β π
11
ππππ π π· Analysis β Cont. Hence: π=1 π π€ π π β€ π=1 π π€ πβ1 π 1β 1βπ½ π π β β π β€β¦ β€ π=1 π π€ 0 π 1β 1βπ½ π 1 β β 1 β― 1β 1βπ½ π π β β π = π=1 π 1β 1βπ½ π π β β π β€ π=1 π π β 1βπ½ π π β β π = π β 1βπ½ π=1 π π π β β π = π β 1βπ½ πΏ π΄ Using the inequality 1+π₯β€ π π₯
12
ππππ π π· Analysis β Cont. Combining both bounds we obtain for all 1β€πβ€π: π β 1βπ½ πΏ π΄ β₯ 1 π π½ πΏ π Thus, πΏ π΄ β€ β ln π½ 1βπ½ πΏ π βπ½ ln π And specifically: πΏ π΄ β€ β ln π½ 1βπ½ min 1β€πβ€π πΏ π βπ½ ln π
13
ππππ π π· Analysis β Cont. πΏ π΄ β€ β ln π½ 1βπ½ min 1β€πβ€π πΏ π βπ½ ln π
14
ππππ π π· Analysis β Cont. It can be shown that for every other algorithm for the problem, that satisfies for some π,π: πΏ π΄ β€π min 1β€πβ€π πΏ π +π ln π either πβ₯ β ln π½ 1βπ½ or πβ₯ 1 1βπ½ for all π½β 0,1
15
Choosing π· We would like to choose π½ in a way that exploits any prior knowledge we have on the problem Let πΏ be an upper bound on the overall loss of the best expert. We will choose: π½= ln π πΏ
16
Choosing π· β Cont. We can use the inequality: 1+π½ 2π½ + ln π½ 1βπ½ β₯0 for all π½β 0,1
17
Choosing π· β Cont. If we know the number of steps π ahead of time, we can bound the loss with πΏ =π We will choose: π½= ln π π and obtain: πΏ π΄ β€ min 1β€πβ€π πΏ π + ln π ln π + 2π ln π
18
Applications - Boosting
One important application for Hedge π½ is boosting Consider a βweakβ learning algorithm, with relatively large error Boosting turns a weak-learner into a strong-learner
19
PAC Learning Model β Brief Intro
We have a domain π We want to find a function π:πβ 0,1 PAC-learning algorithm Input Labeled samples chosen randomly from π₯,π π₯ , π₯βπ with unknown distribution π π,πΏ>0 Output Hypothesis β:πβ 0,1 Limited to some class of hypotheses to avoid over-fitting With high probability (1βπΏ), β has low error (π) Polynomial time in 1 π , 1 πΏ PAC-learner β π,πΏ>0 Samples P πΈ π₯~π β π₯ βπ π₯ <π >1βπΏ
20
PAC Learning Model β Brief Intro β Cont.
A weak PAC-learner is the same, but its error is always larger than πΎ for some 1 2 >πΎβ₯0 We can call the weak-learner many times each time with a different distribution π· π on the samples The learner will minimize the observed error relative to π· π : πΈ π~ π· π β π π₯ βπ π₯ Combining the resulting hypotheses can lead to a smaller error (this is called boosting)
21
AdaBoost AdaBoost is a boosting algorithm based on Hedge π½ The input:
Some weak-learner π samples: π₯ 1 , π¦ 1 ,β¦, π₯ π , π¦ π The number of steps π The main idea is: Use the samples as experts The distribution π π on the experts is the distribution π· π we provide to the weak-learner Give more weight to samples with larger error
22
AdaBoost β Cont. We initialize a weight vector π 0 = 1 π ,β¦, 1 π
For π steps, at step π: Choose π· π = π πβ1 π=1 π π€ πβ1 π Call the weak-learner providing π· π as the samples distribution Obtain hypothesis β π Calculate the observed error π π of β π : π π = πΈ π₯,π¦ ~ π· π β π π₯ βπ¦ = π=1 π π· π π β π π₯ π β π¦ π Set: π½ π = π π 1β π π β π π =1β β π π₯ π β π¦ π Update the weight vector: π€ π π = π€ πβ1 π β π½ π β π π
23
AdaBoost - Cont. The final hypothesis β is a weighted majority of the hypotheses β π | 1β€πβ€π : β π₯ = π=1 π ln 1 π½ π β π π₯ π=1 π ln 1 π½ π Notice the major differences between AdaBoost and Hedge π½ The loss β π measures how well the expert did π½ is no longer fixed
24
AdaBoost Analysis It can be shown that the error π of the final hypothesis β is bounded by: πβ€ 2 π π=1 π π π 1β π π β€ π β2 π=1 π β π π β€ π β2π βπΎ 2 Notice we donβt need to know πΎ for AdaBoost to work
25
Questions?
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.