Download presentation

Presentation is loading. Please wait.

Published bySebastian Gill Modified over 3 years ago

1
Regret to the Best vs. Regret to the Average Eyal Even-Dar Computer and Information Science University of Pennsylvania Collaborators: Michael Kearns (Penn) Yishay Mansour (Tel Aviv) Jenn Wortman (Penn)

2
Learner maintains a weighting over N experts On each of T trials, learner observes payoffs for all K –Payoff to the learner = weighted payoff –Learner then dynamically adjusts weights Let R i,T be cumulative payoff of expert i on some sequence of T trials Let R A,T be cumulative payoff of learning algorithm A Classical no-regret results: We can produce a learning algorithm A such that on any sequence of trials, R A,T > max{R i,T } – sqrt(log(N)*T) –No regret: per-trial regret sqrt(log(N)/T) approaches 0 as T grows The No-Regret Setting

3
We simultaneously examine: –Regret to best expert in hindsight –Regret to the average return of all experts Note that no learning is required to achieve just this! Why look at the average? –A safety net or sanity check –Simple algorithm outperforms –Future direction: S&P 500 We assume a fixed horizon T –But this can easily be relaxed… This Work

4
Every difference based algorithm with regret O(T α ) to the best expert has Ω(T 1-α ) regret to the average There exists simple difference based algorithm achieving the tradeoff Every algorithm with O(T 1/2 ) regret to the best expert must have regret Ω(T 1/2 ) regret to the average We can produce an algorithm with O(logT T 1/2 ) regret to the best and O(1) regret to the average Our Results

5
Consider 2 experts with instantaneous gains in {0,1} Let w be the weight on first expert and initialize w = ½ Suppose expert 1 gets a gain of 1 on the first time step, and expert 2 gets a gain of 1 on the second… Best, worst, and average all earn 1 Algorithm earns w + (1 – w – ) = 1 – Regret to Best = Regret to Worst = Regret to Average = ww w + (1,0)(0,1) Oscillations: The Cost of an Update

6
Consider the following sequence –Expert 1: 1,0,1,0,1,0,1,0,…,1,0 –Expert 2: 0,1,0,1,0,1,0,1,…,0,1 We can examine w over time for existing algorithms… Follow the Perturbed Leader: ½, ½ + 1/(T(1+ln(2)) 1/2 - 1/2T, ½, ½ +1/(T(1+ln(2)) 1/2 - 1/2T, ½, … Weighted Majority: ½, ½ + (ln(2)/2T) 1/2 /(1+(ln(2)/2T) 1/2 ), ½, ½+(ln(2)/2T) 1/2 /(1+(ln(2)/2T) 1/2 ), ½,... Both will lose to best, worst, and average A Bad Sequence

7
… w = ½ w = ½ + w = 2/3 L steps, regret to best > L/3 Some t > 1/6L … … … T steps, regret to average ~ (T/2)*(1/6L) ~ (T/L) Again, consider 2 experts with instantaneous gains in {0,1} Let w be the weight on first expert and initialize w = ½ Will first examine algorithms that depend only on cumulative difference in payoffs –Insight holds more generally for aggressive updating Regret to Best * Regret to Average ~ (T) ! (1,0) A Simple Trade-off: The (T) Barrier

8
Unnormalized weight on expert i at time t: w i,t = e ηRi,t Define W t = w i,t, so we have p i,t = w i,t / W t Let N be the number of experts Setting η = O(1/T 1/2 ) achieves O(T 1/2 ) regret to the best Setting η = O(1/T 1/2+α ) achieves O(T 1/2+α ) regret to the best Can be shown that Setting η = O(1/T 1/2+α ) regret to the average is O(T 1/2-α ) Exponential Weights [F94]

9
Regret to best ~ T x Regret to average ~ T y 1/2 1 cumulative difference algorithms So far…

10
Any algorithm achieving O(T 1/2 ) regret to best must suffer (T 1/2 ) regret to average Any algorithm achieving O(log(T)T) 1/2 regret to best must suffer (T regret to the average Not restricted to cumulative difference algorithms! Regret to best ~ T x Regret to average ~ T y 1/2 1 all algorithms cumulative difference algorithms An Unrestricted Lower Bound

11
Once again, 2 experts with instantaneous gains in {0,1}, w initialized to ½ Let D t be difference in cumulative payoffs of the two experts at time t The algorithm will make the following updates –If expert gains are (0,0) or (1,1): no change to w –If expert gains are (1,0): w w + –If expert gains are (0,1): w w – Assume we never reach w =1 For any difference D t = d we have w = ½ + d A Simple Additive Algorithm

12
While |D t | < H –(0,0) or (1,1): no change to w –(1,0): w w + –(0,1): w w – Play EW with Will analyze what happens: 1. If we stay in the loop 2. If we exit the loop Breaking the (T) Barrier

13
While |D_t| < H –(0,0) or (1,1): no change to w –(1,0): w w + –(0,1): w w – Observe R best,t - R avg,T < H Enough to compute regret to the average Time t Distance D t d d+1 ww w + (1,0)(0,1) Lose to Best & Average Regret to the Average at most T Regret to the Best at most T Staying in the Loop

14
While |D_t| < H –(0,0) or (1,1): no change to w –(1,0): w w + –(0,1): w w – Play EW with Upon exit from loop: –Regret to the best: still at most H + T –Gain over the average: (... H - T ~ H 2 - T So e.g. H = T 2/3 and = 1/T gives –Regret to best: < T 2/3 in loop or upon exit –Regret to average: constant in loop; but gain T 1/3 upon exit Now EW regret to the best T 2/3 and to the average T 1/3 w w + (1,0) Lose 1-w to Best Gain w-½ over Average Time t d d+1 Exiting the Loop Distance D t

15
Regret to best ~ T x Regret to avg ~ T y 1/2 1 all algorithms cumulative difference algorithms 2/3

16
Instead of playing additive algorithm inside the loop, we can play EW with η = Δ = 1/T Instead of having one phase, we can have many Set η = 1/T, k = logT For i = 1 to k –Reset and run EW with the current value of η until R best,t – R avg,t > H = O(T 1/2 ) –Set η = η * 2 Reset and run EW with final value of η Obliterating the (T) Barrier

17
Known Extensions to Our Algorithm: –Instead of average, can use any static weight inside the simplex Future Goals: –Nicer dependence on the number of experts Ours is O(logN), typically O(sqrt(logN)) –Generalization to the returns setting and to other loss functions Extensions and Open Problems

18
Thanks! Questions?

Similar presentations

OK

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on ms word 2003 Ppt on dot net basics connect Ppt on information security policy Ppt on purchase order Ppt on balancing redox reactions Ppt on napoleon and french revolution Ppt on safe drinking water Ppt on online marketing concept Ppt on conceptual art video Ppt on team building interventions