Download presentation

Presentation is loading. Please wait.

Published byFrankie Ury Modified over 2 years ago

1
Regret Minimization: Algorithms and Applications Yishay Mansour Tel Aviv Univ. Many thanks for my co-authors: A. Blum, N. Cesa-Bianchi, and G. Stoltz

2
2

3
3 Weather Forecast Sunny: Rainy: No meteorological understanding! –using other web sites forecastWeb site CNN BBC weather.com OUR Goal: Nearly the most accurate forecast

4
4 Route selection Challenge: Partial Information Goal: Fastest route

5
5 Rock-Paper-Scissors (1,-1)(-1,1) (0, 0) (-1,1)(0, 0)(1,-1) (0, 0)(1,-1)(-1,1)

6
6 Rock-Paper-Scissors Play multiple times –a repeated zero-sum game How should you “learn” to play the game?! –How can you know if you are doing “well” –Highly opponent dependent –In retrospect we should always win … (1,-1)(-1,1) (0, 0) (-1,1)(0, 0)(1,-1) (0, 0)(1,-1)(-1,1)

7
7 Rock-Paper-Scissors The (1-shot) zero-sum game has a value –Each player has a mixed strategy that can enforce the value Alternative 1: Compute the minimax strategy –Value V= 0 –Strategy = (⅓, ⅓, ⅓) Drawback: payoff will always be the value V –Even if the opponent is “weak” (always plays ) (1,-1)(-1,1) (0, 0) (-1,1)(0, 0)(1,-1) (0, 0)(1,-1)(-1,1)

8
8 Rock-Paper-Scissors Alternative 2: Model the opponent –Finite Automata Optimize our play given the opponent model. Drawbacks: –What is the “right” opponent model. –What happens if the assumption is wrong. (1,-1)(-1,1) (0, 0) (-1,1)(0, 0)(1,-1) (0, 0)(1,-1)(-1,1)

9
9 Rock-Paper-Scissors Alternative 3: Online setting –Adjust to the opponent play –No need to know the entire game in advance –Payoff can be more than the game’s value V Conceptually: –Have a set of comparison class of strategies. –Compare performance in hindsight (1,-1)(-1,1) (0, 0) (-1,1)(0, 0)(1,-1) (0, 0)(1,-1)(-1,1)

10
10 Rock-Paper-Scissors Comparison Class H: –Example : A={,, } –Other plausible strategies: Play what your opponent played last time Play what will beat your opponent previous play Goal: Online payoff near the best strategy in the class H Tradeoff: –The larger the class H, the difference grows. (1,-1)(-1,1) (0, 0) (-1,1)(0, 0)(1,-1) (0, 0)(1,-1)(-1,1)

11
11 Rock-Paper-Scissors: Regret Consider A={,, } –All the pure strategies Zero-sum game: Given any mixed strategy σ of the opponent, there exists a pure strategy a ∊ A whose expected payoff is at least V Corollary: For any sequence of actions (of the opponent) We have some action whose average value is V

12
12 Rock-Paper-Scissors: Regret payoffopponentwe 1 0 0 Average payoff -1/3 payoffopponentplay 1 0 0 1 1 New average payoff 1/3

13
13 Rock-Paper-Scissors: Regret More formally: After T games, Û = our average payoff, U(h) = the payoff if we play using h regret(h) = U(h)- Û External regret: max h ∊ H regret(h) Claim: If for every a ∊ A we have regret(a) ≤ ε, then Û ≥ V- ε

14
14 [Blum & M] and [Cesa-Bianchi, M & Stoltz]

15
15 Regret Minimization: Setting Online decision making problem (single agent) At each time, the agent: –selects an action –observes the loss/gain Goal: minimize loss (or maximize gain) Environment model: –stochastic versus adversarial Performance measure: –optimality versus regret

16
16 Regret Minimization: Model Actions A={1, …,N} Number time steps: t ∊ { 1, …, T} At time step t: –The agent selects a distribution p i t over A –Environment returns costs c i t ∊ [0,1] –Online loss: l t = Σ i c i t p i t Cumulative loss : L online = Σ t l t Information Models: –Full information: observes every action’s cost –Partial information: observes only its own cost

17
17 Stochastic Environment Costs: for each action i, c i t are i.i.d. r.v. (for diff. t) –Assuming an oblivious opponent Full information Greedy Algorithm: –selects the action with the lowest average cost. Analysis: –Assume two actions –Boolean cost: Pr[c i =1]=p i and p 1 < p 2 –Concentration bound: Pr[avg 1 t > avg 2 t ] < exp{-(p 1 -p 2 ) 2 t} –Expected regret: E[Σ (c 2 t+1 -c 1 t+1 ) I(avg 1 t > avg 2 t )] –sum over t=1, … Σ t (p 2 -p 1 ) exp{-(p 1 -p 2 ) 2 t}= O( 1/ (p 2 -p 1 ) )

18
18 Stochastic Environment Partial information Tradeoff: Exploration versus Exploitation Simple approximate solution: –sample each action O(ε -2 log N/δ) times –Afterwards, select the best observed action Gittin’s Index –Simple optimal selection rule Assuming a Bayesian prior and discounting

19
19 Competitive Analysis Costs: c i t are generated adversarially, –might depend on the online algorithm decisions inline with our game theory applications Online Competitive Analysis: –Strategy class = any dynamic policy –too permissive Always wins rock-paper-scissors Performance measure: –compare to the best strategy in a class of strategies

20
20 External Regret Static class –Best fixed solution Compares to a single best strategy (in H) The class H is fixed beforehand. –optimization is done with respect to H Assume H=A –Best action: L best = MIN i {Σ t c i t } –External Regret = L online – L best Normalized regret is divided by T

21
21 External regret: Bounds Average external regret goes to zero –No regret –Hannan [1957] Explicit bounds –Littstone & Warmuth ‘94 –CFHHSW ‘97 –External regret = O(√Tlog N)

22
22

23
23 External Regret: Greedy Simple Greedy: –Go with the best action so far. For simplicity loss is {0,1} Loss can be N times the best action –holds for any deterministic online algorithm Can not be worse: –L online < N L best

24
24 External Regret: Randomized Greedy Randomized Greedy: –Go with a random best action. Loss is ln(N) times the best action Analysis: best Expected loss during the time that best increases from k to k+1 is 1/N + 1/(N-1) + … ≈ ln(N)

25
25 External Regret: PROD Algorithm Regret is √Tlog N PROD Algorithm: –plays sub-best actions –Uses exponential weights w a = (1-η) L a –Normalize weights Analysis: –W t = weights of all actions at time t –F t = fraction of weight of actions with loss 1 at time t Also, expected loss: L ON = ∑ F t W t+1 = W t (1-ηF t ) FtFt WtWt W t+1 (1-η)

26
26 External Regret: Bounds Derivation Bounding W T Lower bound: W T > (1-η) L min Upper bound: W T = W 1 Π t (1-ηF t ) ≤ W 1 Π t exp{-ηF t } = W 1 exp{-η L ON } using 1-x ≤ e -x Combined bound: (1-η) L min ≤ W 1 exp{-η L ON } Taking logarithms: L min log(1- η) ≤ log(W 1 ) -ηL ON Final bound: L ON ≤ L min + ηL min +log(N)/η Optimizing the bound: η = √log(N)/T L ON ≤ L min +2√ T log(N)

27
27 External Regret: Summary We showed a bound of 2√T log N More refined bounds √Q log N where Q = Σ t (c t best ) 2

28
28 External Regret: Summary How surprising are the results … –Near optimal result in online adversarial setting very rear … –Lower bound: stochastic model stochastic assumption does not help … –Models an “improved” greedy –An “automatic” optimization methodology Find the best fixed setting of parameters

29
29 External Regret and classification Connections to Machine Learning: H – the hypothesis class cost – an abstract loss function –no need to specify in advance Learning setting – online –learner: observes point, predicts, observes loss Regret guarantee: –compares to the best classifier in H.

30
30

31
31 Types of Regret (Internal) Comparison is based on online’s decisions. –Depends on the actions of the online algorithm –modify a single decision (consistently) Each time action A was done do action B Comparison class is not well define in advanced. Scope: –Stronger then External Regret –Weaker then competitive analysis.

32
32 Internal Regret Assume action sequence A=a 1 … a T –Modified input (b → d ) : Change every a i t =b to a i t =d, and create actions seq. B. L(b→d) is the cost of B –using the same costs c i t Internal regret L online –min {b,d} L online (b→d) = max {b,d} Σ t (c b t –c d t )p b t Swap Regret: –Change action i to action F(i)

33
33 Internal regret No regret bounds –Foster & Vohra –Hart & Mas-Colell –Based on the approachability theorem Blackwell ’56 –Cesa-Bianchi & Lugasi ’03 Internal regret = O(log N + √T log N)

34
34 Regret: Example External regret –You should have bought S&P 500 –Match boy i to girl i Internal regret –Each time you bought IBM you should have bought SUN –Stable matching

35
35 Internal Regret Intuition: –considers swapping two actions (consistently) –A more complex benchmark: depends on the online actions selected! Game theory applications: –Avoiding dominated actions –Correlated equilibrium Reduction from External Regret [Blum & M]

36
36 Dominated actions Action a i is dominated by b i if for every a -i we have u i (a i, a -i ) < u i (b i, a -i ) Clearly, we like to avoid dominated action –Remark: an action can be also dominated by a mixed action Q: can we guarantee to avoid dominated actions?! 13021 129152 a b

37
37 Dominated Actions & Internal Regret How can we test it?! –in retrospect a i is dominates b i –Every time we played a i we do better with b i Define internal regret –swapping a pair of actions No internal regret no dominated actions Internal Regret (a b) Modified Payoff (a b) Our Payoff our actions 2-1=121a b c 5-2=352a d 9-3=693a b d 1-0=110a

38
38 Dominated actions & swap regret Swap regret –An action sequence σ = σ 1, …, σ t –Modification function F:A A –A modified sequence σ(F) = F(σ 1 ), …, F(σ t ) Swap_regret = max F V(σ (F)) - V(σ ) Theorem: If Swap_regret < R then in at most R/ε steps we play ε-dominated actions. σ(F)σ ba cb cc ba bd ba cb bd ba

39
39 Calibration Model: each step predict a probability and observe outcome Goal: prediction calibrated with outcome –During time steps where the prediction is p the average outcome is (approx) p.3.5.3.5 predictionsoutcome Calibration:.3.5 1/3 1/2 Predict prob. of rain

40
Calibration to Regret Reduction to Swap/Internal regret: –Discrete Probabilities Say: {0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0} –Loss of action x at time t: (x – c t ) 2 –y*(x)= argmax y Regret(x,y) y*(x)=avg(c t |x) –Consider Regret(x,y*(x))

41
41 Correlated Equilibrium Q a distribution over joint actions Scenario: –Draw joint action a from Q, –player i receives action a i and no other information Q is a correlated Eq if: –for every player i, the recommended action a i is a best response given the induced distribution. Q a1a1 a2a2 a3a3 a4a4

42
42 Swap Regret & Correlated Eq. Correlated Eq NO Swap regret Repeated game setting Assume swap_regret ≤ ε –Consider the empirical distribution A distribution Q over joint actions –For every player it is ε best response –Empirical history is an ε correlated Eq.

43
43 External Regret to Internal Regret: Generic reduction Input: –N (External Regret) algorithms –Algorithm A i, for any input sequence : L A i ≤ L best,i + R i A1A1 AiAi AnAn

44
44 External to Internal: Generic reduction General setting (at time t): –Each A i outputs a distribution q i A matrix Q –We decide on a distribution p –Adversary decides on costs c= –We return to A i some cost vector A1A1 AiAi AnAn q1q1 qiqi qnqn Q pc

45
45 Combining the experts Approach I: –Select an expert A i with probability r i –Let the “selected” expert decide the outcome –action distribution p=Qr Approach II: –Directly decide on p. Our approach: make p=r –Find a p such that p=Qp Q p

46
46 Distributing loss Adversary selects costs c= Reduction: –Return to A i cost vector c i = p i c –Note: Σ c i = c A1A1 AiAi AnAn c

47
47 External to Internal: Generic reduction Combination rule: –Each A i outputs a distribution q i Defines a matrix Q –Compute p such that p=Qp p j = Σ i p i q i,j –Adversary selects costs c= –Return to A i cost vector p i c Q p

48
48 Motivation Dual view of p: –p i is the probability of selecting action i –p i is the probability of selecting algo. A i Then use A i probability, namely q i Breaking symmetry: –The feedback to A i depends on p i

49
49 Proof of reduction: Loss of A i (from its view) – = p i Regret guarantee (for any action i) w.r.t. action j: –L i = Σ t p i t ≤ Σ t p i t c j t + R i Online loss: –L online = Σ t = Σ t = Σ t Σ i p i t =Σ i L i For any swap function F: –L online ≤ L online,F + Σ i R i

50
50 Swap regret Corollary: For any swap F: Improved bound: –Note that i L max,i T worse case: all L max,i are equal. –Improved bound:

51
51

52
52 Reductions between Regrets External Regret Full Information External Regret Partial Information Internal Regret Full Information Internal Regret Partial Information [AM] [BM] [CMS]

53
53 More elaborate regret notions Time selection functions [Blum & M] –determines the relevance of the next time step –identical for all actions –multiple time-selection functions Wide range regret [Lehrer, Blum & M] –Any set of modification functions mapping histories to actions

54
54 Conclusion and Open problems Reductions –External to Internal Regret full information partial information SWAP regret Lower Bound –poly in N= |A| lower bounds [Auer] Wide Range Regret –Applications …

55
55 Many thanks for my co-authors: A. Blum, N. Cesa-Bianchi, and G. Stoltz

56
56 Partial Information Partial information: –Agent selects action i –Observes the loss of action i –No information regarding the loss of other actions How can we handle this case?

57
57 Reductions between Regrets External Regret Full Information External Regret Partial Information Internal Regret Full Information Internal Regret Partial Information [AM, podc 03] [BM, colt 05] [CMS, colt 05]

58
58 Information: Full versus Partial Importance Sampling: –maintain weights as before. –update the selected action k by loss c k t /p k t –Expectation is maintain –Need to argue directly on the algorithm. Used in: [ACFS], [HM] and others –Regret Bound about T 1/2

59
59 Information: Full versus Partial Partial information Works in blocks of size B –explore each action with small probability with high probability explores each action once –Otherwise uses FI action distribution At the end of B: Feeds the explored action to FI Regrets: –Regret of FI on T/B time steps (each of size B) –Exploration regret T/B

60
60 Information model: Full versus Partial Full Information Regret Algorithm Partial Information exploit sample

61
61 Information: Full versus Partial Anslysis: –Regret of FI on T/B time steps (each of size B) Regret ~ sqrt{ B T} –Exploration regret O(N log N) in block Regret ~ T/B Optimizing B ~ T 1/3 Performance guarantee: Regret ~ T 2/3 Benefit: Simpler and wider applicability.

62
62 More elaborate regret notions Time selection functions [Blum & M] –determines the relevance of the next time step –identical for all actions –multiple time-selection functions Wide range regret [Lehrer, Blum & M] –Any set of modification functions mapping histories to actions

63
63

64
64 Routing Model: select a path for each sessions Loss: The latency on the path Performance Goal: Achieve the best latencies

65
65 Financial Markets Model: Select a portfolio each day. Gain: The changes in stock values. Performance Goal: Compare well with the best busy and hold policy.

66
66 Parameter Tuning Model: Multiple parameters. Optimization: any Performance Goal: guarantee the performance of (almost) the best set of parameters

67
67 General Motivation Development Cycle –develop product (software) –test performance –tune parameters –deliver “tuned” product Challenge: can we combine –testing –tuning –runtime

68
68 Challenges Large action spaces –Regret is logarithmic –For d dimension regret ~ d Computational issue: –How to aggregate weights efficiently Some existing results [KV,KKL]

69
69 Summary Regret bounds –simple algorithms –impressive bounds near optimal Applications –methodological –algorithmic Great potential

70
70 Many thanks for my co-authors: A. Blum, N. Cesa-Bianchi, and G. Stoltz

Similar presentations

OK

Yossi Azar Tel Aviv University Joint work with Ilan Cohen Serving in the Dark 1.

Yossi Azar Tel Aviv University Joint work with Ilan Cohen Serving in the Dark 1.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google