Regret Minimization: Algorithms and Applications Yishay Mansour Tel Aviv Univ. Many thanks for my co-authors: A. Blum, N. Cesa-Bianchi, and G. Stoltz.

Slides:

Advertisements

Similar presentations

1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.

Advertisements

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

Regret to the Best vs. Regret to the Average Eyal Even-Dar Computer and Information Science University of Pennsylvania Collaborators: Michael Kearns (Penn)

TAU Agent Team: Yishay Mansour Mariano Schain Tel Aviv University TAC-AA 2010.

Primal Dual Combinatorial Algorithms Qihui Zhu May 11, 2009.

On-line learning and Boosting

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Lecturer: Moni Naor Algorithmic Game Theory Uri Feige Robi Krauthgamer Moni Naor Lecture 8: Regret Minimization.

Fast Convergence of Selfish Re-Routing Eyal Even-Dar, Tel-Aviv University Yishay Mansour, Tel-Aviv University.

Evolution and Repeated Games D. Fudenberg (Harvard) E. Maskin (IAS, Princeton)

A Game Theoretic Approach to Robust Option Pricing Peter DeMarzo, Stanford Ilan Kremer, Stanford Yishay Mansour, TAU.

Calibrated Learning and Correlated Equilibrium By: Dean Foster and Rakesh Vohra Presented by: Jason Sorensen.

Study Group Randomized Algorithms 21 st June 03. Topics Covered Game Tree Evaluation –its expected run time is better than the worst- case complexity.

Regret Minimization and Job Scheduling Yishay Mansour Tel Aviv University.

Regret Minimization and the Price of Total Anarchy Paper by A. Blum, M. Hajiaghayi, K. Ligett, A.Roth Presented by Michael Wunder.

Learning in games Vincent Conitzer

Online learning, minimizing regret, and combining expert advice

6/30/00UAI Regret Minimization in Stochastic Games Shie Mannor and Nahum Shimkin Technion, Israel Institute of Technology Dept. of Electrical Engineering.

1 Learning with continuous experts using Drifting Games work with Robert E. Schapire Princeton University work with Robert E. Schapire Princeton University.

ANDREW MAO, STACY WONG Regrets and Kidneys. Intro to Online Stochastic Optimization Data revealed over time Distribution of future events is known Under.

Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012]

Part 3: The Minimax Theorem

Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion.

Infinite Horizon Problems

Rational Learning Leads to Nash Equilibrium Ehud Kalai and Ehud Lehrer Econometrica, Vol. 61 No. 5 (Sep 1993), Presented by Vincent Mak

Visual Recognition Tutorial

1 Computing Nash Equilibrium Presenter: Yishay Mansour.

Probably Approximately Correct Model (PAC)

AWESOME: A General Multiagent Learning Algorithm that Converges in Self- Play and Learns a Best Response Against Stationary Opponents Vincent Conitzer.

UNIT II: The Basic Theory Zero-sum Games Nonzero-sum Games Nash Equilibrium: Properties and Problems Bargaining Games Bargaining and Negotiation Review.

Visual Recognition Tutorial

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Online Packet Switching Techniques and algorithms Yossi Azar Tel Aviv University.

Experts and Boosting Algorithms. Experts: Motivation Given a set of experts –No prior information –No consistent behavior –Goal: Predict as the best expert.

UNIT II: The Basic Theory Zero-sum Games Nonzero-sum Games Nash Equilibrium: Properties and Problems Bargaining Games Bargaining and Negotiation Review.

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

Game Dynamics Out of Sync Michael Schapira (Yale University and UC Berkeley) Joint work with Aaron D. Jaggard and Rebecca N. Wright.

Experts Learning and The Minimax Theorem for Zero-Sum Games Maria Florina Balcan December 8th 2011.

CPS Learning in games Vincent Conitzer

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Bayesian and non-Bayesian Learning in Games Ehud Lehrer Tel Aviv University, School of Mathematical Sciences Including joint works with: Ehud Kalai, Rann.

Online Learning Avrim Blum Carnegie Mellon University Your guide: [Machine Learning Summer School 2012]

The Multiplicative Weights Update Method Based on Arora, Hazan & Kale (2005) Mashor Housh Oded Cats Advanced simulation methods Prof. Rubinstein.

Yossi Azar Tel Aviv University Joint work with Ilan Cohen Serving in the Dark 1.

1 Multiplicative Weights Update Method Boaz Kaminer Andrey Dolgin Based on: Aurora S., Hazan E. and Kale S., “The Multiplicative Weights Update Method:

Benk Erika Kelemen Zsolt

Competitive Queue Policies for Differentiated Services Seminar in Packet Networks1 Competitive Queue Policies for Differentiated Services William.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Issues on the border of economics and computation נושאים בגבול כלכלה וחישוב Speaker: Dr. Michael Schapira Topic: Dynamics in Games (Part III) (Some slides.

Connections between Learning Theory, Game Theory, and Optimization Maria Florina (Nina) Balcan Lecture 2, August 26 th 2010.

1 What is Game Theory About? r Analysis of situations where conflict of interests is present r Goal is to prescribe how conflicts can be resolved 2 2 r.

1 Monte-Carlo Planning: Policy Improvement Alan Fern.

1. 2 You should know by now… u The security level of a strategy for a player is the minimum payoff regardless of what strategy his opponent uses. u A.

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

OPPONENT EXPLOITATION Tuomas Sandholm. Traditionally two approaches to tackling games Game theory approach (abstraction+equilibrium finding) –Safe in.

Chapter 5 Adversarial Search. 5.1 Games Why Study Game Playing? Games allow us to experiment with easier versions of real-world situations Hostile agents.

Towards Robust Revenue Management: Capacity Control Using Limited Demand Information Michael Ball, Huina Gao, Yingjie Lan & Itir Karaesmen Robert H Smith.

Online Learning Model. Motivation Many situations involve online repeated decision making in an uncertain environment. Deciding how to invest your money.

Speaker: Dr. Michael Schapira Topic: Dynamics in Games (Part II)

Announcements Homework 3 due today (grace period through Friday)

CSCI B609: “Foundations of Data Science”

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Summarizing Data by Statistics

Computing Nash Equilibrium

Multiagent Systems Repeated Games © Manfred Huber 2018.

Presentation transcript:

Regret Minimization: Algorithms and Applications Yishay Mansour Tel Aviv Univ. Many thanks for my co-authors: A. Blum, N. Cesa-Bianchi, and G. Stoltz

2

3 Weather Forecast Sunny: Rainy: No meteorological understanding! –using other web sites forecastWeb site CNN BBC weather.com OUR Goal: Nearly the most accurate forecast

4 Route selection Challenge: Partial Information Goal: Fastest route

5 Rock-Paper-Scissors (1,-1)(-1,1) (0, 0) (-1,1)(0, 0)(1,-1) (0, 0)(1,-1)(-1,1)

6 Rock-Paper-Scissors Play multiple times –a repeated zero-sum game How should you “learn” to play the game?! –How can you know if you are doing “well” –Highly opponent dependent –In retrospect we should always win … (1,-1)(-1,1) (0, 0) (-1,1)(0, 0)(1,-1) (0, 0)(1,-1)(-1,1)

7 Rock-Paper-Scissors The (1-shot) zero-sum game has a value –Each player has a mixed strategy that can enforce the value Alternative 1: Compute the minimax strategy –Value V= 0 –Strategy = (⅓, ⅓, ⅓) Drawback: payoff will always be the value V –Even if the opponent is “weak” (always plays ) (1,-1)(-1,1) (0, 0) (-1,1)(0, 0)(1,-1) (0, 0)(1,-1)(-1,1)

8 Rock-Paper-Scissors Alternative 2: Model the opponent –Finite Automata Optimize our play given the opponent model. Drawbacks: –What is the “right” opponent model. –What happens if the assumption is wrong. (1,-1)(-1,1) (0, 0) (-1,1)(0, 0)(1,-1) (0, 0)(1,-1)(-1,1)

9 Rock-Paper-Scissors Alternative 3: Online setting –Adjust to the opponent play –No need to know the entire game in advance –Payoff can be more than the game’s value V Conceptually: –Have a set of comparison class of strategies. –Compare performance in hindsight (1,-1)(-1,1) (0, 0) (-1,1)(0, 0)(1,-1) (0, 0)(1,-1)(-1,1)

10 Rock-Paper-Scissors Comparison Class H: –Example : A={,, } –Other plausible strategies: Play what your opponent played last time Play what will beat your opponent previous play Goal: Online payoff near the best strategy in the class H Tradeoff: –The larger the class H, the difference grows. (1,-1)(-1,1) (0, 0) (-1,1)(0, 0)(1,-1) (0, 0)(1,-1)(-1,1)

11 Rock-Paper-Scissors: Regret Consider A={,, } –All the pure strategies Zero-sum game: Given any mixed strategy σ of the opponent, there exists a pure strategy a ∊ A whose expected payoff is at least V Corollary: For any sequence of actions (of the opponent) We have some action whose average value is V

12 Rock-Paper-Scissors: Regret payoffopponentwe Average payoff -1/3 payoffopponentplay New average payoff 1/3

13 Rock-Paper-Scissors: Regret More formally: After T games, Û = our average payoff, U(h) = the payoff if we play using h regret(h) = U(h)- Û External regret: max h ∊ H regret(h) Claim: If for every a ∊ A we have regret(a) ≤ ε, then Û ≥ V- ε

14 [Blum & M] and [Cesa-Bianchi, M & Stoltz]

15 Regret Minimization: Setting Online decision making problem (single agent) At each time, the agent: –selects an action –observes the loss/gain Goal: minimize loss (or maximize gain) Environment model: –stochastic versus adversarial Performance measure: –optimality versus regret

16 Regret Minimization: Model Actions A={1, …,N} Number time steps: t ∊ { 1, …, T} At time step t: –The agent selects a distribution p i t over A –Environment returns costs c i t ∊ [0,1] –Online loss: l t = Σ i c i t p i t Cumulative loss : L online = Σ t l t Information Models: –Full information: observes every action’s cost –Partial information: observes only its own cost

17 Stochastic Environment Costs: for each action i, c i t are i.i.d. r.v. (for diff. t) –Assuming an oblivious opponent Full information Greedy Algorithm: –selects the action with the lowest average cost. Analysis: –Assume two actions –Boolean cost: Pr[c i =1]=p i and p 1 < p 2 –Concentration bound: Pr[avg 1 t > avg 2 t ] < exp{-(p 1 -p 2 ) 2 t} –Expected regret: E[Σ (c 2 t+1 -c 1 t+1 ) I(avg 1 t > avg 2 t )] –sum over t=1, … Σ t (p 2 -p 1 ) exp{-(p 1 -p 2 ) 2 t}= O( 1/ (p 2 -p 1 ) )

18 Stochastic Environment Partial information Tradeoff: Exploration versus Exploitation Simple approximate solution: –sample each action O(ε -2 log N/δ) times –Afterwards, select the best observed action Gittin’s Index –Simple optimal selection rule Assuming a Bayesian prior and discounting

19 Competitive Analysis Costs: c i t are generated adversarially, –might depend on the online algorithm decisions inline with our game theory applications Online Competitive Analysis: –Strategy class = any dynamic policy –too permissive Always wins rock-paper-scissors Performance measure: –compare to the best strategy in a class of strategies

20 External Regret Static class –Best fixed solution Compares to a single best strategy (in H) The class H is fixed beforehand. –optimization is done with respect to H Assume H=A –Best action: L best = MIN i {Σ t c i t } –External Regret = L online – L best Normalized regret is divided by T

21 External regret: Bounds Average external regret goes to zero –No regret –Hannan [1957] Explicit bounds –Littstone & Warmuth ‘94 –CFHHSW ‘97 –External regret = O(√Tlog N)

22

23 External Regret: Greedy Simple Greedy: –Go with the best action so far. For simplicity loss is {0,1} Loss can be N times the best action –holds for any deterministic online algorithm Can not be worse: –L online < N L best

24 External Regret: Randomized Greedy Randomized Greedy: –Go with a random best action. Loss is ln(N) times the best action Analysis: best Expected loss during the time that best increases from k to k+1 is 1/N + 1/(N-1) + … ≈ ln(N)

25 External Regret: PROD Algorithm Regret is √Tlog N PROD Algorithm: –plays sub-best actions –Uses exponential weights w a = (1-η) L a –Normalize weights Analysis: –W t = weights of all actions at time t –F t = fraction of weight of actions with loss 1 at time t Also, expected loss: L ON = ∑ F t W t+1 = W t (1-ηF t ) FtFt WtWt W t+1 (1-η)

26 External Regret: Bounds Derivation Bounding W T Lower bound: W T > (1-η) L min Upper bound: W T = W 1 Π t (1-ηF t ) ≤ W 1 Π t exp{-ηF t } = W 1 exp{-η L ON } using 1-x ≤ e -x Combined bound: (1-η) L min ≤ W 1 exp{-η L ON } Taking logarithms: L min log(1- η) ≤ log(W 1 ) -ηL ON Final bound: L ON ≤ L min + ηL min +log(N)/η Optimizing the bound: η = √log(N)/T L ON ≤ L min +2√ T log(N)

27 External Regret: Summary We showed a bound of 2√T log N More refined bounds √Q log N where Q = Σ t (c t best ) 2

28 External Regret: Summary How surprising are the results … –Near optimal result in online adversarial setting very rear … –Lower bound: stochastic model stochastic assumption does not help … –Models an “improved” greedy –An “automatic” optimization methodology Find the best fixed setting of parameters

29 External Regret and classification Connections to Machine Learning: H – the hypothesis class cost – an abstract loss function –no need to specify in advance Learning setting – online –learner: observes point, predicts, observes loss Regret guarantee: –compares to the best classifier in H.

30

31 Types of Regret (Internal) Comparison is based on online’s decisions. –Depends on the actions of the online algorithm –modify a single decision (consistently) Each time action A was done do action B Comparison class is not well define in advanced. Scope: –Stronger then External Regret –Weaker then competitive analysis.

32 Internal Regret Assume action sequence A=a 1 … a T –Modified input (b → d ) : Change every a i t =b to a i t =d, and create actions seq. B. L(b→d) is the cost of B –using the same costs c i t Internal regret L online –min {b,d} L online (b→d) = max {b,d} Σ t (c b t –c d t )p b t Swap Regret: –Change action i to action F(i)

33 Internal regret No regret bounds –Foster & Vohra –Hart & Mas-Colell –Based on the approachability theorem Blackwell ’56 –Cesa-Bianchi & Lugasi ’03 Internal regret = O(log N + √T log N)

34 Regret: Example External regret –You should have bought S&P 500 –Match boy i to girl i Internal regret –Each time you bought IBM you should have bought SUN –Stable matching

35 Internal Regret Intuition: –considers swapping two actions (consistently) –A more complex benchmark: depends on the online actions selected! Game theory applications: –Avoiding dominated actions –Correlated equilibrium Reduction from External Regret [Blum & M]

36 Dominated actions Action a i is dominated by b i if for every a -i we have u i (a i, a -i ) < u i (b i, a -i ) Clearly, we like to avoid dominated action –Remark: an action can be also dominated by a mixed action Q: can we guarantee to avoid dominated actions?! a b

37 Dominated Actions & Internal Regret How can we test it?! –in retrospect a i is dominates b i –Every time we played a i we do better with b i Define internal regret –swapping a pair of actions No internal regret  no dominated actions Internal Regret (a  b) Modified Payoff (a  b) Our Payoff our actions 2-1=121a b c 5-2=352a d 9-3=693a b d 1-0=110a

38 Dominated actions & swap regret Swap regret –An action sequence σ = σ 1, …, σ t –Modification function F:A  A –A modified sequence σ(F) = F(σ 1 ), …, F(σ t ) Swap_regret = max F V(σ (F)) - V(σ ) Theorem: If Swap_regret < R then in at most R/ε steps we play ε-dominated actions. σ(F)σ ba cb cc ba bd ba cb bd ba

39 Calibration Model: each step predict a probability and observe outcome Goal: prediction calibrated with outcome –During time steps where the prediction is p the average outcome is (approx) p predictionsoutcome Calibration:.3.5 1/3 1/2 Predict prob. of rain

Calibration to Regret Reduction to Swap/Internal regret: –Discrete Probabilities Say: {0.0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1.0} –Loss of action x at time t: (x – c t ) 2 –y*(x)= argmax y Regret(x,y) y*(x)=avg(c t |x) –Consider Regret(x,y*(x))

41 Correlated Equilibrium Q a distribution over joint actions Scenario: –Draw joint action a from Q, –player i receives action a i and no other information Q is a correlated Eq if: –for every player i, the recommended action a i is a best response given the induced distribution. Q a1a1 a2a2 a3a3 a4a4

42 Swap Regret & Correlated Eq. Correlated Eq  NO Swap regret Repeated game setting Assume swap_regret ≤ ε –Consider the empirical distribution A distribution Q over joint actions –For every player it is ε best response –Empirical history is an ε correlated Eq.

43 External Regret to Internal Regret: Generic reduction Input: –N (External Regret) algorithms –Algorithm A i, for any input sequence : L A i ≤ L best,i + R i A1A1 AiAi AnAn

44 External to Internal: Generic reduction General setting (at time t): –Each A i outputs a distribution q i A matrix Q –We decide on a distribution p –Adversary decides on costs c= –We return to A i some cost vector A1A1 AiAi AnAn q1q1 qiqi qnqn Q pc

45 Combining the experts Approach I: –Select an expert A i with probability r i –Let the “selected” expert decide the outcome –action distribution p=Qr Approach II: –Directly decide on p. Our approach: make p=r –Find a p such that p=Qp Q p

46 Distributing loss Adversary selects costs c= Reduction: –Return to A i cost vector c i = p i c –Note: Σ c i = c A1A1 AiAi AnAn c

47 External to Internal: Generic reduction Combination rule: –Each A i outputs a distribution q i Defines a matrix Q –Compute p such that p=Qp p j = Σ i p i q i,j –Adversary selects costs c= –Return to A i cost vector p i c Q p

48 Motivation Dual view of p: –p i is the probability of selecting action i –p i is the probability of selecting algo. A i Then use A i probability, namely q i Breaking symmetry: –The feedback to A i depends on p i

49 Proof of reduction: Loss of A i (from its view) – = p i Regret guarantee (for any action i) w.r.t. action j: –L i = Σ t p i t ≤ Σ t p i t c j t + R i Online loss: –L online = Σ t = Σ t = Σ t Σ i p i t =Σ i L i For any swap function F: –L online ≤ L online,F + Σ i R i

50 Swap regret Corollary: For any swap F: Improved bound: –Note that  i L max,i  T worse case: all L max,i are equal. –Improved bound:

51

52 Reductions between Regrets External Regret Full Information External Regret Partial Information Internal Regret Full Information Internal Regret Partial Information [AM] [BM] [CMS]

53 More elaborate regret notions Time selection functions [Blum & M] –determines the relevance of the next time step –identical for all actions –multiple time-selection functions Wide range regret [Lehrer, Blum & M] –Any set of modification functions mapping histories to actions

54 Conclusion and Open problems Reductions –External to Internal Regret full information partial information SWAP regret Lower Bound –poly in N= |A| lower bounds [Auer] Wide Range Regret –Applications …

55 Many thanks for my co-authors: A. Blum, N. Cesa-Bianchi, and G. Stoltz

56 Partial Information Partial information: –Agent selects action i –Observes the loss of action i –No information regarding the loss of other actions How can we handle this case?

57 Reductions between Regrets External Regret Full Information External Regret Partial Information Internal Regret Full Information Internal Regret Partial Information [AM, podc 03] [BM, colt 05] [CMS, colt 05]

58 Information: Full versus Partial Importance Sampling: –maintain weights as before. –update the selected action k by loss c k t /p k t –Expectation is maintain –Need to argue directly on the algorithm. Used in: [ACFS], [HM] and others –Regret Bound about T 1/2

59 Information: Full versus Partial Partial information Works in blocks of size B –explore each action with small probability with high probability explores each action once –Otherwise uses FI action distribution At the end of B: Feeds the explored action to FI Regrets: –Regret of FI on T/B time steps (each of size B) –Exploration regret T/B

60 Information model: Full versus Partial Full Information Regret Algorithm Partial Information exploit sample

61 Information: Full versus Partial Anslysis: –Regret of FI on T/B time steps (each of size B) Regret ~ sqrt{ B T} –Exploration regret O(N log N) in block Regret ~ T/B Optimizing B ~ T 1/3 Performance guarantee: Regret ~ T 2/3 Benefit: Simpler and wider applicability.

62 More elaborate regret notions Time selection functions [Blum & M] –determines the relevance of the next time step –identical for all actions –multiple time-selection functions Wide range regret [Lehrer, Blum & M] –Any set of modification functions mapping histories to actions

63

64 Routing Model: select a path for each sessions Loss: The latency on the path Performance Goal: Achieve the best latencies

65 Financial Markets Model: Select a portfolio each day. Gain: The changes in stock values. Performance Goal: Compare well with the best busy and hold policy.

66 Parameter Tuning Model: Multiple parameters. Optimization: any Performance Goal: guarantee the performance of (almost) the best set of parameters

67 General Motivation Development Cycle –develop product (software) –test performance –tune parameters –deliver “tuned” product Challenge: can we combine –testing –tuning –runtime

68 Challenges Large action spaces –Regret is logarithmic –For d dimension regret ~ d Computational issue: –How to aggregate weights efficiently Some existing results [KV,KKL]

69 Summary Regret bounds –simple algorithms –impressive bounds near optimal Applications –methodological –algorithmic Great potential

70 Many thanks for my co-authors: A. Blum, N. Cesa-Bianchi, and G. Stoltz