Response Regret Martin Zinkevich AAAI Fall Symposium November 5 th, 2005 This work was supported by NSF Career Grant #IIS-0133689.

Slides:

Advertisements

Similar presentations

IDSIA Lugano Switzerland Master Algorithms for Active Experts Problems based on Increasing Loss Values Jan Poland and Marcus Hutter Defensive Universal.

Advertisements

Vincent Conitzer CPS Repeated games Vincent Conitzer

Infinitely Repeated Games

Nash’s Theorem Theorem (Nash, 1951): Every finite game (finite number of players, finite number of pure strategies) has at least one mixed-strategy Nash.

Markov Decision Process

1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.

Worksheet I. Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.

Lecturer: Moni Naor Algorithmic Game Theory Uri Feige Robi Krauthgamer Moni Naor Lecture 8: Regret Minimization.

Game Theory “Доверяй, Но Проверяй” (“Trust, but Verify”) - Russian Proverb (Ronald Reagan) Topic 5 Repeated Games.

Evaluation Through Conflict Martin Zinkevich Yahoo! Inc.

Congestion Games with Player- Specific Payoff Functions Igal Milchtaich, Department of Mathematics, The Hebrew University of Jerusalem, 1993 Presentation.

Infinitely Repeated Games. In an infinitely repeated game, the application of subgame perfection is different - after any possible history, the continuation.

Calibrated Learning and Correlated Equilibrium By: Dean Foster and Rakesh Vohra Presented by: Jason Sorensen.

Infinitely Repeated Games Econ 171. Finitely Repeated Game Take any game play it, then play it again, for a specified number of times. The game that is.

EC941 - Game Theory Lecture 7 Prof. Francesco Squintani

Game Theory: Inside Oligopoly

Learning in games Vincent Conitzer

Short introduction to game theory 1. 2  Decision Theory = Probability theory + Utility Theory (deals with chance) (deals with outcomes)  Fundamental.

Satisfaction Equilibrium Stéphane Ross. Canadian AI / 21 Problem In real life multiagent systems :  Agents generally do not know the preferences.

Planning under Uncertainty

INFORMS 2006, Pittsburgh, November 8, 2006 © 2006 M. A. Zinkevich, AICML 1 Games, Optimization, and Online Algorithms Martin Zinkevich University of Alberta.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.

Outline MDP (brief) –Background –Learning MDP Q learning Game theory (brief) –Background Markov games (2-player) –Background –Learning Markov games Littman’s.

Correlated-Q Learning and Cyclic Equilibria in Markov games Haoqi Zhang.

APEC 8205: Applied Game Theory Fall 2007

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Constraints in Repeated Games. Rational Learning Leads to Nash Equilibrium …so what is rational learning? Kalai & Lehrer, 1993.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Reinforcement Learning (1)

1 On the Agenda(s) of Research on Multi-Agent Learning by Yoav Shoham and Rob Powers and Trond Grenager Learning against opponents with bounded memory.

Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.

Experts Learning and The Minimax Theorem for Zero-Sum Games Maria Florina Balcan December 8th 2011.

Commitment without Regrets: Online Learning in Stackelberg Security Games Nika Haghtalab Carnegie Mellon University Joint work with Maria-Florina Balcan,

CPS Learning in games Vincent Conitzer

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.

Online Oblivious Routing Nikhil Bansal, Avrim Blum, Shuchi Chawla & Adam Meyerson Carnegie Mellon University 6/7/2003.

A Game-Theoretic Approach to Strategic Behavior. Chapter Outline ©2015 McGraw-Hill Education. All Rights Reserved. 2 The Prisoner’s Dilemma: An Introduction.

Exponential Moving Average Q- Learning Algorithm By Mostafa D. Awheda Howard M. Schwartz Presented at the 2013 IEEE Symposium Series on Computational Intelligence.

Punishment and Forgiveness in Repeated Games. A review of present values.

Reinforcement Learning

林偉楷 Taiwan Evolutionary Intelligence Laboratory.

General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.

Dynamic Games of complete information: Backward Induction and Subgame perfection - Repeated Games -

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Finite Iterated Prisoner’s Dilemma Revisited: Belief Change and End Game Effect Jiawei Li (Michael) & Graham Kendall University of Nottingham.

Regret Minimizing Equilibria of Games with Strict Type Uncertainty Stony Brook Conference on Game Theory Nathanaël Hyafil and Craig Boutilier Department.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

Reinforcement Learning Yishay Mansour Tel-Aviv University.

Section 2 – Ec1818 Jeremy Barofsky

Vincent Conitzer CPS Learning in games Vincent Conitzer

R. Brafman and M. Tennenholtz Presented by Daniel Rasmussen.

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

 When playing two person finitely repeated games, do people behave like they are adapting a policy directly(Gradient Accent) or do they behave like they.

Reinforcement Learning (1)

Vincent Conitzer CPS Repeated games Vincent Conitzer

Propagating Uncertainty In POMDP Value Iteration with Gaussian Process

Announcements Homework 3 due today (grace period through Friday)

Multiagent Systems Game Theory © Manfred Huber 2018.

CMSC 471 Fall 2009 RL using Dynamic Programming

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Vincent Conitzer Repeated games Vincent Conitzer

Chapter 14 & 15 Repeated Games.

Chapter 17 – Making Complex Decisions

CS 188: Artificial Intelligence Fall 2008

Reinforcement Learning (2)

Collaboration in Repeated Games

Vincent Conitzer CPS Repeated games Vincent Conitzer

Reinforcement Learning (2)

Presentation transcript:

Response Regret Martin Zinkevich AAAI Fall Symposium November 5 th, 2005 This work was supported by NSF Career Grant #IIS

Outline  Introduction  Repeated Prisoners’ Dilemma Tit-for-Tat Grim Trigger  Traditional Regret  Response Regret  Conclusion

The Prisoner’s Dilemma  Two prisoners (Alice and Bob) are caught for a small crime. They make a deal not to squeal on each other for a large crime.  Then, the authorities meet with each prisoner separately, and offer a pardon for the small crime if the prisoner turns (his/her) partner in for the large crime.  Each has two options: Cooperate with (his/her) fellow prisoner, or Defect from the deal.

Bimatrix Game Alice: 5 years Bob: 5 years Alice: 0 years Bob: 6 years Alice Defects Alice: 6 years Bob: 0 years Alice: 1 year Bob: 1 year Alice Cooperates Bob Defects Bob Cooperates

Bimatrix Game Bob Cooperates Bob Defects Alice Cooperates -1,-1-6,0 Alice Defects 0,-6-5,-5

Nash Equilibrium Bob Cooperates Bob Defects Alice Cooperates -1,-1-6,0 Alice Defects 0,-6-5,-5

The Problem  Each acting to slightly improve his/her circumstances hurts the other player, such that if they both acted “irrationally”, they would both do better.

A Better Model for Real Life  Consequences for misbehavior  These improve life  A better model: Infinitely repeated games

The Goal  Can we come up with algorithms with performance guarantees in the presence of other intelligent agents which take into account the delayed consequences?  Side effect: a goal for reinforcement learning in infinite POMDPs.

Regret Versus Standard RL  Guarantees of performance during learning.  No guarantee for the “final” policy… …for now.

A New Measure of Regret  Traditional Regret measures immediate consequences  Response Regret measures delayed effects

Outline  Introduction  Repeated Prisoners’ Dilemma Tit-for-Tat Grim Trigger  Traditional Regret  Response Regret  Conclusion

Outline  Introduction  Repeated Prisoners’ Dilemma Tit-for-Tat Grim Trigger  Traditional Regret  Response Regret  Conclusion

Repeated Bimatrix Game -5,-50,-6 Alice Defects -6,0-1,-1 Alice Cooperates Bob Defects Bob Cooperates

Finite State Machine (for Bob) Bob cooperates Bob defects Alice defects Alice defects Alice * Alice cooperates Bob cooperates Alice cooperates

Grim Trigger Bob cooperates Bob defects Alice defects Alice * Alice cooperates

Always Cooperate Bob cooperates Alice *

Always Defect Bob defects Alice *

Tit-for-Tat Bob cooperates Bob defects Alice defects Alice defects Alice cooperates Alice cooperates

Discounted Utility Bob cooperates Bob defects Alice cooperates Alice defects Alice defects Alice cooperates GO STOP GO STOP GO STOP GO STOP GO STOP GO STOP GO STOP GO STOP GO STOP GO STOP GO STOP GO STOP C -1 D 0 C -6 D 0 C -6 GO Pr[ ]=2/3 STOP Pr[ ]=1/3

Discounted Utility  The expected value of that process   t=1 1 u t  t-1

Optimal Value Functions for FSMs  V  * (s) discounted utility of OPTIMAL policy from state s  V  * (s) immediate maximum utility at state s  V  * ( B ) discounted utility of OPTIMAL policy given belief over states B  V  * ( B ) immediate maximum utility given belief over states B GO Pr[ ]= STOP Pr[ ]=(1-)

Best Responses, Discounted Utility  If >1/5, a policy is a best response to grim trigger iff it always cooperates when playing grim trigger. Bob cooperates Bob defects Alice defects Alice * Alice cooperates

Best Responses, Discounted Utility  Similarly, if >1/5, a policy is a best response to tit-for-tat iff it always cooperates when playing tit-for-tat. Bob cooperates Bob defects Alice defects Alice defects Alice cooperates Alice cooperates

Knowing Versus Learning  Given a known FSM for the opponent, we can determine the optimal policy (for some ) from an initial state.  However, if it is an unknown FSM, by the time we learn what it is, it will be too late to act optimally.

Grim Trigger or Always Cooperate? Bob cooperates Bob defects Alice defects Alice * Alice cooperates Bob cooperates Alice * Grim TriggerAlways Cooperate For learning, optimality from the initial state is a bad goal.

Deterministic Infinite SMs  represent any deterministic policy  de-randomization C D C C D D D

New Goal  Can a measure of regret allow us to play like tit-for-tat in the Infinitely Repeated Prisoner’s Dilemma?  In addition, it should be possible for one algorithm to minimize regret against all possible opponents (finite and infinite SMs).

Outline  Introduction  Repeated Prisoners’ Dilemma Tit-for-Tat Grim Trigger  Traditional Regret  Response Regret  Conclusion

Traditional Regret: Rock-Paper-Scissors Bob plays Rock Bob plays Paper Bob plays Scissors Alice plays Rock Tie Bob wins $1 Alice wins $1 Alice plays Paper Alice wins $1 Tie Bob wins $1 Alice plays Scissors Bob wins $1 Alice wins $1 Tie

Traditional Regret: Rock-Paper-Scissors Bob plays Rock Bob plays Paper Bob plays Scissors Alice plays Rock 0,00,0-1,11,-1 Alice plays Paper 1,-10,00,0-1,1 Alice plays Scissors -1,11,-10,00,0

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Utility of the Algorithm  Define u t to be the utility of ALG at time t.  Define u 0 ALG to be: u 0 ALG =(1/T) t=1 T u t  Here: u 0 ALG =(1/5)(0+1+(-1)+1+0)=1/5 u 0 ALG =1/5

Rock-Paper-Scissors Visit Counts for Bob’s Internal States 3 Visits 1 Visit u 0 ALG =1/5

Rock-Paper-Scissors Frequencies 3/5 Visits 1/5 Visits u 0 ALG =1/5

Rock-Paper-Scissors Dropped according to Frequencies 3/5 Visits 1/5 Visits 0 2/5 -2/5 u 0 ALG =1/5

Traditional Regret  Consider B to be the empirical frequency states were visited.  Define u 0 ALG to be the average utility of the algorithm.  Traditional regret of ALG is: R= V  * ( B )-u 0 ALG R=(2/5)-(1/5) u 0 ALG =1/5 0 2/5 -2/5

Traditional Regret  Goal: regret approach zero a.s.  Exists an algorithm that will do this for all opponents.

What Algorithm?  Gradient Ascent With Euclidean Projection (Zinkevich, 2003):  (when p i strictly positive)

What Algorithm?  Exponential Weighted Experts (Littlestone + Warmuth, 1994):  And a close relative:

What Algorithm?  Regret Matching:

What Algorithm?  Lots of them!

Extensions to Traditional Regret (Foster and Vohra, 1997)  Into the past… Have a short history Optimal against BR to Alice’s Last.

Extensions to Traditional Regret  (Auer et al)  Only see u t, not u i,t :  Use an unbiased estimator of u i,t :

Outline  Introduction  Repeated Prisoners’ Dilemma Tit-for-Tat Grim Trigger  Traditional Regret  Response Regret  Conclusion

This Talk  Do you want to?  Even then, is it possible?

Traditional Regret: Prisoner’s Dilemma Bob cooperates Bob defects Alice defects Alice defects Alice cooperates Alice cooperates C DCDC D D D D D D D D

Traditional Regret: Prisoner’s Dilemma Bob cooperates (0.2) Bob defects (0.8) Alice cooperates Alice defects Alice defects Alice cooperates Alice defects:-4 Alice cooperates:-5

Traditional Regret Bob Cooperates Bob Defects Alice Cooperates -1,-1-6,0 Alice Defects 0,-6-5,-5

The New Dilemma  Traditional regret forces greedy, short-sighted behavior.  A new concept is needed.

A New Measurement of Regret Bob cooperates (0.2) Bob defects (0.8) Alice defects Alice defects Alice cooperates Alice cooperates V*(B)V*(B) instead of V 0 * ( B )

Response Regret  Consider B to be the empirical distribution over states visited.  Define u 0 ALG to be the average utility of the algorithm.  Traditional regret is: R 0 = V  * ( B )-u 0 ALG  Response regret is: R  = V  * ( B )-?

Averaged Discounted Utility Utility of algorithm at time t’= u t’ Discounted utility from time t=  t’=t 1 u t’  t’-t Averaged discounted utility from 1 to T u  ALG =(1/T) t=1 T  t’=t 1 u t’  t’-t Dropped in at random but play optimally: V*(B)V*(B) Response Regret R  = V  * ( B )-u  ALG

Response Regret  Consider B to be the empirical distribution over states visited.  Traditional regret is: R 0 = V  * ( B )-u 0 ALG  Response regret is: R  = V  * ( B )-u  ALG

Comparing Regret Measures: when Bob Plays Tit-for-Tat Bob cooperates (0.2) Bob defects (0.8) Alice defects Alice defects Alice cooperates Alice cooperates CCDCDDDDDDDDDDDDDDDDCCDCDDDDDDDDDDDDDDDDDDDDDDD R 0 =1/10 (defect) R 1/5 =0 (any policy) R 2/3 =(203/30) ¼ 6.76 (always cooperate)

Comparing Regret Measures: when Bob Plays Tit-for-Tat Bob cooperates (1.0) Bob defects (0.0) Alice defects Alice defects Alice cooperates Alice cooperates CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC R 0 =1 (defect) R 1/5 =0 (any policy) R 2/3 =0 (always cooperate/tit-for-tat/grim trigger)

Comparing Regret Measures: when Bob Plays Grim Trigger Bob cooperates (0.2) Bob defects (0.8) Alice defects Alice * Alice cooperates CCDCDDDDDDDDDDDDDDDDCCDCDDDDDDDDDDDDDDDDDDDDDDD R 0 =1/10 (defect) R 1/5 =0 (grim trigger/tit-for-tat/always defect) R 2/3 =11/30 (grim trigger/tit-for-tat)

Comparing Regret Measures: when Bob Plays Grim Trigger Bob cooperates (1.0) Bob defects (0.0) Alice defects Alice * Alice cooperates CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC R 0 =1 (defect) R 1/5 =0 (always cooperate/always defect/tit-for-tat/grim trigger) R 2/3 =0 (always cooperate/tit-for-tat/grim trigger)

Regrets vs Tit-for-Tat vs Grim Trigger CDDDDDDDDD CCDDDDDDDD R 0 =0.1 R 1/5 =0 R 2/3 ¼ 6.76 R 0 =0.1 R 1/5 =0 R 2/3 ¼ 0.36 CCCCCCCCCCC R 0 =1 R 1/5 =0 R 2/3 =0 R 0 =1 R 1/5 =0 R 2/3 =0

What it Measures:  constant opportunities high response regret  a few drastic mistakes low response regret  convergence implies Nash Equilibrium of the repeated game

Philosophy  Response regret cannot be known without knowing the opponent.  Response regret can be estimated while playing the opponent, so that the estimate in the limit will be exact a.s.

Determining Utility of a Policy in a State  If I want to know the discounted utility of using a policy P from the third state visited…  Use the policy P from the third time step ad infinitum, and take the discounted reward. S1S1 S2S2 S3S3 S4S4 S5S5

Determining Utility of a Policy in a State in Finite Time  Start using the policy P from the third time step: with probability , continue using P. Take the total reward over time steps P was used.  In EXPECTATION, the same as before. S1S1 S2S2 S3S3 S4S4 S5S5

Determining Utility of a Policy in a State in Finite Time Without ALWAYS Using It  With a probability, start using the policy P from the third time step: with probability , continue using P. Take the total reward over time steps P was used and multiply it by 1/.  In EXPECTATION, the same as before.  Can estimate any finite number of policies at the same time this way. S1S1 S2S2 S3S3 S4S4 S5S5

Traditional Regret  Goal: regret approach zero a.s.  Exists an algorithm for all opponents.

Response Regret  Goal: regret approach zero a.s.  Exists an algorithm for all opponents.

A Hard Environment: The Combination Lock Problem BdBd BdBd Ad Ac BdBd BdBd BcBc Ad A* Ad Ac

SPEED!  Response regret takes time to minimize (combination lock problem).  Current work: restricting the adversary’s choice of policies. In particular, if the number of policies is N, then the regret is linear in N and polynomial in 1/(1-).

Related Work  Other work De Farias and Meggido 2004 Browning, Bowling, and Veloso 2004 Bowling and McCracken 2005  Episodic solutions: similar problems to Finitely Repeated Prisoner’s Dilemma.

What is in a Name?  Why not Consequence Regret?

Questions? Thanks to: Avrim Blum (CMU) Michael Bowling (U Alberta) Amy Greenwald (Brown) Michael Littman (Rutgers) Rich Sutton (U Alberta)

Always Cooperate Bob cooperates Alice * CCDCDDDDDDDDDDDDDDDDCCDCDDDDDDDDDDDDDDDDDDDDDDD R 0 =1/10 R 1/5 =1/10 R 2/3 =1/10

Practice  Using these estimation techniques, it is possible to minimize response regret (make it approach zero almost surely in the limit in an ARBITRARY environment).  Similar to the Folk Theorems, it is also possible to converge to the socially optimal behavior if  is close enough to 1.(???)

Traditional Regret: Prisoner’s Dilemma Bob cooperates Bob defects Alice defects Alice defects Alice cooperates Alice cooperates

Possible Outcomes  Alice cooperates, Bob cooperates: Alice: 1 year Bob: 1 year  Alice defects, Bob cooperates: Alice: 0 years Bob: 6 years  Alice cooperates, Bob defects: Alice: 6 years Bob: 0 years  Alice defects, Bob defects: Alice: 5 years Bob: 5 years

Bimatrix Game Bob Cooperates Bob Defects Alice Cooperates Alice: 1 year Bob: 1 year Alice: 6 years Bob: 0 years Alice Defects Alice: 0 years Bob: 6 years Alice: 5 years Bob: 5 years

Repeated Bimatrix Game  The same one-shot game is played repeatedly.  Either average reward or discounted reward is considered.

Rock-Paper-Scissors Bob plays BR to Alice’s Last

One Slide Summary  Problem: Prisoner’s Dilemma  Solution: Infinitely Repeated Prisoner’s Dilemma  Same Problem: Traditional Regret  Solution: Response Regret

Formalism for FSMs (S,A, ,O,u,T)  States S  Finite actions A  Finite observations   Observation function O:S !   Utility function u:S £ A ! R (or u:S £ O ! R)  Transition function T:S £ A ! S  V * (s)=max a 2 A [u(s,a)+V * (T(s,a))]

Beliefs  Suppose S is a set of states.  T(s,a) state  O(s) observation  u(s,a) value  V * (s)=max a 2 A [u(s,a)+ V * (T(s,a))]  Suppose B is a distribution over states.  T( B,a,o) belief  O( B,o) probability  u( B,a) expected value  V * ( B )=max a 2 A [u( B,a)+  o 2  O( B,o)V * (T( B,a,o))]