# Response Regret Martin Zinkevich AAAI Fall Symposium November 5 th, 2005 This work was supported by NSF Career Grant #IIS-0133689.

## Presentation on theme: "Response Regret Martin Zinkevich AAAI Fall Symposium November 5 th, 2005 This work was supported by NSF Career Grant #IIS-0133689."— Presentation transcript:

Response Regret Martin Zinkevich AAAI Fall Symposium November 5 th, 2005 This work was supported by NSF Career Grant #IIS-0133689.

Outline  Introduction  Repeated Prisoners’ Dilemma Tit-for-Tat Grim Trigger  Traditional Regret  Response Regret  Conclusion

The Prisoner’s Dilemma  Two prisoners (Alice and Bob) are caught for a small crime. They make a deal not to squeal on each other for a large crime.  Then, the authorities meet with each prisoner separately, and offer a pardon for the small crime if the prisoner turns (his/her) partner in for the large crime.  Each has two options: Cooperate with (his/her) fellow prisoner, or Defect from the deal.

Bimatrix Game Alice: 5 years Bob: 5 years Alice: 0 years Bob: 6 years Alice Defects Alice: 6 years Bob: 0 years Alice: 1 year Bob: 1 year Alice Cooperates Bob Defects Bob Cooperates

Bimatrix Game Bob Cooperates Bob Defects Alice Cooperates -1,-1-6,0 Alice Defects 0,-6-5,-5

Nash Equilibrium Bob Cooperates Bob Defects Alice Cooperates -1,-1-6,0 Alice Defects 0,-6-5,-5

The Problem  Each acting to slightly improve his/her circumstances hurts the other player, such that if they both acted “irrationally”, they would both do better.

A Better Model for Real Life  Consequences for misbehavior  These improve life  A better model: Infinitely repeated games

The Goal  Can we come up with algorithms with performance guarantees in the presence of other intelligent agents which take into account the delayed consequences?  Side effect: a goal for reinforcement learning in infinite POMDPs.

Regret Versus Standard RL  Guarantees of performance during learning.  No guarantee for the “final” policy… …for now.

A New Measure of Regret  Traditional Regret measures immediate consequences  Response Regret measures delayed effects

Outline  Introduction  Repeated Prisoners’ Dilemma Tit-for-Tat Grim Trigger  Traditional Regret  Response Regret  Conclusion

Outline  Introduction  Repeated Prisoners’ Dilemma Tit-for-Tat Grim Trigger  Traditional Regret  Response Regret  Conclusion

Repeated Bimatrix Game -5,-50,-6 Alice Defects -6,0-1,-1 Alice Cooperates Bob Defects Bob Cooperates

Finite State Machine (for Bob) Bob cooperates Bob defects Alice defects Alice defects Alice * Alice cooperates Bob cooperates Alice cooperates

Grim Trigger Bob cooperates Bob defects Alice defects Alice * Alice cooperates

Always Cooperate Bob cooperates Alice *

Always Defect Bob defects Alice *

Tit-for-Tat Bob cooperates Bob defects Alice defects Alice defects Alice cooperates Alice cooperates

Discounted Utility Bob cooperates Bob defects Alice cooperates Alice defects Alice defects Alice cooperates GO STOP GO STOP GO STOP GO STOP GO STOP GO STOP GO STOP GO STOP GO STOP GO STOP GO STOP GO STOP C -1 D 0 C -6 D 0 C -6 GO Pr[ ]=2/3 STOP Pr[ ]=1/3

Discounted Utility  The expected value of that process   t=1 1 u t  t-1

Optimal Value Functions for FSMs  V  * (s) discounted utility of OPTIMAL policy from state s  V  * (s) immediate maximum utility at state s  V  * ( B ) discounted utility of OPTIMAL policy given belief over states B  V  * ( B ) immediate maximum utility given belief over states B GO Pr[ ]= STOP Pr[ ]=(1-)

Best Responses, Discounted Utility  If >1/5, a policy is a best response to grim trigger iff it always cooperates when playing grim trigger. Bob cooperates Bob defects Alice defects Alice * Alice cooperates

Best Responses, Discounted Utility  Similarly, if >1/5, a policy is a best response to tit-for-tat iff it always cooperates when playing tit-for-tat. Bob cooperates Bob defects Alice defects Alice defects Alice cooperates Alice cooperates

Knowing Versus Learning  Given a known FSM for the opponent, we can determine the optimal policy (for some ) from an initial state.  However, if it is an unknown FSM, by the time we learn what it is, it will be too late to act optimally.

Grim Trigger or Always Cooperate? Bob cooperates Bob defects Alice defects Alice * Alice cooperates Bob cooperates Alice * Grim TriggerAlways Cooperate For learning, optimality from the initial state is a bad goal.

Deterministic Infinite SMs  represent any deterministic policy  de-randomization C D C C D D D

New Goal  Can a measure of regret allow us to play like tit-for-tat in the Infinitely Repeated Prisoner’s Dilemma?  In addition, it should be possible for one algorithm to minimize regret against all possible opponents (finite and infinite SMs).

Outline  Introduction  Repeated Prisoners’ Dilemma Tit-for-Tat Grim Trigger  Traditional Regret  Response Regret  Conclusion

Traditional Regret: Rock-Paper-Scissors Bob plays Rock Bob plays Paper Bob plays Scissors Alice plays Rock Tie Bob wins \$1 Alice wins \$1 Alice plays Paper Alice wins \$1 Tie Bob wins \$1 Alice plays Scissors Bob wins \$1 Alice wins \$1 Tie

Traditional Regret: Rock-Paper-Scissors Bob plays Rock Bob plays Paper Bob plays Scissors Alice plays Rock 0,00,0-1,11,-1 Alice plays Paper 1,-10,00,0-1,1 Alice plays Scissors -1,11,-10,00,0

Rock-Paper-Scissors Bob plays BR to Alice’s Last

Utility of the Algorithm  Define u t to be the utility of ALG at time t.  Define u 0 ALG to be: u 0 ALG =(1/T) t=1 T u t  Here: u 0 ALG =(1/5)(0+1+(-1)+1+0)=1/5 u 0 ALG =1/5

Rock-Paper-Scissors Visit Counts for Bob’s Internal States 3 Visits 1 Visit u 0 ALG =1/5

Rock-Paper-Scissors Frequencies 3/5 Visits 1/5 Visits u 0 ALG =1/5

Rock-Paper-Scissors Dropped according to Frequencies 3/5 Visits 1/5 Visits 0 2/5 -2/5 u 0 ALG =1/5

Traditional Regret  Consider B to be the empirical frequency states were visited.  Define u 0 ALG to be the average utility of the algorithm.  Traditional regret of ALG is: R= V  * ( B )-u 0 ALG R=(2/5)-(1/5) u 0 ALG =1/5 0 2/5 -2/5

Traditional Regret  Goal: regret approach zero a.s.  Exists an algorithm that will do this for all opponents.

What Algorithm?  Gradient Ascent With Euclidean Projection (Zinkevich, 2003):  (when p i strictly positive)

What Algorithm?  Exponential Weighted Experts (Littlestone + Warmuth, 1994):  And a close relative:

What Algorithm?  Regret Matching:

What Algorithm?  Lots of them!

Extensions to Traditional Regret (Foster and Vohra, 1997)  Into the past… Have a short history Optimal against BR to Alice’s Last.

Extensions to Traditional Regret  (Auer et al)  Only see u t, not u i,t :  Use an unbiased estimator of u i,t :

Outline  Introduction  Repeated Prisoners’ Dilemma Tit-for-Tat Grim Trigger  Traditional Regret  Response Regret  Conclusion

This Talk  Do you want to?  Even then, is it possible?

Traditional Regret: Prisoner’s Dilemma Bob cooperates Bob defects Alice defects Alice defects Alice cooperates Alice cooperates C DCDC D D D D D D D D

Traditional Regret: Prisoner’s Dilemma Bob cooperates (0.2) Bob defects (0.8) Alice cooperates Alice defects Alice defects Alice cooperates Alice defects:-4 Alice cooperates:-5

Traditional Regret Bob Cooperates Bob Defects Alice Cooperates -1,-1-6,0 Alice Defects 0,-6-5,-5

The New Dilemma  Traditional regret forces greedy, short-sighted behavior.  A new concept is needed.

A New Measurement of Regret Bob cooperates (0.2) Bob defects (0.8) Alice defects Alice defects Alice cooperates Alice cooperates V*(B)V*(B) instead of V 0 * ( B )

Response Regret  Consider B to be the empirical distribution over states visited.  Define u 0 ALG to be the average utility of the algorithm.  Traditional regret is: R 0 = V  * ( B )-u 0 ALG  Response regret is: R  = V  * ( B )-?

Averaged Discounted Utility Utility of algorithm at time t’= u t’ Discounted utility from time t=  t’=t 1 u t’  t’-t Averaged discounted utility from 1 to T u  ALG =(1/T) t=1 T  t’=t 1 u t’  t’-t Dropped in at random but play optimally: V*(B)V*(B) Response Regret R  = V  * ( B )-u  ALG

Response Regret  Consider B to be the empirical distribution over states visited.  Traditional regret is: R 0 = V  * ( B )-u 0 ALG  Response regret is: R  = V  * ( B )-u  ALG

Comparing Regret Measures: when Bob Plays Tit-for-Tat Bob cooperates (0.2) Bob defects (0.8) Alice defects Alice defects Alice cooperates Alice cooperates CCDCDDDDDDDDDDDDDDDDCCDCDDDDDDDDDDDDDDDDDDDDDDD R 0 =1/10 (defect) R 1/5 =0 (any policy) R 2/3 =(203/30) ¼ 6.76 (always cooperate)

Comparing Regret Measures: when Bob Plays Tit-for-Tat Bob cooperates (1.0) Bob defects (0.0) Alice defects Alice defects Alice cooperates Alice cooperates CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC R 0 =1 (defect) R 1/5 =0 (any policy) R 2/3 =0 (always cooperate/tit-for-tat/grim trigger)

Comparing Regret Measures: when Bob Plays Grim Trigger Bob cooperates (0.2) Bob defects (0.8) Alice defects Alice * Alice cooperates CCDCDDDDDDDDDDDDDDDDCCDCDDDDDDDDDDDDDDDDDDDDDDD R 0 =1/10 (defect) R 1/5 =0 (grim trigger/tit-for-tat/always defect) R 2/3 =11/30 (grim trigger/tit-for-tat)

Comparing Regret Measures: when Bob Plays Grim Trigger Bob cooperates (1.0) Bob defects (0.0) Alice defects Alice * Alice cooperates CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC R 0 =1 (defect) R 1/5 =0 (always cooperate/always defect/tit-for-tat/grim trigger) R 2/3 =0 (always cooperate/tit-for-tat/grim trigger)

Regrets vs Tit-for-Tat vs Grim Trigger CDDDDDDDDD CCDDDDDDDD R 0 =0.1 R 1/5 =0 R 2/3 ¼ 6.76 R 0 =0.1 R 1/5 =0 R 2/3 ¼ 0.36 CCCCCCCCCCC R 0 =1 R 1/5 =0 R 2/3 =0 R 0 =1 R 1/5 =0 R 2/3 =0

What it Measures:  constant opportunities high response regret  a few drastic mistakes low response regret  convergence implies Nash Equilibrium of the repeated game

Philosophy  Response regret cannot be known without knowing the opponent.  Response regret can be estimated while playing the opponent, so that the estimate in the limit will be exact a.s.

Determining Utility of a Policy in a State  If I want to know the discounted utility of using a policy P from the third state visited…  Use the policy P from the third time step ad infinitum, and take the discounted reward. S1S1 S2S2 S3S3 S4S4 S5S5

Determining Utility of a Policy in a State in Finite Time  Start using the policy P from the third time step: with probability , continue using P. Take the total reward over time steps P was used.  In EXPECTATION, the same as before. S1S1 S2S2 S3S3 S4S4 S5S5

Determining Utility of a Policy in a State in Finite Time Without ALWAYS Using It  With a probability, start using the policy P from the third time step: with probability , continue using P. Take the total reward over time steps P was used and multiply it by 1/.  In EXPECTATION, the same as before.  Can estimate any finite number of policies at the same time this way. S1S1 S2S2 S3S3 S4S4 S5S5

Traditional Regret  Goal: regret approach zero a.s.  Exists an algorithm for all opponents.

Response Regret  Goal: regret approach zero a.s.  Exists an algorithm for all opponents.

A Hard Environment: The Combination Lock Problem BdBd BdBd Ad Ac BdBd BdBd BcBc Ad A* Ad Ac

SPEED!  Response regret takes time to minimize (combination lock problem).  Current work: restricting the adversary’s choice of policies. In particular, if the number of policies is N, then the regret is linear in N and polynomial in 1/(1-).

Related Work  Other work De Farias and Meggido 2004 Browning, Bowling, and Veloso 2004 Bowling and McCracken 2005  Episodic solutions: similar problems to Finitely Repeated Prisoner’s Dilemma.

What is in a Name?  Why not Consequence Regret?

Questions? Thanks to: Avrim Blum (CMU) Michael Bowling (U Alberta) Amy Greenwald (Brown) Michael Littman (Rutgers) Rich Sutton (U Alberta)

Always Cooperate Bob cooperates Alice * CCDCDDDDDDDDDDDDDDDDCCDCDDDDDDDDDDDDDDDDDDDDDDD R 0 =1/10 R 1/5 =1/10 R 2/3 =1/10

Practice  Using these estimation techniques, it is possible to minimize response regret (make it approach zero almost surely in the limit in an ARBITRARY environment).  Similar to the Folk Theorems, it is also possible to converge to the socially optimal behavior if  is close enough to 1.(???)

Traditional Regret: Prisoner’s Dilemma Bob cooperates Bob defects Alice defects Alice defects Alice cooperates Alice cooperates

Possible Outcomes  Alice cooperates, Bob cooperates: Alice: 1 year Bob: 1 year  Alice defects, Bob cooperates: Alice: 0 years Bob: 6 years  Alice cooperates, Bob defects: Alice: 6 years Bob: 0 years  Alice defects, Bob defects: Alice: 5 years Bob: 5 years

Bimatrix Game Bob Cooperates Bob Defects Alice Cooperates Alice: 1 year Bob: 1 year Alice: 6 years Bob: 0 years Alice Defects Alice: 0 years Bob: 6 years Alice: 5 years Bob: 5 years

Repeated Bimatrix Game  The same one-shot game is played repeatedly.  Either average reward or discounted reward is considered.

Rock-Paper-Scissors Bob plays BR to Alice’s Last

One Slide Summary  Problem: Prisoner’s Dilemma  Solution: Infinitely Repeated Prisoner’s Dilemma  Same Problem: Traditional Regret  Solution: Response Regret

Formalism for FSMs (S,A, ,O,u,T)  States S  Finite actions A  Finite observations   Observation function O:S !   Utility function u:S £ A ! R (or u:S £ O ! R)  Transition function T:S £ A ! S  V * (s)=max a 2 A [u(s,a)+V * (T(s,a))]

Beliefs  Suppose S is a set of states.  T(s,a) state  O(s) observation  u(s,a) value  V * (s)=max a 2 A [u(s,a)+ V * (T(s,a))]  Suppose B is a distribution over states.  T( B,a,o) belief  O( B,o) probability  u( B,a) expected value  V * ( B )=max a 2 A [u( B,a)+  o 2  O( B,o)V * (T( B,a,o))]

Download ppt "Response Regret Martin Zinkevich AAAI Fall Symposium November 5 th, 2005 This work was supported by NSF Career Grant #IIS-0133689."

Similar presentations