# 6/30/00UAI 20001 Regret Minimization in Stochastic Games Shie Mannor and Nahum Shimkin Technion, Israel Institute of Technology Dept. of Electrical Engineering.

## Presentation on theme: "6/30/00UAI 20001 Regret Minimization in Stochastic Games Shie Mannor and Nahum Shimkin Technion, Israel Institute of Technology Dept. of Electrical Engineering."— Presentation transcript:

6/30/00UAI 20001 Regret Minimization in Stochastic Games Shie Mannor and Nahum Shimkin Technion, Israel Institute of Technology Dept. of Electrical Engineering

6/30/00UAI 20002 Introduction Modeling of a dynamic decision process as a stochastic game: Non stationarity of the environment Environments are not (necessarily) hostile Looking for the best possible strategy in light of the environment ’ s actions.

6/30/00UAI 20003 Repeated Matrix Games The sets of single stage strategies P and Q are simplical. Rewards are defined by a reward matrix G: r(p,q)=pGq Reward criteria - average reward Need not converge – stationarity is not assumed

6/30/00UAI 20004 Regret for Repeated Matrix Games  Suppose by time t, average reward is, opponent empirical strategy is q t.  The regret is defined as:  A policy is called regret minimizing if:

6/30/00UAI 20005 Regret minimization for repeated matrix games Such policies do exist (Hannan, 56) A proof using Approachability theory (Blackwell, 56) Also for games with partial observation (Auer et al.,1995 ; Rustichini, 1999)

6/30/00UAI 20006 Stochastic Games Formal Model: S={1, …,s} state space A=A(s) actions of Regret minimizing player, P1 B=B(s) actions of the “ environment ”, P2 r - reward function, r(s,a,b) P - transition kernel, P(s`|s,a,b) Expected average for pP, qQ is r(p,q) Single state recurrence assumption

6/30/00UAI 20007 Bayes Reward in Strategy Space  For every stationary strategy qQ, the Bayes reward is defined as:  Problems:  P2 ’ s strategy is not completely observed  P1 ’ s observations may depends on the strategies of both players

6/30/00UAI 20008 Bayes Reward in State- Action Space Let  sb be the observed frequency of P2 ’ s action b and state s. A natural estimate of q is: The associated Bayes envelope is:

6/30/00UAI 20009 Approachability Theory  A standard tool in the theory of repeated matrix games (Blackwell, 1956)  For a game with vector reward and average reward  A set is approachable by P1 with a policy  if:  Was extended to recurrent stochastic games (Shimkin and Shwartz, 1993)

6/30/00UAI 200010 The Convex Bayes Envelope In general BE is not approachable. Define CBE=co(BE), that is where is the lower convex hull of Theorem: CBE is approachable. (val is the value of the game)

6/30/00UAI 200011 Single Controller Games Theorem: Assume that P2 alone controls the transitions, i.e. then BE itself is approachable.

6/30/00UAI 200012 An Application to Prediction with Expert Advice  Given a channel and a set of experts  At each time epoch each expert states his prediction of the next symbol and P1 has to choose his prediction,   Then a letter  appears in the channel and P1 receives his prediction reward r(, ) Problem can be formulated as stochastic game, P2 stands for all experts and the channel

6/30/00UAI 200013 Prediction Example (cont ’ ) Theorem: P1 has a zero regret strategy. 0 (0,0,0) (k-1,k,k) (k,k,k) Expert recommendation 0 r(a,b) r=0

6/30/00UAI 200014 An example in which BE is not approachable It can be proved that BE for the above game is not approachable r=b S 0 r=b S 1 a=0 a=1 P=0.99 B(0)=B(1)={-1,1}

6/30/00UAI 200015 Example (cont ’ ) In r*(q) space the envelopes are:

6/30/00UAI 200016 Open questions Characterization of minimal approachable sets in reward- state-actions space On-line learning schemes for stochastic games with unknown parameters Other ways of formulating optimality with respect to observed state action frequencies

6/30/00UAI 200017 Conclusions  The problem of regret minimization for stochastic games was considered  The proposed solution concept, CBE, is based on convexification of the Bayes envelope in the natural state action space.  The concept of CBE ensures an average reward that is higher than value when the opponent is sub optimal

6/30/00UAI 200018 Regret Minimization in Stochastic Games Shie Mannor and Nahum Shimkin Technion, Israel Institute of Technology Dept. of Electrical Engineering

6/30/00UAI 200019 Approachability Theory Let m(p,q) be the average vector valued reward in a game when P1 and P2 play p and q Define Theorem [Blackwell 56]: A convex set C is approachable if and only if for every qQ Extended to stochastic games (Shimkin and Shwartz, 1993)

6/30/00UAI 200020 A related Vector Valued Game Define the following vector valued game: If in state s action b is played by P2 and a reward r is gained then the vector valued m t :

Download ppt "6/30/00UAI 20001 Regret Minimization in Stochastic Games Shie Mannor and Nahum Shimkin Technion, Israel Institute of Technology Dept. of Electrical Engineering."

Similar presentations