Download presentation

Presentation is loading. Please wait.

Published byBarry Imm Modified over 3 years ago

1
6/30/00UAI 20001 Regret Minimization in Stochastic Games Shie Mannor and Nahum Shimkin Technion, Israel Institute of Technology Dept. of Electrical Engineering

2
6/30/00UAI 20002 Introduction Modeling of a dynamic decision process as a stochastic game: Non stationarity of the environment Environments are not (necessarily) hostile Looking for the best possible strategy in light of the environment ’ s actions.

3
6/30/00UAI 20003 Repeated Matrix Games The sets of single stage strategies P and Q are simplical. Rewards are defined by a reward matrix G: r(p,q)=pGq Reward criteria - average reward Need not converge – stationarity is not assumed

4
6/30/00UAI 20004 Regret for Repeated Matrix Games Suppose by time t, average reward is, opponent empirical strategy is q t. The regret is defined as: A policy is called regret minimizing if:

5
6/30/00UAI 20005 Regret minimization for repeated matrix games Such policies do exist (Hannan, 56) A proof using Approachability theory (Blackwell, 56) Also for games with partial observation (Auer et al.,1995 ; Rustichini, 1999)

6
6/30/00UAI 20006 Stochastic Games Formal Model: S={1, …,s} state space A=A(s) actions of Regret minimizing player, P1 B=B(s) actions of the “ environment ”, P2 r - reward function, r(s,a,b) P - transition kernel, P(s`|s,a,b) Expected average for pP, qQ is r(p,q) Single state recurrence assumption

7
6/30/00UAI 20007 Bayes Reward in Strategy Space For every stationary strategy qQ, the Bayes reward is defined as: Problems: P2 ’ s strategy is not completely observed P1 ’ s observations may depends on the strategies of both players

8
6/30/00UAI 20008 Bayes Reward in State- Action Space Let sb be the observed frequency of P2 ’ s action b and state s. A natural estimate of q is: The associated Bayes envelope is:

9
6/30/00UAI 20009 Approachability Theory A standard tool in the theory of repeated matrix games (Blackwell, 1956) For a game with vector reward and average reward A set is approachable by P1 with a policy if: Was extended to recurrent stochastic games (Shimkin and Shwartz, 1993)

10
6/30/00UAI 200010 The Convex Bayes Envelope In general BE is not approachable. Define CBE=co(BE), that is where is the lower convex hull of Theorem: CBE is approachable. (val is the value of the game)

11
6/30/00UAI 200011 Single Controller Games Theorem: Assume that P2 alone controls the transitions, i.e. then BE itself is approachable.

12
6/30/00UAI 200012 An Application to Prediction with Expert Advice Given a channel and a set of experts At each time epoch each expert states his prediction of the next symbol and P1 has to choose his prediction, Then a letter appears in the channel and P1 receives his prediction reward r(, ) Problem can be formulated as stochastic game, P2 stands for all experts and the channel

13
6/30/00UAI 200013 Prediction Example (cont ’ ) Theorem: P1 has a zero regret strategy. 0 (0,0,0) (k-1,k,k) (k,k,k) Expert recommendation 0 r(a,b) r=0

14
6/30/00UAI 200014 An example in which BE is not approachable It can be proved that BE for the above game is not approachable r=b S 0 r=b S 1 a=0 a=1 P=0.99 B(0)=B(1)={-1,1}

15
6/30/00UAI 200015 Example (cont ’ ) In r*(q) space the envelopes are:

16
6/30/00UAI 200016 Open questions Characterization of minimal approachable sets in reward- state-actions space On-line learning schemes for stochastic games with unknown parameters Other ways of formulating optimality with respect to observed state action frequencies

17
6/30/00UAI 200017 Conclusions The problem of regret minimization for stochastic games was considered The proposed solution concept, CBE, is based on convexification of the Bayes envelope in the natural state action space. The concept of CBE ensures an average reward that is higher than value when the opponent is sub optimal

18
6/30/00UAI 200018 Regret Minimization in Stochastic Games Shie Mannor and Nahum Shimkin Technion, Israel Institute of Technology Dept. of Electrical Engineering

19
6/30/00UAI 200019 Approachability Theory Let m(p,q) be the average vector valued reward in a game when P1 and P2 play p and q Define Theorem [Blackwell 56]: A convex set C is approachable if and only if for every qQ Extended to stochastic games (Shimkin and Shwartz, 1993)

20
6/30/00UAI 200020 A related Vector Valued Game Define the following vector valued game: If in state s action b is played by P2 and a reward r is gained then the vector valued m t :

Similar presentations

OK

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]

Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on hdfc mutual fund Ppt on asian continental Ppt on cross sectional study Ppt on quality education data Ppt on fiscal policy in india 2010 Ppt on pythagoras theorem for class 10 Ppt on nature and human development Ppt on weapons of mass destruction iraq Ppt online viewership Ppt on electricity from waste materials