Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban www.cse.lehigh.edu/~munoz/InSyTe.

Similar presentations


Presentation on theme: "Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban www.cse.lehigh.edu/~munoz/InSyTe."— Presentation transcript:

1 Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban www.cse.lehigh.edu/~munoz/InSyTe

2 Outline Introduction  Adaptive Game AI  Domination games in Unreal Tournament©  Reinforcement Learning Adaptive Game AI with Reinforcement Learning  RETALIATE – architecture and algorithm Empirical Evaluation Final Remarks – Main Lessons

3 Introduction Adaptive Game AI, Unreal Tournament, Reinforcement Learning

4 Adaptive AI in Games Without (shipped) LearningWith Learning Non Stochastic StochasticOfflineOnline Symbolic (FOL, etc.) ScriptsHTN Planning Trained VSDecision Tree Sub-Symbolic (weights, etc.) Stored NNsGenetic Alg.RL offlineRL online In this class: Using Reinforcement Learning to accomplish Online Learning of Game AI for Team based First-Person Shooters HTNbots: we presented this before Lee-Urban et al, ICAPS-2007 http://www.youtube.com/watch?v=yO9CcEujJ64

5 Adaptive Game AI and Learning Learning – Motivation  Combinatorial explosion of possible situations Tactics (e.g., competing team’s tactics) Game worlds (e.g., map where the game is played) Game modes (e.g., domination, capture the flag)  Little time for development Learning – the “Cons”  Difficult to control and predict Game AI  Difficult to test

6 Unreal Tournament© (UT) Online FPS developed by Epic Games Inc. 1999 Six gameplay modes including team deathmatch and domination games Gamebots: a client- server architecture for controlling bots started by U.S.C. Information Sciences Institute (ISI)

7 UT Domination Games A number of fixed domination locations. Ownership: the team of last player to step into location Scoring: a team point awarded for every five seconds location remains controlled Winning: first team to reach pre- determined score (50) (top-down view)

8 Reinforcement Learning

9 Some Introductory RL Videos  http://demo.viidea.com/ijcai09_paduraru_rle/ http://demo.viidea.com/ijcai09_paduraru_rle/  http://www.youtube.com/watch?v=NR99Hf9Ke2c http://www.youtube.com/watch?v=NR99Hf9Ke2c  http://demo.viidea.com/ijcai09_littman_rlrl/ http://demo.viidea.com/ijcai09_littman_rlrl/

10 Reinforcement Learning Agents learn policies through rewards and punishments Policy - Determines what action to take from a given state (or situation) Agent’s goal is to maximize returns (example) Tabular Techniques We maintain a “Q-Table”:  Q-table: State × Action  value

11 The DOM Game Domination Points Wall Spawn Points Lets write on blackboard: a policy for this and a potential Q-table

12 Example of a Q-Table ACTIONS STATES “good” action “bad” action Best action identified so far For state “EFE” (Enemy controls 2 DOM points)

13 Reinforcement Learning Problem ACTIONS STATES How can we identify for every state which is the BEST action to take over the long run?

14 Let Us Model the Problem of Finding the best Build Order for a Zerg Rush as a Reinforcement Learning Problem

15 Adaptive Game AI with RL RETALIATE (Reinforced Tactic Learning in Agent-Team Environments)

16 The RETALIATE Team Controls two or more UT bots Commands bots to execute actions through the GameBots API  The UT server provides sensory (state and event) information about the UT world and controls all gameplay  Gamebots acts as middleware between the UT server and the Game AI

17 The RETALIATE Algorithm

18 Initialization Game model:  n is the number of domination points  (Owner 1, Owner 2, …, Owner n ) For all states s and for all actions a Q[s,a]  0.5 Actions:  m is the number of bots in team  (goto 1, goto 2, …, goto m ) Team 1 Team 2 … None loc 1 loc 2 … loc n

19 Rewards and Utilities U(s) = F( s ) – E( s ),  F(s) is the number of friendly locations  E(s) is the number of enemy-controlled locations R = U( s’ ) – U( s ) Standard Q-learning ([Sutton & Barto, 1998]):  Q(s, a) ← Q(s, a) +  ( R + γ max a’ Q(s’, a’) – Q(s, a))

20 Rewards and Utilities U(s) = F( s ) – E( s ),  F(s) is the number of friendly locations  E(s) is the number of enemy-controlled locations R = U( s’ ) – U( s ) Standard Q-learning ([Sutton & Barto, 1998]):  Q(s, a) ← Q(s, a) +  ( R + γ max a’ Q(s’, a’) – Q(s, a))  “step-size” parameter  was set to 0.2  discount-rate parameter γ was set close to 0.9  Thus, most recent state-reward pairs are considered more important than earlier state-reward pairs

21 State Information and Actions x, y, z Player Scores Team Scores Domination Loc. Ownership Map TimeLimit Score Limit Max # Teams Max Team Size Navigation (path nodes…) Reachability Items (id, type, location…) Events (hear, incoming…) SetWalk RunTo Stop Jump Strafe TurnTo Rotate Shoot ChangeWeapon StopShoot

22 Managing (State x Action) Growth Our Table:  States: ( {E,F,N}, {E,F,N}, {E,F,N} ) = 27  Actions: ( {L1, L2, L3}, …) = 27  27 x 27 = 729  Generally, 3 #loc x #loc #bot Adding health, discretized (high, med, low)  States: (…, {h,m,l}) = 27 x 3 = 81  Actions: ( {L1, L2, L3, Health}, … ) = 4 3 = 64  81 x 64 = 5184  Generally, 3 (#loc+1) x (#loc+1) #bot Number of Locations, size of team frequently varies.

23 Empirical Evaluation Opponents, Performance Curves, Videos

24 The Competitors Team NameDescription HTNBotHTN planning. We discussed this previously OpportunisticBotBots go from one domination location to the next. If the location is under the control of the opponent’s team, the bot captures it. PossesiveBotEach bot is assigned a single domination location that it attempts to capture and hold during the whole game GreedyBotAttempts to recapture any location that is taken by the opponent RETALIATEReinforcement Learning

25 Summary of Results Against the opportunistic, possessive, and greedy control strategies, RETALIATE won all 3 games in the tournament.  within the first half of the first game, RETALIATE developed a competitive strategy. 5 runs of 10 games opportunistic  possessive  greedy

26 Summary of Results: HTNBots vs RETALIATE (Round 1)

27 Summary of Results: HTNBots vs RETALIATE (Round 2) -10 0 10 20 30 40 50 60 Time Score RETALIATE HTNbots Difference

28 Video: Initial Policy (top-down view) RETALIATE Opponent http://www.cse.lehigh.edu/~munoz/projects/RETALIATE/BadStrategy.wmv

29 Video: Learned Policy RETALIATE Opponent http://www.cse.lehigh.edu/~munoz/projects/RETALIATE/GoodStrategy.wmv

30 Final Remarks Lessons Learned, Future Work

31 Final Remarks (1) From our work with RETALIATE we learned the following lessons, beneficial to any real-world application of RL for these kinds of games:  Separate individual bot behavior from team strategies.  Model the problem of learning team tactics through a simple state formulation.

32 Final Remarks (2) It is very hard to predict all strategies beforehand  As a result, RETALIATE was able to find a weakness and exploit it to produce a winning strategy that HTNBots could not counter  On the other hand HTNBots produce winning strategies against the other opponents from the beginning while it took RETALIATE half a game in some situations  Tactics emerging from RETALIATE might be difficult to predict, a game developer will have a hard time maintaining the Game AI

33 Thank you! Questions?


Download ppt "Reinforcement Learning and Markov Decision Processes: A Quick Introduction Hector Munoz-Avila Stephen Lee-Urban www.cse.lehigh.edu/~munoz/InSyTe."

Similar presentations


Ads by Google