Download presentation
Presentation is loading. Please wait.
Published byKristian Bradley Modified over 6 years ago
1
On-Line Markov Decision Processes for Learning Movement in Video Games
Aaron Arvey
2
Goals Induce human player (HP) movement strategy for non-player character (NPC) Learn in real-time so that HP strategy can be determined and mimicked. Use a reinforcement learning approach Compare results with (very) primitive FSM
3
HP Movement Every HP has individual style and movement patterns
Best strategy for NPC is to use: “dumb” HP FSM “smart” HP Learn from HP If you can’t beat ‘em, join ‘em!
4
Mimicking HP Movement I
How does HP transition? How does HP react to NPC? Did HP make the right move? How long should we observe before mimicking actions already seen?
5
Mimicking HP Movement II
Use FSM at the start Observe HP and record reactions Once we have accumulated enough observations, determine optimal policy Assumptions: Game length is sufficient for learning All actions are reactions
6
Methods Reinforcement learning Rewards States Actions
Probabilistic reinforcement learning Add in a probabilistic transition model Markov Decision Processes
7
Rewards Experimentally determined (subjectively)
Represented as function which looked at Seeking closest “dead” balls Dodging closest “live” balls Maintaining distance from HP
8
States Discretize world into grid State space includes: HP location
NPC location Closest live and dead balls
9
Actions Very simplistic approach
Actions are “path” (strategy) oriented NPC can plan to move in four cardinal directions Actions are chosen from a policy determined by a Markov Decision Process
10
Markov Decision Processes (MDP)
Actions, States, Rewards, discount factor, and Probability Model T Discount factor used to weight immediate or future rewards T describes probability of moving from state s to state s’ when action a is performed Produces a policy from which to choose actions
11
Policy A policy is a mapping from states to actions
An optimal policy is the policy which maximizes the value of every state The value of a state is determined by potential rewards that could be received from being in that state.
12
Value Iteration Determine the approximate expected value of every state An optimal policy can be derived Algorithm is formulated as a dynamic programming problem Infinite time horizon Update the expected value of states via iterative process, halt when “close enough”
13
Value Iteration Algorithm
14
Method for Mimicking HP I
Use MDP to determine optimal policy Possible actions, states, discounts, and rewards are hard coded – discount = 0.8 Transition model is the only element that we must determine during play time Utilize online methods for solving MDPs once we have a transition model
15
Method for Mimicking HP II
Determining T Use FSM to start game Observe how HP “reacts” to NPC Assume all actions are based on a reactive paradigm Once we have a frequency matrix, we need to adjust for observation bias Use Laplacian prior
16
Platform: Dodgeball Dodgeball ~16 KLOC of C++ ~2 KLOC of AI code
Graphics using OpenGL AI is modular Swap out FSM for MDP based AI
17
Example Seeking a ball via FSM
18
Specific Experiments MDP Steering MDP Reasoning, FSM Steering
MDP/FSM Steering
19
MDP Steering Instead of high-level reasoning, MDP does the grunt work
Every time step, MDP returns an action Pros: More agent autonomy in comparison Learned how to dodge balls Cons: Get stuck in between states Rigid movement due to restricted action set
20
MDP with FSM Steering MDP makes a plan (goal state)
Similar to what was done in steering experiment FSM carries out the plan (go to goal state) Doesn’t go directly to the goal state Can deviate from plan Pros: Smoother than MDP steering Cons: Less autonomy, FSM does most of the work
21
MDP/FSM Hybrid Steering
Use both FSM (5-10%) and MDP (90-95%) steering Pros: Smoother than MDP steering More autonomy than the MDP with FSM steering Learned how to dodge balls Cons: Still uses FSM Still gets stuck in between states
22
Extensions I Learn more; more autonomy
States – Waypoint learning, neural gas Rewards – Apprenticeship and Inverse RL Actions – Hierachical action learning Take full advantage of updateable model Reevaluate policy
23
Extensions II Apply to more standardized platforms
Quake II via Matlab/Java connection through QASE TIELT game/simulation environment Alternative value iteration algorithms “real time value iteration” (RTDP) Offline value iteration
24
Questions? Comments?
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.