On-Line Markov Decision Processes for Learning Movement in Video Games

On-Line Markov Decision Processes for Learning Movement in Video Games
Aaron Arvey

Goals Induce human player (HP) movement strategy for non-player character (NPC) Learn in real-time so that HP strategy can be determined and mimicked. Use a reinforcement learning approach Compare results with (very) primitive FSM

HP Movement Every HP has individual style and movement patterns
Best strategy for NPC is to use: “dumb” HP  FSM “smart” HP  Learn from HP If you can’t beat ‘em, join ‘em!

Mimicking HP Movement I
How does HP transition? How does HP react to NPC? Did HP make the right move? How long should we observe before mimicking actions already seen?

Mimicking HP Movement II
Use FSM at the start Observe HP and record reactions Once we have accumulated enough observations, determine optimal policy Assumptions: Game length is sufficient for learning All actions are reactions

Methods Reinforcement learning Rewards States Actions
Probabilistic reinforcement learning Add in a probabilistic transition model Markov Decision Processes

Rewards Experimentally determined (subjectively)
Represented as function which looked at Seeking closest “dead” balls Dodging closest “live” balls Maintaining distance from HP

States Discretize world into grid State space includes: HP location
NPC location Closest live and dead balls

Actions Very simplistic approach
Actions are “path” (strategy) oriented NPC can plan to move in four cardinal directions Actions are chosen from a policy determined by a Markov Decision Process

Markov Decision Processes (MDP)
Actions, States, Rewards, discount factor, and Probability Model T Discount factor used to weight immediate or future rewards T describes probability of moving from state s to state s’ when action a is performed Produces a policy from which to choose actions

Policy A policy is a mapping from states to actions
An optimal policy is the policy which maximizes the value of every state The value of a state is determined by potential rewards that could be received from being in that state.

Value Iteration Determine the approximate expected value of every state An optimal policy can be derived Algorithm is formulated as a dynamic programming problem Infinite time horizon Update the expected value of states via iterative process, halt when “close enough”

Value Iteration Algorithm

Method for Mimicking HP I
Use MDP to determine optimal policy Possible actions, states, discounts, and rewards are hard coded – discount = 0.8 Transition model is the only element that we must determine during play time Utilize online methods for solving MDPs once we have a transition model

Method for Mimicking HP II
Determining T Use FSM to start game Observe how HP “reacts” to NPC Assume all actions are based on a reactive paradigm Once we have a frequency matrix, we need to adjust for observation bias Use Laplacian prior

Platform: Dodgeball Dodgeball ~16 KLOC of C++ ~2 KLOC of AI code
Graphics using OpenGL AI is modular Swap out FSM for MDP based AI

Example Seeking a ball via FSM

Specific Experiments MDP Steering MDP Reasoning, FSM Steering
MDP/FSM Steering

MDP Steering Instead of high-level reasoning, MDP does the grunt work
Every time step, MDP returns an action Pros: More agent autonomy in comparison Learned how to dodge balls Cons: Get stuck in between states Rigid movement due to restricted action set

MDP with FSM Steering MDP makes a plan (goal state)
Similar to what was done in steering experiment FSM carries out the plan (go to goal state) Doesn’t go directly to the goal state Can deviate from plan Pros: Smoother than MDP steering Cons: Less autonomy, FSM does most of the work

MDP/FSM Hybrid Steering
Use both FSM (5-10%) and MDP (90-95%) steering Pros: Smoother than MDP steering More autonomy than the MDP with FSM steering Learned how to dodge balls Cons: Still uses FSM Still gets stuck in between states

Extensions I Learn more; more autonomy
States – Waypoint learning, neural gas Rewards – Apprenticeship and Inverse RL Actions – Hierachical action learning Take full advantage of updateable model Reevaluate policy

Extensions II Apply to more standardized platforms
Quake II via Matlab/Java connection through QASE TIELT game/simulation environment Alternative value iteration algorithms “real time value iteration” (RTDP) Offline value iteration

Questions? Comments?

On-Line Markov Decision Processes for Learning Movement in Video Games

Similar presentations

Presentation on theme: "On-Line Markov Decision Processes for Learning Movement in Video Games"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

On-Line Markov Decision Processes for Learning Movement in Video Games

Similar presentations

Presentation on theme: "On-Line Markov Decision Processes for Learning Movement in Video Games"— Presentation transcript:

Similar presentations

About project

Feedback