Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.

Similar presentations


Presentation on theme: "Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog."— Presentation transcript:

1 ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog

2 ai in game programming it university of copenhagen Rationale  How can an agent learn if there is no teacher around who tells it with every action what’s right and what’s wrong?  E.g., an agent can learn how to play chess by supervised learning, given that examples of states and their correct actions are provided  But what if these examples are not available?

3 ai in game programming it university of copenhagen Rationale  But what if these examples are not available?  Through random moves, i.e., exploratory behavior, agent may be able to infer knowledge about the environment it is in  But what is good and what is bad? = necessary knowledge to decide what to do in order to reach its goal

4 ai in game programming it university of copenhagen Rationale  But what is good and what is bad? = necessary knowledge to decide what to do in order to reach its goal  ‘Rewarding’ the agent when it did something good and ‘punishing’ it when it did something bad is called reinforcement  Task of reinforcement learning is to use observed rewards to learn a [best] policy for the environment

5 ai in game programming it university of copenhagen Reinforcement Learning  Use observed rewards to learn an [almost?] optimal policy for an environment  Reward R(s) assigns to every state s a number  Utility of an environment history is [as an example] the sum of the rewards received  Policy describes agent’s action from any state s in order to reach the goal  Optimal policy is policy with highest expected utility

6 ai in game programming it university of copenhagen Reinforcement Learning  Might be considered to encompass all of AI : an agent is dropped off somewhere and it should itself figure everything out  We will concentrate on simple settings and agent designs to keep things manageable  E.g. fully observable environment

7 ai in game programming it university of copenhagen Typically in Games  Offline / during development : episodic reinforcement learning  Multiple training instances / several runs from start to end  Online / during actual game playing : incremental reinforcement learning  One continuous sequence of states / possibly without clear ‘end’

8 ai in game programming it university of copenhagen 3 Agent Designs  Utility-based agents : learns a utility function based on which it chooses actions  Q-learning agent : learns an action value function given the expected utility of taking a given action in a given state  Reflex agent : learns a policy that maps directly from states to actions

9 ai in game programming it university of copenhagen Passive Reinforcement  Policy is fixed : state s always leads to the same action  Goal is simply to learn how good this policy is  [Of course this can be extended ‘easily’ to policy learning...]

10 ai in game programming it university of copenhagen Direct Utility Estimation  Idea : utility of a state is expected total reward from that state onward  Each trial provides a sample of this value for each state visited  After trial, utility for each observed state is simply updated using running average  In the limit, sample average converges to true expectation  Direct utility estimation = standard supervised inductive learning

11 ai in game programming it university of copenhagen More Direct Utility Estimation  ‘Reduction’ of the problem to ‘standard learning’ is nice [of course]  However, important source of information is not used : utilities of states are not independent  Utility of each state is own reward + expected utility of its successor states  Bellman equations  Using this prior knowledge can improve [e.g. speed up] learning considerably  As is generally the case

12 ai in game programming it university of copenhagen Adaptive Dynamic Programming  Take into account constraints between states  Passive learning agent learns based on observed rewards and transition model  Latter models the probability of reaching state s’ from state s when performing action a(s)  Two possibilities  Solve system of linear equations [for small systems]  Update iteratively

13 ai in game programming it university of copenhagen Temporal Difference  Take into account constraints between states  Idea : use observed transitions to adjust utility value of observed states so that they agree [better] with the constraints

14 ai in game programming it university of copenhagen Active Reinforcement  Passive learning agent has fixed policy...  Active agent must decide [learn] what action to take, i.e., it should find the optimal policy  Agent should make a trade-off between exploitation and exploration

15 ai in game programming it university of copenhagen Exploitation & Exploration  Exploitation : use best action [at that time] in order to come to highest reward  Exploration : attempt to get to all states possible by trying all actions possible [resulting in experience from which can be learned]

16 ai in game programming it university of copenhagen Exploitation & Exploration  Agent relying completely on exploitation is called greedy and often very suboptimal  Trade-off between greed and curiosity of the agent is controlled by an exploration function

17 ai in game programming it university of copenhagen Learning Action-Value  Temporal difference learning can also be used for active reinforcement learning  Action-value function gives expected utility of taking given action in given state  Q-learning is an alternative to temporal difference learning that learns an action- value function Q(a,s) instead of utilities  The important difference is that former is ‘model-free’, no transition model has to be learned, nor the actual utilities

18 ai in game programming it university of copenhagen Of Course : Generalization  For large state spaces exact inference of utility and/or Q-function as a table becomes unrealistic  Function approximation is needed, i.e., not in a tabular form  Makes it possible to represent utility functions for very large state spaces  More importantly, it allows for generalization  All this relates, of course, to decision trees, MAP, regression, density estimation, ML, hypotheses spaces, etc.

19 ai in game programming it university of copenhagen E.g. Inverted Pendulum  MPI Magdeburg, Germany

20 ai in game programming it university of copenhagen...and Triple Inverted  MPI Magdeburg, Germany

21 ai in game programming it university of copenhagen E.g. Lee05a.pdf

22 ai in game programming it university of copenhagen Finally... a Summary  Reinforcement learning enables agents to become skilled in an unknown environment based only on percepts and occasional rewards  3 approaches  Direct utility estimation : observations independent  Adaptive dynamic programming : learns model + reward function and uses this to determine utilities or optimal policy  Temporal difference : adjust utility value so they agree with the constraints

23 ai in game programming it university of copenhagen More Summary...  Trade-off between exploitation and exploration is important  Large state spaces call for approximate methods, giving rise to function learning, regression, etc.  Reinforcement learning : one of most active areas of machine learning research, because of its potential for eliminating hand coding of control strategies...

24 ai in game programming it university of copenhagen Next Week  Guest lecturer Peter Andreasen on... I don’t know yet  Place : Auditorium 1  Start : ±0900  [Next next week : final lecture, including probably an hour’s lecture on NERO, some words on the course evaluation and there will be the possibility for asking questions...]

25 ai in game programming it university of copenhagen


Download ppt "Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog."

Similar presentations


Ads by Google