Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.

Introduction to Reinforcement Learning Dr Kathryn Merrick k.merrick@adfa.edu.au 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th November, 15:30-17:00

Reinforcement Learning is… … learning from trial-and-error and reward by interaction with an environment.

Today’s Lecture A formal framework: Markov Decision Processes Optimality criteria Value functions Solution methods: Q-learning Examples and exercises Alternative models Summary and applications

Markov Decision Processes The reinforcement learning problem can be represented as: A set S of states {s 1, s 2, s 3, …} A set A of actions {a 1, a 2, a 3, …} A transition function T:S x A  S (deterministic) orT:S x A x S  [0, 1] (stochastic) A reward function R:S x A  Real orR:S x A x S  Real A policy π:S  A (deterministic) orπ:S x A  [0, 1] (stochastic)

Optimality Criteria Suppose an agent receives a reward r t at time t. Then optimal behaviour might: Maximise the sum of expected future reward: Maximise over a finite horizon: Maximise over an infinite horizon: Maximise over a discounted infinite horizon: Maximise average reward:

Value Functions State value function V π :S  Real or V π (s) State-action value function Q π :S x A  Real or Q π (s, a) The expected sum of discounted reward for following the policy π from state s to the end of time. The expected sum of discounted reward for starting in state s, taking action a once then following the policy π from state s’ to the end of time.

Optimal State Value Function V*(s) = E{ R(s, a, s’) + γ V*(s’) | s, a } = T(s, a, s’) [ R(s, a, s’) + γ V*(s’) ] A Bellman Equation Can be solved using dynamic programming Requires knowledge of the transition function T

Optimal State-Action Value Function Q*(s, a) = E{ R(s, a, s’) + γ Q*(s’, a’) | s, a } = T(s, a, s’) [ R(s, a, s’) + γ Q*(s’, a’) ] Also a Bellman Equation Also requires knowledge of the transition function T to solve using dynamic programming Can now define action selection: π*(s) = Q*(s, a)

A Possible Application…

Solution Methods Model based: –For example dynamic programming –Require a model (transition function) of the environment for learning Model free: –Learn from interaction with the environment without requiring a model –For example Q-learning…

Q-Learning by Example: Driving in Canberra Parked Clean Driving Clean Parked Dirty Driving Dirty Drive Park Drive Park Clean Drive Clean Drive Park Clean

Formulating the Problem States s 1 Park clean s 2 Park dirty s 3 Drive clean s 4 Drive dirty Actions a 1 Drive a 2 Clean a 3 Park Reward 1 for transitions to a r t = ‘clean’ state 0 otherwise State-Action Table or Q-Table a1a1 a2a2 a3a3 s1s1 ??? s2s2 ??? s3s3 ??? s4s4 ???

A Q-Learning Agent Agent Environment stst rtrt atat Learning update to π t Action selection from π t

Q-Learning Algorithmic Components Learning update (to Q-Table): Q(s, a)  (1-α)Q(s, a) + α[r + γ Q(s’, a’)] or Q(s, a)  Q(s, a) + α[r + γ Q(s’, a’) - Q(s, a)] Action selection (from Q-Table): a = f(Q(s, a))

Matlab Code Available on Request

Exercise You need to program a small robot to learn to find food. What assumptions will you make about the robot’s sensors and actuators to represent the environment? How could you model the problem as an MDP? Calculate a few learning iterations in your domain by hand.

Alternatives Function approximation of the Q-table: –Neural networks –Decision trees –Gradient descent methods Reinforcement learning variants: –Relational reinforcement learning –Hierarchical reinforcement learning –Intrinsically motivated reinforcement learning

A final application…

References and Further Reading Sutton, R., Barto, A., (2000) Reinforcement Learning: an Introduction, The MIT Press http://www.cs.ualberta.ca/~sutton/book/the-book.html Kaelbling, L., Littman, M., Moore, A., (1996) Reinforcement Learning: a Survey, Journal of Artificial Intelligence Research, 4:237-285 Barto, A., Mahadevan, S., (2003) Recent Advances in Hierarchical Reinforcement Learning, Discrete Event Dynamic Systems: Theory and Applications, 13(4):41-77

Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.

Similar presentations

Presentation on theme: "Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.

Similar presentations

Presentation on theme: "Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th."— Presentation transcript:

Similar presentations

About project

Feedback