1 Introduction to Reinforcement Learning Freek Stulp.

Slides:



Advertisements
Similar presentations
Reinforcement learning
Advertisements

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Markov Decision Process
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Decision Theoretic Planning
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
Markov Decision Processes
Reinforcement Learning
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Pieter Abbeel and Andrew Y. Ng Apprenticeship Learning via Inverse Reinforcement Learning Pieter Abbeel Stanford University [Joint work with Andrew Ng.]
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.
Pieter Abbeel and Andrew Y. Ng Reinforcement Learning and Apprenticeship Learning Pieter Abbeel and Andrew Y. Ng Stanford University.
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning (1)
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement Learning
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.
1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.
Solving POMDPs through Macro Decomposition
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
CUHK Learning-Based Power Management for Multi-Core Processors YE Rong Nov 15, 2011.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement Learning
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Modeling Long Term Care and Supportive Housing Marisela Mainegra Hing Telfer School of Management University of Ottawa Canadian Operational Research Society,
Introduction to Neural Networks Freek Stulp. 2 Overview Biological Background Artificial Neuron Classes of Neural Networks 1. Perceptrons 2. Multi-Layered.
Department of Computer Science Undergraduate Events More
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement Learning
Reinforcement Learning (1)
Markov Decision Processes
Reinforcement Learning
"Playing Atari with deep reinforcement learning."
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Spring 2006
Introduction to Reinforcement Learning and Q-Learning
Reinforcement Learning (2)
Reinforcement Learning
Reinforcement Learning (2)
Presentation transcript:

1 Introduction to Reinforcement Learning Freek Stulp

2 Overview General principles of RL Markov Decision Process as model Values of states: V(s) Values of state-actions: Q(a,s) Exploration vs. Exploitation Issues in RL Conclusion

3 General principles of RL Neural Networks are supervised learning algorithms: for each input, we know the output. What if we don‘t know the output for each input? Flight control system example Let the agent learn how to achieve certain goals itself, through interaction with the environment.

4 General principles of RL This does not solve the problem! Environment Agent actionpercept reward Rewards to specify goals (example: dogs) Let the agent learn how to achieve certain goals itself, through interaction with the environment.

5 Popular model: MDPs Markov Decision Process = {S,A,R,T} Set of states S Set of actions A Reward function R Transition function T Markov property T ss´ only depends on s, s´ Policy:  (S)=>A Problem: Find policy  that maximizes the reward Discounted reward: r 0 +   r 1 +   r 2...  n r n s0s0 a0a0 r0r0 s1s1 a1a1 r1r1 s2s2 a2a2 r2r2 s3s3

6 Values of states: V  (s) Definition of value V  (s) Cumulative reward when starting in state s, and executing some policy untill terminal state is reached. Optimal policy yields V*(s) V*(s) (Optimal policy) V  (s) (Random policy) 0 0 R (Rewards)

7 Determining V  (s) Dynamic programming V(s) = R(s) +  V  s´ (T ss´ V(s´)) + Only visited states are used ss - Necessary to consider all states. TD-learning V(s) = V(s) +  (R(s)+V(s´)-V(s))

8 Values of state-action: Q(a,s) Q-values: Q(a,s) Value of doing an action in a certain state. Dynamic Programming: Q(a,s) =R(s) +  s´ T ss´ max a Q(a´,s´) TD-learning Q(a,s) = Q(a,s) +  (R(s) + max a´ Q(a´,s´) - Q(a,s)) T is not in this formula: Model free learning!

9 Exploration vs. Exploitation Only exploitation: New (maybe better) paths never discovered Only exploration: What is learned is never exploited Good trade-off: Explore first to learn, exploit later to benefit

10 Some issues Hidden state If you don‘t know where you are, you can‘t know what to do. Curse of dimensionality Very large state spaces. Continuous states/action spaces All algorithms use discrete tables spaces. What about continuous values? Many of your articles discuss solutions to these problems.

11 Conclusion RL: Learning through interaction and rewards. Markov Decision Process popular model Values of states: V(s) Values of action/states: Q(a,s) (model free!) Still some problems... not quite ready for complex real-world problems yet, but research underway!

12 Literature Artificial Intelligence: A Modern Approach Stuart Russel and Peter Norvig Machine Learning Tom M. Mitchell Reinforcement learning: A Tutorial Mance E. Harmon and Stephanie S. Harmon