Between MDPs and Semi-MDPs: Learning, Planning and Representing Knowledge at Multiple Temporal Scales Richard S. Sutton Doina Precup University of Massachusetts.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Hierarchical Reinforcement Learning Amir massoud Farahmand
Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Introduction to Hierarchical Reinforcement Learning Jervis Pinto Slides adapted from Ron Parr (From ICML 2005 Rich Representations for Reinforcement Learning.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Planning under Uncertainty
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density Amy McGovern Andrew Barto.
Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.
Reinforcement Learning
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
An Introduction to Reinforcement Learning (Part 1) Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005.
Toward Grounding Knowledge in Prediction or Toward a Computational Theory of Artificial Intelligence Rich Sutton AT&T Labs with thanks to Satinder Singh.
Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2003 material from Jean-Claude Latombe, and Daphne Koller.
Reinforcement Learning (1)
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Hierarchical Reinforcement Learning Ronald Parr Duke University ©2005 Ronald Parr From ICML 2005 Rich Representations for Reinforcement Learning Workshop.
MAKING COMPLEX DEClSlONS
Search and Planning for Inference and Learning in Computer Vision
1 Endgame Logistics  Final Project Presentations  Tuesday, March 19, 3-5, KEC2057  Powerpoint suggested ( to me before class)  Can use your own.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Reinforcement Learning
Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
Solving POMDPs through Macro Decomposition
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
INTRODUCTION TO Machine Learning
Transfer in Variable - Reward Hierarchical Reinforcement Learning Hui Li March 31, 2006.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
COMP 2208 Dr. Long Tran-Thanh University of Southampton Reinforcement Learning.
Some Final Thoughts Abhijit Gosavi. From MDPs to SMDPs The Semi-MDP is a more general model in which the time for transition is also a random variable.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Chapter 6: Temporal Difference Learning
Reinforcement Learning: How far can it Go?
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Markov Decision Processes
Markov Decision Processes
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Reinforcement learning
Chapter 3: The Reinforcement Learning Problem
CMSC 471 Fall 2009 RL using Dynamic Programming
SNU BioIntelligence Lab.
Chapter 3: The Reinforcement Learning Problem
Chapter 3: The Reinforcement Learning Problem
Chapter 6: Temporal Difference Learning
Chapter 9: Planning and Learning
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Reinforcement learning
Presentation transcript:

Between MDPs and Semi-MDPs: Learning, Planning and Representing Knowledge at Multiple Temporal Scales Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado with thanks to Andy Barto Amy McGovern, Andrew Fagg Ron Parr, Csaba Szepeszvari

Related Work “Classical” AI Fikes, Hart & Nilsson(1972) Newell & Simon (1972) Sacerdoti (1974, 1977) Macro-Operators Korf (1985) Minton (1988) Iba (1989) Kibler & Ruby (1992) Qualitative Reasoning Kuipers (1979) de Kleer & Brown (1984) Dejong (1994) Laird et al. (1986) Drescher (1991) Levinson & Fuchs (1994) Say & Selahatin (1996) Brafman & Moshe (1997) Robotics and Control Engineering Brooks (1986) Maes (1991) Koza & Rice (1992) Brockett (1993) Grossman et. al (1993) Dorigo & Colombetti (1994) Asada et. al (1996) Uchibe et. al (1996) Huber & Grupen(1997) Kalmar et. al (1997) Mataric(1997) Sastry (1997) Toth et. al (1997) Reinforcement Learning and MDP Planning Mahadevan & Connell (1992) Singh (1992) Lin (1993) Dayan & Hinton (1993) Kaelbling(1993) Chrisman (1994) Bradtke & Duff (1995) Ring (1995) Sutton (1995) Thrun & Schwartz (1995) Boutilier et. al (1997) Dietterich(1997) Wiering & Schmidhuber (1997) Precup, Sutton & Singh (1997) McGovern & Sutton (1998) Parr & Russell (1998) Drummond (1998) Hauskrecht et. al (1998) Meuleau et. al (1998) Ryan and Pendrith (1998)

Abstraction in Learning and Planning A long-standing, key problem in AI ! How can we give abstract knowledge a clear semantics? e.g. “I could go to the library” How can different levels of abstraction be related?  spatial: states  temporal: time scales How can we handle stochastic, closed-loop, temporally extended courses of action? Use RL/MDPs to provide a theoretical foundation

Outline RL and Markov Decision Processes (MDPs) Options and Semi-MDPs Rooms Example Between MDPs and Semi-MDPs  Termination Improvement  Intra-option Learning  Subgoals

RL is Learning from Interaction Environment action perception reward Agent complete agent temporally situated continual learning and planning object is to affect environment environment is stochastic and uncertain RL is like Life!

More Formally: Markov Decision Problems (MDPs) t An MDP is defined by S - set of states of the environment A(s) – set of actions possible in state s - probability of transition from s - expected reward when executing a in s  - discount rate for expected reward Assumption: discrete time t = 0, 1, 2,... s r s s r s... t a t +1 a r t +2 a t t +3 a

The Objective Find a policy (way of acting) that gets a lot of reward in the long run These are called value functions - cf. evaluation functions

Outline RL and Markov Decision Processes (MDPs) Options and Semi-MDPs Rooms Example Between MDPs and Semi-MDPs  Termination Improvement  Intra-option Learning  Subgoals

Options Example: docking  hand-crafted controller  : terminate when docked or charger not visible Options can take variable number of steps A generalization of actions to include courses of action Option execution is assumed to be call-and-return I : all states in which charger is in sight

Room Example HALLWAYS O 2 O 1 4 rooms 4 hallways 8 multi-step options Given goal location, quickly plan shortest route up down rightleft (to each room's 2 hallways) G? 4 unreliable primitive actions Fail 33% of the time Goal states are given a terminalvalue of 1  =.9 All rewards zero ROOM

Options define a Semi-Markov Decison Process (SMDP) Discrete time Homogeneous discount Continuous time Discrete events Interval-dependent discount Discrete time Overlaid discrete events Interval-dependent discount A discrete-time SMDP overlaid on an MDP Can be analyzed at either level

MDP + Options = SMDP Thus all Bellman equations and DP results extend for value functions over options and models of options (cf. SMDP theory). Theorem: For any MDP, and any set of options, the decision process that chooses among the options, executing each to termination, is an SMDP.

What does the SMDP connection give us? A theoretical fondation for what we really need! But the most interesting issues are beyond SMDPs...

Value Functions for Options Define value functions for options, similar to the MDP case

Models of Options Knowing how an option is executed is not enough for reasoning about it, or planning with it. We need information about its consequences This form follows from SMDP theory. Such models can be used interchangeably with models of primitive actions in Bellman equations.

Outline RL and Markov Decision Processes (MDPs) Options and Semi-MDPs Rooms Example Between MDPs and Semi-MDPs  Termination Improvement  Intra-option Learning  Subgoals

Room Example HALLWAYS O 2 O 1 4 rooms 4 hallways 8 multi-step options Given goal location, quickly plan shortest route up down rightleft (to each room's 2 hallways) G? 4 unreliable primitive actions Fail 33% of the time Goal states are given a terminalvalue of 1  =.9 All rewards zero ROOM

Example: Synchronous Value Iteration Generalized to Options

Rooms Example

Example with Goal  Subgoal both primitive actions and options

What does the SMDP connection give us? A theoretical foundation for what we really need! But the most interesting issues are beyond SMDPs...

Advantages of Dual MDP/SMDP View At the SMDP level Compute value functions and policies over options with the benefit of increased speed / flexibility At the MDP level Learn how to execute an option for achieving a given goal Between the MDP and SMDP level Improve over existing options (e.g. by terminating early) Learn about the effects of several options in parallel, without executing them to termination

Outline RL and Markov Decision Processes (MDPs) Options and Semi-MDPs Rooms Example Between MDPs and Semi-MDPs  Termination Improvement  Intra-option Learning  Subgoals

Between MDPs and SMDPs Termination Improvement Improving the value function by changing the termination conditions of options Intra-Option Learning Learning the values of options in parallel, without executing them to termination Learning the models of options in parallel, without executing them to termination Tasks and Subgoals Learning the policies inside the options

Termination Improvement Idea: We can do better by sometimes interrupting ongoing options - forcing them to terminate before  says to 

Landmarks Task Task: navigate from S to G as fast as possible 4 primitive actions, for taking tiny steps up, down, left, right 7 controllers for going straight to each one of the landmarks, from within a circular region where the landmark is visible In this task, planning at the level of primitive actions is computationally intractable, we need the controllers

Termination Improvement for Landmarks Task Allowing early termination based on models improves the value function at no additional cost!

Illustration: Reconnaissance Mission Planning (Problem) Mission: Fly over (observe) most valuable sites and return to base Stochastic weather affects observability (cloudy or clear) of sites Limited fuel Intractable with classical optimal control methods Temporal scales:  Actions: which direction to fly now  Options: which site to head for Options compress space and time  Reduce steps from ~600 to ~6  Reduce states from ~10 11 to ~10 6 any state (10 6 ) sites only (6)

Illustration: Reconnaissance Mission Planning (Results) SMDP planner:  Assumes options followed to completion  Plans optimal SMDP solution SMDP planner with re-evaluation  Plans as if options must be followed to completion  But actually takes them for only one step  Re-picks a new option on every step Static planner:  Assumes weather will not change  Plans optimal tour among clear sites  Re-plans whenever weather changes Low Fuel High Fuel Expected Reward/Mission SMDP Planner Static Re-planner SMDP planner with re-evaluation of options on each step Temporal abstraction finds better approximation than static planner, with little more computation than SMDP planner

Intra-Option Learning Methods for Markov Options Proven to converge to correct values, under same assumptions as 1-step Q-learning Idea: take advantage of each fragment of experience SMDP Q-learning: execute option to termination, keeping track of reward along the way at the end, update only the option taken, based on reward and value of state in which option terminates Intra-option Q-learning: after each primitive action, update all the options that could have taken that action, based on the reward and the expected value from the next state on

Intra-Option Learning Methods for Markov Options Proven to converge to correct values, under same assumptions as 1-step Q-learning Idea: take advantage of each fragment of experience Intra-Option Learning: after each primitive action, update all the options that could have taken that action SMDP Learning: execute option to termination,then update only the option taken

Example of Intra-Option Value Learning Intra-option methods learn correct values without ever taking the options! SMDP methods are not applicable here Random start, goal in right hallway, random actions

Intra-Option Value Learning Is Faster Than SMDP Value Learning Random start, goal in right hallway, choice from A U H, 90% greedy

Intra-Option Model Learning Intra-option methods work much faster than SMDP methods Random start state, no goal, pick randomly among all options

Tasks and Subgoals It is natural to define options as solutions to subtasks e.g. treat hallways as subgoals, learn shortest paths

Options Depend on Outcome Values Large Outcome Values Small Outcome Values Learned Policy: Avoids Negative Rewards Learned Policy: Shortest Paths Small negative rewards on each step

Between MDPs and SMDPs Termination Improvement Improving the value function by changing the termination conditions of options Intra-Option Learning Learning the values of options in parallel, without executing them to termination Learning the models of options in parallel, without executing them to termination Tasks and Subgoals Learning the policies inside the options

Summary: Benefits of Options Transfer  Solutions to sub-tasks can be saved and reused  Domain knowledge can be provided as options and subgoals Potentially much faster learning and planning  By representing action at an appropriate temporal scale Models of options are a form of knowledge representation  Expressive  Clear  Suitable for learning and planning Much more to learn than just one policy, one set of values  A framework for “constructivism” – for finding models of the world that are useful for rapid planning and learning