Hierarchical Reinforcement Learning Amir massoud Farahmand

Slides:



Advertisements
Similar presentations
Reinforcement Learning Peter Bodík. Previous Lectures Supervised learning –classification, regression Unsupervised learning –clustering, dimensionality.
Advertisements

Hierarchical Reinforcement Learning Mausam [A Survey and Comparison of HRL techniques]
Reinforcement learning
Lecture 18: Temporal-Difference Learning
Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
NIPS 2007 Workshop Welcome! Hierarchical organization of behavior Thank you for coming Apologies to the skiers… Why we will be strict about timing Why.
Introduction to Hierarchical Reinforcement Learning Jervis Pinto Slides adapted from Ron Parr (From ICML 2005 Rich Representations for Reinforcement Learning.
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
LCSLCS 18 September 2002DARPA MARS PI Meeting Intelligent Adaptive Mobile Robots Georgios Theocharous MIT AI Laboratory with Terran Lane and Leslie Pack.
Between MDPs and Semi-MDPs: Learning, Planning and Representing Knowledge at Multiple Temporal Scales Richard S. Sutton Doina Precup University of Massachusetts.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Markov Decision Processes
Amir massoud Farahmand
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005.
Chapter 6: Temporal Difference Learning
Chapter 6: Temporal Difference Learning
An Overview of MAXQ Hierarchical Reinforcement Learning Thomas G. Dietterich from Oregon State Univ. Presenter: ZhiWei.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
IROS04 (Japan, Sendai) University of Tehran Amir massoud Farahmand - Majid Nili Ahmadabadi Babak Najar Araabi {mnili,
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Hierarchical Reinforcement Learning Ronald Parr Duke University ©2005 Ronald Parr From ICML 2005 Rich Representations for Reinforcement Learning Workshop.
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
Hybrid Behavior Co-evolution and Structure Learning in Behavior-based Systems Amir massoud Farahmand (a,b,c) (
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Reinforcement Learning
Value Function Approximation on Non-linear Manifolds for Robot Motor Control Masashi Sugiyama1)2) Hirotaka Hachiya1)2) Christopher Towell2) Sethu.
INTRODUCTION TO Machine Learning
Transfer in Variable - Reward Hierarchical Reinforcement Learning Hui Li March 31, 2006.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
MDPs (cont) & Reinforcement Learning
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Abstract LSPI (Least-Squares Policy Iteration) works well in value function approximation Gaussian kernel is a popular choice as a basis function but can.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Chapter 6: Temporal Difference Learning
Chapter 5: Monte Carlo Methods
Reinforcement Learning
An Overview of Reinforcement Learning
Markov Decision Processes
Structure in Reinforcement Learning
CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17
Reinforcement learning
CMSC 471 Fall 2009 RL using Dynamic Programming
Chapter 4: Dynamic Programming
SNU BioIntelligence Lab.
Chapter 4: Dynamic Programming
Dr. Unnikrishnan P.C. Professor, EEE
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
CS 416 Artificial Intelligence
Chapter 4: Dynamic Programming
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Hierarchical Reinforcement Learning Amir massoud Farahmand

Markov Decision Problems Markov Process: Formulating a wide range of dynamical systems Finding an optimal solution of an objective function [Stochastic] Dynamics Programming Planning: Known environment Learning: Unknown environment

MDP

Reinforcement Learning (1) Very important Machine Learning method An approximate online solution of MDP –Monte Carlo method –Stochastic Approximation –[Function Approximation]

Reinforcement Learning (2) Q-Learning and SARSA are among the most important solution of RL

Curses of DP Curse of Modeling –RL solves this problem Curse of Dimensionality –Approximating Value function –Hierarchical methods

Hierarchical RL (1) Use some kind of hierarchy in order to … –Learn faster –Need less values to be updated (smaller storage dimension) –Incorporate a priori knowledge by designer –Increase reusability –Have a more meaningful structure than a mere Q-table

Hierarchical RL (2) Is there any unified meaning of hierarchy? NO! Different methods: –Temporal abstraction –State abstraction –Behavioral decomposition –…

Hierarchical RL (3) Feudal Q-Learning [Dayan, Hinton] Options [Sutton, Precup, Singh] MaxQ [Dietterich] HAM [Russell, Parr, Andre] HexQ [Hengst] Weakly-Coupled MDP [Bernstein, Dean & Lin, …] Structure Learning in SSA [Farahmand, Nili] …

Feudal Q-Learning Divide each task to a few smaller sub-tasks State abstraction method Different layers of managers Each manager gets orders from its super-manager and orders to its sub-managers Super- Manager Manager 1 Sub- Manager 1 Sub- Manager 2 Manager 2

Feudal Q-Learning Principles of Feudal Q-Learning –Reward Hiding: Managers must reward sub-managers for doing their bidding whether or not this satisfies the commands of the super-managers. Sub-managers should just learn to obey their managers and leave it up to them to determine what it is best to do at the next level up. –Information Hiding: Managers only need to know the state of the system at the granularity of their own choices of tasks. Indeed, allowing some decision making to take place at a coarser grain is one of the main goals of the hierarchical decomposition. Information is hidden both downwards - sub- managers do not know the task the super-manager has set the manager - and upwards -a super-manager does not know what choices its manager has made to satisfy its command.

Feudal Q-Learning

Options: Introduction People do decision making at different time scales –Traveling example It is desirable to have a method to support this temporally-extended actions over different time scales

Options: Concept Macro-actions Temporal abstraction method of Hierarchical RL Options are temporally extended actions which each of them is consisted of a set of primitive actions Example: –Primitive actions: walking NSWE –Options: go to {door, cornet, table, straight} Options can be Open-loop or Closed-loop Semi-Markov Decision Process Theory [Puterman]

Options: Formal Definitions

Options: Rise of SMDP! Theorem: MDP + Options = SMDP

Options: Value function

Options: Bellman-like optimality condition

Options: A simple example

Interrupting Options Option’s policy is followed until it terminates. It is somehow unnecessary condition –You may change your decision in the middle of execution of your previous decision. Interruption Theorem: Yes! It is better!

Interrupting Options: An example

Options: Other issues Intra-option {model, value} learning Learning each options –Defining sub-goal reward function

MaxQ MaxQ Value Function Decomposition Somehow related to Feudal Q- Learning Decomposing Value function in a hierarchical structure

MaxQ

MaxQ: Value decomposition

MaxQ: Existence theorem Recursive optimal policy. There may be many recursive optimal policies with different value function. Recursive optimal policies are not an optimal policy. If H is stationary macro hierarchy for MDP M, then all recursively optimal policies w.r.t. have the same value.

MaxQ: Learning Theorem: If M is MDP, H is stationary macro, GLIE (Greedy in the Limit with Infinite Exploration) policy, common convergence conditions (bounded V and C, sum of alpha is …), then with Prob. 1, algorithm MaxQ-0 will converge!

MaxQ Faster learning: all states updating –Similar to “all-goal-updating” of Kaelbling

MaxQ

MaxQ: State abstraction Advantageous –Memory reduction –Needed exploration will be reduced –Increase reusability as it is not dependent on its higher parents Is it possible?!

MaxQ: State abstraction Exact preservation of value function Approximate preservation

MaxQ: State abstraction Does it converge? –It has not proved formally yet. What can we do if we want to use an abstraction that violates theorem 3? –Reward function decomposition Design a reward function that reinforces those responsible parts of the architecture.

MaxQ: Other issues Undesired Terminal states Non-hierarchical execution (polling execution) –Better performance –Computational intensive

Learning in Subsumption Architecture Structure learning –How should behaviors arranged in the architecture? Behavior learning –How should a single behavior act? Structure/Behavior learning

SSA: Purely Parallel Case build maps explore avoid obstacles locomote manipulate the world sensors

SSA: Structure learning issues How should we represent structure? –Sufficient (problem space can be covered) –Tractable (small hypothesis space) –Well-defined credit assignment How should we assign credits to architecture?

SSA: Structure learning issues Purely parallel structure –Is it the most plausible choice (regarding SSA-BBS assumptions)? Some different representations –Beh. learning –Beh/Layer learning –Order learning

SSA: Behavior learning issues Reinforcement signal decomposition: each Beh. has its own reward function Reinforcement signal design: How should we transform our desires into reward function? –Reward Shaping –Emotional Learning –…? Hierarchical Credit Assignment

SSA: Structure Learning example Suppose we have correct behaviors and want to arrange them in an architecture in order to maximize a specific behavior Subjective evaluation: We want to lift an object to a specific height while its slope does not become too high. Objective evaluation: How should we design it?!

SSA: Structure Learning example