U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and.

Slides:



Advertisements
Similar presentations
A Decision-Theoretic Model of Assistance - Evaluation, Extension and Open Problems Sriraam Natarajan, Kshitij Judah, Prasad Tadepalli and Alan Fern School.
Advertisements

Hierarchical Reinforcement Learning Amir massoud Farahmand
Learning to Interpret Natural Language Instructions Monica Babeş-Vroman +, James MacGlashan *, Ruoyuan Gao +, Kevin Winner * Richard Adjogah *, Marie desJardins.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
NIPS 2007 Workshop Welcome! Hierarchical organization of behavior Thank you for coming Apologies to the skiers… Why we will be strict about timing Why.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
Introduction to Hierarchical Reinforcement Learning Jervis Pinto Slides adapted from Ron Parr (From ICML 2005 Rich Representations for Reinforcement Learning.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Adam Coates, Pieter Abbeel, and Andrew Y. Ng Stanford University ICML 2008 Learning for Control from Multiple Demonstrations TexPoint fonts used in EMF.
Hidden Markov Models M. Vijay Venkatesh. Outline Introduction Graphical Model Parameterization Inference Summary.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
STANFORD Hierarchical Apprenticeship Learning with Application to Quadruped Locomotion J. Zico Kolter, Pieter Abbeel, Andrew Y. Ng Goal Initial Position.
Automatic Discovery of Subgoals in Reinforcement Learning using Diverse Density Amy McGovern Andrew Barto.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Multi-Agent Shared Hierarchy Reinforcement Learning Neville Mehta Prasad Tadepalli School of Electrical Engineering and Computer Science Oregon State University.
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Hierarchical Reinforcement Learning Ersin Basaran 19/03/2005.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
An Overview of MAXQ Hierarchical Reinforcement Learning Thomas G. Dietterich from Oregon State Univ. Presenter: ZhiWei.
Reinforcement Learning and Soar Shelley Nason. Reinforcement Learning Reinforcement learning: Learning how to act so as to maximize the expected cumulative.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Dynamic Bayesian Networks CSE 473. © Daniel S. Weld Topics Agency Problem Spaces Search Knowledge Representation Reinforcement Learning InferencePlanningLearning.
Reinforcement Learning (1)
Making Decisions CSE 592 Winter 2003 Henry Kautz.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Hierarchical Reinforcement Learning Ronald Parr Duke University ©2005 Ronald Parr From ICML 2005 Rich Representations for Reinforcement Learning Workshop.
History-Dependent Graphical Multiagent Models Quang Duong Michael P. Wellman Satinder Singh Computer Science and Engineering University of Michigan, USA.
Reinforcement Learning
REINFORCEMENT LEARNING LEARNING TO PERFORM BEST ACTIONS BY REWARDS Tayfun Gürel.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science November 12, 2000Memory Management for High- Performance Applications - Ph.D. defense.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Solving Large Markov Decision Processes Yilan Gu Dept. of Computer Science University of Toronto April 12, 2004.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Haley: A Hierarchical Framework for Logical Composition of Web Services Haibo Zhao, Prashant Doshi LSDIS Lab, Dept. of Computer Science, University of.
Solving POMDPs through Macro Decomposition
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Neural Networks Chapter 7
INTRODUCTION TO Machine Learning
Transfer in Variable - Reward Hierarchical Reinforcement Learning Hui Li March 31, 2006.
Regularization and Feature Selection in Least-Squares Temporal Difference Learning J. Zico Kolter and Andrew Y. Ng Computer Science Department Stanford.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
Representing hierarchical POMDPs as DBNs for multi-scale robot localization G. Thocharous, K. Murphy, L. Kaelbling Presented by: Hannaneh Hajishirzi.
1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 20: Approximate & Neuro Dynamic Programming, Policy Gradient Methods Dr. Itamar Arel.
Model Minimization in Hierarchical Reinforcement Learning Balaraman Ravindran Andrew G. Barto Autonomous Learning Laboratory.
Reinforcement learning (Chapter 21)
Reinforcement Learning
Transfer Learning in Sequential Decision Problems: A Hierarchical Bayesian Approach Aaron Wilson, Alan Fern, Prasad Tadepalli School of EECS Oregon State.
Learning Team Behavior Using Individual Decision Making in Multiagent Settings Using Interactive DIDs Muthukumaran Chandrasekaran THINC Lab, CS Department.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Engineering Societies in the Agents World Workshop 2003
Generative Adversarial Imitation Learning
Online Multiscale Dynamic Topic Models
Reinforcement learning (Chapter 21)
Hidden Markov Models Part 2: Algorithms
Announcements Homework 3 due today (grace period through Friday)
Hierarchical POMDP Solutions
Approximate POMDP planning: Overcoming the curse of history!
Introduction to Reinforcement Learning and Q-Learning
Presentation transcript:

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and Sridhar Mahadevan Rich Representations for Reinforcement Learning ICML’05 Workshop August 7, 2005

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 2 Introduction Abstraction necessary to scale RL  hierarchical RL Want to learn abstractions automatically Other approaches Find subgoals: McGovern & Barto’01, Simsek & Barto’04, Simsek, Wolfe, & Barto’05, Mannor et al ’04 … Build policy hierarchy: Hengst’02 Potentially proto-value functions: Mahadevan’05 Our approach Learn initial policy hierarchy using graphical model framework, then learn how to use policies using reinforcement learning and reward Related to imitation Price & Boutilier’03, Abbeel & Ng’04

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 3 Outline Dynamic Abstraction Networks Approach Experiments Results Summary Future Work

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 4 Conference Center Bonn Attend ICML’05 Register Dynamic Abstraction Network P0P0 Obs P0P0 FF Just one realization of a DAN; others are possible P1P1 P1P1 S1S1 S1S1 F0F0 F1F1 F0F0 F1F1 t=1 Obs t=2 S0S0 Obs S0S0 Policy Hierarchy State Hierarchy HHMM Fine, Singer, & Tishby’98 AHMM Bui, Venkatesh, & West’02 DAN Manfredi & Mahadevan’05

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 5 Approach Extract Abstractions Policy Improvement Phase 2 e.g., SMDP Q-Learning Hand-code Skills Observe Trajectories Learn DAN using EM Phase 1 Discrete variables? Continuous? How many state values? Levels? Expert

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 6 HAMs [Parr & Russell’98] # of levels Hierarchy of stochastic finite state machines Explicit action, call, choice, stop states DANs vs MAXQ/HAMs DANs # of levels in state/policy hierarchies # of values for each (abstract) state/policy node Training sequences: (flat state,action) pairs MAXQ [Dietterich’00] # of levels, # of tasks at each level Connections between levels Initiation set for each task Termination set for each task DANs infer from training sequences

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 7 Advantages of Graphical Models Joint learning of multiple policy/state abstractions Continuous/hidden domains Full machinery of inference can be used Disadvantages Parameter learning with hidden variables is expensive Expectation-Maximization can get stuck in local maxima Why Graphical Models?

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 8 Domain Dietterich’s Taxi (2000) States Taxi Location (TL): 25 Passenger Location (PL): 5 Passenger Destination (PD): 5 Actions North, South, East, West Pickup, Putdown Hand-coded policies GotoRed GotoGreen GotoYellow GotoBlue Pickup, Putdown

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 9 Experiments PL PD TL PL PD TL Taxi DAN Policy F Action Policy Action Policy S0S0 S0S0 F1F1 F0F0 F0F0 F1F1 F S1S1 S1S1 Phase 1 |S 1 | = 5, |S 0 | = 25, |  1 | = 6, |  0 | = sequences from SMDP Q-learner {TL, PL, PD, A} 1, …, {TL, PL, PD, A} n Bayes Net Toolbox (Murphy’01) Phase 2 SMDP Q-learning Choose policy  1 using  -greedy Compute most likely abstract state s 0 given TL, PL, PD Select action  0 using Pr (  0   1 =  1, S 0 = s 0 )

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 10 Policy Improvement Policy learned over DAN policies performs well Each plot is average over 10 RL runs and 1 EM run

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 11 Policy Recognition Initial Passenger Loc Passenger Dest Policy 1 Policy 6 PU PD Can (sometimes!) recognize a specific sequence of actions as composing a single policy DAN

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 12 Summary Two-phased method for automating hierarchical RL using graphical models Advantages Limited info needed (# of levels, # of values) Permits continuous and partially observable state/actions Disadvantages EM is expensive Need mentor Abstractions learned can be hard to decipher (local maxima?)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 13 Future Work Approximate inference in DANs Saria & Mahadevan’04: Rao-Blackwellized particle filtering for multi-agent AHMMs Johns & Mahadevan’05: variational inference for AHMMs Take advantage of ability to do inference in hierarchical RL phase Incorporate reward in DAN

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 14 Thank You Questions?

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 15 Abstract State Transitions: S 0 Regardless of abstract P 0 policy being executed, abstract S 0 states self-transition with high probability Depending on abstract P 0 policy, may alternatively transition to one of a few abstract S 0 states Similarly for abstract S 1 states and abstract P 1 policies

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 16 State Abstractions Abstract state to which agent is most likely to transition is a consequence, in part, of the learned state abstractions

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 17 Q(s,o)  Q(s,o) +  [r +   max o  O – Q(s, o) – Q(s,o)] s Semi-MDP Q-learning Q(s,o): activity-value for state s and activity o  : learning rate   : discount rate raised to the number of time steps o took r: accumulated discounted reward since o began

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 18 Abstract State S 1 Transitions Abstract state S 1 transitions under abstract policy P 1

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 19 Expectation-Maximization (EM) Hidden variables and unknown parameters E(xpectation)-step Assume parameters known and compute the conditional expected values for variables M(aximization)-step Assume variables observed and compute the argmax parameters

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 20 Abstract State S 0 Transitions Abstract state S 0 transitions under abstract policy P 0