Presentation is loading. Please wait.

Presentation is loading. Please wait.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and.

Similar presentations


Presentation on theme: "U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and."— Presentation transcript:

1 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and Sridhar Mahadevan Rich Representations for Reinforcement Learning ICML’05 Workshop August 7, 2005

2 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 2 Introduction Abstraction necessary to scale RL  hierarchical RL Want to learn abstractions automatically Other approaches Find subgoals: McGovern & Barto’01, Simsek & Barto’04, Simsek, Wolfe, & Barto’05, Mannor et al ’04 … Build policy hierarchy: Hengst’02 Potentially proto-value functions: Mahadevan’05 Our approach Learn initial policy hierarchy using graphical model framework, then learn how to use policies using reinforcement learning and reward Related to imitation Price & Boutilier’03, Abbeel & Ng’04

3 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 3 Outline Dynamic Abstraction Networks Approach Experiments Results Summary Future Work

4 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 4 Conference Center Bonn Attend ICML’05 Register Dynamic Abstraction Network P0P0 Obs P0P0 FF Just one realization of a DAN; others are possible P1P1 P1P1 S1S1 S1S1 F0F0 F1F1 F0F0 F1F1 t=1 Obs t=2 S0S0 Obs S0S0 Policy Hierarchy State Hierarchy HHMM Fine, Singer, & Tishby’98 AHMM Bui, Venkatesh, & West’02 DAN Manfredi & Mahadevan’05

5 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 5 Approach Extract Abstractions Policy Improvement Phase 2 e.g., SMDP Q-Learning Hand-code Skills Observe Trajectories Learn DAN using EM Phase 1 Discrete variables? Continuous? How many state values? Levels? Expert

6 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 6 HAMs [Parr & Russell’98] # of levels Hierarchy of stochastic finite state machines Explicit action, call, choice, stop states DANs vs MAXQ/HAMs DANs # of levels in state/policy hierarchies # of values for each (abstract) state/policy node Training sequences: (flat state,action) pairs MAXQ [Dietterich’00] # of levels, # of tasks at each level Connections between levels Initiation set for each task Termination set for each task DANs infer from training sequences

7 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 7 Advantages of Graphical Models Joint learning of multiple policy/state abstractions Continuous/hidden domains Full machinery of inference can be used Disadvantages Parameter learning with hidden variables is expensive Expectation-Maximization can get stuck in local maxima Why Graphical Models?

8 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 8 Domain Dietterich’s Taxi (2000) States Taxi Location (TL): 25 Passenger Location (PL): 5 Passenger Destination (PD): 5 Actions North, South, East, West Pickup, Putdown Hand-coded policies GotoRed GotoGreen GotoYellow GotoBlue Pickup, Putdown

9 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 9 Experiments PL PD TL PL PD TL Taxi DAN Policy F Action Policy Action Policy S0S0 S0S0 F1F1 F0F0 F0F0 F1F1 F S1S1 S1S1 Phase 1 |S 1 | = 5, |S 0 | = 25, |  1 | = 6, |  0 | = 6 1000 sequences from SMDP Q-learner {TL, PL, PD, A} 1, …, {TL, PL, PD, A} n Bayes Net Toolbox (Murphy’01) Phase 2 SMDP Q-learning Choose policy  1 using  -greedy Compute most likely abstract state s 0 given TL, PL, PD Select action  0 using Pr (  0   1 =  1, S 0 = s 0 )

10 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 10 Policy Improvement Policy learned over DAN policies performs well Each plot is average over 10 RL runs and 1 EM run

11 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 11 Policy Recognition Initial Passenger Loc Passenger Dest Policy 1 Policy 6 PU PD Can (sometimes!) recognize a specific sequence of actions as composing a single policy DAN

12 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 12 Summary Two-phased method for automating hierarchical RL using graphical models Advantages Limited info needed (# of levels, # of values) Permits continuous and partially observable state/actions Disadvantages EM is expensive Need mentor Abstractions learned can be hard to decipher (local maxima?)

13 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 13 Future Work Approximate inference in DANs Saria & Mahadevan’04: Rao-Blackwellized particle filtering for multi-agent AHMMs Johns & Mahadevan’05: variational inference for AHMMs Take advantage of ability to do inference in hierarchical RL phase Incorporate reward in DAN

14 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 14 Thank You Questions?

15 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 15 Abstract State Transitions: S 0 Regardless of abstract P 0 policy being executed, abstract S 0 states self-transition with high probability Depending on abstract P 0 policy, may alternatively transition to one of a few abstract S 0 states Similarly for abstract S 1 states and abstract P 1 policies

16 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 16 State Abstractions Abstract state to which agent is most likely to transition is a consequence, in part, of the learned state abstractions

17 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 17 Q(s,o)  Q(s,o) +  [r +   max o  O – Q(s, o) – Q(s,o)] s Semi-MDP Q-learning Q(s,o): activity-value for state s and activity o  : learning rate   : discount rate raised to the number of time steps o took r: accumulated discounted reward since o began

18 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 18 Abstract State S 1 Transitions Abstract state S 1 transitions under abstract policy P 1

19 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 19 Expectation-Maximization (EM) Hidden variables and unknown parameters E(xpectation)-step Assume parameters known and compute the conditional expected values for variables M(aximization)-step Assume variables observed and compute the argmax parameters

20 U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 20 Abstract State S 0 Transitions Abstract state S 0 transitions under abstract policy P 0


Download ppt "U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Hierarchical Reinforcement Learning Using Graphical Models Victoria Manfredi and."

Similar presentations


Ads by Google