Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hierarchical POMDP Solutions

Similar presentations


Presentation on theme: "Hierarchical POMDP Solutions"— Presentation transcript:

1 Hierarchical POMDP Solutions
Georgios Theocharous

2 Sequential Decision Making Under Uncertainty
Tests ACTIONS Treatments Symptoms OBSERVATIONS & REWARDS AGENT ENVIRONMENT HIDDEN STATES What is the optimal policy?

3 Manufacturing Processes (Mahadevan, Theocharous FLAIRS 98)
Buffer Machine Observations: Parts in buffers Throughput States: Machine internal state Reward: Reward for consuming Penalize for filling buffers Penalize for machine breakdown Actions: Produce Maintenance What is the optimal policy?

4 Foveated Active Vision (Minut)
States: Objects Observations: Local features Reward: Reward for finding object Actions: Where to saccade next What features to use What is the optimal policy?

5 Many More Partially Observable Problems
Assistive technologies Web searching, preference elicitation Sophisticated Computing Distributed file access, Network trouble-shooting Industrial Machine maintenance, manufacturing processes Social Education, medical diagnosis, health care policymaking Corporate Marketing, corporate policy ….

6 Overview Learning models of partially observable problems is far from a solved problem Computing policies for partially observable domains is intractable We Propose hierarchical solutions Learn models using less space and time Compute robust policies that cannot be computed by previous approaches

7 How? Spatial and Time Abstractions Reduce Uncertainty
Spatial abstraction MIT Temporal abstraction

8 Outline Sequential decision-making under uncertainty
A Hierarchical POMDP model for robot navigation Heuristic macro-action selection in H-POMDPs Near Optimal macro-action selection for arbitrary POMDPs Representing H-POMDPs as DBNs Current and Future directions

9 A Real System: Robot Navigation
0.1 0.8 S15 0.1 0.0 0.8 S1 S9 S5 Transition matrix for the Go-Forward action Observations Actions 0.0 0.15 0.7 WWWW OWOW OOOO OBSERVATION Observation model for S1

10 Belief States (Probability Distributions over states)
True State Belief State

11 Belief States (Probability Distributions over states)
True State Belief State

12 Belief States (Probability Distributions over states)
True State Belief State

13 Learning POMDPs Given As and Zs compute Ts and Os
Estimate probability distribution over hidden states Count number of times a state was visited Update T and O and repeat. It is an Expectation Maximization algorithm: An iterative procedure for doing maximum likelihood parameter estimation over hidden state variables Converges to local maxima A1 A2 T(S1=i,A1=a,S2=j) S1 S2 S3 Z1 Z2 Z3 O(O2=z,S2=i,A1=a)

14 Planning in POMDPs Belief states constitute a sufficient statistic for making decisions (Markov property holds: Astrom 1965) Bellman equation: STATE ESTIMATOR POLICY(p) ENVIRONMENT OBSERVATION (z) ACTION (a) BELIEF STATE (b) AGENT Since we have an infinite state space, the problem becomes computationally intractable (PSPACE hard for finite horizon) (UNDECIDABLE for infinite horizon)

15 Our Solution: Spatial and Temporal Abstraction
Learning A hierarchical Baum-Welch algorithm, which is derived from the Baum-Welch algorithm for training HHMMs (with Rohanimanesh and Mahadevan, ICRA 2001) Structure learning from weak priors (with Mahadevan IROS 2002) Inference can be done in linear time by representing H-POMDPs as Dynamic Bayesian Networks (DBNs) (with Murphy and Kaelbling, ICRA 2004) Planning Heuristic macro-action selection (with Mahadevan, ICRA 2002) Near optimal macro-action selection (with Kaelbling, NIPS 2003) Structure Learning and Planning combined Dynamic POMDP abstractions (with Mannor and Kaelbling)

16 Outline A Hierarchical POMDP model for robot navigation
Sequential decision-making under uncertainty A Hierarchical POMDP model for robot navigation Heuristic macro-action selection in H-POMDPs Near Optimal macro-action selection for arbitrary POMDPs Representing H-POMDPs as DBNs Current and Future directions

17 Hierarchical POMDPs EAST WEST 1. Require less data to train than flat approaches Provide better state estimation than flat approaches Produces robust behavior Faster Planning

18 Hierarchical POMDPs ABSTRACT STATES + ACTIONS
(Fine, Singer, Tishby, MLJ 98)

19 Experimental Environments
600 states 1. 1200 states

20 The Robot Navigation Domain
The robot Pavlov in the real MSU environment The Nomad 200 simulator

21 Learning Feature Detectors (Mahadevan, Theocharous, Khaleeli: MLJ 98)
736 hand-labeled-grids 8-fold cross-validation Classification error (m=7.33, s=3.7)

22 Learning and Planning in H-POMDPs for Robot Navigation
INITIAL H-POMDP LEARNING HAND CODING COMPILATION TOPOLOGICAL MAP PLANNING ENVIRONMENT PLANNING PLANNING EXECUTION EM TRAINED H-POMDP NAVIGATION SYSTEM

23 Outline Heuristic macro-action selection in H-POMDPs
Sequential decision-making under uncertainty A Hierarchical POMDP model for robot navigation Heuristic macro-action selection in H-POMDPs Near Optimal macro-action selection for arbitrary POMDPs Representing H-POMDPs as DBNs Current and Future directions

24 Planning in H-POMDPs (Theocharous, Mahadevan: ICRA 2002)
Abstract actions Hierarchical MDP solutions (using the options framework [Sutton, Precup, Singh, AIJ]) Heuristic POMDP solutions MLS Primitive actions Beliefs: b(s) 0.35 0.3 0.2 0.1 0.05 4, 10 10, 5 23, 100 49, 20 100, 40 p(b)= go-west v(go-west) v(go-east)

25 Plan Execution

26 Plan Execution

27 Plan Execution

28 Plan Execution

29 Intuition Probability distribution at the higher level evolves more slowly The agent does not decide what the best macro-action to do every time step Long term actions result in robot localization

30 F-MLS Demo

31 H-MLS Demo

32 Hierarchical is More Successful
Unknown initial position Success % Environment Algorithm MLS QMDP MLS QMDP

33 Hierarchical Takes Less Time to Reach Goal
Unknown initial position ? Average Steps to Goal Environment Algorithm MLS QMDP MLS QMDP

34 Hierarchical Plans are Computed Faster
Planning Time Environment Goal 1 Goal 2 Goal 1 Goal 2 Algorithm

35 Outline Near Optimal macro-action selection for arbitrary POMDPs
Sequential decision-making under uncertainty A Hierarchical POMDP model for robot navigation Heuristic macro-action selection in H-POMDPs Near Optimal macro-action selection for arbitrary POMDPs Representing H-POMDPs as DBNs Current and Future directions

36 Near Optimal Macro-action Selection (Theocharous, Kaelbling NIPS 2003)
Usually agents don’t require the entire belief space Macro-actions can reduce belief space even more Tested in large scale robot navigation Only small part of the belief-space is required Learn approximate POMDP policies fast High success rate Better policies Does information gathering

37 Dynamic Grids Given a resolution, points are sampled dynamically from regular dicretizations, by simulating trajectories

38 The Algorithm True belief state True trajectory Resulting next true
Simulation trajectories from g of macro A (estimation of value at g) Value of b’’ is interpolated from it’s neighbors Nearest grid point to b

39 Experimental Setup

40 Fewer Number of States

41 Fewer Steps to Goal

42 More Successful

43 Information Gathering

44 Information Gathering (scaling up)

45 Dynamic POMDP Abstractions (Theocharous, Mannor, Kaelbling)
Entropy thresholds start goal Localization macros

46 Fewer Steps to Goal

47 Outline Representing H-POMDPs as DBNs
Sequential decision-making under uncertainty A Hierarchical POMDP model for robot navigation Heuristic macro-action selection in H-POMDPs Near Optimal macro-action selection for arbitrary POMDPs Representing H-POMDPs as DBNs Current and Future directions

48 Dynamic Bayesian Networks
STATE POMDP FACTORED DBN POMDP 0.08 0.01 0.7 0.05 # of parameters # of parameters

49 DBN Inference L 1

50 Representing H-POMDPs as Dynamic Bayesian Networks (Theocharous, Murphy, Kaelbling: ICRA 2004)
EAST WEST STATE H-POMDP FACTORED DBN H-POMDP

51 Representing H-POMDPs as Dynamic Bayesian Networks (Theocharous, Murphy, Kaelbling: ICRA 2004)
EAST WEST STATE H-POMDP FACTORED DBN H-POMDP

52 Representing H-POMDPs as Dynamic Bayesian Networks (Theocharous, Murphy, Kaelbling: ICRA 2004)
EAST WEST STATE H-POMDP FACTORED DBN H-POMDP

53 Representing H-POMDPs as Dynamic Bayesian Networks (Theocharous, Murphy, Kaelbling: ICRA 2004)
EAST WEST STATE H-POMDP FACTORED DBN H-POMDP

54 Representing H-POMDPs as Dynamic Bayesian Networks (Theocharous, Murphy, Kaelbling: ICRA 2004)
EAST WEST STATE H-POMDP FACTORED DBN H-POMDP

55 Complexity of Inference
FACTORED DBN H-POMDP STATE H-POMDP EAST WEST DBN H-POMDP STATE POMDP

56 Hierarchical Localizes better
Original Factored DBN tied H-POMDP Factored DBN H-POMDP DBN H-POMDP STATE POMDP Before training

57 Hierarchical Fits Data Better
Original Factored DBN tied H-POMDP Factored DBN H-POMDP DBN H-POMDP STATE POMDP Before training

58 Directions for Future Research
In the future we will explore structure learning Bayesian model selection approaches Methods for learning compositional hierarchies (recurrent nets, hierarchical sparse n-grams) Natural language acquisition methods Identifying isomorphic processes On–line learning Interactive Learning Application to real world problems

59 Major Contributions The H-POMDP model
Requires less training data Provides better state estimation Fast planning Macro-actions in POMDPS reduce uncertainty Information gathering Application of the algorithms to large scale Robot navigation Map Learning Planning and execution


Download ppt "Hierarchical POMDP Solutions"

Similar presentations


Ads by Google