Download presentation
Presentation is loading. Please wait.
1
Hierarchical POMDP Solutions
Georgios Theocharous
2
Sequential Decision Making Under Uncertainty
Tests ACTIONS Treatments Symptoms OBSERVATIONS & REWARDS AGENT ENVIRONMENT HIDDEN STATES What is the optimal policy?
3
Manufacturing Processes (Mahadevan, Theocharous FLAIRS 98)
Buffer Machine Observations: Parts in buffers Throughput States: Machine internal state Reward: Reward for consuming Penalize for filling buffers Penalize for machine breakdown Actions: Produce Maintenance What is the optimal policy?
4
Foveated Active Vision (Minut)
States: Objects Observations: Local features Reward: Reward for finding object Actions: Where to saccade next What features to use What is the optimal policy?
5
Many More Partially Observable Problems
Assistive technologies Web searching, preference elicitation Sophisticated Computing Distributed file access, Network trouble-shooting Industrial Machine maintenance, manufacturing processes Social Education, medical diagnosis, health care policymaking Corporate Marketing, corporate policy ….
6
Overview Learning models of partially observable problems is far from a solved problem Computing policies for partially observable domains is intractable We Propose hierarchical solutions Learn models using less space and time Compute robust policies that cannot be computed by previous approaches
7
How? Spatial and Time Abstractions Reduce Uncertainty
Spatial abstraction MIT Temporal abstraction
8
Outline Sequential decision-making under uncertainty
A Hierarchical POMDP model for robot navigation Heuristic macro-action selection in H-POMDPs Near Optimal macro-action selection for arbitrary POMDPs Representing H-POMDPs as DBNs Current and Future directions
9
A Real System: Robot Navigation
0.1 0.8 S15 … 0.1 0.0 0.8 S1 S9 S5 Transition matrix for the Go-Forward action Observations Actions 0.0 0.15 0.7 WWWW … OWOW OOOO OBSERVATION Observation model for S1
10
Belief States (Probability Distributions over states)
True State Belief State
11
Belief States (Probability Distributions over states)
True State Belief State
12
Belief States (Probability Distributions over states)
True State Belief State
13
Learning POMDPs Given As and Zs compute Ts and Os
Estimate probability distribution over hidden states Count number of times a state was visited Update T and O and repeat. It is an Expectation Maximization algorithm: An iterative procedure for doing maximum likelihood parameter estimation over hidden state variables Converges to local maxima A1 A2 T(S1=i,A1=a,S2=j) S1 S2 S3 Z1 Z2 Z3 O(O2=z,S2=i,A1=a)
14
Planning in POMDPs Belief states constitute a sufficient statistic for making decisions (Markov property holds: Astrom 1965) Bellman equation: STATE ESTIMATOR POLICY(p) ENVIRONMENT OBSERVATION (z) ACTION (a) BELIEF STATE (b) AGENT Since we have an infinite state space, the problem becomes computationally intractable (PSPACE hard for finite horizon) (UNDECIDABLE for infinite horizon)
15
Our Solution: Spatial and Temporal Abstraction
Learning A hierarchical Baum-Welch algorithm, which is derived from the Baum-Welch algorithm for training HHMMs (with Rohanimanesh and Mahadevan, ICRA 2001) Structure learning from weak priors (with Mahadevan IROS 2002) Inference can be done in linear time by representing H-POMDPs as Dynamic Bayesian Networks (DBNs) (with Murphy and Kaelbling, ICRA 2004) Planning Heuristic macro-action selection (with Mahadevan, ICRA 2002) Near optimal macro-action selection (with Kaelbling, NIPS 2003) Structure Learning and Planning combined Dynamic POMDP abstractions (with Mannor and Kaelbling)
16
Outline A Hierarchical POMDP model for robot navigation
Sequential decision-making under uncertainty A Hierarchical POMDP model for robot navigation Heuristic macro-action selection in H-POMDPs Near Optimal macro-action selection for arbitrary POMDPs Representing H-POMDPs as DBNs Current and Future directions
17
Hierarchical POMDPs EAST WEST 1. Require less data to train than flat approaches Provide better state estimation than flat approaches Produces robust behavior Faster Planning
18
Hierarchical POMDPs ABSTRACT STATES + ACTIONS
(Fine, Singer, Tishby, MLJ 98)
19
Experimental Environments
600 states 1. 1200 states
20
The Robot Navigation Domain
The robot Pavlov in the real MSU environment The Nomad 200 simulator
21
Learning Feature Detectors (Mahadevan, Theocharous, Khaleeli: MLJ 98)
736 hand-labeled-grids 8-fold cross-validation Classification error (m=7.33, s=3.7)
22
Learning and Planning in H-POMDPs for Robot Navigation
INITIAL H-POMDP LEARNING HAND CODING COMPILATION TOPOLOGICAL MAP PLANNING ENVIRONMENT PLANNING PLANNING EXECUTION EM TRAINED H-POMDP NAVIGATION SYSTEM
23
Outline Heuristic macro-action selection in H-POMDPs
Sequential decision-making under uncertainty A Hierarchical POMDP model for robot navigation Heuristic macro-action selection in H-POMDPs Near Optimal macro-action selection for arbitrary POMDPs Representing H-POMDPs as DBNs Current and Future directions
24
Planning in H-POMDPs (Theocharous, Mahadevan: ICRA 2002)
Abstract actions Hierarchical MDP solutions (using the options framework [Sutton, Precup, Singh, AIJ]) Heuristic POMDP solutions MLS Primitive actions Beliefs: b(s) 0.35 0.3 0.2 0.1 0.05 4, 10 10, 5 23, 100 49, 20 100, 40 p(b)= go-west v(go-west) v(go-east)
25
Plan Execution
26
Plan Execution
27
Plan Execution
28
Plan Execution
29
Intuition Probability distribution at the higher level evolves more slowly The agent does not decide what the best macro-action to do every time step Long term actions result in robot localization
30
F-MLS Demo
31
H-MLS Demo
32
Hierarchical is More Successful
Unknown initial position Success % Environment Algorithm MLS QMDP MLS QMDP
33
Hierarchical Takes Less Time to Reach Goal
Unknown initial position ? Average Steps to Goal Environment Algorithm MLS QMDP MLS QMDP
34
Hierarchical Plans are Computed Faster
Planning Time Environment Goal 1 Goal 2 Goal 1 Goal 2 Algorithm
35
Outline Near Optimal macro-action selection for arbitrary POMDPs
Sequential decision-making under uncertainty A Hierarchical POMDP model for robot navigation Heuristic macro-action selection in H-POMDPs Near Optimal macro-action selection for arbitrary POMDPs Representing H-POMDPs as DBNs Current and Future directions
36
Near Optimal Macro-action Selection (Theocharous, Kaelbling NIPS 2003)
Usually agents don’t require the entire belief space Macro-actions can reduce belief space even more Tested in large scale robot navigation Only small part of the belief-space is required Learn approximate POMDP policies fast High success rate Better policies Does information gathering
37
Dynamic Grids Given a resolution, points are sampled dynamically from regular dicretizations, by simulating trajectories
38
The Algorithm True belief state True trajectory Resulting next true
Simulation trajectories from g of macro A (estimation of value at g) Value of b’’ is interpolated from it’s neighbors Nearest grid point to b
39
Experimental Setup
40
Fewer Number of States
41
Fewer Steps to Goal
42
More Successful
43
Information Gathering
44
Information Gathering (scaling up)
45
Dynamic POMDP Abstractions (Theocharous, Mannor, Kaelbling)
Entropy thresholds start goal Localization macros
46
Fewer Steps to Goal
47
Outline Representing H-POMDPs as DBNs
Sequential decision-making under uncertainty A Hierarchical POMDP model for robot navigation Heuristic macro-action selection in H-POMDPs Near Optimal macro-action selection for arbitrary POMDPs Representing H-POMDPs as DBNs Current and Future directions
48
Dynamic Bayesian Networks
STATE POMDP FACTORED DBN POMDP 0.08 0.01 0.7 0.05 # of parameters # of parameters
49
DBN Inference L 1
50
Representing H-POMDPs as Dynamic Bayesian Networks (Theocharous, Murphy, Kaelbling: ICRA 2004)
EAST WEST STATE H-POMDP FACTORED DBN H-POMDP
51
Representing H-POMDPs as Dynamic Bayesian Networks (Theocharous, Murphy, Kaelbling: ICRA 2004)
EAST WEST STATE H-POMDP FACTORED DBN H-POMDP
52
Representing H-POMDPs as Dynamic Bayesian Networks (Theocharous, Murphy, Kaelbling: ICRA 2004)
EAST WEST STATE H-POMDP FACTORED DBN H-POMDP
53
Representing H-POMDPs as Dynamic Bayesian Networks (Theocharous, Murphy, Kaelbling: ICRA 2004)
EAST WEST STATE H-POMDP FACTORED DBN H-POMDP
54
Representing H-POMDPs as Dynamic Bayesian Networks (Theocharous, Murphy, Kaelbling: ICRA 2004)
EAST WEST STATE H-POMDP FACTORED DBN H-POMDP
55
Complexity of Inference
FACTORED DBN H-POMDP STATE H-POMDP EAST WEST DBN H-POMDP STATE POMDP
56
Hierarchical Localizes better
Original Factored DBN tied H-POMDP Factored DBN H-POMDP DBN H-POMDP STATE POMDP Before training
57
Hierarchical Fits Data Better
Original Factored DBN tied H-POMDP Factored DBN H-POMDP DBN H-POMDP STATE POMDP Before training
58
Directions for Future Research
In the future we will explore structure learning Bayesian model selection approaches Methods for learning compositional hierarchies (recurrent nets, hierarchical sparse n-grams) Natural language acquisition methods Identifying isomorphic processes On–line learning Interactive Learning Application to real world problems
59
Major Contributions The H-POMDP model
Requires less training data Provides better state estimation Fast planning Macro-actions in POMDPS reduce uncertainty Information gathering Application of the algorithms to large scale Robot navigation Map Learning Planning and execution
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.