Hierarchical POMDP Solutions

Hierarchical POMDP Solutions
Georgios Theocharous

Sequential Decision Making Under Uncertainty
Tests ACTIONS Treatments Symptoms OBSERVATIONS & REWARDS AGENT ENVIRONMENT HIDDEN STATES What is the optimal policy?

Manufacturing Processes (Mahadevan, Theocharous FLAIRS 98)
Buffer Machine Observations: Parts in buffers Throughput States: Machine internal state Reward: Reward for consuming Penalize for filling buffers Penalize for machine breakdown Actions: Produce Maintenance What is the optimal policy?

Foveated Active Vision (Minut)
States: Objects Observations: Local features Reward: Reward for finding object Actions: Where to saccade next What features to use What is the optimal policy?

Many More Partially Observable Problems
Assistive technologies Web searching, preference elicitation Sophisticated Computing Distributed file access, Network trouble-shooting Industrial Machine maintenance, manufacturing processes Social Education, medical diagnosis, health care policymaking Corporate Marketing, corporate policy ….

Overview Learning models of partially observable problems is far from a solved problem Computing policies for partially observable domains is intractable We Propose hierarchical solutions Learn models using less space and time Compute robust policies that cannot be computed by previous approaches

How? Spatial and Time Abstractions Reduce Uncertainty
Spatial abstraction MIT Temporal abstraction

Outline Sequential decision-making under uncertainty
A Hierarchical POMDP model for robot navigation Heuristic macro-action selection in H-POMDPs Near Optimal macro-action selection for arbitrary POMDPs Representing H-POMDPs as DBNs Current and Future directions

A Real System: Robot Navigation
0.1 0.8 S15 … 0.1 0.0 0.8 S1 S9 S5 Transition matrix for the Go-Forward action Observations Actions 0.0 0.15 0.7 WWWW … OWOW OOOO OBSERVATION Observation model for S1

Belief States (Probability Distributions over states)
True State Belief State

Learning POMDPs Given As and Zs compute Ts and Os
Estimate probability distribution over hidden states Count number of times a state was visited Update T and O and repeat. It is an Expectation Maximization algorithm: An iterative procedure for doing maximum likelihood parameter estimation over hidden state variables Converges to local maxima A1 A2 T(S1=i,A1=a,S2=j) S1 S2 S3 Z1 Z2 Z3 O(O2=z,S2=i,A1=a)

Planning in POMDPs Belief states constitute a sufficient statistic for making decisions (Markov property holds: Astrom 1965) Bellman equation: STATE ESTIMATOR POLICY(p) ENVIRONMENT OBSERVATION (z) ACTION (a) BELIEF STATE (b) AGENT Since we have an infinite state space, the problem becomes computationally intractable (PSPACE hard for finite horizon) (UNDECIDABLE for infinite horizon)

Our Solution: Spatial and Temporal Abstraction
Learning A hierarchical Baum-Welch algorithm, which is derived from the Baum-Welch algorithm for training HHMMs (with Rohanimanesh and Mahadevan, ICRA 2001) Structure learning from weak priors (with Mahadevan IROS 2002) Inference can be done in linear time by representing H-POMDPs as Dynamic Bayesian Networks (DBNs) (with Murphy and Kaelbling, ICRA 2004) Planning Heuristic macro-action selection (with Mahadevan, ICRA 2002) Near optimal macro-action selection (with Kaelbling, NIPS 2003) Structure Learning and Planning combined Dynamic POMDP abstractions (with Mannor and Kaelbling)

Outline A Hierarchical POMDP model for robot navigation
Sequential decision-making under uncertainty A Hierarchical POMDP model for robot navigation Heuristic macro-action selection in H-POMDPs Near Optimal macro-action selection for arbitrary POMDPs Representing H-POMDPs as DBNs Current and Future directions

Hierarchical POMDPs EAST WEST 1. Require less data to train than flat approaches Provide better state estimation than flat approaches Produces robust behavior Faster Planning

Hierarchical POMDPs ABSTRACT STATES + ACTIONS
(Fine, Singer, Tishby, MLJ 98)

Experimental Environments
600 states 1. 1200 states

The Robot Navigation Domain
The robot Pavlov in the real MSU environment The Nomad 200 simulator

Learning Feature Detectors (Mahadevan, Theocharous, Khaleeli: MLJ 98)
736 hand-labeled-grids 8-fold cross-validation Classification error (m=7.33, s=3.7)

Learning and Planning in H-POMDPs for Robot Navigation
INITIAL H-POMDP LEARNING HAND CODING COMPILATION TOPOLOGICAL MAP PLANNING ENVIRONMENT PLANNING PLANNING EXECUTION EM TRAINED H-POMDP NAVIGATION SYSTEM

Outline Heuristic macro-action selection in H-POMDPs

Planning in H-POMDPs (Theocharous, Mahadevan: ICRA 2002)
Abstract actions Hierarchical MDP solutions (using the options framework [Sutton, Precup, Singh, AIJ]) Heuristic POMDP solutions MLS Primitive actions Beliefs: b(s) 0.35 0.3 0.2 0.1 0.05 4, 10 10, 5 23, 100 49, 20 100, 40 p(b)= go-west v(go-west) v(go-east)

Plan Execution

Intuition Probability distribution at the higher level evolves more slowly The agent does not decide what the best macro-action to do every time step Long term actions result in robot localization

F-MLS Demo

H-MLS Demo

Hierarchical is More Successful
Unknown initial position Success % Environment Algorithm MLS QMDP MLS QMDP

Hierarchical Takes Less Time to Reach Goal
Unknown initial position ? Average Steps to Goal Environment Algorithm MLS QMDP MLS QMDP

Hierarchical Plans are Computed Faster
Planning Time Environment Goal 1 Goal 2 Goal 1 Goal 2 Algorithm

Outline Near Optimal macro-action selection for arbitrary POMDPs

Near Optimal Macro-action Selection (Theocharous, Kaelbling NIPS 2003)
Usually agents don’t require the entire belief space Macro-actions can reduce belief space even more Tested in large scale robot navigation Only small part of the belief-space is required Learn approximate POMDP policies fast High success rate Better policies Does information gathering

Dynamic Grids Given a resolution, points are sampled dynamically from regular dicretizations, by simulating trajectories

The Algorithm True belief state True trajectory Resulting next true
Simulation trajectories from g of macro A (estimation of value at g) Value of b’’ is interpolated from it’s neighbors Nearest grid point to b

Experimental Setup

Fewer Number of States

Fewer Steps to Goal

More Successful

Information Gathering

Information Gathering (scaling up)

Dynamic POMDP Abstractions (Theocharous, Mannor, Kaelbling)
Entropy thresholds start goal Localization macros

Fewer Steps to Goal

Outline Representing H-POMDPs as DBNs

Dynamic Bayesian Networks
STATE POMDP FACTORED DBN POMDP 0.08 0.01 0.7 0.05 # of parameters # of parameters

DBN Inference L 1

Representing H-POMDPs as Dynamic Bayesian Networks (Theocharous, Murphy, Kaelbling: ICRA 2004)
EAST WEST STATE H-POMDP FACTORED DBN H-POMDP

Complexity of Inference
FACTORED DBN H-POMDP STATE H-POMDP EAST WEST DBN H-POMDP STATE POMDP

Hierarchical Localizes better
Original Factored DBN tied H-POMDP Factored DBN H-POMDP DBN H-POMDP STATE POMDP Before training

Hierarchical Fits Data Better
Original Factored DBN tied H-POMDP Factored DBN H-POMDP DBN H-POMDP STATE POMDP Before training

Directions for Future Research
In the future we will explore structure learning Bayesian model selection approaches Methods for learning compositional hierarchies (recurrent nets, hierarchical sparse n-grams) Natural language acquisition methods Identifying isomorphic processes On–line learning Interactive Learning Application to real world problems

Major Contributions The H-POMDP model
Requires less training data Provides better state estimation Fast planning Macro-actions in POMDPS reduce uncertainty Information gathering Application of the algorithms to large scale Robot navigation Map Learning Planning and execution

Hierarchical POMDP Solutions

Similar presentations

Presentation on theme: "Hierarchical POMDP Solutions"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hierarchical POMDP Solutions

Similar presentations

Presentation on theme: "Hierarchical POMDP Solutions"— Presentation transcript:

Similar presentations

About project

Feedback