A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento
Planning under Uncertainty (ICAPS’03 Workshop) Qualitative (disjunctive) uncertainty Which real problem can you solve? Quantitative (probabilistic) uncertainty Which real problem can you model?
The Quantitative View Markov Decision Process models uncertainty with probabilistic outcomes general decision-theoretic framework algorithms are slow do we need the full power of decision theory? is an unconverged partial policy any good?
The Qualitative View Conditional Planning Model uncertainty as logical disjunction of outcomes exploits classical planning techniques FAST ignores probabilities poor solutions how bad are pure qualitative solutions? can we improve the qualitative policies?
HybPlan: A Hybridized Planner combine probabilistic + disjunctive planners produces good solutions in intermediate times anytime: makes effective use of resources bounds termination with quality guarantee Quantitative View completes partial probabilistic policy by using qualitative policies in some states Qualitative View improves qualitative policies in more important regions
Outline Motivation Planning with Probabilistic Uncertainty (RTDP) Planning with Disjunctive Uncertainty (MBP) Hybridizing RTDP and MBP (HybPlan) Experiments Conclusions and Future Work
Markov Decision Process S : a set of states A : a set of actions Pr : prob. transition model C : cost model s 0 : start state G : a set of goals Find a policy (S ! A) minimizes expected cost to reach a goal for an indefinite horizon for a fully observable Markov decision process. Optimal cost function, J*, ~ optimal policy
s0s0 Goal 2 2 Example Longer path Wrong direction, but goal still reachable All states are dead-ends
Goal Optimal State Costs
Goal Optimal Policy
Bellman Backup: Create better approximation to cost s
Bellman Backup: Create better approximation to cost s Trial=simulate greedy policy & update visited states
Bellman Backup: Create better approximation to cost s Real Time Dynamic Programming (Barto et al. ’95; Bonet & Geffner’03) Repeat trials until cost function converges Trial=simulate greedy policy & update visited states
Planning with Disjunctive Uncertainty S : a set of states A : a set of actions T : disjunctive transition model s 0 : the start state G : a set of goals Find a strong-cyclic policy (S ! A) that guarantees reaching a goal for an indefinite horizon for a fully observable planning problem
Model Based Planner (Bertoli et. al.) States, transitions, etc. represented logically Uncertainty multiple possible successor states Planning Algorithm Iteratively removes “bad” states. Bad = don’t reach anywhere or reach other bad states
Goal MBP Policy Sub-optimal solution
Outline Motivation Planning with Probabilistic Uncertainty (RTDP) Planning with Disjunctive Uncertainty (MBP) Hybridizing RTDP and MBP (HybPlan) Experiments Conclusions and Future Work
HybPlan Top Level Code 0. run MBP to find a solution to goal 1.run RTDP for some time 2.compute partial greedy policy ( rtdp ) 3.compute hybridized policy ( hyb ) by 1. hyb (s) = rtdp (s) if visited(s) > threshold hyb (s) = mbp (s) otherwise 4.clean hyb by removing 1. dead-ends 2. probability 1 cycles 5.evaluate hyb 6.save best policy obtained so far repeat until 1) resources exhaust or 2)a satisfactory policy found
Goal First RTDP Trial 1.run RTDP for some time
Goal Q 1 (s,N) = £ £ 0 Q 1 (s,N) = 1 Q 1 (s,S) = Q 1 (s,W) = Q 1 (s,E) = 1 J 1 (s) = 1 Let greedy action be North Bellman Backup 1.run RTDP for some time
Goal Simulation of Greedy Action 1.run RTDP for some time
Goal Continuing First Trial 1.run RTDP for some time
Goal Continuing First Trial 1.run RTDP for some time
Goal Finishing First Trial 1.run RTDP for some time
Goal Cost Function after First Trial 1.run RTDP for some time
0 2 Goal Partial Greedy Policy 2. compute greedy policy ( rtdp )
0 2 Goal Construct Hybridized Policy w/ MBP 3. compute hybridized policy ( hyb ) (threshold = 0)
0 2 Goal After first trial 0 J( hyb ) = Evaluate Hybridized Policy 5. evaluate hyb 6. store hyb
Goal Second Trial
Partial Greedy Policy
Absence of MBP Policy MBP Policy doesn’t exist! no path to goal £ Goal 2
Third Trial 2
Partial Greedy Policy
Probability 1 Cycles 0 repeat find a state s in cycle hyb (s) = mbp (s) until cycle is broken
Probability 1 Cycles 0 repeat find a state s in cycle hyb (s) = mbp (s) until cycle is broken
Probability 1 Cycles 0 repeat find a state s in cycle hyb (s) = mbp (s) until cycle is broken
Probability 1 Cycles 0 repeat find a state s in cycle hyb (s) = mbp (s) until cycle is broken
Probability 1 Cycles Goal 2 repeat find a state s in cycle hyb (s) = mbp (s) until cycle is broken
0 2 Goal After 1 st trial 0 J( hyb ) = Error Bound J*(s 0 ) · 5 J*(s 0 ) ¸ 1 ) Error( hyb ) = 5-1 = 4
Termination when a policy of required error bound is found when the planning time exhausts when the available memory exhausts Properties outputs a proper policy anytime algorithm (once MBP terminates) HybPlan = RTDP, if infinite resources available HybPlan = MBP, if extremely limited resources HybPlan = better than both, otherwise
Outline Motivation Planning with Probabilistic Uncertainty (RTDP) Planning with Disjunctive Uncertainty (MBP) Hybridizing RTDP and MBP (HybPlan) Experiments Anytime Properties Scalability Conclusions and Future Work
Domains NASA Rover Domain Factory Domain Elevator domain
Anytime Properties RTDP
Anytime Properties RTDP
Scalability ProblemsTime before memory exhausts J( rtdp )J( mbp )J( hyb ) Rov5~1100 sec Rov2~800 sec Mach9~1500 sec Mach6~300 sec Elev14~10000 sec Elev15~10000 sec
Conclusions First algorithm that integrates disjunctive and probabilistic planners. Experiments show that HybPlan is anytime scales better than RTDP produces better quality solutions than MBP can interleaved planning and execution
Hybridized Planning: A General Notion Hybridize other pairs of planners an optimal or close-to-optimal planner a sub-optimal but fast planner to yield a planner that produces a good quality solution in intermediate running times Examples POMDP : RTDP/PBVI with POND/MBP/BBSP Oversubscription Planning : A* with greedy solutions Concurrent MDP : Sampled RTDP with single-action RTDP