A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.

Slides:



Advertisements
Similar presentations
Dialogue Policy Optimisation
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Markov Decision Process
Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
SARSOP Successive Approximations of the Reachable Space under Optimal Policies Devin Grady 4 April 2013.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model (aka M a s,s’ )  C (s,a,s’): cost model  G.
Decision Theoretic Planning
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
An Introduction to Markov Decision Processes Sarah Hickmott
Infinite Horizon Problems
Planning under Uncertainty
1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.
In practice, we run into three common issues faced by concurrent optimization algorithms. We alter our model-shaping to mitigate these by reasoning about.
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Markov Decision Processes
Concurrent Probabilistic Temporal Planning (CPTP) Mausam Joint work with Daniel S. Weld University of Washington Seattle.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
Probabilistic Temporal Planning with Uncertain Durations Mausam Joint work with Daniel S. Weld University of Washington Seattle.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Ames Research Center Planning with Uncertainty in Continuous Domains Richard Dearden No fixed abode Joint work with: Zhengzhu Feng U. Mass Amherst Nicolas.
Heuristic Search for problems with uncertainty CSE 574 April 22, 2003 Mausam.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Department of Computer Science Undergraduate Events More
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
Markov Decision Processes Chapter 17 Mausam. Planning Agent What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Making complex decisions
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 5
Markov Decision Processes
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 5
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Approximate POMDP planning: Overcoming the curse of history!
Instructor: Vincent Conitzer
Chapter 9: Planning and Learning
Chapter 17 – Making Complex Decisions
CS 188: Artificial Intelligence Spring 2006
Hidden Markov Models (cont.) Markov Decision Processes
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Reinforcement Learning (2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento

Planning under Uncertainty (ICAPS’03 Workshop)  Qualitative (disjunctive) uncertainty  Which real problem can you solve?  Quantitative (probabilistic) uncertainty  Which real problem can you model?

The Quantitative View  Markov Decision Process  models uncertainty with probabilistic outcomes  general decision-theoretic framework  algorithms are slow  do we need the full power of decision theory?  is an unconverged partial policy any good?

The Qualitative View  Conditional Planning  Model uncertainty as logical disjunction of outcomes  exploits classical planning techniques  FAST  ignores probabilities  poor solutions  how bad are pure qualitative solutions?  can we improve the qualitative policies?

HybPlan: A Hybridized Planner  combine probabilistic + disjunctive planners  produces good solutions in intermediate times  anytime: makes effective use of resources  bounds termination with quality guarantee  Quantitative View  completes partial probabilistic policy by using qualitative policies in some states  Qualitative View  improves qualitative policies in more important regions

Outline  Motivation  Planning with Probabilistic Uncertainty (RTDP)  Planning with Disjunctive Uncertainty (MBP)  Hybridizing RTDP and MBP (HybPlan)  Experiments  Conclusions and Future Work

Markov Decision Process S : a set of states A : a set of actions Pr : prob. transition model C : cost model s 0 : start state G : a set of goals Find a policy (S ! A)  minimizes expected cost to reach a goal  for an indefinite horizon  for a fully observable  Markov decision process. Optimal cost function, J*, ~ optimal policy

s0s0 Goal 2 2 Example Longer path Wrong direction, but goal still reachable All states are dead-ends

Goal Optimal State Costs

Goal Optimal Policy

Bellman Backup: Create better approximation to cost s

Bellman Backup: Create better approximation to cost s Trial=simulate greedy policy & update visited states

Bellman Backup: Create better approximation to cost s Real Time Dynamic Programming (Barto et al. ’95; Bonet & Geffner’03) Repeat trials until cost function converges Trial=simulate greedy policy & update visited states

Planning with Disjunctive Uncertainty  S : a set of states A : a set of actions T : disjunctive transition model s 0 : the start state G : a set of goals  Find a strong-cyclic policy (S ! A)  that guarantees reaching a goal  for an indefinite horizon  for a fully observable  planning problem

Model Based Planner (Bertoli et. al.)  States, transitions, etc. represented logically  Uncertainty  multiple possible successor states  Planning Algorithm  Iteratively removes “bad” states.  Bad = don’t reach anywhere or reach other bad states

Goal MBP Policy Sub-optimal solution

Outline  Motivation  Planning with Probabilistic Uncertainty (RTDP)  Planning with Disjunctive Uncertainty (MBP)  Hybridizing RTDP and MBP (HybPlan)  Experiments  Conclusions and Future Work

HybPlan Top Level Code 0. run MBP to find a solution to goal 1.run RTDP for some time 2.compute partial greedy policy (  rtdp ) 3.compute hybridized policy (  hyb ) by 1.  hyb (s) =  rtdp (s) if visited(s) > threshold   hyb (s) =  mbp (s) otherwise 4.clean  hyb by removing 1. dead-ends 2. probability 1 cycles 5.evaluate  hyb 6.save best policy obtained so far repeat until 1) resources exhaust or 2)a satisfactory policy found

Goal First RTDP Trial 1.run RTDP for some time

Goal Q 1 (s,N) = £ £ 0 Q 1 (s,N) = 1 Q 1 (s,S) = Q 1 (s,W) = Q 1 (s,E) = 1 J 1 (s) = 1 Let greedy action be North Bellman Backup 1.run RTDP for some time

Goal Simulation of Greedy Action 1.run RTDP for some time

Goal Continuing First Trial 1.run RTDP for some time

Goal Continuing First Trial 1.run RTDP for some time

Goal Finishing First Trial 1.run RTDP for some time

Goal Cost Function after First Trial 1.run RTDP for some time

0 2 Goal Partial Greedy Policy 2. compute greedy policy (  rtdp )

0 2 Goal Construct Hybridized Policy w/ MBP 3. compute hybridized policy (  hyb ) (threshold = 0)

0 2 Goal After first trial 0 J(  hyb ) = Evaluate Hybridized Policy 5. evaluate  hyb 6. store  hyb

Goal Second Trial

Partial Greedy Policy

Absence of MBP Policy MBP Policy doesn’t exist! no path to goal £ Goal 2

Third Trial 2

Partial Greedy Policy

Probability 1 Cycles 0 repeat find a state s in cycle  hyb (s) =  mbp (s) until cycle is broken

Probability 1 Cycles 0 repeat find a state s in cycle  hyb (s) =  mbp (s) until cycle is broken

Probability 1 Cycles 0 repeat find a state s in cycle  hyb (s) =  mbp (s) until cycle is broken

Probability 1 Cycles 0 repeat find a state s in cycle  hyb (s) =  mbp (s) until cycle is broken

Probability 1 Cycles Goal 2 repeat find a state s in cycle  hyb (s) =  mbp (s) until cycle is broken

0 2 Goal After 1 st trial 0 J(  hyb ) = Error Bound J*(s 0 ) · 5 J*(s 0 ) ¸ 1 ) Error(  hyb ) = 5-1 = 4

Termination  when a policy of required error bound is found  when the planning time exhausts  when the available memory exhausts Properties  outputs a proper policy  anytime algorithm (once MBP terminates)  HybPlan = RTDP, if infinite resources available  HybPlan = MBP, if extremely limited resources  HybPlan = better than both, otherwise

Outline  Motivation  Planning with Probabilistic Uncertainty (RTDP)  Planning with Disjunctive Uncertainty (MBP)  Hybridizing RTDP and MBP (HybPlan)  Experiments  Anytime Properties  Scalability  Conclusions and Future Work

Domains NASA Rover Domain Factory Domain Elevator domain

Anytime Properties RTDP

Anytime Properties RTDP

Scalability ProblemsTime before memory exhausts J(  rtdp )J(  mbp )J(  hyb ) Rov5~1100 sec Rov2~800 sec Mach9~1500 sec Mach6~300 sec Elev14~10000 sec Elev15~10000 sec

Conclusions  First algorithm that integrates disjunctive and probabilistic planners.  Experiments show that HybPlan is  anytime  scales better than RTDP  produces better quality solutions than MBP  can interleaved planning and execution

Hybridized Planning: A General Notion  Hybridize other pairs of planners  an optimal or close-to-optimal planner  a sub-optimal but fast planner to yield a planner that produces  a good quality solution in intermediate running times  Examples  POMDP : RTDP/PBVI with POND/MBP/BBSP  Oversubscription Planning : A* with greedy solutions  Concurrent MDP : Sampled RTDP with single-action RTDP