Sungwook Yoon – Probabilistic Planning via Determinization Probabilistic Planning via Determinization in Hindsight FF-Hindsight Sungwook Yoon Joint work.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Markov Decision Process
Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1.
Efficient Approaches for Solving Large-scale MDPs Slides on LRTDP and UCT are courtesy Mausam/Kolobov.
Kshitij Judah, Alan Fern, Tom Dietterich TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: School of EECS, Oregon State.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
A Simple Distribution- Free Approach to the Max k-Armed Bandit Problem Matthew Streeter and Stephen Smith Carnegie Mellon University.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
ANDREW MAO, STACY WONG Regrets and Kidneys. Intro to Online Stochastic Optimization Data revealed over time Distribution of future events is known Under.
Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model (aka M a s,s’ )  C (s,a,s’): cost model  G.
Decision Theoretic Planning
Knowledge Representation Meets Stochastic Planning Bob Givan Joint work w/ Alan Fern and SungWook Yoon Electrical and Computer Engineering Purdue University.
Utilizing Problem Structure in Local Search: The Planning Benchmarks as a Case Study Jőrg Hoffmann Alberts-Ludwigs-University Freiburg.
All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.
A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.
Infinite Horizon Problems
Planning under Uncertainty
4/3. (FO)MDPs: The plan General model has no initial state; complex cost and reward functions, and finite/infinite/indefinite horizons Standard algorithms.
Summary of MDPs (until Now) Finite-horizon MDPs – Non-stationary policy – Value iteration Compute V 0..V k.. V T the value functions for k stages to go.
Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.
Using Inaccurate Models in Reinforcement Learning Pieter Abbeel, Morgan Quigley and Andrew Y. Ng Stanford University.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
Markov Decision Processes
Discrete-Event Simulation: A First Course Steve Park and Larry Leemis College of William and Mary.
Concurrent Probabilistic Temporal Planning (CPTP) Mausam Joint work with Daniel S. Weld University of Washington Seattle.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
7. Experiments 6. Theoretical Guarantees Let the local policy improvement algorithm be policy gradient. Notes: These assumptions are insufficient to give.
Minh Do - PARC Planning with Goal Utility Dependencies J. Benton Department of Computer Science Arizona State University Tempe, AZ Subbarao.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Monte-Carlo Planning Look Ahead Trees
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
CHAPTER 15 S IMULATION - B ASED O PTIMIZATION II : S TOCHASTIC G RADIENT AND S AMPLE P ATH M ETHODS Organization of chapter in ISSO –Introduction to gradient.
Monte-Carlo Tree Search
Planning and Verification for Stochastic Processes with Asynchronous Events Håkan L. S. Younes Carnegie Mellon University.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
Computation Model and Complexity Class. 2 An algorithmic process that uses the result of a random draw to make an approximated decision has the ability.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 1: Generalized policy iteration.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
1 Monte-Carlo Tree Search Alan Fern. 2 Introduction  Rollout does not guarantee optimality or near optimality  It only guarantees policy improvement.
Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Application of Dynamic Programming to Optimal Learning Problems Peter Frazier Warren Powell Savas Dayanik Department of Operations Research and Financial.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Computer Simulation Henry C. Co Technology and Operations Management,
Analytics and OR DP- summary.
Reinforcement Learning (1)
Course Logistics CS533: Intelligent Agents and Decision Making
Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall CHAPTER 15 SIMULATION-BASED OPTIMIZATION II: STOCHASTIC GRADIENT AND.
CS 188: Artificial Intelligence Fall 2007
Approximate POMDP planning: Overcoming the curse of history!
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Reinforcement Learning (2)
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

Sungwook Yoon – Probabilistic Planning via Determinization Probabilistic Planning via Determinization in Hindsight FF-Hindsight Sungwook Yoon Joint work with Alan Fern, Bob Givan and Rao Kambhampati

Sungwook Yoon – Probabilistic Planning via Determinization Probabilistic Planning Competition Client : Participants, send action Server: Competition Host, simulates actions 2

Sungwook Yoon – Probabilistic Planning via Determinization The Winner was …… FF-Replan – A replanner. Use FF – Probabilistic domain is determinized Interesting Contrast – Many probabilistic planning techniques Work in theory but does not work in practice – FF-Replan No theory Work in practice 3

Sungwook Yoon – Probabilistic Planning via Determinization The Paper’s Objective Better determinization approach (Determinization in Hindsight) Theoretical consideration of the new determinization (in Hindsight) New view on FF-Replan Experimental studies with determinization in Hindsight (FF-Hindsight) 4

Sungwook Yoon – Probabilistic Planning via Determinization Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 5 Action State Maximize Goal Achievement Dead End A1A2 I A1 A2 A1 A2 A1 A2 A1 A2 Left Outcomes are more likely

Sungwook Yoon – Probabilistic Planning via Determinization All Outcome Replanning (FFR A ) Action Effect 1 Effect 2 Probability 1 Probability 2 Action1Effect 1 Action2Effect 2 ICAPS-07 6

Sungwook Yoon – Probabilistic Planning via Determinization Probabilistic Planning All Outcome Determinization Action Probabilistic Outcome Time 1 Time 2 Goal State 7 Action State Find Goal Dead End A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I A1-1A1-2A2-1A2-2 A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2

Sungwook Yoon – Probabilistic Planning via Determinization Probabilistic Planning All Outcome Determinization Action Probabilistic Outcome Time 1 Time 2 Goal State 8 Action State Find Goal Dead End A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I A1-1A1-2A2-1A2-2 A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2

Sungwook Yoon – Probabilistic Planning via Determinization Problem of FF-Replan and better alternative sampling 9 FF-Replan’s Static Determinizations don’t respect probabilities. We need “Probabilistic and Dynamic Determinization” Sample Future Outcomes and Determinization in Hindsight Each Future Sample Becomes a Known-Future Deterministic Problem

Sungwook Yoon – Probabilistic Planning via Determinization Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 10 Action State Maximize Goal Achievement Dead End Left Outcomes are more likely A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

Sungwook Yoon – Probabilistic Planning via Determinization 11 Start Sampling Note. Sampling will reveal which is better A1? Or A2 at state I

Sungwook Yoon – Probabilistic Planning via Determinization Hindsight Sample 1 Action Probabilistic Outcome Time 1 Time 2 Goal State 12 Action State Maximize Goal Achievement Dead End A1: 1 A2: 0 Left Outcomes are more likely A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

Sungwook Yoon – Probabilistic Planning via Determinization Hindsight Sample 2 Action Probabilistic Outcome Time 1 Time 2 Goal State 13 Action State Maximize Goal Achievement Dead End Left Outcomes are more likely A1: 2 A2: 1 A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

Sungwook Yoon – Probabilistic Planning via Determinization Hindsight Sample 3 Action Probabilistic Outcome Time 1 Time 2 Goal State 14 Action State Maximize Goal Achievement Dead End Left Outcomes are more likely A1: 2 A2: 1 A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

Sungwook Yoon – Probabilistic Planning via Determinization Hindsight Sample 4 Action Probabilistic Outcome Time 1 Time 2 Goal State 15 Action State Maximize Goal Achievement Dead End Left Outcomes are more likely A1: 3 A2: 1 A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

Sungwook Yoon – Probabilistic Planning via Determinization Summary of the Idea: The Decision Process (Estimating Q-Value, Q(s,a)) 1. For Each Action A, Draw Future Samples 2. Solve The Deterministic Problems 3. Aggregate the solutions for each action 4. Select the action with best aggregation S: Current State, A(S) → S’ Each Sample is a Deterministic Planning Problem The solution length is used for goal-oriented problems, Q(s,A) Max A Q(s,A) 16

Sungwook Yoon – Probabilistic Planning via Determinization Mathematical Summary of the Algorithm H-horizon future F H for M = [S,A,T,R] – Mapping of state, action and time (h<H) to a state – S × A × h → S Value of a policy π for F H – R(s,F H, π) V HS (s,H) = E F H [max π R(s,F H,π)] Compare this and the real value V*(s,H) = max π E F H [ R(s,F H,π) ] V FFRa (s) = max F V(s,F) ≥ V HS (s,H) ≥ V*(s,H) Q(s,a,H) = (R(a) + E F H-1 [max π R(a(s),F H-1,π)] ) – In our proposal, computation of max π R(s,F H-1,π) is approximately done by FF [Hoffmann and Nebel ’01] 17 Done by FF Each Future is a Deterministic Problem

Sungwook Yoon – Probabilistic Planning via Determinization Key Technical Results The Importance of Independent Sampling of States, Actions, Time The necessity of Random Time Breaking in Decision making Theorem 1 When there is a policy that can achieve the goal with probability 1 within horizon, hindsight decision making algorithm will find the goal with probability 1. Theorem 2 Polynomial number of samples are needed with regard to, Horizon, Action, The minimum Q-value advantage We identify the characteristic of FF-Replan in terms of Hindsight Decision Making, V FFRa (s) = max F V(s,F) 18

Sungwook Yoon – Probabilistic Planning via Determinization Empirical Results ProblemFFRaFF-Hindsight Blocksworld Boxworld Fileworld2914 R-Tireworld30 ZenoTravel300 Exploding BW528 G-Tireworld718 Tower of Hanois1117 IPPC-04 ProblemsNumbers are solved Trials For ZenoTravel, when we used Importance sampling, the solved trials have been improved to 26 19

Sungwook Yoon – Probabilistic Planning via Determinization Empirical Results PlannersClimberRiverBus-FareTire1Tire2Tire3Tire4Tire5Tire6 FFRa 60%65%1%50%0% Paragraph 100%65%100% 3%1%0% FPG 100%65%22%100%92%60%35%19%13% FF-HS 100%65%100% These Domains are Developed just to Beat FF-Replan Obviously, FF-Replan did not do well. But, FF-Hindsight did very well, showing Probabilistic Reasoning Ability while achieving Scalability 20

Sungwook Yoon – Probabilistic Planning via Determinization Conclusion 21 Deterministic Planning scalability Classic Planning Machine Learning for Planning Net Benefit Optimization Temporal Planning Probabilistic Planning scalability Markov Decision Processes Machine Learning for MDP Temporal MDP scalability Determinization

Sungwook Yoon – Probabilistic Planning via Determinization Conclusion Devised an algorithm that can take advantage of the significant advances in deterministic planning in the context of probabilistic planning Made many of the deterministic planning techniques available to probabilistic planning – Most of the learning to planning techniques are developed solely for deterministic planning Now, these techniques are relevant to probabilistic planning too – Advanced net-benefit style of planners can be used for the reward maximization style of probabilistic planning problems 22

Sungwook Yoon – Probabilistic Planning via Determinization Discussion Mercier and Van Hentenryck provided the analysis of the difference between – V*(s,H) = max π E F H [ R(s,F H,π) ] – V HS (s,H) = E F H [max π R(s,F H,π)] Ng and Jordan provided the analysis of the difference between – V*(s,H) = max π E F H [ R(s,F H,π) ] – V^(s,H) = max π ∑ [ R(s,F H,π) ] / m, where m is the sample number 23

Sungwook Yoon – Probabilistic Planning via Determinization IPPC-2004 Results NMR C J1ClassyNMRmGPTCFFR S FFR A BW Box File Zeno Tire-r---30 Tire-g TOH Exploding Human Control Knowledge 2 nd Place Winners Learned Knowledge NMR Non-Markovian Reward Decision Process Planner ClassyApproximate Policy Iteration with a Policy Language Bias mGPTHeuristic Search Probabilistic Planning CSymbolic Heuristic Search Numbers : Successful Runs Winner of IPPC-04 FFRs 24

Sungwook Yoon – Probabilistic Planning via Determinization IPPC-2006 Results FFR A FPGFOALPsfDPParagraphFFR S BW Zenotravel Random Elevator Exploding Drive Schedule PitchCatch Tire FPGFactored Policy Gradient Planner FOALPFirst Order Approximate Linear Programming sfDP Symbolic Stochastic Focused Dynamic Programming with Decision Diagrams ParagraphA Graphplan Based Probabilistic Planner Numbers : Percentage of Successful Runs Unofficial Winner of IPPC-06 FFRa 25

Sungwook Yoon – Probabilistic Planning via Determinization 26

Sungwook Yoon – Probabilistic Planning via Determinization Sampling Problem Time dependency issue Start S1S2 Goal S3 Dead End A B C (with probability p) C (with probability 1-p) D (with probability 1-p) D (with probability p) 27

Sungwook Yoon – Probabilistic Planning via Determinization Sampling Problem Time dependency issue Start S1S2 Goal S3 Dead End A B S3 is worse state then S1 but looks like there is always a path to Goal Need to sample independently across actions 28

Sungwook Yoon – Probabilistic Planning via Determinization Action Selection Problem Random Tie breaking is essential StartS1Goal C: with probability 1-p C: with probability p B: with probability p A: Always stays in Start B: with probability 1-p In Start state, C action is definitely better, but A can be used to wait until C to the Goal effect is realized 29

Sungwook Yoon – Probabilistic Planning via Determinization Sampling Problem Importance Sampling (IS) StartGoalS1 B: with extremely low probability B: with very high probability - Sampling uniformly would find the problem unsolvable. - Use importance sampling. - Identifying the region that needs importance sampling is for further study. -In the benchmark, Zenotravel needs the IS idea. 30

Sungwook Yoon – Probabilistic Planning via Determinization Theoretical Results Theorem 1 – For goal-achieving probabilistic planning problems, if there is a policy that can solve the probabilistic planning problem with probability 1 with bounded horizon, then hindsight planning would solve the problem with probability 1. If there is no such policy, hindsight planning would return less 1 success ratio. – If there is a future where no plan can achieve the goal, the future can be sampled Theorem 2 – The number of future samples needed to correctly identify the best action – w > 4Δ -2 T ln (|A|H| / δ) – Δ : the minimum Q-advantage of the best action over the other actions, δ: confidence parameter – From Chernoff Bound 31

Sungwook Yoon – Probabilistic Planning via Determinization Probabilistic Planning Expecti-max solution Action Probabilistic Outcome Time 1 Time 2 Goal State 32 Action State Maximize Goal Achievement Max Exp EEEEEEEE

Sungwook Yoon – Probabilistic Planning via Determinization Hindsight Sample 1 Action Probabilistic Outcome Time 1 Time 2 Goal State 33 Action State Maximize Goal Achievement Dead End A1: 1 A2: 0 Left Outcomes are more likely A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

Sungwook Yoon – Probabilistic Planning via Determinization Hindsight Sample 2 Action Probabilistic Outcome Time 1 Time 2 Goal State 34 Action State Maximize Goal Achievement Dead End Left Outcomes are more likely A1: 2 A2: 1 A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

Sungwook Yoon – Probabilistic Planning via Determinization Hindsight Sample 3 Action Probabilistic Outcome Time 1 Time 2 Goal State 35 Action State Maximize Goal Achievement Dead End Left Outcomes are more likely A1: 2 A2: 1 A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

Sungwook Yoon – Probabilistic Planning via Determinization Hindsight Sample 4 Action Probabilistic Outcome Time 1 Time 2 Goal State 36 Action State Maximize Goal Achievement Dead End Left Outcomes are more likely A1: 3 A2: 1 A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I