Presentation is loading. Please wait.

Presentation is loading. Please wait.

Knowledge Representation Meets Stochastic Planning Bob Givan Joint work w/ Alan Fern and SungWook Yoon Electrical and Computer Engineering Purdue University.

Similar presentations


Presentation on theme: "Knowledge Representation Meets Stochastic Planning Bob Givan Joint work w/ Alan Fern and SungWook Yoon Electrical and Computer Engineering Purdue University."— Presentation transcript:

1 Knowledge Representation Meets Stochastic Planning Bob Givan Joint work w/ Alan Fern and SungWook Yoon Electrical and Computer Engineering Purdue University

2 Bob Givan Electrical and Computer Engineering Purdue University 2 Dagstuhl, May 12-16, 2003 Overview  We present a form of approximate policy iteration specifically designed for large relational MDPs.  We describe a novel application viewing entire planning domains as MDPs  we automatically induce domain-specific planners  Induced planners are state-of-the-art on:  deterministic planning benchmarks  stochastic variants of planning benchmarks

3 Bob Givan Electrical and Computer Engineering Purdue University 3 Dagstuhl, May 12-16, 2003 Decision-theoretic Planning Traditional Planning Ideas from Two Communities Induction of Control Knowledge Planning Heuristics Policy Rollout Approximate Policy Iteration (API) Two views of the new technique Iterative improvement of control knowledge API with a policy space bias

4 Bob Givan Electrical and Computer Engineering Purdue University 4 Dagstuhl, May 12-16, 2003 Planning Problems ? Current StateGoal State/Region States: First-order Interpretations of a particular language A planning problem gives:  a current state  a goal state  a list of actions and their semantics (may be stochastic) Available actions: Pickup(x) PutDown(y)

5 Bob Givan Electrical and Computer Engineering Purdue University 5 Dagstuhl, May 12-16, 2003  Distributions over problems sharing one set of actions (but with different domains and sizes) Planning Domains Blocks World Domain ? ? ? ? Available actions: Pickup(x) PutDown(y)

6 Bob Givan Electrical and Computer Engineering Purdue University 6 Dagstuhl, May 12-16, 2003  Traditional planners solve problems, not domains.  little or no generalization between problems in a domain  Planning domains “solved” by control knowledge  pruning some actions, typically eliminating search Control Knowledge ? ? ? X e.g. “don’t pick up a solved block” X X

7 Bob Givan Electrical and Computer Engineering Purdue University 7 Dagstuhl, May 12-16, 2003 Recent Control Knowledge Research  Human-written c. k. often eliminates search [Bacchus & Kabanza, 1996] TL-Plan  Helpful c. k. can be learned from “small problems” [Khardon, 1996 & 1999] Learning Horn clause action strategies [Huang, Selman & Kautz, 2000] Learning action selection & action rejection rules [Martin & Geffner, 2000] Learning generalized policies in concept languages [Yoon, Fern & Givan, 2002] Inductive policy selection for stochastic planning domains

8 Bob Givan Electrical and Computer Engineering Purdue University 8 Dagstuhl, May 12-16, 2003 Unsolved Problems  Finding control knowledge without immediate access to small problems  Can we learn directly in a large domain?  Improving buggy control knowledge  All previous techniques produce unreliable control knowledge…with occasional fatal flaws.  Our approach: view control knowledge as an MDP policy and apply policy improvement A policy is a choice of action for each MDP state

9 Bob Givan Electrical and Computer Engineering Purdue University 9 Dagstuhl, May 12-16, 2003 View domain as one big statespace, each state a planning problem This view facilitates generalization between problems. Planning Domains as MDPs Blocks World Domain ? ? ? ? Available actions: Pickup(x) PutDown(y) Pickup(Purple)

10 Bob Givan Electrical and Computer Engineering Purdue University 10 Dagstuhl, May 12-16, 2003 Decision-theoretic Planning Traditional Planning Ideas from Two Communities Induction of Control Knowledge Planning Heuristics Policy Rollout Approximate Policy Iteration (API) Two views of the new technique Iterative improvement of control knowledge API with a policy space bias

11 Bob Givan Electrical and Computer Engineering Purdue University 11 Dagstuhl, May 12-16, 2003  Given a policy  and a state s, can we improve  (s)?  If V  (s) < Q  (s,b), then  (s) can be improved to blue.  Can make such improvements at all states at once: Policy Iteration s RoRo RbRb … tntn s1s1 sksk t1t1 … V  (s) = Q  (s,o) = R o +  E s’  {s1…sk} V  (s’) Q  (s,b) = R b +  E s’  {t1…tn} V  (s’)  (s) Policy Improvement base policy improved policy

12 Bob Givan Electrical and Computer Engineering Purdue University 12 Dagstuhl, May 12-16, 2003 Flowchart View of Policy Iteration Current Policy Choose best action at each state Compute Q  for each action at all states Compute V  at all states Improved Policy  ’  VV QQ Problem: too many states

13 Bob Givan Electrical and Computer Engineering Purdue University 13 Dagstuhl, May 12-16, 2003 at all states Compute V  at all states Flowchart View of Policy Rollout Improved Policy  VV QQ Choose best action at each state Compute Q  for each action at all states Current Policy  s”  (s”) s’ …  (s’) … … … … Trajectories under  s’ V  (s’) at s’ s RaRa … s1s1 sksk a Sample s’ from s 1 …s k s Q  (s,) at s s  ’(s) at s

14 Bob Givan Electrical and Computer Engineering Purdue University 14 Dagstuhl, May 12-16, 2003 Approximate Policy Iteration Compute Q  for each action at state s s Q  (s,) at state s’ Compute V  at state s’ Choose best action at state s Current Policy s”  (s”) s’ V  (s’) s  ’(s)  draw a training set of pairs (s,  ’(s))  learn a policy  repeat Idea: use machine learning to control the number of samples needed Refinement: use pairs (s,Q  (s,)) to define mis- classification costs

15 Bob Givan Electrical and Computer Engineering Purdue University 15 Dagstuhl, May 12-16, 2003 Challenge Problem Consider the following stochastic blocks world problem: Goal: Clear(A) Assume: Block color affects pickup() success Optimal policy is compact, but value function is not – state value depends on set of colors above A AA ?

16 Bob Givan Electrical and Computer Engineering Purdue University 16 Dagstuhl, May 12-16, 2003 Policy for Example Problem A compact policy for this problem: 1. If holding a block, put it down on the table, else… 2. Pick up a clear block above A. How can we formalize this policy? AA ? 1. A ? A 2.

17 Bob Givan Electrical and Computer Engineering Purdue University 17 Dagstuhl, May 12-16, 2003 Action Selection Rules [Martin&Geffner, KR2000] Pickup a clear block above block A… Action selection rules based on classes of objects  Apply action a to an object in class C (if possible).  abbreviated C : a How can we describe the object classes? AA ? A ? A

18 Bob Givan Electrical and Computer Engineering Purdue University 18 Dagstuhl, May 12-16, 2003 A ? A Formal Policy for Example Problem English Decision List Taxonomic Syntax 1. “blocks being held” : putdown 2. “clear blocks above block A” : pickup 1. holding : putdown 2. clear  (on* A) : pickup AA ? 1.2. We find this policy with a heuristic search guided by the training data

19 Bob Givan Electrical and Computer Engineering Purdue University 19 Dagstuhl, May 12-16, 2003 Decision-theoretic Planning Traditional Planning Ideas from Two Communities Induction of Control Knowledge Planning Heuristics Policy Rollout Approximate Policy Iteration (API) Two views of the new technique Iterative improvement of control knowledge API with a policy space bias

20 Bob Givan Electrical and Computer Engineering Purdue University 20 Dagstuhl, May 12-16, 2003 API with a Policy Language Bias Compute Q  for each action at state s s Q  (s,) at state s’ Compute V  at state s’ Choose best action at state s Current Policy s”  (s”) s’ V  (s’) s  ’(s) Train a new policy  ’ ’’

21 Bob Givan Electrical and Computer Engineering Purdue University 21 Dagstuhl, May 12-16, 2003 Incorporating Value Estimates  What happens if the policy can’t find reward?  For learning control knowledge, we use the FF-plan plangraph heuristic s’ …  (s’) … … … … Trajectories under  Use a value estimate at these states

22 Bob Givan Electrical and Computer Engineering Purdue University 22 Dagstuhl, May 12-16, 2003 Initial Policy Choice  Policy iteration requires an initial base policy  Options include:  random policy  greedy policy with respect to a planning heuristic  policy learned from small problems

23 Bob Givan Electrical and Computer Engineering Purdue University 23 Dagstuhl, May 12-16, 2003 Experimental Domains (Stochastic) Blocks World (Stochastic) Painted Blocks World (Stochastic) Logistics World SBW(n)SPW(n)SLW(t,p,c)

24 Bob Givan Electrical and Computer Engineering Purdue University 24 Dagstuhl, May 12-16, 2003 API Results Starting with flawed policies learned from small problems Success Rate

25 Bob Givan Electrical and Computer Engineering Purdue University 25 Dagstuhl, May 12-16, 2003 API Results We used the heuristic of FF-plan (Hoffman and Nebel ’02 JAIR) Starting with a policy greedy with respect to a domain independent heuristic

26 Bob Givan Electrical and Computer Engineering Purdue University 26 Dagstuhl, May 12-16, 2003 How Good is the Induced Planner? Success Rate Average Plan Length Running Time(s) FFAPIFFAPIFFAPI BW(10)10.9933250.10.5 BW(15)0.960.9953394.80.9 BW(20)0.720.98745535.21.4 BW(30)0.110.9911286176.12.4 LW(4,6,4)1116 0.00.5 LW(5,14,20)1173740.73.4

27 Bob Givan Electrical and Computer Engineering Purdue University 27 Dagstuhl, May 12-16, 2003 Conclusions  Using a policy space bias, we can learn good policies for extremely large structured MDPs.  We can automatically learn domain-specific planners that compete favorably with the state-of-the-art domain-independent planners.

28 Bob Givan Electrical and Computer Engineering Purdue University 28 Dagstuhl, May 12-16, 2003 Approximate Policy Iteration Sample states s, and compute Q values at each: Form a training set of tuples (s,b,Q ,b (s)). Learn a new policy from this training set. s RoRo RbRb … tntn s1s1 sksk t1t1 … Estimate R b +  E s’  {t1…tn} V  (s’) by  Sampling states t i from t 1 …t n  Drawing trajectories under  from t i to estimate V  Computing Q ,b (s):

29 Bob Givan Electrical and Computer Engineering Purdue University 29 Dagstuhl, May 12-16, 2003 Markov Decision Process (MDP)  Ingredients:  System state x in state space X  Control action a in A(x)  Reward R(x,a)  State-transition probability P(x,y,a)  Find control policy to maximize objective fun

30 Bob Givan Electrical and Computer Engineering Purdue University 30 Dagstuhl, May 12-16, 2003 Control Knowledge vs. Policy  Perhaps the biggest difference in communities:  deterministic planning works with action sequences  decision-theoretic planning works with policies  Policies are needed because uncertainty may carry you to any state.  compare: control knowledge also handles every state  Good c.k. eliminates search  defines a policy over the possible state/goal pairs


Download ppt "Knowledge Representation Meets Stochastic Planning Bob Givan Joint work w/ Alan Fern and SungWook Yoon Electrical and Computer Engineering Purdue University."

Similar presentations


Ads by Google