Download presentation

Presentation is loading. Please wait.

Published byAdrien Pinkerman Modified about 1 year ago

1
Efficient Approaches for Solving Large-scale MDPs Slides on LRTDP and UCT are courtesy Mausam/Kolobov

2
Ideas for Efficient Algorithms.. Use heuristic search (and reachability information) – LAO*, RTDP Use execution and/or Simulation – “Actual Execution” Reinforcement learning (Main motivation for RL is to “learn” the model) – “Simulation” –simulate the given model to sample possible futures Policy rollout, hindsight optimization etc. Use “factored” representations – Factored representations for Actions, Reward Functions, Values and Policies – Directly manipulating factored representations during the Bellman update

3
Real Time Dynamic Programming [Barto et al 95] Original Motivation – agent acting in the real world Trial – simulate greedy policy starting from start state; – perform Bellman backup on visited states – stop when you hit the goal RTDP: repeat trials forever – Converges in the limit #trials ! 1 3 We will do the discussion in terms of SSP MDPs --Recall they subsume infinite horizon MDPs

4
Trial 4 s0s0 SgSg s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8

5
5 s0s0 SgSg s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 h h hh V start at start state repeat perform a Bellman backup simulate greedy action

6
Trial 6 s0s0 SgSg s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 h h hh V start at start state repeat perform a Bellman backup simulate greedy action hh

7
Trial 7 s0s0 SgSg s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 h h Vh V start at start state repeat perform a Bellman backup simulate greedy action hh

8
Trial 8 s0s0 SgSg s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 h h Vh V start at start state repeat perform a Bellman backup simulate greedy action hh

9
Trial 9 s0s0 SgSg s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 h h Vh V start at start state repeat perform a Bellman backup simulate greedy action Vh

10
Trial 10 s0s0 SgSg s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 h h Vh V start at start state repeat perform a Bellman backup simulate greedy action until hit the goal Vh

11
Trial 11 s0s0 SgSg s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s7s7 s8s8 h h Vh V start at start state repeat perform a Bellman backup simulate greedy action until hit the goal Vh Backup all states on trajectory RTDP repeat forever

12
Real Time Dynamic Programming [Barto et al 95] Original Motivation – agent acting in the real world Trial – simulate greedy policy starting from start state; – perform Bellman backup on visited states – stop when you hit the goal RTDP: repeat trials forever – Converges in the limit #trials ! 1 12 No termination condition!

13
RTDP Family of Algorithms repeat s Ã s 0 repeat //trials REVISE s; identify a greedy FIND: pick s’ s.t. T(s, a greedy, s’) > 0 s Ã s’ until s 2 G until termination test 13

14
Admissible heuristic & monotonicity ⇒ V(s) · V*(s) ⇒ Q(s,a) · Q*(s,a) Label a state s as solved – if V(s) has converged high Q costs best action Res V (s ) < ² ) V(s) won’t change! label s as solved sgsg s

15
Labeling (contd) 15 high Q costs best action Res V (s ) < ² s' already solved ) V(s) won’t change! label s as solved sgsg s s'

16
Labeling (contd) 16 high Q costs best action Res V (s ) < ² s' already solved ) V(s) won’t change! label s as solved sgsg s s' high Q costs best action Res V (s ) < ² Res V (s’ ) < ² V(s), V(s’) won’t change! label s, s’ as solved sgsg s s' high Q costs best action

17
Labeled RTDP [Bonet&Geffner 03b] repeat s Ã s 0 label all goal states as solved repeat //trials REVISE s; identify a greedy FIND: sample s’ from T(s, a greedy, s’) s Ã s’ until s is solved for all states s in the trial try to label s as solved until s 0 is solved 17

18
terminates in finite time – due to labeling procedure anytime – focuses attention on more probable states fast convergence – focuses attention on unconverged states 18 LRTDP

19
Picking a Successor Take 2 Labeled RTDP/RTDP: sample s’ / T(s, a greedy, s’) – Adv: more probable states are explored first – Labeling Adv: no time wasted on converged states – Disadv: labeling is a hard constraint – Disadv: sampling ignores “amount” of convergence If we knew how much V(s) is expected to change? – sample s’ / expected change 19

20
Upper Bounds in SSPs RTDP/LAO* maintain lower bounds – call it V l Additionally associate upper bound with s – V u (s) ¸ V*(s) Define gap(s) = V u (s) – V l (s) – low gap(s): more converged a state – high gap(s): more expected change in its value 20

21
Backups on Bounds Recall monotonicity Backups on lower bound – continue to be lower bounds Backups on upper bound – continues to be upper bounds Intuitively – V l will increase to converge to V* – V u will decrease to converge to V* 21

22
Bounded RTDP [McMahan et al 05] repeat s Ã s 0 repeat //trials identify a greedy based on V l FIND: sample s’ / T(s, a greedy, s’).gap(s’) s Ã s’ until gap(s) < ² for all states s in trial in reverse order REVISE s until gap(s 0 ) < ² 22

23
Min ? ? s0s0 JnJn JnJn JnJn JnJn JnJn JnJn JnJn Q n+1 (s 0,a) J n+1 (s 0 ) a greedy = a 2 Goal a1a1 a2a2 a3a3 RTDP Trial ?

24
Greedy “On-Policy” RTDP without execution Using the current utility values, select the action with the highest expected utility (greedy action) at each state, until you reach a terminating state. Update the values along this path. Loop back—until the values stabilize

25
Comments Properties – if all states are visited infinitely often then J n → J* – Only relevant states will be considered A state is relevant if the optimal policy could visit it. Notice emphasis on “optimal policy”—just because a rough neighborhood surrounds National Mall doesn’t mean that you will need to know what to do in that neighborhood Advantages – Anytime: more probable states explored quickly Disadvantages – complete convergence is slow! – no termination condition Do we care about complete convergence? Think Cpt. Sullenberger

26

27
9/26

28
The “Heuristic” The value function is They approximate it by Exactly what are they relaxing? They are assuming that they can make the best outcome of the action happen.. What if we pick the s’ corresponding to the highest P?

29
Monte-Carlo Planning Consider the Sysadmin problem: 29 Restart(Ser 2 ) Restart(Ser 1 ) Restart(Ser 3 ) P(Ser 1 t |Restart t-1 (Ser 1 ), Ser 1 t-1, Ser 2 t-1 ) P(Ser 2 t |Restart t-1 (Ser 2 ), Ser 1 t-1, Ser 2 t-1, Ser 3 t-1 ) P(Ser 3 t |Restart t-1 (Ser 3 ), Ser 2 t-1, Ser 3 t-1 ) Time t-1Time t T:T:A:A: R: ∑ i [Ser i = ↑]

30
Monte-Carlo Planning: Motivation Characteristics of Sysadmin: – FH MDP turned SSP s0 MDP Reaching the goal is trivial, determinization approaches not really helpful – Enormous reachable state space – High-entropy T (2 |X| outcomes per action, many likely ones) Building determinizations can be super-expensive Doing Bellman backups can be super-expensive Try Monte-Carlo planning – Does not manipulate T or C/R explicitly – no Bellman backups – Relies on a world simulator – indep. of MDP description size 30

31
UCT: A Monte-Carlo Planning Algorithm UCT [Kocsis & Szepesvari, 2006] computes a solution by simulating the current best policy and improving it – Similar principle as RTDP – But action selection, value updates, and guarantees are different – Useful when we have Enormous reachable state space High-entropy T (2 |X| outcomes per action, many likely ones) – Building determinizations can be super-expensive – Doing Bellman backups can be super-expensive Success stories: – Go (thought impossible in ‘05, human grandmaster level at 9x9 in ‘08) – Klondike Solitaire (wins 40% of games) – General Game Playing Competition – Real-Time Strategy Games – Probabilistic Planning Competition – The list is growing… 31

32
Select an arm that probably (w/ high probability) has approximately the best expected reward Use as few simulator calls (or pulls) as possible s a1a1 a2a2 akak R(s,a 1 ) R(s,a 2 ) R(s,a k ) … … Background: Multi-Armed Bandit Problem Slide courtesy of A. Fern 32 Just like a an FH MDP with horizon 1!

33
Current World State Rollout policy Terminal (reward = 1) Build a state-action tree At a leaf node perform a random rollout Initially tree is single leaf UCT Example Slide courtesy of A. Fern 33

34
Current World State Must select each action at a node at least once 0 Rollout Policy Terminal (reward = 0) Slide courtesy of A. Fern 34 UCT Example

35
Current World State Must select each action at a node at least once Slide courtesy of A. Fern 35 UCT Example

36
Current World State When all node actions tried once, select action according to tree policy Tree Policy Slide courtesy of A. Fern 36 UCT Example

37
Current World State When all node actions tried once, select action according to tree policy Tree Policy 0 Rollout Policy Slide courtesy of A. Fern 37 UCT Example

38
Current World State /2 When all node actions tried once, select action according to tree policy Tree Policy What is an appropriate tree policy? Rollout policy? Slide courtesy of A. Fern 38 UCT Example

39
Rollout policy: – Basic UCT uses random Tree policy : – Q(s,a) : average reward received in current trajectories after taking action a in state s – n(s,a) : number of times action a taken in s – n(s) : number of times state s encountered Theoretical constant that must be selected empirically in practice. Setting it to distance to horizon guarantees arriving at the optimal policy eventually, if R Slide courtesy of A. Fern 39 UCT Details Exploration term

40
Current World State /2 When all node actions tried once, select action according to tree policy Tree Policy a1a1 a2a2 Slide courtesy of A. Fern 40 UCT Example

41
Current World State /2 1/3 When all node actions tried once, select action according to tree policy Tree Policy Slide courtesy of A. Fern 41

42
To select an action at a state s – Build a tree using N iterations of Monte-Carlo tree search Default policy is uniform random up to level L Tree policy is based on bandit rule – Select action that maximizes Q(s,a) (note that this final action selection does not take the exploration term into account, just the Q-value estimate) The more simulations, the more accurate – Guaranteed to pick suboptimal actions exponentially rarely after convergence (under some assumptions) Possible improvements – Initialize the state-action pairs with a heuristic (need to pick a weight) – Think of a better-than-random rollout policy Slide courtesy of A. Fern 42 UCT Summary & Theoretical Properties

43
LRTDP or UCT? 43 AAAI 2012!

44
Other Advances Ordering the Bellman backups to maximise information flow. – [Wingate & Seppi’05] – [Dai & Hansen’07] Partition the state space and combine value iterations from different partitions. – [Wingate & Seppi’05] – [Dai & Goldsmith’07] External memory version of value iteration – [Edelkamp, Jabbar & Bonet’07] …

45
Two Models of Evaluating Probabilistic Planning IPPC (Probabilistic Planning Competition) – How often did you reach the goal under the given time constraints FF-HOP FF-Replan Evaluate on the quality of the policy – Converging to optimal policy faster LRTDP mGPT Kolobov’s approach

46
Online Action Selection Off-line policy generation First compute the whole policy – Get the initial state – Compute the optimal policy given the initial state and the goals Then just execute the policy – Loop Do action recommended by the policy Get the next state – Until reaching goal state Pros: Can anticipate all problems; Cons: May take too much time to start executing Online action selection Loop – Compute the best action for the current state – execute it – get the new state Pros: Provides fast first response Cons: May paint itself into a corner.. Policy Computation ExecSelect exex exex exex exex

47
FF-Replan Simple replanner Determinizes the probabilistic problem –IF an action has multiple effect sets with different probabilities Select the most likely on Split the action into multiple actions one for each setup Solves for a plan in the determinized problem SG a1a2a3a4 a2 a3 a4 G a5

48
All Outcome Replanning (FFR A ) Action Effect 1 Effect 2 Probability 1 Probability 2 Action 1 Effect 1 Action 2 Effect 2 ICAPS-07 48

49
1 st IPPC & Post-Mortem.. IPPC Competitors Most IPPC competitors used different approaches for offline policy generation. One group implemented a simple online “replanning” approach in addition to offline policy generation – Determinize the probabilistic problem Most-likely vs. All-outcomes – Loop Get the state S; Call a classical planner (e.g. FF) with [S,G] as the problem Execute the first action of the plan Umpteen reasons why such an approach should do quite badly.. Results and Post-mortem To everyone’s surprise, the replanning approach wound up winning the competition. Lots of hand-wringing ensued.. – May be we should require that the planners really really use probabilities? – May be the domains should somehow be made “probabilistically interesting”? Current understanding: – No reason to believe that off-line policy computation must dominate online action selection – The “replanning” approach is just a degenerate case of hind-sight optimization

50
Reducing calls to FF.. We can reduce calls to FF by memoizing successes – If we were given s0 and sG as the problem, and solved it using our determinization to get the plan s0—a0—s1—a1—s2—a2—s3…an—sG – Then in addition to sending a1 to the simulator, we can memoize {si—ai} as the partial policy. Whenever a new state is given by the simulator, we can see if it is already in the partial policy Additionally, FF-replan can consider every state in the partial policy table as a goal state (in that if it reaches them, it knows how to get to goal state..)

51
Hindsight Optimization for Anticipatory Planning/Scheduling Consider a deterministic planning (scheduling) domain, where the goals arrive probabilistically – Using up resources and/or doing greedy actions may preclude you from exploiting the later opportunities How do you select actions to perform? – Answer: If you have a distribution of the goal arrival, then Sample goals upto a certain horizon using this distribution Now, we have a deterministic planning problem with known goals Solve it; do the first action from it. – Can improve accuracy with multiple samples FF-Hop uses this idea for stochastic planning. In anticipatory planning, the uncertainty is exogenous (it is the uncertain arrival of goals). In stochastic planning, the uncertainty is endogenous (the actions have multiple outcomes)

52
Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 52 Action State Maximize Goal Achievement Dead End A1A2 I A1 A2 A1 A2 A1 A2 A1 A2 Left Outcomes are more likely

53
Probabilistic Planning All Outcome Determinization Action Probabilistic Outcome Time 1 Time 2 Goal State 53 Action State Find Goal Dead End A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I A1-1A1-2A2-1A2-2 A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2

54
Probabilistic Planning All Outcome Determinization Action Probabilistic Outcome Time 1 Time 2 Goal State 54 Action State Find Goal Dead End A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I A1-1A1-2A2-1A2-2 A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2A1-1A1-2A2-1A2-2

55
Problems of FF-Replan and better alternative sampling 55 FF-Replan’s Static Determinizations don’t respect probabilities. We need “Probabilistic and Dynamic Determinization” Sample Future Outcomes and Determinization in Hindsight Each Future Sample Becomes a Known-Future Deterministic Problem

56
Solving stochastic planning problems via determinizations Quite an old idea (e.g. envelope extension methods) What is new is that there is increasing realization that determinizing approaches provide state-of-the-art performance –Even for probabilistically interesting domains Should be a happy occasion..

57
Hindsight Optimization (Online Computation of V HS ) H-horizon future F H for M = [S,A,T,R] –Mapping of state, action and time (h

58
Implementation FF-Hindsight Constructs a set of futures Solves the planning problem using the H-horizon futures using FF Sums the rewards of each of the plans Chooses action with largest Qhs value

59
Hindsight Optimization (Online Computation of V HS ) Pick action a with highest Q(s,a,H) where – Q(s,a,H) = R(s,a) + T(s,a,s’)V*(s’,H-1) Compute V* by sampling – H-horizon future F H for M = [S,A,T,R] Mapping of state, action and time (h

60
Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 60 Action State Maximize Goal Achievement Dead End Left Outcomes are more likely A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

61
Improvement Ideas Reuse –Generated futures that are still relevant –Scoring for action branches at each step –If expected outcomes occur, keep the plan Future generation –Not just probabilistic –Somewhat even distribution of the space Adaptation –Dynamic width and horizon for sampling –Actively detect and avoid unrecoverable failures on top of sampling

62
Hindsight Sample 1 Action Probabilistic Outcome Time 1 Time 2 Goal State 62 Action State Maximize Goal Achievement Dead End A1: 1 A2: 0 Left Outcomes are more likely A1A2 A1 A2 A1 A2 A1 A2 A1 A2 I

63
Exploiting Determinism Find the longest prefix for all plans Apply the actions in the prefix to continuously until one is not applicable Resume ZSL/OSL steps

64
Exploiting Determinism G S1S1 G S1S1 G S1S1 a* Plans generated for chosen action, a* Longest prefix for each plan is identified and executed without running ZSL, OSL or FF!

65
Handling unlikely outcomes: All-outcome Determinization Assign each possible outcome an action Solve for a plan Combine the plan with the plans from the HOP solutions

66
Deterministic Techniques for Stochastic Planning No longer the Rodney Dangerfield of Stochastic Planning?

67
Determinizations Most-likely outcome determinization – Inadmissible – e.g. if only path to goal relies on less likely outcome of an action All outcomes determinization – Admissible, but not very informed – e.g. Very unlikely action leads you straight to goal

68
Problems with transition systems Transition systems are a great conceptual tool to understand the differences between the various planning problems …However direct manipulation of transition systems tends to be too cumbersome – The size of the explicit graph corresponding to a transition system is often very large – The remedy is to provide “compact” representations for transition systems Start by explicating the structure of the “states” – e.g. states specified in terms of state variables Represent actions not as incidence matrices but rather functions specified directly in terms of the state variables – An action will work in any state where some state variables have certain values. When it works, it will change the values of certain (other) state variables

69
State Variable Models World is made up of states which are defined in terms of state variables – Can be boolean (or multi-ary or continuous) States are complete assignments over state variables – So, k boolean state variables can represent how many states? Actions change the values of the state variables – Applicability conditions of actions are also specified in terms of partial assignments over state variables

70
Blocks world State variables: Ontable(x) On(x,y) Clear(x) hand-empty holding(x) Stack(x,y) Prec: holding(x), clear(y) eff: on(x,y), ~cl(y), ~holding(x), hand-empty Unstack(x,y) Prec: on(x,y),hand-empty,cl(x) eff: holding(x),~clear(x),clear(y),~hand-empty Pickup(x) Prec: hand-empty,clear(x),ontable(x) eff: holding(x),~ontable(x),~hand-empty,~Clear(x) Putdown(x) Prec: holding(x) eff: Ontable(x), hand-empty,clear(x),~holding(x) Initial state: Complete specification of T/F values to state variables --By convention, variables with F values are omitted Goal state: A partial specification of the desired state variable/value combinations --desired values can be both positive and negative Init: Ontable(A),Ontable(B), Clear(A), Clear(B), hand-empty Goal: ~clear(B), hand-empty All the actions here have only positive preconditions; but this is not necessary STRIPS ASSUMPTION: If an action changes a state variable, this must be explicitly mentioned in its effects

71
Why is STRIPS representation compact? (than explicit transition systems) In explicit transition systems actions are represented as state-to-state transitions where in each action will be represented by an incidence matrix of size |S|x|S| In state-variable model, actions are represented only in terms of state variables whose values they care about, and whose value they affect. Consider a state space of 1024 states. It can be represented by log =10 state variables. If an action needs variable v1 to be true and makes v7 to be false, it can be represented by just 2 bits (instead of a 1024x1024 matrix) – Of course, if the action has a complicated mapping from states to states, in the worst case the action rep will be just as large – The assumption being made here is that the actions will have effects on a small number of state variables. Sit. Calc STRIPS rep Transition rep First order Rel/ Prop Atomic

72
Factored Representations fo MDPs: Actions Actions can be represented directly in terms of their effects on the individual state variables (fluents). The CPTs of the BNs can be represented compactly too! –Write a Bayes Network relating the value of fluents at the state before and after the action Bayes networks representing fluents at different time points are called “Dynamic Bayes Networks” We look at 2TBN (2-time-slice dynamic bayes nets) Go further by using STRIPS assumption –Fluents not affected by the action are not represented explicitly in the model –Called Probabilistic STRIPS Operator (PSO) model

73
Action CLK

74

75

76
Factored Representations: Reward, Value and Policy Functions Reward functions can be represented in factored form too. Possible representations include – Decision trees (made up of fluents) – ADDs (Algebraic decision diagrams) Value functions are like reward functions (so they too can be represented similarly) Bellman update can then be done directly using factored representations..

77

78
SPUDDs use of ADDs

79
Direct manipulation of ADDs in SPUDD

80
Ideas for Efficient Algorithms.. Use heuristic search (and reachability information) – LAO*, RTDP Use execution and/or Simulation – “Actual Execution” Reinforcement learning (Main motivation for RL is to “learn” the model) – “Simulation” –simulate the given model to sample possible futures Policy rollout, hindsight optimization etc. Use “factored” representations – Factored representations for Actions, Reward Functions, Values and Policies – Directly manipulating factored representations during the Bellman update

81
Probabilistic Planning --The competition (IPPC) --The Action language.. (PPDDL)

82

83
Not ergodic

84

85

86
Reducing Heuristic Computation Cost by exploiting factored representations The heuristics computed for a state might give us an idea about the heuristic value of other “similar” states – Similarity is possible to determine in terms of the state structure Exploit overlapping structure of heuristics for different states – E.g. SAG idea for McLUG – E.g. Triangle tables idea for plans (c.f. Kolobov)

87
A Plan is a Terrible Thing to Waste Suppose we have a plan – s0—a0—s1—a1—s2—a2—s3…an—sG – We realized that this tells us not just the estimated value of s0, but also of s1,s2…sn – So we don’t need to compute the heuristic for them again Is that all? – If we have states and actions in factored representation, then we can explain exactly what aspects of si are relevant for the plan’s success. – The “explanation” is a proof of correctness of the plan » Can be based on regression (if the plan is a sequence) or causal proof (if the plan is a partially ordered one. The explanation will typically be just a subset of the literals making up the state – That means actually, the plan suffix from si may actually be relevant in many more states that are consistent with that explanation

88
Triangle Table Memoization Use triangle tables / memoization C C B B A A A A B B C C If the above problem is solved, then we don’t need to call FF again for the below: B B A A A A B B

89
Explanation-based Generalization (of Successes and Failures) Suppose we have a plan P that solves a problem [S, G]. We can first find out what aspects of S does this plan actually depend on – Explain (prove) the correctness of the plan, and see which parts of S actually contribute to this proof – Now you can memoize this plan for just that subset of S

90
Relaxations for Stochastic Planning Determinizations can also be used as a basis for heuristics to initialize the V for value iteration [mGPT; GOTH etc] Heuristics come from relaxation We can relax along two separate dimensions: – Relax –ve interactions Consider +ve interactions alone using relaxed planning graphs – Relax uncertainty Consider determinizations – Or a combination of both!

91
Solving Determinizations If we relax –ve interactions – Then compute relaxed plan Admissible if optimal relaxed plan is computed Inadmissible otherwise If we keep –ve interactions – Then use a deterministic planner (e.g. FF/LPG) Inadmissible unless the underlying planner is optimal

92
Dimensions of Relaxation Uncertainty Negative Interactions Relaxed Plan Heuristic 2 2 McLUG 3 3 FF/LPG Reducing Uncertainty Bound the number of stochastic outcomes Stochastic “width” Limited width stochastic planning? Increasing consideration

93
Dimensions of Relaxation NoneSomeFull NoneRelaxed PlanMcLUG Some FullFF/LPGLimited width Stoch Planning Uncertainty -ve interactions

94
Expressiveness v. Cost h = 0 McLUG FF-Replan FF Limited width stochastic planning Node Expansions v. Heuristic Computation Cost Nodes Expanded Computation Cost FF R FF

95

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google