Presentation is loading. Please wait.

Presentation is loading. Please wait.

5/6: Summary and Decision Theoretic Planning  Last homework socket opened (two more problems to be added—Scheduling, MDPs)  Project 3 due today  Sapa.

Similar presentations


Presentation on theme: "5/6: Summary and Decision Theoretic Planning  Last homework socket opened (two more problems to be added—Scheduling, MDPs)  Project 3 due today  Sapa."— Presentation transcript:

1 5/6: Summary and Decision Theoretic Planning  Last homework socket opened (two more problems to be added—Scheduling, MDPs)  Project 3 due today  Sapa homework points sent..

2 Current Grades…

3 Sapa homework grades

4 Static Deterministic ObservableInstantaneousPropositional “Classical Planning” Dynamic Replanning / Situated Plans Durative Temporal Reasoning Continuous Numeric Constraint reasoning (LP/ILP) Stochastic Contingent/Conformant Plans, Interleaved execution MDP Policies POMDP Policies Partially Observable Contingent/Conforma nt Plans, Interleaved execution Semi-MDP Policies

5 All that water under the bridge…  Actions, Proofs, Planning Strategies (Week 2; 1/28;1/30)  More PO planning, dealing with partially instantiated actions, and start of deriving heuristics. (Week 3; 2/4;2/6)  Reachability Heuristics contd. (2/11;/13)  Heuristics for Partial order planning; Graphplan search (2/18; 2/20).  EBL for Graphplan; Solving planning graph by compilation strategies (2/25;2/27).  Compilation to SAT, ILP and Naive Encoding(3/4;3/6).  Knowledge-based planners.  Metric-Temporal Planning: Issues and Representation.  Search Techniques; Heuristics.  Tracking multiple objective heuristics (cost propagation); partialization; LPG  Temporal Constraint Networks; Scheduling  4/22;4/24 Incompleteness and Unertainty; Belief States; Conformant planning  4/29;5/1 Conditional Planning  Decision Theoretic Planning…

6 Problems, Solutions, Success Measures: 3 orthogonal dimensions  Incompleteness in the initial state  Un (partial) observability of states  Non-deterministic actions  Uncertainty in state or effects  Complex reward functions (allowing degrees of satisfaction)  Conformant Plans: Don’t look— just do  Sequences  Contingent/Conditional Plans: Look, and based on what you see, Do; look again  Directed acyclic graphs  Policies: If in (belief) state S, do action a  (belief) state  action tables  Deterministic Success: Must reach goal-state with probability 1  Probabilistic Success: Must succeed with probability >= k (0<=k<=1)  Maximal Expected Reward: Maximize the expected reward (an optimization problem) MDP POMDP

7 The Trouble with Probabilities… Once we have probabilities associated with the action effects, as well as the constituents of a belief state,  The belief space size explodes…  Infinitely large  may be able to find a plan if one exists, but exhaustively searching to prove plan doesn’t exist is out of the question  Conformant Probabilistic planning is known to be Semi- decidable  So, solving POMDPs is semi-decidable too.  Introduces the notion of “partial satisfaction” and “expected value” of the plan… (rather than 0-1 valuation)

8 MDPs are generalizations of Markov chains where transitions are under the control of an agent.  HMMs are thus generalized to POMDPs Useful as normative modeling tools In tons of places: --planning, (reinforcement) learning, multi-agent interactions..

9 [aka action cost C(a,s)] If M ij matrix is not known a priori, then we have a reinforcement learning scenario..

10 MDPs vs. Markov Chains  Markov chains are transition systems, where transitions happen automatically  HMMs (hidden markov models) are markov chains where the current state is partially observable. Has been very useful in many different areas.  Generalization to MDPs leads to POMDPs  MDPs are generalizations of Markov chains where transitions are under the control of an agent.

11

12 Policies change with rewards.. --

13

14

15

16 Why are values coming down first? Why are some states reaching optimal value faster? Updates can be done synchronously OR asynchronously --convergence guaranteed as long as each state updated infinitely often

17 Policies converge earlier than values Given a utility vector U i we can compute the greedy policy  ui The policy loss of  is ||U   – U*|| (max norm difference of two vectors is the maximum amount by which they differ on any dimension) So search in the space of policies

18 We can either solve the linear eqns exactly, or solve them approximately by running the value iteration a few times (the update wont have max factor)

19 The Big Computational Issues in MDP  MDP models are quite easy to specify and understand conceptually. The big issue is “compactness” and “effciency”  Policy construction is polynomial in the size of state space (which is bad news…!)  For POMDPs, the state space is the belief space (infinite  )  Compact representations needed for  Actions  Reward function  Policy  Value  Efficient methods needed for  Policy/value update   Representations that have been tried include:  Decision trees  Neural nets,  Bayesian nets  ADDs (algebraic decision diagrams—which are a general case of BDDs— where the leaf nodes can have real-valued valuation instead of T/F).

20 SPUDD: Using ADDs to Represent Actions, Rewards and Policies

21 MDPs and Planning Problems  FOMDPS (fully observable MDPS) can be used to model planning problems with fully observable states, but non-deterministic transitions  POMDPs (partially observable MDPs)—a generalization of MDP framework, where the current state can only be partially observed—will be needed to handle planning problems with partial observability  POMDPs can be solved by converting them into FOMDPs— but the conversion takes us from world states to belief states (which is a continuous space)

22 SSPP—Stochastic Shortest Path Problem An MDP with Init and Goal states  MDPs don’t have a notion of an “initial” and “goal” state. (Process orientation instead of “task” orientation)  Goals are sort of modeled by reward functions  Allows pretty expressive goals (in theory)  Normal MDP algorithms don’t use initial state information (since policy is supposed to cover the entire search space anyway).  Could consider “envelope extension” methods  Compute a “deterministic” plan (which gives the policy for some of the states; Extend the policy to other states that are likely to happen during execution  RTDP methods  SSSP are a special case of MDPs where  (a) initial state is given  (b) there are absorbing goal states  (c) Actions have costs. Goal states have zero costs.  A proper policy for SSSP is a policy which is guaranteed to ultimately put the agent in one of the absorbing states  For SSSP, it would be worth finding a partial policy that only covers the “relevant” states (states that are reachable from init and goal states on any optimal policy)  Value/Policy Iteration don’t consider the notion of relevance  Consider “heuristic state search” algorithms  Heuristic can be seen as the “estimate” of the value of a state.  (L)AO* or  RTDP algorithms  (or envelope extension methods)

23 AO* search for solving SSP problems Main issues: -- Cost of a node is expected cost of its children -- The And tree can have LOOPS  Cost backup is complicated Intermediate nodes given admissible heuristic estimates --can be just the shortest paths (or their estimates)

24 LAO*--turning bottom-up labeling into a full DP

25 RTDP Approach: Interleave Planning & Execution (Simulation) Start from the current state S. Expand the tree (either uniformly to k-levels, or non-uniformly—going deeper in some branches) Evaluate the leaf nodes; back-up the values to S. Update the stored value of S. Pick the action that leads to best value Do it {or simulate it}. Loop back. Leaf nodes evaluated by Using their “cached” values  If this node has been evaluated using RTDP analysis in the past, you use its remembered value else use the heuristic value  If not use heuristics to estimate a. Immediate reward values b. Reachability heuristics Sort of like depth-limited game-playing (expectimax) --Who is the game against? Can also do “reinforcement learning” this way  The M ij are not known correctly in RL

26 Greedy “On-Policy” RTDP without execution  Using the current utility values, select the action with the highest expected utility (greedy action) at each state, until you reach a terminating state. Update the values along this path. Loop back—until the values stabilize

27

28 Envelope Extension Methods  For each action, take the most likely outcome and discard the rest.  Find a plan (deterministic path) from Init to Goal state. This is a (very partial) policy for just the states that fall on the maximum probability state sequence.  Consider states that are most likely to be encountered while traveling this path.  Find policy for those states too.  Tricky part is to show that we can converge to the optimal policy

29 Incomplete observability (the dreaded POMDPs)  To model partial observability, all we need to do is to look at MDP in the space of belief states (belief states are fully observable even when world states are not)  Policy maps belief states to actions  In practice, this causes (humongous) problems  The space of belief states is “continuous” (even if the underlying world is discrete and finite).  Even approximate policies are hard to find (PSPACE-hard).  Problems with few dozen world states are hard to solve currently  “Depth-limited” exploration (such as that done in adversarial games) are the only option…

30

31


Download ppt "5/6: Summary and Decision Theoretic Planning  Last homework socket opened (two more problems to be added—Scheduling, MDPs)  Project 3 due today  Sapa."

Similar presentations


Ads by Google