4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

MDPs as a way of Reducing Planning Complexity MDPs provide a normative basis for talking about optimal plans (policies) in the context of stochastic actions, and complex reward models Optimal policies for MDPs can be computed in polynomial time In contrast, even classical planning is NP-complete or P- Space Complete (depending on whether the plans are polynomial or exponential length)  SO, convert planning problems to MDP problems, so we can get polynomial time performance.  –To see this, note that Sorting problem can be written as a planning problem. But Sorting is only polynomial time. Thus, the inherent complexity of planning is only polynomial. All that MDP conversion does is to let planning exhibit its inherent polynomial complexity.

MDP Complexity: The Real Deal Complexity results are stated in terms of the Size of the input (measured in some way) MDP complexity results are typically in terms of state space size. Planning complexity results are typically in terms of the factored input (in terms of state variables ) State Space is already exponential in terms of State Variables. So, polynomial in state space implies exponential in factored representation –More depressingly, optimal policy construction is exponential and undecidable respectively for POMDPs with finite and infinite horizons even with input size measured in terms of explicit state space. So clearly, we don’t consider compiling planning problems to MDP model for efficiency…

Forget your homework grading. Forget your project grading. We’ll make it look like you remembered

Agenda General (FO)MDP model –Action (Transition) model Action Cost Model –Reward Model –Histories Horizon –Policies –Optimal value and policy –Value iteration/Policy Iteration/RTDP Special cases of MDP model relevant to Planning –Pure cost models (goal states are absorbing) –Reward/Cost models –Over-subscription models –Connections to heuristic search –Efficient approaches for policy construction

Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model (aka M a s,s’ )  C (s,a,s’): cost model  G : set of goals  s 0 : start state   : discount factor  R ( s,a,s’): reward model

Objective of a Fully Observable MDP  Find a policy  : S → A  which optimises minimises expected cost to reach a goal maximises expected reward maximises expected (reward-cost)  given a ____ horizon finite infinite indefinite  assuming full observability discounted or undiscount.

Histories; Value of Histories; (expected) Value of a policy; Optimal Value & Bellman Principle

Policy evaluation vs. Optimal Value Function in Finite vs. Infinite Horizon

[can generalize to have action costs C(a,s)] If M ij matrix is not known a priori, then we have a reinforcement learning scenario.. Repeat

What does a solution to an MDP look like? The solution should tell the optimal action to do in each state (called a “Policy”) –Policy is a function from states to actions (* see finite horizon case below*) –Not a sequence of actions anymore Needed because of the non-deterministic actions –If there are |S| states and |A| actions that we can do at each state, then there are |A| |S| policies How do we get the best policy? –Pick the policy that gives the maximal expected reward –For each policy  Simulate the policy (take actions suggested by the policy) to get behavior traces Evaluate the behavior traces Take the average value of the behavior traces. We will concentrate on infinite horizon problems (infinite horizon doesn’t necessarily mean that that all behavior traces are infinite. They could be finite and end in a sink state)

Horizon & Policy How long should behavior traces be? –Each trace is no longer than k (Finite Horizon case) Policy will be horizon-dependent (optimal action depends not just on what state you are in, but how far is your horizon) –Eg: Financial portfolio advice for yuppies vs. retirees. –No limit on the size of the trace (Infinite horizon case) Policy is not horizon dependent We will concentrate on infinite horizon problems (infinite horizon doesn’t necessarily mean that that all behavior traces are infinite. They could be finite and end in a sink state) If you are twenty and not a liberal, you are heartless If you are sixty and not a conservative, you are mindless --Churchill

How to handle unbounded state sequences? If we don’t have a horizon, then we can have potentially infinitely long state sequences. Three ways to handle them 1.Use discounted reward model ( i th state in the sequence contributes only ° i R(s i ) 2.Assume that the policy is proper (i.e., each sequence terminates into an absorbing state with non-zero probability). 3.Consider “average reward per-step”

How to evaluate a policy? Step 1: Define utility of a sequence of states in terms of their rewards –Assume “stationarity” of preferences If you prefer future f1 to f2 starting tomorrow, you should prefer them the same way even if they start today –Then, only two reasonable ways to define Utility of a sequence of states –U(s 1, s 2  s n ) =  n R(s i ) –U(s 1, s 2  s n ) =  n ° i R(s i ) (0 · ° · 1) Maximum utility bounded from above by R max /(1 - ° ) Step 2: Utility of a policy ¼ is the expected utility of the behaviors exhibited by an agent following it. E [  1 t=0 ° t R(s t ) | ¼ ] Step 3: Optimal policy ¼ * is the one that maximizes the expectation: argmax ¼ E [  1 t=0 ° t R(s t ) | ¼ ] –Since there are only A |s| different policies, you can evaluate them all in finite time (Haa haa..)

Utility of a State The (long term) utility of a state s with respect to a policy \pi is the expected value of all state sequences starting with s –U ¼ (s) = E [  1 t=0 ° t R(s t ) | ¼, s 0 =s ] The true utility of a state s is just its utility w.r.t optimal policy U(s) =U ¼ *(s) Thus, U and ¼ * are closely related – ¼ * (s) = argmax a  s’ M a ss’ U(s’) As are utilities of neighboring states –U(s) = R(s) + ° argmax a  s’ M a ss’ U(s’) Bellman Eqn

Think of these as h*() values… Called value function U* Think of these as related to h* values Repeat U* is the maximal expected utility (value) assuming optimal policy

Optimal Policies depend on rewards.. -- Repeat - -

Bellman Equations as a basis for computing optimal policy Qn: Is there a simpler way than having to evaluate |A| |S| policies? –Yes… The Optimal Value and Optimal Policy are related by the Bellman Equations –U(s) = R(s) + ° argmax a  s’ M a ss’ U(s’) – ¼ * (s) = argmax a  s’ M a ss’ U(s’) The equations can be solved exactly through –“value iteration” (iteratively compute U and then compute ¼ * ) – “policy iteration” ( iterate over policies) –Or solve approximately through “real-time dynamic programming”

.8.1 U(i) = R(i) + ° max j M a ij U(j) + °

Value Iteration Demo http://www.cs.ubc.ca/spider/poole/demos/ mdp/vi.htmlhttp://www.cs.ubc.ca/spider/poole/demos/ mdp/vi.html Things to note –The way the values change (states far from absorbing states may first reduce and then increase their values) –The convergence speed difference between Policy and value

Why are values coming down first? Why are some states reaching optimal value faster? Updates can be done synchronously OR asynchronously --convergence guaranteed as long as each state updated infinitely often.8.1

Terminating Value Iteration The basic idea is to terminate the value iteration when the values have “converged” (i.e., not changing much from iteration to iteration) –Set a threshold  and stop when the change across two consecutive iterations is less than  –There is a minor problem since value is a vector We can bound the maximum change that is allowed in any of the dimensions between two successive iterations by  Max norm ||.|| of a vector is the maximal value among all its dimensions. We are basically terminating when ||U i – U i+1 || < 

Policies converge earlier than values There are finite number of policies but infinite number of value functions. So entire regions of value vector are mapped to a specific policy So policies may be converging faster than values. Search in the space of policies Given a utility vector U i we can compute the greedy policy  ui The policy loss of  ui is ||U  ui  U*|| (max norm difference of two vectors is the maximum amount by which they differ on any dimension) V(S 1 ) V(S 2 ) Consider an MDP with 2 states and 2 actions P1P1 P2P2 P3P3 P4P4 U*U*

We can either solve the linear eqns exactly, or solve them approximately by running the value iteration a few times (the update wont have the “max” operation) n linear equations with n unknowns.

Bellman equations when actions have costs The model discussed in class ignores action costs and only thinks of state rewards –C(s,a) is the cost of doing action a in state s Assume costs are just negative rewards.. –The Bellman equation then becomes U(s) = R(s) + ° max a [ -C(s,a) +  s’ R(s’) M a ss’ ] Notice that the only difference is that -C(s,a) is now inside the maximization With this model, we can talk about “partial satisfaction” planning problems where –Actions have costs; goals have utilities and the optimal plan may not satisfy all goals.

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

Similar presentations

Presentation on theme: "4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

Similar presentations

Presentation on theme: "4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)"— Presentation transcript:

Similar presentations

About project

Feedback