The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.

The Value of Plans

Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation The shape of value functions Optimal planning: the policy iteration algorithm Discussion of R2

Administrivia R1 back today ([whew] finally... :-P) R3 assigned today

R3 Due: Apr 17 Abbeel, Coates, Quigley, & Ng, “An Application of Reinforcement Learning to Aerobatic Helicopter Flight”. Proc Neural Information Processing Systems (NIPS), 2006. http://ai.stanford.edu/~pabbeel/

The Bellman equation The final recursive equation is known as the Bellman equation: Unique soln to this eqn gives value of a fixed policy π when operating in a known MDP M = 〈 S, A,T,R 〉 When state/action spaces are discrete, can think of V and R as vectors and T π as matrix, and get matrix eqn:

Exercise Solve the matrix Bellman equation (i.e., find V ): I formulated the Bellman equations for “state- based” rewards: R(s) Formulate & solve the B.E. for “state-action” rewards ( R(s,a) )

Exercise Solve the matrix Bellman equation (i.e., find V ): Formulate & solve the B.E. for: “state-action” rewards ( R(s,a) ) “state-action-state” rewards ( R(s,a,s’) )

Policy values in practice “Robot” navigation in a grid maze Goal state

The MDP formulation State space: Action space: Reward function: Transition function:...

The MDP formulation Transition function: If desired direction is unblocked Move in desired direction with probability 0.7 Stay in same place w/ prob 0.1 Move “forward right” w/ prob 0.1 Move “forward left” w/ prob 0.1 If desired direction is blocked (wall) Stay in same place w/ prob 1.0

Policy values in practice Optimal policy, π* EAST SOUTH WEST NORTH

Policy values in practice Value function for optimal policy, V* Why does it look like this?

A harder “maze”... Walls Doors

A harder “maze”... Optimal policy, π*

A harder “maze”... Value function for optimal policy, V*

Still more complex...

Optimal policy, π*

Still more complex... Value function for optimal policy, V*

Planning: finding π* So we know how to evaluate a single policy, π How do you find the best policy? Remember: still assuming that we know M = 〈 S, A,T,R 〉

Planning: finding π* So we know how to evaluate a single policy, π How do you find the best policy? Remember: still assuming that we know M = 〈 S, A,T,R 〉 Non-solution: iterate through all possible π, evaluating each one; keep best

Policy iteration & friends Many different solutions available. All exploit some characteristics of MDPs: For infinite-horizon discounted reward in a discrete, finite MDP, there exists at least one optimal, stationary policy (may exist more than one equivalent policy) The Bellman equation expresses recursive structure of an optimal policy Leads to a series of closely related policy solutions: policy iteration, value iteration, generalized policy iteration, etc.

The policy iteration alg. Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉  discount  Output: optimal policy π* ; opt. value func. V* Initialization: choose π 0 arbitrarily Repeat { V i =eval_policy( M, π i,  ) // from Bellman eqn π i+1 =local_update_policy( π i, V i ) } Until ( π i+1 ==π i ) Function: π’ =local_update_policy( π, V ) for i=1..| S | { π’(s i ) =argmax a ∈ A ( sum j ( T(s i,a,s j )*V(s j ) ) ) }

Why does this work? 2 explanations:

Why does this work? 2 explanations: Theoretical: The local update w.r.t. the policy value is a contractive mapping, ergo a fixed point exists and will be reached See, “contraction mapping”, “Banach fixed- point theorem”, etc. http://math.arizona.edu/~restrepo/475A/Notes/sourcea/node 22.html http://math.arizona.edu/~restrepo/475A/Notes/sourcea/node 22.html http://planetmath.org/encyclopedia/BanachFixedPointTheor em.html Contracts w.r.t. the Bellman Error:

Why does this work? The intuitive explanation It’s doing a dynamic-programming “backup” of reward from reward “sources” At every step, the policy is locally updated to take advantage of new information about reward that is propagated back by the evaluation step Value “propagates away” from sources and the policy is able to say “hey! there’s reward over there! I can get some of that action if I change a bit!”

P.I. in action PolicyValue Iteration 0

P.I. in action PolicyValue Iteration 6: done

Properties Policy iteration Known to converge (provable) Observed to converge exponentially quickly # iterations is O(ln(| S |)) Empirical observation; strongly believed but no proof (yet) O(| S | 3 ) time per iteration (policy evaluation)

Variants Other methods possible Linear program (poly time soln exists) Value iteration Generalized policy iter. (often best in practice)

The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.

Similar presentations

Presentation on theme: "The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.

Similar presentations

Presentation on theme: "The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation."— Presentation transcript:

Similar presentations

About project

Feedback