Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Decision Theoretic Planning
An Introduction to Markov Decision Processes Sarah Hickmott
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
Policy Evaluation & Policy Iteration S&B: Sec 4.1, 4.3; 6.5.
Markov Decision Processes
Q. The policy iteration alg. Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉  discount  Output: optimal policy π* ; opt. value func. V* Initialization:
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.
Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:
The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.
Department of Computer Science Undergraduate Events More
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Q. Administrivia Final project proposals back today (w/ comments) Evaluated on 4 axes: W&C == Writing & Clarity M&P == Motivation & Problem statement.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
MAKING COMPLEX DEClSlONS
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.
MDPs (cont) & Reinforcement Learning
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
Department of Computer Science Undergraduate Events More
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Markov Decision Process (MDP)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Making complex decisions
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
CSE 473: Artificial Intelligence
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence
Reinforcement learning
CAP 5636 – Advanced Artificial Intelligence
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Quiz 6: Utility Theory Simulated Annealing only applies to continuous f(). False Simulated Annealing only applies to differentiable f(). False The 4 other.
CS 188: Artificial Intelligence Fall 2007
Reinforcement Learning in MDPs by Lease-Square Policy Iteration
Chapter 17 – Making Complex Decisions
Hidden Markov Models (cont.) Markov Decision Processes
CS 416 Artificial Intelligence
Announcements Homework 2 Project 2 Mini-contest 1 (optional)
Warm-up as You Walk In Given Set actions (persistent/static)
CS 416 Artificial Intelligence
Markov Decision Processes
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Value and Planning in MDPs

Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty in Artificial Intelligence (UAI-2005). Due: Apr 20 Groups assigned this time

Where we are Last time: Expected value of policies Principle of maximum expected utility The Bellman equation Today: A little intuition (pictures) Finding π * : the policy iteration algorithm The Q function On to actual learning (maybe?)

The Bellman equation The final recursive equation is known as the Bellman equation: Unique soln to this eqn gives value of a fixed policy π when operating in a known MDP M = 〈 S, A,T,R 〉 When state/action spaces are discrete, can think of V and R as vectors and T π as matrix, and get matrix eqn:

Exercise Solve the matrix Bellman equation (i.e., find V ): I formulated the Bellman equations for “state- based” rewards: R(s) Formulate & solve the B.E. for: “state-action” rewards ( R(s,a) ) “state-action-state” rewards ( R(s,a,s’) )

Exercise Solve the matrix Bellman equation (i.e., find V ): Formulate & solve the B.E. for: “state-action” rewards ( R(s,a) ) “state-action-state” rewards ( R(s,a,s’) )

Policy values in practice “Robot” navigation in a grid maze Goal state

The MDP formulation State space: Action space: Reward function: Transition function:...

The MDP formulation Transition function: If desired direction is unblocked Move in desired direction with probability 0.7 Stay in same place w/ prob 0.1 Move “forward right” w/ prob 0.1 Move “forward left” w/ prob 0.1 If desired direction is blocked (wall) Stay in same place w/ prob 1.0

Policy values in practice Optimal policy, π* EAST SOUTH WEST NORTH

Policy values in practice Value function for optimal policy, V* Why does it look like this?

A harder “maze”... Walls Doors

A harder “maze”... Optimal policy, π*

A harder “maze”... Value function for optimal policy, V*

A harder “maze”... Value function for optimal policy, V*

Still more complex...

Optimal policy, π*

Still more complex... Value function for optimal policy, V*

Still more complex... Value function for optimal policy, V*

Planning: finding π* So we know how to evaluate a single policy, π How do you find the best policy? Remember: still assuming that we know M = 〈 S, A,T,R 〉

Planning: finding π* So we know how to evaluate a single policy, π How do you find the best policy? Remember: still assuming that we know M = 〈 S, A,T,R 〉 Non-solution: iterate through all possible π, evaluating each one; keep best

Policy iteration & friends Many different solutions available. All exploit some characteristics of MDPs: For infinite-horizon discounted reward in a discrete, finite MDP, there exists at least one optimal, stationary policy (may exist more than one equivalent policy) The Bellman equation expresses recursive structure of an optimal policy Leads to a series of closely related policy solutions: policy iteration, value iteration, generalized policy iteration, etc.

The policy iteration alg. Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉  discount  Output: optimal policy π* ; opt. value func. V* Initialization: choose π 0 arbitrarily Repeat { V i =eval_policy( M, π i,  ) // from Bellman eqn π i+1 =local_update_policy( π i, V i ) } Until ( π i+1 ==π i ) Function: π’ =local_update_policy( π, V ) for i=1..| S | { π’(s i ) =argmax a ∈ A ( sum j ( T(s i,a,s j )*V(s j ) ) ) }

Why does this work? 2 explanations: Theoretical: The local update w.r.t. the policy value is a contractive mapping, ergo a fixed point exists and will be reached See, “contraction mapping”, “Banach fixed- point theorem”, etc. e22.html em.html Contracts w.r.t. the Bellman Error:

Why does this work? The intuitive explanation It’s doing a dynamic-programming “backup” of reward from reward “sources” At every step, the policy is locally updated to take advantage of new information about reward that is propagated back by the evaluation step Value “propagates away” from sources and the policy is able to say “hey! there’s reward over there! I can get some of that action if I change a bit!”

P.I. in action PolicyValue Iteration 0

P.I. in action PolicyValue Iteration 1

P.I. in action PolicyValue Iteration 2

P.I. in action PolicyValue Iteration 3

P.I. in action PolicyValue Iteration 4

P.I. in action PolicyValue Iteration 5

P.I. in action PolicyValue Iteration 6: done

Properties Policy iteration Known to converge (provable) Observed to converge exponentially quickly # iterations is O(ln(| S |)) Empirical observation; strongly believed but no proof (yet) O(| S | 3 ) time per iteration (policy evaluation)

Variants Other methods possible Linear program (poly time soln exists) Value iteration Generalized policy iter. (often best in practice)