Policy Evaluation & Policy Iteration S&B: Sec 4.1, 4.3; 6.5.

Slides:

Advertisements

Similar presentations

Value and Planning in MDPs. Administrivia Reading 3 assigned today Mahdevan, S., “Representation Policy Iteration”. In Proc. of 21st Conference on Uncertainty.

Advertisements

Markov Decision Process

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.

Decision Theoretic Planning

Markov Decision Processes

Infinite Horizon Problems

Planning under Uncertainty

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.

RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.

91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010

To Model or not To Model; that is the question.. Administriva ICES surveys today Reminder: ML dissertation defense (ML for fMRI) Tomorrow, 1:00 PM, FEC141.

Markov Decision Processes

Q. The policy iteration alg. Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉  discount  Output: optimal policy π* ; opt. value func. V* Initialization:

Reinforcement Learning

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Nash Q-Learning for General-Sum Stochastic Games Hu & Wellman March 6 th, 2006 CS286r Presented by Ilan Lobel.

Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

Policies and exploration and eligibility, oh my!.

4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)

Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.

To Model or not To Model; that is the question.. Administriva Presentations starting Thurs Ritthaler Scully Gupta Wildani Ammons ICES surveys today.

CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.

The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.

Department of Computer Science Undergraduate Events More

More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.

Reinforcement Learning (1)

Q. Administrivia Final project proposals back today (w/ comments) Evaluated on 4 axes: W&C == Writing & Clarity M&P == Motivation & Problem statement.

Policies and exploration and eligibility, oh my!.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

MAKING COMPLEX DEClSlONS

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)

CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

MDPs (cont) & Reinforcement Learning

CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.

Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.

CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.

Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.

Reinforcement Learning Dynamic Programming I Subramanian Ramamoorthy School of Informatics 31 January, 2012.

Department of Computer Science Undergraduate Events More

Comparison Value vs Policy iteration

Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.

CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.

1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.

Markov Decision Process (MDP)

Making complex decisions

Markov Decision Processes

CS 188: Artificial Intelligence

Reinforcement learning

CS 188: Artificial Intelligence Fall 2007

CS 188: Artificial Intelligence Fall 2008

Reinforcement Learning in MDPs by Lease-Square Policy Iteration

Chapter 17 – Making Complex Decisions

Hidden Markov Models (cont.) Markov Decision Processes

CS 416 Artificial Intelligence

Reinforcement Learning (2)

Markov Decision Processes

Markov Decision Processes

Reinforcement Learning (2)

Presentation transcript:

Policy Evaluation & Policy Iteration S&B: Sec 4.1, 4.3; 6.5

The Bellman equation The final recursive equation is known as the Bellman equation: Unique soln to this eqn gives value of a fixed policy π when operating in a known MDP M = 〈 S, A,T,R 〉 When state/action spaces are discrete, can think of V and R as vectors and T π as matrix, and get matrix eqn:

Exercise Solve the matrix Bellman equation (i.e., find V ): I formulated the Bellman equations for “state- based” rewards: R(s) Formulate & solve the B.E. for “state-action” rewards ( R(s,a) ) and “state-action-state” rewards ( R(s,a,s’) )

Policy values in practice “Robot” navigation in a grid maze

Policy values in practice Optimal policy, π*

Policy values in practice Value function for optimal policy, V*

A harder “maze”...

Optimal policy, π*

A harder “maze”... Value function for optimal policy, V*

A harder “maze”... Value function for optimal policy, V*

Still more complex...

Optimal policy, π*

Still more complex... Value function for optimal policy, V*

Still more complex... Value function for optimal policy, V*

Planning: finding π* So we know how to evaluate a single policy, π How do you find the best policy? Remember: still assuming that we know M = 〈 S, A,T,R 〉

Planning: finding π* So we know how to evaluate a single policy, π How do you find the best policy? Remember: still assuming that we know M = 〈 S, A,T,R 〉 Non-solution: iterate through all possible π, evaluating each one; keep best

Policy iteration & friends Many different solutions available. All exploit some characteristics of MDPs: For infinite-horizon discounted reward in a discrete, finite MDP, there exists at least one optimal, stationary policy (may exist more than one equivalent policy) The Bellman equation expresses recursive structure of an optimal policy Leads to a series of closely related policy solutions: policy iteration, value iteration, generalized policy iteration, etc.

The policy iteration alg. Function: policy_iteration Input: MDP M = 〈 S, A,T,R 〉, discount γ Output: optimal policy π* ; opt. value func. V* Initialization: choose π 0 arbitrarily Repeat { V i =eval_policy( M, π i, γ) // from Bellman eqn π i+1 =local_update_policy( π i, V i ) } Until ( π i+1 ==π i ) Function: π’ =local_update_policy( π, V ) for i=1..| S | { π’(s i ) =argmax a ∈ A {sum j ( T(s i,a,s j )*V(s j ) )} }

Why does this work? 2 explanations: Theoretical: The local update w.r.t. the policy value is a contractive mapping, ergo a fixed point exists and will be reached See, “contraction mapping”, “Banach fixed- point theorem”, etc. ode22.html eorem.html Contracts w.r.t. the Bellman Error:

Why does this work? The intuitive explanation It’s doing a dynamic-programming “backup” of reward from reward “sources” At every step, the policy is locally updated to take advantage of new information about reward that is propagated back by the evaluation step Value “propagates away” from sources and the policy is able to say “hey! there’s reward over there! I can get some of that action if I change a bit!”

P.I. in action PolicyValue Iteration 0

P.I. in action PolicyValue Iteration 1

P.I. in action PolicyValue Iteration 2

P.I. in action PolicyValue Iteration 3

P.I. in action PolicyValue Iteration 4

P.I. in action PolicyValue Iteration 5

P.I. in action PolicyValue Iteration 6: done

Properties & Variants Policy iteration Known to converge (provable) Observed to converge exponentially quickly # iterations is O(ln(| S |)) Empirical observation; strongly believed but no proof (yet) O(| S | 3 ) time per iteration (policy evaluation) Other methods possible Linear program (poly time soln exists) Value iteration Generalized policy iter. (often best in practice)

Q : A key operative Critical step in policy iteration π’(s i ) =argmax a ∈ A {sum j ( T(s i,a,s j )*V(s j ) )} Asks “What happens if I ignore π for just one step, and do a instead (and then resume doing π thereafter)?” Often used operation. Gets a special name: Definition: the Q function, is: Policy iter says: “Figure out Q, act greedily according to Q, then update Q and repeat, until you can’t do any better...”

What to do with Q Can think of Q as a big table: one entry for each state/action pair “If I’m in state s and take action a, this is my expected discounted reward...” A “one-step” exploration: “In state s, if I deviate from my policy π for one timestep, then keep doing π, is my life better or worse?” Can get V and π from Q :