April 21st, 2002Copyright © 2002, 2004, Andrew W. Moore Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm.

Slides:



Advertisements
Similar presentations
Hierarchical Reinforcement Learning Amir massoud Farahmand
Advertisements

Markov Decision Process
Probabilistic Planning (goal-oriented) Action Probabilistic Outcome Time 1 Time 2 Goal State 1 Action State Maximize Goal Achievement Dead End A1A2 I A1.
1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.
Decision Theoretic Planning
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Executive Summary: Bayes Nets Andrew W. Moore Carnegie Mellon University, School of Computer Science Note to other teachers.
Markov Models Introduction to Artificial Intelligence COS302 Michael L. Littman Fall 2001.
An Introduction to Markov Decision Processes Sarah Hickmott
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
Advanced MDP Topics Ron Parr Duke University. Value Function Approximation Why? –Duality between value functions and policies –Softens the problems –State.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
An introduction to time series approaches in biosurveillance Professor The Auton Lab School of Computer Science Carnegie Mellon University
Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.
Reinforcement Learning
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
Discretization Pieter Abbeel UC Berkeley EECS
Copyright © 2004, Andrew W. Moore Naïve Bayes Classifiers Andrew W. Moore Professor School of Computer Science Carnegie Mellon University
Reinforcement Learning (2) Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
1 Machine Learning: Symbol-based 9d 9.0Introduction 9.1A Framework for Symbol-based Learning 9.2Version Space Search 9.3The ID3 Decision Tree Induction.
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Exploration in Reinforcement Learning Jeremy Wyatt Intelligent Robotics Lab School of Computer Science University of Birmingham, UK
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Sep 10th, 2001Copyright © 2001, Andrew W. Moore Learning Gaussian Bayes Classifiers Andrew W. Moore Associate Professor School of Computer Science Carnegie.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
Department of Computer Science Undergraduate Events More
Zaawansowana Analiza Danych Wykład 4: Ukryte modele Markova Piotr Synak.
Markov Decision Processes Chapter 17 Mausam. Planning Agent What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable.
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Department of Computer Science Undergraduate Events More
Nov 30th, 2001Copyright © 2001, Andrew W. Moore PAC-learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
Oct 29th, 2001Copyright © 2001, Andrew W. Moore Bayes Net Structure Learning Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Nov 20th, 2001Copyright © 2001, Andrew W. Moore VC-dimension for characterizing classifiers Andrew W. Moore Associate Professor School of Computer Science.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Markov Systems, Markov Decision Processes, and Dynamic Programming
Markov Decision Processes II
Reinforcement Learning
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
Reinforcement learning
CMSC 471 Fall 2009 RL using Dynamic Programming
CS 188: Artificial Intelligence Fall 2007
CMSC 471 – Fall 2011 Class #25 – Tuesday, November 29
Hidden Markov Models (cont.) Markov Decision Processes
CS 416 Artificial Intelligence
Support Vector Machines
CS 416 Artificial Intelligence
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

April 21st, 2002Copyright © 2002, 2004, Andrew W. Moore Andrew W. Moore Professor School of Computer Science Carnegie Mellon University Markov Systems, Markov Decision Processes, and Dynamic Programming Prediction and Search in Probabilistic Worlds Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: Comments and corrections gratefully received.

Markov Systems: Slide 2 Copyright © 2002, 2004, Andrew W. Moore Discounted Rewards An assistant professor gets paid, say, 20K per year. How much, in total, will the A.P. earn in their life? … = Infinity What’s wrong with this argument? $$

Markov Systems: Slide 3 Copyright © 2002, 2004, Andrew W. Moore Discounted Rewards “A reward (payment) in the future is not worth quite as much as a reward now.” Because of chance of obliteration Because of inflation Example: Being promised $10,000 next year is worth only 90% as much as receiving $10,000 right now. Assuming payment n years in future is worth only (0.9) n of payment now, what is the AP’s Future Discounted Sum of Rewards ?

Markov Systems: Slide 4 Copyright © 2002, 2004, Andrew W. Moore Discount Factors People in economics and probabilistic decision- making do this all the time. The “Discounted sum of future rewards” using discount factor  ” is (reward now) +  (reward in 1 time step) +  2 (reward in 2 time steps) +  3 (reward in 3 time steps) + : : (infinite sum)

Markov Systems: Slide 5 Copyright © 2002, 2004, Andrew W. Moore The Academic Life Define: J A = Expected discounted future rewards starting in state A J B = Expected discounted future rewards starting in state B J T = “ “ “ “ “ “ “ T J S = “ “ “ “ “ “ “ S J D = “ “ “ “ “ “ “ D How do we compute J A, J B, J T, J S, J D ? A. Assistant Prof 20 B. Assoc. Prof 60 S. On the Street 10 D. Dead 0 T. Tenured Prof 400 Assume Discount Factor  =

Markov Systems: Slide 6 Copyright © 2002, 2004, Andrew W. Moore Computing the Future Rewards of an Academic

Markov Systems: Slide 7 Copyright © 2002, 2004, Andrew W. Moore A Markov System with Rewards… Has a set of states {S 1 S 2 ·· S N } Has a transition probability matrix P 11 P 12 ·· P 1N P= P 21 P ij = Prob(Next = S j | This = S i ) : P N1 ·· P NN Each state has a reward. {r 1 r 2 ·· r N } There’s a discount factor . 0 <  < 1 On Each Time Step … 0. Assume your state is S i 1. You get given reward r i 2. You randomly move to another state P(NextState = S j | This = S i ) = P ij 3. All future rewards are discounted by 

Markov Systems: Slide 8 Copyright © 2002, 2004, Andrew W. Moore Solving a Markov System Write J*(S i ) = expected discounted sum of future rewards starting in state S i J*(S i ) = r i +  x (Expected future rewards starting from your next state) = r i +  (P i1 J*(S 1 )+P i2 J*(S 2 )+ ··· P iN J*(S N )) Using vector notation write J*(S 1 ) r 1 P 11 P 12 ·· P 1N J*(S 2 ) r 2 P 21 ·. J= :R=:P= : J*(S N ) r N P N1 P N2 ·· P NN Question: can you invent a closed form expression for J in terms of R P and  ?

Markov Systems: Slide 9 Copyright © 2002, 2004, Andrew W. Moore Solving a Markov System with Matrix Inversion Upside: You get an exact answer Downside:

Markov Systems: Slide 10 Copyright © 2002, 2004, Andrew W. Moore Solving a Markov System with Matrix Inversion Upside: You get an exact answer Downside: If you have 100,000 states you’re solving a 100,000 by 100,000 system of equations.

Markov Systems: Slide 11 Copyright © 2002, 2004, Andrew W. Moore Value Iteration: another way to solve a Markov System Define J 1 (S i ) = Expected discounted sum of rewards over the next 1 time step. J 2 (S i ) = Expected discounted sum rewards during next 2 steps J 3 (S i ) = Expected discounted sum rewards during next 3 steps : J k (S i ) = Expected discounted sum rewards during next k steps J 1 (S i ) = (what?) J 2 (S i ) =(what?) : J k+1 (S i ) =(what?)

Markov Systems: Slide 12 Copyright © 2002, 2004, Andrew W. Moore Value Iteration: another way to solve a Markov System Define J 1 (S i ) = Expected discounted sum of rewards over the next 1 time step. J 2 (S i ) = Expected discounted sum rewards during next 2 steps J 3 (S i ) = Expected discounted sum rewards during next 3 steps : J k (S i ) = Expected discounted sum rewards during next k steps J 1 (S i ) = r i (what?) J 2 (S i ) =(what?) : J k+1 (S i ) =(what?) N = Number of states

Markov Systems: Slide 13 Copyright © 2002, 2004, Andrew W. Moore Let’s do Value Iteration kJ k ( SUN )J k ( WIND )J k ( HAIL ) SUN  +4 WIND  0 HAIL.::.:.:: -8 1/2  = 0.5

Markov Systems: Slide 14 Copyright © 2002, 2004, Andrew W. Moore Let’s do Value Iteration kJ k ( SUN )J k ( WIND )J k ( HAIL ) SUN  +4 WIND  0 HAIL.::.:.:: -8 1/2  = 0.5

Markov Systems: Slide 15 Copyright © 2002, 2004, Andrew W. Moore Value Iteration for solving Markov Systems Compute J 1 (S i ) for each j Compute J 2 (S i ) for each j : Compute J k (S i ) for each j As k→∞ J k (S i )→J*(S i ). Why? When to stop? When Max J k+1 (S i ) – J k (S i ) < ξ i This is faster than matrix inversion (N 3 style) if the transition matrix is sparse

Markov Systems: Slide 16 Copyright © 2002, 2004, Andrew W. Moore A Markov Decision Process  = 0.9 Poor & Unknown +0 Rich & Unknown +10 Rich & Famous +10 Poor & Famous +0 You run a startup company. In every state you must choose between Saving money or Advertising. S A A S A A S S /2

Markov Systems: Slide 17 Copyright © 2002, 2004, Andrew W. Moore Markov Decision Processes An MDP has… A set of states {s 1 ··· S N } A set of actions {a 1 ··· a M } A set of rewards {r 1 ··· r N } (one for each state) A transition probability function On each step: 0. Call current state S i 1. Receive reward r i 2. Choose action  {a 1 ··· a M } 3. If you choose action a k you’ll move to state S j with probability 4. All future rewards are discounted by 

Markov Systems: Slide 18 Copyright © 2002, 2004, Andrew W. Moore A Policy A policy is a mapping from states to actions. Examples STATE → ACTION PUS PFA RUS RFA STATE → ACTION PUA PFA RUA RFA Policy Number 1: How many possible policies in our example? Which of the above two policies is best? How do you compute the optimal policy? PU 0 PF 0 RU +10 RF +10 RF 10 PF 0 PU 0 RU 10 S S A A A A AA /2 Policy Number 2:

Markov Systems: Slide 19 Copyright © 2002, 2004, Andrew W. Moore Interesting Fact For every M.D.P. there exists an optimal policy. It’s a policy such that for every possible start state there is no better option than to follow the policy. (Not proved in this lecture)

Markov Systems: Slide 20 Copyright © 2002, 2004, Andrew W. Moore Computing the Optimal Policy Idea One: Run through all possible policies. Select the best. What’s the problem ??

Markov Systems: Slide 21 Copyright © 2002, 2004, Andrew W. Moore Optimal Value Function Define J*(S i ) = Expected Discounted Future Rewards, starting from state S i, assuming we use the optimal policy S 1 +0 S 3 +2 S 2 +3 B B A A B A 1/ /3 Question What (by inspection) is an optimal policy for that MDP? (assume  = 0.9) What is J*(S 1 ) ? What is J*(S 2 ) ? What is J*(S 3 ) ?

Markov Systems: Slide 22 Copyright © 2002, 2004, Andrew W. Moore Computing the Optimal Value Function with Value Iteration Define J k (S i ) = Maximum possible expected sum of discounted rewards I can get if I start at state S i and I live for k time steps. Note that J 1 (S i ) = r i

Markov Systems: Slide 23 Copyright © 2002, 2004, Andrew W. Moore Let’s compute J k (S i ) for our example kJ k (PU)J k (PF)J k (RU)J k (RF)

Markov Systems: Slide 24 Copyright © 2002, 2004, Andrew W. Moore Let’s compute J k (S i ) for our example kJ k (PU)J k (PF)J k (RU)J k (RF)

Markov Systems: Slide 25 Copyright © 2002, 2004, Andrew W. Moore Bellman’s Equation Value Iteration for solving MDPs Compute J 1 (S i ) for all i Compute J 2 (S i ) for all i : Compute J n (S i ) for all i …..until converged …Also known as Dynamic Programming

Markov Systems: Slide 26 Copyright © 2002, 2004, Andrew W. Moore Finding the Optimal Policy 1.Compute J*(S i ) for all i using Value Iteration (a.k.a. Dynamic Programming) 2.Define the best action in state S i as (Why?)

Markov Systems: Slide 27 Copyright © 2002, 2004, Andrew W. Moore Applications of MDPs This extends the search algorithms of your first lectures to the case of probabilistic next states. Many important problems are MDPs…. …Robot path planning …Travel route planning …Elevator scheduling …Bank customer retention …Autonomous aircraft navigation …Manufacturing processes …Network switching & routing

Markov Systems: Slide 28 Copyright © 2002, 2004, Andrew W. Moore Asynchronous D.P. Value Iteration: “Backup S 1 ”, “Backup S 2 ”, ···· “Backup S N ”, then “Backup S 1 ”, “Backup S 2 ”, ···· repeat : : There’s no reason that you need to do the backups in order! Random Order …still works. Easy to parallelize (Dyna, Sutton 91) On-Policy Order Simulate the states that the system actually visits. Efficient Order e.g. Prioritized Sweeping [Moore 93] Q-Dyna [Peng & Williams 93]

Markov Systems: Slide 29 Copyright © 2002, 2004, Andrew W. Moore Policy Iteration Write π(S i ) = action selected in the i’th state. Then π is a policy. Write π t = t’th policy on t’th iteration Algorithm: π˚ = Any randomly chosen policy  i compute J˚(S i ) = Long term reward starting at S i using π˚ π 1 (S i ) = J 1 = …. π 2 (S i ) = …. … Keep computing π 1, π 2, π 3 …. until π k = π k+1. You now have an optimal policy. Another way to compute optimal policies

Markov Systems: Slide 30 Copyright © 2002, 2004, Andrew W. Moore Policy Iteration & Value Iteration: Which is best ??? It depends. Lots of actions? Choose Policy Iteration Already got a fair policy? Policy Iteration Few actions, acyclic? Value Iteration Best of Both Worlds: Modified Policy Iteration [Puterman] …a simple mix of value iteration and policy iteration 3 rd Approach Linear Programming

Markov Systems: Slide 31 Copyright © 2002, 2004, Andrew W. Moore Time to Moan What’s the biggest problem(s) with what we’ve seen so far?

Markov Systems: Slide 32 Copyright © 2002, 2004, Andrew W. Moore Dealing with large numbers of states STATEVALUE s1s1 S2S2 : S Don’t use a Table… use… (Generalizers) (Hierarchies) Splines A Function Approximator Variable Resolution Multi Resolution Memory Based STATEVALUE [Munos 1999]

Markov Systems: Slide 33 Copyright © 2002, 2004, Andrew W. Moore Function approximation for value functions Polynomials [Samuel, Boyan, Much O.R. Literature] Neural Nets [Barto & Sutton, Tesauro, Crites, Singh, Tsitsiklis] SplinesEconomists, Controls Downside: All convergence guarantees disappear. Backgammon, Pole Balancing, Elevators, Tetris, Cell phones Checkers, Channel Routing, Radio Therapy

Markov Systems: Slide 34 Copyright © 2002, 2004, Andrew W. Moore Memory-based Value Functions J(“state”) = J(most similar state in memory to “state”) or Average J(20 most similar states) or Weighted Average J(20 most similar states) [Jeff Peng, Atkenson & Schaal, Geoff Gordon, proved stuff Scheider, Boyan & Moore 98] “Planet Mars Scheduler”

Markov Systems: Slide 35 Copyright © 2002, 2004, Andrew W. Moore Hierarchical Methods Continuous State Space:“Split a state when statistically significant that a split would improve performance” e.g. Simmons et al 83, Chapman & Knelbling 92, Mark Ring 94 …, Munos 96 with interpolation! “Prove needs a higher resolution” Moore 93, Moore & Atkeson 95 Discrete Space: Chapman & Kaelbling 92, McCallum 95 (includes hidden state) A kind of Decision Tree Value Function Multiresolution A hierarchy with high level “managers” abstracting low level “servants” Many O.R. Papers, Dayan & Sejnowski’s Feudal learning, Dietterich 1998 (MAX-Q hierarchy) Moore, Baird & Kaelbling 2000 (airports Hierarchy) Continuous Space

Markov Systems: Slide 36 Copyright © 2002, 2004, Andrew W. Moore What You Should Know Definition of a Markov System with Discounted rewards How to solve it with Matrix Inversion How (and why) to solve it with Value Iteration Definition of an MDP, and value iteration to solve an MDP Policy iteration Great respect for the way this formalism generalizes the deterministic searching of the start of the class But awareness of what has been sacrificed.