CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.

Slides:

Advertisements

Similar presentations

Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.

Advertisements

Markov Decision Process

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.

Markov Decision Process (MDP)  S : A set of states  A : A set of actions  P r(s’|s,a): transition model (aka M a s,s’ )  C (s,a,s’): cost model  G.

Decision Theoretic Planning

CS 188: Artificial Intelligence

A Hybridized Planner for Stochastic Domains Mausam and Daniel S. Weld University of Washington, Seattle Piergiorgio Bertoli ITC-IRST, Trento.

An Introduction to Markov Decision Processes Sarah Hickmott

Markov Decision Processes

Infinite Horizon Problems

Planning under Uncertainty

1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.

Announcements Homework 3: Games Project 2: Multi-Agent Pacman

SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.

Concurrent Markov Decision Processes Mausam, Daniel S. Weld University of Washington Seattle.

91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010

Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.

Markov Decision Processes

Concurrent Probabilistic Temporal Planning (CPTP) Mausam Joint work with Daniel S. Weld University of Washington Seattle.

CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.

Markov Decision Processes Value Iteration Pieter Abbeel UC Berkeley EECS TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

The Value of Plans. Now and Then Last time Value in stochastic worlds Maximum expected utility Value function calculation Today Example: gridworld navigation.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.

CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.

1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.

Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.

MDPs (cont) & Reinforcement Learning

gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/

CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.

Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.

Markov Decision Processes Chapter 17 Mausam. Planning Agent What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable.

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Department of Computer Science Undergraduate Events More

QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.

CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.

Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.

Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.

CS 541: Artificial Intelligence Lecture X: Markov Decision Process Slides Credit: Peter Norvig and Sebastian Thrun.

Markov Decision Processes II

Markov Decision Processes

CSE 473: Artificial Intelligence

Markov Decision Processes

Markov Decision Processes

CS 188: Artificial Intelligence

CAP 5636 – Advanced Artificial Intelligence

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Quiz 6: Utility Theory Simulated Annealing only applies to continuous f(). False Simulated Annealing only applies to differentiable f(). False The 4 other.

CS 188: Artificial Intelligence Fall 2007

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Fall 2007

CSE 473 Markov Decision Processes

CS 188: Artificial Intelligence Spring 2006

Hidden Markov Models (cont.) Markov Decision Processes

CS 188: Artificial Intelligence Fall 2007

Announcements Homework 2 Project 2 Mini-contest 1 (optional)

CS 416 Artificial Intelligence

Reinforcement Learning (2)

Markov Decision Processes

Markov Decision Processes

Reinforcement Learning (2)

Presentation transcript:

CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer

MDPs Markov Decision Processes Planning Under Uncertainty Mathematical Framework Bellman Equations Value Iteration Real-Time Dynamic Programming Policy Iteration Reinforcement Learning Andrey Markov ( )

Recap: Defining MDPs  Markov decision process: States S Start state s 0 Actions A Transitions P(s’|s, a) aka T(s,a,s’) Rewards R(s,a,s’) (and discount  )  Compute the optimal policy,  *  is a function that chooses an action for each state We the policy which maximizes sum of discounted rewards

High-Low as an MDP  States: 2, 3, 4, done  Actions: High, Low  Model: T(s, a, s’): P(s’=4 | 4, Low) = P(s’=3 | 4, Low) = P(s’=2 | 4, Low) = P(s’=done | 4, Low) = 0 P(s’=4 | 4, High) = 1/4 P(s’=3 | 4, High) = 0 P(s’=2 | 4, High) = 0 P(s’=done | 4, High) = 3/4 …  Rewards: R(s, a, s’): Number shown on s’ if s’<s  a=“high” … 0 otherwise  Start: /4 1/2 So, what’s a policy?  : {2,3,4,D}  {hi,lo}

Bellman Equations for MDPs Define V*(s) {optimal value} as the maximum expected discounted reward from this state. V* should satisfy the following equation:

Bellman Equations for MDPs Q*(a, s)

Existence  If positive cycles, does v* exist?

Bellman Backup (MDP) Given an estimate of V* function (say V n ) Backup V n function at state s calculate a new estimate (V n+1 ) : Q n+1 (s,a) : value/cost of the strategy: execute action a in s, execute  n subsequently  n = argmax a ∈ Ap(s) Q n (s,a) V R V  ax

Bellman Backup V 0 = 0 V 0 = 1 V 0 = 2 Q 1 (s,a 1 ) = 2 +  0 ~ 2 Q 1 (s,a 2 ) = 5 +  0.9~ 1 +  0.1~ 2 ~ 6.1 Q 1 (s,a 3 ) =  2 ~ 6.5 max V 1 = 6.5 a greedy = a a2a2 a1a1 a3a3 s0s0 s1s1 s2s2 s3s

Value iteration [Bellman’57] assign an arbitrary assignment of V 0 to each state. repeat for all states s compute V n+1 (s) by Bellman backup at s. until max s |V n+1 (s) – V n (s)| <  Iteration n+1 Residual(s)  Theorem: will converge to unique optimal values  Basic idea: approximations get refined towards optimal values  Policy may converge long before values do

Example: Value Iteration

 Next slide confusing – out of order? Need to define policy We never stored Q* just value so where does this come from?

What about Policy?  Which action should we chose from state s: Given optimal values Q? Given optimal values V? Lesson: actions are easier to select from Q’s!

Policy Computation Theorem: optimal policy is Stationary (time-independent) Deterministic for infinite / indefinite horizon problems ax R V 

Example: Value Iteration 23 4D V 0 (s 2 )=0V 0 (s 3 )=0 V 0 (s 4 )=0V 0 (s D )=0 Q 1 (s 3,,hi)=.25*(4+0) +.25*(0+0) +.5*0 = 1 hi

Example: Value Iteration 23 4D V 0 (s 2 )=0V 0 (s 3 )=0 V 0 (s 4 )=0V 0 (s D )=0 Q 1 (s 3,,hi)=.25*(4+0) +.25*(0+0) +.5*0 = 1 Q 1 (s 3,,lo)=.5*(2+0) +.25*(0+0) +.25*0 = 1 lo V 1 (s 3 ) = Max a Q(s 3,a)= Max (1, 1) = 1

Example: Value Iteration 23 4D V 0 (s 2 )=0V 1 (s 3 )=1 V 0 (s 4 )=0V 0 (s D )=0

Example: Value Iteration 23 4D V 0 (s 2 )=0V 0 (s 3 )=0 V 0 (s 4 )=0V 0 (s D )=0 Q 1 (s 2,,hi)=.5*(0+0) +.25*(3+0) +.25*(4+0) = 1.75 hi

Example: Value Iteration 23 4D V 0 (s 2 )=0V 0 (s 3 )=0 V 0 (s 4 )=0V 0 (s D )=0 Q 1 (s 2,,hi)=.5*(0+0) +.25*(3+0) +.25*(4+0) = 1.75 lo Q 1 (s 2,,lo)=.5*(0+0) +.25*(0) +.25*(0) = 0 V 1 (s 2 ) = Max a Q(s 2,a)= Max (1, 0) = 1

Example: Value Iteration 23 4D V 1 (s 2 )=1.75V 1 (s 3 )=1 V 1 (s 4 )=1.75V 0 (s D )=0

 Next slides garbled

Round D V 1 (s 2 )=1V 1 (s 3 )=1 V 1 (s 4 )=1V 1 (s D )=0 Q 2 (s 3,,hi)=.25*(4+1) +.25*(0+1) +.5*0 = 1.5 hi

Round D V 0 (s 2 )=0V 0 (s 3 )=0 V 0 (s 4 )=0V 0 (s D )=0 Q 1 (s 2,,hi)=.5*(0+0) +.25*(3+0) +.25*(4+0) = 1 lo Q 1 (s 2,,lo)=.5*(0+0) +.25*(0) +.25*(0) = 0 V 1 (s 2 ) = Max a Q(s 2,a)= Max (1, 0) = 1

Example: Bellman Updates Example:  =0.9, living reward=0, noise=0.2 ? ? ???? ? ? ?

Example: Value Iteration  Information propagates outward from terminal states and eventually all states have correct value estimates V1V1 V2V2

Convergence  Define the max-norm:  Theorem: For any two approximations U and V I.e. any distinct approximations must get closer to each other, so, in particular, any approximation must get closer to the true U and value iteration converges to a unique, stable, optimal solution  Theorem: I.e. once the change in our approximation is small, it must also be close to correct

Value Iteration Complexity  Problem size: |A| actions and |S| states  Each Iteration Computation: O(|A| ⋅ |S| 2 ) Space: O(|S|)  Num of iterations Can be exponential in the discount factor γ

Summary: Value Iteration  Idea: Start with V 0 * (s) = 0, which we know is right (why?) Given V i *, calculate the values for all states for depth i+1: This is called a value update or Bellman update Repeat until convergence  Theorem: will converge to unique optimal values  Basic idea: approximations get refined towards optimal values  Policy may converge long before values do

todo  Get sldies for labeled RTDP

Asynchronous Value Iteration  States may be backed up in any order instead of an iteration by iteration  As long as all states backed up infinitely often Asynchronous Value Iteration converges to optimal

Asynch VI: Prioritized Sweeping  Why backup a state if values of successors same?  Prefer backing a state whose successors had most change  Priority Queue of (state, expected change in value)  Backup in the order of priority  After backing a state update priority queue for all predecessors

Asynchonous Value Iteration Real Time Dynamic Programming [Barto, Bradtke, Singh’95] Trial: simulate greedy policy starting from start state; perform Bellman backup on visited states RTDP: repeat Trials until value function converges

Why?  Why is next slide saying min

Min ? ? s0s0 VnVn VnVn VnVn VnVn VnVn VnVn VnVn Q n+1 (s 0,a) V n+1 (s 0 ) a greedy = a 2 RTDP Trial Goal a1a1 a2a2 a3a3 ?

Comments Properties if all states are visited infinitely often then V n → V* Advantages Anytime: more probable states explored quickly Disadvantages complete convergence can be slow!