QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.

QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal. TRUE  T/F: π*(s)=max s’ V*(s’). FALSE  T/F: The Bellman equation can be satisfied by sub-optimal value functions FALSE  T/F: Value Iteration: The policy cannot converge before the value function FALSE  Explain the difference between Policy Iteration and Value Iteration.  Why can Policy Iteration be faster than Value Iteration? 1

CS 511a: Artificial Intelligence Spring 2013 Lecture 11: MDPs / Reinforcement Learning Feb 25, 2013 Robert Pless, Course adopted from Kilian Weinberger, with many slides from either Dan Klein, Stuart Russell or Andrew Moore 2

Announcements  Project 2 due Thursday night.  HW 1 due Friday 5pm*  * accepted no penalty or late-day charge until Monday 10am. 3

Policy Iteration 4 Why do we compute V* or Q*, if all we care about is the best policy  *? Why do we compute V* or Q*, if all we care about is the best policy  *?

Utilities for Fixed Policies  Another basic operation: compute the utility of a state s under a fix (general non-optimal) policy  Define the utility of a state s, under a fixed policy  : V  (s) = expected total discounted rewards (return) starting in s and following   Recursive relation (one-step look- ahead / Bellman equation): 5 a s s, a T(s,a,s’) s’s’ R(s,a,s’) V  (s) Q * (s,a) a=  (s)

Policy Evaluation  How do we calculate the V’s for a fixed policy?  Idea one: modify Bellman updates  Idea two: Optimal solution is stationary point (equality). Then it’s just a linear system, solve with Matlab (or whatever) 6

Policy Iteration  Policy evaluation: with fixed current policy , find values with simplified Bellman updates:  Iterate until values converge  Policy improvement: with fixed utilities, find the best action according to one-step look-ahead 7

Comparison  In value iteration:  Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy)  Policy might not change between updates (wastes computation)   In policy iteration:  Several passes to update utilities with frozen policy  Occasional passes to update policies  Value update can be solved as linear system  Can be faster, if policy changes infequently  Hybrid approaches (asynchronous policy iteration):  Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely often 8

Asynchronous Value Iteration  In value iteration, we update every state in each iteration  Actually, any sequences of Bellman updates will converge if every state is visited infinitely often  In fact, we can update the policy as seldom or often as we like, and we will still converge  Idea: Update states whose value we expect to change: If is large then update predecessors of s

Reinforcement Learning 10

Reinforcement Learning  Reinforcement learning:  Still have an MDP:  A set of states s  S  A set of actions (per state) A  A model T(s,a,s’)  A reward function R(s,a,s’)  Still looking for a policy  (s)  New twist: don’t know T or R  I.e. don’t know which states are good or what the actions do  Must actually try actions and states out to learn 11 Demo

Example: Animal Learning  RL studied experimentally for more than 60 years in psychology  Rewards: food, pain, hunger, drugs, etc.  Mechanisms and sophistication debated  Example: foraging  Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies  Bees have a direct neural connection from nectar intake measurement to motor planning area 12

Passive Learning  Simplified task  You don’t know the transitions T(s,a,s’)  You don’t know the rewards R(s,a,s’)  You are given a policy  (s)  Goal: learn the state values  … what policy evaluation did  In this case:  Learner “along for the ride”  No choice about what actions to take  Just execute the policy and learn from experience  We’ll get to the active case soon  This is NOT offline planning! You actually take actions in the world and see what happens… 13

Passive Model-Based Learning  Idea:  Learn the model empirically through experience  Solve for values as if the learned model were correct  Simple empirical model learning  Count outcomes for each s,a  Normalize to give estimate of T(s,a,s’)  Discover R(s,a,s’) when we experience (s,a,s’)  Solving the MDP with the learned model  Iterative policy evaluation, for example 14  (s) s s,  (s) s,  (s),s’ s’s’

Example: Model-Based Learning  Episodes: x y T(, right, ) = 1 / 3 T(, right, ) = 2 / 2 +100 -100  = 1 (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) 15

Passive Model-Free Learning  Big idea: why bother learning T?  1. Direct Estimation:  Average V(s) value directly and compute expected discounted reward for each state.  No need to compute T or R. 16  (s) s s,  (s) s’s’

Model-Free Learning  Want to compute an expectation weighted by P(x):  Model-based: estimate P(x) from samples, compute expectation  Model-free: estimate expectation directly from samples  Why does this work? Because samples appear with the right frequencies! 17

Example:Model-Free Estimation  Episodes: x y (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) V(2,3) ~ (96 + -103) / 2 = -3.5 V(3,3) ~ (99 + 97 + -102) / 3 = 31.3  = 1, R = -1 +100 -100 18

Sample-Based Policy Evaluation?  Update V without building T or R. 19  (s) s s,  (s) s1’s1’ s2’s2’ s3’s3’ s,  (s),s’ s’s’

Passive Model-Free Learning  Big idea: why bother learning T?  1. Direct Estimation:  Average V(s) value directly and compute expected discounted reward for each state.  No need to compute T or R.  2. Temporal-Difference Leearning:  Update value function towards whatever successor occurs – maintain running average. 20  (s) s s,  (s) s’s’

Temporal-Difference Learning  Big idea: learn from every experience!  Update V(s) each time we experience (s,a,s’,r)  Likely s’ will contribute updates more often  Temporal difference learning  Policy still fixed!  Move values toward value of whatever successor occurs: running average! 21  (s) s s,  (s) s’s’ Sample of V(s): Update to V(s): Same update:

Exponential Moving Average  Exponential moving average  Makes recent samples more important  Forgets about the past (distant past values were wrong anyway)  Easy to compute from the running average  Decreasing learning rate can give converging averages 22

Problems with TD Value Learning  TD value leaning is a model-free way to do policy evaluation  However, if we want to turn values into a (new) policy, we’re sunk:  Idea: learn Q-values directly  Makes action selection model-free too! a s s, a s,a,s’ s’s’ 23

Active Learning  Full reinforcement learning  You don’t know the transitions T(s,a,s’)  You don’t know the rewards R(s,a,s’)  You can choose any actions you like  Goal: learn the optimal policy  … what value iteration did!  In this case:  Learner makes choices!  Fundamental tradeoff: exploration vs. exploitation  This is NOT offline planning! You actually take actions in the world and find out what happens… 24

Detour: Q-Value Iteration  Value iteration: find successive approx optimal values  Start with V 0 * (s) = 0, which we know is right (why?)  Given V i *, calculate the values for all states for depth i+1:  But Q-values are more useful!  Start with Q 0 * (s,a) = 0, which we know is right (why?)  Given Q i *, calculate the q-values for all q-states for depth i+1: 25

Q-Learning  Q-Learning: sample-based Q-value iteration  Learn Q*(s,a) values  Receive a sample (s,a,s’,r)  Consider your old estimate:  Consider your new sample estimate:  Incorporate the new estimate into a running average: [DEMO – Grid Q’s] 26

Q-Learning  Q-Learning: sample-based Q-value iteration  Learn Q*(s,a) values  Receive a sample (s,a,s’,r)  Consider your old estimate:  Consider your new sample estimate:  Incorporate the new estimate into a running average: 27

 Example’s Tom Erez, Hopper:  http://www.youtube.com/watch?feature=playe r_embedded&v=kUfmnoobTHQ - ! http://www.youtube.com/watch?feature=playe r_embedded&v=kUfmnoobTHQ - ! 28

QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.

Similar presentations

Presentation on theme: "QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.

Similar presentations

Presentation on theme: "QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal."— Presentation transcript:

Similar presentations

About project

Feedback