Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMSC 471 – Spring 2014 Class #25 – Thursday, May 1

Similar presentations


Presentation on theme: "CMSC 471 – Spring 2014 Class #25 – Thursday, May 1"— Presentation transcript:

1 CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
MDPs and the RL Problem CMSC 471 – Spring 2014 Class #25 – Thursday, May 1 Russell & Norvig Chapter Thanks to Rich Sutton and Andy Barto for the use of their slides (modified with additional slides and in-class exercise) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

2 Learning Without a Model
Last time, we saw how to learn a value function and/or a policy from a transition model What if we don’t have a transition model?? Idea #1: Explore the environment for a long time Record all transitions Learn the transition model Apply value iteration/policy iteration Slow and requires a lot of exploration! No intermediate learning! Idea #2: Learn a value function (or policy) directly from interactions with the environment, while exploring R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

3 Simple Monte Carlo T T T T T T T T T T T
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

4 TD Prediction Policy Evaluation (the prediction problem):
for a given policy p, compute the state-value function Recall: target: the actual return after time t target: an estimate of the return γ: a discount factor in [0,1] (relative value of future rewards) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

5 Simplest TD Method T T T T T T T T T T T
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

6 Temporal Difference Learning
TD-learning: Uπ (s) = Uπ (s) + α (R(s) + γ Uπ (s’) – Uπ (s)) or equivalently: Uπ (s) = α [ R(s) + γ Uπ (s’) ] + (1-α) [ Uπ (s) ] General idea: Iteratively update utility values, assuming that current utility values for other (local) states are correct Previous utility estimate Discount rate Learning rate Observed reward Previous utility estimate for successor state R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

7 Exploration vs. Exploitation
Problem with naive reinforcement learning: What action to take? Best apparent action, based on learning to date Greedy strategy Often prematurely converges to a suboptimal policy! Random action Will cover entire state space Very expensive and slow to learn! When to stop being random? Balance exploration (try random actions) with exploitation (use best action so far) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

8 Q-Learning Q-value: Value of taking action A in state S
(as opposed to V = value of state S) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

9 A B G Q-Learning Exercise     Q-learning reminder:
Starting state: A Reward function:  in A yields -1 (at time t+1!);  in B yields +1; all other actions yield -.1; G is a terminal state Action sequence:       All Q-values are initialized to zero (including Q(G, *)) Fill in the following table for the six Q-learning updates: t at St Rt+1 St+1 Q’(st,at) A 1 2 3 4 5 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

10 A B G Q-Learning Exercise     Q-learning reminder:
Starting state: A Reward function:  in A yields -1 (at time t+1!);  in B yields +1; all other actions yield -.1; G is a terminal state Action sequence:       All Q-values are initialized to zero (including Q(G, *)); α and γ are 0.9 Fill in the following table for the six Q-learning updates: t at St Rt+1 St+1 Q’(st,at) A -1 -0.9 1 -0.99 2 -.1 B -0.09 3 -0.162 4 -0.099 5 G 0.9 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction


Download ppt "CMSC 471 – Spring 2014 Class #25 – Thursday, May 1"

Similar presentations


Ads by Google