1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.

Slides:

Advertisements

Similar presentations

Lecture 18: Temporal-Difference Learning

Advertisements

Programming exercises: Angel – lms.wsu.edu – Submit via zip or tar – Write-up, Results, Code Doodle: class presentations Student Responses First visit.

TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011.

1 Dynamic Programming Week #4. 2 Introduction Dynamic Programming (DP) –refers to a collection of algorithms –has a high computational complexity –assumes.

Monte-Carlo Methods Learning methods averaging complete episodic returns Slides based on [Sutton & Barto: Reinforcement Learning: An Introduction, 1998]

11 Planning and Learning Week #9. 22 Introduction... 1 Two types of methods in RL ◦Planning methods: Those that require an environment model  Dynamic.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 3 Ann Nowé By Sutton.

INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.

1 Monte Carlo Methods Week #5. 2 Introduction Monte Carlo (MC) Methods –do not assume complete knowledge of environment (unlike DP methods which assume.

1 Eligibility Traces (ETs) Week #7. 2 Introduction A basic mechanism of RL. λ in TD(λ) refers to an eligibility trace. TD methods such as Sarsa and Q-learning.

Markov Decision Processes & Reinforcement Learning Megan Smith Lehigh University, Fall 2006.

Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.

Reinforcement Learning Tutorial

Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)

Chapter 5: Monte Carlo Methods

Policies and exploration and eligibility, oh my!.

1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.

Reinforcement Learning. Overview  Introduction  Q-learning  Exploration vs. Exploitation  Evaluating RL algorithms  On-Policy Learning: SARSA.

Persistent Autonomous FlightNicholas Lawrance Reinforcement Learning for Soaring CDMRG – 24 May 2010 Nick Lawrance.

Chapter 6: Temporal Difference Learning

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 From Sutton & Barto Reinforcement Learning An Introduction.

Chapter 6: Temporal Difference Learning

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

Reinforcement Learning

Policies and exploration and eligibility, oh my!.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.

CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.

1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006.

Reinforcement Learning (II.) Exercise Solutions Ata Kaban School of Computer Science University of Birmingham.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.

Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning,

Bayesian Reinforcement Learning Machine Learning RCC 16 th June 2011.

1 S ystems Analysis Laboratory Helsinki University of Technology Flight Time Allocation Using Reinforcement Learning Ville Mattila and Kai Virtanen Systems.

Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.

Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.

Reinforcement Learning

CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.

INTRODUCTION TO Machine Learning

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 4 Ann Nowé By Sutton.

Reinforcement Learning Eligibility Traces 主講人：虞台文大同大學資工所智慧型多媒體研究室.

CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:

Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,

Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs.

Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.

RL2: Q(s,a) easily gets us U(s) and pi(s) Leftarrow with alpha above = move towards target, weighted by learning rate Sum of alphas is infinity, sum of.

Monte Carlo Methods. Learn from complete sample returns – Only defined for episodic tasks Can work in different settings – On-line: no model necessary.

Retraction: I’m actually 35 years old. Q-Learning.

Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 5: Monte Carlo Methods pMonte Carlo methods learn from complete sample.

Reinforcement Learning Elementary Solution Methods

Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.

TD(0) prediction Sarsa, On-policy learning Q-Learning, Off-policy learning.

Chapter 6: Temporal Difference Learning

Chapter 5: Monte Carlo Methods

Biomedical Data & Markov Decision Process

CMSC 671 – Fall 2010 Class #22 – Wednesday 11/17

Reinforcement learning

CMSC 471 Fall 2009 RL using Dynamic Programming

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming

Instructors: Fei Fang (This Lecture) and Dave Touretzky

October 6, 2011 Dr. Itamar Arel College of Engineering

Chapter 6: Temporal Difference Learning

Chapter 7: Eligibility Traces

Chapter 4: Dynamic Programming

Presentation transcript:

1 Temporal-Difference Learning Week #6

2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on other learned estimates (i.e., bootstraps), (as DP methods) does not require a model; learns from raw experience (as MC methods –constitutes a basis for the reinforcement learning. –Convergence to V π is guaranteed (asymptotically as in MC methods) in the mean for a constant learning rate  if it is sufficiently small with probability 1 if  decreases in accordance with the usual stochastic approximation conditions.

3 Update Rules for DP, MC and TD DP update MC update TD update target current value estimate of next state most recent return

4 Update Rules for DP, MC and TD General update rule: –NewEst = OldEst + LearningParameter (Target - OldEst) DP update –The new estimate is recalculated at every time step using the information of the completely defined model. MC update –The target in MC methods is the real average return obtained at the end of an episode. TD update –The target is the most recent return of the environment added to the current estimated value of the next state.

5 Algorithm for TD(0) Initialize V(s) arbitrarily, π to the policy to be evaluated Repeat for each episode –Initialize s –Repeat for each step of episode a ← action generated by π for s Take action a, observe reward r, and next state s’ s ← s’ –Until s is terminal

6 Advantages of TD Prediction Methods They bootstrap (i.e., learn a guess from other guesses) –Question: Learning a guess from a guess still guarantees a convergence to optimal state values? –Answer: Yes, in the mean for a constant and sufficiently small α, and certainly (i.e., wp1) if α decreases according to the usual stochastic approximation conditions. They need no model of environment –an advantage over DP methods They do not wait until end of episode to update state/action values

7 Advantages of TD Prediction Methods... (2) Constant-α MC and TD methods bootstrap and guarantee convergence. Question: Which methods converge first? –There is no mathematical proof answering that question –In practice, however, TD methods are shown to converge usually faster than constant-α MC methods on stochastic tasks.

8 Example: Random Walk Episodic, no discounted RL task States: A,..., E; Available actions: left (L), right (R) Goal: red square Termination: Both squares Values: ? ABCDE

9 Example: Random Walk True values: –V(A)=1/6; V(B)=2/6; V(C)=3/6; V(D)=4/6; V(E)=5/6; ABCDE

10 Batch Updating Motive: In case the amount of experience is limited, an alternative solution in incremental learning is to repeatedly present experience until convergence. Batch updating is the name since the updates are performed but recording is postponed until after the entire data are processed. Comparing constant-α MC and TD methods conducting random walk experiment under batch updating, we observe TD methods converge faster.

11 SARSA: On-Policy TD Control GPI using TD methods: two approaches –On-policy –Off-policy Transitions are between state-action pairs. Action value functions Value updates:

12 Algorithm: SARSA Initialize Q(s,a) arbitrarily Repeat for each episode –Initialize s; –Choose a from s using policy derived from Q –Repeat for each step of episode Take action a, observe reward r, and next state s’ Choose a’ from s’ using policy derived from Q(e.g., ε- greedy) s ← s’; a ← a’; –Until s is terminal

13 Q-Learning: Off-Policy TD Control Learned action-value function, Q, directly approximates optimal action-value function, Q *, independent of the behavior policy. This simplifies the analysis of the algorithm. For convergence, all that the behavior policy is required to do is that it sees the state-action pairs are continuously visited and updated.

14 Algorithm for Q-learning Initialize Q(s,a) arbitrarily Repeat for each episode –Initialize s; –Repeat for each step of episode Choose a from s using policy derived from Q(e.g., ε- greedy) Take action a, observe reward r, and next state s’ s ← s’; –Until s is terminal

15 References [1] Sutton, R. S. and Barto A. G., “Reinforcement Learning: An introduction,” MIT Press, 1998