Temporal Difference Learning By John Lenz. Reinforcement Learning Agent interacting with environment Agent receives reward signal based on previous action.

Slides:



Advertisements
Similar presentations
Reinforcement Learning
Advertisements

Reinforcement learning
RL for Large State Spaces: Value Function Approximation
10/29/01Reinforcement Learning in Games 1 Colin Cherry Oct 29/01.
Reinforcement Learning
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning Reinforcement Learning.
Adapted from R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction From Sutton & Barto Reinforcement Learning An Introduction.
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Model-Free vs. Model- Based RL: Q, SARSA, & E 3. Administrivia Reminder: Office hours tomorrow truncated 9:00-10:15 AM Can schedule other times if necessary.
Reinforcement Learning Tutorial
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Reinforcement Learning
Blackjack as a Test Bed for Learning Strategies in Neural Networks A. Perez-Uribe and E. Sanchez Swiss Federal Institute of Technology IEEE IJCNN'98.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה.
Chapter 1: Introduction
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Persistent Autonomous FlightNicholas Lawrance Reinforcement Learning for Soaring CDMRG – 24 May 2010 Nick Lawrance.
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning: Generalization and Function Brendan and Yifang Feb 10, 2015.
Reinforcement Learning (1)
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Formulating MDPs pFormulating MDPs Rewards Returns Values pEscalator pElevators.
Reinforcement Learning
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Reinforcement learning This is mostly taken from Dayan and Abbot ch. 9 Reinforcement learning is different than supervised learning in that there is no.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning,
Reinforcement Learning Generalization and Function Approximation Subramanian Ramamoorthy School of Informatics 28 February, 2012.
Design Principles for Creating Human-Shapable Agents W. Bradley Knox, Ian Fasel, and Peter Stone The University of Texas at Austin Department of Computer.
Reinforcement Learning Ata Kaban School of Computer Science University of Birmingham.
CHECKERS: TD(Λ) LEARNING APPLIED FOR DETERMINISTIC GAME Presented By: Presented To: Amna Khan Mis Saleha Raza.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 3: TD( ) and eligibility traces.
Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.
Neural Networks Chapter 7
Reinforcement Learning Eligibility Traces 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Reinforcement learning (Chapter 21)
Reinforcement Learning
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
Reinforcement Learning: Learning algorithms Yishay Mansour Tel-Aviv University.
Reinforcement Learning Guest Lecturer: Chengxiang Zhai Machine Learning December 6, 2001.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
REINFORCEMENT LEARNING Unsupervised learning 1. 2 So far ….  Supervised machine learning: given a set of annotated istances and a set of categories,
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
TD(0) prediction Sarsa, On-policy learning Q-Learning, Off-policy learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Machine Learning Supervised Learning Classification and Regression
Stochastic tree search and stochastic games
Reinforcement Learning
Reinforcement learning (Chapter 21)
Reinforcement learning (Chapter 21)
Reinforcement Learning
Announcements Homework 3 due today (grace period through Friday)
Reinforcement learning
RL for Large State Spaces: Value Function Approximation
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 1: Introduction
Chapter 7: Eligibility Traces
What is Artificial Intelligence?
Presentation transcript:

Temporal Difference Learning By John Lenz

Reinforcement Learning Agent interacting with environment Agent receives reward signal based on previous action. Goal is to maximize reward signal over time Epoch vs. Continuous tasks No expert teacher

Value Function Actual return, R t, is the sum of all future rewards. Value function: V  (s) = E  {R t | s t = s} is the expected return starting from state s t and following policy . State-action value function, Q(s,a), is the expected return starting from state s, taking action a, and then following policy .

Temporal Difference Learning Constant-  Monty Carlo method. Update weights only when the actual return is known.  V(s) =  [R t – V(s)] Must wait until actual return is known to update weights. End of epoch Instead, we estimate the actual return as the sum of the next reward plus the value of the next state

Forward view of TD( ) Estimate the actual return by n-step return R t (n) = r t+1 +  r t+2 + … +  n-1 r t+n +  n V(s t+n ) TD( ) is a weighted average of n-step returns R t = (1 - )  n (n-1) R t (n) =0, we only look at next reward =1, constant-  Monty Carlo

Backward view of TD( ) Each state has an eligibility trace. The eligibility trace is incremented by one for the current state, and decayed by  Then, each time-step, the value function for every recently visited state is updated. The eligibility trace determines which states have been recently visited

Backward view continued e t (s) =  e t-1 (s) if s != s t e t (s) =  e t-1 (s) + 1 if s = s t  t = r t+1 +  V t (s t+1 ) - V t (s t )  V t (s) =   t e t (s) for all s

Control Problem: Sarsa( )  -Greedy policy using Q(s,a) instead of V(s) Each state-action pair has an eligibility trace e(s,a)

Function Approximation Until now, assumed V(s) and Q(s,a) implemented as huge table Instead approximate the Value function V(s) and the state-action value function Q(s,a) with any supervised learning methods Radial Basis Networks, Support Vector Machines, Artificial Neural Networks, clustering, etc.

My Implementation C++ Uses a two-layer feed forward artificial neural network for function approximation Agent implemented on top of neural network, which implements the learning algorithm Agent is independent of environment.

Hill Climbing Problem Goal is to reach the top of the hill, but the car is unable to accelerate to the top Car must move away from the goal to gain momentum to reach the top.

Hill Climbing Results Initial run took 65,873 steps but by the ninth epoch took 186 steps

Games N-in-a-row type games like checkers, tic- tac-toe, backgammon, chess, etc. Use the after-state value function to select moves TD-Gammon plays at level of best human players Can learn through self play

Implementation I implemented tic-tac-toe and checkers. After around 30,000 games of self play, the tic-tac-toe program could learn to play a descent game. Checkers was less successful. Even after 400,000 self play games the agent could not beat a traditional AI