Reinforcement Learning Generalization and Function Approximation Subramanian Ramamoorthy School of Informatics 28 February, 2012.

Slides:



Advertisements
Similar presentations
Short reading for Thursday Job talk at 1:30pm in ETRL 101 Kuka robotics –
Advertisements

Reinforcement Learning
Reinforcement learning
Lecture 18: Temporal-Difference Learning
RL for Large State Spaces: Value Function Approximation
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2012.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
The loss function, the normal equation,
x – independent variable (input)
Università di Milano-Bicocca Laurea Magistrale in Informatica Corso di APPRENDIMENTO E APPROSSIMAZIONE Lezione 6 - Reinforcement Learning Prof. Giancarlo.
Reinforcement Learning Tutorial
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
Radial Basis Functions
לביצוע מיידי ! להתחלק לקבוצות –2 או 3 בקבוצה להעביר את הקבוצות – היום בסוף השיעור ! ספר Reinforcement Learning – הספר קיים online ( גישה מהאתר של הסדנה.
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
November 2, 2010Neural Networks Lecture 14: Radial Basis Functions 1 Cascade Correlation Weights to each new hidden node are trained to maximize the covariance.
1 Kunstmatige Intelligentie / RuG KI Reinforcement Learning Johan Everts.
Chapter 6: Temporal Difference Learning
Chapter 6: Temporal Difference Learning
Reinforcement Learning: Generalization and Function Brendan and Yifang Feb 10, 2015.
Kunstmatige Intelligentie / RuG KI Reinforcement Learning Sander van Dijk.
Radial Basis Function Networks
1 Reinforcement Learning: Learning algorithms Function Approximation Yishay Mansour Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Drones Collecting Cell Phone Data in LA AdNear had already been using methods.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Reinforcement Learning
Temporal Difference Learning By John Lenz. Reinforcement Learning Agent interacting with environment Agent receives reward signal based on previous action.
Chapter 8: Generalization and Function Approximation pLook at how experience with a limited part of the state set be used to produce good behavior over.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.
Learning Theory Reza Shadmehr & Jörn Diedrichsen Reinforcement Learning 2: Temporal difference learning.
Overcoming the Curse of Dimensionality with Reinforcement Learning Rich Sutton AT&T Labs with thanks to Doina Precup, Peter Stone, Satinder Singh, David.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Non-Bayes classifiers. Linear discriminants, neural networks.
CMSC 471 Fall 2009 Temporal Difference Learning Prof. Marie desJardins Class #25 – Tuesday, 11/24 Thanks to Rich Sutton and Andy Barto for the use of their.
Schedule for presentations. 6.1: Chris? – The agent is driving home from work from a new work location, but enters the freeway from the same point. Thus,
Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs.
Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.
Reinforcement Learning with Laser Cats! Marshall Wang Maria Jahja DTR Group Meeting October 5, 2015.
Retraction: I’m actually 35 years old. Q-Learning.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
Reinforcement Learning Based on slides by Avi Pfeffer and David Parkes.
Reinforcement Learning Elementary Solution Methods
Decision Making Under Uncertainty Lec #9: Approximate Value Function UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Some slides by Jeremy.
Deep Learning and Deep Reinforcement Learning. Topics 1.Deep learning with convolutional neural networks 2.Learning to play Atari video games with Deep.
RiskTeam/ Zürich, 6 July 1998 Andreas S. Weigend, Data Mining Group, Information Systems Department, Stern School of Business, NYU 2: 1 Nonlinear Models.
1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 12: Generalization and Function Approximation Dr. Itamar Arel College of Engineering.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Machine Learning Supervised Learning Classification and Regression
Data Transformation: Normalization
Chapter 6: Temporal Difference Learning
A Simple Artificial Neuron
"Playing Atari with deep reinforcement learning."
Hidden Markov Models Part 2: Algorithms
Neuro-Computing Lecture 4 Radial Basis Function Network
Instructors: Fei Fang (This Lecture) and Dave Touretzky
RL for Large State Spaces: Value Function Approximation
Chapter 11: Case Studies Objectives of this chapter:
Chapter 8: Generalization and Function Approximation
Chapter 6: Temporal Difference Learning
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Spring 2006
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Reinforcement Learning (2)
Chapter 11: Case Studies Objectives of this chapter:
Reinforcement Learning (2)
Presentation transcript:

Reinforcement Learning Generalization and Function Approximation Subramanian Ramamoorthy School of Informatics 28 February, 2012

Function Approximation in RL: Why? 28/02/20122 Tabular V  : What should we do when the state spaces are very large and this becomes computationally expensive? Represent V  as a surface, use function approximation

Generalization 28/02/20123 … how experience with small part of state space is used to produce good behaviour over large part of state space

Rich Toolbox Neural networks, decision trees, multivariate regression... Use of statistical clustering for state aggregation Can swap in/out your favourite function approximation method as long as they can deal with: – learning while interacting, online – non-stationarity induced by policy changes So, combining gradient descent methods with RL requires care May use generalization methods to approximate states, actions, value functions, Q-value functions, policies 28/02/20124

Value Prediction with FA As usual: Policy Evaluation (the prediction problem): for a given policy , compute the state-value function 28/02/20125

Adapt Supervised Learning Algorithms Parameterized Function Inputs Outputs Training Info = desired (target) outputs Error = (target output – actual output) Training example = {input, target output} 28/02/20126

Backups as Training Examples As a training example: inputtarget output 28/02/20127

Feature Vectors 28/02/20128 This gives you a summary of state, e.g., state  weighted linear combination of features

Linear Methods 28/02/20129

What are we learning? From what? 28/02/201210

Performance Measures Many are applicable but… a common and simple one is the mean-squared error (MSE) over a distribution P : Why P ? Why minimize MSE? Let us assume that P is always the distribution of states at which backups are done. The on-policy distribution: the distribution created while following the policy being evaluated. Stronger results are available for this distribution. 28/02/201211

RL with FA – Basic Setup For the policy evaluation (i.e., value prediction) problem, we are minimizing: This could be solved by gradient descent (updating the parameter vector  that specifies function V ): 28/02/ Approximated ‘surface’Value obtained, e.g., by backup updates Unbiased estimator,

Gradient Descent 28/02/ Iteratively move down the gradient:

Gradient Descent For the MSE given earlier and using the chain rule: 28/02/201214

Gradient Descent Use just the sample gradient instead: Since each sample gradient is an unbiased estimate of the true gradient, this converges to a local minimum of the MSE if  decreases appropriately with t. 28/02/201215

But We Don’t have these Targets 28/02/201216

RL with FA, TD style 28/02/ In practice, how do we obtain this?

Nice Properties of Linear FA Methods The gradient is very simple: For MSE, the error surface is simple: quadratic surface with a single minimum. Linear gradient descent TD( ) converges: – Step size decreases appropriately – On-line sampling (states sampled from the on-policy distribution) – Converges to parameter vector with property: best parameter vector 28/02/201218

On-Line Gradient-Descent TD( ) 28/02/201219

On Basis Functions: Coarse Coding 28/02/201220

Shaping Generalization in Coarse Coding 28/02/201221

Learning and Coarse Coding 28/02/201222

Tile Coding 28/02/2012 Binary feature for each tile Number of features present at any one time is constant Binary features means weighted sum easy to compute Easy to compute indices of the features present 23

Tile Coding, Contd. Irregular tilings Hashing 28/02/201224

Radial Basis Functions (RBFs) e.g., Gaussians 28/02/201225

Beating the “Curse of Dimensionality” Can you keep the number of features from going up exponentially with the dimension? Function complexity, not dimensionality, is the problem. Kanerva coding: – Select a set of binary prototypes – Use Hamming distance as distance measure – Dimensionality is no longer a problem, only complexity “Lazy learning” schemes: – Remember all the data – To get new value, find nearest neighbors & interpolate – e.g., (nonparametric) locally-weighted regression 28/02/201226

Going from Value Prediction to GPI So far, we’ve only discussed policy evaluation where the value function is represented as an approximated function In order to extend this to a GPI-like setting, 1.Firstly, we need to use the action-value functions 2.Combine that with the policy improvement and action selection steps 3.For exploration, we need to think about on-policy vs. off- policy methods 28/02/201227

Gradient Descent Update for Action-Value Function Prediction 28/02/201228

How to Plug-in Policy Improvement or Action Selection? If spaces are very large, or continuous, this is an active research topic and there are no conclusive answers For small-ish discrete spaces, – For each action, a, available at a state, s t, compute Q t (s t,a) and find the greedy action according to it – Then, one could use this as part of an  -greedy action selection or as the estimation policy in off-policy methods 28/02/201229

On Eligibility Traces with Function Approx. The formulation of control with function approx., so far, is mainly assuming accumulating traces Replacing traces have advantages over this However, they do not directly extend to case of function approximation – Notice that we now maintain a column vector of eligibility traces, one trace for each component of parameter vector – So, can’t update separately for a state (as in tabular methods) A practical way to get around this is to do the replacing traces procedure for features rather than states – Could also utilize an optional clearing procedure, over features 28/02/201230

On-policy, SARSA( ), control with function approximation 28/02/ Let’s discuss it using a grid world…

Off-policy, Q-learning, control with function approximation 28/02/ Let’s discuss it using the same grid world…

Example: Mountain-car Task Drive an underpowered car up a steep mountain road Gravity is stronger than engine (like in cart-pole example) Example of a continuous control task where system must move away from goal first, then converge to goal Reward of -1 until car ‘escapes’ Actions: + , - , 0 28/02/201233

Mountain-car Example: Cost-to-go Function (SARSA( ) solution) 28/02/201234

Mountain Car Solution with RBFs 28/02/ [Computed by M. Kretchmar]

Parameter Choices: Mountain Car Example 28/02/201236

Off-Policy Bootstrapping All methods except MC make use of an existing value estimate in order to update the value function If the value function is being separately approximated, what is the combined effect of the two iterative processes? As we already saw, value prediction with linear gradient- descent function approximation converges to a sub-optimal point (as measured by MSE) – The further strays from 1, the more the deviation from optimality Real issue is that V is learned from an on-policy distribution – Off-policy bootstrapping with function approximation can lead to divergence (MSE tending to infinity) 28/02/201237

Some Issues with Function Approximation: Baird’s Counterexample 28/02/ Reward = 0, on all transitions True V  (s) = 0, for all s Parameter updates, as below.

Baird’s Counterexample 28/02/ If we backup according to an uniform off-policy distribution (as opposed to the on-policy distribution), the estimates diverge for some parameters! - similar counterexamples exist for Q-learning as well…

Another Example 28/02/ Baird’s counterexample has a simple fix: Instead of taking small steps towards expected one step returns, change value function to the best least-squares approximation. However, even this is not a general rule! Reward = 0, on all transitions True V  (s) = 0, for all s

Tsitsiklis and Van Roy’s Counterexample 28/02/ Some function approximation methods (that do not extrapolate), such as nearest neighbors or locally weighted regression, can avoid this. - another solution is to change the objective (e.g., min 1-step expected error)

Some Case Studies Positive examples (instead of just counterexamples)… 28/02/201242

TD Gammon Tesauro 1992, 1994, 1995,... White has just rolled a 5 and a 2 so can move one of his pieces 5 and one (possibly the same) 2 steps Objective is to advance all pieces to points Hitting Doubling 30 pieces, 24 locations implies enormous number of configurations Effective branching factor of

A Few Details Reward: 0 at all times except those in which the game is won, when it is 1 Episodic (game = episode), undiscounted Gradient descent TD( ) with a multi-layer neural network – weights initialized to small random numbers – backpropagation of TD error – four input units for each point; unary encoding of number of white pieces, plus other features Use of afterstates, learning during self-play 28/02/201244

Multi-layer Neural Network 28/02/201245

Summary of TD-Gammon Results 28/02/201246

Arthur Samuel 1959, 1967 Samuel’s Checkers Player Score board configurations by a “scoring polynomial” (after Shannon, 1950) Minimax to determine “backed-up score” of a position Alpha-beta cutoffs Rote learning: save each board config encountered together with backed-up score – needed a “sense of direction”: like discounting Learning by generalization: similar to TD algorithm 28/02/201247

Samuel’s Backups 28/02/201248

The Basic Idea “... we are attempting to make the score, calculated for the current board position, look like that calculated for the terminal board positions of the chain of moves which most probably occur during actual play.” A. L. Samuel Some Studies in Machine Learning Using the Game of Checkers, /02/201249