Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs.

Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs

Off-policy Learning Learning about a way of behaving without behaving (exactly) that way Target policy must be part of source (behavior) policy  E.g., Q-learning learns about the greedy policy while following something more exploratory Learning about many macro-action policies at once We need off-policy learning!

RL Algorithm Space TDLinear FA Off-policy Linear TD( ) Q-learning, options stable We need all 3 But we can only get 2 at a time Tsitsiklis & Van Roy 1997 Tadic 2000 Baird 1995 Gordon 1995 NDP 1996 Boom!

Baird’s Counterexample Markov chain (no actions) All states updated equally often, synchronously Exact solution exists:  = 0 Initial  0 = (1,1,1,1,1,10,1) T 100% ±1)

Importance Sampling Re-weighting samples according to their “importance,” correcting for a difference in sampling distribution For example, any episode has probability under , so its importance is Corrects for oversampling under  ’

Naïve Importance Sampling Alg Update t = ( Regular-linear-TD( )-update t ) Converts off-policy to on-policy On-policy convergence theorem then applies Tsitsiklis & Van Roy, 1997 Tadic, 2000 But variance is high, convergence is very slow We can do better!

Approximate the action-value function: as a linear form: where is a feature vector representing s,a and is the modifiable parameter vector Linear Function Approximation

 Updating after each episode Linear TD( ) Per-Decision Importance-Sampled TD( )  (s t  1,a t  1 ) (s t  1,a t  1 ) 0 0 t The new Algorithm! (see paper for general )

Main Result Total change over episode for new algorithm Total change for conventional TD( ) Convergence Theorem (based on Tsitsiklis & Van Roy 1997) Under the usual assumptions, and one annoying assumption: new algorithm converges to the same    as on-policy TD( ) e.g., bounded episode length

The variance assumption is restrictive Consider a modified MDP with bounded episode length –We have data for this MDP –Our result assures good convergence for this –This solution can be made close to the sol’n to original problem –By choosing episode bound long relative to  or the mixing time Consider application to macro-actions –Here it is the macro-action that terminates –Termination is artificial, real process is unaffected –Yet all results directly apply to learning about macro-actions –We can choose macro-action termination to satisfy the variance condition But can often be satisfied with “artificial” terminations

Empirical Illustration Agent always starts at S Terminal states marked G Deterministic actions Behavior policy chooses up-down with 0.4-0.1 prob. Target policy chooses up-down with 0.1-0.4 If the algorithm is successful, it should give positive weight to rightmost feature, negative to the leftmost one

Trajectories of Two Components of  = 0.9  decreased  appears to converge as advertised -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 012345 Episodes x 100,000 µ leftmost,down µ leftmost,down µ rightmost,down * µ rightmost,down *

Comparison of Naïve and PD IS Algs 1 1.5 2 -12-13-14-15-16-17 2.5 Root Mean Squared Error Naive IS Per-Decision IS Log 2  = 0.9  constant (after 100,000 episodes, averaged over 50 runs) Precup, Sutton & Dasgupta, 2001

Can Weighted IS help the variance? Return to the tabular case, consider two estimators: i th return following s,a at time t IS correction product converges with finite variance iff the w i have finite variance converges with finite variance even if the w i have infinite variance Can this be extended to the FA case?

Restarting within an Episode We can consider episodes to start at any time This alters the weighting of states, –But we still converge, –And to near the best answer (for the new weighting)

Incremental Implementation At the start of each episode: On each step:

Conclusion First off-policy TD methods with linear FA –Certainly not the last –Somewhat greater efficiencies are undoubtedly possible But the problem is so important Can’t we do better? –Is there no other approach? –Something other than importance sampling? I can’t think of a credible alternative approach Perhaps experimentation in a nontrivial domain would suggest other possibilities...

Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs.

Similar presentations

Presentation on theme: "Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs.

Similar presentations

Presentation on theme: "Off-Policy Temporal-Difference Learning with Function Approximation Doina Precup McGill University Rich Sutton Sanjoy Dasgupta AT&T Labs."— Presentation transcript:

Similar presentations

About project

Feedback