Reinforcement Learning Eligibility Traces 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Reinforcement Learning Eligibility Traces 主講人：虞台文大同大學資工所智慧型多媒體研究室

Content n-step TD prediction Forward View of TD( ) Backward View of TD( ) Equivalence of the Forward and Backward Views Sarsa( ) Q( ) Eligibility Traces for Actor-Critic Methods Replacing Traces Implementation Issues

Reinforcement Learning Eligibility Traces n -Step TD Prediction 大同大學資工所智慧型多媒體研究室

Elementary Methods Dynamic Programming Monte Carlo Methods TD(0)

Monte Carlo vs. TD(0) Monte Carlo – observe reward for all steps in an episode TD(0) – observed one step only

n -Step TD Prediction TD (1-step)Monte Carlo2-step3-stepn-step

n -Step TD Prediction corrected n-step truncated return

Backups Monte Carlo TD(0) n-step TD

n -Step TD Backup online offline When offline, the new V(s) will be for the next episode.

Error Reduction Property online offline n-step return Maximum error using V (current value) Maximum error using n-step return

Example (Random Walk) ABCDE start 000001 Consider 2-step TD, 3-step TD, … V(s)V(s) 1/62/63/64/65/6 n=? is optimal?

Example (19-state Random Walk) start 1100001 offlineonline Average RMSE Over First 10 Trials

Exercise (Random Walk)

1.Evaluate value function for random policy 2.Approximate value function using n-step TD (try different n’s and  ’s), and compare their performance. 3.Find optimal policy. 1.Evaluate value function for random policy 2.Approximate value function using n-step TD (try different n’s and  ’s), and compare their performance. 3.Find optimal policy.

Reinforcement Learning Eligibility Traces The Forward View of TD( ) 大同大學資工所智慧型多媒體研究室

Averaging n -step Returns We are not limited to simply using n-step TD returns For example, we could take average n-step TD returns like: One backup Sum to 1

TD( )  -Return w1w1 w2w2 w3w3 w T  t  1 TD( ) is a method for averaging all n-step backups – weight by n  1 (time since visitation) – Called -return Backup using -return:

TD( )  -Return w1w1 w2w2 w3w3 wTtwTt TD( ) is a method for averaging all n-step backups – weight by n  1 (time since visitation) – Called -return Backup using -return:

Forward View of TD( ) A theoretical view

TD( ) on the Random Walk

Reinforcement Learning Eligibility Traces The Backward View of TD( ) 大同大學資工所智慧型多媒體研究室

Why Backward View? Forward view is acausal – Not implementable Backward view is causal – Implementable – In the offline case, achieving the same result as the forward view

Eligibility Traces Each state is associated with an additional memory variable  eligibility trace, defined by:

Eligibility  Recency of Visiting At any time, the traces record which states have recently been visited, where “recently" is defined in terms of . The traces indicate the degree to which each state is eligible for undergoing learning changes should a reinforcing event occur. Reinforcing event The moment-by-moment 1-step TD errors

Reinforcing Event The moment-by-moment 1-step TD errors

TD(  ) Eligibility Traces Reinforcing Events Value updates

Online TD(  )

Backward View of TD(  )

Backwards View vs. MC & TD(0) Set to 0, we get to TD(0) Set to 1, we get MC but in a better way – Can apply TD(1) to continuing tasks – Works incrementally and on-line (instead of waiting to the end of the episode) How about 0 < < 1?

Reinforcement Learning Eligibility Traces Equivalence of the Forward and Backward Views 大同大學資工所智慧型多媒體研究室

Offline TD( )’s Offline Forward TD( )  -Return Offline Backward TD( )

Forward View = Backward View Backward updates Forward updates See the proofproof

Forward View = Backward View Backward updates Forward updates

TD( ) on the Random Walk Average RMSE Over First 10 Trials Offline -return (forward) Online TD( ) (backward)

Reinforcement Learning Eligibility Traces Sarsa( ) 大同大學資工所智慧型多媒體研究室

Sarsa( ) TD( )  – Use eligibility traces for policy evaluation How can eligibility traces be used for control? – Learn Q t (s, a) rather than V t (s).

Sarsa( ) Eligibility Traces Reinforcing Events Updates

Sarsa( )

Sarsa( )  Traces in Grid World With one trial, the agent has much more information about how to get to the goal – not necessarily the best way Considerably accelerate learning

Reinforcement Learning Eligibility Traces Q( ) 大同大學資工所智慧型多媒體研究室

Q-Learning An off-policy method – breaks from time to time to take exploratory actions – a simple time trace cannot be easily implemented How to combine eligibility traces and Q-learning? Three methods: – Watkins's Q( ) – Peng's Q ( ) – Naïve Q ( )

Watkins's Q( ) Behavior policy (e.g.,  -greedy) Estimation policy (e.g., greedy) Greedy Path Non-Greedy Path First non-greedy action

Backups  Watkins's Q( ) Two cases: 1.Both behavior and estimation policies take the greedy path. 2.Behavior path has taken a non-greedy action before the episode ends. Case 1 Case 2 How to define the eligibility traces?

Watkins's Q( )

Peng's Q( ) Cutting off traces loses much of the advantage of using eligibility traces. If exploratory actions are frequent, as they often are early in learning, then only rarely will backups of more than one or two steps be done, and learning may be little faster than 1-step Q-learning. Peng's Q( ) is an alternate version of Q( ) meant to remedy this.

Backups  Peng's Q( ) Peng, J. and Williams, R. J. (1996). Incremental Multi-Step Q-Learning. Machine Learning, 22(1/2/3). Never cut traces Backup max action except at end The book says it outperforms Watkins Q(λ) and almost as well as Sarsa(λ) Disadvantage: difficult for implementation

Peng's Q( ) See Peng, J. and Williams, R. J. (1996). Incremental Multi-Step Q-Learning. Machine Learning, 22(1/2/3). for notations.

Naïve Q( ) Idea: Is it really a problem to backup exploratory actions? – Never zero traces – Always backup max at current action (unlike Peng or Watkins’s) Is this truly naïve? Works well is preliminary empirical studies

Naïve Q( )

Comparisons McGovern, Amy and Sutton, Richard S. (1997) Towards a better Q( ). Presented at the Fall 1997 Reinforcement Learning Workshop. Deterministic gridworld with obstacles – 10x10 gridworld – 25 randomly generated obstacles – 30 runs –  = 0.05,  = 0.9, = 0.9,  = 0.05, – accumulating traces

Comparisons

Convergence of the Q( )’s None of the methods are proven to converge. – Much extra credit if you can prove any of them. Watkins’s is thought to converge to Q * Peng’s is thought to converge to a mixture of Q  and Q * Naïve - Q * ?

Reinforcement Learning Eligibility Traces Eligibility Traces for Actor-Critic Methods 大同大學資工所智慧型多媒體研究室

Actor-Critic Methods Environment Value Function Policy Critic Actor Action statereward TD error Critic: On-policy learning of V . Use TD(  ) as described before. Actor: Needs eligibility traces for each state-action pair.

Policy Parameters Update Environment Value Function Policy Critic Actor Action statereward TD error Method 1:

Policy Parameters Update Environment Value Function Policy Critic Actor Action statereward TD error Method 2:

Reinforcement Learning Eligibility Traces Replacing Traces 大同大學資工所智慧型多媒體研究室

Accumulating/Replacing Traces Accumulating Traces: Replacing Traces:

Why Replacing Traces? Using accumulating traces, frequently visited states can have eligibilities greater than 1 – This can be a problem for convergence Replacing traces can significantly speed learning They can make the system perform well for a broader set of parameters Accumulating traces can do poorly on certain types of tasks

Example (19-State Random Walk)

Extension to action-values When you revisit a state, what should you do with the traces for the other actions? Singh and Sutton (1996)  to set traces of all other actions from the revisited state to 0.

Reinforcement Learning Eligibility Traces Implementation Issues 大同大學資工所智慧型多媒體研究室

Implementation Issues For practical use we cannot compute every trace down to the last. Dropping very small values is recommended and encouraged. If you implement it in Matlab, backup is only one line of code and is very fast (Matlab is optimized for matrices). Use with neural networks and backpropagation generally only causes a doubling of needed computational power.

Variable Can generalize to variable Here is a function of time – E.g.,

Proof An accumulating eligibility trace can be written explicitly (non-recursively) as

Reinforcement Learning Eligibility Traces 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Similar presentations

Presentation on theme: "Reinforcement Learning Eligibility Traces 主講人：虞台文大同大學資工所智慧型多媒體研究室."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement Learning Eligibility Traces 主講人：虞台文 大同大學資工所 智慧型多媒體研究室.

Similar presentations

Presentation on theme: "Reinforcement Learning Eligibility Traces 主講人：虞台文 大同大學資工所 智慧型多媒體研究室."— Presentation transcript:

Similar presentations

About project

Feedback

Reinforcement Learning Eligibility Traces 主講人：虞台文大同大學資工所智慧型多媒體研究室.

Presentation on theme: "Reinforcement Learning Eligibility Traces 主講人：虞台文大同大學資工所智慧型多媒體研究室."— Presentation transcript: