1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2010 October 11, 2010

ECE 517: Reinforcement Learning in AI 2 Outline Actor-Critic Model (TD) Eligibility Traces

ECE 517: Reinforcement Learning in AI 3 Actor-Critic Methods Explicit (and independent) representation of policy as well as value function A critique scalar signal drives all learning in both actor and critic These methods received much attention early on, and are being revisited now! Appealing in context of psychological and neural models Dopamine Neurons ( Dopamine Neurons ( W. Schultz et al. Universite de Fribourg )

ECE 517: Reinforcement Learning in AI 4 Actor-Critic Details Typically, the critic is a state-value function After each action selection, an evaluation error is obtained in the form After each action selection, an evaluation error is obtained in the form where V is the critic’s current value function Positive error  action a t should be strengthened for the future Typical actor is a parameterized mapping of states to actions Suppose actions are generated by Gibbs softmax then the agent can update the preferences as

ECE 517: Reinforcement Learning in AI 5 Actor Critic Models (cont.) Actor-Critic methods offer a powerful framework for scalable RL systems (as will be shown later) They are particular interesting since they … Operate inherently online Operate inherently online Require minimal computation in order to select actions Require minimal computation in order to select actions e.g. Draw a number from a given distribution In Neural Networks it will be equivalent to a single feed- forward pass Can cope with non-Markovian environments Can cope with non-Markovian environments

ECE 517: Reinforcement Learning in AI 6 Summary of TD TD is based on prediction (and associated error) Introduced one-step tabular model-free TD methods Extended prediction to control by employing some form of GPI On-policy control: Sarsa On-policy control: Sarsa Off-policy control: Q-learning Off-policy control: Q-learning These methods bootstrap and sample, combining aspects of DP and MC methods Have shown to have some correlation with biological systems

ECE 517: Reinforcement Learning in AI 7 Unified View of RL methods (so far)

ECE 517: Reinforcement Learning in AI 8 Eligibility Traces ET are one of the basic practical mechanisms in RL Almost any TD methods can be combined with ET to obtain a more efficient learning engine Combine TD concepts with Monte Carlo ideas Combine TD concepts with Monte Carlo ideas Addresses the gap between events and training data Temporary record of occurrence of an event Temporary record of occurrence of an event Trace marks memory parameters associated with the event as eligible for undergoing learning changes Trace marks memory parameters associated with the event as eligible for undergoing learning changes When TD error is recorded – eligible states or actions are assigned credit or “blame” for the error There will be two views of ET Forward view – more theoretic Forward view – more theoretic Backward view – more mechanistic Backward view – more mechanistic

ECE 517: Reinforcement Learning in AI 9 n -step TD Prediction Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)

ECE 517: Reinforcement Learning in AI 10 Mathematics of n -step TD Prediction Monte Carlo: TD: Use V(s) to estimate remaining return Use V(s) to estimate remaining return n - step TD: 2-step return: 2-step return: n - step return at time t : n - step return at time t :

ECE 517: Reinforcement Learning in AI 11 Learning with n -step Backups Backup (on-line or off-line): Error reduction property of n -step returns Using this, one can show that n -step methods converge Yields a family of methods, of which TD and MC are members Maximum error using n -step return Maximum error using V(s)

ECE 517: Reinforcement Learning in AI 12 On-line vs. Off-line Updating In on-line updating – updates are done during the episode, as soon as the increment is computed In that case we have In off-line updating – we update the value of each state at the end of the episode Increments are accumulated and calculated “on the side” Increments are accumulated and calculated “on the side” Values are constant throughout the episode Values are constant throughout the episode Given a value V(s), the new value (in the next episode) will be Given a value V(s), the new value (in the next episode) will be

ECE 517: Reinforcement Learning in AI 13 Random Walk Revisited: e.g. for 19-Step Random Walk

ECE 517: Reinforcement Learning in AI 14 Averaging n -step Returns n -step methods were introduced to help with TD( ) understanding Idea: backup an average of several returns e.g. backup half of 2-step and half of 4-step e.g. backup half of 2-step and half of 4-step The above is called a complex backup Draw each component Draw each component Label with the weights for that component Label with the weights for that component TD( ) can be viewed as one way of averaging n - step backups TD( ) can be viewed as one way of averaging n - step backups One backup

ECE 517: Reinforcement Learning in AI 15 Forward View of TD( ) TD( ) is a method for averaging all n- step backups weight by n-1 (time since visitation) weight by n-1 (time since visitation) -return: -return: Backup using -return:

ECE 517: Reinforcement Learning in AI 16 -Return Weighting Function for episodic tasks -Return Weighting Function for episodic tasks Until termination After termination

ECE 517: Reinforcement Learning in AI 17 Relation of -Return to TD(0) and Monte Carlo -return can be rewritten as: -return can be rewritten as: If = 1, you get Monte Carlo: If = 0, you get TD(0)

ECE 517: Reinforcement Learning in AI 18 Forward View of TD( ) Look forward from each state to determine update from future states and rewards Q: Can this be practically implemented?

ECE 517: Reinforcement Learning in AI 19 -Return on the Random Walk -Return on the Random Walk Same 19 state random walk as before Q: Why do you think intermediate values of are best?

ECE 517: Reinforcement Learning in AI 20 Backward View The forward view was theoretical The backward view is for practical mechanism “Shout”  t backwards over time “Shout”  t backwards over time The strength of your voice decreases with temporal distance by  The strength of your voice decreases with temporal distance by 

ECE 517: Reinforcement Learning in AI 21 Backward View of TD( ) TD( ) parametrically shifts from TD to MC New variable called eligibility trace On each step, decay all traces by  On each step, decay all traces by   is the discount rate and  is the Return weighting coefficient Increment the trace for the current state by 1 Increment the trace for the current state by 1 Accumulating trace is thus Accumulating trace is thus

ECE 517: Reinforcement Learning in AI 22 0 On-line Tabular TD( )

ECE 517: Reinforcement Learning in AI 23 Relation of Backwards View to MC & TD(0) Using the update rule: As before, if you set to 0, you get to TD(0) If you set  1 (no decay), you get MC but in a better way Can apply TD(1) to continuing tasks Can apply TD(1) to continuing tasks Works incrementally and on-line (instead of waiting to the end of the episode) Works incrementally and on-line (instead of waiting to the end of the episode) In between – earlier states are given less credit for the TD Error

ECE 517: Reinforcement Learning in AI 24 Forward View = Backward View The forward (theoretical) view of TD( ) is equivalent to the backward (mechanistic) view for off-line updating The book shows (pp. 176-178): On-line updating with small  is similar Backward updatesForward updates algebra shown in book

ECE 517: Reinforcement Learning in AI 25 On-line versus Off-line on Random Walk Same 19 state random walk On-line performs better over a broader range of parameters

ECE 517: Reinforcement Learning in AI 26 Control: Sarsa( ) Next we want to use ET for control, not just prediction (i.e. estimation of value functions) Idea: we save eligibility for state-action pairs instead of just states

ECE 517: Reinforcement Learning in AI 27 0 Sarsa( ) Algorithm

ECE 517: Reinforcement Learning in AI 28 Implementing Q( ) Two methods have been proposed that combine ET and Q-Learning: Watkins’s Q( ) and Peng’s Q( ) Recall that Q-learning is an off-policy method Learns about greedy policy while follows exploratory actions Learns about greedy policy while follows exploratory actions Suppose the agent follows the greedy policy for the first two steps, but not on the third Suppose the agent follows the greedy policy for the first two steps, but not on the third Watkins: Zero out eligibility trace after a non-greedy action. Do max when backing up at first non-greedy choice.

ECE 517: Reinforcement Learning in AI 29 0 Watkins’s Q( )

ECE 517: Reinforcement Learning in AI 30 Peng’s Q( ) Disadvantage to Watkins’s method: Early in learning, the eligibility trace will be “cut” (zeroed out), frequently resulting in little advantage to traces Early in learning, the eligibility trace will be “cut” (zeroed out), frequently resulting in little advantage to tracesPeng: Backup max action except at the end Backup max action except at the end Never cut traces Never cut tracesDisadvantage: Complicated to implement Complicated to implement

ECE 517: Reinforcement Learning in AI 31 Variable Variable ET methods can improve by allowing to change in time Can generalize to variable Can generalize to variable can be defined, for example, (as a function of time) as can be defined, for example, (as a function of time) as States visited with high certainty values     Use that value estimate fully and ignore subsequent states Use that value estimate fully and ignore subsequent states States visited with uncertainty of values     Causes their estimated values to have little effect on any updates Causes their estimated values to have little effect on any updates

ECE 517: Reinforcement Learning in AI 32 Conclusions Eligibility Traces offer an efficient, incremental way to combine MC and TD Includes advantages of MC Includes advantages of MC Can deal with lack of Markov property Consider an n-step interval for improved performance Includes advantages of TD Includes advantages of TD Using TD error Bootstrapping Can significantly speed learning Does have a cost in computation

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.

Similar presentations

Presentation on theme: "1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College.

Similar presentations

Presentation on theme: "1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 11: Temporal Difference Learning (cont.), Eligibility Traces Dr. Itamar Arel College."— Presentation transcript:

Similar presentations

About project

Feedback