Download presentation

Presentation is loading. Please wait.

Published byAlbert Stacks Modified over 3 years ago

1
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011

2
Introduction Temporal Difference Learning combines idea from the Monte Carlo Methods and Dynamic Programming Still sample the environment based on some policy Determine current estimate based on previous estimates Predictions are adjusted as time goes on to match other more accurate predications Temporal Difference Learning is popular for its simplicity and on-line applications

3
MC vs TD Constant-α MC: R(t) – actual return (reward) α – constant step-sized parameter Because the actual return is used, we must wait until the end of the episode to determine the update to V.

4
MC vs TD TD(0): r t+1 – observed award γ – discount rate TD method only waits for the next time step. At time t+1 a target can be formed and an update made using the observed reward, r t+1, and estimate, V(s t+1 ). In effect, TD(0) targets r t+1 + γV(s t+1 ) instead of R(t) in the MC method Called bootstrapping because update is based on previous estimate

5
Psuedo Code Initialize V(s) arbitrarily, and π to the policy to be evaluated Repeat (for each episode): Initialize s Repeat (for each step of episode): α <- action given by π for s Take action α observe reward r, and next state, s’ V(s) <- V(s) + α[r + γV(s’) – V(s)] s <- s’ until s is terminal

6
Advantages over MC Lends itself naturally to on-line applications MC must wait until end of the episode to adjust reward, TD only needs one time step Turns out this is critical consideration Some applications have long episodes or no episodes at all TD learns from every transition MC methods generally discount or throw out episodes where an experimental action was taken TD converges faster than constant-α MC in practice No formal proof has been developed

7
Soundness Is TD sound? Yes, for any fixed policy the TD algorithm has been proven to V π, provided a sufficiently small constant step-size parameter, or if the step-size parameter decreases according to the usual stochastic approximation conditions.

Similar presentations

Presentation is loading. Please wait....

OK

Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on special types of chromosomes in the karyotype Ppt on creating brand equity Ppt on word association test definition Ppt on disaster management act 2005 Ppt on regular expression builder Ppt on second law of thermodynamics and entropy Ppt on political parties and electoral process in america Ppt on multiplexers and demultiplexers Ppt on credit default swaps aig Pharynx anatomy and physiology ppt on cells