Presentation on theme: "TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011."— Presentation transcript:
TEMPORAL DIFFERENCE LEARNING Mark Romero – 11/03/2011
Introduction Temporal Difference Learning combines idea from the Monte Carlo Methods and Dynamic Programming Still sample the environment based on some policy Determine current estimate based on previous estimates Predictions are adjusted as time goes on to match other more accurate predications Temporal Difference Learning is popular for its simplicity and on-line applications
MC vs TD Constant-α MC: R(t) – actual return (reward) α – constant step-sized parameter Because the actual return is used, we must wait until the end of the episode to determine the update to V.
MC vs TD TD(0): r t+1 – observed award γ – discount rate TD method only waits for the next time step. At time t+1 a target can be formed and an update made using the observed reward, r t+1, and estimate, V(s t+1 ). In effect, TD(0) targets r t+1 + γV(s t+1 ) instead of R(t) in the MC method Called bootstrapping because update is based on previous estimate
Psuedo Code Initialize V(s) arbitrarily, and π to the policy to be evaluated Repeat (for each episode): Initialize s Repeat (for each step of episode): α <- action given by π for s Take action α observe reward r, and next state, s’ V(s) <- V(s) + α[r + γV(s’) – V(s)] s <- s’ until s is terminal
Advantages over MC Lends itself naturally to on-line applications MC must wait until end of the episode to adjust reward, TD only needs one time step Turns out this is critical consideration Some applications have long episodes or no episodes at all TD learns from every transition MC methods generally discount or throw out episodes where an experimental action was taken TD converges faster than constant-α MC in practice No formal proof has been developed
Soundness Is TD sound? Yes, for any fixed policy the TD algorithm has been proven to V π, provided a sufficiently small constant step-size parameter, or if the step-size parameter decreases according to the usual stochastic approximation conditions.