Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006.

Similar presentations


Presentation on theme: "1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006."— Presentation transcript:

1 1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006

2 2 Objetivo desta Aula n Aprendizado por Reforço: –Métodos de Monte Carlo. –Aprendizado por Diferenças Temporais. –Traços de Elegibilidade. n Aula de hoje: capítulos 5, 6 e 7 do Sutton & Barto.

3 3 Aprendizado pelo Método das Diferenças Temporais Capítulo 6 do Livro Sutton e Barto.

4 4 Temporal Difference Learning n Objectives of this chapter... n Introduce Temporal Difference (TD) learning n Focus first on policy evaluation, or prediction, methods n Then extend to control methods

5 5 Central Idea If one had to identify one idea as central and novel to reinforcement learning, it would undoubtedly be temporal-difference (TD) learning. n TD learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. –Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment's dynamics. –Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

6 6 TD (0) Prediction Policy Evaluation (the prediction problem): for a given policy , compute the state-value function Recall: target: the actual return after time t target: an estimate of the return

7 7 TD (0) Prediction n We know from DP:

8 8 Simple Monte Carlo TTTTTTTTTT

9 9 Simplest TD Method TTTTTTTTTT

10 10 cf. Dynamic Programming T T T TTTTTTTTTT

11 11 Tabular TD(0) Algorithm

12 12 TD Bootstraps and Samples n Bootstrapping: update involves an estimate –MC does not bootstrap –DP bootstraps –TD bootstraps n Sampling: update does not involve an expected value –MC samples –DP does not sample –TD samples

13 13 Example: Driving Home

14 14 Driving Home Changes recommended by Monte Carlo methods  =1) Changes recommended by TD methods (  =1)

15 15 Advantages of TD Learning n TD methods do not require a model of the environment, only experience. n TD, but not MC, methods can be fully incremental: –You can learn before knowing the final outcome Less memory Less peak computation –You can learn without the final outcome From incomplete sequences n Both MC and TD converge (under certain assumptions to be detailed later), but which is faster?

16 16 Random Walk Example n Empirically compare the prediction abilities of TD(0) and constant learning rate MC applied to the small Markov process:

17 17 Random Walk Example All episodes start in the center state, C, and proceed either left or right by one state on each step, with equal probability. n Episodes terminate either on the extreme left or the extreme right. n Rewards: –When an episode terminates on the right a reward of +1 occurs; –all other rewards are zero.

18 18 Random Walk Example n Values learned by TD(0) after various numbers of episodes.

19 19 Random Walk Example n The final estimate is about as close as the estimates ever get to the true values. –With a constant step-size parameter (in this example), the values fluctuate indefinitely in response to the outcomes of the most recent episodes.

20 20 TD and MC on the Random Walk Data averaged over 100 sequences of episodes Learning curves for TD(0) and constant-learning rate MC.

21 21 Optimality of TD(0) Batch Updating : train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence. Compute updates according to TD(0), but only update estimates after each complete pass through the data. For any finite Markov prediction task, under batch updating, TD(0) converges for sufficiently small . Constant-  MC also converges under these conditions, but to a difference answer!

22 22 Random Walk under Batch Updating After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times.

23 23 Example: You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 0

24 24 You are the Predictor

25 25 You are the Predictor n The prediction that best matches the training data is V(A)=0 –This minimizes the mean-square-error on the training set –This is what a batch Monte Carlo method gets n If we consider the sequentiality of the problem, then we would set V(A)=.75 –This is correct for the maximum likelihood estimate of a Markov model generating the data –i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?) –This is called the certainty-equivalence estimate –This is what TD(0) gets

26 26 And now for some methods...

27 27 Learning An Action-Value Function n We turn now to the use of TD prediction methods for the control problem. –As usual, we follow the pattern of generalized policy iteration (GPI), only this time using TD methods for the evaluation or prediction part. n As with Monte Carlo methods, we face the need to trade off exploration and exploitation, and again approaches fall into two main classes: –on-policy and –off-policy.

28 28 Sarsa: On-Policy TD Control

29 29 Sarsa: On-Policy TD Control This update is done after every transition from a nonterminal state s t. If s t+1 is terminal, then Q(s t+1,a t+1 ) is defined as zero. This rule uses every element of the quintuple of events, (s t, a t, r t+1, s t+1,a t+1 ), that make up a transition from one state-action pair to the next. This quintuple gives rise to the name Sarsa for the algorithm.

30 30 Sarsa: On-Policy TD Control Turn this into a control method by always updating the policy to be greedy with respect to the current estimate:

31 31 Windy Gridworld undiscounted, episodic, reward = –1 until goal

32 32 Results of Sarsa on the Windy Gridworld

33 33 Q-Learning: Off-Policy TD Control One of the most important breakthroughs in reinforcement learning was the development of an off-policy TD control algorithm known as Q-learning (Watkins, 1989). Its simplest form, one-step Q-learning, is defined by:

34 34 Q-Learning: Off-Policy TD Control In this case, the learned action-value function, Q, directly approximates Q*, the optimal action-value function, independent of the policy being followed. n This dramatically simplifies the analysis of the algorithm and enabled early convergence proofs: –all that is required for correct convergence is that all pairs continue to be updated.

35 35 Q-Learning Algorithm

36 36 Example 6.6: Cliff Walking n This example compares Sarsa and Q- learning, highlighting the difference between on-policy (Sarsa) and off-policy (Q-learning) methods. n Consider the gridworld shown in the upper part of Figure 6.13. 6.13 –Standard undiscounted, episodic task, with start and goal states, and the usual actions causing movement up, down, right, and left. –Reward is –1 on all transitions except those into the the region marked "The Cliff." Stepping into this region incurs a reward of -100 and sends the agent instantly back to the start.

37 37 Example 6.6: Cliff Walking  greedy  = 0.1

38 38 Example 6.6: Cliff Walking n After an initial transient, Q-learning learns values for the optimal policy, that which travels right along the edge of the cliff. –Unfortunately, this results in its occasionally falling off the cliff because of the epsilon-greedy action selection. n Sarsa, on the other hand, takes the action selection into account and learns the longer but safer path through the upper part of the grid.

39 39 Example 6.6: Cliff Walking n Although Q-learning actually learns the values of the optimal policy, its on-line performance is worse than that of Sarsa, which learns the roundabout policy. n Of course, if epsilon were gradually reduced, then both methods would asymptotically converge to the optimal policy.

40 40 Actor-Critic Methods n Explicit representation of policy as well as value function n Minimal computation to select actions n Can learn an explicit stochastic policy n Can put constraints on policies n Appealing as psychological and neural models

41 41 Actor-Critic Details

42 42 Summary – TD Learning n TD prediction n Introduced one-step tabular model-free TD(0) methods n Extend prediction to control by employing some form of GPI –On-policy control: Sarsa –Off-policy control: Q-learning n These methods bootstrap and sample, combining aspects of DP and MC methods

43 43 Eligibility Traces Capítulo 7

44 44 Eligibility Traces

45 45 N-step TD Prediction n Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps)

46 46 n Monte Carlo: n TD: –Use V to estimate remaining return n n-step TD: –2 step return: –n-step return: Mathematics of N-step TD Prediction

47 47 Learning with N-step Backups n Backup (on-line or off-line): n Error reduction property of n-step returns: n step return Maximum error using n-step return Maximum error using V

48 48 Random Walk Examples n How does 2-step TD work here? n How about 3-step TD?

49 49 A Larger Example n Task: 19 state random walk n Do you think there is an optimal n (for everything)?

50 50 Averaging N-step Returns n-step methods were introduced to help with TD( ) understanding n Idea: backup an average of several returns: –e.g. backup half of 2-step and half of 4-step n Called a complex backup –Draw each component –Label with the weights for that component One backup

51 51 Forward View of TD( ) TD( ) is a method for averaging all n- step backups –weight by n-1 (time since visitation) – -return: Backup using - return:

52 52 -return Weighting Function

53 53 Relation to TD(0) and MC -return can be rewritten as: If = 1, you get MC: If = 0, you get TD(0) Until terminationAfter termination

54 54 Forward View of TD( ) II n Look forward from each state to determine update from future states and rewards:

55 55 -return on the Random Walk n Same 19 state random walk as before Why do you think intermediate values of are best?

56 56 Backward View of TD( ) n The forward view was for theory. n The backward view is for mechanism. n New variable called eligibility trace: – –On each step, decay all traces by  and increment the trace for the current state by 1 –Accumulating trace

57 57 Backward View of TD( )

58 58 On-line Tabular TD( )

59 59 Backward View Shout  t backwards over time The strength of your voice decreases with temporal distance by 

60 60 Relation of Backwards View to MC & TD(0) n Using update rule: As before, if you set to 0, you get to TD(0) If you set to 1, you get MC but in a better way –Can apply TD(1) to continuing tasks –Works incrementally and on-line (instead of waiting to the end of the episode)

61 61 Forward View = Backward View The forward (theoretical) view of TD( ) is equivalent to the backward (mechanistic) view for off-line updating: Backward updatesForward updates algebra shown in book

62 62 On-line versus Off-line on Random Walk n Same 19 state random walk n On-line performs better over a broader range of parameters

63 63 Control: Sarsa( ) n Save eligibility for state- action pairs instead of just states

64 64 Sarsa( ) Algorithm

65 65 Sarsa( ) Gridworld Example n With one trial, the agent has much more information about how to get to the goal –not necessarily the best way n Can considerably accelerate learning

66 66 Three Approaches to Q( ) n How can we extend this to Q-learning? n If you mark every state action pair as eligible, you backup over non-greedy policy. n There are 3 approaches to extend Q-learning: –Watkins Q( ) –Peng Q( ) and –Naïve Q( ).

67 67 Watkins Q( ) n Watkins: –Zero out eligibility trace after a non-greedy action. –Do max when backing up at first non-greedy choice.

68 68 Watkins Q( )

69 69 Watkins’s Q( )

70 70 Peng’s Q( ) n Disadvantage to Watkins’s method: –Early in learning, the eligibility trace will be “cut” (zeroed out) frequently resulting in little advantage to traces n Peng: –Backup max action except at end –Never cut traces n Disadvantage: –Complicated to implement

71 71 Naïve Q( ) n Idea: is it really a problem to backup exploratory actions? –Never zero traces –Always backup max at current action (unlike Peng or Watkins’s) n Is this truly naïve? n Works well is preliminary empirical studies What is the backup diagram?

72 72 Compared Watkins’s, Peng’s, and Naïve (called McGovern’s here) Q( ) on several tasks. –See McGovern and Sutton (1997). Towards a Better Q( ) for other tasks and results (stochastic tasks, continuing tasks, etc) n Deterministic gridworld with obstacles –10x10 gridworld –25 randomly generated obstacles –30 runs –  = 0.05,  = 0.9, = 0.9,  = 0.05, accumulating traces Comparison Task

73 73 Comparison Results From McGovern and Sutton (1997). Towards a better Q( )

74 74 Convergence of the Q( )’s n None of the methods are proven to converge. –Much extra credit if you can prove any of them. n Watkins’s is thought to converge to Q * Peng’s is thought to converge to a mixture of Q  and Q * n Naïve - Q * ?

75 75 Replacing Traces n Using accumulating traces, frequently visited states can have eligibilities greater than 1. –This can be a problem for convergence n Replacing traces: Instead of adding 1 when you visit a state, set that trace to 1:

76 76 Replacing Traces

77 77 Replacing Traces Example n Same 19 state random walk task as before Replacing traces perform better than accumulating traces over more values of

78 78 Why Replacing Traces? n Replacing traces can significantly speed learning n They can make the system perform well for a broader set of parameters n Accumulating traces can do poorly on certain types of tasks Why is this task particularly onerous for accumulating traces?

79 79 More Replacing Traces n Off-line replacing trace TD(1) is identical to first-visit MC n Extension to action-values: –When you revisit a state, what should you do with the traces for the other actions? –Singh and Sutton say to set them to zero:

80 80 Implementation Issues n Could require much more computation –But most eligibility traces are VERY close to zero n If you implement it in Matlab, backup is only one line of code and is very fast (Matlab is optimized for matrices)

81 81 Variable Can generalize to variable Here is a function of time –Could define

82 82 Conclusions n Provides efficient, incremental way to combine MC and TD –Includes advantages of MC (can deal with lack of Markov property) –Includes advantages of TD (using TD error, bootstrapping) n Can significantly speed learning n Does have a cost in computation

83 83 Something Here is Not Like the Other

84 84 Fim!


Download ppt "1 Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2006."

Similar presentations


Ads by Google