Download presentation
Presentation is loading. Please wait.
Published byDerek Lester Modified over 9 years ago
1
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 4 Ann Nowé Ann.nowe@vub.ac.be http://como.vub.ac.be By Sutton and Barto
2
Computational Modeling Lab Backup diagrams in DP State-value function for policy V(s) V(s 2 ’ ) V(s 2 ) Q(s,a) Q(s 2,a 2 ) Q(s 2,a 1 ) s1s1 s2s2 Action-values function for policy Q(s 1,a 2 ) Q(s 1,a 1 ) V(s 3 ’ ) V(s 3 ) V(s 1 ’ ) V(s 1 )
3
Computational Modeling Lab Dynamic Programming, model based T T T TTTTTTTTTT
4
Computational Modeling Lab Recall Value Iteration in DP Q(s,a)
5
Computational Modeling Lab RL, model free TTTTTTTTTT
6
Computational Modeling Lab Q-Learning, a value iteration approach Q-learning is off-policy
7
Computational Modeling Lab example 1 5 4 2 3 6 d R=4 c b a R=5 R=2 R=1 R=10 R=1 1 0.2 0.8 0.7 1 0.3 Epoch 1: 1,2,4 Epoch 2: 1,6 Epoch 3: 1,3 Epoch 4: 1,2,5 Epoch 6: 2,5
8
Computational Modeling Lab Some convergence issues Q-learning in guaranteed to converge in a Markovian setting Tsitsiklis J.N. Asynchronous Stochastic Approximation and Q- learning. Machine Learning, Vol. 16:185-202, 1994.
9
Computational Modeling Lab Proof by Tsitsiklis, cont. On the convergence of Q-learning
10
Computational Modeling Lab Proof by Tsitsiklis On the convergence of Q-learning “Learning factor” Contraction mapping Noise term q vector, but with possibly outdated components Q(s,a)
11
Computational Modeling Lab Proof by Tsitsiklis, cont. Stochastic approximation, as a vector t qiqi qjqj FiFi F i + noise
12
Computational Modeling Lab Proof by Tsitsiklis, cont. Relating Q-learning to stochastic approximation i th component Noise term Contraction mapping Bellman operator Can vary in time
13
Computational Modeling Lab Sarsa: On-Policy TD Control When is Sarsa = Q-learning?
14
Computational Modeling Lab Q-Learning versus SARSA Q-learning is off-policy Q-learning is on-policy Sarsa
15
Computational Modeling Lab Cliff Walking example Actions: up, down, left, right Reward: cliff -100, goal 0, default -1. Action selection -greedy, with = 0.1 Sarsa takes exploration into account
16
Computational Modeling Lab Q-learning for CAC S 1 = (2,4) S 3 = (3,3) Q(s 1,A1) Q(s 1,R1) Q(s 3,A2) Q(s 3,R2) Class -1 Class -2 [ [ S 2 =(3,4) Acceptance Criterion: Maximize Network Revenue
17
Computational Modeling Lab Continuous Time Q-learning for CAC Call Arrival t 0 = 0 System state: x Call Departure t1t1 Call Arrival System state: y Call Departure tntn Call Departure t2t2 [Bratke]
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.