Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 4 Ann Nowé By Sutton.

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 4 Ann Nowé Ann.nowe@vub.ac.be http://como.vub.ac.be By Sutton and Barto

Computational Modeling Lab Backup diagrams in DP State-value function for policy  V(s) V(s 2 ’ ) V(s 2 ) Q(s,a) Q(s 2,a 2 ) Q(s 2,a 1 ) s1s1 s2s2 Action-values function for policy  Q(s 1,a 2 ) Q(s 1,a 1 ) V(s 3 ’ ) V(s 3 ) V(s 1 ’ ) V(s 1 )

Computational Modeling Lab Dynamic Programming, model based T T T TTTTTTTTTT

Computational Modeling Lab Recall Value Iteration in DP Q(s,a)

Computational Modeling Lab RL, model free TTTTTTTTTT

Computational Modeling Lab Q-Learning, a value iteration approach Q-learning is off-policy

Computational Modeling Lab example 1 5 4 2 3 6 d R=4 c b a R=5 R=2 R=1 R=10 R=1 1 0.2 0.8 0.7 1 0.3 Epoch 1: 1,2,4 Epoch 2: 1,6 Epoch 3: 1,3 Epoch 4: 1,2,5 Epoch 6: 2,5

Computational Modeling Lab Some convergence issues Q-learning in guaranteed to converge in a Markovian setting Tsitsiklis J.N. Asynchronous Stochastic Approximation and Q- learning. Machine Learning, Vol. 16:185-202, 1994.

Computational Modeling Lab Proof by Tsitsiklis, cont. On the convergence of Q-learning

Computational Modeling Lab Proof by Tsitsiklis On the convergence of Q-learning “Learning factor” Contraction mapping Noise term q vector, but with possibly outdated components Q(s,a)

Computational Modeling Lab Proof by Tsitsiklis, cont. Stochastic approximation, as a vector t qiqi qjqj FiFi F i + noise

Computational Modeling Lab Proof by Tsitsiklis, cont. Relating Q-learning to stochastic approximation i th component Noise term Contraction mapping Bellman operator Can vary in time

Computational Modeling Lab Sarsa: On-Policy TD Control When is Sarsa = Q-learning?

Computational Modeling Lab Q-Learning versus SARSA Q-learning is off-policy Q-learning is on-policy Sarsa

Computational Modeling Lab Cliff Walking example Actions: up, down, left, right Reward: cliff -100, goal 0, default -1. Action selection  -greedy, with  = 0.1 Sarsa takes exploration into account

Computational Modeling Lab Q-learning for CAC S 1 = (2,4) S 3 = (3,3) Q(s 1,A1) Q(s 1,R1) Q(s 3,A2) Q(s 3,R2) Class -1 Class -2 [ [ S 2 =(3,4) Acceptance Criterion: Maximize Network Revenue

Computational Modeling Lab Continuous Time Q-learning for CAC Call Arrival t 0 = 0 System state: x Call Departure t1t1 Call Arrival System state: y  Call Departure tntn Call Departure t2t2 [Bratke]

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 4 Ann Nowé By Sutton.

Similar presentations

Presentation on theme: "Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 4 Ann Nowé By Sutton."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 4 Ann Nowé By Sutton.

Similar presentations

Presentation on theme: "Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 4 Ann Nowé By Sutton."— Presentation transcript:

Similar presentations

About project

Feedback