Presentation is loading. Please wait.

Presentation is loading. Please wait.

Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.

Similar presentations


Presentation on theme: "Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning."— Presentation transcript:

1 Value Iteration & Q-learning CS 5368 Song Cui

2 Outline Recap Value Iteration Q-learning

3 Recap “Markov” meanings MDP: Solving MDP

4 Recap Utility of sequences Discount rate determines the “vision” of the agent. State-value function V π (s) Action-value function Q π (s,a)

5 Value Iteration How to find V k *(S): k → infinity Almost solution: recursion Correct solution: dynamic programming Value Iteration

6 Bellman update: Another way: V-node Q-node V-node

7 Value Iteration Algorithm: for i = 1,2,3….. for

8 Value Iteration Theorem: convergence to a optimal value Policy may converge faster Three components to return

9 Value Iteration Advantages compared with Expectimax: Given MDP: state space: 1,2 action: 1,2 transition:80% reward: state1 →1, state2→0 S1(1)S1(2) S1S2 S1 80% 20% 0 0 1 1 0.8 0.2 0.8 S1 S1(1) S2(1)S2(2)S2(1) Repeats !

10 Q-learning Compared with Value Iteration: same: MDP model seeking policy different: T(s,a,s’) and R(s,a,s’) unkown different ways of solving RDP (learned model vs. unlearned model) Reinforcement Learning policy, experience, reward model-based vs. model free passive learning vs. active learning

11 Q-learning Value iteration: Q-values:

12 Q-learning Q-learning: Q-value iteration Process: sample: s →a→s’ r Update new Q-value based on the sample:

13 Q-learning Q-learning: converge to optimal policy Sample enough, leaning rate small enough Ways to explore: epsilon-greedy action selection : choose between acting randomly and acting accordingly to the best current Q-value


Download ppt "Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning."

Similar presentations


Ads by Google