Download presentation

Presentation is loading. Please wait.

Published byLana Spurrier Modified over 9 years ago

1
Value Iteration & Q-learning CS 5368 Song Cui

2
Outline Recap Value Iteration Q-learning

3
Recap “Markov” meanings MDP: Solving MDP

4
Recap Utility of sequences Discount rate determines the “vision” of the agent. State-value function V π (s) Action-value function Q π (s,a)

5
Value Iteration How to find V k *(S): k → infinity Almost solution: recursion Correct solution: dynamic programming Value Iteration

6
Bellman update: Another way: V-node Q-node V-node

7
Value Iteration Algorithm: for i = 1,2,3….. for

8
Value Iteration Theorem: convergence to a optimal value Policy may converge faster Three components to return

9
Value Iteration Advantages compared with Expectimax: Given MDP: state space: 1,2 action: 1,2 transition:80% reward: state1 →1, state2→0 S1(1)S1(2) S1S2 S1 80% 20% 0 0 1 1 0.8 0.2 0.8 S1 S1(1) S2(1)S2(2)S2(1) Repeats !

10
Q-learning Compared with Value Iteration: same: MDP model seeking policy different: T(s,a,s’) and R(s,a,s’) unkown different ways of solving RDP (learned model vs. unlearned model) Reinforcement Learning policy, experience, reward model-based vs. model free passive learning vs. active learning

11
Q-learning Value iteration: Q-values:

12
Q-learning Q-learning: Q-value iteration Process: sample: s →a→s’ r Update new Q-value based on the sample:

13
Q-learning Q-learning: converge to optimal policy Sample enough, leaning rate small enough Ways to explore: epsilon-greedy action selection : choose between acting randomly and acting accordingly to the best current Q-value

Similar presentations

© 2024 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google