Presentation is loading. Please wait.

Presentation is loading. Please wait.

Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7.

Similar presentations


Presentation on theme: "Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7."— Presentation transcript:

1 Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7

2 Administrivia Office hours truncated (9:00-10:15) on Nov 17 Someone scheduled a meeting for me :-P HW3 assigned today Due: Dec 2 Large HW, but you have a little extra time on it

3 The Q -learning algorithm Algorithm: Q_learn Inputs: State space S ; Act. space A Discount γ (0<=γ<1); Learning rate α (0<=α<1) Outputs: Q Repeat { s =get_current_world_state() a =pick_next_action( Q, s ) ( r, s’ )=act_in_world( a ) Q ( s, a )= Q ( s, a )+α*( r +γ*max_ a’ ( Q ( s’, a’ ))- Q ( s, a )) } Until (bored)

4 Why does this work? Still... Why should that weighted avg be the right thing? Compare w/ Bellman eqn:

5 Why does this work? Still... Why should that weighted avg be the right thing? Compare w/ Bellman eqn... I.e., the update is based on a sample of the true distribution, T, rather than the full expectation that is used in the Bellman eqn/policy iteration alg First time agent finds a rewarding state, s r, α of that reward will be propagated back by one step via Q update to s r-1, a state one step away from s r Next time, the state two away from s r will be updated, and so on...

6 Picking the action One critical step underspecified in Q learn alg: a =pick_next_action( Q, s ) How should you pick an action at each step? Could pick greedily according to Q Might tend to keep doing the same thing and not explore at all. Need to force exploration. Could pick an action at random Ignores everything you’ve learned about Q so far Would you still converge?

7 Off-policy learning Exploit a critical property of the Q learn alg: Lemma (w/o proof): The Q learning algorithm will converge to the correct Q* independently of the policy being executed, so long as: Every (s,a) pair is visited infinitely often in the infinite limit α is chosen to be small enough (usually decayed) I.e., Q learning doesn’t care what policy is being executed -- will still converge Called an off-policy method: the policy being learned can be diff than the policy being executed

8 “Almost greedy” exploring Off-policy property tells us: we’re free to pick any policy we like to explore, so long as we guarantee infinite visits to each (s,a) pair Might as well choose one that does (mostly) as well as we know how to do at each step Can’t be just greedy w.r.t. Q (why?) Typical answers: ε-greedy: execute argmax a {Q(s,a)} w/ prob (1- ε ) and a random action w/ prob ε Boltzmann exploration: pick action a w/ prob:

9 The value of experience We observed that Q learning converges slooooooowly... Same is true of many other RL algs But we can do better (sometimes by orders of magnitude) What’re the biggest hurdles to Q convergence?

10 The value of experience We observed that Q learning converges slooooooowly... Same is true of many other RL algs But we can do better (sometimes by orders of magnitude) What’re the biggest hurdles to Q convergence? Well, there are many Big one, though, is: poor use of experience Each timestep only changes one Q(s,a) value Takes many steps to “back up” experience very far

11 That eligible state Basic problem: Every step, Q only does a one- step backup Forgot where it was before that No sense of the sequence of state/actions that got it where it is now Want to have a long-term memory of where the agent has been; update the Q values for all of them Idea called eligibility traces: Have a memory cell for each state/action pair Set memory when visit that state/action Each step, update all eligible states

12 Retrenching from Q Can integrate eligibility traces w/ Q -learning But it’s a bit of a pain Need to track when agent is “on policy” or “off policy”, etc. Good discussion in Sutton & Barto We’ll focus on a (slightly) simpler learning alg: SARSA learning V. similar to Q learning Strictly on policy: only learns about policy it’s actually executing E.g., learns instead of

13 The Q -learning algorithm Algorithm: Q_learn Inputs: State space S ; Act. space A Discount γ (0<=γ<1); Learning rate α (0<=α<1) Outputs: Q Repeat { s =get_current_world_state() a =pick_next_action( Q, s ) ( r, s’ )=act_in_world( a ) Q ( s, a )= Q ( s, a )+α*( r +γ*max_ a’ ( Q ( s’, a’ ))- Q ( s, a )) } Until (bored)

14 SARSA-learning algorithm Algorithm: SARSA_learn Inputs: State space S ; Act. space A Discount γ (0<=γ<1); Learning rate α (0<=α<1) Outputs: Q s =get_current_world_state() a =pick_next_action( Q, s ) Repeat { ( r, s’ )=act_in_world( a ) a’ =pick_next_action( Q, s’ ) Q ( s, a )= Q ( s, a )+α*( r +γ* Q ( s’, a’ )- Q ( s, a )) a = a’ ; s = s’ ; } Until (bored)

15 SARSA vs. Q SARSA and Q -learning very similar SARSA updates Q(s,a) for the policy it’s actually executing Lets the pick_next_action() function pick action to update Q updates Q(s,a) for greedy policy w.r.t. current Q Uses max_ a to pick action to update might be diff than the action it executes at s’ In practice: Q will learn the “true” π*, but SARSA will learn about what it’s actually doing Exploration can get Q -learning in trouble...

16 Getting Q in trouble... “Cliff walking” example (Sutton & Barto, Sec 6.5)

17 Getting Q in trouble... “Cliff walking” example (Sutton & Barto, Sec 6.5)

18 Radioactive breadcrumbs Can now define eligibility traces for SARSA In addition to Q(s,a) table, keep an e(s,a) table Records “eligibility” (real number) for each state/action pair At every step ( (s,a,r,s’,a’) tuple): Increment e(s,a) for current (s,a) pair by 1 Update all Q(s’’,a’’) vals in proportion to their e(s’’,a’’) Decay all e(s’’,a’’) by factor of λγ Leslie Kaelbling calls this the “radioactive breadcrumbs” form of RL

19 SARSA(λ)-learning alg. Algorithm: SARSA(λ)_learn Inputs: S, A, γ (0<=γ<1), α (0<=α<1), λ (0<=λ<1) Outputs: Q e ( s, a )=0 // for all s, a s =get_curr_world_st(); a =pick_nxt_act( Q, s ) Repeat { ( r, s’ )=act_in_world( a ) a’ =pick_next_action( Q, s’ ) δ= r +γ* Q ( s’, a’ )- Q ( s, a ) e ( s, a )+=1 foreach ( s’’, a’’ ) pair in ( S X A ) { Q ( s’’, a’’ )= Q ( s’’, a’’ )+α* e ( s’’, a’’ )*δ e ( s’’, a’’ )*=λγ } a = a’ ; s = s’ ; } Until (bored)


Download ppt "Q-learning, SARSA, and Radioactive Breadcrumbs S&B: Ch.6 and 7."

Similar presentations


Ads by Google