Presentation is loading. Please wait.

Presentation is loading. Please wait.

RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet.

Similar presentations


Presentation on theme: "RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet."— Presentation transcript:

1 RL at Last! Q- learning and buddies

2 Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet to be read (I warned you)

3 Proposal analysis Overall: excellent job! Congrats to all! In general: better than previous semesters. You do not need to revise them But do pay attention to my comments I’m available for questions Overall scores (to date): Writing & Clarity (W&C): 7/10 ± 1.3 Background & Context (B&C): 7.9/10 ± 1.1 Research Plan: 7.9/10 ± 0.6

4 Reminders Agent acting in a Markov decision process (MDP): M = 〈 S, A,T,R 〉 E.g., robot in maze, airplane, etc. Fully observable, finite state and action spaces, finite history, bounded rewards Last time: planning given known M Policy evaluation: find value V π of fixed policy, π Policy iteration: find best policy, π*

5 Q : A key operative Critical step in policy iteration π’(s i ) =argmax a ∈ A {sum j ( T(s i,a,s j )*V(s j ) )} Asks “What happens if I ignore π for just one step, and do a instead (and then resume doing π thereafter)?” Often used operation. Gets a special name: Definition: the Q function, is: Policy iter says: “Figure out Q, act greedily according to Q, then update Q and repeat, until you can’t do any better...”

6 What to do with Q Can think of Q as a big table: one entry for each state/action pair “If I’m in state s and take action a, this is my expected discounted reward...” A “one-step” exploration: “In state s, if I deviate from my policy π for one timestep, then keep doing π, is my life better or worse?” Can get V and π from Q :

7 Learning with Q Q and the notion of policy evaluation give us a nice way to do actual learning Use Q table to represent policy Update Q through experience Every time you see a (s,a,r,s’) tuple, update Q Each example of (s,a,r,s’) is a sample from T(s,a,s’) and from R W/ enough samples, can get a good idea of how the world works, where reward is, etc. Note: Never actually learn T or R ; let Q encode everything you need to know about the world

8 The Q -learning algorithm Algorithm: Q_learn Inputs: State space S ; Act. space A Discount γ (0<=γ<1); Learning rate α (0<=α<1) Outputs: Q Repeat { s =get_current_world_state() a =pick_next_action( Q, s ) ( r, s’ )=act_in_world( a ) Q ( s, a )= Q ( s, a )+α*( r +γ*max_ a’ ( Q ( s’, a’ ))- Q ( s, a )) } Until (bored)

9 Q -learning in action 15x15 maze world; R (goal)= 1; R( other)=0 γ =0.9 α =0.65

10 Q -learning in action Initial policy

11 Q -learning in action After 20 episodes

12 Q -learning in action After 30 episodes

13 Q -learning in action After 100 episodes

14 Q -learning in action After 150 episodes

15 Q -learning in action After 200 episodes

16 Q -learning in action After 250 episodes

17 Q -learning in action After 300 episodes

18 Q -learning in action After 350 episodes

19 Q -learning in action After 400 episodes

20 Well, it looks good anyway But are we sure it’s actually learning? How to measure whether it’s actually getting any better at the task? (Finding the goal state)

21 Well, it looks good anyway But are we sure it’s actually learning? How to measure whether it’s actually getting any better at the task? (Finding the goal state) Every 10 episodes, “freeze” policy (turn off learning) Measure avg time to goal from a number of starting states Average over a number of test episodes to iron out noise Plot learning curve: #episodes of learning vs. avg performance

22 Learning performance

23 Notes on learning perf. After 400 learning episodes, still hasn’t asymptoted Note: that’s ~700,000 steps of experience!!! Q learning is really, really slow!!! Same holds for many RL methods (sadly) Fixing this is a good research topic... ;-)

24 Why does this work? Multiple ways to think of it The (more nearly) intuitive: Look at the key update step in the Q - learning alg: I.e., a weighted avg between current Q(s,a) and sampled Q(s’,a’)

25 Why does this work? Still... Why should that weighted avg be the right thing? Compare update eqn w/ Bellman eqn:

26 Why does this work? Still... Why should that weighted avg be the right thing? Compare w/ Bellman eqn:


Download ppt "RL at Last! Q- learning and buddies. Administrivia R3 due today Class discussion Project proposals back (mostly) Only if you gave me paper; e-copies yet."

Similar presentations


Ads by Google