Reinforcement Learning (for real this time). Administrivia No noose is good noose (Crazy weather?)

Reinforcement Learning (for real this time)

Administrivia No noose is good noose (Crazy weather?)

Historical perspective Last time P2 handout Discussion of P2 This time Questions/answers -- generics, etc. Q2 (for real this time!) Intro to RL

Meet Mack the Mouse* Mack lives a hard life as a psychology test subject Has to run around mazes all day, finding food and avoiding electric shocks Needs to know how to find cheese quickly, while getting shocked as little as possible Q: How can Mack learn to find his way around? * Mickey is still copyright ?

Start with an easy case V. simple maze: Whenever Mack goes left, he gets cheese Whenever he goes right, he gets shocked After reward/punishment, he’s reset back to start of maze Q: how can Mack learn to act well in this world?

Learning in the easy case Say there are two labels: “cheese” and “shock” Mack tries a bunch of trials in the world -- that generates a bunch of experiences: Now what?

But what to do? So we know that Mack can learn a mapping from actions to outcomes But what should Mack do in any given situation? What action should he take at any given time? Suppose Mack is the subject of a psychotropic drug study and has actually come to like shocks and hate cheese -- how does he act now?

Reward functions In general, we think of a reward function: R() tells us whether Mack thinks a particular outcome is good or bad Mack before drugs: R( cheese )=+1 R( shock )=-1 Mack after drugs: R( cheese )=-1 R( shock )=+1 Behavior always depends on rewards (utilities)

Maximizing reward So Mack wants to get the maximum possible reward (Whatever that means to him) For the one-shot case like this, this is fairly easy Now what about a harder case?

Reward over time In general: agent can be in a state s i at any time t Can choose an action a j to take in that state Rwd associated with a state: R(s i ) Or with a state/action transition: R(s i,a j ) Series of actions leads to series of rewards (s 1,a 1 ) → s 3 : R(s 3 ); (s 3,a 7 ) → s 14 : R(s 14 );...

Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10

Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10 V(s 1 )=R(s 1 )+R(s 4 )+R(s 11 )+R(s 10 )+...

Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10 V(s 1 )=R(s 1 )+R(s 2 )+R(s 6 )+...

Reinforcement Learning (for real this time). Administrivia No noose is good noose (Crazy weather?)

Similar presentations

Presentation on theme: "Reinforcement Learning (for real this time). Administrivia No noose is good noose (Crazy weather?)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement Learning (for real this time). Administrivia No noose is good noose (Crazy weather?)

Similar presentations

Presentation on theme: "Reinforcement Learning (for real this time). Administrivia No noose is good noose (Crazy weather?)"— Presentation transcript:

Similar presentations

About project

Feedback