Reinforcement Learning, Cont’d Useful refs: Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.

Reinforcement Learning, Cont’d Useful refs: Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998. http://www.cs.ualberta.ca/~sutton/book/the-book.html Kaelbling, Littman, & Moore, ``Reinforcement Learning: A Survey,'' Journal of Artificial Intelligence Research, Volume 4, 1996. http://people.csail.mit.edu/u/l/lpk/public_html/papers/rl-survey.ps

Administrivia Mid-class survey results (momentarily) Reading 2 due today New assignments: Final project proposal Due Nov 5 (Fri), 5:00 PM To me or in my mailbox Paper preferred Reading 3: Due Nov 9 Bentivegna, D. C. and Atkeson, C. G. “Learning How to Behave from Observing Others” SAB'02-Workshop on Motor Control in Humans and Robots, Edinburgh, UK, August, 2002.

Civics Reminder: Nov 2 is US election Vote! (If you’re a citizen & registered, etc.) Do your research first Think about what you want Vote responsibly In practice I will be here & lecture that day No assignments, quizzes, etc. that day Notes will be on the web shortly after class

Survey Results: Lectures Pacing Content Math Intuition Slides Access. Too littleToo much

Survey Results: Exercises Useful? (binary) Quantity Length Graded? (binary) Too littleToo much

Back to RL Mack lives a hard life as a psychology test subject Has to run around mazes all day, finding food and avoiding electric shocks Needs to know how to find cheese quickly, while getting shocked as little as possible Q: How can Mack learn to find his way around? ?

Start with an easy case V. simple maze: Whenever Mack goes left, he gets cheese Whenever he goes right, he gets shocked After reward/punishment, he’s reset back to start of maze Q: how can Mack learn to act well in this world?

Reward functions In general, we think of a reward function: R() tells us whether Mack thinks a particular outcome is good or bad Mack before drugs: R( cheese )=+1 R( shock )=-1 Mack after drugs: R( cheese )=-1 R( shock )=+1 Behavior always depends on rewards (utilities)

Maximizing reward So Mack wants to get the maximum possible reward (Whatever that means to him) For the one-shot case like this, this is fairly easy Now what about a harder case?

Reward over time In general: agent can be in a state s i at any time t Can choose an action a j to take in that state Rwd associated with a state: R(s i ) Or with a state/action transition: R(s i,a j ) Series of actions leads to series of rewards (s 1,a 1 ) → s 3 : R(s 3 ); (s 3,a 7 ) → s 14 : R(s 14 );...

Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10

Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10 V(s 1 )=R(s 1 )+R(s 4 )+R(s 11 )+R(s 10 )+...

Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10 V(s 1 )=R(s 1 )+R(s 2 )+R(s 6 )+...

Where can you go? Definition: Complete set of all states agent could be in is called the state space: S Could be discrete or continuous We’ll usually work with discrete Size of state space: | S | Definition: Complete set of actions an agent could take is called the action space: A Again, discrete or cont. Again, we work w/ discrete Again, size: | A |

Policies Total accumulated reward (value, V ) depends on Where agent starts What agent does at each step (duh) Plan of action is called a policy, π Policy defines what action to take in every state of the system: Value is a function of start state and policy:

Experience & histories In supervised learning, “fundamental unit of experience”: feature vector+label Fundamental unit of experience in RL: At time t in some state s i, take action a j, get reward r t, end up in state s k Called an experience tuple or SARSA tuple Set of all experience during a single episode up to time t is a history:

Reinforcement Learning, Cont’d Useful refs: Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.

Similar presentations

Presentation on theme: "Reinforcement Learning, Cont’d Useful refs: Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reinforcement Learning, Cont’d Useful refs: Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.

Similar presentations

Presentation on theme: "Reinforcement Learning, Cont’d Useful refs: Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998."— Presentation transcript:

Similar presentations

About project

Feedback