Reinforcement Learning: Learning to get what you want... Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.

Slides:



Advertisements
Similar presentations
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Advertisements

Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
1 Reinforcement Learning Problem Week #3. Figure reproduced from the figure on page 52 in reference [1] 2 Reinforcement Learning Loop state Agent Environment.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
1. Algorithms for Inverse Reinforcement Learning 2
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Partially Observable Markov Decision Process By Nezih Ergin Özkucur.
COSC 878 Seminar on Large Scale Statistical Machine Learning 1.
Planning under Uncertainty
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
RL Cont’d. Policies Total accumulated reward (value, V ) depends on Where agent starts What agent does at each step (duh) Plan of action is called a policy,
Intro to Reinforcement Learning Learning how to get what you want...
Bayesian Reinforcement Learning with Gaussian Processes Huanren Zhang Electrical and Computer Engineering Purdue University.
Reinforcement Learning Mitchell, Ch. 13 (see also Barto & Sutton book on-line)
Reinforcement Learning (for real this time). Administrivia No noose is good noose (Crazy weather?)
Reinforcement Learning, Cont’d Useful refs: Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press 1998.
Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
Reinforcement Learning Reinforced. Administrivia I’m out of town next Tues and Wed Class cancelled Apr 4 -- work on projects! No office hrs Apr 4 or 5.
Planning to learn. Progress report Last time: Transition functions & stochastic outcomes Markov chains MDPs defined Today: Exercise completed Value functions.
RL 2 It’s 2:00 AM. Do you know where your mouse is?
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
The People Have Spoken.... Administrivia Final Project proposal due today Undergrad credit: please see me in office hours Dissertation defense announcements.
Learning and Planning for POMDPs Eyal Even-Dar, Tel-Aviv University Sham Kakade, University of Pennsylvania Yishay Mansour, Tel-Aviv University.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Reinforcement Learning Russell and Norvig: Chapter 21 CMSC 421 – Fall 2006.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
An Analytical Framework for Ethical AI
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Introduction to Reinforcement Learning Dr Kathryn Merrick 2008 Spring School on Optimisation, Learning and Complexity Friday 7 th.
Reinforcement Learning 主講人:虞台文 Content Introduction Main Elements Markov Decision Process (MDP) Value Functions.
Privacy-Preserving Bayes-Adaptive MDPs CS548 Term Project Kanghoon Lee, AIPR Lab., KAIST CS548 Advanced Information Security Spring 2010.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Decision Making Under Uncertainty Lec #8: Reinforcement Learning UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Most slides by Jeremy.
Reinforcement Learning
Game Theory, Social Interactions and Artificial Intelligence Supervisor: Philip Sterne Supervisee: John Richter.
Reinforcement Learning Yishay Mansour Tel-Aviv University.
INTRODUCTION TO Machine Learning
Reinforcement Learning 主講人:虞台文 大同大學資工所 智慧型多媒體研究室.
CHAPTER 16: Reinforcement Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Game-playing:
1 Introduction to Reinforcement Learning Freek Stulp.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Reinforcement Learning
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Reinforcement Learning AI – Week 22 Sub-symbolic AI Two: An Introduction to Reinforcement Learning Lee McCluskey, room 3/10
Markov Decision Process (MDP)
MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Reinforcement Learning. Overview Supervised Learning: Immediate feedback (labels provided for every input). Unsupervised Learning: No feedback (no labels.
Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning Applications Summary.
Computational Modeling Lab Wednesday 18 June 2003 Reinforcement Learning an introduction part 5 Ann Nowé By Sutton.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
Reinforcement Learning (1)
Reinforcement Learning in POMDPs Without Resets
Markov Decision Processes
Markov Decision Processes
Announcements Homework 3 due today (grace period through Friday)
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Dr. Unnikrishnan P.C. Professor, EEE
CS 188: Artificial Intelligence Fall 2007
Introduction to Reinforcement Learning and Q-Learning
Reinforcement Learning
Presentation transcript:

Reinforcement Learning: Learning to get what you want... Sutton & Barto, Reinforcement Learning: An Introduction, MIT Press Kaelbling, Littman, & Moore, ``Reinforcement Learning: A Survey,'' Journal of Artificial Intelligence Research, Volume 4,

Meet Mack the Mouse* Mack lives a hard life as a psychology test subject Has to run around mazes all day, finding food and avoiding electric shocks Needs to know how to find cheese quickly, while getting shocked as little as possible Q: How can Mack learn to find his way around? * Mickey is still copyright ?

Start with an easy case V. simple maze: Whenever Mack goes left, he gets cheese Whenever he goes right, he gets shocked After reward/punishment, he’s reset back to start of maze Q: how can Mack learn to act well in this world?

Learning in the easy case Say there are two labels: “cheese” and “shock” Mack tries a bunch of trials in the world -- that generates a bunch of experiences: Now what?

But what to do? So we know that Mack can learn a mapping from actions to outcomes But what should Mack do in any given situation? What action should he take at any given time? Suppose Mack is the subject of a psychotropic drug study and has actually come to like shocks and hate cheese -- how does he act now?

Reward functions In general, we think of a reward function: R() tells us whether Mack thinks a particular outcome is good or bad Mack before drugs: R( cheese )=+1 R( shock )=-1 Mack after drugs: R( cheese )=-1 R( shock )=+1 Behavior always depends on rewards (utilities)

Maximizing reward So Mack wants to get the maximum possible reward (Whatever that means to him) For the one-shot case like this, this is fairly easy Now what about a harder case?

Reward over time In general: agent can be in a state s i at any time t Can choose an action a j to take in that state Rwd associated with a state: R(s i ) Or with a state/action transition: R(s i,a j ) Series of actions leads to series of rewards (s 1,a 1 ) → s 3 : R(s 3 ); (s 3,a 7 ) → s 14 : R(s 14 );...

Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10

Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10 V(s 1 )=R(s 1 )+R(s 4 )+R(s 11 )+R(s 10 )+...

Reward over time s1s1 s2s2 s3s3 s4s4 s5s5 s6s6 s4s4 s2s2 s7s7 s 11 s8s8 s9s9 s 10 V(s 1 )=R(s 1 )+R(s 2 )+R(s 6 )+...

Where can you go? Definition: Complete set of all states agent could be in is called the state space: S Could be discrete or continuous We’ll usually work with discrete Size of state space: | S | S ={s 1,s 2,...,s | S | }

What can you do? Definition: Complete set of actions an agent could take is called the action space: A Again, discrete or cont. Again, we work w/ discrete Again, size: | A | A ={a 1,...,a | A | }

Experience & histories In supervised learning, “fundamental unit of experience”: feature vector+label Fundamental unit of experience in RL: At time t in some state s i, take action a j, get reward r t, end up in state s k Called an experience tuple or SARSA tuple

The value of history... Set of all experience during a single episode up to time t is a history: A.k.a., trace, trajectory

Policies Total accumulated reward (value, V ) depends on Where agent starts, initial s What agent does at each step (duh), a Plan of action is called a policy, π Policy defines what action to take in every state of the system: A.k.a. controller, control law, decision rule, etc.

Policies Value is a function of start state and policy: Useful to think about finite horizon and infinite horizon values:

Finite horizon reward Assuming that an episode is finite: Agent acts in the world for a finite number of time steps, T, experiences history h T What should total aggregate value be?

Finite horizon reward Assuming that an episode is finite: Agent acts in the world for a finite number of time steps, T, experiences history h T What should total aggregate value be? Total accumulated reward: Occasionally useful to use average reward:

Gonna live forever... Often, we want to model a process that is indefinite Infinitely long Of unknown length (don’t know in advance when it will end) Runs ‘til it’s stopped (randomly) Have to consider infinitely long histories Q: what does value mean over an infinite history?

Reaaally long-term reward Let be an infinite history We define the infinite-horizon discounted value to be: where is the discount factor Q1: Why does this work? Q2: if R max is the max possible reward attainable in the environment, what is V max ?