CS 188: Artificial Intelligence Fall 2008 Lecture 11: Reinforcement Learning 10/2/2008 Dan Klein – UC Berkeley Many slides over the course adapted from.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
RL for Large State Spaces: Value Function Approximation
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement Learning
Reinforcement learning (Chapter 21)
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Planning under Uncertainty
Reinforcement learning
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
Reinforcement Learning
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2009
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010 Reinforcement Learning 10/25/2010 A subset of slides from Dan Klein – UC Berkeley Many.
Reinforcement Learning (1)
CS 188: Artificial Intelligence Fall 2009 Lecture 12: Reinforcement Learning II 10/6/2009 Dan Klein – UC Berkeley Many slides over the course adapted from.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Reinforcement Learning
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CSE 573: Artificial Intelligence
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Reinforcement Learning
CHAPTER 10 Reinforcement Learning Utility Theory.
QUIZ!!  T/F: RL is an MDP but we do not know T or R. TRUE  T/F: In model free learning we estimate T and R first. FALSE  T/F: Temporal Difference learning.
Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.
Eick: Reinforcement Learning. Reinforcement Learning Introduction Passive Reinforcement Learning Temporal Difference Learning Active Reinforcement Learning.
MDPs (cont) & Reinforcement Learning
gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement learning (Chapter 21)
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: IV 4/19/2007 Srini Narayanan – ICSI and UC Berkeley.
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
Reinforcement Learning
Warmup: Which one would you try next? 1 Tries: 1000 Winnings: 900 Tries: 100 Winnings: 90 Tries: 5 Winnings: 4 Tries: 100 Winnings: 0 ABCD.
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.
CSE 473: Artificial Intelligence Spring 2012 Reinforcement Learning Dan Weld Many slides adapted from either Dan Klein, Stuart Russell, Luke Zettlemoyer.
CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
CS 188: Artificial Intelligence Fall 2007 Lecture 12: Reinforcement Learning 10/4/2007 Dan Klein – UC Berkeley.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
CS 188: Artificial Intelligence Fall 2006
Instructors: Fei Fang (This Lecture) and Dave Touretzky
RL for Large State Spaces: Value Function Approximation
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008
CSE 473: Artificial Intelligence Autumn 2011
CSE 473: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2007
October 6, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

CS 188: Artificial Intelligence Fall 2008 Lecture 11: Reinforcement Learning 10/2/2008 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 1

Announcements  P3: MDPs and Reinforcement Learning is up!  Homework session: Today 6-8pm in 10 Evans 2

Reinforcement Learning  Reinforcement learning:  Still have an MDP:  A set of states s  S  A set of actions (per state) A  A model T(s,a,s’)  A reward function R(s,a,s’)  Still looking for a policy  (s)  New twist: don’t know T or R  I.e. don’t know which states are good or what the actions do  Must actually try actions and states out to learn [DEMO] 3

Example: Animal Learning  RL studied experimentally for more than 60 years in psychology  Rewards: food, pain, hunger, drugs, etc.  Mechanisms and sophistication debated  Example: foraging  Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies  Bees have a direct neural connection from nectar intake measurement to motor planning area 4

Example: Backgammon  Reward only for win / loss in terminal states, zero otherwise  TD-Gammon learns a function approximation to V(s) using a neural network  Combined with depth 3 search, one of the top 3 players in the world  You could imagine training Pacman this way…  … but it’s tricky! 5

Passive Learning  Simplified task  You don’t know the transitions T(s,a,s’)  You don’t know the rewards R(s,a,s’)  You are given a policy  (s)  Goal: learn the state values (and maybe the model)  I.e., policy evaluation  In this case:  Learner “along for the ride”  No choice about what actions to take  Just execute the policy and learn from experience  We’ll get to the active case soon  This is NOT offline planning! 6

Example: Direct Estimation  Episodes: x y (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) V(1,1) ~ ( ) / 2 = -7 V(3,3) ~ ( ) / 3 = 31.3  = 1, R = [DEMO – Optimal Policy]

Model-Based Learning  Idea:  Learn the model empirically (rather than values)  Solve the MDP as if the learned model were correct  Empirical model learning  Simplest case:  Count outcomes for each s,a  Normalize to give estimate of T(s,a,s’)  Discover R(s,a,s’) the first time we experience (s,a,s’)  More complex learners are possible (e.g. if we know that all squares have related action outcomes, e.g. “stationary noise”) 8

Example: Model-Based Learning  Episodes: x y T(, right, ) = 1 / 3 T(, right, ) = 2 /  = 1 (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) 9

Recap: Model-Based Policy Evaluation  Simplified Bellman updates to calculate V for a fixed policy:  New V is expected one-step-look- ahead using current V  Unfortunately, need T and R 10  (s) s s,  (s) s,  (s),s’ s’

Sample Avg to Replace Expectation?  Who needs T and R? Approximate the expectation with samples (drawn from T!) 11  (s) s s,  (s) s1’s1’ s2’s2’ s3’s3’

Model-Free Learning  Big idea: why bother learning T?  Update V each time we experience a transition  Frequent outcomes will contribute more updates (over time)  Temporal difference learning (TD)  Policy still fixed!  Move values toward value of whatever successor occurs: running average! 12  (s) s s,  (s) s’

Example: TD Policy Evaluation Take  = 1,  = 0.5 (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) 13

Problems with TD Value Learning  TD value leaning is model-free for policy evaluation  However, if we want to turn our value estimates into a policy, we’re sunk:  Idea: learn Q-values directly  Makes action selection model-free too! a s s, a s,a,s’ s’ 14

Active Learning  Full reinforcement learning  You don’t know the transitions T(s,a,s’)  You don’t know the rewards R(s,a,s’)  You can choose any actions you like  Goal: learn the optimal policy (maybe values)  In this case:  Learner makes choices!  Fundamental tradeoff: exploration vs. exploitation  This is NOT offline planning! 15

Model-Based Learning  In general, want to learn the optimal policy, not evaluate a fixed policy  Idea: adaptive dynamic programming  Learn an initial model of the environment:  Solve for the optimal policy for this model (value or policy iteration)  Refine model through experience and repeat  Crucial: we have to make sure we actually learn about all of the model 16

Example: Greedy ADP  Imagine we find the lower path to the good exit first  Some states will never be visited following this policy from (1,1)  We’ll keep re-using this policy because following it never collects the regions of the model we need to learn the optimal policy ?? 17

What Went Wrong?  Problem with following optimal policy for current model:  Never learn about better regions of the space if current policy neglects them  Fundamental tradeoff: exploration vs. exploitation  Exploration: must take actions with suboptimal estimates to discover new rewards and increase eventual utility  Exploitation: once the true optimal policy is learned, exploration reduces utility  Systems must explore in the beginning and exploit in the limit ?? 18

Q-Value Iteration  Value iteration: find successive approx optimal values  Start with V 0 * (s) = 0, which we know is right (why?)  Given V i *, calculate the values for all states for depth i+1:  But Q-values are more useful!  Start with Q 0 * (s,a) = 0, which we know is right (why?)  Given Q i *, calculate the q-values for all q-states for depth i+1: 19

Q-Learning  Learn Q*(s,a) values  Receive a sample (s,a,s’,r)  Consider your old estimate:  Consider your new sample estimate:  Incorporate the new estimate into a running average: [DEMO – Grid Q’s] 20

Q-Learning Properties  Will converge to optimal policy  If you explore enough  If you make the learning rate small enough  But not decrease it too quickly!  Basically doesn’t matter how you select actions (!)  Neat property: learns optimal q-values regardless of action selection noise (some caveats) SE SE [DEMO – Grid Q’s] 21

Exploration / Exploitation  Several schemes for forcing exploration  Simplest: random actions (  greedy)  Every time step, flip a coin  With probability , act randomly  With probability 1- , act according to current policy  Problems with random actions?  You do explore the space, but keep thrashing around once learning is done  One solution: lower  over time  Another solution: exploration functions [DEMO – RL Pacman] 22

Exploration Functions  When to explore  Random actions: explore a fixed amount  Better idea: explore areas whose badness is not (yet) established  Exploration function  Takes a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important) 23

Q-Learning  Q-learning produces tables of q-values: [DEMO – Crawler Q’s] 24

Q-Learning  In realistic situations, we cannot possibly learn about every single state!  Too many states to visit them all in training  Too many states to hold the q-tables in memory  Instead, we want to generalize:  Learn about some small number of training states from experience  Generalize that experience to new, similar states  This is a fundamental idea in machine learning, and we’ll see it over and over again 25

Example: Pacman  Let’s say we discover through experience that this state is bad:  In naïve q learning, we know nothing about this state or its q states:  Or even this one! 26

Feature-Based Representations  Solution: describe a state using a vector of features  Features are functions from states to real numbers (often 0/1) that capture important properties of the state  Example features:  Distance to closest ghost  Distance to closest dot  Number of ghosts  1 / (dist to dot) 2  Is Pacman in a tunnel? (0/1)  …… etc.  Can also describe a q-state (s, a) with features (e.g. action moves closer to food) 27

Linear Feature Functions  Using a feature representation, we can write a q function (or value function) for any state using a few weights:  Advantage: our experience is summed up in a few powerful numbers  Disadvantage: states may share features but be very different in value! 28

Function Approximation  Q-learning with linear q-functions:  Intuitive interpretation:  Adjust weights of active features  E.g. if something unexpectedly bad happens, disprefer all states with that state’s features  Formal justification: online least squares 29

Example: Q-Pacman 30

Linear regression Given examples Predict given a new point 31

Linear regression Prediction 32

Ordinary Least Squares (OLS) Error or “residual” Prediction Observation 33

Minimizing Error Value update explained: 34

[DEMO] Degree 15 polynomial Overfitting 35

Policy Search 36

Policy Search  Problem: often the feature-based policies that work well aren’t the ones that approximate V / Q best  E.g. your value functions from project 2 were probably horrible estimates of future rewards, but they still produced good decisions  We’ll see this distinction between modeling and prediction again later in the course  Solution: learn the policy that maximizes rewards rather than the value that predicts rewards  This is the idea behind policy search, such as what controlled the upside-down helicopter 37

Policy Search  Simplest policy search:  Start with an initial linear value function or q-function  Nudge each feature weight up and down and see if your policy is better than before  Problems:  How do we tell the policy got better?  Need to run many sample episodes!  If there are a lot of features, this can be impractical 38

Policy Search*  Advanced policy search:  Write a stochastic (soft) policy:  Turns out you can efficiently approximate the derivative of the returns with respect to the parameters w (details in the book, but you don’t have to know them)  Take uphill steps, recalculate derivatives, etc. 39

Take a Deep Breath…  We’re done with search and planning!  Next, we’ll look at how to reason with probabilities  Diagnosis  Tracking objects  Speech recognition  Robot mapping  … lots more!  Last part of course: machine learning 40