CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement Learning
Reinforcement learning (Chapter 21)
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
1 Temporal-Difference Learning Week #6. 2 Introduction Temporal-Difference (TD) Learning –a combination of DP and MC methods updates estimates based on.
Markov Decision Processes
Planning under Uncertainty
Reinforcement learning
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2009
1 Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One) Li, Hailin.
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010 Reinforcement Learning 10/25/2010 A subset of slides from Dan Klein – UC Berkeley Many.
CS 188: Artificial Intelligence Fall 2009 Lecture 12: Reinforcement Learning II 10/6/2009 Dan Klein – UC Berkeley Many slides over the course adapted from.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
Reinforcement Learning
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Reinforcement Learning
CHAPTER 10 Reinforcement Learning Utility Theory.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
QUIZ!!  T/F: RL is an MDP but we do not know T or R. TRUE  T/F: In model free learning we estimate T and R first. FALSE  T/F: Temporal Difference learning.
Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: I Utilities and Simple decisions 4/10/2007 Srini Narayanan – ICSI and UC.
MDPs (cont) & Reinforcement Learning
gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/
Reinforcement learning (Chapter 21)
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: IV 4/19/2007 Srini Narayanan – ICSI and UC Berkeley.
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Reinforcement Learning
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.
CSE 473: Artificial Intelligence Spring 2012 Reinforcement Learning Dan Weld Many slides adapted from either Dan Klein, Stuart Russell, Luke Zettlemoyer.
CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
CS 188: Artificial Intelligence Fall 2007 Lecture 12: Reinforcement Learning 10/4/2007 Dan Klein – UC Berkeley.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 188: Artificial Intelligence Fall 2008 Lecture 11: Reinforcement Learning 10/2/2008 Dan Klein – UC Berkeley Many slides over the course adapted from.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement Learning (1)
Reinforcement learning (Chapter 21)
CS 188: Artificial Intelligence
CAP 5636 – Advanced Artificial Intelligence
Quiz 6: Utility Theory Simulated Annealing only applies to continuous f(). False Simulated Annealing only applies to differentiable f(). False The 4 other.
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008
CSE 473: Artificial Intelligence Autumn 2011
CS 188: Artificial Intelligence Fall 2007
October 6, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2008
Hidden Markov Models (cont.) Markov Decision Processes
CS 188: Artificial Intelligence Spring 2006
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning (2)
Presentation transcript:

CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley

Announcements  Othello tournament rules up.  On-line readings for this week.

Recap: MDPs  Markov decision processes:  States S  Actions A  Transitions P(s’|s,a) (or T(s,a,s’))  Rewards R(s,a,s’)  Start state s 0  Examples:  Gridworld, High-Low, N-Armed Bandit  Any process where the result of your action is stochastic  Goal: find the “best” policy   Policies are maps from states to actions  What do we mean by “best”?  This is like search – it’s planning using a model, not actually interacting with the environment

Bellman’s Equation for Selecting actions  Definition of utility leads to a simple relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy Formally: Bellman’s Equation That’s my equation!

Example: GridWorld

Value Iteration  Idea:  Start with bad guesses at all utility values (e.g. V 0 (s) = 0)  Update all values simultaneously using the Bellman equation (called a value update or Bellman update):  Repeat until convergence  Theorem: will converge to unique optimal values  Basic idea: bad guesses get refined towards optimal values  Policy may converge long before values do

Example: Bellman Updates

Example: Value Iteration  Information propagates outward from terminal states and eventually all states have correct value estimates [DEMO]

Convergence*  Define the max-norm:  Theorem: For any two approximations U and V (any two utility vectors)  I.e. any distinct approximations must get closer to each other (after the Bellman update), so, in particular, any approximation must get closer to the true U (Bellman update is U) and value iteration converges to a unique, stable, optimal solution  Theorem:  I.e. once the change in our approximation is small, it must also be close to correct

Policy Iteration  Alternate approach:  Policy evaluation: calculate utilities for a fixed policy until convergence (remember last lecture)  Policy improvement: update policy based on resulting converged utilities  Repeat until policy converges  This is policy iteration  Can converge faster under some conditions

Policy Iteration  If we have a fixed policy , use simplified Bellman equation to calculate utilities:  For fixed utilities, easy to find the best action according to one-step look-ahead

Comparison  In value iteration:  Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy)  In policy iteration:  Several passes to update utilities with frozen policy  Occasional passes to update policies  Hybrid approaches (asynchronous policy iteration):  Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely often

Reinforcement Learning  Reinforcement learning:  Still have an MDP:  A set of states s  S  A set of actions (per state) A  A model T(s,a,s’)  A reward function R(s,a,s’)  Still looking for a policy  (s)  New twist: don’t know T or R  I.e. don’t know which states are good or what the actions do  Must actually try actions and states out to learn [DEMO]

Example: Animal Learning  RL studied experimentally for more than 60 years in psychology  Rewards: food, pain, hunger, drugs, etc.  Mechanisms and sophistication debated  Example: foraging  Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies  Bees have a direct neural connection from nectar intake measurement to motor planning area

Passive Learning  Simplified task  You don’t know the transitions T(s,a,s’)  You don’t know the rewards R(s,a,s’)  You are given a policy  (s)  Goal: learn the state values (and maybe the model)  In this case:  No choice about what actions to take  Just execute the policy and learn from experience  We’ll get to the general case soon

Example: Direct Estimation Simple Monte Carlo  Episodes: x y (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done) U(1,1) ~ ( ) / 2 = -7 U(3,3) ~ ( ) / 3 = 31.3  = 1, R =

Model-Based Learning  Idea:  Learn the model empirically (rather than values)  Solve the MDP as if the learned model were correct  Empirical model learning  Simplest case:  Count outcomes for each s,a  Normalize to give estimate of T(s,a,s’)  Discover R(s,a,s’) the first time we experience (s,a,s’)  More complex learners are possible (e.g. if we know that all squares have related action outcomes, e.g. “stationary noise”)

Example: Model-Based Learning  Episodes: x y T(, right, ) = 1 / 3 T(, right, ) = 2 /  = 1 (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done)

Model-Based Learning  In general, want to learn the optimal policy, not evaluate a fixed policy  Idea: adaptive dynamic programming  Learn an initial model of the environment:  Solve for the optimal policy for this model (value or policy iteration)  Refine model through experience and repeat  Crucial: we have to make sure we actually learn about all of the model

Example: Greedy ADP  Imagine we find the lower path to the good exit first  Some states will never be visited following this policy from (1,1)  We’ll keep re-using this policy because following it never collects the regions of the model we need to learn the optimal policy ??

What Went Wrong?  Problem with following optimal policy for current model:  Never learn about better regions of the space if current policy neglects them  Fundamental tradeoff: exploration vs. exploitation  Exploration: must take actions with suboptimal estimates to discover new rewards and increase eventual utility  Exploitation: once the true optimal policy is learned, exploration reduces utility  Systems must explore in the beginning and exploit in the limit ??

Dynamic Programming T T T TTTTTTTTTT

Simple Monte Carlo TTTTTTTTTT

Simplest TD Method TTTTTTTTTT

Model-Free Learning  Big idea: why bother learning T?  Update each time we experience a transition  Frequent outcomes will contribute more updates (over time)  Temporal difference learning (TD)  Policy still fixed!  Move values toward value of whatever successor occurs a s s, a s,a,s’ s’

Example: Passive TD Take  = 1,  = 0.5 (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right -1 (4,3) exit +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (4,2) exit -100 (done)

TD Learning features  On-line  Incremental  Bootstrapping (like DP unlike MC)  Model free  Does not require special treatment of exploration actions (unlike DP)  Converges for any policy to the correct value of a state for that policy.  As long as  Alpha is high in the beginning and low at the end (say 1/k)  Gamma less than one

Problems with TD Value Learning  TD value learning is model- free for policy evaluation  However, if we want to turn our value estimates into a policy, we’re sunk:  Idea: Learn state-action pairings (Q-values) directly  Makes action selection model-free too! a s s, a s,a,s’ s’

Q-Functions  To simplify things, introduce a q-value, for a state and action under a policy  Utility of taking starting in state s, taking action a, then following  thereafter

The Bellman Equations  Definition of utility leads to a simple relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy  Formally:

Q-Learning  Learn Q*(s,a) values  Receive a sample (s,a,s’,r)  Consider your old estimate:  Consider your new sample estimate:  Nudge the old estimate towards the new sample:

Q-Learning

Exploration / Exploitation  Several schemes for forcing exploration  Simplest: random actions (  -greedy)  Every time step, flip a coin  With probability , act randomly  With probability 1- , act according to current policy  Problems with random actions?  You do explore the space, but keep thrashing around once learning is done  One solution: lower  over time  Another solution: exploration functions

Demo of Q Learning  Demo Robot crawling  Parameters  k (learning rate)  discounted reward (high for future rewards)  exploration(should decrease with time)  MDP  number of pixel moved to the right/ iteration number  Actions : Arm up and down (yellow line), hand up and down (red line)

Q-Learning Properties  Will converge to optimal policy  If you explore enough  If you make the learning rate small enough  Neat property: does not learn policies which are optimal in the presence of action selection noise SE SE

Exploration Functions  When to explore  Random actions: explore a fixed amount  Better idea: explore areas whose badness is not (yet) established  Exploration function  Takes a value estimate and a count, and returns an optimistic utility, e.g. (exact form not important)