CS 188: Artificial Intelligence Spring 2006

Slides:



Advertisements
Similar presentations
Lirong Xia Reinforcement Learning (1) Tue, March 18, 2014.
Advertisements

Lirong Xia Reinforcement Learning (2) Tue, March 21, 2014.
Markov Decision Process
Value Iteration & Q-learning CS 5368 Song Cui. Outline Recap Value Iteration Q-learning.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/21 at 5:00pm.  Optional.
Ai in game programming it university of copenhagen Reinforcement Learning [Outro] Marco Loog.
Reinforcement Learning
Reinforcement learning (Chapter 21)
Markov Decision Processes
Planning under Uncertainty
Reinforcement learning
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
CS 182/CogSci110/Ling109 Spring 2008 Reinforcement Learning: Algorithms 4/1/2008 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2009
CS 188: Artificial Intelligence Fall 2009 Lecture 10: MDPs 9/29/2009 Dan Klein – UC Berkeley Many slides over the course adapted from either Stuart Russell.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010 Reinforcement Learning 10/25/2010 A subset of slides from Dan Klein – UC Berkeley Many.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CHAPTER 10 Reinforcement Learning Utility Theory.
© D. Weld and D. Fox 1 Reinforcement Learning CSE 473.
Quiz 6: Utility Theory  Simulated Annealing only applies to continuous f(). False  Simulated Annealing only applies to differentiable f(). False  The.
MDPs (cont) & Reinforcement Learning
gaflier-uas-battles-feral-hogs/ gaflier-uas-battles-feral-hogs/
CS 188: Artificial Intelligence Spring 2007 Lecture 23: Reinforcement Learning: III 4/17/2007 Srini Narayanan – ICSI and UC Berkeley.
Reinforcement learning (Chapter 21)
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CSE 473Markov Decision Processes Dan Weld Many slides from Chris Bishop, Mausam, Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer.
QUIZ!!  T/F: Optimal policies can be defined from an optimal Value function. TRUE  T/F: “Pick the MEU action first, then follow optimal policy” is optimal.
Reinforcement Learning
CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.
CSE 473: Artificial Intelligence Spring 2012 Reinforcement Learning Dan Weld Many slides adapted from either Dan Klein, Stuart Russell, Luke Zettlemoyer.
CS 541: Artificial Intelligence Lecture XI: Reinforcement Learning Slides Credit: Peter Norvig and Sebastian Thrun.
Announcements  Homework 3: Games  Due tonight at 11:59pm.  Project 2: Multi-Agent Pacman  Has been released, due Friday 2/19 at 5:00pm.  Optional.
Def gradientDescent(x, y, theta, alpha, m, numIterations): xTrans = x.transpose() replaceMe =.0001 for i in range(0, numIterations): hypothesis = np.dot(x,
CS 188: Artificial Intelligence Fall 2007 Lecture 12: Reinforcement Learning 10/4/2007 Dan Klein – UC Berkeley.
Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act.
CS 188: Artificial Intelligence Fall 2008 Lecture 11: Reinforcement Learning 10/2/2008 Dan Klein – UC Berkeley Many slides over the course adapted from.
CS 5751 Machine Learning Chapter 13 Reinforcement Learning1 Reinforcement Learning Control learning Control polices that choose optimal actions Q learning.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Reinforcement learning (Chapter 21)
Reinforcement Learning (1)
CMSC 471 – Spring 2014 Class #25 – Thursday, May 1
Reinforcement learning (Chapter 21)
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning
CS 188: Artificial Intelligence
Reinforcement learning
CAP 5636 – Advanced Artificial Intelligence
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Quiz 6: Utility Theory Simulated Annealing only applies to continuous f(). False Simulated Annealing only applies to differentiable f(). False The 4 other.
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008
CSE 473: Artificial Intelligence Autumn 2011
CS 188: Artificial Intelligence Fall 2007
CS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Spring 2006
CS 188: Artificial Intelligence Fall 2007
CS 416 Artificial Intelligence
Announcements Homework 2 Project 2 Mini-contest 1 (optional)
CS 416 Artificial Intelligence
Reinforcement Learning (2)
Markov Decision Processes
Markov Decision Processes
Reinforcement Learning
Reinforcement Learning (2)
CS 440/ECE448 Lecture 22: Reinforcement Learning
Presentation transcript:

CS 188: Artificial Intelligence Spring 2006 Lecture 22: Reinforcement Learning 4/11/2006 Dan Klein – UC Berkeley

Today More MDPs: policy iteration Reinforcement learning Passive learning Active learning

Recap: MDPs Markov decision processes (MDPs) Solutions to an MDP A set of states s  S A model T(s,a,s’) Probability that the outcome of action a in state s is s’ A reward function R(s) Solutions to an MDP A policy (s) Specifies an action for each state We want to find a policy which maximizes total expected utility = expected (discounted) rewards

Bellman Equations The value of a state according to  The policy according to a value U The optimal value of a state

Recap: Value Iteration Idea: Start with (bad) value estimates (e.g. U0(s) = 0) Start with corresponding (bad) policy 0(s) Update values using the Bellman relations (once) Update policy based on new values Repeat until convergence

Policy Iteration Alternate approach: This is policy iteration Policy evaluation: calculate exact utility values for a fixed policy Policy improvement: update policy based on values Repeat until convergence This is policy iteration Can converge faster under some conditions

Policy Evaluation If we have a fixed policy , use a simplified Bellman update to calculate utilities: Unlike in value iteration, policy does not change during update process Converges to the expected utility values for this  Can also solve for U with linear algebra methods instead of iteration

Policy Improvement Once values are correct for current policy, update the policy Note: Value iteration: update U, , U,  U, … Policy iteration: U, U, U, U… , U, U, U, U…  Otherwise, basically the same!

Reinforcement Learning Still have an MDP: A set of states s  S A model T(s,a,s’) A reward function R(s) Still looking for a policy (s) New twist: don’t know T or R I.e. don’t know which states are good or what the actions do Must actually try actions and states out to learn

Example: Animal Learning RL studied experimentally for more than 60 years in psychology Rewards: food, pain, hunger, drugs, etc. Mechanisms and sophistication debated Example: foraging Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies Bees have a direct neural connection from nectar intake measurement to motor planning area

Example: Autonomous Helicopter

Example: Autonomous Helicopter

Example: Backgammon Reward only for win / loss in terminal states, zero otherwise TD-Gammon learns a function approximation to U(s) using a neural network Combined with depth 3 search, one of the top 3 players in the world (We’ll cover game playing in a few weeks)

Passive Learning Simplified task In this case: You don’t know the transitions T(s,a,s’) You don’t know the rewards R(s) You DO know the policy (s) Goal: learn the state values (and maybe the model) In this case: No choice about what actions to take Just execute the policy and learn from experience We’ll get to the general case soon

Example: Direct Estimation y Episodes: (1,1) -1 up (1,2) -1 up (1,3) -1 right (2,3) -1 right (3,3) -1 right (3,2) -1 up (4,3) +100 (1,1) -1 up (1,2) -1 up (1,3) -1 right (2,3) -1 right (3,3) -1 right (3,2) -1 up (4,2) -100 x U(1,1) ~ (92 + -106) / 2 = -7 U(3,3) ~ (99 + 97 + -102) / 3 = -31.3

Model-Based Learning Idea: Empirical model learning Learn the model empirically (rather than values) Solve the MDP as if the learned model were correct Empirical model learning Simplest case: Count outcomes for each s,a Normalize to give estimate of T(s,a,s’) Discover R(s) the first time we enter s More complex learners are possible (e.g. if we know that all squares have related action outcomes “stationary noise”)

Example: Model-Based Learning y Episodes: (1,1) -1 up (1,2) -1 up (1,3) -1 right (2,3) -1 right (3,3) -1 right (3,2) -1 up (4,3) +100 (1,1) -1 up (1,2) -1 up (1,3) -1 right (2,3) -1 right (3,3) -1 right (3,2) -1 up (4,2) -100 x T(<3,3>, right, <4,3>) = 1 / 3 T(<2,3>, right, <3,3>) = 2 / 2 R(3,3) = -1, R(4,1) = 0?

Model-Free Learning Big idea: why bother learning T? Update each time we experience a transition Frequent outcomes will contribute more updates (over time) Temporal difference learning (TD) Policy still fixed! Move values toward value of whatever successor occurs [DEMO]

Example: Passive TD (1,1) -1 up (1,2) -1 up (1,3) -1 right (4,3) +100 (1,1) -1 up (1,2) -1 up (1,3) -1 right (2,3) -1 right (3,3) -1 right (3,2) -1 up (4,2) -100 Take  = 1,  = 0.1

(Greedy) Active Learning In general, want to learn the optimal policy Idea: Learn an initial model of the environment: Solve for the optimal policy for this model (value or policy iteration) Refine model through experience and repeat

Example: Greedy Active Learning Imagine we find the lower path to the good exit first Some states will never be visited following this policy from (1,1) We’ll keep re-using this policy because following it never collects the regions of the model we need to learn the optimal policy ? ?

What Went Wrong? Problem with following optimal policy for current model: Never learn about better regions of the space Fundamental tradeoff: exploration vs. exploitation Exploration: must take actions with suboptimal estimates to discover new rewards and increase eventual utility Exploitation: once the true optimal policy is learned, exploration reduces utility Systems must explore in the beginning and exploit in the limit ? ?

Next Time Active reinforcement learning Function approximation Q-learning Balancing exploration / exploitation Function approximation Generalization for reinforcement learning Modeling utilities for complex spaces