CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.

CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley

Announcements  Othello tournament signup  Please send email to cs188@imail.berkeley.edu cs188@imail.berkeley.edu  HW on classification out  Due 4/23  Can work in pairs

Types of Learning: Recap  Past (learning to classify the world)  Supervised  Naïve Bayes, Perceptron  Unsupervised  Clustering k-means  Next (learning to act in the world)  Reinforcement Learning

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act so as to maximize expected utility  Change the rewards, change the behavior  Examples:  Learning optimal paths  Playing a game, reward at the end for winning / losing  Vacuuming a house, reward for each piece of dirt picked up  Automated taxi, reward for each passenger delivered

Recap: MDPs  Markov decision processes:  States S  Actions A  Transitions P(s’|s,a) (or T(s,a,s’))  Rewards R(s,a,s’)  Start state s 0  Examples:  Gridworld, High-Low, N-Armed Bandit  Any process where the result of your action is stochastic  Goal: find the “best” policy   Policies are maps from states to actions  What do we mean by “best”?  This is like search – it’s planning using a model, not actually interacting with the environment

MDP Solutions  In deterministic single-agent search, want an optimal sequence of actions from start to a goal  In an MDP, like expectimax, want an optimal policy  (s)  A policy gives an action for each state  Optimal policy maximizes expected utility (i.e. expected rewards) if followed  Defines a reflex agent Optimal policy when R(s, a, s’) = -0.04 for all non-terminals s

Example Optimal Policies R(s) = -2.0 R(s) = -0.4 R(s) = -0.03R(s) = -0.01

Stationarity  In order to formalize optimality of a policy, need to understand utilities of reward sequences  Typically consider stationary preferences:  Theorem: only two ways to define stationary utilities  Additive utility:  Discounted utility:

Infinite Utilities?!  Problem: infinite state sequences with infinite rewards  Solutions:  Finite horizon:  Terminate after a fixed T steps  Gives nonstationary policy (  depends on time left)  Absorbing state(s): guarantee that for every policy, agent will eventually “die” (like “done” for High-Low)  Discounting: for 0 <  < 1  Smaller  means smaller horizon

How (Not) to Solve an MDP  The inefficient way:  Enumerate policies  For each one, calculate the expected utility (discounted rewards) from the start state  E.g. by simulating a bunch of runs  Choose the best policy  We’ll return to a (better) idea like this later

Utility of a State  Define the utility of a state under a policy: V  (s) = expected total (discounted) rewards starting in s and following   Recursive definition (one-step look-ahead):

Policy Evaluation  Idea one: turn recursive equations into updates  Idea two: it’s just a linear system, solve with Matlab (or Mosek, or Cplex)

Example: High-Low  Policy: always say “high”  Iterative updates:

Optimal Utilities  Goal: calculate the optimal utility of each state V*(s) = expected (discounted) rewards with optimal actions  Why: Given optimal utilities, MEU tells us the optimal policy

Bellman’s Equation for Selecting actions  Definition of utility leads to a simple relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy Formally: Bellman’s Equation That’s my equation!

Practice: Computing Actions  Which action should we chose from state s:  Given optimal q-values Q?  Given optimal values V?

Example: GridWorld

Value Iteration  Idea:  Start with bad guesses at all utility values (e.g. V 0 (s) = 0)  Update all values simultaneously using the Bellman equation (called a value update or Bellman update):  Repeat until convergence  Theorem: will converge to unique optimal values  Basic idea: bad guesses get refined towards optimal values  Policy may converge long before values do

Example: Bellman Updates

Example: Value Iteration  Information propagates outward from terminal states and eventually all states have correct value estimates [DEMO]

Convergence*  Define the max-norm:  Theorem: For any two approximations U and V (any two utility vectors)  I.e. any distinct approximations must get closer to each other (after the Bellman update), so, in particular, any approximation must get closer to the true U (Bellman update is U) and value iteration converges to a unique, stable, optimal solution  Theorem:  I.e. once the change in our approximation is small, it must also be close to correct

Policy Iteration  Alternate approach:  Policy evaluation: calculate utilities for a fixed policy until convergence (remember the beginning of lecture)  Policy improvement: update policy based on resulting converged utilities  Repeat until policy converges  This is policy iteration  Can converge faster under some conditions

Policy Iteration  If we have a fixed policy , use simplified Bellman equation to calculate utilities:  For fixed utilities, easy to find the best action according to one-step look-ahead

Comparison  In value iteration:  Every pass (or “backup”) updates both utilities (explicitly, based on current utilities) and policy (possibly implicitly, based on current policy)  In policy iteration:  Several passes to update utilities with frozen policy  Occasional passes to update policies  Hybrid approaches (asynchronous policy iteration):  Any sequences of partial updates to either policy entries or utilities will converge if every state is visited infinitely often

Reinforcement Learning  Reinforcement learning:  Still have an MDP:  A set of states s  S  A model T(s,a,s’)  A reward function R(s)  Still looking for a policy  (s)  New twist: don’t know T or R  I.e. don’t know which states are good or what the actions do  Must actually try actions and states out to learn

Example: Animal Learning  RL studied experimentally for more than 60 years in psychology  Rewards: food, pain, hunger, drugs, etc.  Mechanisms and sophistication debated  Example: foraging  Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies  Bees have a direct neural connection from nectar intake measurement to motor planning area

Example: Backgammon  Reward only for win / loss in terminal states, zero otherwise  TD-Gammon learns a function approximation to U(s) using a neural network  Combined with depth 3 search, one of the top 3 players in the world

Passive Learning  Simplified task  You don’t know the transitions T(s,a,s’)  You don’t know the rewards R(s,a,s’)  You are given a policy  (s)  Goal: learn the state values (and maybe the model)  In this case:  No choice about what actions to take  Just execute the policy and learn from experience  We’ll get to the general case soon

Example: Direct Estimation  Episodes: x y (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) right -1 (4,2) right -100 (done) U(1,1) ~ (93 + -105) / 2 = -6 U(3,3) ~ (100 + 98 + -101) / 3 = 32.3  = 1, R = -1 +100 -100

Model-Based Learning  Idea:  Learn the model empirically (rather than values)  Solve the MDP as if the learned model were correct  Empirical model learning  Simplest case:  Count outcomes for each s,a  Normalize to give estimate of T(s,a,s’)  Discover R(s,a,s’) the first time we experience (s,a,s’)  More complex learners are possible (e.g. if we know that all squares have related action outcomes, e.g. “stationary noise”)

Example: Model-Based Learning  Episodes: x y T(, right, ) = 1 / 3 T(, right, ) = 2 / 2 +100 -100  = 1 (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) up -1 (3,3) right +100 (done) (1,1) up -1 (1,2) up -1 (1,3) right -1 (2,3) right -1 (3,3) right -1 (3,2) right -1 (4,2) right -100 (done)

Model-Based Learning  In general, want to learn the optimal policy, not evaluate a fixed policy  Idea: adaptive dynamic programming  Learn an initial model of the environment:  Solve for the optimal policy for this model (value or policy iteration)  Refine model through experience and repeat  Crucial: we have to make sure we actually learn about all of the model

CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.

Similar presentations

Presentation on theme: "CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley.

Similar presentations

Presentation on theme: "CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: II MDP 4/12/2007 Srini Narayanan – ICSI and UC Berkeley."— Presentation transcript:

Similar presentations

About project

Feedback