CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: I Utilities and Simple decisions 4/10/2007 Srini Narayanan – ICSI and UC.

CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: I Utilities and Simple decisions 4/10/2007 Srini Narayanan – ICSI and UC Berkeley

Announcments  Othello tournament signup  Please send email to cs188@imail.berkeley.edu cs188@imail.berkeley.edu  HW on classification out today  Due 4/23  Can work in pairs

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act so as to maximize expected utility  Change the rewards, change the behavior  Examples:  Playing a game, reward at the end for winning / losing  Vacuuming a house, reward for each piece of dirt picked up  Automated taxi, reward for each passenger delivered

Maximum Expected Utility  MEU: An agent should chose the action which maximizes its expected utility, given its knowledge  General principle for decision making  Often taken as the definition of rationality  Let’s decompress this definition…

Reminder: Expectations  Often a quantity of interest depends on a random variable  The expected value of a function is the average output, weighted by some distribution over inputs  Example: How late will I be?  Lateness is a function of traffic: L(T=none) = -10, L(T=light) = -5, L(T=heavy) = 15  What is my expected lateness?  Need to specify some belief over T to weight the outcomes  Say P(T) = {none: 2/5, light: 2/5, heavy: 1/5}  The expected lateness:

Expectations  Real valued functions of random variables:  Expectation of a function a random variable  Example: Expected value of a fair die roll X P f 11/61 2 2 3 3 4 4 5 5 6 6

Utilities  Utilities are functions from outcomes (states of the world) to real numbers that describe an agent’s preferences  Where do utilities come from?  In a game, may be simple (+1/-1)  Utilities summarize the agent’s goals  Theorem: any set of preferences between outcomes can be summarized as a utility function (provided the preferences meet certain conditions)  In general, we utilities are determined from rewards and actions emerge to maximize expected utility.

Expectimax Search  In expectimax search, we have a probabilistic model of how the opponent (or environment) will act  Model could be a simple uniform distribution (roll a die)  Model could be sophisticated and require a great deal of computation  We have a node for every outcome out of our control: opponent or environment  The model can predict that smart actions are likely!  For now, assume we are given P(A|S), a function which, for any state s maps states actions to probabilities: P(A=a|S=s) Having a probabilistic belief about an agent’s action does not mean that agent is flipping any coins!

Expectimax Evaluation  For minimax search, evaluation function insensitive to monotonic transformations  We just want better states to have higher evaluations (get the ordering right)  For expectimax, we need the scales to be meaningful as well  We need to know whether 50/50 chances of A or B is better than C  100 or -10 vs 0 is different then 10 or -100 vs 0

Preferences  An agent chooses among:  Prizes: A, B, etc.  Lotteries: situations with uncertain prizes  Notation:

Rational Preferences  We want some constraints on preferences before we call them rational  For example: an agent with intransitive preferences can be induced to give away all its money  If B > C, then an agent with C would pay (say) 1 cent to get B  If A > B, then an agent with B would pay (say) 1 cent to get A  If C > A, then an agent with A would pay (say) 1 cent to get C

Rational Preferences  Preferences of a rational agent must obey constraints.  These constraints (plus one more) are the axioms of rationality  Theorem: Rational preferences imply behavior describable as maximization of expected utility

MEU Principle  Theorem:  [Ramsey, 1931; von Neumann & Morgenstern, 1944]  Given any preferences satisfying these constraints, there exists a real-valued function U such that:  Maximum expected likelihood (MEU) principle:  Choose the action that maximizes expected utility  Note: an agent can be entirely rational (consistent with MEU) without ever representing or manipulating utilities and probabilities  E.g., a lookup table for perfect tictactoe, reflex vacuum cleaner

Human Utilities  Utilities map states to real numbers. Which numbers?  Standard approach to assessment of human utilities:  Compare a state A to a standard lottery L p between  ``best possible prize'' u + with probability p  ``worst possible catastrophe'' u - with probability 1-p  Adjust lottery probability p until A ~ L p  Resulting p is a utility in [0,1]

Utility Scales  Normalized utilities: u + = 1.0, u - = 0.0  Micromorts: one-millionth chance of death, useful for paying to reduce product risks, etc.  QALYs: quality-adjusted life years, useful for medical decisions involving substantial risk  One year with good health = 1 QALY  Note: behavior is invariant under positive linear transformation  With deterministic prizes only (no lottery choices), only ordinal utility can be determined, i.e., total order on prizes

Example: Insurance  Consider the lottery [0.5,$1000; 0.5,$0]  What is its expected monetary value? ($500)  What is its certainty equivalent?  Monetary value acceptable in lieu of lottery  $400 for most people  Difference of $100 is the insurance premium  There’s an insurance industry because people will pay to reduce their risk  If everyone were risk-prone, no insurance needed!

Money  Money does not behave as a utility function  Given a lottery L:  Define expected monetary value EMV(L)  Usually U(L) < U(EMV(L))  I.e., people are risk-averse  Utility curve: for what probability p am I indifferent between:  A prize x  A lottery [p,$M; (1-p),$0] for large M?  Typical empirical data, extrapolated with risk-prone behavior:

Example: Human Rationality?  Famous example of Allais (1953)  A: [0.8,$4k; 0.2,$0]  B: [1.0,$3k; 0.0,$0]  C: [0.2,$4k; 0.8,$0]  D: [0.25,$3k; 0.75,$0]  Most people prefer B > A, C > D  But if U($0) = 0, then  B > A  U($3k) > 0.8 U($4k)  C > D  0.8 U($4k) > U($3k)

Reinforcement Learning  Basic idea:  Receive feedback in the form of rewards  Agent’s utility is defined by the reward function  Must learn to act so as to maximize expected utility  Change the rewards, change the behavior  Examples:  Learning your way around, reward for reaching the destination.  Playing a game, reward at the end for winning / losing  Vacuuming a house, reward for each piece of dirt picked up  Automated taxi, reward for each passenger delivered DEMO

Markov Decision Processes  Markov decision processes (MDPs)  A set of states s  S  A model T(s,a,s’) = P(s’ | s,a)  Probability that action a in state s leads to s’  A reward function R(s, a, s’) (sometimes just R(s) for leaving a state or R(s’) for entering one)  A start state (or distribution)  Maybe a terminal state  MDPs are the simplest case of reinforcement learning  In general reinforcement learning, we don’t know the model or the reward function

Example: High-Low  Three card types: 2, 3, 4  Infinite deck, twice as many 2’s  Start with 3 showing  After each card, you say “high” or “low”  New card is flipped  If you’re right, you win the points shown on the new card  Ties are no-ops  If you’re wrong, game ends 2 3 2 4

High-Low  States: 2, 3, 4, done  Actions: High, Low  Model: T(s, a, s’):  P(s’=done | 4, High) = 3/4  P(s’=2 | 4, High) = 0  P(s’=3 | 4, High) = 0  P(s’=4 | 4, High) = 1/4  P(s’=done | 4, Low) = 0  P(s’=2 | 4, Low) = 1/2  P(s’=3 | 4, Low) = 1/4  P(s’=4 | 4, Low) = 1/4 ……  Rewards: R(s, a, s’):  Number shown on s’ if s  s’  0 otherwise  Start: 3 Note: could choose actions with search. How? 4

Elements of RL  Transition model, how action influences states  Reward R, immediate value of state-action transition  Policy , maps states to actions Agent Environment StateRewardAction Policy

MDP Solutions  In deterministic single-agent search, want an optimal sequence of actions from start to a goal  In an MDP, like expectimax, want an optimal policy  (s)  A policy gives an action for each state  Optimal policy maximizes expected utility (i.e. expected rewards) if followed  Defines a reflex agent Optimal policy when R(s, a, s’) = -0.04 for all non-terminals s

Example Optimal Policies R(s) = -2.0 R(s) = -0.4 R(s) = -0.03R(s) = -0.01

Finding Optimal Policies Demo

Stationarity  In order to formalize optimality of a policy, need to understand utilities of reward sequences  Typically consider stationary preferences:  Theorem: only two ways to define stationary utilities  Additive utility:  Discounted utility:

Infinite Utilities?!  Problem: infinite state sequences with infinite rewards  Solutions:  Finite horizon:  Terminate after a fixed T steps  Gives nonstationary policy (  depends on time left)  Absorbing state(s): guarantee that for every policy, agent will eventually “die” (like “done” for High-Low)  Discounting: for 0 <  < 1  Smaller  means smaller horizon

How (Not) to Solve an MDP  The inefficient way:  Enumerate policies  For each one, calculate the expected utility (discounted rewards) from the start state  E.g. by simulating a bunch of runs  Choose the best policy  We’ll return to a (better) idea like this later

Utility of a State  Define the utility of a state under a policy: V  (s) = expected total (discounted) rewards starting in s and following   Recursive definition (one-step look-ahead):

Policy Evaluation  Idea one: turn recursive equations into updates  Idea two: it’s just a linear system, solve with Matlab (or Mosek, or Cplex)

Example: High-Low  Policy: always say “high”  Iterative updates:

Optimal Utilities  Goal: calculate the optimal utility of each state V*(s) = expected (discounted) rewards with optimal actions  Why: Given optimal utilities, MEU tells us the optimal policy

Bellman’s Equation for Selecting actions  Definition of utility leads to a simple relationship amongst optimal utility values: Optimal rewards = maximize over first action and then follow optimal policy Formally: Bellman’s Equation That’s my equation!

Practice: Computing Actions  Which action should we chose from state s:  Given optimal q-values Q?  Given optimal values V?

Example: GridWorld

Value Iteration  Idea:  Start with bad guesses at all utility values (e.g. V 0 (s) = 0)  Update all values simultaneously using the Bellman equation (called a value update or Bellman update):  Repeat until convergence  Theorem: will converge to unique optimal values  Basic idea: bad guesses get refined towards optimal values  Policy may converge long before values do

Example: Bellman Updates

Example: Value Iteration  Information propagates outward from terminal states and eventually all states have correct value estimates [DEMO]

Convergence*  Define the max-norm:  Theorem: For any two approximations U and V  I.e. any distinct approximations must get closer to each other, so, in particular, any approximation must get closer to the true U and value iteration converges to a unique, stable, optimal solution  Theorem:  I.e. one the change in our approximation is small, it must also be close to correct

Policy Iteration  Alternate approach:  Policy evaluation: calculate utilities for a fixed policy until convergence (remember the beginning of lecture)  Policy improvement: update policy based on resulting converged utilities  Repeat until policy converges  This is policy iteration  Can converge faster under some conditions

CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: I Utilities and Simple decisions 4/10/2007 Srini Narayanan – ICSI and UC.

Similar presentations

Presentation on theme: "CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: I Utilities and Simple decisions 4/10/2007 Srini Narayanan – ICSI and UC."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: I Utilities and Simple decisions 4/10/2007 Srini Narayanan – ICSI and UC.

Similar presentations

Presentation on theme: "CS 188: Artificial Intelligence Spring 2007 Lecture 21:Reinforcement Learning: I Utilities and Simple decisions 4/10/2007 Srini Narayanan – ICSI and UC."— Presentation transcript:

Similar presentations

About project

Feedback