Presentation is loading. Please wait.

Presentation is loading. Please wait.

Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the.

Similar presentations


Presentation on theme: "Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the."— Presentation transcript:

1 Planning in MDPs S&B: Sec 3.6; Ch. 4

2 Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the chance! HW2, R2 returned today

3 MDPs defined Full definition: A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R 〉 S : State space A : Action space T : Transition function R : Reward function For most of RL, we’ll assume the agent is living in an MDP

4 Exercise Given the following tasks, describe the corresponding MDP -- what are the state space, action space, transition function, and reward function? How many states/actions are there? How many policies are possible? Flying an airplane and trying to get from point A to B. Flying an airplane and trying to emulate recorded human behaviors. Delivering a set of packages to buildings on UNM campus Winning at the stock market

5 The true meaning of Value First we said: value is the sum of reward experienced along a history, under a fixed policy Possibly w/ discounting for infinite procs Now we know: fixed policy+stochastic environment => distribution over histories So what’s value?

6 The true meaning of Value Definition: expected value, V, is expected aggregate reward when averaged across all possible histories, weighted by their probabilities: Note: this may be an infinite sum V() itself may also involve an infinite sum (bummer...)

7 Histories & dynamics Pr({s 2,s 5,s 8 }|q 1 =s 1,π)=0.0138Pr({s 4,s 11,s 9 }|q 1 =s 1,π)=0.0379 V({s 2,s 5,s 8 })=+1.9V({s 4,s 11,s 9 })=-2.3 Fixed policy π

8 Maximizing reward Which policy should the agent prefer? Definition: The principle of maximum expected utility says that an agent should act so as to obtain the maximum reward over time on average. Because dynamics of environment are stochastic, no specific reward can be guaranteed Note: Other optimality criteria are possible (e.g., risk aversion) -- this is the easiest to work with

9 The RL problem Now the RL question is concrete: Given an agent acting in an MDP, M, how can the agent learn π* (or a good approximation of it) as quickly as possible? (More generally, while receiving as much reward along the way as possible)

10 Planning & Learning Need to distinguish when the agent knows full MDP, M, vs. has to learn M from experience: Definition: In the reinforcement learning problem, the agent is acting in an initially unknown MDP Needs experience tuples from environment Equivalent to: learn class models (PDFs), then figure out best decision surface Definition: In the planning problem, the agent is acting in a fully known MDP No experience needed! Can figure out π * purely by thinking about it Equiv to: Given known class models, figure out best decision surface

11 Planning Planning is a useful subroutine for learning, so we’ll do it first Goal: given known MDP, M = 〈 S, A,T,R 〉, find 2 sub-problems: Evaluate a single policy Search through space of all possible policies

12 How good is that plan? Policy evaluation: given M = 〈 S, A,T,R 〉 and π, find V π (s i ) for all i Recall: V is average over all histories: How to calculate this? From definition of V(h) :

13 Policy evaluation

14

15 The Bellman equation The final recursive equation is known as the Bellman equation: Unique soln to this eqn gives value of a fixed policy π when operating in a known MDP M = 〈 S, A,T,R 〉 When state/action spaces are discrete, can think of V and R as vectors and T π as matrix, and get matrix eqn:

16 Exercise Solve the matrix Bellman equation (i.e., find V ): I formulated the Bellman equations for “state- based” rewards: R(s) Formulate & solve the B.E. for “state-action” rewards ( R(s,a) ) and “state-action-state” rewards ( R(s,a,s’) )


Download ppt "Planning in MDPs S&B: Sec 3.6; Ch. 4. Administrivia Reminder: Final project proposal due this Friday If you haven’t talked to me yet, you still have the."

Similar presentations


Ads by Google