MDPs (cont) & Reinforcement Learning

Name: MDPs (cont) & Reinforcement Learning
Uploaded: 2017-09-09T15:14:43+00:00
Duration: PTM20S28
Channel: Phyllis Turner
Description: MDPs (cont) & Reinforcement Learning

MDPs (cont) & Reinforcement Learning
Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell, Andrew Moore, Percy Liang, Luke Zettlemoyer

Announcements HW2 online: CSPs and Games
Due Oct 8, 11:59pm (start now!) Mid-term exam next Wed, Sept 30 Held during regular class time. Closed book. You may bring a calculator. Written questions (no coding). Ric will lead an in class mid-term review/exercises session in class on Sept 28. 1st midterm will be next Wed during regular class time. This will be a closed book, written exam with short answers.

Exam topics 1) Intro to AI, agents and environments
Turing test Rationality Expected utility maximization PEAS Environment characteristics: fully vs. partially observable, deterministic vs. stochastic, episodic vs. sequential, static vs. dynamic, discrete vs. continuous, single-agent vs. multi-agent, known vs. unknown 2) Search Search problem formulation: initial state, actions, transition model, goal state, path cost State space graph Search tree Frontier, Explored set Evaluation of search strategies: completeness, optimality, time complexity, space complexity Uninformed search strategies: breadth-first search, uniform cost search, depth-first search, iterative deepening search Informed search strategies: greedy best-first, A*, weighted A* Heuristics: admissibility, dominance As a reminder, these are the main topics we have covered in class so far. You should be able to define the following terms and answer basic questions about them. For the most part, questions will test your knowledge of the different concepts and algorithms we have covered much as you would find in any similar class. For example, I could ask you to demonstrate what a particular algorithm would do on a problem. Or I might ask you to compare and contrast two algorithms (e.g. compare two search algos in terms of completeness, optimality etc) or prove that an algorithm is optimal. I might also give you an example (e.g. an agent in some particular environment and ask you to formulate the problem as a search problem. Ie write down the formulation in terms of initial state, actions, etc)

Exam topics 3) Constraint satisfaction problems
Backtracking search Heuristics: most constrained/most constraining variable, least constraining value Forward checking, constraint propagation, arc consistency Taking advantage of structure – connected components, Tree-structured CSPs Local search Formulating photo ordering as a CSP 4) Games Zero-sum games Game tree Minimax/Expectimax/Expectiminimax search Alpha-beta pruning Evaluation function Quiescence search Horizon effect Stochastic elements in games We also covered CSPs in detail including algorithms for backtracking search, as well as algorithms for selecting good variable or value orderings during search. I might also ask you high level questions about particular CPS we saw in class. For example, standard CSP problems like map coloring or the CSP we saw that orders images in time by local search. Then we covered games so you should understand algorithms for thing like minimax search, alpha-beta pruning… and so on

Exam topics 5) Markov decision processes Markov assumption, transition model, policy Bellman equation Value iteration Policy iteration 6) Reinforcement learning Model-based vs. model-free approaches Passive vs Active Exploration vs. exploitation Direct Estimation TD Learning TD Q-learning Applications to backgammon, quadruped locomotion, helicopter flying Finally last week we talked about MDPs and this week we will finish up our search and planning portion of the course with reinforcement learning and applications of reinforcement learning. Anything that was covered in lecture is fair game for questions on the exam. For example, I may ask a question or two about specific examples we have covered, like formulating a CSP to order images in time, but I will say that the bulk of the exam will focus on the general concepts and algs we have learned. I would recommend studying from the lecture slides and also making sure you have read the appropriate materials from the textbook.

Markov Decision Processes
Stochastic, sequential environments As a reminder last class we introduced MDPs, sequential decision problems, where an agent is acting in an environment potentially forever and their utility will depend on the sequence of actions they make and the rewards they receive. In particular we are talking about stochastic environments, where the outcomes of actions are non-deterministic. In each state the agent receives some reward and their total utility is the sum of discounted rewards received. Image credit: P. Abbeel and D. Klein

Markov Decision Processes
Components: States s, beginning with initial state s0 Actions a Each state s has actions A(s) available from it Transition model P(s’ | s, a) Markov assumption: the probability of going to s’ from s depends only on s and a and not on any other past actions or states Reward function R(s) Policy (s): the action that an agent takes in any given state The “solution” to an MDP What is the solution or goal of an MDP? Our goal is to learn a policy pi. Q: what’s a policy? the action that an agent takes in any given state. How is this related to the definition of games that we saw last time? Well a stochastic game with only one player is an exmaple of an MDP. So you can use algorithms like expectimax. However, we will formalize these ideas as a Markov Decision Process where we can do things like handle intermediate rewards & infinite plans and use algorithms with more efficient processing to solve for the optimal policy.

Overview First, we will look at how to “solve” MDPs, ie find the optimal policy when the transition model and the reward function are known Second, we will consider reinforcement learning, where we don’t know the rules of the environment or the consequences of our actions So first we will look at how to “solve” MDPs (find the optimal policy when the transition model and the reward function are known). Then we will consider reinforcement learning next week, where we don’t know the rules of the environment or the consequences of our actions.

Grid world Transition model: R(s) = -0.04 for every non-terminal state
0.1 0.8 0.1 Let’s look at another example MDP. This is grid world. Our agent lives in a grid of squares. The agent can move between squares, except when walls block the agent’s path. What does stochastic mean here? The agent’s actions are stochastic, ie they don’t always go as planned. 80% of the time the action North takes the agent north (if there is no wall there), 10% of the time North takes the agent West and 10% East. The other actions are similarly defined. If there is a wall in the direction the agent would have taken, then the agent stays put. The agent receives a small reward at each step for simply being alive and then a big reward comes at the end if the agent ends up in particular grid locations (terminal states) here representing the game ending in death or winning. R(s) = for every non-terminal state Source: P. Abbeel and D. Klein

Goal: Policy Our goal is to determine the best policy for every state (what the agent should do in each grid square). Source: P. Abbeel and D. Klein

Grid world Optimal policy when R(s) = for every non-terminal state So we want to specify for each grid location what action the agent should take under this transition model and reward function. For example, what should the agent do if he’s in row 2, column 1? What if he’s in row 3 column 2? And so on. It turns out this is the optimal policy for the grid world agent under this transition function and reward model.

Grid world Optimal policies for other values of R(s):
the optimal policy varies depending on the particular reward function used and on the transition function. For R<-1.6 we see that the agent goes toward the -1 terminal state. For -.4<r<0.08 the agent tries to take the shortest route (even if that brings him close to the -1 terminal state). For < R < 0 the agent is more risk averse moving away from the -1 terminal state and taking a safer route to the +1 terminal state. For R>0 what’s the agent doing? Staying put because just staying alive as long as possible gives the best reward.

Solving MDPs MDP components: States s Actions a
Transition model P(s’ | s, a) Reward function R(s) The solution: Policy (s): mapping from states to actions How to find the optimal policy? So as we said an MDP has various components, states, actions, transition model, and reward function that are specified a priori and our job is to figure out the policy, but how do we find the optimal policy? Well let’s see

In search problems we want to find a plan or sequence of actions from the start to the goal.
In an MDP we want an optimal policy – for each state, what action should the agent take. An optimal policy should maximize expected utility if followed.

So why not solve for the best policy with expectimax as we saw in games? Well that’s fine for a game where usually you start the game and then at some point someone wins or loses or you have a draw, but what about for agents acting in the real world, e.g. robots helping you at your house? What’s different? Well the tree will usually be infinite… some states would appear over and over…and you also want to be able to handle intermediate rewards, not just at the terminal nodes. so expectimax in this case would not necessarily be the best way to go. So, let’s try to come up with a different method.

Maximizing expected utility
The optimal policy should maximize the expected utility over all possible state sequences produced by following that policy: How to define the utility of a state sequence? Sum of rewards of individual states Problem: infinite state sequences Remember in MDPs the world is stochastic so executing a policy can still result in multiple possible state sequences (the agent tries to go north and usually he does, but sometimes he ends up east or west - different state sequences are produced under the same policy). An optimal policy should maximize the expected utility over all possible state sequences produced by following that policy = Sum over all sequences produced by starting from the start state and following the policy ( Probabilty(sequence)*Utility(sequence) ) One question is, how should we define the utility of a state sequence? How about the sum of rewards of the individual states in the sequence? That sounds pretty good. What’s the problem? Well what about infinite state sequences? The utility defined that way might be infinite so then we’d have no way of knowing what was a good policy and what was a bad one. Eep!

Utilities of state sequences
Normally, we would define the utility of a state sequence as the sum of the rewards of the individual states Problem: infinite state sequences Solution: discount the individual state rewards by a factor  between 0 and 1: Sooner rewards count more than later rewards Makes sure the total utility stays bounded Helps algorithms converge So normally we would define the utility of a state sequence as the sum of the rewards of individual states. A problem arises for infinite state sequences. What’s a reasonable solution? Let’s discount the individual state rewards by a factor gamma between 0 and 1. Then the utility of a state sequence is now R(s0)+gammaR(s1)+gamma^2R(s2) … equals sum from t=0 to infinity of gamma^tR(st) This discounting essentially means that sooner rewards count more than later rewards (which makes sense in a real world situation) it ensures that the total utility is bounded (and not infinite) and it helps the algorithm converge when finding the optimal policy

Utilities of states Expected utility obtained by policy  starting in state s: The “true” utility (value) of a state, is the expected sum of discounted rewards if the agent executes an optimal policy starting in state s So recall the expected utility obtained by a policy pi starting in state s is this… where the sequences we sum over are all the possible sequences produced by following pi. The true utility of a state is the expected sum of discounted rewards if the agent executes an optimal policy starting in state s.

Finding the utilities of states
What is the expected utility of taking action a in state s? How do we choose the optimal action? Max node Chance node P(s’ | s, a) Here’s an illustration of a little part of the world. Say the agent is in some state s. if they take an action a then based on the transition function they end up in states s’ with probability P(s’|s,a) and s’ has utility U(s’). What is the expected utility of taking action a in state s? sum over successor states s’, P(s’|s,a)*U(s’). How do we choose the optimal action at s? argmax over actions, expected utility of the actions (ie why? Because our rational agent should choose the action that maximizes expected utility!). So, if we knew the utilities of the states, U, we could easily get the optimal policy (by taking this argmax for each state)! But how do we calculate the utilities of the states? Well, we can define a recursive expression for U(s) in terms of reward for being in state s + discounted expected utility of the next state under optimal policy. Why max here? Because our rational agent chooses the action to maximize expected utility. U(s’) What is the recursive expression for U(s) in terms of the utilities of its successor states?

The Bellman equation Recursive relationship between the utilities of successive states: Receive reward R(s) Choose optimal action a This is the bellman equation – a recursive relationship between the utilities of successive states. U(s) = reward in s + discounted expected utility of next state under the optimal policy. End up here with P(s’ | s, a) Get utility U(s’) (discounted by )

The Bellman equation Recursive relationship between the utilities of successive states: For N states, we get N equations in N unknowns Solving them solves the MDP Two methods: value iteration and policy iteration If we have N states, then we have N equations and N unknowns. Solving them solves the MDP. If these were linear we could just solve for the unknowns using existing solvers, but since max is non-linear we take an iterative approach. There are two iterative methods for solving the bellman equation, called value iteration and policy iteration.

Method 1: Value iteration
Start out with every U(s) = 0 Iterate until convergence During the ith iteration, update the utility of all states (simultaneously) according to this rule: In the limit of infinitely many iterations, guaranteed to find the correct utility values In practice, don’t need an infinite number of iterations… Value iteration initializes every U(s) to 0. then iterates until convergence: At the ith iteration update the utility of each state according to this rule: Ui+1(s) = R(s)+discounted expected future reward. Where this update is applied simultaneously to all states at each iteration. We keep updating our utilities until they converge. In the limit this is guaranteed to find the correct utility value, in practice you don’t need an infinite number of iterations to reach equilibrium.

Value iteration Run value iteration on the following grid-world:
Transition model: 0.7 chance of going in desired direction, 0.1 chance of going in any of the other 3 directions. If agent moves into a wall, it stays put. R(s)= for all non-terminal states Find a partner. Try to run the value iteration algorithm for a few iterations on the following grid-world MDP.

Value iteration Run value iteration on the following grid-world:
Transition model: 0.7 chance of going in desired direction, 0.1 chance of going in any of the other 3 directions. If agent moves into a wall, it stays put. R(s)= for all non-terminal states Here’s the initialization. Now, compute the value of each state over 3 more iterations…

Value iteration Value iteration demo
Here’s a java applet demo of value iteration. You can see that over the iterations the values from the terminal state spread to other parts of the grid. Value iteration demo

Basic idea: approximations get refined towards optimal values
Policy may converge long before values do

MDPs (cont) & Reinforcement Learning

Similar presentations

Presentation on theme: "MDPs (cont) & Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MDPs (cont) & Reinforcement Learning

Similar presentations

Presentation on theme: "MDPs (cont) & Reinforcement Learning"— Presentation transcript:

Similar presentations

About project

Feedback