Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Department of Computer Science Undergraduate Events More
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Partially Observable Markov Decision Processes
Decision Theoretic Planning
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
An Introduction to Markov Decision Processes Sarah Hickmott
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
POMDPs: Partially Observable Markov Decision Processes Advanced AI
Decision Theory: Single Stage Decisions Computer Science cpsc322, Lecture 33 (Textbook Chpt 9.2) March, 30, 2009.
CPSC 322, Lecture 37Slide 1 Finish Markov Decision Processes Last Class Computer Science cpsc322, Lecture 37 (Textbook Chpt 9.5) April, 8, 2009.
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Markov Decision Processes
Nov 14 th  Homework 4 due  Project 4 due 11/26.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
Markov Decision Processes
Department of Computer Science Undergraduate Events More
Department of Computer Science Undergraduate Events More
More RL. MDPs defined A Markov decision process (MDP), M, is a model of a stochastic, dynamic, controllable, rewarding process given by: M = 〈 S, A,T,R.
Decision Making Under Uncertainty Russell and Norvig: ch 16, 17 CMSC421 – Fall 2003 material from Jean-Claude Latombe, and Daphne Koller.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
MAKING COMPLEX DEClSlONS
CPSC 502, Lecture 13Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 13 Oct, 25, 2011 Slide credit POMDP: C. Conati.
Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
CPSC 7373: Artificial Intelligence Lecture 10: Planning with Uncertainty Jiang Bian, Fall 2012 University of Arkansas at Little Rock.
Department of Computer Science Undergraduate Events More
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
MDPs (cont) & Reinforcement Learning
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
Department of Computer Science Undergraduate Events More
Markov Decision Process (MDP)
MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.
Department of Computer Science Undergraduate Events More
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
1 Markov Decision Processes Finite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 2
Making complex decisions
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 2
CS 188: Artificial Intelligence Fall 2007
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Chapter 17 – Making Complex Decisions
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Reinforcement Nisheeth 18th January 2019.
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Decision Theoretic Planning

Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment is understood with certainty  What if an agent has to make decisions in a domain that involves uncertainty?  An agent’s decision will depend on: what actions are available. They often don’t have deterministic outcome what beliefs the agent has over the world the agent’s goals and preferences

Decisions Under Uncertainty  In CPSC 322, you have covered How to express agent’s preferences in terms of utility functions One-off decisions Started looking at sequential decision problems  In this section of the course we will look in more detail at sequential decision problems

Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving MDPs Value Iteration Policy Iteration  POMDPs

Decision Processes  Often an agent needs to decide how to act in situations that involve sequences of decisions The agent’s utility depends upon the final state reached, and the sequence of actions taken to get there  Would like to have an ongoing decision process. At any stage of the process The agent decides which action to perform The new state of the world depends probabilistically upon the previous state as well as the action performed The agent receives rewards or punishments at various points in the process  Aim: maximize the reward received

Example  Agent moves in the above grid via actions Up, Down, Left, Right  Each action has 0.8 probability to reach its intended effect 0.1 probability to move at right angles of the intended direction If the agents bumps into a wall, it says there

Example

 Can the sequence [Up, Up, Right, Right, Right] take the agent in terminal state (4,3)?  Can the sequence reach the goal in any other way?

Example  Can the sequence [Up, Up, Right, Right, Right] take the agent in terminal state (4,3)? Yes, with probability =  Can the sequence reach the goal in any other way? yes, going the other way around with prob x0.8

Markov Decision Processes (MDP)  We will focus on decision processes that can be represented as Markovian (as in Markov models) Actions have probabilistic outcomes that depend only on the current state Let s t be the state at time t P(s t+1 |s 0, a 0,...,s t, a t ) = P(s t+1 |s t, a t )  The process is stationary if P(s t+1 |s t, a t ) is the same for every t P(s’ |s, a) is the probability that the agent will be in state s’ after doing action a in state s  We also need to specify the utility function for the agent depends on the sequence of states involved in a decision (environment history), rather than a single final state Defined based on a reward R(s) that the agent receives in each state s can be negative (punishment)

MDP specification  For an MDP you specify: set S of states, set A of actions Initial state s 0 the process’ dynamics (or transition model) P(s'|s,a) The reward function R(s, a,, s’) describing the reward that the agent receives when it performs action a in state s and ends up in state s’ We will use R(s) when the reward depends only on the state s and not on how the agent got there

MDP specification  An MDP augments a Markov chain with actions and values

Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving MDPs Value Iteration Policy Iteration  POMDPs

Solving MDPs  In search problems, aim is to find an optimal state sequence  In MDPs, aim is to find an optimal policy π(s) A policy π(s) specifies what the agent should do for each state s Because of the stochastic nature of the environment, a policy can generate a set of environment histories (sequences of states) with different probabilities  Optimal policy maximizes the expected total reward, where the expectation is take over the set of possible state sequences generated by the policy Each state sequence associated with that policy has a given amount of total reward Total reward is a function of the rewards of its individual states (we’ll see how)

Optimal Policy in our Example  Let’s suppose that, in our example, the total reward of an environment history is simply the sum of the individual rewards For instance, with a penalty of in not terminal states, reaching (3,4) in 10 steps gives a total reward of 0.6 Penalty designed to make the agent go for shorter solution paths

Rewards and Optimal Policy  Optimal Policy when penalty in non-terminal states is  Note that here the cost of taking steps is small compared to the cost of ending into (4,2) Thus, the optimal policy for state (3,1) is to take the long way around the obstacle rather then risking to fall into (4,2) by taking the shorter way that passes next to it But the optimal policy may change if the reward in the non-terminal states (let’s call it r) changes

Rewards and Optimal Policy  Optimal Policy when r <  Why is the agent heading straight into (4,2) from it surrounding states?

Rewards and Optimal Policy  Optimal Policy when r <  The cost of taking a step is so high that the agent heads straight into the nearest terminal state, even if this is (4,2) (reward -1)

Rewards and Optimal Policy  Optimal Policy when < r <  The cost of taking a step is high enough to make the agent take the shortcut to (4,3) from (3,1)

Rewards and Optimal Policy  Optimal Policy when < r < 0  Why is the agent heading straight into the obstacle from (3,2)?

Rewards and Optimal Policy  Optimal Policy when < r < 0  Stay longer in the grid is not penalized as much as before. The agent is willing to take longer routes to avoid (4,2) This is true even when it means banging against the obstacle a few times when moving from (3,2)

Rewards and Optimal Policy  Optimal Policy when r > 0  What happens when the agent is rewarded for every step it takes? state where every action belongs to an optimal policy

Rewards and Optimal Policy  Optimal Policy when r > 0  What happens when the agent is rewarded for every step it takes? it is basically rewarded for sticking around The only actions that need to be specified are the ones in states that are adjacent to the terminal states: take the agent away from them state where every action belong to an optimal policy

Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving MDPs Value Iteration Policy Iteration  POMDPs

Planning Horizons  The planning horizon defines the timing for decision making Indefinite horizons: the decision process ends but it is unknown when In the previous example, the process ends when the agent enters one of the two terminal (absorbing) states Infinite horizons: the process never halts e.g. if we change the previous example so that, when the agent enters one of the terminal states, it is flung back to one of the two left corners of the grid Finite horizons: the process must end at a specific time N

 Decision processes with a finite planning horizon are especially tricky to handle  Assume that, in the previous example with r = -.004, the agent starts at (3,1) If it has a limit of 3 steps that it can take, it can only reach the high reward state (4,3) if it goes straight to it: UP is the optimal action here If it has a limit of 100, there is plenty of time to take the safe action of going left, as it did in our initial example Finite Planning Horizons start

Finite Planning Horizons  So the optimal action in a given state can change overtime: the optimal policy is non-stationary  If there is no time limit, there is no reason to behave differently in the same state at different times: optimal policy is stationary we will only consider processes with stationary policies here

Information Availability  What information is available when the agent decides what to do?  Fully-observable MDP (FOMDP): the agent gets to observe current state s t when deciding on action a t.  Partially-observable MDP (POMDP) the agent can’t observe s t directly, can only use sensors to get information about the state Similar to Hidden Markov Models  We will first look at FOMDP (but we will call them simply MDPs)

Types of Reward Functions  Suppose the agent goes through states s 1, s 2,...,s k and receives rewards r 1, r 2,...,r k  We will look at two ways to define the reward for this sequence, i.e. its utility for the agent

Discounted Reward  If we are dealing with infinite horizon, the environment histories can be infinitely long their values will generally be infinite if we use additive rewards -> hard to compare  The problem disappears if we use a discounted reward with γ < 1 and reward for each state bounded by R max using the standard formula for an infinite geometric series  Thus, with discounted reward, the utility of an infinite sequence is finite

Discounted Reward  The discount factor γ determines the preference of the agent for current rewards vs. future rewards  The closer γ is to 0, the less importance is given to future rewards We could see the resulting policies as being “short sighted”, i.e. not taking much into account the consequences of the agent’s action in the distant future For γ = 1, discounted rewards are equivalent to additive rewards

Total (additive) Reward  If the environment includes terminal states and the agent is guaranteed to reach one, the environment histories are guaranteed to be finite, and so their values Fine to use an additive reward function Allows for “far sighted” policies that value rewards in the future as much as rewards in the present  A policy guaranteed to reach a terminal state is a proper policy  In environments with terminal states but improper policies (i.e. policies that never reach them), can’t use additive reward function the improper policy gains infinite rewards just by staying away from the terminal states  Generally safer to use discounted reward unless finite horizon is guaranteed

Example of Improper Policy

A Few More Definitions  A stationary policy is a function: For every state s, л(s) specifies which action should be taken in that state.  Utility of a state Defined as the expected utility of the state sequences that may follow it These sequences depend on the policy л being followed, so we define the utility of a state s with respected to л (a.k.a value of л for that state) as Where S i is the random variable representing the state the agent reaches after executing л for i steps starting from s

A Few More Definitions  An optimal policy is one with maximum expected discounted reward for every state. A policy л* is optimal if there is no other policy л’ and state s such as U п’ (s) > U п* (s) That is an optimal policy gives the Maximum Expected Utility (MEU)  For a fully-observable MDP with stationary dynamics and rewards with infinite or indefinite horizon, there is always an optimal stationary policy.