MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.

Slides:



Advertisements
Similar presentations
Markov Decision Processes (MDPs) read Ch utility-based agents –goals encoded in utility function U(s), or U:S  effects of actions encoded in.
Advertisements

Markov Decision Process
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Decision Theoretic Planning
Meeting 3 POMDP (Partial Observability MDP) 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
An Introduction to Markov Decision Processes Sarah Hickmott
主講人:虞台文 大同大學資工所 智慧型多媒體研究室
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
Announcements Homework 3: Games Project 2: Multi-Agent Pacman
SA-1 1 Probabilistic Robotics Planning and Control: Markov Decision Processes.
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
Search and Planning for Inference and Learning in Computer Vision
Reinforcement Learning
Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
1 Operations Research Prepared by: Abed Alhameed Mohammed Alfarra Supervised by: Dr. Sana’a Wafa Al-Sayegh 2 nd Semester ITGD4207 University.
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Department of Computer Science Undergraduate Events More
Reinforcement Learning Yishay Mansour Tel-Aviv University.
Deciding Under Probabilistic Uncertainty Russell and Norvig: Sect ,Chap. 17 CS121 – Winter 2003.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
CS 484 – Artificial Intelligence1 Announcements Homework 5 due Tuesday, October 30 Book Review due Tuesday, October 30 Lab 3 due Thursday, November 1.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
Department of Computer Science Undergraduate Events More
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
Markov Decision Process (MDP)
MDPs and Reinforcement Learning. Overview MDPs Reinforcement learning.
1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.
Planning Under Uncertainty. Sensing error Partial observability Unpredictable dynamics Other agents.
Possible actions: up, down, right, left Rewards: – 0.04 if non-terminal state Environment is observable (i.e., agent knows where it is) MDP = “Markov Decision.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Achieving Goals in Decentralized POMDPs Christopher Amato Shlomo Zilberstein UMass.
Markov Decision Process (MDP)
Announcements Grader office hours posted on course website
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Making complex decisions
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
Markov Decision Processes
Planning to Maximize Reward: Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
Chapter 17 – Making Complex Decisions
Hidden Markov Models (cont.) Markov Decision Processes
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Reinforcement Nisheeth 18th January 2019.
Markov Decision Processes
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence

Topic Planning and Control in Stochastic Domains With Imperfect Information

Objective Markov Decision Processes (Sequences of decisions) – Introduction to MDPs – Computing optimal policies for MDPs

Markov Decision Process (MDP) Sequential decision problems under uncertainty – Not just the immediate utility, but the longer-term utility as well – Uncertainty in outcomes Roots in operations research Also used in economics, communications engineering, ecology, performance modeling and of course, AI! – Also referred to as stochastic dynamic programs

Markov Decision Process (MDP) Defined as a tuple: – S: State – A: Action – P: Transition function Table P(s’| s, a), prob of s’ given action “a” in state “s” – R: Reward R(s, a) = cost or reward of taking action a in state s Choose a sequence of actions (not just one decision or one action) – Utility based on a sequence of decisions

Example: What SEQUENCE of actions should our agent take? Reward Blocked CELL Reward +1 Start Each action costs –1/25 Agent can take action N, E, S, W Faces uncertainty in every state N

MDP Tuple: MDP Tuple: S: State of the agent on the grid (4,3) – Note that cell denoted by (x,y) A: Actions of the agent, i.e., N, E, S, W P: Transition function – Table P(s’| s, a), prob of s’ given action “a” in state “s” – E.g., P( (4,3) | (3,3), N) = 0.1 – E.g., P((3, 2) | (3,3), N) = 0.8 – (Robot movement, uncertainty of another agent’s actions,…) R: Reward (more comments on the reward function later) – R( (3, 3), N) = -1/25 – R (4,1) = +1

??Terminology Before describing policies, lets go through some terminology Terminology useful throughout this set of lectures Policy: Complete mapping from states to actions

MDP Basics and Terminology An agent must make a decision or control a probabilistic system Goal is to choose a sequence of actions for optimality Defined as MDP models: – Finite horizon: Maximize the expected reward for the next n steps – Infinite horizon: Maximize the expected discounted reward. – Transition model: Maximize average expected reward per transition. – Goal state: maximize expected reward (minimize expected cost) to some target state G.

???Reward Function According to chapter2, directly associated with state – Denoted R(I) – Simplifies computations seen later in algorithms presented Sometimes, reward is assumed associated with state,action – R(S, A) – We could also assume a mix of R(S,A) and R(S) Sometimes, reward associated with state,action,destination-state – R(S,A,J) – R(S,A) =  R(S,A,J) * P(J | S, A) J

Markov Assumption Markov Assumption: Transition probabilities (and rewards) from any given state depend only on the state and not on previous history Where you end up after action depends only on current state – After Russian Mathematician A. A. Markov ( ) – (He did not come up with markov decision processes however) – Transitions in state (1,2) do not depend on prior state (1,1) or (1,2)

???MDP vs POMDPs Accessibility: Agent’s percept in any given state identify the state that it is in, e.g., state (4,3) vs (3,3) – Given observations, uniquely determine the state – Hence, we will not explicitly consider observations, only states Inaccessibility: Agent’s percepts in any given state DO NOT identify the state that it is in, e.g., may be (4,3) or (3,3) – Given observations, not uniquely determine the state – POMDP: Partially observable MDP for inaccessible environments We will focus on MDPs in this presentation.

MDP vs POMDP Agent World States Actions MDP Agent World Observations Actions SE P b

Stationary and Deterministic Policies Policy denoted by symbol 

Policy Policy is like a plan, but not quite – Certainly, generated ahead of time, like a plan Unlike traditional plans, it is not a sequence of actions that an agent must execute – If there are failures in execution, agent can continue to execute a policy Prescribes an action for all the states Maximizes expected reward, rather than just reaching a goal state

MDP problem The MDP problem consists of: – Finding the optimal control policy for all possible states; – Finding the sequence of optimal control functions for a specific initial state – Finding the best control action(decision) for a specific state.

Non-Optimal Vs Optimal Policy +1 Start Choose Red policy or Yellow policy? Choose Red policy or Blue policy? Which is optimal (if any)? Value iteration: One popular algorithm to determine optimal policy

Value Iteration: Key Idea Iterate: update utility of state “I” using old utility of neighbor states “J”; given actions “A” – U t+1 (I) = max [R(I,A) +   P(J|I,A)* U t (J)] A J – P(J|I,A): Probability of J if A is taken in state I – max F(A) returns highest F(A) – Immediate reward & longer term reward taken into account

Value Iteration: Algorithm Initialize: U 0 (I) = 0 Iterate: U t+1 (I) = max [ R(I,A) +  P(J|I,A)* U t (J) ] A J – Until close-enough (U t+1, U t ) At the end of iteration, calculate optimal policy: Policy(I) = argmax [R(I,A) +  P(J|I,A)* U t+1 (J) ] A J

Forward Method for Solving MDP Decision Tree

??Markov Chain Given fixed policy, you get a markov chain from the MDP – Markov chain: Next state is dependent only on previous state – Next state: Not dependent on action (there is only one action) – Next state: History dependency only via the previous state – P(S t+1 | S t, S t-1, S t-2 …..) = P(S t+1 | S t ) How to evaluate the markov chain? Could we try simulations? Are there other sophisticated methods around?

Influence Diagram

Expanded Influence Diagram

Relation between time & steps-to-go

Decision Tree

Dynamic Construction of the Decision Tree Incrémental expansion(MDP,γ, s I, є, V L, V U ) Initialize tree T with s I and ubound (s I ), lbound (s I ) using V L, V U ; repeat until(single action remains for s I or ubound (s I ) - lbound (s I ) <= є call Improve-tree(T,MDP,γ, V L, V U ) return action with greatest lover bound as a result; Improve-tree (T,MDP,γ, V L, V U ) if root(T) is a leaf then expand root(T) set bouds lbound, ubound of new leaves using V L, V U ; else for all decision subtrees T’ of T do call Improve-tree (T,MDP,γ, V L, V U ) recompute bounds lbound(root(T)), ubound(root(T))for root(T); when root(T) is a decision node prune suboptimal action branches from T; return;

Incremental expansion function: Basic Method for the Dynamic Construction of the Decision Tree start MDP, γ, S I, ε, V L, V U OR S I )-bound(S I ) initialize leaf node of the partially built decision tree return call Improve-tree(T,MDP, γ, ε, V L, V U ) Terminate

Computer Decisions using Bound Iteration Incrémental expansion(MDP,γ, s I, є, V L, V U ) Initialize tree T with s I and ubound (s I ), lbound (s I ) using V L, V U ; repeat until(single action remains for s I or ubound (s I ) - lbound (s I ) <= є call Improve-tree(T,MDP,γ, V L, V U ) return action with greatest lover bound as a result; Improve-tree (T,MDP,γ, V L, V U ) if root(T) is a leaf then expand root(T) set bouds lbound, ubound of new leaves using V L, V U ; else for all decision subtrees T’ of T do call Improve-tree (T,MDP,γ, V L, V U ) recompute bounds lbound(root(T)), ubound(root(T))for root(T); when root(T) is a decision node prune suboptimal action branches from T; return;

Incremental expansion function: Basic Method for the Dynamic Construction of the Decision Tree start MDP, γ, S I, ε, V L, V U OR (S I )-bound(S I ) initialize leaf node of the partially built decision tree return call Improve-tree(T,MDP, γ, ε, V L, V U ) Terminate

Solving Large MDP problmes

If You Want to Read More on MDPs Book: – Martin L. Puterman Markov Decision Processes Wiley Series in Probability – Available on Amazon.com