Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Situation Calculus for Action Descriptions We talked about STRIPS representations for actions. Another common representation is called the Situation Calculus.
Partially Observable Markov Decision Process (POMDP)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
SARSOP Successive Approximations of the Reachable Space under Optimal Policies Devin Grady 4 April 2013.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Partially Observable Markov Decision Processes
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.
Meeting 3 POMDP (Partial Observability MDP) 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授.
CS594 Automated decision making University of Illinois, Chicago
MDP Presentation CS594 Automated Optimal Decision Making Sohail M Yousof Advanced Artificial Intelligence.
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
An Introduction to Markov Decision Processes Sarah Hickmott
Partially Observable Markov Decision Process By Nezih Ergin Özkucur.
主講人:虞台文 大同大學資工所 智慧型多媒體研究室
Markov Decision Processes
Planning under Uncertainty
1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.
POMDPs: Partially Observable Markov Decision Processes Advanced AI
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
Max-norm Projections for Factored MDPs Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
Solving Factored POMDPs with Linear Value Functions Carlos Guestrin Daphne Koller Stanford University Ronald Parr Duke University.
Machine LearningRL1 Reinforcement Learning in Partially Observable Environments Michael L. Littman.
Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Markov Decision Processes
Department of Computer Science Undergraduate Events More
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
Reinforcement Learning on Markov Games Nilanjan Dasgupta Department of Electrical and Computer Engineering Duke University Durham, NC Machine Learning.
General Polynomial Time Algorithm for Near-Optimal Reinforcement Learning Duke University Machine Learning Group Discussion Leader: Kai Ni June 17, 2005.
Overview  Decision processes and Markov Decision Processes (MDP)  Rewards and Optimal Policies  Defining features of Markov Decision Process  Solving.
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.
Computer Science CPSC 502 Lecture 14 Markov Decision Processes (Ch. 9, up to 9.5.3)
Dynamic Programming for Partially Observable Stochastic Games Daniel S. Bernstein University of Massachusetts Amherst in collaboration with Christopher.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Privacy-Preserving Bayes-Adaptive MDPs CS548 Term Project Kanghoon Lee, AIPR Lab., KAIST CS548 Advanced Information Security Spring 2010.
Solving POMDPs through Macro Decomposition
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.
MDPs (cont) & Reinforcement Learning
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
Decision Making Under Uncertainty Lec #10: Partially Observable MDPs UIUC CS 598: Section EA Professor: Eyal Amir Spring Semester 2006 Some slides by Jeremy.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Announcements  Upcoming due dates  Wednesday 11/4, 11:59pm Homework 8  Friday 10/30, 5pm Project 3  Watch out for Daylight Savings and UTC.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
CS 416 Artificial Intelligence Lecture 20 Making Complex Decisions Chapter 17 Lecture 20 Making Complex Decisions Chapter 17.
Reinforcement Learning for Mapping Instructions to Actions S.R.K. Branavan, Harr Chen, Luke S. Zettlemoyer, Regina Barzilay Computer Science and Artificial.
CS 541: Artificial Intelligence Lecture X: Markov Decision Process Slides Credit: Peter Norvig and Sebastian Thrun.
POMDPs Logistics Outline No class Wed
Biomedical Data & Markov Decision Process
Markov Decision Processes
Markov Decision Processes
CS 188: Artificial Intelligence Fall 2007
Approximate POMDP planning: Overcoming the curse of history!
Chapter 17 – Making Complex Decisions
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
Markov Decision Processes
Markov Decision Processes
Presentation transcript:

Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra; in Artificial Intelligence 1998 “Efficient dynamic-programming updates in partially observable Markov decision processes”, Michael L. Littman, Anthony R. Cassandra, Leslie Pack Kaelbling; in Operations Research 1995 “Spoken Dialogue Management Using Probabilistic Reasoning”, Nicholas Roy and Joelle Pineau and Sebastian Thrun, in ACL 2000 “Solving POMDPs with Continuous or Large Discrete Observation Spaces”, Jesse Hoey, Pascal Poupart; in Proceedings of International Joint Conference on Artificial Intelligence (IJCAI) 2005

Review of MDP

For MDPs we can compute the optimal policy π and use it to act by simply executing π(s) for current state s. What happens if the agent is no longer able to determine the state it is currently in with complete reliability?

POMDP framework A POMDP can be described as a tuple –S, A, T, and R describe an MDP –Ω is a finite set of observations the agent can experience of its world –O: S  A  П(Ω) is the observation function, which gives, for each action and resulting state, a probability distribution over possible observations (we write O( s’,a,o ) for the probability of making observation o given that the agent took action a and landed in state s’)

Problem structure Because the agent doesn’t know the exact state, he keeps an internal belief state, b, that summarizes its previous experience. The problem is decomposed into two parts –State estimator: update the belief state based on the last action, the current observation, and the previous belief state –The policy: maps the belief state to actions

An example There are actions: EAST and WEST, each succeeds with probability 0.9, and when they fail the movement is in the opposite direction. If no movement is possible in particular direction, then the agent remains in the same location –Initially [0.33, 0.33, 0, 0.33] –After taking one EAST movement  [0.1, 0.45, 0, 0.45] –After taking another EAST movement  [0.1, 0.164, 0, 0.736]

Computing belief states

Value functions for POMDPs As in he case of discrete MDPs, if we can compute the optimal value function, then we can use it to directly determine the optimal policy Policy tree

Policy tree for value iteration In the simplest case, p is a 1-step policy tree (a single action). The value of executing that action in state s is –V p (s) = R(s, a(p)) In the general case, p is a t-step policy tree,

Because the agent will never know the exact state of the world, it must be able to determine the value of executing a policy tree p from some belief state b. – A useful expression:

To execute different trees from different initial states. Let P be the finite set of all t-step policy trees, then This definition of the value function leads us to some important geometric insights into its form. Each policy tree, p, induces a value function that is linear in b, and Vt is the upper surface of those functions. So, Vt is peicewise-linear and convex.

Some examples If there are only two states:

If there are three states:

Once we choose the optimal tree according to the entire policy tree p can be executed from this point by conditioning the choice of further actions directly on observations, without updating the belief state!

Parsimonious representation There are generally many policy trees whose value functions are totally dominated by or tied with value functions associated with other policy trees

Given a set of policy trees, V, it is possible to define a unique minimal subset V that represents the same value function We call this a parsimonious representation of the value function

One step of value iteration The new problem is how to compute a parsimonious representation of V t from a parsimonious representation of V t-1 A naiive algorithm is: –V t-1, the set of useful (t-1)-step policy trees, can be used to construct a superset V t + of the useful t-step policy trees –A t-step policy tree is composed of a root node with an associated action a and | Ω | subtrees, each a (t-1)- step policy tree –There are |A||V t-1 | | Ω | elements in V t +

The witness algorithm Instead of computing V t directly, we will compute, for each action a, a set Q t a of t-step policy trees that have action a at their root We can compute V t by taking the union of the Q t a sets for all actions and pruning Q t a can be expressed as

The structure of the algorithm We try to find a minimal set of policy trees for representing Q t a for each a We initialize the set U a of policy trees with a single policy tree, which is the best for some arbitrary belief state At each iteration we ask: Is there some belief state b for which the true value Q t a (b), computed by one-step lookahead using V t-1, is different from the estimated value Q t a (b), computed using the set U a ? Once the witness is identified, we find the policy tree with action a at the root that will yield the best value at that belief state. To construct this tree, we must find, for each observation o, the (t-1)-step policy tree that should be executed if observation o is made after executing action a.

The witness algorithm Let be the collection of policy trees that specify Q t a. It is minimal

To find a witness point Witness theorem: The witness theorem requires us to search for a p є U a, an o є Ω, a p’ є V t-1 and a b є B such that condition (1) holds, or to guarantee that no such quadruple exists

The linear program to find witness points