Partially Observable Markov Decision Process (POMDP)

Slides:



Advertisements
Similar presentations
Dialogue Policy Optimisation
Advertisements

Markov Decision Process
1 University of Southern California Keep the Adversary Guessing: Agent Security by Policy Randomization Praveen Paruchuri University of Southern California.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Solving POMDPs Using Quadratically Constrained Linear Programs Christopher Amato.
SA-1 Probabilistic Robotics Planning and Control: Partially Observable Markov Decision Processes.
CSE-573 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Decision Theoretic Planning
Optimal Policies for POMDP Presented by Alp Sardağ.
What Are Partially Observable Markov Decision Processes and Why Might You Care? Bob Wall CS 536.
An Introduction to Markov Decision Processes Sarah Hickmott
Partially Observable Markov Decision Process By Nezih Ergin Özkucur.
Infinite Horizon Problems
Planning under Uncertainty
1 Policies for POMDPs Minqing Hu. 2 Background on Solving POMDPs MDPs policy: to find a mapping from states to actions POMDPs policy: to find a mapping.
POMDPs: Partially Observable Markov Decision Processes Advanced AI
Markov Decision Processes CSE 473 May 28, 2004 AI textbook : Sections Russel and Norvig Decision-Theoretic Planning: Structural Assumptions.
KI Kunstmatige Intelligentie / RuG Markov Decision Processes AIMA, Chapter 17.
An Introduction to PO-MDP Presented by Alp Sardağ.
Incremental Pruning CSE 574 May 9, 2003 Stanley Kok.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
Discretization Pieter Abbeel UC Berkeley EECS
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Optimal Fixed-Size Controllers for Decentralized POMDPs Christopher Amato Daniel.
Markov Decision Processes
Reinforcement Learning Yishay Mansour Tel-Aviv University.
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
Based on slides by Nicholas Roy, MIT Finding Approximate POMDP Solutions through Belief Compression.
CS Reinforcement Learning1 Reinforcement Learning Variation on Supervised Learning Exact target outputs are not given Some variation of reward is.
Instructor: Vincent Conitzer
MAKING COMPLEX DEClSlONS
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
CSE-473 Artificial Intelligence Partially-Observable MDPS (POMDPs)
Simultaneous Localization and Mapping Presented by Lihan He Apr. 21, 2006.
Generalized and Bounded Policy Iteration for Finitely Nested Interactive POMDPs: Scaling Up Ekhlas Sonu, Prashant Doshi Dept. of Computer Science University.
CSE-573 Reinforcement Learning POMDPs. Planning What action next? PerceptsActions Environment Static vs. Dynamic Fully vs. Partially Observable Perfect.
TKK | Automation Technology Laboratory Partially Observable Markov Decision Process (Chapter 15 & 16) José Luis Peralta.
Model-based Bayesian Reinforcement Learning in Partially Observable Domains by Pascal Poupart and Nikos Vlassis (2008 International Symposium on Artificial.
Utilities and MDP: A Lesson in Multiagent System Based on Jose Vidal’s book Fundamentals of Multiagent Systems Henry Hexmoor SIUC.
Solving POMDPs through Macro Decomposition
Reinforcement Learning Yishay Mansour Tel-Aviv University.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
A Tutorial on the Partially Observable Markov Decision Process and Its Applications Lawrence Carin June 7,2006.
Conformant Probabilistic Planning via CSPs ICAPS-2003 Nathanael Hyafil & Fahiem Bacchus University of Toronto.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
1 Chapter 17 2 nd Part Making Complex Decisions --- Decision-theoretic Agent Design Xin Lu 11/04/2002.
Smart Sleeping Policies for Wireless Sensor Networks Venu Veeravalli ECE Department & Coordinated Science Lab University of Illinois at Urbana-Champaign.
1 (Chapter 3 of) Planning and Control in Stochastic Domains with Imperfect Information by Milos Hauskrecht CS594 Automated Decision Making Course Presentation.
Generalized Point Based Value Iteration for Interactive POMDPs Prashant Doshi Dept. of Computer Science and AI Institute University of Georgia
CS 416 Artificial Intelligence Lecture 20 Making Complex Decisions Chapter 17 Lecture 20 Making Complex Decisions Chapter 17.
Markov Decision Processes AIMA: 17.1, 17.2 (excluding ), 17.3.
Keep the Adversary Guessing: Agent Security by Policy Randomization
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Making complex decisions
POMDPs Logistics Outline No class Wed
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
Markov Decision Processes
Hierarchical POMDP Solutions
Reinforcement Learning
Chapter 17 – Making Complex Decisions
Heuristic Search Value Iteration
CS 416 Artificial Intelligence
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Reinforcement Nisheeth 18th January 2019.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 7
Presentation transcript:

Partially Observable Markov Decision Process (POMDP) by Ye Fang Department of Computer Science Rice University Douglas Aberdeen, National ICT Australia 2003

Overview Recap POMDP and the exact solution Heuristic methods Solution Heuristics for Exact methods Grid methods Factored belief states Simulation Methods for continuous state and action space Solution

Learning with a Model The agent knows the model , , Observation/action history: Belief state 1/3 1/3 1/3 Goal 1/2 1/2 1

Learning with a Model Update beliefs: Long-term value of a belief state Define:

Complexity of Exact Methods Exponential number of state variables: Updating believe state is expensive. Believe-state monitoring is hard. Exponential number of belief states: PSPACE-Hard for simplified finite-horizon POMDP. NP-Hard to find a policy.

How to make POMDP feasible? Almost impossible to find a exact solution for POMDP model Where does the complexity of exact solution come from? Infinite believe states Updating believe states and their value functions Introduce heuristic methods for exact methods

How to make POMDP feasible? Why can Heuristics work? Simplify the representation of value function by assuming the system is an MDP. Replace the believe state b with real world state Then, we have finite many states

Heuristic for Exact Methods The intuition behind these heuristics is to assume the system as an MDP by finding an approximate projection from belief state to world state. In exact solution, (Instead of using all possible world state in a belief state b, and their corresponding possible transitions, heuristics are used for exact solution.) Different approximation method needs 1) belief state  state in MDP 2) update scheme 3) appropriate value function

Heuristic for Exact Methods Goal: Find an good approximation of projection from belief state to world state. Find a good policy for each believe state.

Heuristic for Exact Methods MSL(most likely state) Voting Heuristics QMDP Heuristic Heuristic using the uncertainty of belief state

MLS Heuristic We can assume the system is in the most likely world state(MLS) i at time t. The policy executed at that state is the transition with largest Q-value at state i. Therefore, the state in the model is world state which is the same as MDP model. It ignores the agents confusion about the current world state it in. Then we can use the MDP method learned before.

MLS Heuristic This method neglects all possible world states but the MSL state at belief state b. EX: Given optimal action in a world with three states and two actions, u(s0) = a0, u(s1) = a0, u(s2) = a1 b = [0.3, 0.3, 0.4] The probability of not being at state s2 is 0.4 and the best action a0

Voting Heuristic The voting heuristic assigns a probability distribution over the actions instead of over the states. Given: The action for each belief state: (Probability at state j) * (best action at that state)

Voting Heuristic EX: Given optimal action in a world with three states and two actions, u(s0) = a0, u(s1) = a0, u(s2) = a1 b = [0.3, 0.3, 0.4], V(s0, a0)=5, V(s0, a1)=4, V(s1,a0)=5, V(s1,a1)=4, V(s2,a0)=0, V(s2,a1)= 10. At state s2, expectedR(a0) =3, expectedR(a1) =6.4 The most likely best action is a0 (0.6). A1 maybe better given the reward

Voting Heuristic This method does not take the reward of an action into account. Introduce QMDP, which emphasize the Q- function of the optimal policy rather than the policy itself.

QMDP Heuristic QMDP only takes into account the belief state at first step. What if this action does not do much to disambiguate the state, this method cannot improve the action over time. Find the action with largest Q-value. This method takes account the belief state for one step and assume the state is entirely known.

Shortcomings of the Heuristics What if the belief states is close to uniform? Ex: a robot trying to reach the other end of a futureless desert. By observation, it has almost same belief of it is at everywhere. What if there is a lot of uncertainty in the information state? Consider the uncertainty when taking action

Formal measurement of Uncertainty Entropy is the measure of a probability distribution that reflects how spiked or spread out the probability mass is, essentially capturing the amount of uncertainty with a single number. f(.) is a discrete probability mass function.

Two Objectives When choosing actions, we want to: To take actions that will yield the highest rewards. To reduce the entropy of information state.

Weighted Entropy Control Intuition: relate the entropy to the rewards to give some rough measure of the value of information.

Weighted Entropy Control When the entropy is near 1, it means the environment is totally unobservable. When the entropy near 0, it means the model is almost a MDP.

Weighted Entropy Control Define VL to be the lower bound for POMDP value function. The value at each belief state is: The control strategy will be: Vco means completely observable. VL is the lower bound for any state in POMDP.

Other Heuristics for POMDP Grid method Factored belief method Simulation

Grid method Instead of compose the world state from belief state, it picks the real world states. How to choose the set of real world states (a interesting region of each belief state)? How to interpolate?

How to choose grid points? Simulation to find useful points Adding points where the value differed a lot though with similar observation.

How to interpolate? Maintain the convex nature of the value function: f(g,u) is the value grid point g under action u. Example: nearest neighbors, linear interpolations, etc. MSL heuristic can be thought of as a grid method with points at the belief-simplex corners and a simple 1 NN interpolation A value function over a continuous belief space can be approximated by a finite set of grid points G and an interpolation-extrapolation rule that estimates the value of an arbiturary point of the belief space by relying on the points of the grid and their associate values.

Factored Belief State Intuition: learn the dependency of state variables Ex: at time t: the state of raining is true at time t+1: the state of “ground is dry” is not very likely to be true. Two-slice temporal Bayes network showing dependencies between state variables over successive time steps.

Factored Belief State We can use a subset of state variables to construct a Bayes network(BN). Belief-state projection can be searched to find a suitable BN for a specific problem(belief monitoring). It is a learning of adjusting the belief network parametrized by ϕ. Factored linear value function: weighted linear combinations of polynomial basis functions. Search procedure is provided to determine belief-state projection with bounded errors.

Simulation and Belief State Concentrate learning effort on the states that are most likely to be encountered. In terms of Q-learning, we can simulate a path in POMDP and perform iteration of the value function on the monitored current belief states. Not good for POMDP with more than hundreds of states. The full DP update is not efficient.

Simulation and Belief State Learn Q-function that generalize to all belief states Artificial neural network can also be used to approximate the value function of the full belief states.

Continuous State and Action Spaces Sampled belief states. Use particle filters to update the belief state. The value function is approximated using the average of k nearest neighbors. Filtering refers to determine the distribution of some variable at a certain time given the observation up to that time. Approximate continuous belief states are formed from the sampling using Gaussian Kernels. 1.Infinite memory is needed to represent continuous state space 2. Balance running time and accuracy with the number of samples n. 3. Hyperline methods cannot be used because the state space is continuous.

Policy search Vs Value search It is simpler to just determine how to act instead of the value of acting. Approximate value function method usually produce deterministic policies. The heuristic methods are approximated projections from belief states to world states. It is better to introduce randomness in policy.

Policy search Vs Value search Policy search can be very difficult. Value search can be better for small POMDP . Value search imposes Bellman equations as constrains. Hybrid

Policy search Policy search can be implemented by policy iteration. Step1: Evaluate current policy Step2: Improve policy

Recap Different Heuristics Projecting a belief state to a world state Evaluating the values for belief states Finding good policy