Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University.

Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University Providence, RI 02912 arc@cs.brown.edu Michael L. Littman Dept. of Computer Science Duke University Durham, NC 27708-0129 mlittman@cs.duke.edu Nevin L. Zhang Computer Science Dept. The Hong Kong U. of Sci. & Tech. Clear Water Bay, Kwolon, HK lzhang@cs.ust.hk Presented by Costas Djouvas

POMDPs: Who Needs them? Tony Cassandra St. Edwards University Austin, TX http://www.cassandra.org/pomdp/talks/who-needs-pomdps/index.shtml http://www.cassandra.org/pomdp/talks/who-needs-pomdps/index.shtml http://www.cassandra.org/pomdp/talks/who-needs-pomdps/index.shtml

Markov Decision Processes (MDP) A discrete model for decision making under uncertainty. A discrete model for decision making under uncertainty. The four components of MDP model: The four components of MDP model: States: The world is divided into states. States: The world is divided into states. Actions: Each state has a finite number of actions to choose from. Actions: Each state has a finite number of actions to choose from. Transition Function: Probabilistic relationship between states and available actions for each state. Transition Function: Probabilistic relationship between states and available actions for each state. Reward Function: The expected reward of taking action a under state s. Reward Function: The expected reward of taking action a under state s.

MDP More Formally S = S = A set of possible world states. A = A = A set of possible actions. Transition Function: A real number function T(s,a,s') = Pr(s'|s, a). Reward Function: A real number function R(s,a).

MDP Example (1/2) S = {OK, DOWN}. S = {OK, DOWN}. A = {NO-OP, ACTIVE-QUERY, RELOCATE}. A = {NO-OP, ACTIVE-QUERY, RELOCATE}. Reward Function Reward Function R(a, s) s a OKDOWN NO-OP+1-10 A-Q-5-5 RELOCATE-22-20

Transition Functions: Transition Functions: POMDP POMDP POMDP MDP Example (2/2) T(s, RELOCATE, s') s' s OKDOWN OK1.000.00 DOWN1.000.00 T(s, A-Q, s') s' sOKDOWN OK0.980.02 DOWN0.001.00 T(s, NO-OP, s') s' sOKDOWN OK0.980.02 DOWN0.001.00

Best Strategy Value Iteration Algorithm: Value Iteration Algorithm: Input: Actions, States, Reward Function, Probabilistic Transition Function. Input: Actions, States, Reward Function, Probabilistic Transition Function. Derive a mapping from states to “best” actions for a given horizon of time. Derive a mapping from states to “best” actions for a given horizon of time. Starts with horizon length 1 and iteratively found the value function for the desired horizon. Starts with horizon length 1 and iteratively found the value function for the desired horizon.  Optimal Policy  Maps states to actions (S  A).  It depends only on current state (Markov Property).  To apply this we must know the agent’s state.

Partially Observable Markov Decision Processes Domains with partial information available about the current state (we can’t observe the current state). Domains with partial information available about the current state (we can’t observe the current state). The observation can be probabilistic. The observation can be probabilistic. We need an observation function. We need an observation function. Uncertainly about current state. Uncertainly about current state. Non-Markovian process: required keeping track of the entire history. Non-Markovian process: required keeping track of the entire history.

Partially Observable Markov Decision Processes In addition to MDP model we have: In addition to MDP model we have: Observation: A set of observation of the state. Observation: A set of observation of the state. Z = A set of observations. Z = A set of observations. Observation Function: Relation between the state and the observation. Observation Function: Relation between the state and the observation. O(s, a, z) = Pr(z |s, a). O(s, a, z) = Pr(z |s, a).

POMDP Example In addition to the definitions of the MDP example, we must define the observation set and the observation probability function. In addition to the definitions of the MDP example, we must define the observation set and the observation probability function.MDP exampleMDP example Z={pink-ok, pink-timeout, active-ok, active-down}. O(s, ACTIVE_QUERY, Observation) Observation s POPTAOAD OK0.0000.9990.0000.001 DOWN0.0000.0100.0000.990 O(s, NO-OP, Observation) Observation sPOPTAOAD OK0.9700.0000.0300.000 DOWN0.0250.0000.9750.000 O(s, RELOCATE, Observation) Observation sPOPTAOAD OK0.2500.2500.2500.250 DOWN0.2500.2500.2500.250 Optimal Policy

Background on Solving POMDPs We have to find a mapping from probability distribution over states to actions. We have to find a mapping from probability distribution over states to actions. Belief State: the probability distribution over states. Belief State: the probability distribution over states. Belief Space: the entire probability space. Belief Space: the entire probability space. Assuming finite number of possible actions and observations, there are finite number of possible next beliefs states. Assuming finite number of possible actions and observations, there are finite number of possible next beliefs states. Our next belief state is fully determined and it depends only on the current belief state (Markov Property). Our next belief state is fully determined and it depends only on the current belief state (Markov Property).

Background on Solving POMDPs Next Belief State

Background on Solving POMDPs Start from belief state b (Yellow Dot). Start from belief state b (Yellow Dot). Two states s1, s2. Two states s1, s2. Two actions a1, a2. Two actions a1, a2. Tree observations z1, z2, z3. Tree observations z1, z2, z3. Belief Space Belief States

Policies for POMDPs An optimal POMDP policy maps belief states to actions. An optimal POMDP policy maps belief states to actions. The way in which one would use a computed policy is to start with some a priori belief about where you are in the world. The continually: The way in which one would use a computed policy is to start with some a priori belief about where you are in the world. The continually: 1. Use the policy to select action for current belief state; 2. Execute the action; 3. Receive an observation; 4. Update the belief state using current belief, action and observation; 5. Repeat.

Example for Optimal Policy Pr(OK)Action 0.000 – 0.237 RELOCATE 0.237 – 0.485 ACTIVE 0.485 – 0.493 ACTIVE 0.493 – 0.713 NO-OP 0.713 – 0.928 NO-OP 0.928 – 0.989 NO-OP 0.989 – 1.000 NO-OP Value Function Belief Space RELACATE 0 1 ACTIVE NO-OP

Policy Graph

Value Function The Optimal Policy computation is based on Value Iteration. The Optimal Policy computation is based on Value Iteration. Main problem using the value iteration is that the space of all belief states is continuous. Main problem using the value iteration is that the space of all belief states is continuous.

Value Function For each belief state get a single expected value. For each belief state get a single expected value. Find the expected value of all belief states. Find the expected value of all belief states. Yield a value function defined over all belief space. Yield a value function defined over all belief space.

Value Iteration Example Two states, two actions, three observations. Two states, two actions, three observations. We will use a figure to represent the Belief Space and the Transformed Value Function. We will use a figure to represent the Belief Space and the Transformed Value Function. We will use the s(a, z) function to transform the continues space Value Function. We will use the s(a, z) function to transform the continues space Value Function. Belief Space Transformed Value Dot Product

Value Iteration Example Start from belief state b Start from belief state b One available action, a1 for the first decision and then two a1 and a2. One available action, a1 for the first decision and then two a1 and a2. Three possible observations, z1, z2, z3. Three possible observations, z1, z2, z3.

Value Iteration Example For each of the three new belief states compute the new value function, for all actions. For each of the three new belief states compute the new value function, for all actions. Transformed Value Functions for all observations Partition for action a1

Value Iteration Example Value Function and partition for action a1Value Function and partition for action a2 Combined a1 and a2 values functionsValues functions for horizon 2

Transformed Value Example MDP Example

Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes The agent is not aware of its current state. The agent is not aware of its current state. It only knows its information (belief) state x (probability distribution over possible states). It only knows its information (belief) state x (probability distribution over possible states). new information state new information state x a where a z S: a finite set of states A: a finite set of possible actions Z: a finite set of possible observations α Α s S z Z r α (s) R Transition function: Pr(s'|s, α) [0, 1] Observation function: Pr(z'|s', α) [0, 1] Notations

Introduction Algorithms for POMDPs use a form of dynamic programming, called dynamic programming updates. Algorithms for POMDPs use a form of dynamic programming, called dynamic programming updates. One Value Function is translated into a another. One Value Function is translated into a another. Some of the algorithms using DPU: Some of the algorithms using DPU: One pass (Sondik 1971) Exhaustive (Monahan 1982) Linear support (Cheng 1988) Witness (Littman, Cassandra & Kaelbling 1996) Dynamic Pruning (Zhang & Liu 1996)

Dynamic Programming Updates Idea: Define a new value function V' in terms of a given value function V. Idea: Define a new value function V' in terms of a given value function V. Using value iteration, in infinite-horizon, V' represents an approximation that is very close to optimal value function. Using value iteration, in infinite-horizon, V' represents an approximation that is very close to optimal value function. The V' is defined by: The V' is defined by:  So the function V can be expressed as vectors for some finite set of |S|-vectors S α, S α, S'   The transformations preserve piecewise linearity and convexity (Smallwood & Sondik, 1973). z

α1 > α2 if and only if for a1(s) > a2(s) for all s S. Dynamic Programming Updates Some more notations Vector Comparison: Vector Comparison: Vector dot product: Vector dot product: Cross sum: Cross sum: Set subtraction: Set subtraction: α.β = Σs α(s)β(s) A B = {α + β|α Α, β Β} Α\Β = {α Α|β Β}

Dynamic Programming Updates Using these notations, we can characterize the “S” sets described earlier as: Using these notations, we can characterize the “S” sets described earlier as: purge(.) takes a set of vectors and reduces it to its unique minimum form

Pruning Sets of Vectors Given a set of |S|-vectors A and a vector α, define: Given a set of |S|-vectors A and a vector α, define: which is called “witness region” the set of information states for which vector α is the clear “winner” (has the largest dot product) compared to all the others vectors of A. which is called “witness region” the set of information states for which vector α is the clear “winner” (has the largest dot product) compared to all the others vectors of A. Using the definition of R, we can define: Using the definition of R, we can define: which is the set of vectors in A that have non-empty witness region and is precisely the minimum-size set.

Pruning Sets of Vectors Implementation of purge(F) Implementation of purge(F) Returns the vectors in F with non-empty witness region. Returns an information state x for which α gives larger dot product that any vector in A.

Incremental Pruning Computes S α efficiently: Computes S α efficiently: Conceptually easier than witness. Superior performance and asymptotic complexity. A = purge(A), B = purge(B). W = purge(A B). |W| ≥ max(|A|, |B|). It never grows explosively compared to its final size.

Incremental Pruning We first construct all of S(a,z) sets. We first construct all of S(a,z) sets. We do all combinations of the S(a,z1) and S(a,z2) vectors. We do all combinations of the S(a,z1) and S(a,z2) vectors.

Incremental Pruning We yields the new value function. We yields the new value function. We then eliminate all useless (light blue) vectors. We then eliminate all useless (light blue) vectors.

Incremental Pruning We are left with just three vectors. We are left with just three vectors. We then combine these three with the vectors in S(a,z3). We then combine these three with the vectors in S(a,z3). This is repeated for the other action. This is repeated for the other action.

Generalizing Incremental Pruning Modification of F ILTER to take advantage of the fact that the set of vectors has a great deal of regularity. Modification of F ILTER to take advantage of the fact that the set of vectors has a great deal of regularity. Replace x  D OMINATE ( Φ, W) with x  D OMINATE ( Φ, D\{ Φ }). Replace x  D OMINATE ( Φ, W) with x  D OMINATE ( Φ, D\{ Φ }). Recall: Recall: A B : filtering set of vectors. A B : filtering set of vectors. W: set of wining vectors. W: set of wining vectors. Φ : the “winner” vectors of the W Φ : the “winner” vectors of the W D A B D A B

Generalizing Incremental Pruning D must satisfying any of the following properties: D must satisfying any of the following properties: Different choices of D result in different incremental pruning algorithms. Different choices of D result in different incremental pruning algorithms. The smaller the D set the more efficient the algorithm. The smaller the D set the more efficient the algorithm. (1) (2) (3) (4) (5)

Generalizing Incremental Pruning To IP algorithm uses equation 1. To IP algorithm uses equation 1. A variation of the incremental pruning method using a combination of 4 and 5 is referred as restricted region (RR) algorithm. A variation of the incremental pruning method using a combination of 4 and 5 is referred as restricted region (RR) algorithm. The asymptotic total number of linear programs does not change RR, actually requires slightly more linear programs than IP in the worst case. However empirically it appears that the savings in the total constraints usually saves more time than the extra linear programs require.

Generalizing Incremental Pruning Complete RR algorithm

Empirical Results Total execution time Total time spent constructing S α sets.

Conclusions We examined the incremental pruning We examined the incremental pruning method for performing dynamic programming updates in partially observable Markov decision processes. It compares favorably in terms of ease of implementation to the simplest of the previous algorithms. It has asymptotic performance as good as or better than the most efficient of the previous algorithms and is empirically the fastest algorithm of its kind.

Conclusion In any event even the slowest variation of the incremental pruning method that we studied is a consistent improvement over earlier algorithms. This algorithm will make it possible to greatly expand the set of POMDP problems that can be solved efficiently. Issues to be explored: Issues to be explored: All algorithms studied have a precision parameter ε, which differs from algorithm to algorithm. All algorithms studied have a precision parameter ε, which differs from algorithm to algorithm. Develop better best-case and worst-case analyses for RR.

Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University.

Similar presentations

Presentation on theme: "Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University.

Similar presentations

Presentation on theme: "Incremental Pruning: A simple, Fast, Exact Method for Partially Observable Markov Decision Processes Anthony Cassandra Computer Science Dept. Brown University."— Presentation transcript:

Similar presentations

About project

Feedback