Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;

Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra; in Artificial Intelligence 1998 “Efficient dynamic-programming updates in partially observable Markov decision processes”, Michael L. Littman, Anthony R. Cassandra, Leslie Pack Kaelbling; in Operations Research 1995 “Spoken Dialogue Management Using Probabilistic Reasoning”, Nicholas Roy and Joelle Pineau and Sebastian Thrun, in ACL 2000 “Solving POMDPs with Continuous or Large Discrete Observation Spaces”, Jesse Hoey, Pascal Poupart; in Proceedings of International Joint Conference on Artificial Intelligence (IJCAI) 2005

Review of MDP

For MDPs we can compute the optimal policy π and use it to act by simply executing π(s) for current state s. What happens if the agent is no longer able to determine the state it is currently in with complete reliability?

POMDP framework A POMDP can be described as a tuple –S, A, T, and R describe an MDP –Ω is a finite set of observations the agent can experience of its world –O: S  A  П(Ω) is the observation function, which gives, for each action and resulting state, a probability distribution over possible observations (we write O( s’,a,o ) for the probability of making observation o given that the agent took action a and landed in state s’)

Problem structure Because the agent doesn’t know the exact state, he keeps an internal belief state, b, that summarizes its previous experience. The problem is decomposed into two parts –State estimator: update the belief state based on the last action, the current observation, and the previous belief state –The policy: maps the belief state to actions

An example There are actions: EAST and WEST, each succeeds with probability 0.9, and when they fail the movement is in the opposite direction. If no movement is possible in particular direction, then the agent remains in the same location –Initially [0.33, 0.33, 0, 0.33] –After taking one EAST movement  [0.1, 0.45, 0, 0.45] –After taking another EAST movement  [0.1, 0.164, 0, 0.736]

Computing belief states

Value functions for POMDPs As in he case of discrete MDPs, if we can compute the optimal value function, then we can use it to directly determine the optimal policy Policy tree

Policy tree for value iteration In the simplest case, p is a 1-step policy tree (a single action). The value of executing that action in state s is –V p (s) = R(s, a(p)) In the general case, p is a t-step policy tree,

Because the agent will never know the exact state of the world, it must be able to determine the value of executing a policy tree p from some belief state b. – A useful expression:

To execute different trees from different initial states. Let P be the finite set of all t-step policy trees, then This definition of the value function leads us to some important geometric insights into its form. Each policy tree, p, induces a value function that is linear in b, and Vt is the upper surface of those functions. So, Vt is peicewise-linear and convex.

Some examples If there are only two states:

If there are three states:

Once we choose the optimal tree according to the entire policy tree p can be executed from this point by conditioning the choice of further actions directly on observations, without updating the belief state!

Parsimonious representation There are generally many policy trees whose value functions are totally dominated by or tied with value functions associated with other policy trees

Given a set of policy trees, V, it is possible to define a unique minimal subset V that represents the same value function We call this a parsimonious representation of the value function

One step of value iteration The new problem is how to compute a parsimonious representation of V t from a parsimonious representation of V t-1 A naiive algorithm is: –V t-1, the set of useful (t-1)-step policy trees, can be used to construct a superset V t + of the useful t-step policy trees –A t-step policy tree is composed of a root node with an associated action a and | Ω | subtrees, each a (t-1)- step policy tree –There are |A||V t-1 | | Ω | elements in V t +

The witness algorithm Instead of computing V t directly, we will compute, for each action a, a set Q t a of t-step policy trees that have action a at their root We can compute V t by taking the union of the Q t a sets for all actions and pruning Q t a can be expressed as

The structure of the algorithm We try to find a minimal set of policy trees for representing Q t a for each a We initialize the set U a of policy trees with a single policy tree, which is the best for some arbitrary belief state At each iteration we ask: Is there some belief state b for which the true value Q t a (b), computed by one-step lookahead using V t-1, is different from the estimated value Q t a (b), computed using the set U a ? Once the witness is identified, we find the policy tree with action a at the root that will yield the best value at that belief state. To construct this tree, we must find, for each observation o, the (t-1)-step policy tree that should be executed if observation o is made after executing action a.

The witness algorithm Let be the collection of policy trees that specify Q t a. It is minimal

To find a witness point Witness theorem: The witness theorem requires us to search for a p є U a, an o є Ω, a p’ є V t-1 and a b є B such that condition (1) holds, or to guarantee that no such quadruple exists

The linear program to find witness points

Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;

Similar presentations

Presentation on theme: "Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;

Similar presentations

Presentation on theme: "Partial Observability “Planning and acting in partially observable stochastic domains” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra;"— Presentation transcript:

Similar presentations

About project

Feedback