# Meeting 3 POMDP (Partial Observability MDP) 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授.

## Presentation on theme: "Meeting 3 POMDP (Partial Observability MDP) 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授."— Presentation transcript:

Meeting 3 POMDP (Partial Observability MDP) 資工四 阮鶴鳴 李運寰 Advisor: 李琳山教授

Reference “ Planning and acting in partially observable stochastic domains ” “ Planning and acting in partially observable stochastic domains ” Leslie Pack Kaelbling, Michael L. Littman, Anthony R. Cassandra; in Artificial Intelligence 1998 “ Spoken Dialogue Management Using Probabilistic Reasoning ”, “ Spoken Dialogue Management Using Probabilistic Reasoning ”, Nicholas Roy and Joelle Pineau and Sebastian Thrun, in ACL 2000

MDP (Markov Decision Process) A MDP model contains: A MDP model contains: –A set of states S –A set of actions A –A set of state transition description T  Deterministic or Stochastic –A reward function R (s, a)

MDP For MDPs we can compute the optimal policy π and use it to act by simply executing π(s) for current state s. For MDPs we can compute the optimal policy π and use it to act by simply executing π(s) for current state s. What happens if the agent is no longer able to determine the state it is currently in with complete reliability? What happens if the agent is no longer able to determine the state it is currently in with complete reliability?

POMDP A POMDP model contains: A POMDP model contains: –A set of states S –A set of actions A –A set of state transition description T –A reward function R (s, a) –A finite set of observations Ω –An observation function O:S ╳ A →Π(Ω)  O(s ’, a, o)

POMDP Problem 1. Belief state 1. Belief state –First approach: chose the most probable state of the world, given past experience  Informational properties described via observations –Not explicit –Second approach: probability distributions over states of the world.

An example Actions: EAST and WEST Actions: EAST and WEST –each succeeds with probability 0.9, and when they fail the movement is in the opposite direction. If no movement is possible in particular direction, then the agent remains in the same location –Initially [0.33, 0.33, 0, 0.33] –After taking one EAST movement  [0.1, 0.45, 0, 0.45] –After taking another EAST movement  [0.1, 0.164, 0, 0.736]

POMDP Problem 2. Finding an optimal policy: 2. Finding an optimal policy: –Maps the belief state to actions

Policy Tree A tree of depth t that specifies a complete t-step policy. A tree of depth t that specifies a complete t-step policy. –Nodes: actions, the top node determines the first action to be taken. –Edges: the resulting observation

Sample Policy Tree

Policy Tree Value Evaluation: Value Evaluation: –V p (s) is the value function of step-t that starting from state s and executing policy tree p.

Policy Tree Value Evaluation: Value Evaluation: –Expected value under policy tree p:  Where –Expected value that execute different policy trees from different initial belief states

Policy Tree Value Evaluation: Value Evaluation: –V t with only two states:

Policy Tree Value Evaluation: Value Evaluation: –V t with three states:

Infinite Horizon The three algorithm to compute V: The three algorithm to compute V: –Naive approach –Improved by choosing useful policy tree –Witness algo.

Infinite Horizon Naive approach: Naive approach: –εis a small number –This policy tree contains:  nodes  Each nodes can be labeled with |A| possible actions –Total number of policy threes:

Infinite Horizon Improved by choosing useful policy tree: Improved by choosing useful policy tree: –V t-1 is the set of useful (t – 1)-step policy trees, can be used to construct a superset of the useful t-step policy tree. –And there are | A || V t-1 | |Ω| elements in V t +

Infinite Horizon Improved by choosing useful policy tree: Improved by choosing useful policy tree:

Infinite Horizon Witness algorithm: Witness algorithm:

Infinite Horizon Witness algorithm: Witness algorithm: – is a set of t-step policy trees that have action a at their root – is the value function –And

Infinite Horizon Witness algorithm: Witness algorithm: –Finding witness:  At each iteration we ask, Is there some belief state,b, for which the true value,, computed by one-step lookahead using Vt-1, is different from the estimated value,, computed using the set U?  Provided

Infinite Horizon Witness algorithm: Witness algorithm: –Finding witness:  Now we can state the witness theorem [25]: The true Q-function,, differs from the approximate Q-function,, if and only if there is some,, and for which there is some b such that

Infinite Horizon Witness algorithm: Witness algorithm: –Finding witness:

Infinite Horizon Witness algorithm: Witness algorithm: –Finding witness:  The linear program used to find witness points:

Infinite Horizon Witness algorithm: Witness algorithm: –Complete value-iteration:  An agenda containing any single policy tree  A set U containing the set of desired policy tree  Using p new to determine whether it is an improvement over the policy trees in U –1. If no witness points are discovered, then that policy tree is removed from the agenda. When the agenda is empty, the algorithm terminates. –2. If a witness point is discovered, the best policy tree for that point is calculated and added to U and all policy trees that dier from the current policy tree in a single subtree are added to the agenda.

Infinite Horizon Witness algorithm: Witness algorithm: –Complexity:  Since we know that no more than witness points are discovered (each adds a tree to the set of useful policy trees) –only trees can ever be added to the agenda (in addition to the one tree in the initial agenda).  Each of these linear programs either removes a policy from the agenda (this happens at most times) or a witness point is discovered (this happens at most times).

Tiger Problem Two doors: Two doors: –Behind one door is a tiger –Behind another door is a large reward Two states: Two states: –the state of the world when the tiger is on the left as s l and when it is on the right as s r Three actions: Three actions: –left, right, and listen. Rewards: Rewards: –reward for opening the correct door is +10 and the penalty for choosing the door with the tiger behind it is -100, the cost of listen is -1 Observations: Observations: –to hear the tiger on the left (T l ) or to hear the tiger on the right (T r ) –in state s l, the listen action results in observation T l with probability 0.85 and the observation T r with probability 0.15; conversely for world state s r.

Tiger Problem

Decreasing listening reliability from 0.85 down to 0.65: Decreasing listening reliability from 0.85 down to 0.65:

The End