# 5/11/2015 Mahdi Naser-Moghadasi Texas Tech University.

## Presentation on theme: "5/11/2015 Mahdi Naser-Moghadasi Texas Tech University."— Presentation transcript:

5/11/2015 Mahdi Naser-Moghadasi Texas Tech University

 A MDP model contains:  A set of states S  A set of actions A  A set of state transition description T  A reward function R (s, a) 5/11/2015 Agent Environment State Reward Action

 A POMDP model contains:  A set of states S  A set of actions A  A set of state transition description T  A reward function R (s, a)  A finite set of observations Ω  An observation function O:S ╳ A→ Π(Ω)  O(s’, a, o) 5/11/2015

 Value function  Policy is a description of behavior of Agent.  Policy Tree  Witness Algorithm 5/11/2015

 A tree of depth t that specifies a complete t-step policy.  Nodes: actions, the top node determines the first action to be taken.  Edges: the resulting observation 5/11/2015

 Value Function:  V p (s) is the value function of step-t that starting from state s and executing policy tree p.

5/11/2015  Value Evaluation:  V t with only two states:

5/11/2015  Value Function:  V t with three states:

5/11/2015  Improved by choosing useful policy tree:

5/11/2015  Witness algorithm:

5/11/2015  Witness algorithm:  Finding witness:  At each iteration we ask, Is there some belief state, b, for which the true value,, computed by one-step lookahead using Vt-1, is different from the estimated value,, computed using the set U?

 Witness algorithm:  Complete value-iteration:  An agenda containing any single policy tree  A set U containing the set of desired policy tree  Using p new to determine whether it is an improvement over the policy trees in U  1. If no witness points are discovered, then that policy tree is removed from the agenda. When the agenda is empty, the algorithm terminates.  2. If a witness point is discovered, the best policy tree for that point is calculated and added to U and all policy trees that dier from the current policy tree in a single subtree are added to the agenda. 5/11/2015

 Two doors:  Behind one door is a tiger  Behind another door is a large reward  Two states:  the state of the world when the tiger is on the left as s l and when it is on the right as s r  Three actions:  left, right, and listen.  Rewards:  reward for opening the correct door is +10 and the penalty for choosing the door with the tiger behind it is -100, the cost of listen is -1  Observations:  to hear the tiger on the left (T l ) or to hear the tiger on the right (T r )  in state s l, the listen action results in observation T l with probability 0.85 and the observation T r with probability 0.15; conversely for world state s r. 5/11/2015

 Decreasing listening reliability from 0.85 down to 0.65:

 How number of horizon affect the complexity of solving POMDPs? So can we conclude that Pruning non useful policies is the key point of solving POMPDs?  On page 5, they say “sometimes we need to … compute a greedy policy given a function”, why would you need the greedy policy?  Can you explain the “Witness Algorithm” I don’t understand it at all. (Page 15)  Did you find any papers that implement the techniques in this paper and provide a discussion of timing or accuracy?  Can you give some more POMDP problems in the real world besides tiger problem? 5/11/2015