Download presentation

Presentation is loading. Please wait.

Published byBrett Stephens Modified about 1 year ago

1
5/11/2015 Mahdi Naser-Moghadasi Texas Tech University

2
A MDP model contains: A set of states S A set of actions A A set of state transition description T A reward function R (s, a) 5/11/2015 Agent Environment State Reward Action

3
A POMDP model contains: A set of states S A set of actions A A set of state transition description T A reward function R (s, a) A finite set of observations Ω An observation function O:S ╳ A→ Π(Ω) O(s’, a, o) 5/11/2015

4
Value function Policy is a description of behavior of Agent. Policy Tree Witness Algorithm 5/11/2015

5
A tree of depth t that specifies a complete t-step policy. Nodes: actions, the top node determines the first action to be taken. Edges: the resulting observation 5/11/2015

7
Value Function: V p (s) is the value function of step-t that starting from state s and executing policy tree p.

8
5/11/2015 Value Evaluation: V t with only two states:

9
5/11/2015 Value Function: V t with three states:

10
5/11/2015 Improved by choosing useful policy tree:

11
5/11/2015 Witness algorithm:

12
5/11/2015 Witness algorithm: Finding witness: At each iteration we ask, Is there some belief state, b, for which the true value,, computed by one-step lookahead using Vt-1, is different from the estimated value,, computed using the set U?

13
Witness algorithm: Complete value-iteration: An agenda containing any single policy tree A set U containing the set of desired policy tree Using p new to determine whether it is an improvement over the policy trees in U 1. If no witness points are discovered, then that policy tree is removed from the agenda. When the agenda is empty, the algorithm terminates. 2. If a witness point is discovered, the best policy tree for that point is calculated and added to U and all policy trees that dier from the current policy tree in a single subtree are added to the agenda. 5/11/2015

14
Two doors: Behind one door is a tiger Behind another door is a large reward Two states: the state of the world when the tiger is on the left as s l and when it is on the right as s r Three actions: left, right, and listen. Rewards: reward for opening the correct door is +10 and the penalty for choosing the door with the tiger behind it is -100, the cost of listen is -1 Observations: to hear the tiger on the left (T l ) or to hear the tiger on the right (T r ) in state s l, the listen action results in observation T l with probability 0.85 and the observation T r with probability 0.15; conversely for world state s r. 5/11/2015

16
Decreasing listening reliability from 0.85 down to 0.65:

17
How number of horizon affect the complexity of solving POMDPs? So can we conclude that Pruning non useful policies is the key point of solving POMPDs? On page 5, they say “sometimes we need to … compute a greedy policy given a function”, why would you need the greedy policy? Can you explain the “Witness Algorithm” I don’t understand it at all. (Page 15) Did you find any papers that implement the techniques in this paper and provide a discussion of timing or accuracy? Can you give some more POMDP problems in the real world besides tiger problem? 5/11/2015

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google