# CS594 Automated decision making University of Illinois, Chicago

## Presentation on theme: "CS594 Automated decision making University of Illinois, Chicago"— Presentation transcript:

CS594 Automated decision making University of Illinois, Chicago
Planning and Acting in Partially Observable Stochastic Domains Lesli P. Kaelbling, Michael L.Littman, Anthony R.Cassandra CS594 Automated decision making course presentation Professor: Piotr. Hui Huang University of Illinois, Chicago Oct 30, 2002 Hello, everyone Happy Halloween first. And now I’m here today to introduce the famous paper… by…

Outline MDP & POMDP Basic POMDP  Belief Space MDP Policy Tree
Piecewise Linear Value Function Tiger Problem Before we start, let’s look at the outline for this presentation. As we already know about MDP and POMDP basic knowledge from the previous lectures, and got to know Famous  vectors from Brad… and got to know some fresh ideas about piecewise linear value function from Mingqing … And here It is our time to go deeply to look at these concepts again based on the new interpretation: policy tree and think about them as a whole. First, let us quickly go over these knowledge again and get a more comprehensive understanding on these topics from authors’ point of views. Second, for complex POMDPs problem, this paper introduce an original idea for solving this problem. … For solving this belief MDP problem, we will first look at these 2 properties …, … and then… Finally, let us look at this very interesting problem and see how it will be solved using the above mechanism… Presentation: Planning and Acting in Partially Observable Stochastic Domains

MDP Model Process: Observe state st in S Choose action at in At
Receive immediate reward rt State changes to st+1 MDP Model <S, A, T, R> Agent State Action Reward Environment r0 a0 s1 a1 r1 s2 a2 r2 s3 s0 As we already know, MDP serves as a basis for solving the more complex POMDPs problem which we are finally interested in. From this figure, we can see that an MDP is a model of agent interacting synchronously with the world. Explain …( combine the process) 2. So the MDP Model can be expressed in mathematical form <S, A, T, R> we have seen several times before. But keep in mind, Markov property is maintained in this Model. - the state and reward at time t+1 is dependent only on the state at time t and the action a at the time t. Presentation: Planning and Acting in Partially Observable Stochastic Domains

Policy & Value Function
Which action, a, should the agent take? In MDPs: Policy is a mapping from state to action, : S  A Value Function Vt(S) given a policy  The expected sum of reward gained from starting in state s executing non-stationary policy  for t steps. Relation Value function is evaluation for policy based no the long-run value that agent expects to gain from execute the policy. 1. In an MDP, our goal can be Finding optimal policy for all possible states or Finding the sequence of optimal value functions. As we have found, the concept a policy is closely related with the concept of value function. - Policy / Optimal Policy definition… - value function / optimal value function… which is a mapping from a state to a real number given a policy. 2. Relation: Each policy has an associated value function and the better the policy, the better its value function. Presentation: Planning and Acting in Partially Observable Stochastic Domains

Optimization (MDPs) Recursively calculate expected long-term reward for each state/belief: Find the action that maximizes the expected reward: 1. So based on policy and value function above, One can construct a value function given a policy or one can construct a policy given a value function The better the value function, the better the policy constructed. This will leads us to find optimal policy and Sequence of Optimal value functions for a given starting state. Thus, we reached our goal… 2. For algorithms, we have familiar with value iteration which I don’t show it here. As we already know, It improves value functions in an iterative fashion. Each iteration is referred to as dynamic-programming update (DP update). It stops when the current value function is sufficiently close to the optimal. Presentation: Planning and Acting in Partially Observable Stochastic Domains

POMDP: UNCERTAINTY Case #1: Uncertainty about the action outcome
Case #2: Uncertainty about the world state due to imperfect (partial) information Now, we enter into POMDP problem. There are 2 uncertainty cases in this kind of problem. For case 1:…., the same one as that in MDP. For case 2: … Presentation: Planning and Acting in Partially Observable Stochastic Domains

A broad perspective Belief state STATE AGENT
OBSERVATIONS Belief state AGENT STATE WORLD + AGENT So, In a POMDP, on the other hand, the current feedback does not provide sufficient information about world states. The question of “how would agent select actions?” would raises. A better thinking is that agent can take Information from previous steps (history) into consideration based on observation of the real world states. So All such information can be summarized by a probability distribution over the set of states. In the literature, this probability distribution is often referred to as a belief state. Click… As the figure here shows that the agent only make actions according to the input belief state from observation on real world state. What is exact belief state? How POMDP model works via belief state? How to find optimal policy? Answers for such questions will be given in later slides for details. In MDPs our goal is to find a mapping from states to actions; here instead is to find a mapping from probability distributions (over states) to appropriate actions. ACTIONS GOAL = Selecting appropriate actions Presentation: Planning and Acting in Partially Observable Stochastic Domains

What are POMDPs? S2 S1 S3 Components: Set of states: sS
Set of actions: aA Set of observations: o 0.5 Pr(o1)=0.9 Pr(o2)=0.1 0.5 S1 a1 Pr(o1)=0.5 Pr(o2)=0.5 S3 a2 1 Pr(o1)=0.2 Pr(o2)=0.8 POMDP parameters: Initial belief: b0(s)=Pr(S=s) Belief state updating: b’(s’)=Pr(s’|o, a, b) Observation probabilities: O(s’,a,o)=Pr(o|s’,a) Transition probabilities: T(s,a,s’)=Pr(s’|s,a) Rewards: R(s,a) So, formally we define POMDPS in this slide. Basically, There are 3 components… Let’s examine the parrameters needed in POMDPS - for belief state we should have initial belief and belief state updating function for computing new belief state. - O(s’,a, o) represents for the probability of making observation o given the agent took action a and landed In state s’ - transition probabilites measures the agent move to the new state s’ from state s after taking action a. - reward R(s, a) specifies immediate reward after taking action a in a state s. State transition is shown in this figure. (Explain them in detail….) MDP Presentation: Planning and Acting in Partially Observable Stochastic Domains

Belief state Probability distributions over states of the underlying MDP The agent keeps an internal belief state, b, that summarizes its experience. The agent uses a state estimator, SE, for updating the belief state b’ based on the last action at-1, the current observation ot, and the previous belief state b. Belief state is a sufficient statistic (it satisfies the Markov proerty) Action World Observation b SE Agent Let’s go deeply for belief state concept. Belief states is the probability…. From this figure, we can easily find that how does belief state come from ? So Note that the policy , is the function of belief state rather than the state of the world. … Presentation: Planning and Acting in Partially Observable Stochastic Domains

1D belief space for a 2 state POMDP
We have already seen this graphical representation for belief space and belief state updating from Mingqing’s lectures Let’s look at belief space belief space is the set of all possible probability distributions For a simple case: a two state POMDP we can represent the belief state with a single number. Since a belief state is a probability distribution, the sum of all probabilities must sum to 1. So With a two state POMDP, if we are given the probability for being in one of the states as being 'p', then we know that the probability of being in the other state must be '1-p'. Therefore the entire space of belief states can be represented as a line segment. 2. Explain belief state updating Next go back to the updating of the belief state discussed earlier. Assume we start with a particular belief state b and we take action a1 and receive observation z1 after taking that action. Then our next belief state is fully determined. In fact, since we are assuming that there are a finite number of actions and a finite number of observations, given a belief state, there are a finite number of possible next belief states. These correspond to each combination of action and observation. So In the same case , this figure here also shows this process for a POMDP. We have two states (s1 and s2), two possible actions (a1 and a2) and three possible observations (z1, z2 and z3). The starting belief state is the big yellow dot and the resulting belief states are the smaller black dots. The arcs represent the process of transforming the belief state. The very nice formula for computing new degree of belief in some state s’ is given blow. How does this formula infer is shown in Martai’s lectures already. (basically use Bayes’ Rule and some tricky inference) Presentation: Planning and Acting in Partially Observable Stochastic Domains

POMDP  Continuous-Space Belief MDP
a POMDP can be seen as a continuous-space “belief MDP”, as the agent’s belief is encoded through a continuous “belief state”. We may solve this belief MDP like before using value iteration algorithm to find the optimal policy in a continuous space. However, some adaptations for this algorithm are needed. S0 S1 A1 O1 S3 A3 O3 S2 A2 O2 So It turns out that the process of maintaining the belief state is Markovian; the next belief state depends only on the current belief state (and the current action and observation). In fact, we can convert a discrete POMDP problem into a continuous space Belief MDP problem where the continuous space is the belief space. The transitions of this new continuous space Belief MDP are easily derived from the transition and observation probabilities of the POMDP. And now What this means is that we are now back to solving a CO-MDP and we can use the value iteration (VI) algorithm. The Bellman equation…. However, we will need to adapt the algorithm some. Presentation: Planning and Acting in Partially Observable Stochastic Domains

Belief MDP The policy of a POMDP maps the current belief state into an action. As the belief state holds all relevant information about the past, the optimal policy of the POMDP is the the solution of (continuous-space) belief MDP. A belief MDP is a tuple <B, A, , P>: B = infinite set of belief states A = finite set of actions (b, a) = (reward function) P(b’|b, a) = (transition function) Where P(b’|b, a, o) = 1 if SE(b, a, o) = b’, P(b’|b, a, o) = 0 otherwise; And now we can get the formal mathematical representation for Belief MDP model. It defines as the following. Presentation: Planning and Acting in Partially Observable Stochastic Domains

Thinking(Can we solving this Belief MDP?)
The Bellman equation for this belief MDP is In general case: Very Hard to solve continuous space MDPs. Unfortunately, DP updates cannot be carried out because there are uncountably many of belief states. One cannot enumerate every equation of value function. To conduct DP steps implicitly, we need to discuss the properties of value functions. Some special properties can be exploited first to simplify this problem Policy Tree Piecewise linear and convex property And then find an approximation algorithm to construct the optimal t-step discounted value function over belief space using value iteration… For solving this problem, Value iteration is a standard approach. As we already know that, It consists of two basic steps to compute a near optimal value function: dynamic programming (DP) update and termination test. The big problem using value iteration here is the continuous state space. Since we now have a continuous space the value function is some arbitrary function over belief space. If the POMDP has n states, the belief state MDP has an n-dimensional continuous state space! One cannot enumerate every equation of value function. So Continueous MDPs are hard to slove in general So the following section will study the properties of value functions. …. Due to them, To conduct DP steps implicitly, we need to discuss the properties of Finally… Presentation: Planning and Acting in Partially Observable Stochastic Domains

Policy Tree With one step remaining, the agent must take a single action. With 2 steps to go, it takes an action, make an observation, and makes the final action. In general, an agent t-step policy can be represented as a policy tree. A T steps to go O1 Ok O2 T-1 steps to go A A A This is the an agent’s non-stationary t-step policy can be represented as policy tree. So policy tree is just a graphical representation for policy. no matter for MDP or POMDP problems shown here. Why we say that? For MDP problem, we don’t have observation probability, the agent can get fully information about where they are. So these substrees of one root only represented possible actions next. For POMDP problem, we have observation probabilty, next actions also depends on observation probablity here. We can say that there is an uncertainty for actions. Since there is also a uncertainty for states, which does not show explicitly in this picture. As we know that, with one step left, …. In general, an agent’s non-stationary t-step policy can be represented as policy tree shown here. It is a tree of dept t. Top node determine the first action to be taken. Then, depending on resulting observation, an arc is followed to a node on the next level, which determines the next action. Formally, the t-step policy tree has two components: an action a for the current time point t and (t-1)-step policy trees for the next n-1 time points based on possible observation. So by simply altering the actions on the action nodes, one obtains different n-step policy trees. 2 steps to go (T=2) A O1 Ok O2 1 step to go (T=1) A A A Presentation: Planning and Acting in Partially Observable Stochastic Domains

Value Function for policy tree p
If p is one-step policy tree, then the value of executing that action in state s is Vp(s) = R(s, a(p)). More generally, if p is a t-step policy tree, then Vp(s) = R(s, a(p)) + r (Expected value of the future) = R(s, a(p)) + r Thus, Vp(s) can be thought as a vector associated with the policy trees p since its dimension is the same as the number of states. We often use notation p to refer to this vectors. After knowing about this property, now, what is the expected discounted value to be gained from executing a policy tree p? 1. For the simplest case, … Where a(p) is the action specified in the top node of policy tree p. 2. More generally … Value function should sum the immediate reward after perform action in the root of policy tree and the expected discounted rewards the agent receives during the next n-1 time points if the world is currently in state s and the agent behaves according to the policy tree p . Let’s look at the equation for computing expected value of the future. … It firstly taking an expectation over possible next states, s’, from the given start state s. And then considering the value for each of those target states, the value depends on which subpolicy tree will be executed based on observation. So, we take another expecation of the value of executing the associated subtree, Oi(p), starting in state s’ , with respect to all the possible observations. Presentation: Planning and Acting in Partially Observable Stochastic Domains

Value Function Over Belief Space
As the exact world cannot be observed, the agent must compute an expectation over world states of executing policy tree p from belief state b: Vp(b) = If we let , then To construct an optimal t-step policy, we must maximize over all t-step policy trees P: As Vp(b) is linear in b for each pP, Vt(b) is the upper surface of those functions. That is, Vt(b) is piecewise linear and convex. Because the agent will never know the exact state of the world, it must be able to determine the value of execute a policy tree p from some belief state b. This is just an expectation over world states of executing p in each state. OK, let’s how to get the value function in MingQing’s lecture. If we let … and represent the belief state as a vector (the probability at each state), then we got the value of a belief point b for given policy tree p is simply the dot product of the two vectors. To find an optimal t-step policy, we simply compute the maximum value which can be done when searching in the set P of all possible t-step policy trees. As we already know that belief space is continuous. So Vp(b) is linear in b for each p..Vt(b) is upper suface of those linear functions. So, Vt(b) is piecewise linear and convex. .. Presentation: Planning and Acting in Partially Observable Stochastic Domains

Illustration: Piecewise Linear Value Function
Let Vp1Vp2 and Vp3 be the value functions induced by policy trees p1, p2, and p3. Each of these value functions are the form Which is a multi-linear function of b. Thus, for each value function can be shown as a line, plane, or hyperplane, depending on the number of states, and the optimal t-step value Example– in the case of two states (b(s1) = 1- b(s2)) – be illustrated as the upper surface in two dimensions: In order to know more about this property for value function: piecewise linear, Let’s look at a simple example for illustration. Suppose Vp1, Vp2. … Presentation: Planning and Acting in Partially Observable Stochastic Domains

Picture: Optimal t-step Value Function
Vp1 Vp3 Vp2 After showing value function as lines for each policy p1, p2, p3. We can find that there exists the split for belief space. For each belief region, So the optimal t-step Value is to the maximum value based on selecting the best policy. Final optimal t-step value function can be shown in red line . Belief 1 B(s0) S=S1 S=S0 Presentation: Planning and Acting in Partially Observable Stochastic Domains

Optimal t-Step Policy Vp3 Vp1 Vp2
The optimal t-step policy is determined by projecting the optimal value function back down onto the belief space. The projection of the optimal t-step value function yields a partition into regions, within each of which there is a single policy tree, p, such that is maximal over the entire region. The optimal action in that region a(p), the action in the root of the policy tree p. Vp3 Vp1 Vp2 A(p1) A(p2) A(p3) Belief 1 B(s0) Until now, we have examined 2 important properties for belief MDP. Policy tree Piecewise linear value function So, the optimal t-step policy… Thus, we got the optimal policy, mapping from belief state to actions. Here we don’t introduce detailed value iteration alogrithms for solving POMDPS problem. U can find them in the paper. Presentation: Planning and Acting in Partially Observable Stochastic Domains

First Step of Value Iteration
One-step policy trees are just actions: To do a DP backup, we evaluate every possible 2-step policy tree a0 a1 a0 a1 O0 O1 a0 O0 O1 a0 a1 O0 O1 a0 a1 O0 O1 a1 a0 O0 O1 a1 O0 O1 a1 a0 O0 O1 a1 a0 O0 O1 Let’s see the process of computing value function for POMDP using value iteration, with the same basic structure as for the discrete MDP case. One of simplest algorithm for solving this problem, which we can exhautive enumeration and then prune it. Basic ideas behind this algorithm is the following: Vt-1, the set of useful (t-1)-step policy trees, can be used to constructed a superset Vt+ of the useful t-step policy trees. A t-step policy tree is composed of a root node with an associated action a and || subtree, each a (t-1) step policy tree. Presentation: Planning and Acting in Partially Observable Stochastic Domains

Pruning Some policy trees are dominated and are never useful Vp3 Vp1
They are pruned and not considered on the next step. The key idea is to prune before evaluating (Witness and Incremental Pruning do this) Belief B(s0) 1 Vp3 Vp1 Vp2 1. Value function dominated…. For the whole belief space 2. Parsimonious representation of the value function over the whole belief space: Given a set of policy tree V, it is possible to define a unique minimal subset V that represents the same value function. 3. A policy tree is useful if it is a component of the parsimonious representation of the value function. 4. Although It keeps parsimonious representations of the value functions at each step, there still have more much work to do. Even if Vt is very small, it goes through the step of generating Vt+, which always has size exponential in |o|. So this paper raised a novel algorithm called witness algorithm to attempt to be more efficient than the approach of exhaustively generating Vt+ Presentation: Planning and Acting in Partially Observable Stochastic Domains

Value Iteration (Belief MDP)
Keep doing backups until the value function doesn’t change much anymore In the worst case, you end up considering every possible policy tree But hopefully you will converge before this happens Presentation: Planning and Acting in Partially Observable Stochastic Domains

A POMDP example: The tiger problem
S0 “tiger-left” Pr(o=TL | S0, listen)=0.85 Pr(o=TR | S1, listen)=0.15 S1 “tiger-right” Pr(o=TL | S0, listen)=0.15 Pr(o=TR | S1, listen)=0.85 Actions={ 0: listen, 1: open-left, 2: open-right} Finally, let’s look at this interesting POMDPs problem : tiger problem. We find examine the properties of POMDP policy discussed above. Imagine an agent standing in front of two closed doors. Behind on eof the doors is a tiger and behind the other is a large reward. If the agent opens the door with the tiger, then a large penalty –100 is received. Conversely, the agent will get 10 reward for opening the correct door. Instead of opening one of the two doors, the agent can listen, in order to gain some information about the location of the tiger. Unfortunately, listening is not free, costs –1. We refer to the state of the world when tiger is on the left S0, and when tiger is on the right is S1. Actions can be listen, open the left door or open the right door. There are only 2 observations: to hear the tiger on the left(TL) or to hear the tiger on the right(TR). Immediately after the agent opens the door and receives a reward or penalty, the problem resets, randomly relocating the tiger behind one of the two doors. The transition and observation models can be described in detail as follows. (next slides) Reward Function - Penalty for wrong opening: -100 - Reward for correct opening: +10 - Cost for listening action: -1 Observations - to hear the tiger on the left (TL) - to hear the tiger on the right(TR) Presentation: Planning and Acting in Partially Observable Stochastic Domains

Tiger Problem (Transition Probabilities)
Prob. (LISTEN) Tiger: left Tiger: right 1.0 0.0 Prob. (LEFT) Tiger: left Tiger: right 0.5 For transition model T(s, a, s’), we specify as follows, Listen action does not change the state the world. The Open-Left and Open-Right cause a transition to world state S0 with prob. 0.5 and to state S1 with prob 0.5 (essentially resetting the problem). Prob. (RIGHT) Tiger: left Tiger: right 0.5 Presentation: Planning and Acting in Partially Observable Stochastic Domains

Tiger Problem (Observation Probabilities)
Prob. (LISTEN) O: TL O: TR Tiger: left 0.85 0.15 Tiger: right Prob. (LEFT) O: TL O: TR Tiger: left 0.5 Tiger: right For observation model O(s’, a, o) When the world is in state S0, the LISTEN action results in observation TL with prob and the observation TR with prob Conversely for world state S1. No matter what state the world is in , the LEFT and RIGHT actions result in either observation with prob. 0.5 Prob. (LEFT) O: TL O: TR Tiger: left 0.5 Tiger: right Presentation: Planning and Acting in Partially Observable Stochastic Domains

Tiger Problem (Immediate Rewards)
Reward (LISTEN) Tiger: left -1 Tiger: right Reward (LEFT) Tiger: left -100 Tiger: right +10 Reward (RIGHT) Tiger: left +10 Tiger: right -100 Presentation: Planning and Acting in Partially Observable Stochastic Domains

The tiger problem: State tracking
Belief vector S1 “tiger-left” S2 “tiger-right” Belief For current belief state, we listen according to policy. And then reach the new state, do some observation and compute new belief state based on action and last belief and last action. Presentation: Planning and Acting in Partially Observable Stochastic Domains

The tiger problem: State tracking
Belief vector S1 “tiger-left” S2 “tiger-right” Belief obs=hear-tiger-left action=listen Presentation: Planning and Acting in Partially Observable Stochastic Domains

The tiger problem: State tracking
Belief vector S1 “tiger-left” S2 “tiger-right” Belief obs=growl-left action=listen Presentation: Planning and Acting in Partially Observable Stochastic Domains

Tiger Example Optimal Policy t=1
Optimal Policy for t=1 0(1)=(-100.0, 10.0) 1(1)=(-1.0, -1.0) 0(1)=(10.0, ) [0.00, 0.10] [0.10, 0.90] [0.90, 1.00] left listen right Belief Space: open-right open-left listen S1 “tiger-left” S2 “tiger-right” Optimal policy: Let us begin with the situation for the step t=1, when the only only gets to make a single decision. If the agent believes with high probability that the tiger is on the left, then the best action is to open the right door; If it believes ……. So we may have policy tree left and right , which left and right specify the action to take. But what if the agent is highly uncertain about the tiger’s location? The best thing to do is to listen. That’s because: Guessing incorrectly will incur a penalty of –100, whereas gussing correctly will yield a reward of +10. When the agent’s belief has no bias either way, it will guess wrong as often as it guesses right, so its expected reward for opening a door will be ( )/2 = -45. While, listening always has value –1, which is still greater than the value of opening the door at random. So the 1step optimal policy are shown here. Each policy tree is shown as a node. Below is the belief interval over which the policy tree dominates; the actual  vectors are shown above.  Vectors are computed as… Belief state is They form a partition of the belief space. Presentation: Planning and Acting in Partially Observable Stochastic Domains

Tiger Example Optimal Policy for t=2
[0.00, 0.02] [0.02, 0.39] [0.39, 0.61] [0.61, 0.98] [0.98, 1.00] listen listen listen listen listen right left listen TL/TR TR TL We now move to the case in which the agent can act for 2 steps to make decision for. The policy for t = 2 are shown in this figure. 1. We can find that there is an interesting property; it always chooses to listen. Why? There is a locial reason for this. Because if the agent were to open one of the doors at t=2; then due to the way the problem has been formulated, the tiger is then randomly placed behind of the the doors and the agent’s belief state will get to reset = [0.5, 0.5]. So after opening the door, the agent would be left with no info about the tiger’s location and with one action remaining. We just saw that with one step to go and b=(0.5,0.5) the best thing to do is listen. Therefore, It is a better strategy to listen at t=2 in order to make a more informed decision on the last step. 2. Another intersting feature found here is that there are multiple policy trees with the same action at the root. These vectors area acturally partitioning the beilef space into pieces that have structural similarities. Presentation: Planning and Acting in Partially Observable Stochastic Domains

Reference Planning and acting in partially Observable stochastic domains Leslie P. Kaelbling Optimal Policies for partially Observable Markov Decision Processes Anthony R.Cassandra 1995 Planning and control in stochastic domains with imperfect information Milos Hauskrecht Hierarchical Methods for Planning under Uncertainty Joelle Pineau POMDP’s tutorial in Tony's POMDP Page Partial Observability Dan Bernstein 2002 Presentation: Planning and Acting in Partially Observable Stochastic Domains

Thank You.  Presentation: Planning and Acting in Partially Observable Stochastic Domains