ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr.

ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department The University of Tennessee Fall 2015

Outline Why use POMDPs? Formal definition Belief state Value function

Partially Observable Markov Decision Problems (POMDPs)
To introduce POMDPs let us consider an example where an agent learns to drive a car in New York city The agent can look forward, backward, left or right It cannot change speed but it can steer into the lane it is looking at The different types of observations are the direction in which the agent's gaze is directed the closest object in the agent's gaze whether the object is looming or receding the color of the object whether a horn is sounding To drive safely the agent must steer out of its lane to avoid slow cars ahead and fast cars behind

POMDP Example The agent is in control of the middle car
The car behind is fast and will not slow down The car ahead is slower To avoid a crash, the agent must steer right However, when the agent is gazing to the right, there is no immediate observation that tells it about the impending crash The agent basically needs to learn how the observations might aid its performance

POMDP Example (cont.) This is not easy when the agent has no explicit goals beyond “performing well" There are no explicit training patterns such as “if there is a car ahead and left, steer right." However, a scalar reward is provided to the agent as a performance indicator (just like MDPs) The agent is penalized for colliding with other cars or the road shoulder The only goal hard-wired into the agent is that it must maximize a long-term measure of the reward

POMDP Example (cont.) Two significant problems make it difficult to learn under these conditions Temporal credit assignment – If our agent hits another car and is consequently penalized, how does the agent reason about which sequence of actions should not be repeated, and in what circumstances? Generally same as in MDPs Partial Observability - If the agent is about to hit the car ahead of it, and there is a car to the left, then circumstances dictate that the agent should steer right However, when it looks to the right it has no sensory information regarding what goes on elsewhere To solve the latter, the agent needs memory – creates knowledge of the state of the world around it

Forms of Partial Observability
Partial Observability coarsely pertains to either Lack of important state information in observations – must be compensated using memory Extraneous information in observations – needs to learn to avoid In our example: Color of the car in its gaze is extraneous (unless red cars really drive faster) It needs to build a memory-based model of the world in order to accurately predict what will happen Creates “belief state” information (we’ll see later) If the agent has access to the complete state, such as a chess playing machine that can view the entire board: It can choose optimal actions without memory Markov property holds – i.e. future state of the world is simply a function of the current state and action

Modeling the world as a POMDP
Our setting is that of an agent taking actions in a world according to its policy The agent still receives feedback about its performance through a scalar reward received at each time step Formally stated, POMDPs consists of … |S| states S = {1,2,…,|S|} of the world |U| actions (or controls) U = {1,2,…, |U|} available to the policy |Y| observations Y = {1,2,…,|Y|} a (possibly stochastic) reward r(i) for each state i in S

Modeling the world as a POMDP (cont.)

MDPs vs. POMDPs In MDP: one observation for each state
Concept of observation and state being interchangeable Memoryless policy that does not make use of internal state In POMDPs different states may have similar probability distributions over observations Different states may look the same to the agent For this reason, POMDPs are said to have hidden state Two hallways may look the same for a robot’s sensors Optimal action for the first  take left Optimal action for the first  take right A memoryless policy can not distinguish between the two

MDPs vs. POMDPs (cont.) Noise can create ambiguity in state inference
Agent’s sensors are always limited in the amount of information they can pick up One way of overcoming this is to add sensors Specific sensors that help it to “disambiguate” hallways Only when possible, affordable or desirable In general, we’re now considering agents that need to be proactive (also called “anticipatory”) Not only react to environmental stimuli Self-create context using memory POMDP problems are harder to solve, but represent realistic scenarios

POMDP solution techniques – model based methods
If an exact model of the environment is available, POMDPs can (in theory) be solved i.e. an optimal policy can be found Like model-based MDPs, it’s not so much a learning problem No real “learning”, or trial and error taking place No exploration/exploitation dilemma Rather a probabilistic planning problem  find the optimal policy In POMDPs the above is broken into two elements Belief state computation, and Value function computation based on belief states

The belief state is a probability distribution over the states
Instead of maintaining the complete action/observation history, we maintain a belief state b. The belief state is a probability distribution over the states Given an observation Dim(b) = |S|-1 The belief space is the entire probability space We’ll use a two-state POMDP as a running example Probability of being in state one = p  probability of being in state two = 1-p Therefore, the entire space of belief states can be represented as a line segment

The belief space Here is a representation of the belief space when we have two states (s0,s1)

The belief space (cont.)
The belief space is continuous, but we only visit a countable number of belief points Assumption: Finite action set Finite observation set Next belief state b’ = f (b,a,o) where: b: current belief state, a:action, o:observation

The Tiger Problem Standing in front of two closed doors
World is in one of two states: tiger is behind left door or right door Three actions: Open left door, open right door, listen Listening is not free, and not accurate (may get wrong info) Reward: Open the wrong door and get eaten by the tiger (large –r) Open the right door and get a prize (small +r)

Tiger Problem: POMDP Formulation
Two states: SL and SR (tiger is really behind left or right door) Three actions: LEFT, RIGHT, LISTEN Transition probabilities: Listening does not change the tiger’s position Each episode is a “Reset” Current state Left SL SR 0.5 Listen SL SR 1.0 0.0 Next state Right SL SR 0.5

Tiger Problem: POMDP Formulation (cont.)
Observations: TL (tiger left) or TR (tiger right) Observation probabilities: Left TL TR SL 0.5 SR Current state Listen TL TR SL 0.85 0.15 SR Next state Right TL TR SL 0.5 SR Rewards: R(SL, Listen) = R(SR, Listen) = -1 R(SL, Left) = R(SR, Right) = -100 R(SL, Right) = R(SR, Left) = +10

POMDP Policy Tree (Fake Policy)
Starting belief state (tiger left probability: 0.3) Listen Tiger roar left Tiger roar right New belief state Listen Tiger roar right Open Left door New belief state Tiger roar left Open Left door Listen … New belief state … Listen

POMDP Policy Tree (cont’)
A1 o1 o3 o2 A2 o3 A4 A3 o4 o5 A7 A5 A6 … … A8

How many POMDP policies possible
1 o1 o3 o2 A2 o6 A3 A4 |O| o4 o5 A7 A5 A6 |O|^2 … … A8 … How many policy trees, if |A| actions, |O| observations, T horizon: How many nodes in a tree: N =  |O|i = (|O|T- 1) / (|O| - 1) How many trees: T-1 |A|N i=0

The belief state is updated proportionally to:
Overall formula: The belief state is updated proportionally to: The prob. of seeing the current observation given state s’, and to the prob. of arriving at state s’ given the action and our previous belief state (b) The above are all given by the model

Let’s look at an example:
Belief State (cont.) Let’s look at an example: Consider a robot that is initially completely uncertain about its location Seeing a door may, as specified by the model’s occur in three different locations Suppose that the robot takes an action and observes a T-junction It may be that given the action only one of the three states could have lead to an observation of a T-junction The agent now knows with certainty which state it is in Not in all cases the uncertainty disappears like that

Finding an optimal policy
The policy component of a POMDP agent must map the current belief state into action It turns out that the process of maintaining belief states is a sufficient statistic (i.e. Markovian) We can not do better even if we remembered the entire history of observations and actions We have now transformed the POMDP into a MDP Good news: we have ways of solving those (GPI algorithms) Bad news: the belief state space is continuous !!

The belief state is a point in a continuous space of N-1 dimensions!
Value function The belief state is the input to the second component of the method: the value function computation The belief state is a point in a continuous space of N-1 dimensions! The value function must be defined over this infinite space Application of dynamic programming techniques  infeasible

Value function (cont.) V(b) S1 [1, 0] S2 [0, 1] [0.5, 0.5]
Let’s assume only two states: S1 and S2 Belief state [ ] indicates b(s1) = 0.25, b(s2) = 0.75 With two states, b(s1) is sufficient to indicate belief state: b(s2) = 1 – b(s1) V(b) S1 [1, 0] S2 [0, 1] [0.5, 0.5] b: belief state

Piecewise linear and Convex (PWLC)
Turns out that the value function is, or can be accurately approximated, by a piecewise linear and convex function Intuition on convexity: being certain of a state yields high value, where as uncertainty lowers the value V(b) b: belief state S1 [1, 0] S2 [0, 1] [0.5, 0.5]

Why does PWLC helps? Vp1 Vp3 V(b) Vp2 region3 region1 region2 S1 [1, 0] S2 [0, 1] [0.5, 0.5] b: belief state We can directly work with regions (intervals) of belief space! The vectors are policies, and indicate the right action to take in each region of the space

POMDPs  better modeling of realistic scenarios
Summary POMDPs  better modeling of realistic scenarios Rely on belief states that are derived from observations and actions Can be transformed into an MDP with PWLC for value function approximation

ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr.

Similar presentations

Presentation on theme: "ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr.

Similar presentations

Presentation on theme: "ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) November 5, 2015 Dr."— Presentation transcript:

Similar presentations

About project

Feedback