1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) Dr. Itamar Arel College.

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department The University of Tennessee Fall 2011 October 27, 2011

ECE 517 – Reinforcement Learning in AI 2 Outline Why use POMDPs? Formal definition Belief state Value function

ECE 517 – Reinforcement Learning in AI 3 Partially Observable Markov Decision Problems (POMDPs) To introduce POMDPs let us consider an example where an agent learns to drive a car in New York city The agent can look forward, backward, left or right It can’t change speed but it can steer into the lane it is looking at The different types of observations are the direction in which the agent's gaze is directed the direction in which the agent's gaze is directed the closest object in the agent's gaze the closest object in the agent's gaze whether the object is looming or receding whether the object is looming or receding the color of the object the color of the object whether a horn is sounding whether a horn is sounding To drive safely the agent must steer out of its lane to avoid slow cars ahead and fast cars behind

ECE 517 – Reinforcement Learning in AI 4 POMDP Example The agent is in control of the middle car The car behind is fast and will not slow down The car behind is fast and will not slow down The car ahead is slower The car ahead is slower To avoid a crash, the agent must steer right However, when the agent is gazing to the right, there is no immediate observation that tells it about the impending crash

ECE 517 – Reinforcement Learning in AI 5 POMDP Example (cont.) This is not easy when the agent has no explicit goals beyond “performing well" There are no explicit training patterns such as “if there is a car ahead and left, steer right." However, a scalar reward is provided to the agent as a performance indicator (just like MDPs) The agent is penalized for colliding with other cars or the road shoulder The agent is penalized for colliding with other cars or the road shoulder The only goal hard-wired into the agent is that it must maximize a long-term measure of the reward The only goal hard-wired into the agent is that it must maximize a long-term measure of the reward

ECE 517 – Reinforcement Learning in AI 6 POMDP Example (cont.) Two significant problems make it difficult to learn under these conditions Temporal credit assignment – Temporal credit assignment – If our agent hits another car and is consequently penalized, how does the agent reason about which sequence of actions should not be repeated, and in what circumstances? Generally same as in MDPs Partial Observability - Partial Observability - If the agent is about to hit the car ahead of it, and there is a car to the left, then circumstances dictate that the agent should steer right However, when it looks to the right it has no sensory information regarding what goes on elsewhere To solve the latter, the agent needs memory – creates knowledge of the state of the world around it

ECE 517 – Reinforcement Learning in AI 7 Forms of Partial Observability Partial Observability coarsely pertains to either Lack of important state information in observations – must be compensated using memory Lack of important state information in observations – must be compensated using memory Extraneous information in observations – needs to learn to avoid Extraneous information in observations – needs to learn to avoid In our example: Color of the car in its gaze is extraneous (unless red cars really drive faster) Color of the car in its gaze is extraneous (unless red cars really drive faster) It needs to build a memory-based model of the world in order to accurately predict what will happen It needs to build a memory-based model of the world in order to accurately predict what will happen Creates “belief state” information (we’ll see later) Creates “belief state” information (we’ll see later) If the agent has access to the complete state, such as a chess playing machine that can view the entire board: It can choose optimal actions without memory It can choose optimal actions without memory Markov property holds – i.e. future state of the world is simply a function of the current state and action Markov property holds – i.e. future state of the world is simply a function of the current state and action

ECE 517 – Reinforcement Learning in AI 8 Modeling the world as a POMDP Our setting is that of an agent taking actions in a world according to its policy The agent still receives feedback about its performance through a scalar reward received at each time step Formally stated, POMDPs consists of … |S| states S = {1,2,…,|S|} of the world |S| states S = {1,2,…,|S|} of the world |U| actions (or controls) U = {1,2,…, |U|} available to the policy |U| actions (or controls) U = {1,2,…, |U|} available to the policy |Y| observations Y = {1,2,…,|Y|} |Y| observations Y = {1,2,…,|Y|} a (possibly stochastic) reward r(i) for each state i in S a (possibly stochastic) reward r(i) for each state i in S

ECE 517 – Reinforcement Learning in AI 9 Modeling the world as a POMDP (cont.)

ECE 517 – Reinforcement Learning in AI 10 MDPs vs. POMDPs In MDP: one observation for each state Concept of observation and state being interchangeable Concept of observation and state being interchangeable Memoryless policy that does not make use of internal state Memoryless policy that does not make use of internal state In POMDPs different states may have similar probability distributions over observations Different states may look the same to the agent Different states may look the same to the agent For this reason, POMDPs are said to have hidden state For this reason, POMDPs are said to have hidden state Two hallways may look the same for a robot’s sensors Optimal action for the first  take left Optimal action for the first  take left Optimal action for the second  take right Optimal action for the second  take right A memoryless policy can’t distinguish between the two A memoryless policy can’t distinguish between the two

ECE 517 – Reinforcement Learning in AI 11 MDPs vs. POMDPs (cont.) Noise can create ambiguity in state inference Agent’s sensors are always limited in the amount of information they can pick up Agent’s sensors are always limited in the amount of information they can pick up One way of overcoming this is to add sensors Specific sensors that help it to “disambiguate” hallways Specific sensors that help it to “disambiguate” hallways Only when possible, affordable or desirable Only when possible, affordable or desirable In general, we’re now considering agents that need to be proactive (also called “anticipatory”) Not only react to environmental stimuli Not only react to environmental stimuli Self-create context using memory Self-create context using memory POMDP problems are harder to solve, but represent realistic scenarios

ECE 517 – Reinforcement Learning in AI 12 POMDP solution techniques – model based methods If an exact model of the environment is available, POMDPs can (in theory) be solved i.e. an optimal policy can be found i.e. an optimal policy can be found Like model-based MDPs, it’s not so much a learning problem No real “learning”, or trial and error taking place No real “learning”, or trial and error taking place No exploration/exploitation dilemma No exploration/exploitation dilemma Rather a probabilistic planning problem  find the optimal policy Rather a probabilistic planning problem  find the optimal policy In POMDPs the above is broken into two elements Belief state computation, and Belief state computation, and Value function computation based on belief states Value function computation based on belief states

ECE 517 – Reinforcement Learning in AI 13 The belief state Instead of maintaining the complete action/observation history, we maintain a belief state b. The belief state is a probability distribution over the states Given an observation Given an observation Dim(b) = |S|-1 Dim(b) = |S|-1 The belief space is the entire probability space We’ll use a two-state POMDP as a running example Probability of being in state one = p  probability of being in state two = 1-p Probability of being in state one = p  probability of being in state two = 1-p Therefore, the entire space of belief states can be represented as a line segment Therefore, the entire space of belief states can be represented as a line segment

ECE 517 – Reinforcement Learning in AI 14 The belief space Here is a representation of the belief space when we have two states (s 0,s 1 )

ECE 517 – Reinforcement Learning in AI 15 The belief space (cont.) The belief space is continuous, but we only visit a countable number of belief points Assumption: Finite action set Finite action set Finite observation set Finite observation set Next belief state b’ = f (b,a,o) where: Next belief state b’ = f (b,a,o) where: b: current belief state, a:action, o:observation

ECE 517 – Reinforcement Learning in AI 16 The Tiger Problem Standing in front of two closed doorsStanding in front of two closed doors World is in one of two states: tiger is behind left door or right doorWorld is in one of two states: tiger is behind left door or right door Three actions: Open left door, open right door, listenThree actions: Open left door, open right door, listen Listening is not free, and not accurate (may get wrong info)Listening is not free, and not accurate (may get wrong info) Reward: Open the wrong door and get eaten by the tiger (large –r) Reward: Open the wrong door and get eaten by the tiger (large –r) Open the right door and get a prize (small +r) Open the right door and get a prize (small +r)

ECE 517 – Reinforcement Learning in AI 17 Tiger Problem: POMDP Formulation Two states: SL and SR (tiger is really behind left or right door) Three actions: LEFT, RIGHT, LISTEN Transition probabilities: Listening does not change the tiger’s position Each episode is a “Reset” ListenSLSR SL1.00.0 SR0.01.0 LeftSLSRSL0.50.5 SR0.50.5 RightSLSRSL0.50.5 SR0.50.5 Current state Next state

ECE 517 – Reinforcement Learning in AI 18 Tiger Problem: POMDP Formulation (cont.) Observations: TL (tiger left) or TR (tiger right) Observation probabilities: Rewards: – R(SL, Listen) = R(SR, Listen) = -1 – R(SL, Left) = R(SR, Right) = -100 – R(SL, Right) = R(SR, Left) = +10 ListenTLTR SL0.850.15 SR0.150.85 LeftTLTRSL0.50.5 SR0.50.5 RightTLTRSL0.50.5 SR0.50.5 Current state Next state

ECE 517 – Reinforcement Learning in AI 19 POMDP Policy Tree (Fake Policy) Listen Open Left door Listen Open Left door Listen Tiger roar left Tiger roar right Tiger roar left Tiger roar right … … Starting belief state (tiger left probability: 0.3) New belief state (0.6) New belief State (0.15) New belief State (0.9)

ECE 517 – Reinforcement Learning in AI 20 POMDP Policy Tree (cont’) A1 A2 A3 A4 A5 A6 A7 A8 o1 o2 o3 o4 o5 o6 … …

ECE 517 – Reinforcement Learning in AI 21 How many POMDP policies possible A1 A2 A3 A4 A5 A6 A7 A8 o1 o2 o3 o4 o5 o6 … … How many policy trees, if |A| actions, |O| observations, T horizon: How many nodes in a tree: How many nodes in a tree: N =  |O| i = (|O| T - 1) / (|O| - 1) i=0 T-1 How many trees: |A| N 1 |O| |O|^2 …

ECE 517 – Reinforcement Learning in AI 22 Belief State Overall formula: The belief state is updated proportionally to: The prob. of seeing the current observation given state s’, The prob. of seeing the current observation given state s’, and to the prob. of arriving at state s’ given the action and our previous belief state (b) and to the prob. of arriving at state s’ given the action and our previous belief state (b) The above are all given by the model The above are all given by the model

ECE 517 – Reinforcement Learning in AI 23 Belief State (cont.) Let’s look at an example: Consider a robot that is initially completely uncertain about its location Consider a robot that is initially completely uncertain about its location Seeing a door may, as specified by the model’s occur in three different locations Seeing a door may, as specified by the model’s occur in three different locations Suppose that the robot takes an action and observes a T-junction Suppose that the robot takes an action and observes a T-junction It may be that given the action only one of the three states could have lead to an observation of a T-junction It may be that given the action only one of the three states could have lead to an observation of a T-junction The agent now knows with certainty which state it is in Not in all cases the uncertainty disappears like that

ECE 517 – Reinforcement Learning in AI 24 Finding an optimal policy The policy component of a POMDP agent must map the current belief state into action It turns out that the process of maintaining belief states is a sufficient statistic (i.e. Markovian) We can’t do better even if we remembered the entire history of observations and actions We can’t do better even if we remembered the entire history of observations and actions We have now transformed the POMDP into a MDP Good news: we have ways of solving those (GPI algorithms) Good news: we have ways of solving those (GPI algorithms) Bad news: the belief state space is continuous !! Bad news: the belief state space is continuous !!

ECE 517 – Reinforcement Learning in AI 25 Value function The belief state is the input to the second component of the method: the value function computation The belief state is a point in a continuous space of N-1 dimensions! The value function must be defined over this infinite space Application of dynamic programming techniques  infeasible

ECE 517 – Reinforcement Learning in AI 26 Value function (cont.) Let’s assume only two states: S 1 and S 2Let’s assume only two states: S 1 and S 2 Belief state [0.25 0.75] indicates b(s 1 ) = 0.25, b(s 2 ) = 0.75Belief state [0.25 0.75] indicates b(s 1 ) = 0.25, b(s 2 ) = 0.75 With two states, b(s 1 ) is sufficient to indicate belief state: b(s 2 ) = 1 – b(s 1 )With two states, b(s 1 ) is sufficient to indicate belief state: b(s 2 ) = 1 – b(s 1 ) S 1 [1, 0] S 2 [0, 1] [0.5, 0.5] V(b) b: belief state

ECE 517 – Reinforcement Learning in AI 27 Piecewise linear and Convex (PWLC) Turns out that the value function is, or can be accurately approximated, by a piecewise linear and convex function Intuition on convexity: being certain of a state yields high value, where as uncertainty lowers the value S 1 [1, 0] S 2 [0, 1] [0.5, 0.5] V(b) b: belief state

ECE 517 – Reinforcement Learning in AI 28 Why does PWLC helps? We can directly work with regions (intervals) of belief space! The vectors are policies, and indicate the right action to take in each region of the space S 1 [1, 0] S 2 [0, 1] [0.5, 0.5] V(b) b: belief state V p1 V p2 V p3 region1region2 region3

ECE 517 – Reinforcement Learning in AI 29 Summary POMDPs  model realistic scenarios more accurately Rely on belief states that are derived from observations and actions Can be transformed into an MDP with PWLC for value function approximation What if we don’t have a model??? Next class: (recurrent) neural networks come to the rescue … Next class: (recurrent) neural networks come to the rescue …

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) Dr. Itamar Arel College.

Similar presentations

Presentation on theme: "1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) Dr. Itamar Arel College."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) Dr. Itamar Arel College.

Similar presentations

Presentation on theme: "1 ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 15: Partially Observable Markov Decision Processes (POMDPs) Dr. Itamar Arel College."— Presentation transcript:

Similar presentations

About project

Feedback