An Introduction to PO-MDP Presented by Alp Sardağ.

An Introduction to PO-MDP Presented by Alp Sardağ

MDP  Components: –State –Action –Transition –Reinforcement  Problem: –choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution  Solution: –Policy: value function

Definition  Horizon length  Value Iteration: –Temporal Difference Learning: Q(x,a)  Q(x,a) +  (r+  max b Q(y,b) - Q(x,a)) Q(x,a)  Q(x,a) +  (r+  max b Q(y,b) - Q(x,a)) where  learning rate and  discount rate.  Adding PO to CO-MDP is not trivial: –Requires the complete observability of the state. –PO clouds the current state.

PO-MDP  Components: –States –Actions –Transitions –Reinforcement –Observations

Mapping in CO-MDP & PO-MDP  In CO-MDPs, mapping is from states to actions.  In PO-MDPs, mapping is from probability distributions (over states) to actions.

VI in CO-MDP & PO-MDP  In a CO-MDP, –Track our current state –Update it after each action  In a PO-MDP, –Probability distribution over states –Perform an action and make an observation, then update the distribution

Belief State and Space  Belief State: probability distribution over states.  Belief Space: the entire probability space.  Example: –Assume two state PO-MDP. –P(s 1 ) = p & P(s 2 ) = 1-p. –Line become hyper-plane in higher dimension. s1s1

Belief Transform  Assumption: –Finite action –Finite observation –Next belief state = T(cbf,a,o) where cbf: current belief state, a:action, o:observation  Finite number of possible next belief state

PO-MDP into continuous CO-MDP  The process is Markovian, the next belief state depends on: –Current belief state –Current action –Observation  Discrete PO-MDP problem can be converted into a continuous space CO-MDP problem where the continuous space is the belief space.

Problem  Using VI in continuous state space.  No nice tabular representation as before.

PWLC  Restrictions on the form of the solutions to the continuous space CO-MDP: –The finite horizon value function is piecewise linear and convex (PWLC) for every horizon length. –the value of a belief point is simply the dot product of the two vectors. GOAL:for each iteration of value iteration, find a finite number of linear segments that make up the value function

Steps in VI  Represent the value function for each horizon as a set of vectors. –Overcome how to represent a value function over a continuous space.  Find the vector that has the largest dot product with the belief state.

PO-MDP Value Iteration Example  Assumption: –Two states –Two actions –Three observations  Ex: horizon length is 1. b=[0.25 0.75] [ s1s2s1s2 a 1 a 2 ] 1 0 0 1.5 V(a 1,b) = 0.25x1+0.75x0 = 0.25 V(a 2,b)=0.25x0+0.75x1.5=1.125 a 1 is the best a 2 is the best

 The value of a belief state for horizon length 2 given b,a 1,z 1 : –immediate action plus the value of the next action. –Find best achievable value for the belief state that results from our initial belief state b when we perform action a 1 and observe z 1. PO-MDP Value Iteration Example

 Find the value for all the belief points given this fixed action and observation.  The Transformed value function is also PWLC.

 How to compute the value of a belief state given only the action?  The horizon 2 value of the belief state, given that: –Values for each observation: z 1 : 0.7 z 2 : 0.8 z 3 : 1.2 –P(z 1 | b,a 1 )=0.6; P(z 2 | b,a 1 )=0.25; P(z 3 | b,a 1 )=0.15 0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835 0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835 PO-MDP Value Iteration Example

Transformed Value Functions  Each of these transformed functions partitions the belief space differently.  Best next action to perform depends upon the initial belief state and observation.

Best Value For Belief States  The value of every single belief point, the sum of: –Immediate reward. –The line segments from the S() functions for each observation's future strategy.  since adding lines gives you lines, it is linear.

 All the useful future strategies are easy to pick out: Best Strategy for any Belief Points

Value Function and Partition  For the specific action a 1, the value function and corresponding partitions:

Value Function and Partition  For the specific action a 2, the value function and corresponding partitions:

Which Action to Choose?  put the value functions for each action together to see where each action gives the highest value.

Compact Horizon 2 Value Function

Value Function for Action a 1 with a Horizon of 3

Value Function for Action a 2 with a Horizon of 3

Value Function for Both Action with a Horizon of 3

Value Function for Horizon of 3

An Introduction to PO-MDP Presented by Alp Sardağ.

Similar presentations

Presentation on theme: "An Introduction to PO-MDP Presented by Alp Sardağ."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Introduction to PO-MDP Presented by Alp Sardağ.

Similar presentations

Presentation on theme: "An Introduction to PO-MDP Presented by Alp Sardağ."— Presentation transcript:

Similar presentations

About project

Feedback