Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Introduction to PO-MDP Presented by Alp Sardağ.

Similar presentations


Presentation on theme: "An Introduction to PO-MDP Presented by Alp Sardağ."— Presentation transcript:

1 An Introduction to PO-MDP Presented by Alp Sardağ

2 MDP  Components: –State –Action –Transition –Reinforcement  Problem: –choose the action that makes the right tradeoffs between the immediate rewards and the future gains, to yield the best possible solution  Solution: –Policy: value function

3 Definition  Horizon length  Value Iteration: –Temporal Difference Learning: Q(x,a)  Q(x,a) +  (r+  max b Q(y,b) - Q(x,a)) Q(x,a)  Q(x,a) +  (r+  max b Q(y,b) - Q(x,a)) where  learning rate and  discount rate.  Adding PO to CO-MDP is not trivial: –Requires the complete observability of the state. –PO clouds the current state.

4 PO-MDP  Components: –States –Actions –Transitions –Reinforcement –Observations

5 Mapping in CO-MDP & PO-MDP  In CO-MDPs, mapping is from states to actions.  In PO-MDPs, mapping is from probability distributions (over states) to actions.

6 VI in CO-MDP & PO-MDP  In a CO-MDP, –Track our current state –Update it after each action  In a PO-MDP, –Probability distribution over states –Perform an action and make an observation, then update the distribution

7 Belief State and Space  Belief State: probability distribution over states.  Belief Space: the entire probability space.  Example: –Assume two state PO-MDP. –P(s 1 ) = p & P(s 2 ) = 1-p. –Line become hyper-plane in higher dimension. s1s1

8 Belief Transform  Assumption: –Finite action –Finite observation –Next belief state = T(cbf,a,o) where cbf: current belief state, a:action, o:observation  Finite number of possible next belief state

9 PO-MDP into continuous CO-MDP  The process is Markovian, the next belief state depends on: –Current belief state –Current action –Observation  Discrete PO-MDP problem can be converted into a continuous space CO-MDP problem where the continuous space is the belief space.

10 Problem  Using VI in continuous state space.  No nice tabular representation as before.

11 PWLC  Restrictions on the form of the solutions to the continuous space CO-MDP: –The finite horizon value function is piecewise linear and convex (PWLC) for every horizon length. –the value of a belief point is simply the dot product of the two vectors. GOAL:for each iteration of value iteration, find a finite number of linear segments that make up the value function

12 Steps in VI  Represent the value function for each horizon as a set of vectors. –Overcome how to represent a value function over a continuous space.  Find the vector that has the largest dot product with the belief state.

13 PO-MDP Value Iteration Example  Assumption: –Two states –Two actions –Three observations  Ex: horizon length is 1. b=[0.25 0.75] [ s1s2s1s2 a 1 a 2 ] 1 0 0 1.5 V(a 1,b) = 0.25x1+0.75x0 = 0.25 V(a 2,b)=0.25x0+0.75x1.5=1.125 a 1 is the best a 2 is the best

14  The value of a belief state for horizon length 2 given b,a 1,z 1 : –immediate action plus the value of the next action. –Find best achievable value for the belief state that results from our initial belief state b when we perform action a 1 and observe z 1. PO-MDP Value Iteration Example

15  Find the value for all the belief points given this fixed action and observation.  The Transformed value function is also PWLC.

16  How to compute the value of a belief state given only the action?  The horizon 2 value of the belief state, given that: –Values for each observation: z 1 : 0.7 z 2 : 0.8 z 3 : 1.2 –P(z 1 | b,a 1 )=0.6; P(z 2 | b,a 1 )=0.25; P(z 3 | b,a 1 )=0.15 0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835 0.6x0.8 + 0.25x0.7 + 0.15x1.2 = 0.835 PO-MDP Value Iteration Example

17 Transformed Value Functions  Each of these transformed functions partitions the belief space differently.  Best next action to perform depends upon the initial belief state and observation.

18 Best Value For Belief States  The value of every single belief point, the sum of: –Immediate reward. –The line segments from the S() functions for each observation's future strategy.  since adding lines gives you lines, it is linear.

19  All the useful future strategies are easy to pick out: Best Strategy for any Belief Points

20 Value Function and Partition  For the specific action a 1, the value function and corresponding partitions:

21 Value Function and Partition  For the specific action a 2, the value function and corresponding partitions:

22 Which Action to Choose?  put the value functions for each action together to see where each action gives the highest value.

23 Compact Horizon 2 Value Function

24 Value Function for Action a 1 with a Horizon of 3

25 Value Function for Action a 2 with a Horizon of 3

26 Value Function for Both Action with a Horizon of 3

27 Value Function for Horizon of 3


Download ppt "An Introduction to PO-MDP Presented by Alp Sardağ."

Similar presentations


Ads by Google