Presentation is loading. Please wait.

Presentation is loading. Please wait.

Background Material: Markov Decision Process. Reference Class notes Further studies: Dynamic programming and Optimal Control D. Bertsekas, Volume 1 Chapters.

Similar presentations


Presentation on theme: "Background Material: Markov Decision Process. Reference Class notes Further studies: Dynamic programming and Optimal Control D. Bertsekas, Volume 1 Chapters."— Presentation transcript:

1 Background Material: Markov Decision Process

2 Reference Class notes Further studies: Dynamic programming and Optimal Control D. Bertsekas, Volume 1 Chapters 1, 4, 5, 6, 7

3 Discrete Time Framework x k System State belongs to set S k u k Control Action belongs to set U(x k )  C k w k Random Disturbance characterized by a probability distribution P k (./ x k, u k ) which may depend on x k, u k but not on values of prior disturbances w 0 …… w k-1 x k+1 = f k (x k, u k, w k ) N Number of times control is applied g k (x k, u k, w k ) Cost in slot k g N (x N ) Terminal Cost

4 Finite Horizon Objective Choose the controls such that additive expected cost over N time slots is minimized, that is minimize E{g N (x N ) +  k=0 N-1 g k (x k, u k, w k ) } Control strategy:  = {w 0 =  0 (x 0 ) … w N-1 =  N-1 (x N-1 )… } Cost associated with control  and initial state x 0 J  (x 0 ) = E w {g N (x N ) +  k=0 N-1 g k (x k, u k, w k ) } Choose  such that J  (x 0 ) is minimized for all initial states x 0 Optimal controls need only be a function of the current state (history independence)

5 Type of Control Open loop Can not change in response to system states Optimal if disturbance is a deterministic function of the state and the control Closed loop Can change in response to system states

6 Illustrating Example: Inventory Control x k Stock available at the beginning of the kth period S k Set of integers u k Stock ordered at the beginning of the kth period U(x k ) =C k Set of nonnegative integers w k Demand during the kth period characterized by a probability distribution P k (w k ), w 0 …… w N-1 which are independent x k+1 = x k + u k - w k Negative stock: Backlogged Demand N Time horizon of optimization

7 g k (x k, u k, w k ) Cost in slot k consists of two components Penalty for storage and unfulfilled demands r(x k ) Ordering cost cx k g k (x k, u k, w k ) = cx k + r(x k ) g N (x N ) = r(x N ) Terminal Cost for being left with inventory x N Example Control action (Threshold type) u k =  k - x k if x k <  k = 0 otherwise

8 Bellmans Principle of Optimality Let the optimal strategy be  * = {  0 * ……..  N-1 *} Assume that a given state x occurs with a positive probability at time j. Let the system be in state x in slot j, then the truncated control sequence {  i * ……..  N-1 *} minimizes the cost to go from slot j to N, that is minimizes E w {g N (x N ) +  k=j N-1 g k (x k, u k, w k ) }

9 Dynamic Programming Algorithm Optimal algorithm is given by the following iteration which proceeds backwards J N (x N ) = g N (x N ) J k (x k ) = min u in U(x) E w { g k (x k, u k, w k ) + J k (x k+1 ) } = min u in U(x) E w { g k (x k, u k, w k ) + J k (f k (x k, u k, w k ) ) }

10 Optimizing a Chess Match Strategy A player plays against an opponent who does not change his actions in accordance with the current state They play N games, If the scores are tied towards the end, then the players go to sudden death, where they play until one is ahead of the other A draw fetches 0 point for both, a win fetches 1 point for the winner and 0 for the loser

11 The player can play timid, in that case draws a match with probability p d and loses with probability (1- p d ) The player can play bold, in that case wins a match with probability p w and loses with probability (1- p w ) Optimal strategy in sudden death? Play bold

12 Optimal Strategy in initial N games x N Difference between the score of the player and his opponent S k integers between k and -k u k Timid (0) or Bold (1) U(x k ) = {0, 1} w k Outcome: Probability distribution for timid {p d, 1- p d } Probability distribution for bold {p w, 1- p w } x k+1 = w k N Time horizon of optimization

13 Consider maximization of reward instead of minimization of cost g N (x N ) = 0 if x N < 0 = p w if x N = 0 = 1 if x N > 0 g k (x k, u k, w k ) is the probability of winning in k games g k (x k, u k, w k ) = 0 if k < N

14 J N (x N ) = g N (x N ) J k (x k ) = max u in U(x) E w { J k (x k+1 ) } = max {p d J k (x k ) + (1- p d )J k (x k - 1), p w J k (x k +1) + (1- p w )J k (x k - 1) } Lets work it out!

15 State Augmentation What if system state depends not only on the preceding state and control, but also earlier state and control? x k+1 = f k (x k, u k, x k-1, u k-1, w k ), x 1 = f 1 (x 0, u 0, w 0 ) Now state is (x k, y k, s k ) x k+1 = f k (x k, y k, s k, u k, w k ), y k+1 = x k s k+1 = u k Time lag in cost

16 Correlated Disturbances What if w 0 … w N-1 are not independent? Let w j depend on w j-1 state is (x k, y k, s k ) x k+1 = f k (x k, y k, u k, w k ), y k+1 = w k

17 Linear Systems and Quadratic Cost x k+1 = A k x k + B k u k + w k, g N (x N ) = x N T Q N x N g k (x k ) = x k T Q k x k + u k T R k u k  k (x k ) = L k x k L k = - (B k T K k+1 B k + R k ) -1 B k T K k+1 A k K N = Q N K k = - A k T (K k+1 - K k+1 B k (B k T K k+1 B k + R k ) -1 B k T K k+1 )A k + Q k

18 J(x 0 ) = x 0 T K 0 x 0 +  k=0 N-1 E(w k T K k+1 w k ) optimum cost Let A k = A,, B k = B, R k = R, Q k = Q Then as k becomes large, K k converges to the steady state solution of algebraic Ricatti equation, K = - A T (K - KB (B T KB + R ) -1 B T K )A + Q  (x) = L x L = - (B T KB + R) -1 B T KA

19 Optimal Stopping Problem One of the control actions allow the system to stop in any slot Decision maker can terminate the system at a certain loss or choose to continue at a certain cost. The challenge will be when to stop so as to minimize the total cost.

20 Asset selling problem A person has an asset for which he receives quotes in every slot, w 0 … w N-1 Quotes are independent from slot to slot If a person accepts the offer, he can invest it at a fixed rate of interest r > 0 Control action is to sell or not to sell State is the offer in the previous slot if the asset is not sold yet, or T if it is sold x k+1 = T if sold in previous slots = w k otherwise

21 Reward: g N (x N ) = x N if x N  T = 0 otherwise g k (x k,, u k, w k ) = (1+r) N-k x k if x N  T, decision is to sell = 0 otherwise J N (x N ) = x N if x N  T = 0 otherwise J k (x k ) = max{(1+r) N-k x k,, EJ k+1 (x k+1 ) } if x k  T = 0 if x N = T

22 Let  k = EJ k+1 (w k )/ (1+r) N-k Optimal strategy: Accept the offer if x k >  k Reject the offer if x k <  k Act either way otherwise To show  k is non-increasing function of k We will show by induction that J k+1 (x)/ (1+r) N-k is non- increasing for all x J N (x)/ (1+r) = x/(1+r) J N-1 (x)/ (1+r) 2 = max(x/(1+r), EJ N (x k+1 )) Thus J N (x) )/ (1+r)  J N-1 (x)/ (1+r) 2, base case holds

23 J k (x)/ (1+r) N-k+1 = max{(1+r) -1 x, EJ k+1 (w)/ (1+r) N-k+1 } J k+1 (x)/ (1+r) N-k = max{(1+r) -1 x, EJ k+2 (w)/ (1+r) N-k } By induction, J k+1 (w)/ (1+r) N-k+1  J k+2 (w)/ (1+r) N-k The result follows

24 Iterative Computation of threshold Let V k (x k ) = J k (w k )/ (1+r) N-k V N (x N ) = x N if x N  T = 0 otherwise V k (x k ) = max{x k,, (1+r) -1 EV k+1 (w) } Let  k = EV k+1 (w)/ (1+r) V k (x k ) = max(x k,,  k )

25 Let  k = EV k+1 (w)/ (1+r) =E max(w,,  k ) )/ (1+r) = (  0 to  k+1  k+1 dP +  k+1 to infty wdP) )/ (1+r) P is the cumulative distribution function of w Note that the first and the last parts are upperbounded  k is a decreasing sequence For large k, the sequence converges to  where Let  = (  0 to   dP +  to infty wdP) )/ (1+r)

26 General Stopping Problem Decision maker can terminate the system in slot k at a certain cost t(x k ) Terminal cost is t(x N ) J N (x N ) = t(x N ) J k (x k ) = min{t(x k ),, min u in U(x) E {g(x k,, u k, w k ) + J k+1 (f(x k,u,w))} } Optimal to stop at time k for states x in the set S such that T k = {t(x),  min u in U(x) E {g(x, u, w ) + J k+1 (f(x,u,w))} }

27 We show by induction that J k (x) is non-decreasing in k I It follows that T 0  T 1  …..T N-1 Assume that T N-1 is an absorbing set that is, if a state is in this set, and termination is not selected then the next state is also in this set. Consider a state x in T N-1 Note that J N-1 (x) = t(x ) min u in U(x) E {g(x,, u, w ) + J N-1 (f(x,u,w)) } = min u in U(x) E {g(x,,u,w ) + t (f(x,u,w)) } = min u in U(x) E {g(x,,u,w ) + t(x)}  t(x) J N-2 (x) = t(x).

28 Thus x is in T N-2. Thus T N-1.  T N-2 Similarly T N-1  …..T 1  T 0 Thus T N-1 = …..T 1 = T 0 The optimal decision is to stop once the state is in a certain stopping set, and this set does not depend on the iteration number.

29 Modified Asset selling problem Let it be possible to hold the previous offers T N-1 is the set of states where the quote is above a certain value. Once you enter this set you always remain here Thus the optimal decision is to accept the offer if it is above a certain threshold, where the threshold does not depend on the iteration.

30 Multiaccess Communication A bunch of terminals share a wireless medium. Only one user can successfully transmit a packet at a time. A terminal attempts a packet with a probability which is a function of the total queue length in the system. Multiple attempts cause interference, no attempt causes poor utilization.

31 A single attempt clears a packet from the system. The objective is to choose a probability which maximizes the number of successful transmissions, that is reduces the queue length Let the cost g(x) be an increasing function of the queue length Let every packet be attempted with probability u k in slot k. Success probability is the probability that only one packet is attempted which is x k u k (1- u k ) x-1. Refer to it as p(x k,u k ) Disturbances are arrivals

32 J k (x k ) = g k (x k ) + min u in [0, 1] E w { p(x k, u k ) J k+1 (x k + w k - 1) = + (1 - p(x k, u k )) J k+1 (x k + w k ) } = g k (x k ) + J k+1 (x k + w k ) + min u in [0,1] E w { p(x k, u k ) (J k+1 (x k + w k - 1) - J k+1 (x k + w k ) )} J k (x) is an increasing function of x for each k since g k (x) is an increasing function of x. Thus J k (x k + w k )  J k (x k + w k - 1) The minimum is attained if p(x k, u k ) is maximized. Happens when u k = 1/ x k Every terminal needs to know the entire queue length which is not realistic

33 Imperfect State Information System has access to imperfect information about the state x, that is now the observation is z k and not x k where z k is now h k (x k, u k-1, v k ), where v k is a random disturbance which may now depend on the entire history Choose the controls such that additive expected cost over N time slots is minimized, that is minimize E{g N (x N ) +  k=0 N-1 g k (x k, u k, w k ) } x k+1 = f k (x k, u k, w k )

34 Reformulation as a perfect state problem Let I k be the vector of all previous observations and controls. Consider I k as the system state now. I k+1 = (I k, u k, z k+1 ) J N-1 (x k ) = min u E{g N (f N ( x N-1, u N-1, w N-1 )) + g N-1 (x N-1, u N-1, w N-1 ) | I N-1, u N-1 } J k (I k ) = min u E{g k ( x k, u k, w k ) + J k+1 (I k, z k+1, u k ) | I k, u k }

35 Sufficient Statistic The method is complex because of state space explosion. Can the entire information in I k be carried in a function of I k which has lower dimensionality? Sufficient statistic Assume that the observation disturbance depends on the current state, previous control and disturbance only. Then P(x k | I k ) is a sufficient statistic.

36 J k (I k ) = min u E{g k ( x k, u k, w k ) + J k+1 (I k, z k+1, u k ) | I k, u k } The expectation is a function of P(x k w k z k+1 | I k u k ) P(x k w k z k+1 | I k u k ) is a product of P(z k+1 | I k u k x k w k ), P(w k | x k u k ) and P(x k | I k ) Thus the cost J is a function of P(x k | I k ) explicitly as the first probability is P(z k+1 | u k x k w k ) and the second is P(w k | x k u k ) P(x k | I k ) can be computed efficiently from P(x k+1 | I k+1 ) using bayes rule. The system state is now the conditional probability distribution P(x k+1 | I k+1 )

37 Examples: Treasure searching A site may contain a treasure. If it contains the treasure, then the search yields the treasure with probability  The treasure is worth V units, each search costs C units, and the search has to terminate in N slots. The state is the probability that the site contains the treasure given the previous controls and observations, p k If we don’t search at a previous slot, we wouldn’t search in future.

38 Probability recursion p k+1 = p k if the site is not searched at time k = 0 if the site is searched and a treasure is found. = p k (1-  )/ (p k (1-  ) + 1- p k ) if the site is searched and a treasure is not found. J k (p k ) = max [0, -C +  Vp k + (1-p k  )J k+1 (p k+1 ) } J N-1 (p) = 0 Search if and only if  Vp k  C

39 General Form of the Recursion P(x k+1 | I k+1 ) = P(x k+1 | I k, u k, z k+1 ) = P(x k+1 z k+1 | I k, u k )/ P(z k+1 | I k, u k ) = P(x k+1 | I k, u k ) P(z k+1 | I k, u k, x k+1 )/  -   P(x k+1 | I k, u k ) P(z k+1 | I k, u k, x k+1 ) dx k+1 x k+1 = f k (x k, u k, w k ) P(x k+1 | I k, u k ) = P(w k | I k, u k ) =  -   P(x k | I k ) P(w k | u k, x k ) dx k P(z k+1 | I k, u k x k+1 ) can be expressed in terms of P(v k+1 | x k, u k w k ), P(w k | x k, u k ), P(x k | I k )

40 Suboptimal Control

41 Certainty Equivalence Control Given the information vector I k compute the state estimate x k ( I k ) Choose the controls such that additive expected cost over N time slots is minimized, that is minimize g N (x N ) +  k=0 N-1 g k (x k, u k, w k ) Where the disturbances are fixed at their expectations subject to the initial condition as state x k ( I k ) Deterministic optimizations are easier to solve.

42 Further Simplification Choose a heuristic to solve the optimization approximately. Find the cost to go function associated with the heuristic for every control and state, J k (x k, u k, E(w k )) Find the control which minimizes g k (x k, u k, E(w k )) + J k+1 (x k, u k, E(w k )) And apply it in the kth stage

43 Partially stochastic certainty equivalence control Applies for imperfect state information Solve the DP assuming perfect state information At every stage assume that the state is the expected value given the observation and the controls, and choose the controls accordingly.

44 Applications Multiaccess communication Hidden markov model

45 Open Loop Feedback Control Similar to certainty equivalence controller, except that it uses the measurements to modify the distribution of expectation as well. OLFC performs at least as well as the optimal open loop policy, but CEC does not provide such guarantee.

46 Limited Lookahead Policy Find the control which minimizes E[g k (x k, u k, E(w k )) + J k+1 (x k, u k, w k ))] And apply it in the kth stage, Where J k+1 (x k, u k, w k ) is an approximation of the cost to go function. One stage look ahead policy

47 Two stage lookahead policy Approximate J k+2 (x k, u k, w k ) Compute a two-stage DP with terminal cost J k+2 (x k, u k, w k )

48 Performance Bound Let a function F k (x k, u k, w k ) be upper bounded by J k (x k, u k, w k ), and let F k (x k, u k, w k ) = min E[g k (x k, u k, E(w k )) + J k+1 (x k, u k, w k ))] Then the cost to go of the one step look-ahead policy in the kth stage is upper bounded by F k (x k, u k, w k )

49 How to approximate? Problem approximation Use the cost to go of a related but simpler problem Approximate the cost to go function by a parametrized function, and tune the parameters Approximation architectures Approximate the cost to go by that of a suboptimal strategy which is expected to be reasonably close. Rollout policy

50 Problem Approximation CEC cost Vehicle routing: There is a graph with a reward associated with each node. There are m vehicles which traverse through the graph. The first vehicle traversing a node collects all its reward. Each vehicle starts at a given node and returns to another node after a maximum of a certain number of arcs. Find a route for each vehicle which maximizes the total reward

51 Approximate cost to go is the optimal value to go of the following sub-optimal set of paths. Fix the order of the vehicles Obtain the path for each in order, reducing the rewards of the traversed nodes to 0 at all times.

52 Rollout policy Sub-optimal policy to start with Base policy One step look-ahead always improves upon the base policy.

53 Example: Quiz Problem A person is given a list of N questions. Question j will be answered with probability p j The person will receive a reward v j if he answers the jth question correctly. The quiz terminates at the first incorrect answer. The optimal ordering is to answer in decreasing order of p j v j /(1 - v j )

54 Variants where this solution can be used as a base A limit on the maximum number of questions which can be answered. A time window for each question where each question can be answered. Precedence constraints

55 Infinite Horizon Problem

56 Problem Description The objective is to maximize the total cost over an infinite horizon. Lim N  E  k=0 N-1 g (x k, u k, w k ) This limit need not exist! Thus the objective is to minimize a discounted cost function where Lim N  E  k=0 N-1  k g (x k, u k, w k ) where discount factor  is in (0, 1). J  (x) = Lim N  E  k=0 N-1  k g (x k, u k, w k ) where x 0 = x

57 Classifications Stochastic shortest path problem Here the discount factor can be taken as 1 There is a termination state such that the system stays in the termination state once it reaches there. The system reaches the termination state with probability 1 The horizon is in effect finite but its length is random.

58 Discounted problems The cost per stage is bounded Here the discount factor is less than 1 Absolute cost per stage is upper bounded Thus Lim N  E  k=0 N-1  k g (x k, u k, w k ) exists The cost per stage is un-bounded The analysis is more complicated

59 Average Cost Problem Minimize Lim N  1/NE  k=0 N-1  k g (x k, u k, w k ) Lim   0 (1-  )J  (0) is the average cost of the optimal strategy in many cases Exists under certain special conditions.

60 Bellmans Equations J  (x) = min u in U(x) E { g(x, u, w ) +  J (f(x,u,w)) } The optimal costs J  (x) satisfy Bellman’s equations Given any initial condition, J 0 (x), the iteration J k+1 (x) = min u in U(x) E { g(x, u, w ) +  J k (f(x,u,w)) } Converges to the optimal discounted cost J  (x) (value iteration)

61 Optimal cost of any stationary policy A policy is said to be stationary if it does not depend on the time index, that is given the control action in any slot j is same as that in any other slot k, if the state in both is the same Optimal discounted cost of a stationary policy u can be found by solving the following equations: J ,u (x) = E { g(x, u(x), w ) +  J ,u (f(x,u(x),w)) } The solution can be obtained from the DP iteration, starting from any initial state J k+1 (x) = E { g(x, u(x), w ) +  J k (f(x,u(x),w)) }

62 A stationary policy is optimal if and only if for every state x the cost accrued is the minimum attained in the right side of the Bellmans equation There always exists an optimal stationary policy for bounded cost and discount less than 1. Similar results hold for stochastic shortest path problems with discount factor 1

63 Stochastic Shortest Path Battery management problem

64 Computational Strategies for solving Bellmans equations Value iteration Infinite number of iterations Policy Iteration Finite number of iterartions

65 Policy Iteration Start from a stationary policy Generate a sequence of new policies Let the policy in the kth iteration be u k Compute its cost by solving the following linear equations J(x) = E { g(x, u k (x), w ) +  J(f(x, u k (x), w)) } The new policy u k+1 can be obtained using the solutions of the above, J(x), as follows: u k+1 (x) = arg min u in U(x) E { g(x, u(x), w ) +  J (f(x,u(x),w)) }

66 The iteration stops when the current policy is the same as the previous policy. The policy iteration terminates at the optimal policy in a finite number of iterations, and the cost of the policies are decreasing.

67 Continuous time MDP Time is no longer slotted State transitions occur at any time. Markov: The system restarts itself at the instant of every transition. Fresh control decisions taken at the instant of transitions. Discretize the system by looking at the transition epochs only (these act like slot boundaries)

68 Continuous time MDP formulation of inventory system Unit Demand arrives as a poisson process (  ) Unit Order arrives as a poisson process ( ) Transitions are demand epochs, and inventory arrival epochs Assume that any previous order and demand arrival process is cancelled at a transition epoch. State is the inventory level and whether or not an order was placed at the previous transition Penalties are charged at the transition epochs: demands which can not be fulfilled incur penalties orders are charged at delivery

69 J(x, y) =  g 1 (x) + g 2 (y) +   J(x+1) +  J(x+y) Amount of inventory x Indicator of whether or not fresh inventory was ordered y g 1 (x) = 0 if x is positive = c otherwise g 2 (y) = 0 if y = 0 = p otherwise


Download ppt "Background Material: Markov Decision Process. Reference Class notes Further studies: Dynamic programming and Optimal Control D. Bertsekas, Volume 1 Chapters."

Similar presentations


Ads by Google