Download presentation
Presentation is loading. Please wait.
Published byJordan Dennis Modified over 6 years ago
1
Plan Dynamic programming Introduction to Markov decision processes
Markov decision processes formulation Discounted markov decision processes Average cost markov decision processes Continuous-time Markov decision processes
2
Dynamic programming Basic principe of dynamic programming
Some applications Stochastic dynamic programming
3
Dynamic programming Basic principe of dynamic programming
Some applications Stochastic dynamic programming
4
Introduction Dynamic programming (DP) is a general optimization technique based on implicit enumeration of the solution space. The problems should have a particular sequential structure, such that the set of unknowns can be made sequentially. It is based on the "principle of optimality" A wide range of problems can be put in seqential form and solved by dynamic programming
5
Introduction Applications :
• Optimal control • Most problems in graph theory • Investment • Deterministic and stochastic inventory control • Project scheduling • Production scheduling We limit ourselves to discrete optimization
6
Illustration of DP by shortest path problem
Problem : We are planning the construction of a highway from city A to city K. Different construction alternatives and their costs are given in the following graph. The problem consists in determine the highway with the minimum total cost. 14 D 3 I B G 10 8 10 9 E 5 K A 9 10 10 8 H J C 8 7 F 15
7
BELLMAN's principle of optimality
General form: if C belongs to an optimal path from A to B, then the sub-path A to C and C to B are also optimal or all sub-path of an optimal path is optimal A B C optimal optimal Corollary : SP(xo, y) = min {SP(xo, z) l(z, y) | z : predecessor of y}
8
Solving a problem by DP 1. Extension
Extend the problem to a family of problems of the same nature 2. Recursive Formulation (application of the principle of optimality) Link optimal solutions of these problems by a recursive relation 3. Decomposition into steps or phases Define the order of the resolution of the problems in such a way that, when solving a problem P, optimal solutions of all other problems needed for computation of P are already known. 4. Computation by steps
9
Solving a problem by DP Difficulties in using dynamic programming :
Identification of the family of problems transformation of the problem into a sequential form.
10
Shortest Path in an acyclic graph
• Problem setting : find a shortest path from x0 (root of the graph) to a given node y0 • Extension : Find a shortest path from x0 to any node y, denoted SP(x0, y) • Recursive formulation SP(y) = min { SP(z) l(z, y) : z predecessorr of y} • Decomposition into steps : At each step k, consider only nodes y with unknown SP(y) but for which the SP of all precedecssors are known. • Compute SP(y) step by step Remarks : • It is a backward dynamic programming • It is also possible to solve this problem by forward dynamic programming
11
DP from a control point of view
Consider the control of a discrete-time dynamic system, with costs generated over time depending on the states and the control actions action action State t State t+1 Cost Cost present decision epoch next decision epoch
12
DP from a control point of view
System dynamics : x t+1 = ft(xt, ut), t = 0, 1, ..., N-1 where t : temps index xt : state of the system ut = control action to decide at t State t State t+1 action Cost present decision epoch next decision epoch
13
DP from a control point of view
Criterion to optimize State t State t+1 action Cost present decision epoch next decision epoch
14
DP from a control point of view
Value function or cost-to-go function: State t State t+1 action Cost present decision epoch next decision epoch
15
DP from a control point of view
Optimality equation or Bellman equation State t State t+1 action Cost present decision epoch next decision epoch
16
Applications Single machine scheduling (Knapsac) Inventory control
Traveling salesman problem
17
Applications Single machine scheduling (Knapsac)
Problem : Consider a set of N production requests, each needing a production time ti on a bottleneck machine and generating a profit pi. The capacity of the bottleneck machine is C. Question: determine the production requests to confirm in order to maximize the total profit. Formulation: max pi Xi subject to: ti Xi C
18
Knapsack Problem Knapsack Problème :
Mr Radin can take 7 KG without paying over-weight fee on his return flight. He decides to take advantage of it and look for some local products that he can sale at home for extra gain. He selects n most interesting objects, weighs each of them, and bargains the price. Which objects should he buy in order to maximize his gain? Object (i) 1 2 3 4 5 6 Weight (wi) Expected gain (ri) 8
19
Knapsack Problem Generic formulation: Time = 1, …, 7
State st = remaining capacity for objects t, t+1, … State space = {0, 1, 2, …, 7} Action at time t = selection or not object t Action space At(s) = {1=YES, 0=NO} if s ≥ wt and = {0} if s < wt Immediate gain at time t gt(st, ut) = rt if YES = 0 if NO State transition or system dynamics: st+1 = st – wt if YES = st if NO
20
Knapsack Problem Value function:
Jn(s) = Maximal gain from objects n, n+1, …, 6 with a remaing capacity of s KG. Optimality equation:
21
Knapsack Problem -1 = Infeasible action time 7 6 5 4 3 2 1 wi ri 8
6 5 4 3 2 1 wi ri 8 state Jn(s) action N Y 10 12 13 11 18 9 16 20 14 21 19 24 YES NO -1 -1 = Infeasible action
22
Control map or control policy
Knapsack Problem Control map or control policy stage 1 2 3 4 5 6 N Y 7 state
23
Applications Inventory control
Problem: determining the purchasing quantity at the beginning of each period in order to minimize the total expense Unit price and the demand Storage capacity 5 (in 000), Initial stock = 0 Fixed order cost K = 20 (00$) Unit inventory holding cost h = 1 (00$)
24
Applications Inventory control
Generic formulation: Time = 1, …, 7 State st = Inventory at the beginning of period t State space = {0, 1, 2, …, 5} Action at time t = purchasing quantity ut of period t Action space A(st) = {max(0, dt – st), …, 5 + dt - st} Immediate gain at time t gt(st, ut) = K + ptut + ht(st + ut - dt) if u > 0 = ht(st + ut - dt) if NO State transition or system dynamics: st+1 = st + ut - dt
25
Applications Inventory control
Value function: Jn(s) = minimal total cost over periods n, n+1, …, 6 by starting with an inventory s at the beginning of period n. Optimality equation:
26
Applications Traveling salesman problem
Data: a graph with N nodes and a distance matrix [dij] beteen any two nodes i and j. Question: determine a circuit of minimum total distance passing each node once. Extensions: C(y, S): shortest path from y to x0 passing once each node in S. Application: Machine scheduling with setups. 2007
27
Total tardiness minimization on a single machine
Applications Total tardiness minimization on a single machine Job 1 2 3 Due date di 5 6 Processing time pi 4 weight wi
28
Stochastic dynamic programming Model
Consider the control of a discrete-time stochastic dynamic system, with costs generated over time perturbation perturbation action action State t State t+1 stage cost cost present decision epoch next decision epoch
29
Stochastic dynamic programming Model
System dynamics : x t+1 = ft(xt, ut, wt), t = 0, 1, ..., N-1 where t : time index xt : state of the system ut = decision at time t wt : random perturbations State t State t+1 action cost present decision epoch next decision epoch perturbation
30
Stochastic dynamic programming Model
Criterion State t State t+1 action cost present decision epoch next decision epoch perturbation
31
Stochastic dynamic programming Example
Consider a problem of ordering a quantity of a certain item at each of N periods so as to meet a stochastic demand, while minimizing the incurred expected cost. xt : stock available at the beginning of period t ut : quantity ordered at the beginning of period t wt : random demand during period t with given probability. xt+1 = xt + ut - wt
32
Stochastic dynamic programming Example
Cost : purchaing cost cut inventory cost : r(xt + ut - wt) Total cost: wt stock at period t stock at period t xt Inventory system xt+1 = xt + ut - wt order quantity
33
Stochastic dynamic programming Model
Open-loop control: Order quantities u1, u2, ..., uN-1 are determined once at time 0 Closed-loop control: Order quantity ut at each period is determined dynamically with the knowledge of state xt
34
Stochastic dynamic programming Control policy
The rule for selecting at each period t a control action ut for each possible state xt. Examples of inventory control policies: Order a constant quantity ut = E[wt] Order up to policy : ut = St – xt, if xt St ut = 0, if xt > St where St is a constant order up to level. 2007
35
Stochastic dynamic programming Optimal control policy
Mathematically, in closed-loop control, we want to find a sequence of functions mt, t = 0, ..., N-1, mapping state xt into control ut so as to minimize the total expected cost. The sequence p = {m0, ..., mN-1} is called a policy. 2007
36
Stochastic dynamic programming Optimal control
Cost of a given policy p = {m0, ..., mN-1}, Optimal control: minimize Jp(x0) over all possible polciy p 2007
37
Stochastic dynamic programming State transition probabilities
State transition probabilty: pij(u, t) = P{xt+1 = j | xt = i, ut = u} depending on the control policy. 2007
38
Stochastic dynamic programming Basic problem
A discrete-time dynamic system : x t+1 = ft(xt, ut, wt), t = 0, 1, ..., N-1 Finite state space st St Finite control space ut Ct Control policy p = {m0, ..., mN-1} with ut = mt(xt) State-transition probability: pij(u) stage cost : gt(xt, mt(xt), wt) 2007
39
Stochastic dynamic programming Basic problem
Expected cost of a policy Optimal control policy p* is the policy with minimal cost: where P is the set of all admissible policies. J*(x) : optimal cost function or optimal value function.
40
Stochastic dynamic programming Principle of optimality
Let p* = {m*0, ..., m*N-1} be an optimal policy for the basic problem for the N time periods. Then the truncated policy {m*i, ..., m*N-1} is optimal for the following subproblem minimization of the following total cost (called cost-to-go function) from time i to time N by starting with state xi at time i
41
Stochastic dynamic programming DP algorithm
Theorem: For every initial state x0, the optimal cost J*(x0) of the basic problem is equal to J0(x0), given by the last step of the following algorithm, which proceeds backward in time from period N-1 to period 0 Furthermore, if u*t = m*t(xt) minimizes the right side of Eq (B) for each xt and t, the policy p* = {m*0, ..., m*N-1} is optimal.
42
Stochastic dynamic programming Example
Consider the inventory control problem with the following: Excess demand is lost, i.e. xt+1 = max{0, xt + ut – wt} The inventory capacity is 2, i.e. xt + ut 2 The inventory holding/shortage cost is : (xt + ut – wt)2 Unit ordering cost is a, i.e. gt(xt,ut,wt) = aut + (xt + ut – wt)2. N = 3 and the terminal cost, gN(XN) = 0 Demand : P(wt = 0) = 0.1, P(wt = 1) = 0.2, P(wt = 2) = 0.7.
43
Stochastic dynamic programming Example
Generic formulation: Time = {1, 2, 3, 4=end} State xt = inventory level at the beginning of a period State space = {0, 1, 2} Action ut = order quantity of period t Action space = {0, 1, …, 2 – xt} Perturbation dt = demand of period t Immediate cost = aut + (xt + ut – dt)2 System dynamics xt+1 = max{0, xt + ut – dt}
44
Stochastic dynamic programming Example
Value function: Jn(s) = minimal total cost over periods n, n+1, …, 3 by starting with an inventory s at the beginning of period n. Optimality equation:
45
Stochastic dynamic programming Example – Immediate cost
0,25 dt = w 1 2 Pw 0,1 0,2 0,7 mean 4 3 1,25 1,05 (s,u) 4,5 1,5 0,5 1,1 0,8 4,25 0,85 0,6 mean stage cost = 0.1g(s,u,0)+ 0.2g(s,u,1)+ 0.7g(s,u,2) g(s,u,w) = 0.25u+(s+u– w)2 Myopic policy: (s=0, u=1), (s=1, u=0), (s=2, u=0)
46
Stochastic dynamic programming Example – Period 3-problem
Period n = 3 period-4 opt J4(s') a 0,25 dt = w 1 2 Mean total Pw 0,1 0,2 0,7 4 3 1,25 1,05 s' (s,u) 4,5 1,5 0,5 1,1 0,8 4,25 0,85 0,6 Stage cost 0.25u+(s+u– w)2 + Remaining cost J4((s+u-w)+)
47
Stochastic dynamic programming Example – Periods 2+3-problem
Period n = 2 period-3 opt J3(s') a 0,25 dt = w 1 2 Mean total Pw 0,1 0,2 0,7 1,05 2,05 5,05 4,05 1,3 2,3 2,075 s' (s,u) 5,1 1,55 2,055 1,8 1,825 0,8 4,85 1,805 0,6 4,6 1,555
48
Stochastic dynamic programming Example – Periods 1+2+3-problem
Period n = 1 period-2 opt J2(s') a 0,25 dt = w 1 2 Mean total Pw 0,1 0,2 0,7 2,055 3,055 6,055 5,055 2,305 3,305 3,08 s' (s,u) 2,555 2,805 2,83 1,805 5,805 1,555 5,555
49
Stochastic dynamic programming Example – value function & control
Optimal policy a =0,25 Stock 3-period policy Stage 1 Cos-to-go (order quantity) 2-period policy Stage 2 1-period policy Stage 3 1 2 3.055 (2) 2.805 (1) 2.555 (0) 2.055 (2) 1.805 (1) 1.555 (0) 1.05 (1) 0.8 (0) 0.6 (0)
50
Stochastic dynamic programming Example – Control map or policy
Stock Period-1 Period-2 Period-3 2 1 Long-term to short-term From Long-term policy: (s=0, u=2), (s=1, u=1), (s=2, u=0) To Myopic policy: (s=0, u=1), (s=1, u=0), (s=2, u=0)
51
Stochastic dynamic programming Example – Sample paths
Stock Period-1 Period-2 Period-3 2 1 Control Map Period-1 Period-2 Period-3 Period-4 Sample path 1 Initial stock 1 Control 2 Demand Sample path 2 Sample path 3 Sample path 4 Sample path Demand scenarios (2,1,2) (1,2,1) (0,0,1)
52
Sequential decision model
Key ingredients: A set of decision epochs A set of system states A set of available actions A set of state/action dependent immediate costs A set of state/action dependent transition probabilities Policy: a sequence of decision rules in order to mini. the cost function Issues: Existence of opt. policy Form of the opt. policy Computation of opt. policy Present state Next action costs
53
Applications Inventory management Bus engine replacement
Highway pavement maintenance Bed allocation in hospitals Personal staffing in fire department Traffic control in communication networks …
54
Example Consider a with one machine producing one product. The processing time of a part is exponentially distributed with rate p. The demand arrive according to a Poisson process of rate d. state Xt = stock level, Action : at = make or rest (make, p) (make, p) (make, p) (make, p) 1 2 3 d d d d
55
Example Zero stock policy (M/M/1) -2 -1
p0 = 1-r, p-n = rn p0, r = d/p average cost =br/(1-r) -2 -1 p d Hedging point policy with hedging point 1 p1 = 1-r, p-n = rn+1 p1 average cost =h(1-r) + r.br/(1-r) Better iff h/b < r/(1-r) -2 -1 1 p d
56
MDP = Markov Decision Process
MDP Model formulation
57
Decision epochs Times at which decisions are made.
The set T of decisions epochs can be either a discrete set or a continuum. The set T can be finite (finite horizon problem) or infinite (infinite horizon).
58
State and action sets At each decision epoch, the system occupies a state. S : the set of all possible system states. As : the set of allowable actions in state s. A = sSAs: the set of all possible actions. S and As can be: finite sets countable infinite sets compact sets
59
Costs and Transition probabilities
As a result of choosing action a As in state s at decision epoch t, the decision maker receives a cost Ct(s, a) and the system state at the next decision epoch is determined by the probability distribution pt(. |s, a). If the cost depends on the state at next decision epoch, then Ct(s, a) = jS Ct(s, a, j) pt(j|s, a). where Ct(s, a, j) is the cost if the next state is j. An Markov decision process is characterized by {T, S, As, pt(. |s, a), Ct(s, a)}
60
Exemple of inventory management
Consider the inventory control problem with the following: Excess demand is lost, i.e. xt+1 = max{0, xt + ut – wt} The inventory capacity is 2, i.e. xt + ut 2 The inventory holding/shortage cost is : (xt + ut – wt)2 Unit ordering cost is 1, i.e. gt(xt, ut, wt) = ut + (xt + ut – wt)2. N = 3 and the terminal cost, gN+1(XN+1) = 0 Demand : P(wt = 0) = 0.1, P(wt = 1) = 0.7, P(wt = 2) = 0.2.
61
Exemple of inventory management
Decision Epochs T = {0, 1, 2, …, N} Set of states : S = {0, 1, 2} indicating the initial stock Xt Action set : As indicating the possible order quantity Ut A0 = {0, 1, 2}, A1 = {0, 1}, A2 = {0} (s,a) P(0|s,a) P(1|s,a) P(2|s,a) (0, 0) 1 (0, 1) 0,9 0,1 (0, 2) 0,7 0,2 (1, 0) (1, 1) (2, 0) (s,a) C(s,a) (0, 0) 3 (0, 1) 1,05 (0, 2) 1,1 (1, 0) 0,8 (1, 1) 0,85 (2, 0) 0,6 Transition probability Cost function
62
Decision Rules A decision rule prescribes a procedure for action selection in each state at a specified decision epoch. A decision rule can be either Markovian (memoryless) if the selection of action at is based only on the current state st; History dependent if the action selection depends on the past history, i.e. the sequence of state/actions ht = (s1, a1, …, st-1, at-1, st)
63
Decision Rules A decision rule can also be either
Deterministic if the decision rule selects one action with certainty Randomized if the decision rule only specifies a probability distribution on the set of actions.
64
Decision Rules As a result, the decision rules can be:
HR : history dependent and randomized HD : history dependent and deterministic MR : Markovian and randomized MD : Markovian and deterministic
65
Policies A policy specifies the decision rule to be used at all decision epoch. A policy p is a sequence of decision rules, i.e. p = {d1, d2, …, dN-1} A policy is stationary if dt = d for all t. Stationary deterministic or stationary randomized policies are important for infinite horizon markov decision processes.
66
Example Decision epochs: T = {1, 2, …, N} State : S = {s1, s2}
Actions: As1 = {a11, a12}, As2 = {a21} Costs: Ct(s1, a11) =5, Ct(s1, a12) =10, Ct(s2, a21) = -1, CN(s1) = rN(s2) 0 Transition probabilities: pt(s1 |s1, a11) = 0.5, pt(s2|s1, a11) = 0.5, pt(s1 |s1, a12) = 0, pt(s2|s1, a12) = 1, pt(s1 |s2, a21) = 0, pt(s2 |s2, a21) = 1 S1 S2 a11 {5, .5} {10, 1} a12 a21 {-1, 1}
67
(also called control map)
Example A deterministic Markov policy Decision epoch 1: d1(s1) = a11, d1(s2) = a21 Decision epoch 2: d2(s1) = a12, d2(s2) = a21 One state one action (also called control map) S1 S2 a11 {5, .5} {10, 1} a12 a21 {-1, 1}
68
one proba distribution of actions
Example A randomized Markov policy Decision epoch 1: P1, s1(a11) = 0.7, P1, s1(a12) = 0.3 P1, s2(a21) = 1 Decision epoch 2: P2, s1(a11) = 0.4, P2, s1(a12) = 0.6 P2, s2(a21) = 1 One state one proba distribution of actions S1 S2 a11 {5, .5} {10, 1} a12 a21 {-1, 1}
69
A deterministic history-dependent policy
Example A deterministic history-dependent policy Decision epoch 1: d1(s1) = a11 d1(s2) = a21 Decision epoch 2: history h d2(h) (s1, a11, s1) a13 (s1, a12, s1) infeasible (s1, a13, s1) a11 (s2, a21, s1) infeasible (*, *, s2) a21 One history one action S1 S2 a11 {5, .5} {10, 1} a12 a21 {-1, 1} a13 {0, 1}
70
A randomized history-dependent policy
Example A randomized history-dependent policy Decision epoch 1: Decision epoch 2: P1, s1(a11) = 0.6 P1, s1(a12) = 0.3 P1, s1(a12) = 0.1 P1, s2(a21) = 1 history h P(a = a11) P(a = a12) P(a = a13) (s1, a11, s1) (s1, a12, s1) infeasible infeasible infeasible (s1, a13, s1) (s2, a21, s1) infeasible infeasible infeasible S1 S2 a11 {5, .5} {10, 1} a12 a21 {-1, 1} a13 {0, 1} at s = s2, select a21
71
Stochastic inventory example revisited
Decision Epochs T = {0, 1, 2, …, N} Set of states : S = {0, 1, 2} indicating the initial stock Xt Action set : As indicating the possible order quantity Ut A0 = {0, 1, 2}, A1 = {0, 1}, A2 = {0} (s,a) P(0|s,a) P(1|s,a) P(2|s,a) (0, 0) 1 (0, 1) 0,9 0,1 (0, 2) 0,7 0,2 (1, 0) (1, 1) (2, 0) (s,a) C(s,a) (0, 0) 3 (0, 1) 1,05 (0, 2) 1,1 (1, 0) 0,8 (1, 1) 0,85 (2, 0) 0,6 Transition probability Cost function
72
Stochastic inventory control policies
State s = inventory at the beginning of a period Action a = order quantity such that s+a 2 MD : Markovian and deterministic Stationary: {s=0: a = 2, s=1: a=1, s=2: a = 0} Nonstationary: {(s,a)=(0,2), (1,1), (2,0)} for period 1 to 5 {(s,a)=(0,1), (1,0), (2,0)} for period 6 on MR : Markovian and randomized Stationary: {s=0: a = 2 w.p. 0.5 a=0 w.p. 0.5, s=1: a=1, s=2: a = 0} {(s,a)=(0,2 w.p. 0.5 & 0 w.p. 0.5), (1,0), (2,0)} for period 6 on where w.p. = with probability
73
Stochastic inventory control policies
Action a 2 if lost sales (s+a-d < 0) for last two periods 1 if demand for the last period 0 if no demand for the last period 1 1 if lost sale for the last period 2 HD history dependent and deterministic s Action a 2 if lost sales for last two periods 2 w.p. 0.5 & 0 w.p if demandfor the last period 1 w.p. 0.3 & 0 w.p if no demand for the last period 1 1 w.p. 0.5 & 0 w.p. 0.5 if lost sale for the last period 0 if no demand for the last period 2 HR history dependent and randomized
74
Remarks Each Markov policy leads to a discrete time Markov Chain and the policy can be evaluated by solving the related Markov chain.
75
Remarks MR : Markovian and randomized
MD : Markovian and deterministic s=0: a = 2, s=1: a=1, s=2: a = 0 MR : Markovian and randomized s=0: a = 2 w.p. 0.5 a=0 w.p. 0.5, s=1: a=1, s=2: a = 0 Transition matrix Transition matrix s 1 2 0,85 0,1 0,05 0,7 0,2 s 1 2 0,7 0,2 0,1 Stationary Markov chain (to draw)
76
Remarks Nonstationary MD : Markovian and deterministic
{(s,a)=(0,2), (1,1), (2,0)} for period 1 to 2 {(s,a)=(0,1), (1,0), (2,0)} for period 3 on period 1 period 2 period 3 period 4 s 1 2 0,7 0,2 0,1 0,9 Non Stationary MR : Markovian and randomized {(s,a)=(0,2), (1,1), (2,0)} for period 1 to 2 {(s,a)=(0,2 w.p. 0.5 & 0 w.p. 0.5), (1,0), (2,0)} for period 3 on
77
Finite Horizon Markov Decision Processes
78
Assumptions Assumption 1: The decision epochs T = {1, 2, …, N}
Assumption 2: The state space S is finite or countable Assumption 3: The action space As is finite for each s Criterion: where PHR is the set of all possible policies.
79
Optimality of Markov deterministic policy
Theorem : Assume S is finite or countable, and that As is finite for each s S. Then there exists a Markovian deterministic policy which is optimal.
80
Optimality equations Theorem : The following value functions
satisfy the following optimality equation: and the action a that minimizes the above term defines the optimal policy.
81
Optimality equations The optimality equation can also be expressed as:
where Q(s,a) is a Q-function used to evaluate the consequence of an action from a state s.
82
Backward induction algorithm
Set t = N and Substitute t-1 for t and compute the following for each st S 3. Repeat 2 till t = 1.
83
Infinite Horizon discounted Markov decision processes
84
Assumptions Assumption 1: The decision epochs T = {1, 2, …}
Assumption 2: The state space S is finite or countable Assumption 3: The action space As is finite for each s Assumption 4: Stationary costs and transition probabilities; C(s, a) and p(j |s, a), do not vary from decision epoch to decision epoch Assumption 5: Bounded costs: | Ct(s, a) | M for all a As and all s S (to be relaxed)
85
Assumptions Criterion: where 0 < l < 1 is the discounting factor
PHR is the set of all possible policies.
86
Discounting factor Large discounting factor l 1 : long-term optimum
Small discounting factor l 0 : short-term optimum or myopic
87
Optimality equations Theorem: Under assumptions 1-5, the following optimal cost function V*(s) exists: and satisfies the following optimality equation: Further, V*(.) is the unique solution of the optimality equation. Moreover, a statonary policy p is optimal iff (if and only if) it gives the minimum value in the optimality equation.
88
Computation of optimal policy Value Iteration
Value iteration algorithm: Select any bounded value function V0, let n =0 For each s S, compute Repeat 2 until convergence. Meaning of Vn
89
Computation of optimal policy Value Iteration
Theorem: Under assumptions 1-5, Vn converges to V* The stationary policy defined in the value iteration algorithm converges to an optimal policy.
90
Computation of optimal policy Policy Iteration
Policy iteration algorithm: Select arbitrary stationary policy p0, let n =0 (Policy evaluation) Obtain the value function Vn of policy pn. (Policy improvement) Choose pn+1 = {dn+1, dn+1,…} such that Repeat 2-3 till pn+1 = pn.
91
Computation of optimal policy Policy Iteration
Policy evaluation: For any stationary deterministic policy p = {d, d, …}, its value function is the unique solution of the following equation:
92
Computation of optimal policy Policy Iteration
Theorem: The value functions Vn generated by the policy iteration algorithm is such that Vn+1 <= Vn. Further, if Vn+1 = Vn, Vn = V*.
93
Computation of optimal policy Linear programming
Recall the optimality equation The optimal value function can be determine by the following Linear programme with a > 0 and Sa(s) = 1:
94
Computation of optimal policy Linear programming
Dual linear program 1/ Optimal basic solution x* gives a deterministic optimal policy. 2/ x(s, a) = total discounted joint proba under initiate-state distribution a that the system occupies state s and choose action a 3/ Dual linear program extends to constrained model with upper limit C of total discounted cost, i.e.
95
Extensition to Unbounded Costs
Theorem 1. Under the condition C(s, a) ≥ 0 (or C(s, a) ≤0) for all states i and control actions a, the optimal cost function V*(s) among all stationary determinitic policies satisfies the optimality equation Theorem 2. Assume that the set of control actions is finite. Then, under the condition C(s, a) ≥ 0 for all states i and control actions a, we have where VN(s) is the solution of the value iteration algorithm with V0(s) = 0. Implication of Theorem 2 : The optimal cost can be obtained as the limit of value iteration and the optimal stationary policy can also be obtained in the limit.
96
Example Consider a computer system consisting of M different processors. Using processor i for a job incurs a finite cost Ci with C1 < C2 < ... < CM. When we submit a job to this system, processor i is assigned to our job with probability pi. At this point we can (a) decide to go with this processor or (b) choose to hold the job until a lower-cost processor is assigned. The system periodically return to our job and assign a processor in the same way. Waiting until the next processor assignment incurs a fixed finite cost c. Question: How do we decide to go with the processor currently assigned to our job versus waiting for the next assignment? Suggestions: The state definition should include all information useful for decision The problem belongs to the so-called stochastic shortest path problem.
97
Why does it work: Preliminary
Policy p value function (cost minimization) Without loss of generality, 0 C(s, a) M Transformation by C’(s, a) = C(s, a) + M if | C(s, a) | M
98
Why does it work: DP & optimality equation
DP (Dynamic Programming) Optimality equation
99
Why does it work: DP & optimality equation
DP operator T Contraction of the DP operator
100
Why does it work : DP convergence
Lemma 1: If 0 C(s,a) M, then VN(s) is monotone converging and limN VN(s) = V*(s) Property guarantees the existence of V*(s). Proof. Part one due to VN(s) VN+1(s) and VN(s) M/(1-l) . Due to C(s,a) M, Due to C(s,a) ≥ 0, Taking min on both side of the inequalities,
101
Why does it work : convergence of value iteration
Lemma 2: If 0 C(s,a) M, for any bounded function f, then limN TN(f(s)) = V*(s) and limN TpN(f(s)) = Vp (s) Similary, limN TpN(f(s)) = Vp(s)
102
Why does it work : optimality equation
Theorem 1: If 0 C(s,a) M, V*(s) is the unique bounded function of the optimality equation. Moreover, any stationary policy is optimal iff p(s) is any minimizer of the right hand term.
103
Why does it work : optimality equation
Theorem 1: If 0 C(s,a) M, V*(s) is the unique bounded function of the optimality equation. Moreover, any stationary policy is optimal iff p(s) is any minimizer of the right hand term.
104
Why does it work : optimality equation
Theorem A: If 0 C(s,a) M, V*(s) is the unique bounded function of the optimality equation. Moreover, any stationary policy is optimal iff p(s) is any minimizer of the right hand term.
105
Why does it work : convergence of policy iteration
Theorem B: The value functions Vn generated by the policy iteration algorithm is such that Vn+1 Vn.
106
Why does it work : convergence of policy iteration
Theorem B: The value functions Vn generated by the policy iteration algorithm is such that Vn+1 Vn.
107
Infinite Horizon average cost Markov decision processes
108
Assumptions Assumption 1: The decision epochs T = {1, 2, …}
Assumption 2: The state space S is finite Assumption 3: The action space As is finite for each s Assumption 4: Stationary costs and transition probabilities; C(s, a) and p(j |s, a) do not vary from decision epoch to decision epoch Assumption 5: Bounded costs: | Ct(s, a) | M for all a As and all s S Assumption 6: The markov chain correponding to any stationary deterministic policy contains a single recurrent class. (Unichain)
109
Assumptions Criterion: where PHR is the set of all possible policies.
110
Optimal policy Main Theorem: Under Assumptions 1-6,
There exists an optimal stationary deterministic policy. There exists a real g and a value function h(s) that satisfy the following optimality equation: For any solutions (g, h) and (g’, h’) of the optimality equation: (a) g = g’ is the optimal average reward; h(s) = h’(s) + k (translation closure); Any maximizer of the optimality equation is an optimal policy.
111
Relation between discounted and average cost MDP
It can be shown that (why? online) differential cost for any given state x0.
112
Relation between discounted and average cost MDP
where x0 = any given reference state h(s) : differential reward/cost (starting from s vs x0)
113
Relation between discounted and average cost MDP
Why if limits interchangeable If the discounted policy converges to average cost policy, Blackwell optimality
114
Computation of the optimal policy by LP
Recall the optimality equation: This leads to the following LP for optimal policy computation Remarks: Value iteration and policy iteration can also be extended to the average cost case.
115
Computation of optimal policy Value Iteration
Select any bounded value function h0 with h0(s0) = 0, let n =0 For each s S, compute Repeat 2 until convergence.
116
Computation of optimal policy : Policy Iteration
Select any policy p0, let n =0 Policy evaluation: determine gp = stationary expected reward and solve Policy improvement: Set n := n+1 and repeat 2-3 till convergence.
117
Extensions to unbounded cost
Theorem. Assume that the set of control actions is finite. Suppose that there exists a finite constant L and some state x0 such that |Vl(x) - Vl(x0)| ≤ L for all states x and for all l (0,1). Then, for some sequence {ln} converging to 1, the following limit exist and satisfy the optimality equation. Easy extension to policy iteration. More conditions: Sennott, L.I. (1999) Stochastic Dynamic Programming and the Control of Queueing Systems, New York: Wiley.
118
Why does it work : convergence of policy iteration
Theorem: If all policies generated by policy iteration are unichains, then gn+1 ≥ gn.
119
Continuous time Markov decision processes
120
Assumptions Assumption 1: The decision epochs T = R+
Assumption 2: The state space S is finite Assumption 3: The action space As is finite for each s Assumption 4: Stationary cost rates and transition rates; C(s, a) and m(j |s, a) do not vary from decision epoch to decision epoch
121
Assumptions Criterion:
122
Example Consider a system with one machine producing one product. The processing time of a part is exponentially distributed with rate p. The demand arrive according to a Poisson process of rate d. state Xt = stock level, Action : at = make or rest (make, p) (make, p) (make, p) (make, p) 1 2 3 d d d d
123
Uniformization Any continuous-time Markov chain can be converted to a discrete-time chain through a process called « uniformization ». Each Continuous Time Markov Chain is characterized by the transition rates mij of all possible transitions. The sojourn time Ti in each state i is exponentially distributed with rate m(i) = Sj≠i mij, i.e. E[Ti] = 1/m(i) Transitions different states are unpaced and asynchronuous depending on m(i).
124
Uniformization In order to synchronize (uniformize) the transitions at the same pace, we choose a uniformization rate g MAX{m(i)} « Uniformized » Markov chain with transitions occur only at instants generated by a common a Poisson process of rate g (also called standard clock) state-transition probabilities pij = mij / g pii = 1 - m(i)/ g where the self-loop transitions correspond to fictitious events.
125
Uniformization a S1 S2 b a g-a g-b S1 S2 b 1-a/g a/g 1-b/g S1 S2 b/g
CTMC a Step1: Determine rate of the states m(S1) = a, m(S2) = b Step 2: Select an uniformization rate g ≥ max{m(i)} Step 3: Add self-loop transitions to states of CTMC. Step 4: Derive the corresponding uniformized DTMC S1 S2 b Uniformized CTMC a g-a g-b S1 S2 b DTMC by uniformization 1-a/g a/g 1-b/g S1 S2 b/g
126
Uniformization Rates associated to states m(0,0) = l1+l2
m(1,0) = m1+l2 m(0,1) = l1+m2 m(1,1) = m1
127
Uniformization For Markov decision process, the uniformization rate shoudl be such that g m(s, a) = SjS m(j|s, a) for all states s and for all possible control actions a. The state-transition probabilities of a uniformized Markov decision process becomes: p(j|s, a) = m(j|s, a)/ g p(s|s, a) = 1- SjS m(j|s, a)/ g
128
Uniformized Markov decision process
Uniformization (make, p) (make, p) (make, p) (make, p) 1 2 3 d d d d Uniformized Markov decision process at rate g = p+d (make, p/g) (make, p/g) (make, p/g) (make, p/g) (make, p/g) 1 2 3 d/g d/g d/g d/g d/g (not make, p/g) (not make, p/g) (not make, p/g) (not make, p/g)
129
Uniformization Under the uniformization,
a sequence of discrete decision epochs T1, T2, … is generated where Tk+1 – Tk = EXP(g). The discrete-time markov chain describes the state of the system at these decision epochs. All criteria can be easily converted. continuous cost C(s,a) per unit time fixed cost k(s,a, j) fixed cost K(s,a) (s,a) j T0 T1 T2 T3 EXP(g) Poisson process at rate g
130
Cost function convertion for uniformized Markov chain
Discounted cost of a stationary policy p (only with continuous cost): State change & action taken only at Tk Mutual independence of (Xk, ak) and event-clocks (Tk, Tk+1) Tk is a Poisson process at rate g Average cost of a stationary policy p (only with continuous cost):
131
Cost function convertion for uniformized Markov chain
Tk is a Poisson process at rate g, i.e. Tk = t1 + … + tk, ti = EXP(g)
132
Optimality equation: discounted cost case
Equivalent discrete time discounted MDP a discrete-time Markov chain with uniform transition rate g a discount factor l = g/(g+b) a stage cost given by the sum of continuous cost C(s, a)/(b+g), K(s, a) for fixed cost incurred at T0 lk(s,a,j)p(j|s,a) for fixed cost incurred at T1 Optimality equation
133
Optimality equation: average cost case
Equivalent discrete time average-cost MDP a discrete-time Markov chain with uniform transition rate g a stage cost given by C(s, a)/g whenever a state s is entered and an action a is chosen. Optimality equation for average cost per uniformized period: where g = average cost/uniformized period, gg =average cost/time unit, h(s) = differential cost with respect to reference state s0 and h(s0) = 0
134
Optimality equation: average cost case
Multiply both side of the optimality equation by g leads to: Alternative optimality equation 1: where G = gg optimal average cost per time unit H(s) = modified differential cost with H(s) = g(V(s) – V(s0))
135
Optimality equation: average cost case
Alternative optimality equation 2: where h(s) = differential cost with respect to a reference state s0 m(j|s,a) = transition rate from (s,a) to j with i.e. m(j|s,a) = gp(j|s,a) and m(s|s,a) = gp(s|s,a) - g Hamilton-Jacobi-Bellman equation
136
Example (continue) Uniformize the Markov decision process with rate g = p+d The optimality equation:
137
Example (continue) From the optimality equation:
If V(s) is convex, then there exists a K such that : V(s+1) –V(s) > 0 and the decision is not producing, for all s >= K and V(s+1) –V(s) <= 0 and the decision is producing, for all s < K
138
Example (continue) Convexity proved by value iteration
Proof by induction. V0 is convex. If Vn is convex with minimum at s = K, then Vn+1 is convex. s K-1 K
139
Example (continue) Convexity proved by value iteration
Assume Vn is convex with minimum at s = K. Vn+1 is convex if i.e. DU(s) DU(s+1) where DU(s) = U(s+1) – U(s) True for s +1 < K-1 and s > K-1 by induction. Proof established if DU(K-2) DU(K-1) DVn(K-1) 0 DU(K-1) DU(K) 0 DVn(K) V(s+1) V(s) s K-1 K
140
Condition for optimality of monotone policy (first order properties)
141
Monotone policy Monotone policy
p(s) nondecreasing or nonincreasing in s Question: When there exists an optimal monotone policy? Answers: monotonicity (addressed here) and convexity (addressed in the previous example) Only finite-horizon case is considered but can be extented to discounted and average cost.
142
Submodularity and Supermodularity
A function g(x, y) is said supermodular if for x+ ≥ x- and y+ ≥ y-, It is said submodular if Supermodularity Increasing difference, i.e. Submodularity Decreasing difference
143
Submodularity and Supermodularity
Supermodular functions: Property 1: If g(x,y) is supermodular (submodular), then f(x) = min or max selection of the set argmaxy g(x,y) of maximizers is monotone nondecreasing (nonincreasing) in x.
144
Dynamic Programming Operator
DP operator T equivalently
145
DP Operator: monotonicity preservation
Property 2 T[Vt(s)] is nondecreasing (nonincreasing) if r(s,a) is nondecreasing (nonincreasing) in s for all a (snext|s,a) is nondecreasing in s for all a, k
146
DP Operator: control monotonicity
147
DP Operator: control monotonicity
148
Batch delivery model Customer demand Dt for a product arrives over time. State set S = {0, 1, …}: quantity of pending demand Action set A = {0=no delivery, 1=deliver all pending demand} Cost C(s,a) = hs(1-a) + aK where h = unit holding cost, K= fixed delivery cost Transition snext = s(1-a) + D where P(D=i) = pi, i=0, 1, … GOAL: minimize the total cost Submodularity a(s) nondecreasing
149
Batch delivery model Min submodular Max supermodular
150
A machine replacement model
Machine deteriorates by a random number I of states per period State set S = {0, 1, …} from best to worse condition Action set A = {1=replace,0=not replace} Reward r(s,a) = R – h(s(1-a)) – aK R = fixed income per period, h(s) = nondecreasing operation cost K= replacement cost Transition snext = s(1-a) + I where P(I=i) = pi, i=0, 1, … GOAL: maximize the total reward Supermodularity a(s) nondecreasing
151
A machine replacement model
152
A general framework for value function property analysis
Based on G. Koole, “Structural results for the control of queueing systems using event-based dynamic programming,” Queueing Systems 30: , 1998
153
Introduction: event operators
154
Introduction: a single server queue
exponential server Poisson arrivals of which the admission can be controlled l: arrival rate m: service rate , l+m= 1 c: unit rejection cost C(x): holding cost of x customers
155
Introduction: discrete-time queue
1: customer arrival rate, i.e. one customer per period p: geometric service rate x: queue length before admission decision and service completion
156
One-dimension models : operators
157
One-dimension models: operators
158
One-dimension models: operators
159
One-dimension models : operators
160
One-dimension models : properties
161
One-dimension models : property propagation
162
One-dimension models : property propagation
163
One-dimension models : property propagation
Proof of Lemma 1. Tcosts and Tunif : results follow directly as increasingness and convexity are closed under convex combinations. TA(1) : results follow directly, by replacing x by x + e1 in the inequalities. TFS(1) : certain terms cancel out. TD(1) : Increasingness follows as for TA(1), except if x1 = b1. In this case TD(1)f(x) = TD(1)f(x + e1). Also for the convexity the only non-trivial case is x1 = b1. This reduces to f(x) f(x + e1). TMD(1) : Roughly the same arguments are used. TAC(1) : TCD(1) : similar proof as for TAC(1).
164
One-dimension models : property propagation
165
One-dimension models : property propagation
166
a single server queue l: arrival rate is l, m: service rate , l+m= 1
c: unit rejection cost C(x): holding cost of x customers
167
discrete-time queue 1: customer arrival rate, p: geometric service rate x: queue length before admission decision and service completion
168
Production-inventory system
169
Multi-machine production-inventory with preemption
170
Examples of Tenv(i) Control policy keeps its structure but depends on the environment
171
Examples
172
Two-dimension models : operators
173
Two-dimension models : properties
174
Two-dimension models : properties
Super SuperC Conv R LL R LL R L L R R L + + Inc Sub SubC L R R L L R R L R L + + Super(i,j) + SuperC(i,j) Conv(i)+Conv(j) Super(i,j) + Conv(i) + Conv(j) SubC(i,j) Sub(i,j) + SubC(i,j) Conv(i)+Conv(j) Sub(i,j) + Conv(i) + Conv(j) SuperC(i,j)
175
2-dimension models : property propagation
176
2-dimension models : property propagation
177
2-dimension models : property propagation
178
2-dimension models : property propagation
179
2-dimension models : property propagation
180
2-dimension models : property propagation
181
2-dimension models : property propagation
182
2-dimension models Control structure under Super(1, 2) + SuperC(1, 2)
Super(1, 2) + SuperC(1, 2) Conv(1) + Conv(2) Conv(1) TAC(1) : threshold admission in x1 Conv(2) TAC(2) : threshold admission in x2 Super(1, 2) TAC(1) : of threshold form in x2 Super(1, 2) TAC(2) : of threshold form in x1 SuperC(1, 2) For TAC(1) : rejection in x+e2 rejection in x+e1 SuperC(1, 2) For TAC(2): rejection in x+e1 rejection in x + e2. TAC(1) & TAC(2) : decreasing switching curve below which customers are admitted. TCD(1) and TCD(2) can be seen as dual to TAC(1) and TAC(2), with corresponding results. reject admission
183
2-dimension models Control structure under Super(1, 2) + SuperC(1, 2)
TR: an increasing switching curve above (below) which customers are assigned to queue 1 (2) TCJ(1,2): the optimal control is increasing in x1 and decreasing in x2, i.e. an increasing switching curve, below which jockeying occurs. TCJ(2,1): an increasing switching curve, above which jockeying occurs. queue1 queue2
184
2-dimension models : property propagation
185
2-dimension models Control structure under Super(1, 2)
Admission control for class 1 is decreassing in class 2 and vice vers.
186
2-dimension models : property propagation
187
2-dimension models Control structure under Sub(1, 2) + SubC(1, 2)
Sub(1, 2) + SubC(1, 2) Conv(1) + Conv(2) Conv(i) threshold admission rule for TAC(i) in xi Sub(1, 2) TAC(1) is of threshold form in x2 Sub(1, 2) TAC(2) is of threshold form in x1 SubC(1, 2) TAC(1) (TAC(2)) : increasing switching curve above (below) which customers are admitted. Also the effects of TCD(i) amount to balancing in some sense the two queues. The two queues “attract” each other. TACF(1,2) has a decreasing switching curve below which customers are admitted
188
2-dimension models Control structure under Sub(1, 2) + SubC(1, 2)
TAC(1) TAC(2) Admission queue1 No Admission queue2 No
189
Examples: a queue served by two servers
A common queue served by two servers (1= fast, 2=slow) Poisson arrivals to the queue Exponential servers but with different mean service times Goal: minimizes the mean sojourn time
190
Examples: a queue served by two servers
TCJ(1,2) To slow
191
Examples: production line with Poisson demand
M1 feeds buffers 1, M2 transfers to buffer 2 Poisson demand filled from queue 2 Production rate control of both machines
192
Examples: tandem queues with Poisson demand
M1 produce M2 produce
193
Examples: admission control of tandem queues
Two tandem queues: queue 1 feeds queue 2 Convex holding cost hi(xi) Service rate control of both queues Admission control of arrival to queue 1
194
Examples: cyclic tandem queues
Two cyclic queues: queue 1 feed queue 2, vice versa Convex holding cost hi(xi) Service rate control of both queues
195
Multi-machine production-inventory with non preemption
196
Examples: stochastic knapsack
Packing a knapsack of integer volume B with objects from 2 different classes to maximize profit Poisson arrivals
197
Examples
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.