Dynamic Programming Applications Lecture 6 Infinite Horizon.

Dynamic Programming Applications Lecture 6 Infinite Horizon

DPA62 Infinite horizon Rules of the game: Infinite no. of stages Stationary system Finite number of states Why do we care? Good appox. for problem w/many states Analysis is elegant and insightful Implementation of optimal policy is simple Stationary policy:  … 

DPA63 Total Cost Problems J  (x 0 ) = lim N  E{  k g(x k,  k (x k ), w k )} J  (x 0 ) = min  J  (x 0 ) Stochastic Shortest Paths (SSP) :  =1 Objective: reach cost free termination state Discounted problems w/bounded cost/stage:  <1 |g|<M, so J  (x 0 ) < M/(1-  ) is well defined (e.g. if the state and control sets are finite.) Discounted problems unbounded cost/stage:  ?1 Hard: we don’t do it here.. k=0 N-1

DPA64 Average cost problems J  (x 0 ) =  for all feas. policies  and state x 0 Average cost/stage: lim 1 E {  k g(x k,  k (x k ), w k )} this is well defined and finite LATER k=0 N-1 N  N

DPA65 Preview Convergence: J*(x)=lim N J * N (x), for all x Limiting solution (Bellman Equation): J*(x) = min u E w {g(x,u,w) + J*(f(x,u,w))} Optimal stationary policy:  (x) that solves above.

DPA66 SSP Finite* constraint set U(i) for all i Zero-cost state: p 00 (u)=1, g(0,u,0)=0,  u  U(0) Special cases: Deterministic SP Finite horizon

DPA67 Shorthand J=(J(1),…,J(n)); J(0)=0 TJ (i)= min  p ij (u)( g(i,u,j) + J(j) ) TJ : optimal cost-to-go for one stage problem w/cost per stage g and initial cost J. T  J (i)=  p ij (  (i))( g(i,  (i),j) + J(j) ) T  J : cost-to-go under  for one stage problem w/cost per stage g and initial cost J.

DPA68 Shorthand T  J = g  + P  J where g   i  =  j  p ij (  (i)) g(i,  (i),j) and P  = (p ij (  (i))) for i,j=1,…n (not 0) TJ = g + PJ where g   i  =  j  p ij (   (i)) g(i,   (i),j) and P = P 

DPA69 Value iteration T k J=T(T k-1 J), T 0 J =J T k J: optimal cost-to-go for k-stage problem w/cost/stage g and initial cost J …and similarly for T 

DPA610 T Properties Monotonicity Lemma: If J  J’ and  stationary, then T k J  T k J’ and T  k J  T  k J’. Subadditivity: If  stationary, e=(1,1..1), r >0, then T k (J + re)(i)  T k J(i) + r and T  k (J + re)(i)  T  k J(i) + r

DPA611 Property Define: Proper stationary policy  : Terminal state reachable from any state w.p. > 0 (in  n stages) Assumptions: 1.There exists at least one proper  2.Cost-to-go J  (i) of improper  is infinite for some i. 2’. Expected cost/stage: g(i,u)=  j  p ij (u)g(i,u,j)  0  i  0 and u  U(i). What do these mean in deterministic case?

DPA612 Alternative assumption There exists an integer m such that for any policy  and initial state x, the probability of reaching the terminal state from x in m stages under policy  is non-zero. (3) This is a stronger assumption than 1 & 2.

DPA613 Main Theorem Under assumptions 1 and 2 (or under 3): 1.lim k T k J=J*, for every vector J. 2.J*=TJ*, and J* is the only solution of J=TJ. 3.For any proper  policy  and for every vector J, lim k T  k (J)= J  and J  = T  J  and J  is the only solution. 4.Stationary  is optimal iff T  J*=TJ*

DPA614 Lemma Suppose all stationary policies are proper. Then   >0 s.t. for all stationary , T and T  are contraction mappings w.r.t. the weighted max-norm ||.|| . weighted max-norm: ||J||  =max|J(i)|/  (i) contraction mapping: ||TJ –TJ’||    ||J –J’|| 

DPA615 How to find J* and  *? Value iteration Policy iteration Variants

DPA616 Asynchronous Value Iteration Start with arbitrary J 0. Stage k: pick i k and iterate J k+1 (i k )  TJ k (i k ) (all rest is same: J k+1 (i) =J k (i), for i k  i ). Assume each i k is chosen infinitely often. Then J k  J*. This is also called the Gaus-Seidel method.

DPA617 Decomposition Suppose S can be partitioned into S 1,S 2,..S M so that if i  S m then under any policy, the successor state j=0 or j  S m-k, for some m-1  k  0 Then the solution decomposes as sequential solution of M SSPs that can be solved using optimal sol. of the preceding subproblems. If k > 0 above, then the Gauss-Seidel method that iterates on states in order of their membership in S m needs only one iteration per state to get to optimum. (e.g. finite horizon problems)

DPA618 Policy Iteration Start with given policy  k : Policy evaluation step Compute J  k (i) by solving linear system (J(0)=0): J = g  k + P  k J Policy improvement step Compute new policy  k+1 as solution to: T  k+1 J  k =TJ  k, that is  k+1 (i)= arg min  p ij (u)( g(i,u  j) + J  k (j) ) Terminate iff J  k = J  k+1 (no improvement):  k

DPA619 Policy Iteration Theorem The algorithm generates an improving sequence of proper policies, that is for all i,k: J  k+1 (i)  J  k (i) and terminates with an optimal policy.

DPA620 Multistage Look-ahead start at state i make m subsequent decisions & incur costs end up in state j and pay terminal cost J  (j) Multistage policy iteration terminates w/optimal policy under same conditions.

DPA621 Value vs. Policy iteration In general value iteration requires infinite number of iterations to obtain optimal cost-to-go Policy iteration always terminates finitely Value iteration is easier operation than policy iter. Idea: should combine them.

DPA622 Modified policy iteration Let J 0 s.t. TJ 0  J 0, and J 1,J 2,… and  0,  1,  2,.. s.t. T  k J = TJ k and J k+1 = (T  k ) m k (J k ) if m k =1 for all k: value iteration if m k =  for all k: policy iteration, where the evaluation step done iteratively via value iteration heuristic choices of m k >1 keeping in mind that T  J is much cheaper to compute than TJ

DPA623 Asynchronous Policy Iteration Generate a sequence of costs-to-go J k and stationary policies  k. Given (J k,  k ): select S k, generate new (J k+1,  k+1 ) by alternatively updating : a) J k+1 (i) = T  k J k (i), if i  S k J k (i), else and  k+1 =  k b)  k+1 (i)= arg min  p ij (u)(g(i,u  j)+ J k (j)), if i  S k  k (i), else  and  J k+1 = J k

DPA624 Convergence If both value update and policy update are executed infinitely often for all states, and If initial conditions J 0 and  0 are s.t. T  0 J 0  J 0 (for example select  0 and J 0 = J  0 ). Then J k converges to J*.

DPA625 Linear programming Since lim k T k J =J* for all J, then J  TJ  J  J* = TJ* So J* = arg max{J | J  TJ}, that is: maximize   i subject to i   p ij (u)(g(i,u  j)+ j ) i=1,..,n, u  U(i) Problem: very big when n is big !

DPA626 Discounted problems Let  < 1. No termination state. Prove special case of SSP modify definitions and proofs TJ (i)= min  p ij (u)( g(i,u,j) +  J(j) ) T  J (i)=  p ij (  (i))( g(i,  (i),j) +  J(j) ) T  J = g  +  P  J

DPA627 T  -Properties Monotonicity Lemma: If J  J’ and  stationary, then T k J  T k J’ and T  k J  T  k J’.  -Subadditivity: If  stationary, r >0,then T k (J + re)(i)  T k J(i) +  k r and T  k (J + re)(i)  T  k J(i) +  k r

DPA628 Contraction For any J and J’ and any policy , the following contraction properties hold: ||TJ –TJ’||    ||J –J’||  ||T  J –T  J’||    ||J –J’||  max-norm: ||J||  =max|J(i)|

DPA629 Convergence Theorem Convert to SSP Define new terminal state 0 and transition probabilities: P  (j|i,u) =  P(j|i,u) P  (0|i,u) = 1-  All policies are proper All previous algorithms & convergence properties. Separate proof for infinite no. of states Can extend to compact control set w/continuous probabilities.

DPA630 Applications 1. Asset selling w/infinite horizon- continued 2. Inventory w/batch processing - infinite horizon: An order is placed at time t w.p. p Given current backlog j, the manufacturer can either –process the whole batch at a fixed cost K or –postpone and incur a cost c/unit. The maximum backlog is n Policy that minimizes expected total cost ?

Dynamic Programming Applications Lecture 6 Infinite Horizon.

Similar presentations

Presentation on theme: "Dynamic Programming Applications Lecture 6 Infinite Horizon."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dynamic Programming Applications Lecture 6 Infinite Horizon.

Similar presentations

Presentation on theme: "Dynamic Programming Applications Lecture 6 Infinite Horizon."— Presentation transcript:

Similar presentations

About project

Feedback