Dynamic Programming Applications Lecture 6 Infinite Horizon.

Slides:



Advertisements
Similar presentations
Markov Decision Process
Advertisements

Dynamic Decision Processes
Congestion Games with Player- Specific Payoff Functions Igal Milchtaich, Department of Mathematics, The Hebrew University of Jerusalem, 1993 Presentation.
Decision Theoretic Planning
Basic Feasible Solutions: Recap MS&E 211. WILL FOLLOW A CELEBRATED INTELLECTUAL TEACHING TRADITION.
An Introduction to Markov Decision Processes Sarah Hickmott
Markov Decision Processes
Infinite Horizon Problems
Planning under Uncertainty
91.420/543: Artificial Intelligence UMass Lowell CS – Fall 2010
Markov Decision Processes
Greedy Algorithms Reading Material: –Alsuwaiyel’s Book: Section 8.1 –CLR Book (2 nd Edition): Section 16.1.
4/1 Agenda: Markov Decision Processes (& Decision Theoretic Planning)
Background Material: Markov Decision Process. Reference Class notes Further studies: Dynamic programming and Optimal Control D. Bertsekas, Volume 1 Chapters.
Department of Computer Science Undergraduate Events More
9/23. Announcements Homework 1 returned today (Avg 27.8; highest 37) –Homework 2 due Thursday Homework 3 socket to open today Project 1 due Tuesday –A.
Asaf Cohen (joint work with Rami Atar) Department of Mathematics University of Michigan Financial Mathematics Seminar University of Michigan March 11,
Utility Theory & MDPs Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
MAKING COMPLEX DEClSlONS
ECES 741: Stochastic Decision & Control Processes – Chapter 1: The DP Algorithm 1 Chapter 1: The DP Algorithm To do:  sequential decision-making  state.
1 Markov Decision Processes * Based in part on slides by Alan Fern, Craig Boutilier and Daniel Weld.
An Overview of Dynamic Programming Seminar Series Joe Hartman ISE October 14, 2004.
Markov Decision Processes1 Definitions; Stationary policies; Value improvement algorithm, Policy improvement algorithm, and linear programming for discounted.
Decision Making in Robots and Autonomous Agents Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy.
ECES 741: Stochastic Decision & Control Processes – Chapter 1: The DP Algorithm 49 DP can give complete quantitative solution Example 1: Discrete, finite.
ECES 741: Stochastic Decision & Control Processes – Chapter 1: The DP Algorithm 31 Alternative System Description If all w k are given initially as Then,
1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.
Practical Dynamic Programming in Ljungqvist – Sargent (2004) Presented by Edson Silveira Sobrinho for Dynamic Macro class University of Houston Economics.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
Maximum Flow Problem (Thanks to Jim Orlin & MIT OCW)
Information Theory for Mobile Ad-Hoc Networks (ITMANET): The FLoWS Project Competitive Scheduling in Wireless Networks with Correlated Channel State Ozan.
15.082J & 6.855J & ESD.78J October 7, 2010 Introduction to Maximum Flows.
15.082J and 6.855J March 4, 2003 Introduction to Maximum Flows.
Decision Theoretic Planning. Decisions Under Uncertainty  Some areas of AI (e.g., planning) focus on decision making in domains where the environment.
1 Monte-Carlo Planning: Policy Improvement Alan Fern.
CPSC 536N Sparse Approximations Winter 2013 Lecture 1 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA.
1 Inventory Control with Time-Varying Demand. 2  Week 1Introduction to Production Planning and Inventory Control  Week 2Inventory Control – Deterministic.
Decision Making Under Uncertainty CMSC 471 – Spring 2041 Class #25– Tuesday, April 29 R&N, material from Lise Getoor, Jean-Claude Latombe, and.
CPS 570: Artificial Intelligence Markov decision processes, POMDPs
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making Fully Observable MDP.
Stochastic Optimization for Markov Modulated Networks with Application to Delay Constrained Wireless Scheduling Michael J. Neely University of Southern.
Network Systems Lab. Korea Advanced Institute of Science and Technology No.1 Ch. 3 Iterative Method for Nonlinear problems EE692 Parallel and Distribution.
Markov Decision Process (MDP)
Approximation Algorithms based on linear programming.
Krishnendu ChatterjeeFormal Methods Class1 MARKOV CHAINS.
Markov Decision Process (MDP)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Making complex decisions
Analytics and OR DP- summary.
6.5 Stochastic Prog. and Benders’ decomposition
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Markov Decision Processes
Markov Decision Processes
Markov Decision Processes
Instructor: Shengyu Zhang
Introduction to Maximum Flows
CS 188: Artificial Intelligence Fall 2007
13. Acting under Uncertainty Wolfram Burgard and Bernhard Nebel
Markov Decision Problems
Algorithms (2IL15) – Lecture 7
6.5 Stochastic Prog. and Benders’ decomposition
Reinforcement Learning Dealing with Partial Observability
CS 416 Artificial Intelligence
Markov Decision Processes
Markov Decision Processes
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 3
Presentation transcript:

Dynamic Programming Applications Lecture 6 Infinite Horizon

DPA62 Infinite horizon Rules of the game: Infinite no. of stages Stationary system Finite number of states Why do we care? Good appox. for problem w/many states Analysis is elegant and insightful Implementation of optimal policy is simple Stationary policy:  … 

DPA63 Total Cost Problems J  (x 0 ) = lim N  E{  k g(x k,  k (x k ), w k )} J  (x 0 ) = min  J  (x 0 ) Stochastic Shortest Paths (SSP) :  =1 Objective: reach cost free termination state Discounted problems w/bounded cost/stage:  <1 |g|<M, so J  (x 0 ) < M/(1-  ) is well defined (e.g. if the state and control sets are finite.) Discounted problems unbounded cost/stage:  ?1 Hard: we don’t do it here.. k=0 N-1

DPA64 Average cost problems J  (x 0 ) =  for all feas. policies  and state x 0 Average cost/stage: lim 1 E {  k g(x k,  k (x k ), w k )} this is well defined and finite LATER k=0 N-1 N  N

DPA65 Preview Convergence: J*(x)=lim N J * N (x), for all x Limiting solution (Bellman Equation): J*(x) = min u E w {g(x,u,w) + J*(f(x,u,w))} Optimal stationary policy:  (x) that solves above.

DPA66 SSP Finite* constraint set U(i) for all i Zero-cost state: p 00 (u)=1, g(0,u,0)=0,  u  U(0) Special cases: Deterministic SP Finite horizon

DPA67 Shorthand J=(J(1),…,J(n)); J(0)=0 TJ (i)= min  p ij (u)( g(i,u,j) + J(j) ) TJ : optimal cost-to-go for one stage problem w/cost per stage g and initial cost J. T  J (i)=  p ij (  (i))( g(i,  (i),j) + J(j) ) T  J : cost-to-go under  for one stage problem w/cost per stage g and initial cost J.

DPA68 Shorthand T  J = g  + P  J where g   i  =  j  p ij (  (i)) g(i,  (i),j) and P  = (p ij (  (i))) for i,j=1,…n (not 0) TJ = g + PJ where g   i  =  j  p ij (   (i)) g(i,   (i),j) and P = P 

DPA69 Value iteration T k J=T(T k-1 J), T 0 J =J T k J: optimal cost-to-go for k-stage problem w/cost/stage g and initial cost J …and similarly for T 

DPA610 T Properties Monotonicity Lemma: If J  J’ and  stationary, then T k J  T k J’ and T  k J  T  k J’. Subadditivity: If  stationary, e=(1,1..1), r >0, then T k (J + re)(i)  T k J(i) + r and T  k (J + re)(i)  T  k J(i) + r

DPA611 Property Define: Proper stationary policy  : Terminal state reachable from any state w.p. > 0 (in  n stages) Assumptions: 1.There exists at least one proper  2.Cost-to-go J  (i) of improper  is infinite for some i. 2’. Expected cost/stage: g(i,u)=  j  p ij (u)g(i,u,j)  0  i  0 and u  U(i). What do these mean in deterministic case?

DPA612 Alternative assumption There exists an integer m such that for any policy  and initial state x, the probability of reaching the terminal state from x in m stages under policy  is non-zero. (3) This is a stronger assumption than 1 & 2.

DPA613 Main Theorem Under assumptions 1 and 2 (or under 3): 1.lim k T k J=J*, for every vector J. 2.J*=TJ*, and J* is the only solution of J=TJ. 3.For any proper  policy  and for every vector J, lim k T  k (J)= J  and J  = T  J  and J  is the only solution. 4.Stationary  is optimal iff T  J*=TJ*

DPA614 Lemma Suppose all stationary policies are proper. Then   >0 s.t. for all stationary , T and T  are contraction mappings w.r.t. the weighted max-norm ||.|| . weighted max-norm: ||J||  =max|J(i)|/  (i) contraction mapping: ||TJ –TJ’||    ||J –J’|| 

DPA615 How to find J* and  *? Value iteration Policy iteration Variants

DPA616 Asynchronous Value Iteration Start with arbitrary J 0. Stage k: pick i k and iterate J k+1 (i k )  TJ k (i k ) (all rest is same: J k+1 (i) =J k (i), for i k  i ). Assume each i k is chosen infinitely often. Then J k  J*. This is also called the Gaus-Seidel method.

DPA617 Decomposition Suppose S can be partitioned into S 1,S 2,..S M so that if i  S m then under any policy, the successor state j=0 or j  S m-k, for some m-1  k  0 Then the solution decomposes as sequential solution of M SSPs that can be solved using optimal sol. of the preceding subproblems. If k > 0 above, then the Gauss-Seidel method that iterates on states in order of their membership in S m needs only one iteration per state to get to optimum. (e.g. finite horizon problems)

DPA618 Policy Iteration Start with given policy  k : Policy evaluation step Compute J  k (i) by solving linear system (J(0)=0): J = g  k + P  k J Policy improvement step Compute new policy  k+1 as solution to: T  k+1 J  k =TJ  k, that is  k+1 (i)= arg min  p ij (u)( g(i,u  j) + J  k (j) ) Terminate iff J  k = J  k+1 (no improvement):  k

DPA619 Policy Iteration Theorem The algorithm generates an improving sequence of proper policies, that is for all i,k: J  k+1 (i)  J  k (i) and terminates with an optimal policy.

DPA620 Multistage Look-ahead start at state i make m subsequent decisions & incur costs end up in state j and pay terminal cost J  (j) Multistage policy iteration terminates w/optimal policy under same conditions.

DPA621 Value vs. Policy iteration In general value iteration requires infinite number of iterations to obtain optimal cost-to-go Policy iteration always terminates finitely Value iteration is easier operation than policy iter. Idea: should combine them.

DPA622 Modified policy iteration Let J 0 s.t. TJ 0  J 0, and J 1,J 2,… and  0,  1,  2,.. s.t. T  k J = TJ k and J k+1 = (T  k ) m k (J k ) if m k =1 for all k: value iteration if m k =  for all k: policy iteration, where the evaluation step done iteratively via value iteration heuristic choices of m k >1 keeping in mind that T  J is much cheaper to compute than TJ

DPA623 Asynchronous Policy Iteration Generate a sequence of costs-to-go J k and stationary policies  k. Given (J k,  k ): select S k, generate new (J k+1,  k+1 ) by alternatively updating : a) J k+1 (i) = T  k J k (i), if i  S k J k (i), else and  k+1 =  k b)  k+1 (i)= arg min  p ij (u)(g(i,u  j)+ J k (j)), if i  S k  k (i), else  and  J k+1 = J k

DPA624 Convergence If both value update and policy update are executed infinitely often for all states, and If initial conditions J 0 and  0 are s.t. T  0 J 0  J 0 (for example select  0 and J 0 = J  0 ). Then J k converges to J*.

DPA625 Linear programming Since lim k T k J =J* for all J, then J  TJ  J  J* = TJ* So J* = arg max{J | J  TJ}, that is: maximize   i subject to i   p ij (u)(g(i,u  j)+ j ) i=1,..,n, u  U(i) Problem: very big when n is big !

DPA626 Discounted problems Let  < 1. No termination state. Prove special case of SSP modify definitions and proofs TJ (i)= min  p ij (u)( g(i,u,j) +  J(j) ) T  J (i)=  p ij (  (i))( g(i,  (i),j) +  J(j) ) T  J = g  +  P  J

DPA627 T  -Properties Monotonicity Lemma: If J  J’ and  stationary, then T k J  T k J’ and T  k J  T  k J’.  -Subadditivity: If  stationary, r >0,then T k (J + re)(i)  T k J(i) +  k r and T  k (J + re)(i)  T  k J(i) +  k r

DPA628 Contraction For any J and J’ and any policy , the following contraction properties hold: ||TJ –TJ’||    ||J –J’||  ||T  J –T  J’||    ||J –J’||  max-norm: ||J||  =max|J(i)|

DPA629 Convergence Theorem Convert to SSP Define new terminal state 0 and transition probabilities: P  (j|i,u) =  P(j|i,u) P  (0|i,u) = 1-  All policies are proper All previous algorithms & convergence properties. Separate proof for infinite no. of states Can extend to compact control set w/continuous probabilities.

DPA630 Applications 1. Asset selling w/infinite horizon- continued 2. Inventory w/batch processing - infinite horizon: An order is placed at time t w.p. p Given current backlog j, the manufacturer can either –process the whole batch at a fixed cost K or –postpone and incur a cost c/unit. The maximum backlog is n Policy that minimizes expected total cost ?