Markov Decision Problems

Markov Decision Problems
Date: The purpose of this week is to introduce the basic concepts in Markov Decision Problems (MDP). As we will see in the next week module, there are relationships between MDP and PA. The reference for this week is: Martin L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, Inc

The Multistage System Figure 2.2.1 Flow chart for a multistage system
Applied Optimal Control Optimization, Estimation, and Control Arthur E. Bryson, Jr. Yu-Chi Ho Hemisphere Publishing Corporation, 1975, p44, First recall the multistage system introduced in the last week. This is the basis of understanding decision problems. There are three elements in multistage system: state, input and transition functions. In the figure above, x represents the state, u the input and f the transition functions. Generally, the state, input and transition functions variate with the stage. After defining a cost function based on the state and input, an optimization problem is formulated. Usually the control with the least cost and meets the constraints is interesting. And such control is called the optimal control. Recall the big picture in the last slide of the last week. One special application of optimal control is the decision problems. U is the input, made by person based on some policy. Since the cost(x,u) is not known before the action is taken, this is also an optimal control problem. Further more, if the state transition function is only the function of the current state and input, while not remembering the earlier history, it is the Markov Decision Problem. Generally, if the lifetime of each state contains non-exponential distribution, it is Semi-Markov Decision Problem. Copyright by Yu-Chi Ho

The Content of MDP Providing conditions under which there exist easily implementable optimal policies; Determining how to recognize these policies; Developing and enhancing algorithms for computing them; Establishing convergence of these algorithms. Martin L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, Inc. 1994, Preface xv. Markov Decision Problems are models for sequential decision making, with uncertain outcomes. The stage in the multistage system is the decision epoch. The input is the decision. The state transition function decides the next state based on the current state, the input and the transition probability. More strict formulation is in the next slide. The four points of content illustrate the main part studied in MDP. And in this week module, we simply illustrate the third point. Copyright by Yu-Chi Ho

Problem Formulation T, decision epochs and periods S, state
As, action sets pt(.|s,a), transition probabilities rt(s,a), rewards Decision rules Policies We use the notations in Puterman’s book to illustrate the basic concept in MDP. The Markov Decision Problem can be described by the collections of five objects: The decision epochs and periods. It is similar to the stage in multistage system. It can be discrete or continuous, deterministic or stochastic, with exponential or non-exponential lifetime. The state. It is similar to the state in multistage system. In MDP, the current state contains all the information needed to for the state transition function to decide the next state, sometimes with probability. The action sets. It is similar to the feasible events in GSMP, recall that in week 3. With the comparison to the multistage system, it is similar to the feasible input set. It is a function of the current state in MDP, since some actions cannot be taken in special states. E.g. the light cannot be turn off if it is already off. And the decision should be selected from this set. The transition probabilities. To general the deterministic cases, the transition probabilities are introduced. Thus even when the action at one state is decided, generally the next state cannot be determined excluding the distribution. It is a function of the current state, the action and the time. In special cases, it can be stationary if it is independent of the time. The rewards. It is similar to the cost function in the multistage system. Here the reward is general. Thus it can be positive for the income and negative for the cost. For the optimization problem, an optimal criterion should be introduced. Most of the optimal criteria are based on the rewards, though there are different forms. Details will be introduced in the classification of MDP in the following slide. Besides the five concepts introduced above, there are two other important ones. Decision rules. It determines how to find the decision based on the current state, input, transition probability and optimal criteria. There are many kinds of decision rules. In Puterman’s book the classification includes the history dependent and randomized, history dependent and deterministic, Markovian and randomized, or Markovian and deterministic. Policies. Intuitively, the policies can be seemed as a sequential decisions. Copyright by Yu-Chi Ho

Classification of MDP Horizon Criteria Time Finite & infinite
Expected Total Reward Criteria Expected Total Discounted Reward Criteria Average Reward and Related Criteria Time Discrete & continuous We can classify the MDP through different criteria. And the following are some examples: Horizon. It is similar to the stage in multistage. It includes finite and infinite. The finite horizon MDP is a simple one. And many practical systems can be formulated into this model. And the optimal criteria based on the whole horizons. And infinite horizon MDP corresponds to infinite stages. Thus the optimal criteria based on the average cost or the whole horizons. Criteria. Although most of the optimal criteria are based on the cost of each stage, there are different criteria for different purposes. The expected total reward criteria can be used in finite and infinite horizon MDP. And in infinite ones, new kind of evaluation of the formula should be dressed to deal with the divergence. Its purpose is to consider the whole performance of the process. The expected total discounted reward criteria can be used in infinite ones. Its purpose is to introduce the basic idea in economy, since the one unit value of money in tomorrow has only lambda in today, where 0<l<1. The average reward and related criteria. Its purpose is to consider the average performance of the system. Time. In discrete case, the decision can be made only at deterministic time points, usually after the transitions. In continuous case, the decision can be made at any time point after the transitions. And the transitions in this case can happen in continuous time period. Copyright by Yu-Chi Ho

Optimality Equations and the Principle of Optimality
The Principle of Optimality: An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision. R. E. Bellman, Dynamic Programming (Princeton University Press, Princeton, NJ, 1957, p.83) The optimal equation shown above is in Puterman’s Book p83. It is the optimal equation in Finite Horizon MDP. And the optimal equation is similar in the other cases. The h here represents the history till the current state. Recall the iterative equation of u in Dynamic Programming. They are similar in the form. Here we want to maximize the reward function. In general case, we use supremum while not maximum. The equation represents a basic idea when solving it, the backward induction. If the input at (t+1) is given, the input at t can be decided through this equation. This is also the basic idea used in Dynamic Programming. Recall the principle of optimality in the Dynamic Programming. It is one of the most important principles in DP. Copyright by Yu-Chi Ho

The Algorithms Finite: Dynamic Programming Infinite: Value Iteration
Policy Iteration Action Elimination Etc. We discuss how to solve the optimal equations. For finite horizon MDP, as shown in the last slide, DP is a useful and basic approach. For infinite horizon MDP, generally, we introduce the optimal equation for discounted MDP. (Puterman’s Book, p.159) the lambda here is to consider the discounted factor. In many cases the deterministic stationary Markov policies can be found. Thus we only discuss this case. Then the determination of v can be seemed as an iteration process. That is the basic idea of Value Iteration Algorithm. The basic idea of Value Iteration is that: Given an initial v Use the optimal equation to get the iterated v If the difference between the new v and the original one is small enough, the iteration can stop. Otherwise, it keeps. There are some theorems for the convergence of this algorithm. See also Puterman’s book in The basic idea of Policy Iteration is that: Give an initial policy Each iteration will find the best policy to maximize the reward so far. The iteration continues until the policies of two iterations are the same. But this algorithm requires the action sets are all the same to each state. The difference of the Value Iteration and Policy Iteration. The previous one tries to find the optimum point of the reward function. It requires little about the structure of MDP, so can be applied in many cases. The later one only applies to stationary infinite-horizon problem. And combination of the Value Iteration and the Policy Iteration are the Action Elimination. The basic idea is that: with the iteration proceeds, the action set may change. And sometimes we do know some actions in the action set lead to less reward than the best so far. This information can be gained through the Value Iteration. Then we can use this information to reduce the search space for Policy Iteration. Thus just eliminate the worse actions. In many cases, the combination of the Value Iteration and the Policy Iteration lead to faster convergence. There are also many other extensions for some special cases. See also Puterman’s Book in Chapter We focus on the finite horizon MDP and the discounted MDP to discuss the algorithm. In other cases, first to change the optimal equation to proper cases, and then change the iteration algorithms to new proper forms. That’s the basic idea of the algorithms in other cases. See also Puterman’s Book in Chapter 8. And the continuous MDP is an extension of discrete time MDP. The model can be defined similar to the discrete case, such as the reward, the decision rules and policies. Copyright by Yu-Chi Ho

Markov Decision Problems

Similar presentations

Presentation on theme: "Markov Decision Problems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Markov Decision Problems

Similar presentations

Presentation on theme: "Markov Decision Problems"— Presentation transcript:

Similar presentations

About project

Feedback