1 Introduction of MDP Speaker ： Xu Jia-Hao Adviser ： Ke Kai-Wei.

1 Introduction of MDP Speaker ： Xu Jia-Hao Adviser ： Ke Kai-Wei

2 Outline Simple Markov Process Markov Process with Reward Introduction of Alternative The Policy-Iteration Method for the Solution of Sequential Decision Processes Conclusion Reference

4 Simple Markov process ( 1 ) Definition ： The probability of a transition to state j during the next time interval, given that the system now occupies state i, is a function only of i and j and not of any history of the system before its arrival in i. Example ： Frog in a lily pond.

5 Simple Markov process ( 2 ) We may specify a set of conditional probabilities that a system which now occupies state i will occupy state j after its net transition.

6 Toymaker Example ( 1 ) Has two states ： ( discrete-time MP ) First state ： The toy he is currently producing has found great favor with the public. Second state ： his toy is out of favor. Define state probability the probability that the system will occupy state i after n transitions if its state at n = 0 is known.

7 Toymaker Example ( 2 ) Transition Matrix ： Transition Diagram ： 12

9 Markov Processes with Rewards Definition ： Suppose that an N-state Markov process earns “dollars” when it makes a transition from state i to state j. We call (may be negative) the “reward” associated with the transition from i to j. Reward matrix R. Example ： Frog in a lily pond.

10 Definition ： The expected total earnings in the next n transitions if the system is now in state i. ： The reward to be expected in the next transition out of state i ; it will be called the expected immediate reward for state i.

11 Formula of ：

12 Toymaker with Reward Example ( 1 ) Reward matrix ： Transition matrix ： Can find

13 Toymaker with Reward Example ( 2 )

14 Toymaker with Reward Example ( 3 )

16 Introduction of Alternatives The Concept of alternative for an N-state system.

17 The constrain of Alternative The number of alternatives in any state must be finite, but the number of alternatives in each state may be different from the numbers in other states.

18 Concepts of Decision ( 1 ) ： The number of the alternative in the ith state that will be used at stage n. We call the “decision” in state i at the nth stage. When has been specified for all i and all n, a “policy” has been determined. The optimal policy is the one that maximizes total expected return for each i and n.

19 Concepts of Decision ( 2 ) Here redefine as the total expected return in n stages starting from state i if an optimal policy is followed.

20 The Toymaker’s Problem Solved by Value Iteration ( 1 )

21 The Toymaker’s Problem Solved by Value Iteration ( 2 )

22 Value-Iteration The method that has just been described for the solution of the sequential process may be called the value-iteration method because the or “values” are determined iteratively. Even if we were patient enough to solve the long-duration process by value iteration, the convergence on the best alternative in each state is asymptotic and difficult to measure analytically.

24 Policy

25 Decision and Policy The alternative thus selected is called the “decision” for that state; it is no longer a function of n. The set of X’s or the set of decisions for all states is called a “policy”. Selection of a policy thus determines the Markov process with rewards that will describe the operations of the system.

26 Problems of finding optimal policies In previous figure, there are 4x3x2x1x5=120 different policies. It is still conceivable, but it becomes unfeasible for very large problems. For example ： a problem with 50 states and 50 alternatives in each state contains Solution ： It is composed of two parts, the value- determination operation and the policy- improvement routine.

27 Formula The Value-Determination Operation ： The Policy-Improvement Routine ：

28 The Iteration Cycle

29 The Toymaker’s Problem Two Alternatives ： 1 、 2 、 Four Possible Policies ：

30 Result

32 Conclusion In Markov process, We can use the “Iteration Cycle” to select an optimal policy in order to earn the maximum reward.

34 Reference An introduction to Markov decision processes --Ronald A. Howard, Dynamic Programming and Markov Processes, MIT Press, 1960

35 z-Transform For the study of transient behavior and for theoretical convenience, it is useful to study the Markov process from the point of view of the generating function or, as we shall call it,the z- Transform. The relationship between time function f(n) and its transform F(z) is unique; each time function has only one transform, and the inverse transformation of the transform will produce once more the original time function.

1 Introduction of MDP Speaker ： Xu Jia-Hao Adviser ： Ke Kai-Wei.

Similar presentations

Presentation on theme: "1 Introduction of MDP Speaker ： Xu Jia-Hao Adviser ： Ke Kai-Wei."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Introduction of MDP Speaker ： Xu Jia-Hao Adviser ： Ke Kai-Wei.

Similar presentations

Presentation on theme: "1 Introduction of MDP Speaker ： Xu Jia-Hao Adviser ： Ke Kai-Wei."— Presentation transcript:

Similar presentations

About project

Feedback