Presentation is loading. Please wait.

Presentation is loading. Please wait.

Carnegie Mellon University

Similar presentations


Presentation on theme: "Carnegie Mellon University"— Presentation transcript:

1 Carnegie Mellon University
Solving Generalized Semi-Markov Decision Processes using Continuous Phase-Type Distributions Håkan L. S. Younes Reid G. Simmons Carnegie Mellon University This is joint work with my advisor Reid Simmons. The inspiration for this research came to me while attending a Dagstuhl seminar on probabilistic methods in verification and planning last Spring. The aim of the seminar was to bring researchers from the performance evaluation, model checking, and AI communities together to share ideas. This presentation is an example of what can be done by incorporating techniques from the other two fields to solve problems in AI.

2 Introduction Asynchronous processes are abundant in the real world
Telephone system, computer network, etc. Discrete-time and semi-Markov models are inappropriate for systems with asynchronous events Generalized semi-Markov (decision) processes, GSM(D)Ps, are great for this! Approximate solution using phase-type distributions and your favorite MDP solver I will present an approach to decision theoretic planning for asynchronous processes. Such processes are abundant in the real world. The most obvious example of an asynchronous process is a telephone system where calls are made without any synchronization between callers. Other examples include queuing systems, such as your local post office, and networks of computers. A lot of attention has been given, in the past decade or so, to decision theoretic planning, but almost exclusively using discrete-time models which are best suited for synchronous systems. Even semi-Markov models, with time as a continuous quantity, are not appropriate for systems with asynchronous events, as I will show you. I propose, instead, that we adopt the generalized semi-Markov decision process as the foundation for decision theoretic planning with asynchronous events. I will introduce you to this formalism and I will present an algorithm for approximately solving GSMDPs using phase-type distributions and your favorite MDP solver. And I really mean your favorite solver, because the approach I will present to you is not tied to a specific solver. I will introduce you to phase-type distributions later in the talk, so do not worry right now if you do not know what they are.

3 Asynchronous Processes: Example
But first let us look at a simple example of an asynchronous process: a simple network with two computers. Each computer can be either up or down. While a computer is up, a crash event may occur causing the computer to go down. While down, a reboot action is available that can bring the computer up again. Say that the system starts in a state with both computers up. The hour glass represents the fact that time is continuously progressing: this is a continuous-time system. m1 up m2 up t = 0

4 Asynchronous Processes: Example
m2 crashes After some time, say 2.5 time units, computer m2 may crash, bringing the system to a state with m1 up and m2 down. When in this state, we may issue a reboot action for m2. m1 up m2 up m1 up m2 down t = 0 t = 2.5

5 Asynchronous Processes: Example
m2 crashes m1 crashes While rebooting m2, computer m1 may crash as well. We are still rebooting m2, but now we are in a state with both machines down. m1 up m2 up m1 up m2 down m1 down m2 down t = 0 t = 2.5 t = 3.1

6 Asynchronous Processes: Example
m1 crashes m1 crashes m2 rebooted After some additional time, we complete the reboot of m2 and immediately start rebooting m1. Note, again, that the crash event for m1 occurred while we were rebooting m2, and this is an example of asynchrony. The process is stochastic because the time spent in each state is a random variable. The reason why discrete-time models are not appropriate for this kind of system is that no matter what discretization step we choose, we always run the risk of multiple events occurring within the same time interval, and we can also have long periods of time when nothing happens. The path at the bottom represents the execution history for the system. It consists of states, with state transitions caused by events. Another name for processes that give rise to execution histories like this is discrete event system. m1 up m2 up m1 up m2 down m1 down m2 down m1 down m2 up t = 0 t = 2.5 t = 3.1 t = 4.9

7 A Model of Stochastic Discrete Event Systems
Generalized semi-Markov process (GSMP) [Matthes 1962] A set of events E A set of states S GSMDP Actions A  E are controllable events The generalized semi-Markov process, first introduced by Matthes in 1962, is an established formalism in queuing theory for stochastic discrete event systems. A GSMP consists of a set of events and a set of states. By identifying a subset of the events as controllable, that is actions, we obtain a decision problem.

8 Events With each event e is associated:
A condition e identifying the set of states in which e is enabled A distribution Ge governing the time e must remain enabled before it triggers A distribution pe(s′|s) determining the probability that the next state is s′ if e triggers in state s With each event is associated: a condition identifying the set of states in which the event is enabled (I will tell you what this means shortly) a distribution governing the time the event must remain enabled before it triggers a distribution determining the probability that the next state is s’ if the event triggers in state s When an event is enabled, it can trigger. The distribution Ge determines the probability that the event will trigger within a certain amount of time, provided it remains enabled. When an event triggers, a state transition occurs.

9 Events: Example Network with two machines Crash time: Exp(1)
Reboot time: U(0,1) m1 up m2 up m1 up m2 down m1 down m2 down crash m2 crash m1 Let us look at the network example again. Assume that the distribution associated with crash events is exponential with rate 1 and the distribution associated with reboot actions is uniform in the interval 0 to 1. Assume that we start at time zero in a state with both machines up. The two crash events are enabled and they race to trigger first. Say that the crash event for m2 triggers first after 0.6 time units and that we enabled the reboot action in the following state. There is now a race between the crash event and the reboot action. Say that the crash event for m1 triggers first. We are now in a state with the reboot action still enabled. Note that because it was enabled in the previous state for 0.5 time units, the trigger time distribution for this action is not U(0,1) but U(0,0.5) at the entry of this state. This is the trigger time distribution conditioned on the time the action has remained enabled. Note that the crash event for m1 also remained enabled across state transitions, but it has an exponential distribution so the time it was previously enabled does not matter. In general, the trigger time distribution for an event in a state may depend on the entire execution history. This history dependence allow us to model asynchronous events, but it also means that we have a model that goes beyond semi-Markov processes. We can of course satisfy the semi-Markov property by including real-valued clocks in the description of states, each clock associated with an event and representing the time that event has remained enabled. The result would be a general state-space semi-Markov decision process, but by using this representation we loose some of the structure provided by the GSMDP formulation. This is similar to the way we can represent a POMDP by a continuous-state MDP, but people do not generally solve POMDPs this way. t = 0 t = 0.6 t = 1.1 Gc1 = Exp(1) Gc2 = Exp(1) Gc1 = Exp(1) Gr2 = U(0,1) Gr2 = U(0,0.5) Asynchronous events  beyond semi-Markov

10 Policies Actions as controllable events
We can choose to disable an action even if its enabling condition is satisfied A policy determines the set of actions to keep enabled at any given time during execution An action is a controllable event, and this means that we can choose to disable an action in a state even if its enabling condition is satisfied. A policy for a GSMDP determines the set of actions to keep enabled at any give time during execution.

11 Rewards and Optimality
Lump sum reward k(s,e,s′) associated with transition from s to s′ caused by e Continuous reward rate r(s,A) associated with A being enabled in s Infinite-horizon discounted reward Unit reward earned at time t counts as e –t Optimal choice may depend on entire execution history We assume a traditional reward structure, with a lump sum reward associated with a state transition caused by a specific event, and a continuous reward rate associated with an action being enabled in a state. The optimality criterion we consider is infinite-horizon discounted reward. The optimal action choice at any point in time may depend on the entire execution history including the time spent in the current state, so a policy in its most general form is a mapping from execution histories to sets of actions.

12 GSMDP Solution Method GSMDP GSMDP Continuous-time MDP
Discrete-time MDP Discrete-time MDP Phase-type distributions (approximation) Uniformization [Jensen 1953] So how can we solve GSMDPs? In general, these models are quite complex, and we do not expect to be able to solve them optimally. What we propose is to use phase-type distributions to approximate a GSMDP with a continuous-time MDP. This technique is also known as “the method of supplementary variables”. Once we have a continuous-time MDP, we can use uniformization to obtain a discrete-time MDP which can be solved using standard MDP solution techniques. In order to execute the policy we obtain for the MDP in the original GSMDP model, we have to simulate the phase transitions. GSMDP policy MDP policy Simulate phase transitions

13 Continuous Phase-Type Distributions [Neuts 1981]
Time to absorption in a continuous-time Markov chain with n transient states Exponential 1 p1 (1 – p)1 2 Two-phase Coxian A continuous phase-type distribution with n phases is simply a continuous-time Markov chain with n transient states. The distribution represents the time to absorption in the transient Markov chain. The simplest continuous phase-type distribution consists of a single transient state, with a rate lambda of exiting this state. This is nothing else than the exponential distribution. A more interesting phase-type distribution is the two-phase Coxian distribution. After exiting the first phase, there is a probability (1-p) of exiting without entering the second phase. Another phase-type distribution is the n-phase generalized Erlang distribution. It has the same exit rate for all phases, but there is a probability (1-p) of bypassing the remaining phases after the first phase. n – 1 1 p (1 – p) n-phase generalized Erlang

14 Approximating GSMDP with Continuous-time MDP
Approximate each distribution Ge with a continuous phase-type distribution Phases become part of state description Phases represent discretization into random-length intervals of the time events have been enabled To obtain a continuous-time MDP, we approximate each distribution associated with a GSMDP event with a continuous phase-type distribution. The phases of the phase-type distribution become part of the state description, which of course means that the size of the state space increases. Intuitively, the phases represent a discretization into random-length intervals of the time events have been enabled.

15 Policy Execution The policy we obtain is a mapping from modified state space to actions To execute a policy we need to simulate phase transitions Times when action choice may change: Triggering of actual event or action Simulated phase transition The policy we obtain for the MDP is a mapping from a modified state space, including phase information, to actions. In order to execute the policy in the original model, we have to simulate phase transitions. The action choice can change at the triggering of actual events and actions, but also at simulated phase transitions. This means that the action choice can change while remaining in a state, and this is important.

16 Method of Moments Approximate general distribution G with phase-type distribution PH by matching the first k moments Mean (first moment): 1 Variance:  2 = 2 – 12 The ith moment: i = E[X i] Coefficient of variation: cv =  /1 There are many ways in which we can approximate a general distribution with a phase type distribution, but we choose to use the method of moments because it is very efficient. The method provides a phase-type distribution that matches the first k moments of the general distribution. The first moment of a distribution is the mean. The variance is the second moment minus the square of the first moment. In general, the ith moment is the expectation of a random variable with the given distribution to the power i. The coefficient of variation is also commonly used instead of the second moment. It is defined as the standard deviation divided by the mean.

17 Matching One Moment Exponential distribution:  = 1/1
To match a single moment, we only need to use an exponential distribution with rate 1 over the mean.

18 Matching Two Moments Exponential Distribution cv 2 1
1 If the square of the coefficient of variation is 1, then we can also match two moments with the same exponential distribution because the coefficient of variation is 1 for all exponential distributions.

19 Matching Two Moments Exponential Distribution cv 2 1
1 Generalized Erlang Distribution If the square coefficient of variation is less than 1, then we can use a generalized Erlang distribution with the number of phases proportional to the inverse of the square coefficient of variation.

20 Matching Two Moments Exponential Distribution
Two-Phase Coxian Distribution cv 2 1 Generalized Erlang Distribution If the square coefficient of variation is greater than 1, then we can use a two-phase Coxian distribution. That covers all possible distributions.

21 Matching Three Moments
Combination of Erlang and two-phase Coxian [Osogami & Harchol-Balter, TOOLS’03] p1 2 n – 3 n – 2 n – 1 We can also match three moments by using a combination of an Erlang and a two-phase Coxian distribution. While we could conceivable match even more moments, there are no general closed-form solutions for doing so at present. (1 – p)1

22 The Foreman’s Dilemma When to enable “Service” action in “Working” state? Service Exp(10) Fail G Serviced c = 0.5 Working c = 1 Failed c = 0 I will now show you a few examples to illustrate what we can do with our solution method. First consider a variation of Howard’s “the Foreman’s Dilemma”. A system can be in one of three states: working, serviced, and failed. The reward rate is 1 in the working state, 0.5 in the serviced state, and 0 in the failed state. A failure event, with distribution G, can cause a transition from the working state to the failed state, and the rate at which we return from the failed state is low (1 over 100). At any point in time in the working state, we can enable a service action, which after some delay triggers a transition to the serviced state. The return rate is 1 from this state. We want to remain in the working state for as long as possible, but it may make sense to issue a service action once in a while so that failure can be avoided. Another word for this is “preventive maintenance”. So the question is, when should we enable the service action in the working state? Return Exp(1) Replace Exp(1/100)

23 The Foreman’s Dilemma: Optimal Solution
Find t0 that maximizes v0 To solve this problem optimally, we need to find the value of t0 that maximizes v0 as given here. Now, this is a nasty expression, and if the failure time distribution is not “nice”, then we need to use numerical integration and function maximization techniques to solve it. It can be done, but even for this seemingly simple problem, it looks pretty complicated. Y is the time to failure in “Working” state

24 The Foreman’s Dilemma: SMDP Solution
Same formulas, but restricted choice: Action is immediately enabled (t0 = 0) Action is never enabled (t0 = ∞) The problem can also be solved as a semi-Markov decision process. However, if we use conventional SMDP solution techniques, then we can only choose to enable the action immediately or not at all.

25 The Foreman’s Dilemma: Performance
Failure-time distribution: U(5,x) 100 90 80 Percent of optimal This graph shows the performance of different solution techniques as a percentage of optimal if the failure time distribution is a uniform distribution. We can see that the conventional SMDP solution method suffers from not being able to delay the enabling of the service action in the working state. If we solve the problem using phase-type distribution, then we can gain in performance. Matching only a single moment, which just takes into account the mean of a distribution, we perform like the SMDP solution. By matching two or three moments, however, we can get closer to optimal because the phases allow us to delay the enabling of the service action. 70 60 50 x 5 10 15 20 25 30 35 40 45 50

26 The Foreman’s Dilemma: Performance
Failure-time distribution: W(1.6x,4.5) 100 90 80 Percent of optimal This is for a difference choice of failure time distribution, but we see similar performance here. Matching two or three moments tends to give better performance. 70 60 50 x 5 10 15 20 25 30 35 40

27 System Administration
Network of n machines Reward rate c(s) = k in states where k machines are up One crash event and one reboot action per machine At most one action enabled at any time (single agent) The second example is a system administration problem. We have a network of n machines that can be either up or down, and a reward rate of k in states with k machines up. There is a crash event and a reboot action per machine. We assume that at most one action can be enabled at any time, which corresponds to a single agent system.

28 System Administration: Performance
Reboot-time distribution: U(0,1) 50 45 40 35 Reward 30 This system can no longer be modeled as a semi-Markov decision process, so we only solve it using phase-type distributions. With the reboot time having a uniform distribution in the interval 0 to 1, we can see that by matching more than one moment, we can increase the expected reward. This is possible because the phases permit us to remember that we have already spent some time rebooting a machine. 25 20 15 n 1 2 3 4 5 6 7 8 9 10 11 12 13

29 System Administration: Performance
1 moment 2 moments 3 moments size states time (s) 4 16 0.36 32 3.57 112 10.30 5 0.82 80 7.72 272 22.33 6 64 1.89 192 16.24 640 40.98 7 128 3.65 448 28.04 1472 69.06 8 256 6.98 1024 48.11 3328 114.63 9 512 16.04 2304 80.27 7424 176.93 10 33.58 5120 136.4 16384 291.70 11 2048 66.00 24576 264.17 35840 481.10 12 4096 111.96 53248 646.97 77824 13 8192 210.03 114688 167936 2n (n+1)2n (1.5n+1)2n This table shows the increase in the state space that is the result of matching more moments. Clearly there is a tradeoff between speed and quality of the solution. The planning times are for our prototype GSMDP planner, which implements regular value iteration. With a more sophisticated MDP solver, we would expect the planning times to drop significantly.

30 Summary Generalized semi-Markov (decision) processes allow asynchronous events Phase-type distributions can be used to approximate a GSMDP with an MDP Allows us to approximately solve GSMDPs and SMDPs using existing MDP techniques Phase does matter! In summary, we have introduced the generalized semi-Markov decision process as a model for stochastic decision processes with asynchronous events. We have shown how to solve these problems approximately by using phase-type distributions, which allows us to use existing MDP techniques to solve GSMDPs and SMDPs. Our empirical evaluation shows that phase can make a difference in the quality of the solution.

31 Future Work Discrete phase-type distributions
Handles deterministic distributions Avoids uniformization step Other optimization criteria Finite horizon, etc. Computational complexity of optimal GSMDP planning In the future, we want to use discrete phase-type distributions as well. This would allow us to handle deterministic distributions, and it avoids the uniformization step. We would also like to consider other optimization criteria, in particular finite horizon problems. The finite horizon can be modeled as a deterministic event, and would therefore be easy to deal with if we used discrete phase-type distributions. We also need to investigate the computational complexity of optimal GSMDP planning. In general, we expect the problem to be undecidable, but we know, for example, that if all distributions are exponential, then the problem is no harder than optimal MDP planning (at least for infinite horizon problems).

32 Tempastic-DTP A tool for GSMDP planning:
A simple tool for GSMDP planning, called Tempastic-DTP, is available on the Internet.

33 Matching Moments: Example 1
Weibull distribution: W(1,1/2) 1 = 2, cv2 = 5 1 1/10 9/10 W(1,1/2) 1 2 3 4 5 6 7 8 0.5 1.0 t F(t) As an example consider a Weibull distribution with mean 2 and square coefficient of variation 5. We get the following two-phase Coxian distribution.

34 Matching Moments: Example 2
Uniform distribution: U(0,1) 1 = 1/2, cv2 = 1/3 1 6 2 0.5 1.0 F(t) U(0,1) t 1 2


Download ppt "Carnegie Mellon University"

Similar presentations


Ads by Google