Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 1 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Validating Computer System and Network Trustworthiness.

Similar presentations


Presentation on theme: "Slide 1 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Validating Computer System and Network Trustworthiness."— Presentation transcript:

1 Slide 1 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Validating Computer System and Network Trustworthiness Prof. William H. Sanders Department of Electrical and Computer Engineering and Coordinated Science Laboratory University of Illinois at Urbana-Champaign

2 Slide 2 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Course Outline Issues in Model-Based Validation of High-Availability Computer Systems/Networks Combinatorial Modeling Stochastic Activity Network Concepts Analytic/Numerical State-Based Modeling Case Study: Embedded Fault-Tolerant Multiprocessor System Solution by Simulation Symbolic State-space Exploration and Numerical Analysis of State-sharing Composed Models Case Study: Security Evaluation of a Publish and Subscribe System The Art of System Trust Evaluation /Conclusions

3 Slide 3 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. What is Validation? Definition: Valid (Webster’s Third New International Dictionary) – “Able to effect or accomplish what is designed or intended” Two basic notions: – Specification - A description of what a system is supposed to do. – Realization - A description of what a system is and does. Definition (for class): Validation - the process of determining whether a realization meets its specification.

4 Slide 4 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. What’s a System? Many things, but in the context of this session, an embedded system consisting of –hardware –networks –operating systems, and –application software that is intended to be dependable, secure, survivable or have predictable performance.

5 Slide 5 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. What is Validated? -- Dependability Dependability is the ability of a system to deliver a specified service. System service is classified as proper if it is delivered as specified; otherwise it is improper. System failure is a transition from proper to improper service. System restoration is a transition from improper to proper service.  The “properness” of service depends on the user’s viewpoint! Reference: J.C. Laprie (ed.), Dependability: Basic Concepts and Terminology, Springer-Verlag, improper service failure restoration proper service

6 Slide 6 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Basic Validation Terms Measures -- What you want to know about a system. Used to determine if a realization meets a specification Models -- Abstraction of the system at an appropriate level of abstraction and/or details to determine the desired measures about a realization. Dependability Model Solution Methods -- Method by which one determines measures from a model. Models can be solved by a variety of techniques: –Combinatorial Methods -- Structure of the model is used to obtain a simple arithmetic solution. –Analytical/Numerical Methods -- A system of linear differential equations or linear equations is constructed, which is solved to obtain the desired measures –Simulation -- The realization of the system is executed, and estimates of the measures are calculated based on the resulting executions (known also as sample paths or trajectories.)  Möbius supports performance/reliability/availability validation by analytical/numerical and simulation-based methods.

7 Slide 7 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Dependability Measures: Availability Availability - quantifies the alternation between deliveries of proper and improper service. –A(t) is 1 if service is proper at time t, 0 otherwise. –E[A(t)] (Expected value of A(t)) is the probability that service is proper at time t. –A(0,t) is the fraction of time the system delivers proper service during [0,t]. –E[A(0,t)] is the expected fraction of time service is proper during [0,t]. –P[A(0,t) > t * ] (0  t *  1) is the probability that service is proper more than 100t * % of the time during [0,t]. –A(0,t) t  is the fraction of time that service is proper in steady state. –E[A(0,t) t  ], P[A(0,t) t  > t * ] as above.

8 Slide 8 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Other Dependability Measures Reliability - a measure of the continuous delivery of service –R(t) is the probability that a system delivers proper service throughout [0,t]. Safety - a measure of the time to catastrophic failure –S(t) is the probability that no catastrophic failures occur during [0,t]. –Analogous to reliability, but concerned with catastrophic failures. Time to Failure - measure of the time to failure from last restoration. (Expected value of this measure is referred to as MTTF - Mean time to failure.) Maintainability - measure of the time to restoration from last experienced failure. (Expected value of this measure is referred to as MTTR - Mean time to repair.) Coverage - the probability that, given a fault, the system can tolerate the fault and continue to deliver proper service.

9 Slide 9 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. How is Validation Done? Modeling Simulation Continuous State Discrete Event (state) SequentialParallel Analysis/ Numerical Deterministic Non-Deterministic Probabilistic Non-Probabilistic State-space-based Non-State-space-based (Combinatorial) Validation Measurement Passive (no fault injection) Active (Fault Injection on Prototype) Without Contact With Contact Hardware- Implemented Software- Implemented Stand-alone Systems Networks/ Distributed Systems Möbius supports model-based validation of italicized (red) items.

10 Slide 10 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Integrated Validation Procedure R P S Q Functional Model of the Relevant Subset of the System ModuleA ModuleB ModuleZ … AA1 AA2 AA3 Requirement Decomposition Functional Model of the System (Probabilistic or Logical) Assumptions Supporting Logical Arguments and Experimentation AP1 AP2 M1 M2M3 M4 M6 M5 L1 L2 L3

11 Slide 11 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Model Solution Issues (Many More Details will Follow) In general: –Use “tricks” from probability theory to reduce complexity of model –Choose the right solution method Simulation: –Result is just an estimator based on a statistical experiment –Estimation of accuracy of estimate essential –Use confidence Intervals! Analytic/Numerical model solution: –Avoid state space explosion Limit model complexity Use structure of model (symmetries) to reduce state space size –Understand accuracy/limitations of chose numerical method Transient Solution (Iterative or Direct) Steady-state solution

12 Slide 12 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Probability Review: Exponential Random Variables An exponential random variable X with parameter has the CDF P[X  t] = F x (t) = The density function is given by f x (t) = The exponential random variable is the only continuous random variable that is “memoryless.” To see this, let X be an exponential random variable representing the time that an event occurs (e.g., a fault arrival). Important Fact 1: (memoryless property)! { 0 t  0  e - t t > 0. { 0 t  0 e - t t > 0

13 Slide 13 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Probability Review: Exponential Event Rate The fact that the exponential random variable has the memoryless property indicates that the “rate” at which events occur is constant, i.e., it does not change over time. Often, the event associated with a random variable X is a failure, so the “event rate” is often called the failure rate or the hazard rate. The event rate of X is defined as the probability that the event associated with X occurs within the small interval [t, t +  t], given that the event has not occurred by time t, per the interval size  t: This can be thought of as looking at X at time t, observing that the event has not occurred, and measuring the number of events (probability of the event) that occur per unit of time at time t. Important Fact 2: The exponential random variable has a constant failure rate!

14 Slide 14 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Probability Review: Minimum of Two Independent Exponentials Another interesting property of exponential random variables is that the minimum of two independent exponential random variables is also an exponential random variable. Let A and B be independent exponential random variables with rates  and  respectively. Let us define X = min{A,B}. What is F X (t)? F X (t)= P[X  t] = P[min{A,B}  t] = P[A  t OR B  t] = 1 - P[A > t AND B > t] = 1 - P[A > t] P[B > t] = 1 - (1 - P[A  t])(1 - P[B  t]) = 1 - (1 - F A (t))(1 - F B (t)) = 1 - (1 - [1 - e -  t ])(1 - [1 - e -  t ]) = 1 - e -  t e -  t = 1 - e -(  +  )t Important Fact 3: The minimum of two independent exponential random variables is itself exponential with rate the sum of the two rates!

15 Slide 15 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Probability Review: Competition of Two Independent Exponentials If A and B are independent and exponential with rate  and  respectively, and A and B are competing, then we know that one will “win” with an exponentially distributed time (with rate  +  ). But what is the probability that A wins? Important Fact 4: If A and B are independent, competing exponentials, with rates  and  respectively, the probability that A occurs before B is  + 

16 Slide 16 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Course Outline Issues in Model-Based Validation of High-Availability Computer Systems/Networks Combinatorial Modeling Stochastic Activity Network Concepts Analytic/Numerical State-Based Modeling Case Study: Embedded Fault-Tolerant Multiprocessor System Solution by Simulation Symbolic State-space Exploration and Numerical Analysis of State-sharing Composed Models Case Study: Security Evaluation of a Publish and Subscribe System The Art of System Trust Evaluation /Conclusions

17 Slide 17 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Combinatorial Methods

18 Slide 18 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Introduction to Combinatorial Methods Combinatorial validation methods are the simplest kind of analytical/numerical techniques and can be used for reliability and availability modeling under certain assumptions. Assumptions are that component failures are independent, and for availability, repairs are independent. When these assumptions hold, simple formulas for reliability and availability exist.

19 Slide 19 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Lecture Outline Review definition of reliability Failure rate System reliability –Maximum –Minimum –k of N Reliability formalisms –Reliability block diagrams –Fault trees –Reliability graphs Reliability modeling process

20 Slide 20 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reliability One key to building highly available systems is the use of reliable components and systems. Reliability: The reliability of a system at time t (R(t)) is the probability that the system operation is proper throughout the interval [0,t]. Probability theory and combinatorics can be directly applied to reliability models. Let X be a random variable representing the time to failure of a component. The reliability of the component at time t is given by R X (t) = P[X > t] = 1 - P[X  t] = 1 - F X (t). Similarly, we can define unreliability at time t by U X (t) = P[X  t] = F X (t).

21 Slide 21 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Failure Rate What is the rate that a component fails at time t? This is the probability that a component that has not yet failed fails in the interval (t, t +  t), as  t  0. Note that we are not looking at P[X  (t, t +  t)] = f X (t). Rather, we are seeking P[X  (t, t +  t)| X > t]. r X (t) is called the failure rate or hazard rate.

22 Slide 22 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Typical Failure Rate Break in Normal operation Wear out rX(t)rX(t) time

23 Slide 23 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. System Reliability While F X can give the reliability of a component, how do you compute the reliability of a system? System failure can occur when one, all, or some of the components fail. If one makes the independent failure assumption, system failure can be computed quite simply. The independent failure assumption states that all component failures of a system are independent, i.e., the failure of one component does not cause another component to be more or less likely to fail. Given this assumption, one can determine: 1) Minimum failure time of a set of components 2) Maximum failure time of a set of components 3) Probability that k of N components have failed at a particular time t.

24 Slide 24 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Maximum of n Independent Failure Times Let X 1,..., X n be independent component failure times. Suppose the system fails at time S if all the components fail. Thus, S = max{X 1,..., X n } What is F s (t)? F s (t)= P[S  t] = P[X 1  t AND X 2  t AND... AND X n  t] = P[X 1  t] P[X 2  t]... P[X n  t]By independence = By definition =

25 Slide 25 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Let X 1,..., X n be independent component failure times. A system fails at time S if any of the components fail. Thus, S = min{X 1,..., X n }. What is F S (t)? F S (t) = P[S  t] = P[X 1  t OR X 2  t OR... OR X n  t] This is an application of the law of total probability (LOTP). Minimum of n Independent Component Failure Times  A3A3 A2A2 A1A1

26 Slide 26 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Minimum cont. F s (t)= P[X 1  t OR X 2  t OR... OR X n  t] = 1 - P[X 1 > t AND X 2 > t AND... AND X n > t] By trick = 1 - P[X 1 > t] P[X 2 > t]... P[X n > t]By independence = 1 - (1 - P[X 1  t])(1 - P[X 2  t])... (1 - P[X n  t]) By LOTP =

27 Slide 27 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. k of N Let X 1,..., X n be component failure times that have identical distributions (i.e., =...). The system fails at time S if k of the N components fail. F S (t) = P[at least k components failed by time t] = P[k failed OR k + 1 failed OR... OR N failed] = P[k failed] + P[k + 1 failed] P[N failed] What is P[exactly k failed]? = P[k failed and (N - k) have not] = where F X (t) is the failure distribution of each component. Thus, - by independence and axiom of probability.

28 Slide 28 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. k of N in General For non-identical failure distributions, we must sum over all combinations of at least k failures. Let G k be the set of all subsets of {X 1,..., X N } such that each element in G k is a set of size at least k, i.e., G k = {g i  {X 1,..., X N } : |g i |  k}. The set G k represents all the possible failure scenarios. Now F S is given by

29 Slide 29 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Component Building Blocks Complex systems can be analyzed hierarchically. Example: A computer fails if both power supplies fail or both memories fail or the CPU fails. F S (t) = 1 - (1 - F P1 (t)F P2 (t))(1- F M1 (t)F M2 (t))(1 - F C (t))

30 Slide 30 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Summary A system comprises N components, where the component failure times are given by the random variables X 1,..., X N. The system fails at time S with distribution F S if: Condition: all components fail one component fails k components fail, identical distributions k components fail, general case Distribution:

31 Slide 31 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reliability Formalisms There are several popular graphical formalisms to express system reliability. The core of the solvers is the methods we have just examined. In particular, we will examine Reliability Block Diagrams Fault Trees Reliability Graphs There is nothing particularly special about these formalisms except their popularity. It is easy to implement these formalisms, or design your own, in a spreadsheet, for example.

32 Slide 32 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reliability Block Diagrams Blocks represent components. A system failure occurs if there is no path from source to sink. Series: System fails if any component fails. Parallel: System fails if all components fail. k of N: System fails if at least k of N components fail. C1C2C3 source sink C1 C2 C3 source sink C1 C2 C3 source sink 2 of 3

33 Slide 33 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Example A NASA satellite architecture under study is designed for high reliability. The major computer system components include the CPU system, the high-speed network for data collection and transmission, and the low-speed network for engineering and control. The satellite fails if any of the major systems fail. There are 3 computers, and the computer system fails if 2 or more of the computers fail. Failure distribution of a computer is given by F C. There is a redundant (2) high-speed network, and the high-speed network system fails if both networks fail. The distribution of a high-speed network failure is given by F H. The low-speed network is arranged similarly, with a failure distribution of F L.

34 Slide 34 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. RBG Example computer source sink HSN LSN 2 of 3 HSN computer

35 Slide 35 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Fault Trees Components are leaves in the tree A component fails = logical value of true, otherwise false. The nodes in the tree are boolean AND, OR, and k of N gates. The system fails if the root is true. AND gates true if all the components are true (fail). OR gates true if any of the components are true (fail). k of N gates true if at least k of the components are true (fail). C1C3C2 AND C1C3C2 OR C1C3C2 2 of 3

36 Slide 36 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Fault Tree Example OR C1C3C2 2 of 3 AND H1H2 AND L2L1

37 Slide 37 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Combinatorial Methods: Review A system comprises N components, where the component failure times are given by the random variables X 1,..., X N. The system fails at time S with distribution F S if: Condition: all components fail one component fails k components fail, identical distributions k components fail, general case Distribution:

38 Slide 38 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reliability Formalisms There are several popular graphical formalisms to express system reliability. The core of the solvers is the methods we have just examined. In particular, we will examine Reliability Block Diagrams Fault Trees Reliability Graphs There is nothing particularly special about these formalisms except their popularity. It is easy to implement these formalisms, or design your own, in a spreadsheet, for example.

39 Slide 39 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reliability Block Diagrams Blocks represent components. A system failure occurs if there is no path from source to sink. Series: System fails if any component fails. Parallel: System fails if all components fail. k of N: System fails if at least k of N components fail. C1C2C3 source sink C1 C2 C3 source sink C1 C2 C3 source sink 2 of 3

40 Slide 40 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Example A NASA satellite architecture under study is designed for high reliability. The major computer system components include the CPU system, the high-speed network for data collection and transmission, and the low-speed network for engineering and control. The satellite fails if any of the major systems fail. There are 3 computers, and the computer system fails if 2 or more of the computers fail. Failure distribution of a computer is given by F C. There is a redundant (2) high-speed network, and the high-speed network system fails if both networks fail. The distribution of a high-speed network failure is given by F H. The low-speed network is arranged similarly, with a failure distribution of F L.

41 Slide 41 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. RBG Example computer source sink HSN LSN 2 of 3 HSN computer

42 Slide 42 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Fault Trees Components are leaves in the tree A component fails = logical value of true, otherwise false. The nodes in the tree are boolean AND, OR, and k of N gates. The system fails if the root is true. AND gates true if all the components are true (fail). OR gates true if any of the components are true (fail). k of N gates true if at least k of the components are true (fail). C1C3C2 AND C1C3C2 OR C1C3C2 2 of 3

43 Slide 43 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Fault Tree Example OR C1C3C2 2 of 3 AND H1H2 AND L2L1

44 Slide 44 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reliability Graphs The arcs represent components and have failure distributions. A failure occurs if there is no path from source to sink. Can implement series: Can implement parallel: source sink F C1 F C2 12 source sink F C1 F C2 F C3

45 Slide 45 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reliability Graph Example Reliability graphs can implement more complex interactions. For example, a telephone network “fails” if there is no path from source to sink. How do we solve this? source sink 3 A B C D E

46 Slide 46 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Solving by Conditioning  E F

47 Slide 47 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. First, condition the system on link C being failed. Then the system becomes the series AD in parallel with the series BE source sink 3 A B C D E source sink 3 A B D E

48 Slide 48 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Second, condition the system on link C being up. A B D E 1 2,3 4 source sink

49 Slide 49 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Conditioning Fault Trees It is also possible to use conditioning to solve more complex fault trees. If the same component appears more than once in a fault tree, it violates the independent failure assumption. However, a conditioned fault tree can be solved. Example: A component C appears multiple times in the fault tree.

50 Slide 50 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reliability/Availability Point Estimates Frequently, the desired measure of a reliability model is the reliability at some time t. Thus, the distribution of the system reliability is superfluous; R(t) is the only thing of interest. This condition simplifies computation because all that is necessary for solution is the reliability of the components at time t. Solution then becomes a straightforward computation. If a system is described in terms of the availability of components at time t, then we may compute the system availability in the same way that reliability is computed. The restriction is that all component behaviors must be independent of one another.

51 Slide 51 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reliability/Availability Tables A system comprises N components. Reliability of component i at time t is given by R Xi (t), and the availability of component i at time t is given by A Xi (t). Condition System Reliability System Availability system fails if all components fail system fails if one component fails system fails if at least k components fail, identical distribution system fails if at least k components fail, general case

52 Slide 52 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Estimating Component Reliability For hardware, MIL-HDBK-217 is widely used. –Not always current with modern components. –Lacks distributions; it only contains failure rates. –While not perfect, it seems to be the best source that exists. However, numbers from MIL-HDBK-217 should be used with caution. Due to the nature of software, no accepted mechanism exists to predict software reliability before the software is built. –Best guess is the reliability of previously built similar software. In all cases, numbers should be used with caution and adjusted based on observation and experience. No substitute for empirical observation and experience!

53 Slide 53 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Modeling Process Reliability models are built only after proper service is specified. Reliability models are built to answer the question “What subsystem or components must be proper for the system to be proper?” Build models hierarchically out of subsystems. Estimation and guesses are acceptable, but state them explicitly. If unsure, do sensitivity analysis to see how much it matters.

54 Slide 54 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reliability Modeling Process Realistic systems result in large RBDs and must be managed hierarchically. RBD Process(system) Define the system Define “proper service” Create RBD out of components for each component if component is simple obtain reliability data of component else Do RBD Process(component) end if Compute reliability of system Do results meet specification? Modify design and repeat as necessary

55 Slide 55 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Summary –Reliability: review of definition –Failure rate –System reliability Independent failure assumption Minimum, maximum, k of N Reliability block diagrams, fault trees, reliability graphs –Reliability modeling process

56 Slide 56 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Stochastic Activity Network Concepts

57 Slide 57 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Session Outline Stochastic Petri nets –Places, tokens, input / output arcs, transitions –Readers / Writers example Stochastic activity networks –Input / output gates, cases, instantaneous and timed activities –Marking dependent behavior, well-specified, general distributions –Simple database server model Reward variables –Reward structures –Reward variable classification –Predicate / function implementation in Möbius Fault-tolerant computer example Composed models –Fault-tolerant computer revisited

58 Slide 58 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Introduction –The amount of time a program takes to execute can be computed precisely if all factors are known, but this is nearly impossible and sometimes useless. At a more abstract level, we can approximate the running time by a random variable. –Fault arrivals almost always must be modeled by a random process. We begin by describing a subset of SANs: stochastic Petri nets. Stochastic activity networks, or SANs, are a convenient, graphical, high-level language for describing system behavior. SANs are useful in capturing the stochastic (or random) behavior of a system. Examples:

59 Slide 59 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Stochastic Petri Net Review One of the simplest high-level modeling formalisms is called stochastic Petri nets. A stochastic Petri net is composed of the following components: Places:which contain tokens, and are like variables tokens:which are the “value” or “state” of a place transitions: which change the number of tokens in places input arcs:which connect places to transitions output arcs:which connect transitions to places

60 Slide 60 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Firing Rules for SPNs A stochastic Petri net (SPN) executes according to the following rules: A transition is said to be enabled if for each place connected by input arcs, the number of tokens in the place is  the number of input arcs connecting the place and the transition. Example: Transition t1 is enabled. P1P1 P2P2 t1t1

61 Slide 61 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Firing Rules, cont. A transition may fire if it is enabled. (More about this later.) If a transition fires, for each input arc, a token is removed from the corresponding place, and for each output arc, a token is added to the corresponding place. Example: Note: tokens are not necessarily conserved when a transition fires. P1P1 P2P2 t1t1 P3P3 P4P4 t1 fires

62 Slide 62 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Specification of Stochastic Behavior of an SPN A stochastic Petri net is made from a Petri net by –Assigning an exponentially distributed time to all transitions. –Time represents the “delay” between enabling and firing of a transition. –Transitions “execute” in parallel with independent delay distributions. Since the minimum of multiple independent exponentials is itself exponential, time between transition firings is exponential. If a transition t becomes enabled, and before t fires, some other transition fires and changes the state of the SPN such that t is no longer enabled, then t aborts, that is, t will not fire. Since the exponential distribution is memoryless, one can say that transitions that remain enabled continue or restart, as is convenient, without changing the behavior of the network.

63 Slide 63 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. SPN Example: Readers/Writers Problem There are at most N requests in the system at a time. Read requests arrive at rate ra, and write requests at rate wa. Any number of readers may read from a file at a time, but only one writer may write at a time. A reader and writer may not access the file at the same time. Locks are obtained with rate L (for both read and write locks); reads and writes are performed at rates r and w respectively. Locks are released at rate rel. Note: N (N arcs) ...

64 Slide 64 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. N N N N wa rel w L ra L r rel SPN Representation of Reader/Writers Problem

65 Slide 65 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Notes on SPNs SPNs are much easier to read, write, modify, and debug than Markov chains. SPN to Markov chain conversion can be automated to afford numerical solutions to Markov chains. Most SPN formalisms include a special type of arc called an inhibitor arc, which enables the SPN if there are zero tokens in the associated place, and the identity (do nothing) function. Example: modify SPN to give writes priority. Limited in their expressive power: may only perform +, -, >, and test-for-zero operations. These very limited operations make it very difficult to model complex interactions. Simplicity allows for certain analysis, e.g., a network protocol modeled by an SPN may detect deadlock (if inhibitor arcs are not used). More general and flexible formalisms are needed to represent real systems.

66 Slide 66 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Stochastic Activity Networks The need for more expressive modeling languages has led to several extensions to stochastic Petri nets. One extension that we will examine is called stochastic activity networks. Because there are a number of subtle distinctions relative to SPNs, stochastic activity networks use different words to describe ideas similar to those of SPNs. Stochastic activity networks have the following properties: A general way to specify that an activity (transition) is enabled A general way to specify a completion (firing) rule A way to represent zero-timed events A way to represent probabilistic choices upon activity completion State-dependent parameter values General delay distributions on activities

67 Slide 67 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. SAN Symbols Stochastic activity networks (hereafter SANs) have four new symbols in addition to those of SPNs: –Input gate: used to define complex enabling predicates and completion functions –Output gate: used to define complex completion functions –Cases:(small circles on activities) used to specify probabilistic choices –Instantaneous activities: used to specify zero-timed events

68 Slide 68 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. SAN Enabling Rules An input gate has two components: enabling_function (state)  boolean; also called the enabling predicate input_function(state)  state; rule for changing the state of the model An activity is enabled if for every connected input gate, the enabling predicate is true, and for each input arc, the number of tokens in the connected place  number of arcs. We use the notation MARK(P) to denote the number of tokens in place P. Note that in Mobius, the this would be written as P-> MARK

69 Slide 69 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Example SAN Enabling Rule Example: IG1 Predicate: if((MARK(P1)>0 && MARK(P2)==0)|| (MARK(P1)==0 && MARK(P2)>0)) return 1; else return 0; Activity a1 is enabled if IG1 predicate is true (1) and MARK(P3) > 0. (Note that in Möbius, “1” is used to denote true.)

70 Slide 70 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Cases Cases represent a probabilistic choice of an action to take when an activity completes. When activity a completes, a token is removed from place P1, and with probability  a token is put into place P2, and with probability 1 -  a token is put into place P3. Note: cases are numbered, starting with 1, from top to bottom.  

71 Slide 71 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Output Gates When an activity completes, an output gate allows for a more general change in the state of the system. This output gate function is usually expressed using pseudo-C code. ExampleOG Function MARK(P) = 0; c 1 - c

72 Slide 72 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Instantaneous Activities Another important feature of SANs is the instantaneous activity. An instantaneous activity is like a normal activity except that it completes in zero time after it becomes enabled. Instantaneous activities can be used with input gates, output gates, and cases. Instantaneous activities are useful when modeling events that have an effect on the state of the system, but happen in negligible time, with respect to other activities in the system, and the performance/dependability measures of interest.

73 Slide 73 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. SAN Terms 1. activation - time at which an activity begins 2. completion - time at which activity completes 3. abort - time, after activation but before completion, when activity is no longer enabled 4. active - the time after an activity has been activated but before it completes or aborts.

74 Slide 74 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Illustration of SAN Terms t activity time enabled activationcompletion t activity time enabled activationcompletion activity time completion and activation t enabled activationaborted activity time

75 Slide 75 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Completion Rules When an activity completes, the following events take place (in the order listed), possibly changing the marking of the network: 1. If the activity has cases, a case is (probabilistically) chosen. 2. The functions of all the connected input gates are executed (in an unspecified order). 3. Tokens are removed from places connected by input arcs. 4. The functions of all the output gates connected to the chosen case are executed (in an unspecified order). 5. Tokens are added to places connected by output arcs connected to the chosen case. Ordering is important, since effect of actions can be marking-dependent.

76 Slide 76 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Marking Dependent Behavior Virtually every parameter may be any function of the state of the model. Examples of these are rates of exponential activities parameters of other activity distributions case probabilities An example of this usefulness is a model of three redundant computers where the coverage (probability that a single computer crashing crashes the whole system) increases after a failure. c 1 - c

77 Slide 77 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Example Problem A database server is composed of a compute server and three file servers, and can queue up to N c requests at a time (including the one in service). Requests arrive at rate a and spend on average 1/ CPU time at the compute server being processed. The request is then forwarded to the file server that has the fewest outstanding requests. Requests are processed at a rate of D1, D2, and D3 for file servers D1, D2, and D3 respectively. File server buffers may hold at most N f requests (including requests in service); if all buffers are full, the request is discarded.

78 Slide 78 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. D1 D2 D3 SAN Representation of Example Database Problem

79 Slide 79 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Gate Functions for SAN

80 Slide 80 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. General Delay Distributions SANs (and their implementation in Möbius) support many activity time distributions, including: All distribution parameters can be marking-dependent The obvious implication of general delay distributions is that there is no conversion to a CTMC. Hence, no solutions to CTMCs are applicable. However, simulation is still possible. Analytical/numerical solution is possible for certain mixes of exponential and deterministic activities. See the Möbius manual for details. See [Kececioglu 91], for example, for appropriate use of some of these distributions. Exponential Hyperexponential Deterministic Weibull Conditional Weibull Normal Erlang Gamma Beta Uniform Binomial Negative Binomial

81 Slide 81 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Fault-Tolerant Computer Failure Model Example A fault-tolerant computer system is made up of two redundant computers. Each computer is composed of three redundant CPU boards. A computer is operational if at least 1 CPU board is operational, and the system is operational if at least 1 computer is operational. CPU boards fail at a rate of 1/10 6 hours, and there is a 0.5% chance that a board failure will cause a computer failure, and a 0.8% chance that a board will fail in a way that causes a catastrophic system failure.

82 Slide 82 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. SAN computer for Computer Failure Model

83 Slide 83 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Activity Case Probabilities and Input Gate Definition

84 Slide 84 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Output Gate Definitions

85 Slide 85 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reward Variables Reward variables are a way of measuring performance- or dependability-related characteristics about a model. Examples: –Expected time until service –System availability –Number of misrouted packets in an interval of time –Processor utilization –Length of downtime –Operational cost –Module or system reliability

86 Slide 86 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reward Structures Reward may be “accumulated” two different ways: –A model may be in a certain state or states for some period of time, for example, “CPU idle” states. This is called a rate reward. –An activity may complete. This is called an impulse reward. The reward variable is the sum of the rate reward and the impulse reward structures.

87 Slide 87 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reward Structure Example A web server failure model is used to predict profits. When the web server is fully operational, profits accumulate at $N/hour. In a degraded mode, profits accumulate at Repairs cost $K. By carefully integrating the reward structure from 0 to t, we get the profit at time t. This is an example of an “interval-of-time” variable. m is a fully functioning marking m is a degraded-mode marking otherwise a is an activity representing repair otherwise

88 Slide 88 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reward Variables A reward variable is the sum of the impulse and rate reward structures over a certain time. Let [t, t + l] be the interval of time defined for a reward variable: –If l is 0, then the reward variable is called an instant-of-time reward variable. –If l > 0, then the reward variable is called an interval-of-time reward variable. –If l > 0, then dividing an interval-of-time reward variable by l gives a time- averaged interval-of-time reward variable.

89 Slide 89 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reward Variable Specification Reward Structure Instant-of-Time Time-Average Interval-of-Time Interval-of-Time t lim as t goes to infinity [t, t + l] lim as t goes to infinity [t, t + l] lim as l goes to infinity [t, t + l] lim as l goes to infinity [t, t + l] lim as t goes to infinity

90 Slide 90 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reward Variables are Random Variables Note that since the behavior of a SAN is a stochastic process, then a reward variable is a measure defined on the stochastic process, and therefore a reward variable is a random variable. A tool can solve for the reward variables, but solving for the distribution in many cases can be difficult. It is often much simpler to solve for the mean or variance of the reward variable, especially when using numerical techniques. Example reward variables: A(0,t) - Fraction of time the system delivers proper service during [0,t]. Hard to compute. E[A(0,t)] - Expected value of A(0,t). Easier to compute.

91 Slide 91 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Specifying Reward Variables in Möbius When specifying a rate portion of a reward structure in Möbius, you must define a predicate and function. –predicate: while true (i.e., integer greater than 0 in C), accumulate the reward –function: the value (i.e., double in C) to accumulate Note that both the predicate and function may be any C statement or expression.

92 Slide 92 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reward Variables for Computer Failure Model

93 Slide 93 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reward Variables for Computer Failure Model

94 Slide 94 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Model Composition A composed model is a way of connecting different SANs together to form a larger model. Model composition has two operations: –Replicate: Combine 2 or more identical SANs and reward structures together, holding certain places common among the replicas. –Join: Combine 2 or more different SANs and reward structures together, combining certain places to permit communication.

95 Slide 95 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Composed Model Specification Replicate submodel a certain number of times Hold certain places common to all replicas Join two or more submodels together Certain places in different submodels can be made common

96 Slide 96 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Rationale There are many good reasons for using composed models. –Building highly reliable systems usually involves redundancy. The replicate operation models redundancy in a natural way. –Systems are usually built in a modular way. Replicates and Joins are usually good for connecting together similar and different modules. –Tools can take advantage of something called the Strong Lumping Theorem that allows a tool to generate a Markov process with a smaller state space (to be described in Session 7).

97 Slide 97 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Rules for Building Composed Models Places that are joined together must have the same name and initial marking. Places that are common at a certain level of the tree must be common at all lower levels. Places that are common cannot be connected to the input side of an instantaneous activity. (as Implemented in Möbius)

98 Slide 98 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Computer Failure Model Revisited: Single computer Model (Note initial marking of NumComp is two since there will be two computers in the composed model.)

99 Slide 99 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Composed Model for Computer Failure Model

100 Slide 100 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reward Variables for Composed Model

101 Slide 101 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reward Variables for Composed Model

102 Slide 102 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Composed Model How does adding an additional computer affect reliability? –In the composed model, change number of replications to 3 and change various reward variables - easy (Use a global variable if you think suspect you may want to do this.) –In “flat” model, add another computer - hard In composed model, the number of states in the underlying Markov chain is much smaller, especially for large numbers of replications. (Details will be given in Session 7.)

103 Slide 103 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Analytic/Numerical State-Based Modeling

104 Slide 104 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Session Outline Review of Markov process theory and fundamentals Methods for constructing state-level models from SANs Analytic/numerical solution techniques –Transient solution Standard uniformization (instant-of-time variables) Adaptive uniformization (instant-of-time variables) Interval-of-time uniformization (interval-of-time variables) –Steady-state solution (steady-state instant-of-time variables) Direct solution Iterative solution

105 Slide 105 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Weaknesses of Simulation Simulation relies on good pseudo-random number generation, sufficient observations, and good statistical techniques to produce an approximate solution Increasing accuracy by a factor of n requires on the order of n 2 more work, which can be prohibitively expensive. For example, a 5-Nines system reliability model will require approximately 100,000 observations to observe one failure. One digit of accuracy can easily require over 1,000,000 observations! (For many models, 1,000,000 observations can be generated quickly, but as system failure becomes even rarer, standard simulation quickly becomes infeasible.)

106 Slide 106 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. The Case for Analytical/Numerical Techniques If you can model using exponential delays and your model is sufficiently small, continuous time Markov chains (CTMCs) offer some advantages. These include: –Typically faster solution time for systems with rare events –Typically takes less time to get more accurate answers –Typically more confidence in the solution In order to understand when we get these advantages, we must better understand the methods of obtaining solutions to CTMCs.

107 Slide 107 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Random Variable Review It is often convenient to assign a (real) number to every element in . This assignment, or rule, or function, is called a random variable.  01  X()X() X :   

108 Slide 108 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Random Process Review Random processes are useful for characterizing the behavior of real systems. A random process is a collection of random variables indexed by time. Example: X(t) is a random process. Let X(1) be the result of tossing a die. Let X(2) be the result of tossing a die plus X(1), and so on. Notice that time (T) = {1,2,3,...}. One can ask:

109 Slide 109 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Random Process Review, cont. If X is a random process, X(t) is a random variable. Remember that a random variable Y is a function that maps elements in  to elements in . Therefore, a random process X maps elements in the two-dimensional space   T to elements in . When we fix t, then X becomes a function of  to . However, if we fix , then X becomes a function of T to . By fixing  and observing X as a function of T, we are observing a sample path of X. This is extremely useful for simulation.

110 Slide 110 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Describing a Random Process Recall that for a random variable X, we can use the cumulative distribution F X to describe the random variable. In general, no such simple description exists for a random process. However, a random process can often be described succinctly in various different ways. For example, if Y is a random variable representing the roll of a die, and X(t) is the sum after t rolls, then we can describe X(t) by X(t) - X(t - 1) = Y, P[X(t) = i|X(t - 1) = j] = P[Y = i - j], or X(t) = Y 1 + Y Y t, where the Y i ’s are independent.

111 Slide 111 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Classifying Random Processes: Characteristics of T If the number of time points defined for a random process, i.e., |T|, is finite or countable (e.g., integers), then the random process is said to be a discrete-time random process. If |T| is uncountable (e.g., real numbers) then the random process is said to be a continuous-time random process. Example: Let X(t) be the number of fault arrivals in a system up to time t. Since t  T is a real number, X(t) is a continuous-time random process.

112 Slide 112 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Classifying Random Processes: State Space Type Let X be a random process. The state space of a random process is the set of all possible values that the process can take on, i.e., S = {y: X(t) = y, for some t  T}. If X is a random process that models a system, then the state space of X can represent the set of all possible configurations that the system could be in.

113 Slide 113 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Random Process State Spaces If the state space S of a random process X is finite or countable (e.g., S = {1,2,3,...}), then X is said to be a discrete-state random process. Example: Let X be a random process that represents the number of bad packets received over a network. X is a discrete-state random process. If the state space S of a random process X is infinite and uncountable (e.g., S =  ), then X is said to be a continuous-state random process. Example: Let X be a random process that represents the voltage on a telephone line. X is a continuous-state random process. We examine only discrete-state processes in this lecture.

114 Slide 114 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Stochastic-Process Classification Examples Time State Continuous Discrete Continuous

115 Slide 115 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Markov Process A special type of random process that we will examine in detail is called the Markov process. A Markov process can be informally defined as follows. Given the state (value) of a Markov process X at time t (X(t)), the future behavior of X can be described completely in terms of X(t). Markov processes have the very useful property that their future behavior is independent of past values.

116 Slide 116 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Markov Chains A Markov chain is a Markov process with a discrete state space. We will always make the assumption that a Markov chain has a state space in {1,2,...} and that it is time-homogeneous. A Markov chain is time-homogeneous if its future behavior does not depend on what time it is, only on the current state (i.e., the current value). We make this concrete by looking at a discrete-time Markov chain (hereafter DTMC). A DTMC X has the following property: (1) (2)

117 Slide 117 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. DTMCs Notice that given i, j, and k, is a number! can be interpreted as the probability that if X has value i, then after k time-steps, X will have value j. Frequently, we write to mean

118 Slide 118 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Markov Chains A Markov chain is a Markov process with a discrete state space. We will always make the assumption that a Markov chain has a state space in {1,2,...} and that it is time-homogeneous. A Markov chain is time-homogeneous if its future behavior does not depend on what time it is, only on the current state (i.e., the current value). We make this concrete by looking at a discrete-time Markov chain (hereafter DTMC). A DTMC X has the following property: (1) (2)

119 Slide 119 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. DTMCs Notice that given i, j, and k, is a number! can be interpreted as the probability that if X has value i, then after k time-steps, X will have value j. Frequently, we write to mean

120 Slide 120 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. State Occupancy Probability Vector Let  be a row vector. We denote  i to be the i-th element of the vector. If  is a state occupancy probability vector, then  i (k) is the probability that a DTMC has value i (or is in state i) at time-step k. Assume that a DTMC X has a state-space size of n, i.e., S = {1, 2,..., n}. We say formally  i (k) = P[X(k) = i] Note that for all times k.

121 Slide 121 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Computing State Occupancy Vectors: A Single Step Forward in Time If we are given  (0) (the initial probability vector), and P ij for i, j = 1,..., n, how do we compute  (1)? Recall the definition of P ij. P ij = P[X(k+1) = j | X(k) = i] = P[X(1) = j | X(0) = i] Since

122 Slide 122 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Transition Probability Matrix Notice that this resembles vector-matrix multiplication. In fact, if we arrange the matrix P = {P ij }, that is, if P = then p ij = P ij, and  (1) =  (0)P, where  (0) and  (1) are row vectors, and  (0)P is a vector-matrix multiplication. The important consequence of this is that we can easily specify a DTMC in terms of an occupancy probability vector  and a transition probability matrix P. p 1n p 11 p n1 p nn,

123 Slide 123 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Transient Behavior of Discrete-Time Markov Chains Given  (0) and P, how can we compute  (k)? We can generalize from earlier that  (k)=  (k - 1)P. Also, we can write  (k - 1) =  (k - 2)P, and so  (k)= [  (k - 2)P]P =  (k - 2)P 2 Similarly,  (k - 2) =  (k - 3)P, and so  (k)= [  (k - 3)P]P 2 =  (k - 3)P 3 By repeating this, it should be easy to see that  (k)=  (0)P k

124 Slide 124 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. A Simple Example Suppose the weather at Urbana-Champaign, Illinois can be modeled the following way: If it’s sunny today, there’s a 60% chance of being sunny tomorrow, a 30% chance of being cloudy, and a 10% chance of being rainy. If it’s cloudy today, there’s a 40% chance of being sunny tomorrow, a 45% chance of being cloudy, and a 15% chance of being rainy. If it’s rainy today, there’s a 15% chance of being sunny tomorrow, a 60% chance of being cloudy, and a 25% chance of being rainy. If it’s rainy on Friday, what is the forecast for Monday?

125 Slide 125 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Simple Example, cont. Clearly, the weather model is a DTMC. 1) Future behavior depends on the current state only 2) Discrete time, discrete state 3) Time homogeneous The DTMC has 3 states. Let us assign 1 to sunny, 2 to cloudy, and 3 to rainy. Let time 0 be Friday.

126 Slide 126 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Simple Example Solution The weather on Saturday  (1) is that is, 15% chance sunny, 60% chance cloudy, 25% chance rainy. The weather on Sunday  (2) is The weather on Monday  (3) is  (3) =  (2)P = (.4316,.42,.1484), that is, 43% chance sunny, 42% chance cloudy, and 15% chance rainy.

127 Slide 127 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Solution, cont. Alternatively, we could compute P 3 since we found  (3) =  (0)P 3. Working out solutions by hand can be tedious and error-prone, especially for “larger” models (i.e., models with many states). Software packages are used extensively for this sort of analysis. Software packages compute  (k) by (... ((  (0)P)P)P...)P rather than computing P k, since computing the latter results in a large “fill-in.”

128 Slide 128 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Graphical Representation It is frequently useful to represent the DTMC as a directed graph. Nodes represent states, and edges are labeled with probabilities. For example, our weather prediction model would look like this: = Sunny Day 2 = Cloudy Day 3 = Rainy Day

129 Slide 129 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. “Simple Computer” Example P idle PrPr P ff P fi P com P arr P fb P busy X = 1 computer idle X = 2 computer working X = 3 computer failed

130 Slide 130 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Limiting Behavior of DTMCs It is sometimes useful to know the time-limiting behavior of a DTMC. This translates into the “long term,” where the system has settled into some steady-state behavior. Formally, we are looking for To compute this, what we want is There are various ways to compute this. The simplest is to calculate  (n) for increasingly large n, and when  (n + 1)   (n), we can believe that  (n) is a good approximation to steady-state. This can be rather inefficient if n needs to be large.

131 Slide 131 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Classifications It is much easier to solve for the steady-state behavior of some DTMC’s than others. To determine if a DTMC is “easy” to solve, we need to introduce some definitions. Definition: A state j is said to be accessible from state i if there exists an n  0 such that We write i  j. Note: recall that If one thinks of accessibility in terms of the graphical representation, a state j is accessible from state i if there exists a path of non-zero edges (arcs) from node i to node j.

132 Slide 132 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. State Classification in DTMCs Definition: A DTMC is said to be irreducible if every state is accessible from every other state. Formally, a DTMC is irreducible if i  j for all i,j  S. A DTMC is said to be reducible if it is not irreducible. It turns out that irreducible DTMC’s are simpler to solve. One need only solve one linear equation:  =  P. We will see why this is so, but first there is one more issue we must confront.

133 Slide 133 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Periodicity Consider the following DTMC: However, does exist; it is called the time-averaged steady-state distribution, and is denoted by  *. Definition: A state i is said to be periodic with period d if only when n is some multiple of d. If d = 1, then i is said to be aperiodic. A steady-state solution for an irreducible DTMC exists if all the states are aperiodic. A time-averaged steady-state solution for an irreducible DTMC always exists

134 Slide 134 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Steady-State Solution of DTMCs The steady-state behavior can be computed by solving the linear equation  =  P, with the constraint that For irreducible DTMC’s, it can be shown that this solution is unique. If the DTMC is periodic, then this solution yields  *. One can understand the equation  =  P in two different ways. In steady-state, the probability distribution  (n + 1) =  (n)P, and by definition  (n + 1) =  (n) in steady-state. “Flow” equations. Flow equations require some visualization. Imagine a DTMC graph, where the nodes are assigned the occupancy probability, or the probability that the DTMC has the value of the node.

135 Slide 135 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Flow Equations Let  i P ij be the “probability mass” that moves from state j to state i in one time-step. Since probability must be conserved, the probability mass entering a state must equal the probability mass leaving a state. Prob. mass in = Prob. mass out Written in matrix form,  =  P. Probability must be conserved, i.e., i...

136 Slide 136 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Continuous Time Markov Chains (CTMCs) For most systems of interest, events may occur at any point in time. This leads us to consider continuous time Markov chains. A continuous time Markov chain (CTMC) has the following property: A CTMC is completely described by the initial probability distribution  (0) and the transition probability matrix P(t) = [p ij (t)]. Then we can compute  (t) =  P(t). The problem is that p ij (t) is generally very difficult to compute.

137 Slide 137 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. CTMC Properties This definition of a CTMC is not very useful until we understand some of the properties. First, notice that p ij (  ) is independent of how long the CTMC has previously been in state i, that is, There is only one random variable that has this property: the exponential random variable. This indicates that CTMCs have something to do with exponential random variables. First, we examine the exponential r.v. in some detail.

138 Slide 138 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Exponential Random Variables Recall the property of the exponential random variable. An exponential random variable X with parameter has the CDF P[X  t] = F x (t) = The distribution function is given by f x (t) = The exponential random variable is the only random variable that is “memoryless.” To see this, let X be an exponential random variable representing the time that an event occurs (e.g., a fault arrival). We will show that { 0 t  0  e - t t > 0. { 0 t  0 e - t t > 0

139 Slide 139 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Memoryless Property Proof of the memoryless property:

140 Slide 140 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Event Rate The fact that the exponential random variable has the memoryless property indicates that the “rate” at which events occur is constant, i.e., it does not change over time. Often, the event associated with a random variable X is a failure, so the “event rate” is often called the failure rate or the hazard rate. The event rate of X is defined as the probability that the event associated with X occurs within the small interval [t, t +  t], given that the event has not occurred by time t, per the interval size  t: This can be thought of as looking at X at time t, observing that the event has not occurred, and measuring the number of events (probability of the event) that occur per unit of time at time t.

141 Slide 141 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Observe that: In the exponential case, This is why we often say a random variable X is “exponential with rate.”

142 Slide 142 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Minimum of Two Independent Exponentials Another interesting property of exponential random variables is that the minimum of two independent exponential random variables is also an exponential random variable. Let A and B be independent exponential random variables with rates  and  respectively. Let us define X = min{A,B}. What is F X (t)? F X (t)= P[X  t] = P[min{A,B}  t] = P[A  t OR B  t] = 1 - P[A > t AND B > t] - see comb. methods section = 1 - P[A > t] P[B > t] = 1 - (1 - P[A  t])(1 - P[B  t]) = 1 - (1 - F A (t))(1 - F B (t)) = 1 - (1 - [1 - e -  t ])(1 - [1 - e -  t ]) = 1 - e -  t e -  t = 1 - e -(  +  )t Thus, X is exponential with rate  + .

143 Slide 143 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Competition of Two Independent Exponentials If A and B are independent and exponential with rate  and  respectively, and A and B are competing, then we know that one will “win” with an exponentially distributed time (with rate  +  ). But what is the probability that A wins?

144 Slide 144 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Competing Exponentials in CTMCs Imagine a random process X with state space S = {1,2,3}. X(0) = 1. X goes to state 2 (takes on a value of 2) with an exponentially distributed time with parameter . Independently, X goes to state 3 with an exponentially distributed time with parameter . These state transitions are like competing random variables. We say that from state 1, X goes to state 2 with rate  and to state 3 with rate . X remains in state 1 for an exponentially distributed time with rate  + . This is called the holding time in state 1. Thus, the expected holding time in state 1 is The probability that X goes to state 2 is The probability X goes to state 3 is This is a simple continuous-time Markov chain   X(0) = 1 P[X(0) = 1] = 1

145 Slide 145 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Competing Exponentials vs. a Single Exponential With Choice Consider the following two scenarios: 1. Event A will occur after an exponentially distributed time with rate . Event B will occur after an independent exponential time with rate . 2. After waiting an exponential time with rate  + , event A occurs with probability and event B occurs with probability These two scenarios are indistinguishable. In fact, we frequently interchange the two scenarios rather freely when analyzing a system modeled as a CTMC.

146 Slide 146 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. State-Transition-Rate Matrix A CTMC can be completely described by an initial distribution  (0) and a state- transition-rate matrix. A state-transition-rate matrix Q = [q ij ] is defined as follows: q ij = Example: A computer is idle, working, or failed. When the computer is idle, jobs arrive with rate , and they are completed with rate . When the computer is working, it fails with rate  w, and with rate i when it is idle. rate of going from i  j, state i to state j i = j.

147 Slide 147 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. “Simple Computer” CTMC Let X = 1 represent “the system is idle,” X = 2 “the system is working,” and X = 3 a failure. If the computer is repaired with rate , the new CTMC looks like 3 2   1 i w 3 2   1 i w 

148 Slide 148 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Analysis of “Simple Computer” Model Some questions that this model can be used to answer: –What is the availability at time t? –What is the steady-state availability? –What is the expected time to failure? –What is the expected number of jobs lost due to failure in [0,t]? –What is the expected number of jobs served before failure? –What is the throughput of the system (jobs per unit time), taking into account failures and repairs?

149 Slide 149 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. State-Space Generation from SANs If the activity delays are exponential, it is straightforward to convert a SAN to a CTMC. We first look at the simple case, where there is no composed model.

150 Slide 150 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. State Space (Generated by Möbius)

151 Slide 151 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Underlying Markov Model (State Transition Rates Not Shown)

152 Slide 152 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Reduced Base Model Construction “Reduced Base Model” construction techniques make use of composed model structure to reduce the number of states generated. A state in the reduced base model is composed of a state tree and an impulse reward. During reduced base model construction, the use of state trees permits an algorithm to automatically determine valid lumpings based on symmetries in the composed model. The reduced base model is constructed by finding all possible (state tree, impulse reward) combinations and computing the transition rates between states. Generation of the detailed base model is avoided.

153 Slide 153 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Example Reduced Base Model State Generation Composed Model computer

154 Slide 154 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Example Reduced Base Model States and Transitions 1 1 R (NumComp = 2) computer (CPUboards = 3) computer (CPUboards = 2) 1 1 R (NumComp = 1) computer (CPUboards = 3) computer (CPUboards = 0) 1 1 R (NumComp = 0) computer (CPUboards = 3) computer (CPUboards = 0) 2 R (NumComp = 2) computer (CPUboards = 3) covered uncovered catastrophic (state 1) (state 4)(state 3)(state 2)

155 Slide 155 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Markov Chain of Reduced Base Model (State Transition Rates not Shown)

156 Slide 156 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. State-Space Generation in Möbius (For generating random process representations of models with all exponential or exponential/deterministic timed activities) Print out states and reward variables Print out absorbing states. Useful to detect problems when attempting a steady-state solution. Place comments, as specified by edit comments, in file. State-space generation must be done before all analytic/numerical solutions are done.

157 Slide 157 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Numerical/Analytical Solution Techniques 1) Transient Solution –Standard Uniformization (instant-of-time variables) –Adaptive Uniformization (instant-of-time variables) –Interval-of-time Uniformization (expected value, interval-of-time variables) 2) Steady-state Solution –Direct Solution (instant-of-time steady-state variables) –Iterative Solution (instant-of-time steady-state variables)

158 Slide 158 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. CTMC Transient Solution We have seen that it is easy to specify a CTMC in terms of the initial probability distribution  (0) and the state-transition-rate matrix. Earlier, we saw that the transient solution of a CTMC is given by  (t) =  (0)P(t), and we noted that P(t) was difficult to define. Due to the complexity of the math, we omit the derivation and show the relationship Solving this differential equation in some form is difficult but necessary to compute a transient solution. where Q is the state transition rate matrix of the Markov chain.

159 Slide 159 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Transient Solution Techniques Solutions to can be done in many (dubious) ways*: –Direct: If the CTMC has N states, one can write N 2 PDEs with N 2 initial conditions and solve N 2 linear equations. –Laplace transforms: Unstable with multiple “poles” –Nth order differential equations: Uses determinants and hence is numerically unstable –Matrix exponentiation: P(t) = e Qt, where Matrix exponentiation has some potential. Directly computing e Qt by performing can be expensive and prone to instability. If the CTMC is irreducible, it is possible to take advantage of the fact that Q = ADA -1, where D is a diagonal matrix. Computing e Qt becomes Ae Dt A -1, where * See C. Moler and C. Van Loan, “Nineteen Dubious Ways to Compute the Exponential of a Matrix,” SIAM Review, vol. 20, no. 4, pp , October 1978.

160 Slide 160 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Standard Uniformization Starting with CTMC state transition rate matrix (Q) construct k-step state transition probability Probability of k transitions in time t Choose truncation point to obtain desired accuracy Compute  (k) iteratively, to avoid fill-in

161 Slide 161 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Error Bound in Uniformization Answer computed is a lower bound, since each term in summation is positive, and summation is truncated. Number of iterations to achieve a desired accuracy bound can be computed easily. Error for each state  Choose error bound, then compute N s on-the-fly, as uniformization is done.

162 Slide 162 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. A Simple Example (In the Reverse Direction) Start with: –Discrete time Markov chain –Poisson process with rate = 2 Generate a CTMC: – Q = (P  I) – Make sense? –Look at sum of a geometric number of exponentials (geometric with parameter r) –Result: exponential with rate r. –Holding time in state 1 has mean 1/1.4, holding time in state 2 has mean 1. –Matches that for CTMC

163 Slide 163 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Transient Uniformization Solver (for transient solution of instant-of-time variables) Instant-of-time variable time points of interest. Multiple time points may be specified, separated by spaces. Number of digits of accuracy in the solution. Solution reported is a lower bound. Volume of intermediate results reported. “1” gives the greatest volume, greater numbers less.

164 Slide 164 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Adaptive Uniformization Instead of uniformizing at the highest departure rate among all states, uniformize at a rate that changes, and is highest among the “reached” states after particular numbers of transitions. Then In actual computation: with  (k + 1) =  (k)P k.

165 Slide 165 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Adaptive Uniformization Solver (atrs) (for transient solution of instant-of-time variables) Instant-of-time variable time points of interest. Multiple time points may be specified. Number of digits of accuracy in the solution. Solution reported is a lower bound. Volume of intermediate results reported. “1” gives the greatest volume, greater numbers less.

166 Slide 166 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Hints for Effective Use of Uniformization and Adaptive Uniformization The computation time of trs is primarily determined by the number of iterations. –The number of iterations is proportional to time point times highest departure rate of a state. Models with high rate transitions relative to the time point of interest will take a long time to solve. –E.g., reliability model with slow failure, fast repair, evaluated at large time points. Adaptive uniformization is more time-efficient than standard uniformization when high-rates are not encountered immediately. –Use this solver to get transient solutions in this case. See [van Moorsel 94, van Moorsel 97] for details. For large values of t the result becomes identical to the steady-state result, and will not change any longer if t increases. –Use the iss solver to detect when this occurs.

167 Slide 167 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Accumulated Reward Solver (ars) (solves for expected values of interval-of-time and time-averaged interval- of-time variables on intervals [t 0, t 1 ] when both t 0 and t 1 are finite) Number of digits of accuracy in the solution. Solution reported is a lower bound. Series of time intervals for which solution is desired. Intervals are separated by spaces. Each interval can be specified as t 1 :t 2. Volume of intermediate results reported. “1” gives the greatest volume, greater numbers less. The accumulated reward solver is based on uniformization, so the hints given for the transient solver apply here as well.

168 Slide 168 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Steady-State Behavior of CTMCs since P(t)  0

169 Slide 169 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Steady-State Behavior of CTMCs via Flow Equations Another way to arrive at the equation  * Q = 0, where is to use the flow equations. The “flow” of probability mass into a state i must equal the “flow” of probability mass out of state i. The “flow” of probability mass from state i to state j is simply  i q ij, which is the probability of being in state i times the rate at which transitions from i to j take place. In matrix form, for all i, we obtain  Q = 0. (2) (1)(3) (4)

170 Slide 170 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Steady-State Behavior of CTMCs, cont. This yields the elegant equation  * Q = 0, where the steady-state probability distribution. If the CTMC is irreducible, then  * can be computed with the constraint that Definition: A CTMC is irreducible if every state in the CTMC is reachable from every other state. If the CTMC is not irreducible, then more complex solution methods are required. Notice that for irreducible CTMCs, the steady-state distribution is independent of the initial-state distribution.

171 Slide 171 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Direct Steady-State Solution One steady-state solver in Möbius is the direct steady-state solver. This solver solves the augmented matrix using a form of Gaussian elimination. Pros: Cons: Recommendation: Use for small CTMCs (tens of states) or medium-sized and stiff CTMCs (hundreds to a few thousands), or when high accuracy is required. Reminder: High accuracy in solution does not mean high accuracy in prediction. Use accuracy to do relative comparisons. Can get a very accurate solution in a fixed amount of time; “stiffness” (described later) does not affect solution time. Solution complexity is O(n 3 ), so does not scale well to large models; memory requirements are high due to fill-in and are not known a priori.

172 Slide 172 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Direct Steady-State Solver (dss) (for steady-state solution of instant-of-time variables) Volume of intermediate results reported. “1” gives the greatest volume, greater numbers less. Stopping criterion used in iterative refinement phase, after direct solution is done. Number of rows to search for the “best” pivot when performing LU decomposition “Grace” factor by which elements may become pivots Value that, when multiplied by smallest matrix element, is threshold at which elements may be dropped in LU decomposition.

173 Slide 173 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Hints for Effective Use of the Direct Steady-State Solver dss can be used for steady-state distributions if the Markov model consists of a single class of recurrent non-null states. I.e., dss cannot be applied to a model with multiple absorbing states. –The message invmnorm: zero diagonal element may indicate multiple closed communicating classes. Set the flag Flag Absorbing States in the state-space generator to help determine this. Fill-in can be a serious problem for lager Markov chains. Therefore, dss is usually not used except for smaller Markov chains (10s or 100s, perhaps states). As memory consumption increases, computation time increases as O(n 3 ).

174 Slide 174 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Iterative Solution Methods The simplest iterative solution methods are called stationary iterative methods, and they can be expressed as  (k + 1) =  (k) M, where M is a constant (stationary) matrix. Computing  (k + 1) from  (k) requires one vector-matrix multiplication, or one iteration, which on modern workstations is extremely fast. The simplest stationary iterative method for CTMCs is called the power method. Recall  *Q = 0. Let M = Q + I.  (M - I) = 0  M -  = 0  M =   (k + 1) =  (k) (Q + I) The power method typically converges (gets close to the answer) slowly.

175 Slide 175 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Iterative Solution Characteristics Stationary iterative solution methods have the following characteristics: –Low memory usage (no fill-in); predictable memory usage –Low time per iteration, proportional to the number of non-zero entries –Fast solution time for non-stiff matrices (tens or hundreds of iterations) –Stop when sufficiently accurate –Slow solution time for stiff matrices –Difficult to quantify accuracy, especially for stiff matrices –Easy to implement

176 Slide 176 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Convergence of Iterative Methods We say that an iterative solution method converges if Convergence is of course an important property of any iterative solution method. The rate at which  (k) converges to  * is an important problem, but a very difficult one. Loosely, we say method A converges faster than method B if the smallest k such that is less for A than for B. Which iterative solution method is fastest depends on the Markov chain!

177 Slide 177 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Stopping Criteria for Iterative Methods An important consideration in any iterative method is knowing when to stop. Computing the solution exactly is often wasteful and unnecessary. There are two popular methods of determining when to stop. The residual norm is usually better, but is sometimes a little more difficult to compute. Both norms do have a relationship with although that relationship is complex. The unfortunate fact is that the more iterations necessary, the smaller  must be to guarantee the same accuracy. 1. called the difference norm 2. called the residual norm

178 Slide 178 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Gauss-Seidel One of the most widely used stationary iterative methods is called Gauss-Seidel. The algorithm appears as follows: An intuitive explanation for this algorithm: flow out of node i flow into node i

179 Slide 179 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. SOR There is an extension to Gauss-Seidel called successive over-relaxation, or SOR, that sometimes gives better performance. Choosing  is a hard problem in general. Automatic techniques for choosing  exist but are not implemented in Möbius. Note:  = 1 is the same as Gauss-Seidel. Recommendation: Leave  = 1 unless you are solving a similar system many times and the matrix is stiff.

180 Slide 180 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Iterative Steady-State Solver (iss) (for steady-state solution of instant-of-time variables) Stopping criterion, expressed as 10 -x, where x is given. The criterion used is the infinity difference norm. SOR weight factor. Values = 1 speed convergence, but may not converge. Maximum number of iterations allowed.

181 Slide 181 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Hints for Effective Use of Iterative Steady-State Solver The iss solver will not work with models with models with absorbing states. It will print the message iss_solver: zero on the diagonal and quit –Use Flag Absorbing States to determine if / which states are absorbing. The algorithm used in iss stops when the difference norm is less than ten to the power of negative weight. Normalization is done after stopping, so the actual difference norm could be much less (or more). A value of 9 is typically sufficient. As a rule of thumb, the additional time to get an n-times-as-accurate result is of the order log 10 n. Hence, increased accuracy tends to be not too costly. A weight equal to 1 is usually sufficient. Weight less than 1 (e.g.,.99) guarantees convergence, but typically slows it down. Weights greater than 1 can increase the convergence rate. Weights a only slightly larger than “optimum” can cause divergence. Weight must be between 0 and 2.

182 Slide 182 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Tips for Using iss A simple indicator of stiffness is the ratio of the highest rate transition to the lowest rate transition. –About 10 4 or 10 5 may make a problem stiff. –This is only a rule of thumb. Observed fast_repair and slow_repair have the same stiffness ratio but considerably different convergence characteristics. Over-relaxation can be a real time-saver, but only if you can invest the time. For example,  = 1.4 would work well for all experiments for both studies. For some other model, however, it could diverge. Use “verbosity” to observe the stopping criterion –A “verbosity” of 1000 prints every 1000 iterations. Use a stopping criterion at least as large as the accuracy required. Better, take the required accuracy and add the stiffness ratio. E.g., a desired accuracy of with a stiffness ratio of 10 5 means using a stopping criterion of at least 9. –This is also just a rule of thumb.

183 Slide 183 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Möbius Analytical Solvers a if only rate rewards are used, the time-averaged interval-of-time steady-state measure is identical to the instant-of-time steady-state measure (if both exist). b provided the instant-of-time steady-state distribution is well-defined. Otherwise, the time- averaged interval-of-time steady-state variable is computed and only results for rate rewards should be derived.

184 Slide 184 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Case Study: Fault-Tolerant Embedded Multiprocessor System

185 Slide 185 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Session Outline Problem description Problem solution –Choice of SANs –Choice of activities –Choice of places –Tricks of the trade Discussion of constructed model –Composed model –SAN models Model solution

186 Slide 186 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Problem Origin This problem was originally posed in 1992 as a reliability model of a large, embedded fault-tolerant computer, presumably for space-borne applications. It was posed as a hierarchical model with non-perfect coverage at each level, with the purpose of showing the inadequacy of existing techniques. –Combinatorial methods were incapable of including coverage at all levels of the hierarchy, thus grossly overstating the reliability. –Markov- or SPN-based methods create far too many states to solve. –Monte-Carlo simulation works, but provides only an estimate (which is often not good enough). –A specialized tool was developed to do numerical integration of a semi- Markov process to solve this and similar problems. In Möbius, we solve a smaller version of the same architecture “exactly” using Markov models generated by SANs. This is made possible by automatic state lumping using composed models.

187 Slide 187 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Problem Description System consists of 2 computers Each computer consists of –3 memory modules (2 must be operational) –3 CPU units (2 must be operational) –2 I/O ports (1 must be operational) –2 error-handling chips (non-redundant) Each memory module consists of –41 RAM chips (39 must be operational) –2 interface chips (non-redundant) A CPU consists of 6 non-redundant chips An I/O port consists of 6 non-redundant chips 10 to 20 year operational life

188 Slide 188 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Diagram of Fault-Tolerant Multiprocessor System.. 41 RAMs 2 int. ch RAMs 2 int. ch RAMs 2 int. ch. 2 ch... 6 CPU chips.. 6 CPU chips.. 6 CPU chips.. 6 I/O chips.. 6 I/O chips memory module errorhandlers interface bus CPU module I/O port computer...

189 Slide 189 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Definition of “Operational” The system is operational if at least one computer is operational A computer is operational if all the modules are operational –A memory module is operational if at least 39 RAM chips and both interface chips are operational. –A CPU unit is operational if all 6 CPU chips are operational –An I/O port is operational if all 6 I/O chips are operational –The error-handling unit is operational if both error-handling chips are operational Failure rate per chip is 100 failures per 1 billion hours

190 Slide 190 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Coverage This system could be modeled using combinatorial methods if we did not take coverage into account. Coverage is the chance that the failure of a chip will not cause the larger system to fail even if sufficient redundancy exists. I.e., coverage is the probability that the fault is contained. The coverage probabilities are given in the following table: For example, if a RAM chip fails, there is a 0.2% chance the memory module will fail even if sufficient redundancy exists. If the memory module fails, there is a 5% chance the computer will fail. If a computer fails, there is a 5% chance the system will fail.

191 Slide 191 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Outline of Solution: List of SANs The model is composed of four SANs: 1. memory_module 2. cpu_module 3. errorhandlers 4. io_port_module Each SAN models the behavior of the module in the event of a module component failure.

192 Slide 192 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. List of Places Seven places represent the state of the system: 1. cpus – the number of operational CPU modules 2. ioports – the number of operational I/O modules 3. errorhandlers – whether the two error-handler chips are operational 4. computer_failed – the number of failed computers 5. memory_failed – the number of failed memory modules 6. memory_chips – number of operational RAM chips 7. interface_chips – number of operational interface chips

193 Slide 193 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. List of Activities Five activities represent failures in the system 1. cpu_failure – the failure of any CPU chip 2. ioport_failure – the failure of any I/O chip 3. errorhandling_chip_failure – the failure of either error-handler chip 4. memory_chip_failure – the failure of a RAM chip 5. interface_chip_failure – the failure of a memory interface chip Cases on these activities represent behavior based on coverage or non-coverage.

194 Slide 194 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Tricks of the Trade Since we intend to solve this model analytically, we want the fewest number of states possible. We don’t care which component failed or what particular failed state the model is in. Therefore, we lump all failure states into the same state. We don’t care which computer or which module is in what state. Therefore, we make use of replication to further reduce the number of states. We use marking-dependent rates to model RAM chip failure, making use of the fact that the minimum of independent exponentials is an exponential. We use cases to denote coverage probabilities, and adjust the probabilities depending on the state of the system.

195 Slide 195 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Composed Model

196 Slide 196 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. cpu_modules SAN

197 Slide 197 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. cpu_modules SAN, cont. cpu_modules input gate predicates and functions: cpu_modules activity time distributions:

198 Slide 198 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. cpu_modules SAN, cont. case 1: chip failure covered case 2: chip failure causes computer failure case 3: chip failure causes system (catastrophic) failure cpu_modules case probabilities for activities:

199 Slide 199 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. cpu_modules SAN, cont. cpu_modules output gate functions:

200 Slide 200 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. errorhandlers SAN

201 Slide 201 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. errorhandlers SAN cont. case 1: chip failure causes computer failure case 2: chip failure causes system failure Input gate definitions for SAN model errorhandlers: Activity time distributions for SAN model errorhandlers: Activity case probabilities for SAN model errorhandlers:

202 Slide 202 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. errorhandlers SAN cont. Output gate definitions for SAN model errorhandlers:

203 Slide 203 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. memory_module SAN Note: memory_module is replicated 3 times, computer_failed and memory_failed held in common.

204 Slide 204 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. memory_chip_failure of memory_modules SAN Input gate definition for SAN model memory_module: Activity time distributions for SAN model memory_module:

205 Slide 205 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. memory_chip_failure of memory_modules SAN, cont. case 1: chip failure, sufficient redundancy case 2: chip failure causes memory_module failure case 3: chip failure causes computer failure case 4: chip failure causes system failure Activity case probabilities for SAN model memory_module:

206 Slide 206 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. memory_chip_failure of memory_modules SAN, cont. Output gate definitions for SAN model memory_module:

207 Slide 207 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. interface_chip_failure of memory_modules SAN case 1: chip failure causes memory module failure case 2: chip failure causes computer failure case 3: chip failure causes system failure Input gate definitions for SAN model memory_module: Activity time distributions for SAN model memory_module: Activity case probabilities for SAN model memory_module:

208 Slide 208 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. interface_chip_failure of memory_modules SAN, cont. Output gate definitions for SAN model memory_module:

209 Slide 209 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. io_port_modules SAN

210 Slide 210 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. io_port_modules SAN, cont. I/O port modules input gate predicates and functions: I/O port modules activity time distributions:

211 Slide 211 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. io_port_modules SAN, cont. case 1: chip failure causes I/O port failure case 2: chip failure causes computer failure case 3: chip failure causes system failure I/O port modules case probabilities for activities:

212 Slide 212 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. io_port_modules SAN, cont. I/O port modules output gate functions:

213 Slide 213 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Model Solution The modeled two-computer system with non-perfect coverage at all levels (i.e., the model as described), the state space contains 10,114 states. The 10 year mission reliability was computed to be

214 Slide 214 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Impact of Coverage Coverage can have a large impact on reliability and state-space size. Various coverage schemes were evaluated with the following results.

215 Slide 215 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Conclusion Because there are no fast rate transitions, this model affords efficient solution using uniformization. Rep / Join are a natural way to model redundancy and offer great state-space savings. Möbius is able to provide an accurate and efficient solution where previous solutions required simulation or approximations.

216 Slide 216 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Solution by Simulation

217 Slide 217 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Session Outline Advantages and disadvantages of simulation, relative to other model solution methods Review of simulation fundamentals Estimating measures: Estimators and confidence intervals Simulation in Möbius

218 Slide 218 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Motivation High-level formalisms (like SANs) make it easy to specify realistic systems, but they also make it easy to specify systems that have unreasonably large state spaces. State-of-the-art tools (like Mobius) can handle state-level models with a few ten’s of million states, but not more. When state spaces become too large, discrete event simulation is often a viable alternative. Discrete-event simulation can be used to solve models with arbitrarily large state spaces, as long as the desired measure is not based on a “rare event.” When “rare events” are present, variance reduction techniques can sometimes be used.

219 Slide 219 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Advantages of Simulation Simulation can be applied to any SAN model. The most prominent difference, compared with analytic solvers, is that generally distributed activities can be used. Simulation does not require the generation of a state space and therefore does not require a finite state space. Therefore, much more detailed models can be solved.

220 Slide 220 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Disadvantages of Simulation Simulation only provides an estimate of the desired measure. An approximate confidence interval is constructed that contains the actual result with some user-specified probability. Higher desired accuracy dramatically increases the necessary simulation time. As a rule, to make the confidence interval n times narrower, the simulation has to be run n 2 times as long. The “rare event problem” may arise. If simulation is used to estimate a small probability, such as the reliability of a highly-reliable system, extremely long simulations may have to be performed to encounter the particular event often enough. Complicated models can require long simulation times, even if the rare event problem is not an issue. The simulators in Möbius perform the necessary event scheduling very efficiently, but it should be realized that simulation is not a panacea.

221 Slide 221 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Simulation as Model Experimentation State-based methods (such as Markov chains) work by enumerating all possible states a system can be in, and then invoking a numerical solution method on the generated state space. Simulation, on the other hand, generates one or more trajectories (possible behaviors from the high-level model), and collects statistics from these trajectories to estimate the desired performance/dependability measures. Just how this trajectory is generated depends on the: –nature of the notion of state (continuous or discrete) –type of stochastic process (e.g., ergodic, reducible) –nature of the measure desired (transient or steady-state) –types of delay distributions considered (exponential or general) We will consider each of these issues in this module, as well as the simulation of systems with rare events.

222 Slide 222 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Types of Simulation Continuous-state simulation is applicable to systems where the notion of state is continuous and typically involves solving (numerically) systems of differential equations. Circuit-level simulators are an example of continuous-state simulation. Discrete-event simulation is applicable to systems in which the state of the system changes at discrete instants of time, with a finite number of changes occurring in any finite interval of time. Since we will focus on validating end-to-end systems, rather than circuits, we will focus on discrete-event simulation. There are two types of discrete-event simulation execution algorithms: –Fixed-time-stamp advance –Variable-time-stamp advance

223 Slide 223 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Fixed-Time-Stamp Advance Simulation Simulation clock is incremented a fixed time  t at each step of the simulation. After each time increment, each event type (e.g., activity in a SAN) is checked to see if it should have completed during the time of the last increment. All event types that should have completed are completed and a new state of the model is generated. Rules must be given to determine the ordering of events that occur in each interval of time. Example: Good for all models where most events happen at fixed increments of time (e.g., gate-level simulations). Has the advantage that no “future event list” needs to be maintained. Can be inefficient if events occur in a bursty manner, relative to time-step used.  t tt  t  t  t  e1e1 e2e2 e5e5 e4e4 e3e3 e6e6

224 Slide 224 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Variable-Time Step Advance Simulation Simulation clock advanced a variable amount of time each step of the simulation, to time of next event. If all event times are exponentially distributed, the next event to complete and time of next event can be determined using the equation for the minimum of n exponentials (since memoryless), and no “future event list” is needed. If event times are general (have memory) then “future event list” is needed. Has the advantage (over fixed-time-stamp increment) that periods of inactivity are skipped over, and models with a bursty occurrence of events are not inefficient.

225 Slide 225 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Basic Variable-Time-Step Advance Simulation Loop for SANs A) Set list_of_active_activities to null. B) Set current_marking to initial_marking. C) Generate potential_completion_time for each activity that may complete in the current_marking and add to list_of_active_activities. D) While list_of_active_activities  null: 1) Set current_activity to activity with earliest potential_completion_time. 2) Remove current_activity from list_of_active_activities. 3) Compute new_marking by selecting a case of current_activity, and executing appropriate input and output gates. 4) Remove all activities from list_of_active_activities that are not enabled in new_marking. 5) Remove all activities from list_of_active_activities for which new_marking is a reactivation marking. 6) Select a potential_completion_time for all activities that are enabled in new_marking but not on list_of_active_activities and add them to list_of_active_activities. E) End While.

226 Slide 226 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Types of Discrete-Event Simulation Basic simulation loop specifies how the trajectory is generated, but does not specify how measures are collected, or how long the loop is executed. How measures are collected, and how long (and how many times) the loop is executed depends on type of measures to be estimated. Two types of discrete-event simulation exist, depending on what type of measures are to be estimated. –Terminating - Measures to be estimated are measured at fixed instants of time or intervals of time with fixed finite point and length. This may also include random but finite (in some sense) times, such as a time to failure. –Steady-state - Measures to be estimated depend on instants of time or intervals whose starting points are taken to be t  .

227 Slide 227 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Issues in Discrete-Event Simulation 1) How to generate potential completion times for events 2) How to estimate dependability measures from generated trajectories –Transient measures –Steady-state measures 3) How to implement the basic simulation loop –Sequential or parallel

228 Slide 228 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Generation of Potential Completion Times 1) Generation of uniform [0,1] random variates –Used as a basis for all random variate samples –Types Linear congruential generators Tausworthe generators Other types of generators –Tests of uniform [0,1] generators 2) Generation of non-uniform random variates –Inverse transform technique –Convolution technique –Composition technique –Acceptance-rejection technique –Technique for discrete random variates 3) Recommendations/Issues

229 Slide 229 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Generation of Uniform [0,1] Random Number Samples Goal: Generate sequence of numbers that appears to have come from uniform [0,1] random variable. Importance: Can be used as a basis for all random variates. Issues: 1) Goal isn’t to be random (non-reproducible), but to appear to be random. 2) Many methods to do this (historically), many of them bad (picking numbers out of phone books, computing  to a million digits, counting gamma rays, etc.). 3) Generator should be fast, and not need much storage. 4) Should be reproducible (hence the appearance of randomness, not the reality). 5) Should be able to generate multiple sequences or streams of random numbers.

230 Slide 230 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Linear Congruential Generators (LCGs) Introduced by D. H. Lehmer (1951). He obtained x n = a n mod m x n = (ax n - 1 ) mod m Today, LCGs take the following form: x n = (ax n b) mod m, where x n are integers between 0 and m - 1 a, b, m non-negative integers If a, b, m chosen correctly, sequence of numbers can appear to be uniform and have large period (up to m). LCGs can be implemented efficiently, using only integer arithmetic. LCGs have been studied extensively; good choices of a, b, and m are known. See, e.g., Law and Kelton (1991), Jain (1991).

231 Slide 231 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Tausworthe Generators Proposed by Tausworthe (1965), and are related to cryptographic methods. Operate on a sequence of binary digits (0,1). Numbers are formed by selecting bits from the generated sequence to form an integer or fraction. A Tausworthe generator has the following form: b n = c q - 1 b n - 1  c q - 2 b n - 2 ...  c 0 b n - q where b n is the n th bit, and c i (i = 0 to q - 1) are binary coefficients. As with LCGs, analysis has been done to determine good choices of the c i. Less popular than LCGs, but fairly well accepted.

232 Slide 232 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Generation of Non-Uniform Random Variates Suppose you have a uniform [0,1] random variable, and you wish to have a random variable X with CDF F X. How do we do this? All other random variates can be generated from uniform [0,1] random variates. Methods to generate non-uniform random variates include: –Inverse Transform - Direct computation from single uniform [0,1] variable based on observation about distribution. –Convolution - Used for random variables that can be expressed as sum of other random variables. –Composition - Used when the distribution of the desired random variable can be expressed as a weighted sum of the distributions of other random variables. –Acceptance-Rejection - Uses multiple uniform [0,1] variables and a function that “majorizes” the density of the random variate to be generated.

233 Slide 233 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Inverse Transform Technique Suppose we have a uniform [0,1] random variable U. If we define X = F -1 (U), then X is a random variable with CDF F X = F. To see this, F X (a)= P[X  a] = P[F -1 (U)  a] = P[U  F(a)] = F(a) Thus, by starting with a uniform random variable, we can generate virtually any type of random variable.

234 Slide 234 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Example of Inverse Transform Let X be an exponentially distributed random variable with parameter. Let U be a uniform [0,1] random variable generated by a pseudo-random number generator.

235 Slide 235 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Convolution Technique Technique can be used for all random variables X that can be expressed as the sum of n random variables X = Y 1 + Y 2 + Y Y n In this case, one can generate a random variate X by generating n random variates, one from each of the Y i, and summing them. Examples of random variables: –Sum of n Bernoulli random variables is a binomial random variable. –Sum of n exponential random variables is an n-Erlang random variable.

236 Slide 236 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Composition Technique Technique can be used when the distribution of a desired random variable can be expressed as a weighted sum of other distributions. In this case F(x) can be expressed as The composition technique is as follows: 1) Generate random variate i such that P[I = i] = p i for i = 0, 1,... (This can be done as discussed for discrete random variables.) 2) Return x as random variate from distribution F i (x), where i is as chosen above. A variant of composition can also be used if the density function of the desired random variable can be expressed as weighted sum of other density functions.

237 Slide 237 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Acceptance-Rejection Technique Indirect method for generating random variates that should be used when other methods fail or are inefficient. Must find a function m(x) that “majorizes” the density function f(x) of the desired distribution. m(x) majorizes f(x) if m(x)  f(x) for all x. Note: If random variates for m(x) can be easily computed, then random variates for f(x) can be found as follows: 1) Generate y with density m(x) 2) Generate u with uniform [0,1] distribution 3)

238 Slide 238 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Generating Discrete Random Variates Useful for generating any discrete distribution, e.g., case probabilities in a SAN. More efficient algorithms exist for special cases; we will review most general case. Suppose random variable has probability distribution p(0), p(1), p(2),... on non-negative integers. Then a random variate for this random variable can be generated using the inverse transform method: 1) Generate u with distribution uniform [0,1] 2) Return j satisfying

239 Slide 239 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Recommendations/Issues in Random Variate Generation Use standard/well-tested uniform [0,1] generators. Don’t assume that because a method is complicated, it produces good random variates. Make sure the uniform [0,1] generator that is used has a long enough period. Modern simulators can consume random variates very quickly (multiple per state change!). Use separate random number streams for different activities in a model system. Regular division of a single stream can cause unwanted correlation. Consider multiple random variate generation techniques when generating non- uniform random variates. Different techniques have very different efficiencies.

240 Slide 240 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Estimating Dependability Measures: Estimators and Confidence Intervals An execution of the basic simulation loop produces a single trajectory (one possible behavior of the system). Common mistake is to run the basic simulation loop a single time, and presume observations generated are “the answer.” Many trajectories and/or observations are needed to understand a system’s behavior. Need concept of estimators and confidence intervals from statistics: – Estimators provide an estimate of some characteristic (e.g., mean or variance) of the measure. – Confidence intervals provide an estimate of how “accurate” an estimator is.

241 Slide 241 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Typical Estimators of a Simulation Measure Can be: – Instant-of-time, at a fixed t, or in steady-state – Interval-of-time, for fixed interval, or in steady-state – Time-averaged interval-of-time, for fixed interval, or in steady-state Estimators on these measures include: – Mean – Variance – Interval - Probability that the measure lies in some interval [x,y] Don’t confuse with an interval-of-time measure. Can be used to estimate density and distribution function. – Percentile  th percentile is the smallest value of estimator x such that F(x)  .

242 Slide 242 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Different Types of Processes and Measures Require Different Statistical Techniques Transient measures (terminating simulation): –Multiple trajectories are generated by running basic simulation loop multiple times using different random number streams. Called Independent Replications. –Each trajectory used to generate one observation of each measure. Steady-State measures (steady-state simulation): –Initial transient must be discarded before observations are collected. –If the system is ergodic (irreducible, recurrent non-null, aperiodic), a single long trajectory can be used to generate multiple observations of each measure. –For all other systems, multiple trajectories are needed.

243 Slide 243 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Confidence Interval Generation: Terminating Simulation Approach: –Generate multiple independent observations of each measure, one observation of each measure per trajectory of the simulation. –Observations of each measure will be independent of one another if different random number streams are used for each trajectory. –From a practical point of view, new stream is obtained by continuing to draw numbers from old stream (without resetting stream seed). Notation (for subsequent slides): –Let F(x) = P[X  x] be measure to be estimated. –Define  = E[X],  2 = E[(X -  ) 2 ]. –Define x i as the ith observation value of X (ith replication, for terminating simulation). Issue: How many trajectories are necessary to obtain a good estimate?

244 Slide 244 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Terminating Simulation: Estimating the Mean of a Measure I Wish to estimate  = E[X]. Standard point estimator of  is the sample mean To compute confidence interval, we need to compute sample variance:

245 Slide 245 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Terminating Simulation: Estimating the Mean of a Measure II Then, the (1 -  ) confidence interval about x can be expressed as: Where – –N is the number of observations. Equation assumes x n are distributed normally (good assumption for large number of x i ). The interpretation of the equation is that with (1 -  ) probability the real value (  ) lies within the given interval.

246 Slide 246 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Terminating Simulation: Estimating the Variance of a Measure I Computation of estimator and confidence interval for variance could be done like that done for mean, but result is sensitive to deviations from the normal assumption. So, use a technique called jackknifing developed by Miller (1974). Define Where

247 Slide 247 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Terminating Simulation: Estimating the Variance of a Measure II Now define (where s 2 is the sample variance as defined for the mean) And Then is a (1 -  ) confidence interval about  2.

248 Slide 248 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Terminating Simulation: Estimating the Percentile of an Interval About an Estimator Computed in a manner similar to that for mean and variance. Formulation can be found in Lavenberg, ed., Computer Performance Modeling Handbook, Academic Press, Such estimators are very important, since mean and variance are not enough to plan from when simulating a single system.

249 Slide 249 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Confidence Interval Generation: Steady-State Simulation Informally speaking, steady-state simulation is used to estimate measures that depend on the “long run” behavior of a system. Note that the notion of “steady-state” is with respect to a measure (which has some initial transient behavior), not a model. Different measures in a model will converge to steady state at different rates. Simulation trajectory can be thought of as having two phases: the transient phase and the steady-state phase (with respect to a measure). Multiple approaches to collect observations and generate confidence intervals: –Replication/Deletion –Batch Means –Regenerative Method –Spectral Method Which method to use depends on characteristics of the system being simulated. Before discussing these methods, we need to discuss how the initial transient is estimated.

250 Slide 250 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Estimating the Length of the Transient Phase Problem: Observations of measures are different during so-called “transient phase,” and should be discarded when computing an estimator for steady-state behavior. Need: A method to estimate transient phase, to determine when we should begin to collect observations. Approaches: –Let the user decide: not sophisticated, but a practical solution. –Look at long-term trends: take a moving average and measure differences. –Use more sophisticated statistical measures, e.g., standardized time series (Schruben 1982). Recommendation: –Let the user decide, since automated methods can fail.

251 Slide 251 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Methods of Steady-State Measure Estimation: Replication/Deletion Statistics similar to those for terminating simulation, but observations collected only on steady-state portion of trajectory. One or more observations collected per trajectory: Compute as i th observation, where M i is the number of observations in trajectory i. x i are considered to be independent, and confidence intervals are generated. Useful for a wide range of models/measures (the system need not be ergodic), but slower than other methods, since transient phase must be repeated multiple times. transient phase O 11 O 12 O 21 O 22 O 31 O 32 O 33 O 34 O 23 O 24 O 13 O 14 trajectory 1 trajectory 2 trajectory n...

252 Slide 252 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Methods of Steady-State Measure Estimation: Batch Means Similar to Replication/Deletion, but constructs observations from a single trajectory by breaking it into multiple batches. Example Observations from each batch are combined to construct a single observation; these observations are assumed to be independent and are used to construct the point estimator and confidence interval. Issues: –How to choose batch size? –Only applicable to ergodic systems (i.e., those for which a single trajectory has the same statistics as multiple trajectories). –Initial transient only computed once. In summary, a good method, often used in practice. initial transient O 11 O 12 O 21 O 22 O 31 O 32 O 23 O 13...

253 Slide 253 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Other Steady-State Measure Estimation Methods I Regenerative Method (Crane and Iglehart 1974, Fishman 1974) –Uses “renewal points” in processes to divide “batches.” –Results in batches that are independent, so approach used earlier to generate confidence intervals applies. –However, usually no guarantee that renewal points will occur at all, or that they will occur often enough to efficiently obtain an estimator of the measure. Autoregressive Method (Fishman 1971, 1978) –Uses (as do the two following methods) the autocorrelation structure of process to estimate variance of measure. –Assumes process is covariance stationary and can be represented by an autoregressive model. –Above assumption often questionable.

254 Slide 254 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Other Steady-State Measure Estimation Methods II Spectral Method (Heidelberger and Welch 1981) –Assumes process is covariance stationary, but does not make further assumptions (as previous method does). –Efficient method, if certain parameters chosen correctly, but choice requires sophistication on part of user. Standardized Time Series (Schruben 1983) –Assumes process is strictly stationary and “phi-mixing.” –Phi-mixing means that O i and O i + j become uncorrelated if j is large. –As with spectral method, has parameters whose values must be chosen carefully.

255 Slide 255 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Summary: Measure Estimation and Confidence Interval Generation 1) Only use the mean as an estimator if it has meaning for the situation being studied. Often a percentile gives more information. This is a common mistake! 2) Use some confidence interval generation method! Even if the results rely on assumptions that may not always be completely valid, the methods give an indication of how long a simulation should be run. 3) Pick a confidence interval generation method that is suited to the system that you are studying. In particular, be aware of whether the system being studied is ergodic. 4) If batch means is used, be sure that batch size is large enough that batches are practically uncorrelated. Otherwise the simulation can terminate prematurely with an incorrect result.

256 Slide 256 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Summary/Conclusions: Simulation-Based Validation Techniques 1) Know how random variates are generated in the simulator you use. Make sure: –A good uniform [0,1] generator is used –Independent streams are used when appropriate –Non-uniform random variates are generated in a proper way. 2) Compute and use confidence intervals to estimate the accuracy of your measures. –Choose correct confidence interval computation method based on the nature of your measures and process

257 Slide 257 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Simulator Statistics Editor Batch Size and Initial Transient in Steady-State Simulation Variable Type and Times for Terminating Simulation

258 Slide 258 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Simulator Statistics Editor Estimator Types Variable Type in Terminating Simulation

259 Slide 259 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Simulator Statistics Editor Confidence Interval Width and Level

260 Slide 260 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Setting Initial Transient and Batch Size in Steady- State Simulation Set initial transient large enough so transient has “settled down” –Think about characteristics of model –Make long enough that any one-time events have occurred –For events that occur in a roughly cyclic manner, with a certain period, make initial transient a large (say, 1000) time multiple of the period, so markings related to these events will reach steady state Make batch size similar in size to initial transient, using above guidelines

261 Slide 261 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Simulator Editor Maximum and Minimum Number of Replications to Run Number of Batches between each calculation of the variance Trace-Level for Debugging File Name of Output File

262 Slide 262 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Batch and Replication Outputs (Variable Output Option) Typical batch output: Typical replication output:

263 Slide 263 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Möbius Simulation Techniques

264 Slide 264 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Hints for Successful Simulation Use “Trace Level” option to look at sequence of completion of activities. The batch size in the steady-state simulator must be large enough to assure independence of batches. Enough batches must be collected to assure that the sample variance computed is an accurate reflection of real variance of the measure. Setting the minimum number of batches too low can yield an artificially low (incorrect) confidence interval width. To determine expected run length, monitor the width of the confidence interval using the “Variable Output” option. As a rule of thumb, if the confidence interval is observed after k batches or replications, the simulation will take kn 2 additional batches or replications to decrease the width of the confidence interval by a factor of n.

265 Slide 265 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Symbolic State-space Exploration and Numerical Analysis of State-sharing Composed Models

266 Slide 266 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Motivation State-space (SS) explosion or largeness problem in discrete-state systems –Costly generation and representation of SS (space and time) –Costly representation of CTMC (space) –Costly representation of solution vector (space) and costly iteration/solution time (time) Typical solutions: –Largeness avoidance, e.g., using lumping techniques CTMC level Model level –Largeness tolerance using BDD, MDD, MTBDD, Kronecker, or Matrix Diagrams (MD)

267 Slide 267 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. What Is New? Our approach combines –Model-level lumping induced by structural symmetries Number of states   solution vector size  Number of states   iteration time  –MDD and Matrix Diagram (MD) data structures Enables us to represent lumped CTMCs not possible using sparse matrix An order of magnitude faster than unlumped sparse representation although it induces slowdown in solution time compared to lumped sparse representation State-sharing composed models as opposed to action-synchronization –Maintain almost the same generality

268 Slide 268 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. State-sharing Composed Models Join and Replicate operators Any atomic model formalism that can share state variables –E.g., SAN, PEPA k, and Buckets and Balls Replicate induces symmetry Global and local actions M1 M2 SV1 Join M1M2 Join M1  Rep (3) M1 SV1

269 Slide 269 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Introduction to MDD Represents function where Special case : n = 1, f represents a set of vectors {(0,0,1), (0,0,2), (0,1,1), (0,1,2), (1,0,1), (1,0,2), (1,1,0), (1,1,1), (2,0,0), (2,0,1), (2,1,1), (2,1,2)}

270 Slide 270 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Introduction to MDD Represents function where Special case : n = 1, f represents a set of vectors {(0,0,1), (0,0,2), (0,1,1), (0,1,2), (1,0,1), (1,0,2), (1,1,0), (1,1,1), (2,0,0), (2,0,1), (2,1,1), (2,1,2)} Representation of a set of states of a discrete-state model –Partition set of SVs –Assign index to unique value assignment of variables of each block –Vector of indices represents a state

271 Slide 271 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Introduction to MDD Represents function where Special case : n = 1, f represents a set of vectors {(0,0,1), (0,0,2), (0,1,1), (0,1,2), (1,0,1), (1,0,2), (1,1,0), (1,1,1), (2,0,0), (2,0,1), (2,1,1), (2,1,2)} Representation of a set of states of a discrete-state model –Partition set of SVs –Assign index to unique value assignment of variables of each block –Vector of indices represents a state Augment by state offsets

272 Slide 272 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. MDD data structure by example Partitioning SVs based on composition structure –Maximizing efficiency of local SS exploration –Simplifying global SS exploration Dependability model for multicomputer system Join IO porterror handlercpuRep 1 (M) memory Rep 2 (N) Rep 2 Join mem Rep 1 outer replicate MDD level assignment inner replicate M

273 Slide 273 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Algorithm Overview 1.Generate MDD representation of unlumped SS 2.Build MD representation of CTMC 3.Convert unlumped SS to lumped SS 4.Solve CTMC by iterating through MD data structure

274 Slide 274 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Symbolic Generation of Unlumped SS set of visited states set of unexplored states expands using sequences of firings of local actions expands using single action firing of global actions Never generate potential or unreachable states Creating necessary matrices and data structures to construct MD of the CTMC at a later stage No consideration of lumping properties

275 Slide 275 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Symbolic SSG (Local Actions) Restriction: immediate actions are local On-the-fly elimination of vanishing states Local SS expansion in levels corresponding to atomic models. No assumption of knowing the local state space in advance  –Online computation of transitive closure based on Ibaraki and Katoh’s algoritm Avoids costly computation of tr. closure from scratch ij AB  local transition i to j ij AA B

276 Slide 276 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Symbolic SSG (Global Actions) Global action a in component c affects more than one level No “product-form”-like restriction  Effect of a on each level need not be determined locally More difficult to handle than synchronizing actions Expensive operation

277 Slide 277 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Lumping Redundant states (paths) 12 x Rep AM

278 Slide 278 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Lumping Redundant states (paths) Rep node c implies equivalence relation R c 12 x Rep AM 1 x 1 2

279 Slide 279 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Lumping Redundant states (paths) Rep node c implies equivalence relation R c Overall equivalence relation Canonical representative state in each class min(v) 12 x Rep AM 1 x 1 2

280 Slide 280 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Lumping Redundant states (paths) Rep node c implies equivalence relation R c Overall equivalence relation Canonical representative state in each class min(v) may become exponentially large  break it up into many extremely smaller MDDs  faster computation of 12 x Rep AM 1 x 1 2

281 Slide 281 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Lumping where is the set of all states v where min(v) =v may become huge  break up into extremely smaller MDDs – is often less structured than and therefore larger in number of nodes

282 Slide 282 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. SSG and Lumping Performance Worst case example: No local behavior Drastic decrease in number of states in the lumped SS (up to 6 orders of magnitude) Increase in number of nodes in the lumped state space but still small compared to other entities Very small unlumped and lumped SS representation

283 Slide 283 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. CTMC Generation and Enumeration Use Matrix Diagrams (MD) (Ciardo/Miner) –CTMC of largest example has <40000 nodes and takes <3MB of memory

284 Slide 284 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. CTMC Generation and Enumeration Use Matrix Diagrams (MD) (Ciardo/Miner) –CTMC of largest example has <30000 nodes and takes <5MB of memory Projection of the MD on the lumped SS? Problem: some needed transitions are deleted wrong correct

285 Slide 285 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. CTMC Generation and Enumeration Use Matrix Diagrams (MD) (Ciardo/Miner) –CTMC of largest example has <40000 nodes and takes <3MB of memory and at most a few seconds to build Projection of the MD on the lumped SS? Problem: some needed transitions are deleted Project rows on lumped SS and columns on unlumped SS Redirect transitions on-the-fly DFS-based enumeration of MD using “sorting” MDD wrong correct

286 Slide 286 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. CTMC Enumeration Performance Fairly fast iteration: less than 6 times slower than lumped sparse matrix Solving larger CTMCs

287 Slide 287 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Integration into Möbius

288 Slide 288 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Case Study: Survivability Evaluation

289 Slide 289 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Defending Against a Wide Variety of Attacks Civil disobedience Selling secrets Harassment Collecting trophies Economic intelligence Military spying Information terrorism Stealing credit cards Disciplined strategic cyber attack Serious hackers Script kiddies Curiosity Thrill-seeking Copy-cat attacks Embarrassing organizations HIGH LOW INNOVATION PLANNING STEALTH COORDINATION Nation-states, Terrorists, Multinationals

290 Slide 290 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Intrusion Tolerance: A New Paradigm for Security Prevent Intrusions (Access Controls, Cryptography, Trusted Computing Base) Prevent Intrusions (Access Controls, Cryptography, Trusted Computing Base) 1 st Generation: Protection Cryptography Trusted Computing Base Access Control & Physical Security Detect Intrusions, Limit Damage (Firewalls, Intrusion Detection Systems, Virtual Private Networks, PKI) Detect Intrusions, Limit Damage (Firewalls, Intrusion Detection Systems, Virtual Private Networks, PKI) 2 nd Generation: Detection But intrusions will occur Firewalls Intrusion Detection Systems Boundary Controllers VPNs PKI But some attacks will succeed Tolerate Attacks (Redundancy, Diversity, Deception, Wrappers, Proof-Carrying Code, Proactive Secret Sharing) Tolerate Attacks (Redundancy, Diversity, Deception, Wrappers, Proof-Carrying Code, Proactive Secret Sharing) 3 rd Generation: Tolerance Intrusion Tolerance Big Board View of Attacks Real-Time Situation Awareness & Response Graceful Degradation Hardened Operating System Multiple Security Levels

291 Slide 291 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Validation of Computer System/Network Survivability Security is no longer absolute Trustworthy computer systems/networks must operated through attacks, providing proper service in spite of possible partially successful attacks Intrusion tolerance claims to provide proper operation under such conditions Validation of security/survivability must be done: –During all phases of the design process, to make design choices –During testing, deployment, operation, and maintenance, to gain confidence that the “amount” of intrusion tolerance provided is as advertised.

292 Slide 292 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Validating Computer System Security: Research Goal CONTEXT: Create robust software and hardware that are fault- tolerant, attack resilient, and easily adaptable to changes in functionality and performance over time. GOAL: Create an underlying scientific foundation, methodologies, and tools that will: –Enable clear and concise specifications, –Quantify the effectiveness of novel solutions, –Test and evaluate systems in an objective manner, and –Predict system assurance with confidence.

293 Slide 293 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Existing Security/Survivability Validation Approaches Most traditional approaches to security validation have focus on avoiding intrusions (non-circumventability), or have not been quantitative, instead focusing on and specifying procedures that should be followed during the design of a system (e.g., the Security Evaluation Criteria [DOD85, ISO99]). When quantitative methods have been used, they have typically either been based on formal methods (e.g., [Lan81]), aiming to prove that certain security properties hold given a specified set of assumptions, or been quite informal, using a team of experts (often called a “red team,” e.g. [Low01]) to try to compromise a system. Both of these approaches have been valuable in identifying system vulnerabilities, but probabilistic techniques are also needed.

294 Slide 294 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Example Probabilistic Validation Study Evaluation of DPASA-DV Project design –Designing Protection and Adaptation into a Survivability Architecture: Demonstration and Validation Design of a Joint Battlespace Infosphere –Publish, Subscribe and Query features (PSQ) –Ability to fulfill its mission in the presence of attacks, failures, or accidents Uses Multiple, synergistic validation techniques

295 Slide 295 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. JBI Design Overview JBI Management Staff Executive Zone Crumple Zone Operations Zone JBI Core Quad 1Quad 2Quad 3Quad 4 Network Protection Domains Isolation among selected functions on individual core hosts and on clients Access Proxy (Isolated Process Domains in SE-Linux) Domain6 First Restart Domains Eventually Restart Host Local Controller RMI STCPTCP PS Sensor Rpts TCPUDP IIOP PSQImpl IIOP TCP DC Eascii Domain1Domain2Domain3Domain4Domain5 Forward/ Ratelimit Proxy Logic Inspect / Forward / Rate Limit

296 Slide 296 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Survivability/Security Validation Goal Provide convincing evidence that the design, when implemented, will provide satisfactory mission support under real use scenarios and in the face of cyber-attacks. More specifically, determine whether the design, when implemented will meet the project goals: This assurance case is supported by: –Rigorous logical arguments –Experimental evaluation –A detailed executable model of the design

297 Slide 297 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Goal: Design a Publish and Subscribe Mechanism that …  Provides 100% of critical functionality when under sustained attack by a “Class-A” red team with 3 months of planning.  Detects 95% of large scale attacks within 10 mins. of attack initiation and 99% of attacks within 4 hours with less than 1% false alarm rate.  Displays meaningful attack state alarms. Prevent 95% of attacks from achieving attacker objectives for 12 hours.  Reduces low-level alerts by a factor of 1000 and display meaningful attack state alarms.  Shows survivability versus cost/performance trade-offs.

298 Slide 298 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Integrated Survivability Validation Procedure R P S Q Functional Model of the Relevant Subset of the System Model for Client Model for Client Model for Access Proxy Model for Access Proxy Model for PSQ Server Model for PSQ Server … AA1 AA2 AA3 Requirement Decomposition Functional Model of the System (Probabilistic or Logical) Assumptions Supporting Logical Arguments and Experimentation AP1 AP2 M1 (Network Domains) M2M3 M4 M6 M5 L1 (ADF) L2 L3

299 Slide 299 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. R P S Q Functional Model of the Relevant Subset of the System Model for Client Model for Client Model for Access Proxy Model for Access Proxy Model for PSQ Server Model for PSQ Server … AA1 AA2 AA3 AP1 AP2 M1 (Network Domains) M2M3 M4 M6 M5 L1 (ADF) L2 L3 1.A precise statement of the requirements 2.High-level functional model description: a)Data and alerts flows for the processes related to the requirements, b)Assumed attacks and attack effects [Threat/vulner- ability analysis; whiteboarding] Steps Integrated Survivability Validation Procedure

300 Slide 300 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. R P S Q Functional Model of the Relevant Subset of the System Model for Client Model for Client Model for Access Proxy Model for Access Proxy Model for PSQ Server Model for PSQ Server … AA1 AA2 AA3 AP1 AP2 M1 (Network Domains) M2M3 M4 M6 M5 L1 (ADF) L2 L3 3.Detailed descriptions of model component behaviors representing 2a and 2b, along with statements of underlying assumptions made for each component. [Probabilistic modeling or logical argumentation, depending on requirement] Steps Integrated Survivability Validation Procedure

301 Slide 301 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. R P S Q Functional Model of the Relevant Subset of the System Model for Client Model for Client Model for Access Proxy Model for Access Proxy Model for PSQ Server Model for PSQ Server … AA1 AA2 AA3 AP1 AP2 M1 (Network Domains) M2M3 M4 M6 M5 L1 (ADF) L2 L3 4.Construct executable functional model [Probabilistic modeling, if model constructed in 3 is probabilistic] 5.a) Verification of the modeling assumptions of Step 3 [Logical argumentation] and, b) where possible, justification of model parameter values chosen in Step 4. [Experimentation] In Parallel Steps Integrated Survivability Validation Procedure

302 Slide 302 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. R P S Q Functional Model of the Relevant Subset of the System Model for Client Model for Client Model for Access Proxy Model for Access Proxy Model for PSQ Server Model for PSQ Server … AA1 AA2 AA3 AP1 AP2 M1 (Network Domains) M2M3 M4 M6 M5 L1 (ADF) L2 L3 6.Run the executable model for the measures that correspond to the requirements of Step 1. [Probabilistic modeling] Steps Integrated Survivability Validation Procedure

303 Slide 303 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. R P S Q Functional Model of the Relevant Subset of the System Model for Client Model for Client Model for Access Proxy Model for Access Proxy Model for PSQ Server Model for PSQ Server … AA1 AA2 AA3 AP1 AP2 M1 (Network Domains) M2M3 M4 M6 M5 L1 (ADF) L2 L3 7.Comparison of results obtained in Step 6, noting in particular the configurations and parameter values for which the requirements of Step 1 are satisfied. ? Steps Note that if the requirement being addressed is not quantitative, steps 4 and 6 are skipped. Integrated Survivability Validation Procedure

304 Slide 304 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Step 1: Requirement Specification Expressed in an argument graph: JBI critical mission objectives JBI critical functionality JBI mission Detection / Correlation Requirements Initialized JBI provides essential services JBI properly initialized Authorized publish processed successfully Authorized subscribe processed successfully Authorized query processed successfully Authorized join/leave processed successfully Unauthorized activity properly rejected Confidential info not exposed IDS objectives

305 Slide 305 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Argument Graph for the Design Requirements decomposition Executable model Model assumptions Supporting arguments

306 Slide 306 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Step 2: System and Attack Assumption Definition Example High level description … Steps 4-5 Access proxy verifies if the client is in valid session by sending the session key accompanying the IO to the Downstream Controller for verification Step 6 Access Proxy forwards the IO to the PSQ Server in its quadrant.....

307 Slide 307 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Attack Model Description Definitions –Intrusion, prevented intrusion, tolerated intrusion –New vulnerabilities Assumptions –Outside attackers only –Attacker(s) with unlimited resources –Consider successful (and harmful) attacks only –No patches applied for vulnerabilities found during the mission/scenario execution

308 Slide 308 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Attack Model Description Attack propagation –MTTD: mean time to discovery of a vulnerability –MTTE: mean time to exploitation of a vulnerability 3 types of vulnerabilities: –Infrastructure-Level Vulnerabilities  attacks in depth OS vulnerability Non-JBI-specific application-level vulnerability p common : common-mode failure –Data-Level Vulnerabilities  attacks in breadth Using the application data of JBI software –Across process domains flaw in protection domains

309 Slide 309 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Attack Model Description Attack effects –Compromise Launching pad for further attacks Malicious behavior –Crash Attack propagation stopped –(DoS) –Distinction between OSes with and without protection domains

310 Slide 310 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Attack Model Description Intrusion Detection –p detect =0 if the sensors are compromised –p detect > 0 otherwise. Attack Responses –Restart Processes –Secure Reboot –Permanent Isolation

311 Slide 311 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Infrastructure Attacks Example SM SM, Quad 1, OS 2 ADF NIC SM SM, Quad 1, OS 3 ADF NIC SM SM, Quad 1, OS 4 ADF NIC Access Proxy, Quad 1, OS 1 PSQ Server, Quad 1, OS 1 Co Correlator, Quad 1, OS 1 PSQ Se Ac LC ADF NIC AP IO Se Ac LC AP Hb AP Alert ADF NIC Guardian, Quad 1, OS 1 Gu Se Ac LC ADF NIC DC, Quad 1, OS 1 DC Se Ac LC ADF NIC SM SM, Quad 1, OS 1 ADF NIC Access Proxy, Quad 2, OS 2 AP IO Se Ac LC AP Hb AP Alert ADF NIC Access Proxy, Quad 3, OS 3 PSQ Server, Quad 2, OS 2 PSQ Se Ac LC ADF NIC PSQ Server, Quad 3, OS 3 PSQ Se Ac LC ADF NIC PSQ Server, Quad 4, OS 4 PSQ Se Ac LC ADF NIC Publishing Client, OS1 SD Se Ac LC ADF NIC AP IO Se Ac LC AP Hb AP Alert ADF NIC Access Proxy, Quad 4, OS 4 AP IO Se Ac LC AP Hb AP Alert ADF NIC Outside Quadrant 1 T=85 min.: discovery of a vulnerability on the Main PD, OS1 PS Policy Server, Quad 1, OS 1 ADF NIC all quad components Crumple ZoneOperations ZoneExecutive Zone

312 Slide 312 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Step 3: Detailed descriptions of model component behaviors and Assumptions (Access Proxy) Model of Access Proxy Assumptions 4.4 Access Proxy Model Description AM1: If a process domain in the DJM proxy is not corrupted, it forwards the traffic it is designated to handle from the Quadrant isolation switch to core quadrant elements and vice versa. All traffic being forwarded is well-formed (if the proxy is correct). The following kinds of traffic are handled: 1. IOs (together with tokens) sent from publishing clients to the core (we do not distinguish between IOs sent via different protocols such as RMI or SOAP/HTTP). …. AM2: Attacks on access proxy: attacks on an access proxy are enabled if either/both 1. Quadrant isolation switch is ON, and one or more clients are corrupted, leading to: a) Direct attacks: can cause the corruption of the process domain corresponding to the domain of the attacking process on the compromised client. …. AM3: If an attack occurs on the access proxy, it can have the following effects: 1. Direct attacks leading to process corruption: a) Enable corruption of other process domains on the host. … Facts and Simplifications AF1: Each access proxy runs on a dedicated host machine. AF2: DoS attacks result in increased delays. … Assumptions AA1: Only well-formed traffic is forwarded by a correct access proxy. AA2: The access proxy cannot access cryptographic keys used to sign messages that pass through it. AA3: Access proxy cannot access the contents on an IO if application-level end-to-end encryption is being used. AA4: Attacks on an access proxy can only be launched from compromised clients, or from corrupted core elements that interact with the access proxy during the normal course of a mission. ….

313 Slide 313 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Step 4: Construct Executable Functional Model

314 Slide 314 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Step 5: Supporting Logical Arguments

315 Slide 315 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Logical Argument Sample AA2: AP Application- layer Integrity AA3: AP Application-layer Confidentiality SA3: IO Integrity in PSQ Server SA4: Client Confidentiality in PSQ Server Private Key Confidentiality No Unauthorized Direct Access Keys Protected from Theft DoD Common Access Card (CAC) PKCS #11 Compliance Tamperproof Keys Not Guessable Algorithmic Framework Key Length Key Lifetime No Unauthorized Indirect Access Physical Protection of CAC device Protection of CAC Authentication Data No Compromise of Authorized Process Accessing CAC No Cryptography in Access Proxy Not Preconfigured Not Reconfigurable ADF NIC services protected PSQ Server Model Access Proxy Model Functional Model Model Assumptions Supporting Arguments

316 Slide 316 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Steps 6 and 7: Measures and Results Assumptions: C PUB is the conjunction of –C1 PUB = the publishing client is successfully registered with the core –C2 PUB = the publishing client's mission application interacts with the client as intended Definition of a successful publish: E PUB is the conjunction of –E1 PUB = the data flow for the IO is correct –E2 PUB = the time required for the publish operation is less than t max –E3 PUB = the content of the IO received by the subscriber has the same essential content as that assembled by the publisher Measure: P[E PUB |C PUB ] –Fraction of successful publishes in a 12 hour period –Between clients that cannot be compromised Objective –P[E PUB |C PUB ] ≥ p PUB for a 12-hour mission

317 Slide 317 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Vulnerability Discovery Rate Study Fraction of successful publishes versus MTTD Number of successful intrusions versus MTTD

318 Slide 318 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Varying the number of OS and OS w/ process domains

319 Slide 319 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Autonomic Distributed Firewall (ADF) NIC policies Fraction of successful publishes Total number of intrusions Per-pd policies considerably increase the performance (10% unavailability vs. 1.5% at MTTD=100 minutes) ADF NICs can handle per-port policies => should take advantage of this feature, implying to set the communication ports in advance

320 Slide 320 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Design and Implementation Oriented Validation of Survivable Systems A. Agbaria, T. Courtney, M. Ihde, W. H. Sanders, M. Seri, and S. Singh University of Illinois at Urbana-Champaign I N F O R M A T I O N T R U S T I N S T I T U T ERequirement Decomposable? LogicalDecomposition Yes Sub-requirements Quantitative? LogicalArgumentation No Yes Build high-level description of System and its operational environment Verify assumptions & parameter values Notvalid Probabilistic model of the system and its operational environment Compare with requirement Probabilistic measures System valid w.r.t. the requirement System not valid Step 5: Justify the modeling assumptions of Step 4 Step 7: Evaluation and comparing Design Phase Validation Let PUB be the requirement of “successfully process a publish request”.Let PUB be the requirement of “successfully process a publish request”. Let C be the preconditions.Let C be the preconditions. Let E be the desired event, i.e., the successful of a request to publish.Let E be the desired event, i.e., the successful of a request to publish. E is a conjunction of:E is a conjunction of: E 1 = the data flow of the publish isE 1 = the data flow of the publish is correct correct E 2 = timelinessE 2 = timeliness E 3 = integrityE 3 = integrity E 4 = confidentialityE 4 = confidentiality The requirement: PUB: P[E|C] ≥ pThe requirement: PUB: P[E|C] ≥ p A study of the design reveals that integrity and confidentiality can be regarded as probability-1 events.A study of the design reveals that integrity and confidentiality can be regarded as probability-1 events. We obtain the following logical decomposition:We obtain the following logical decomposition: PUB 1 : P[E 1  E 2 | E 3  E 4  C] ≥ pPUB 1 : P[E 1  E 2 | E 3  E 4  C] ≥ p PUB 2 : P[E 3 | C] = 1PUB 2 : P[E 3 | C] = 1 PUB 3 : P[E 4 | C] = 1PUB 3 : P[E 4 | C] = 1 It can be shown that:It can be shown that: (PUB 1  PUB 2  PUB 3 )  PUB (PUB 1  PUB 2  PUB 3 )  PUB Step 3: For every atomic requirement R a Data Flow Infrastructure-level attacks AA2: AP Application- layer Integrity AA3: AP Application-layer Confidentiality Private Key Confidentiality No Unauthorized Direct Access Keys Protected from Theft DoD (CAC) PKCS #11 Compliance Tamperproof Keys Not Guessable Alg. Framew ork Key Leng th Key Lifeti me No Unauthorized Indirect Access Physical Protection of CAC device Protection of CAC Authenticati on Data No Compromise of Authorized Process Accessing CAC No Cryptography in AP Not Preconfig ured Not Reconfigu rable ADF NIC services protected Access Proxy Model Functional Model Model Assumptions Supporting Arguments Step 1: Formulate a precise statement of R. Step 2: If R is logically decomposable, decompose it iteratively. Step 4: Detailed description of components Step 6: Construct a simulation model Survivable Publish Subscribe System Client Zone Management Staff Executive Zone Crumple Zone Operations Zone Core Quad 1Quad 2Quad 3Quad 4 Network Access Proxy (Isolated Process Domains in SE-Linux) Domain6 First Restart Domains Eventually Restart Host Local Controller RMI TCP PS Sensor Rpts TCP UDP IIOP PSQImpl IIOP TCP DC Eascii Domain1Domain2Domain3Domain4Domain5 Forward/Rate limit Proxy Logic Inspect / Forward / Rate Limit Implementation Phase Validation Attack Tree Attack Graph Automatic construction

321 Slide 321 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. The Art of Dependability Evaluation / Conclusions

322 Slide 322 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Course Outline Revisited Issues in Model-Based Validation of High-Availability Computer Systems/Networks Stochastic Activity Network Concepts Analytic/Numerical State-Based Modeling Case Study: Embedded Fault-Tolerant Multiprocessor System Solution by Simulation The Art of System Dependability /Conclusions

323 Slide 323 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Model Solution Issues In general: –Use “tricks” from probability theory to reduce complexity of model –Choose the right solution method Simulation: –Result is just an estimator based on a statistical experiment –Estimation of accuracy of estimate essential –Use confidence Intervals! Analytic/Numerical model solution: –Avoid state space explosion Limit model complexity Use structure of model (symmetries) to reduce state space size –Understand accuracy/limitations of chose numerical method Transient Solution (Iterative or Direct) Steady-state solution

324 Slide 324 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. The “Art” of Performance and Dependability Validation Performance and dependability validation is an art because: –There is no recipe for producing a good analysis, –The key is knowing how to abstract away unimportant details, while retaining important components and relationships, –This intuition only comes from experience, –Experience comes from making mistakes. There are many ways to make mistakes.

325 Slide 325 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Doing it Right: Model Construction Understand the desired measure before you build the model. The desired measure determines the type of model and the level of detail required. No model is universal! Steps in constructing a model: 1. Choose the desired measures: Choice of measures form a basis for comparison. It’s easy to choose wrong measure and see patterns where none exist. Measures should be refined during the design and validation process. 2. Choose the appropriate level of detail/abstraction for model components. Key is to represent model at the right level of detail for the chosen measures. It is almost never possible or practical to include all system aspects. Model the system at the highest level possible to obtain a good estimate of the desired measures. 3. Build the model. Decide how to break up the model into modules, and how the modules will interact with one another. Test the model as you build it, to ensure it executes as intended.

326 Slide 326 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Doing it Right: Model Solution Use the appropriate model solution technique: –Just because you have a hammer doesn’t mean the world is a nail. –There is no universal model solution technique (not even simulation!) –The appropriate model solution technique depends on model characteristics. Use representative input values: –The results of a model solution are only as good as the inputs. –The inputs will never be perfect. –Understand how uncertainty in inputs affects measures. –Do sensitivity analysis. Include important points in the design/parameter space: –Parameterize choices when design or input values are not fixed. –A complete parametric study is usually not possible. –Some parameters will have to be fixed at “nominal” values. –Make sure you vary the important ones.

327 Slide 327 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Doing it Right: Model Interpretation/Documentation Make all your assumptions explicit: –Results from models are only as good as the assumptions that were made in obtaining them. –It’s easy to forget assumptions if they are not recorded explicitly. Understand the meaning of the obtained measures: –Numbers are not insights. –Understand the accuracy of the obtained measures, e.g., confidence intervals for simulation. Keep social aspects in mind: –Performance and dependability analysts almost always bring bad news. –Bearers of bad news are rarely welcomed. –In presentations, concentrate on results, not the process.

328 Slide 328 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Next Steps You have: –Learned theory related to reliability, availability, and performance validation using SANs and Möbius –Learned about the advantages and disadvantages of various (analytical/numerical and simulation-based) solution algorithms. There are many places to go for further information: –Möbius Software Web pages (www.mobius.uiuc.edu) –Performability Engineering Research Group Web pages (www.perform.csl.uiuc.edu)


Download ppt "Slide 1 ©2005 William H. Sanders. All rights reserved. Do not duplicate without permission of the author. Validating Computer System and Network Trustworthiness."

Similar presentations


Ads by Google