© 2006, Monash University, Australia CSE4884 Network Design and Management Lecturer: Dr Carlo Kopp, MIEEE, MAIAA, PEng Lecture 15-16 Reliability Theory.

Presentation on theme: "© 2006, Monash University, Australia CSE4884 Network Design and Management Lecturer: Dr Carlo Kopp, MIEEE, MAIAA, PEng Lecture 15-16 Reliability Theory."— Presentation transcript:

© 2006, Monash University, Australia CSE4884 Network Design and Management Lecturer: Dr Carlo Kopp, MIEEE, MAIAA, PEng Lecture 15-16 Reliability Theory Concepts; Managing Reliability and Maintainability in Networks

© 2006, Monash University, Australia References and Reading Igor Bazovsky, Reliability Theory and Practice, Dover Books on Mathematics, (1961) – recommended reading. Igor Bazovsky Kopp C., System Reliability and Metrics of Reliability, Peter Harding & Associates, Pty Ltd, Lecture Slides. MIL-STD-756 Revision B Reliability Modeling & Prediction, US Department of Defense, Revision B - Nov 1991.

© 2006, Monash University, Australia Why Study Reliability in Networking? Networks are among the most complicated systems ever created by man. A modern network combines hardware, embedded software and host system resident software, providing a range of data transfer and management functions. The complexity of a modern network makes its reliability a major consideration. Prudent network design can improve its reliability. Poor network design can reduce its reliability. Running costs and user satisfaction depend strongly on network reliability. Reliability matters!

© 2006, Monash University, Australia What is Reliability? Reliability is defined as the Probability of System Survival or P[S](t) over time T. Where R(t) is reliability and Q(t) is probability of failure. A measure of the likelihood of no fault occurring. Reliability is related to system function and architecture. Kopps 5 th Axiom: All systems will fail, the only issue is when, and how frequently.

© 2006, Monash University, Australia Defining System Reliability System Reliability includes the following components: 1. Hardware Reliability. 2. Software Reliability. 3. Reliability of interaction between hardware and software. 4. Reliability of interaction between the system and the operator. Failure to consider any of these is always at the expense of the reliability of the end product.

© 2006, Monash University, Australia Hardware Reliability Classification Considers the reliability of electronic components, printed circuit boards (PCB), cables, interconnection (ie connector) reliability, and failure modes across all such components. Failure regimes include: 1. Hard failures - The component or system fails and remains in that state. 2. Transient failures – the component or system temporarily experiences a loss a loss of function. 3. Intermittent failures – the component or system repeatedly and temporarily experiences a loss a loss of function. Failure types are primarily divided into: 1. Random failures - exponentially distributed. 2. Wearout failures - normally distributed 3. Infant Mortality failures

© 2006, Monash University, Australia Hardware Life Cycle vs Reliability All hardware exhibits three distinct failure modes through its operational life cycle. The first weeks or months after introduction exhibit infant mortality failures, which arise from manufacturing defects which fail under stress. Once the equipment is established in operation, it exhibits random failures, which arise in components for a variety of reasons. Random failures are Poisson distributed. As the equipment reaches the end of its useful life, it begins to exhibit wearout failures. Wearout failures are normally (Gaussian) distributed. The Bathtub Curve illustrates failure frequency over the product life cycle.

© 2006, Monash University, Australia Life Cycles - Bathtub Curve

© 2006, Monash University, Australia Random Failures Random failures arise throughout the useful life of equipment. They are exponentially distributed. The failure rate λ is a measure of how frequently they arise. MTBF = Mean Time Between Failures Temperature dependency of λ - failure rates always increase at high operating temperatures. Electrical voltage dependency of λ - failure rates always increase at higher electrical stress levels. High stress high failure rates!

© 2006, Monash University, Australia Wearout Failures Wearout failures arise at the end of the useful life of equipment. They are normally distributed. The mean wearout time μ is a measure of the average time at which they arise. The standard deviation of mean wearout time σ is a measure of their spread in time. The spread in wearout failures depends on the quality of the components and the types of loads they are subjected to over their life cycle. Wearout arises in mechanical components and connectors due to cyclic mechanical loads, in semiconductor chips due to cyclic thermal loads and junction diffusion effects.

© 2006, Monash University, Australia Wearout Examples Connectors wear out with insertion and extraction cycles which rub away plating and cause metal fatigue damage. Electrical relays wear out with switching cycles which rub away plating and cause metal fatigue damage. Fans wear out due to bush or ball bearing failures, causing a loss of airflow rate and ultimately seizure of the bush or bearing. Fans with bushes – 10,000 hr life. Cables wear out due to cyclic mechanical loads, especially near connectors, but also due to dielectric degradation due to age and moisture ingress. Many electrical components fail due to oxidation of metals which is a corrosion effect. Water, especially salt water, can produce corrosion.

© 2006, Monash University, Australia System Reliability – Lussers Product Law Lussers Product Law was discovered in Germany during A4/V2 ballistic missile testing during the 1943-44 period. It superceded the earlier and dysfunctional weak link model, which attributed failures to the most failure prone component in a design. Lussers Product Law describes the behaviour of complex series systems, in which the function of the system depends on the function of each and every component. It provides the theoretical basis of the US Mil-Hdbk-217 and Mil-Std-756B standards, which are the industry benchmark for reliability modelling.

© 2006, Monash University, Australia System Reliability – Lussers Product Law Lusser states that the probability of system survival is the product of the individual probabilities of survival of each component in the system, where This means P[S] system = P[S] 1 *P[S] 2 *P[S] 3 ….P[S] N or: Where R(t) = P[S](t) = 1 – Q(t), where Q(t) is the probability of failure. If we know the failure rates λ for the components in a system, we can calculate the system reliability.

© 2006, Monash University, Australia Parallel Systems - Redundancy Failure of single element is survivable, but P[S] then reduced as a result. R p = 1 – Q N Where Q is the probability of failure for each of the redundant components. Used in aircraft flight control systems, Space Shuttle and critical control applications. Large servers with multiple parallel interfaces are a good example of a parallel system built for reliability. RAID storage servers are a similar example.

© 2006, Monash University, Australia Complex System Reliability Complex systems combine parallel and serial models. Such systems require detailed analysis to determine R(t) for subsystems and the complete system. It is necessary to analyse for dependencies. For instance, a cascading series of failures may arise if one component fails, and results in overstress to other components, which fail in turn (Refer Mil-Std-756B). Designers must avoid Single Point of Failure (SPoF) items. Such items are typically shared components. The higher the complexity of the system, the higher the component reliability needed to achieve any given MTBF. In very complicated systems, this can require exceptionally high component reliability (and thus cost).

© 2006, Monash University, Australia Example – RAID Array in Server System (2006) N x 1 RAID array with a single cooling fan and power supply. Assessment: 1. Disk drive redundancy in array is good. 2. Power supply failure represent a single point of failure. 3. Fan failures represent a single point of failure. Problem fixed by introducing redundant fans and power supplies. By removing single point of failure items we have significantly improved the reliability of the system.

© 2006, Monash University, Australia Example - P-38 Twin Engine Aircraft (1944) Electrical propeller pitch control, radiator and intercooler doors, dive flap actuators, turbocharger controls. Twin engine aircraft, only one generator on one of the engines. Loss of generator equipped engine - feather propeller, fail over to battery. Once battery flat, prop unfeathers, windmills, turbo runaway -> aircraft crashes. Problem fixed with dual generators, one per engine. Significant loss of pilot lives until problem solved.

© 2006, Monash University, Australia Software vs Hardware Reliability Hardware failures can induce software failures. Software failures can induce hardware failures. It is often extremely difficult to separate hardware and software failures. We cannot apply physical models to software failures. While Lussers Product Law provides a model for system level reliability, we have no hard measures for calculating or estimating the component failure rates in software. The result of software and/or hardware failures is system failure. Networking equipment, especially routers, contain significant amounts of embedded software to handle protocol stacks, management, and data buffering.

© 2006, Monash University, Australia Modes of Software Failure We can identify four basic modes of software failure: 1. Transient Failure – the program produces an incorrect result, but the program continues to run. 2. Hard Failure – the program crashes (stack overrun, heap overrun, broken thread) and ceases to run. 3. Cascaded Failure – the program crashes and takes down other programs as a result. 4. Catastrophic Failure – the program crashes and takes down the operating system or complete system -> total failure.

© 2006, Monash University, Australia Types of Software Failure Numerical Failure - bad result calculated. Propagated Numerical Failure - bad result used in other calculations. Control Flow Failure - control flow of thread is diverted. Propagated Control Flow Failure - bad control flow propagates through code. Addressing Failure - bad pointer or array index. Synchronisation Failure - two pieces of code misunderstand each other's state. In networking equipment, the synchronisation failure is a very common occurrence. It usually arises due to bugs, misconfiguration or incompatible implementations of a protocol engine. Case study – PPP LCP failures.

© 2006, Monash University, Australia Runtime Detection of Software Failures Consistency checks on values – is the result that which was expected? Watchdog timers – has an operation completed on time? Bounds checking – is the result reasonable or within some safe limits? Embedded software in networking equipment which is well designed must have runtime software failure detection functions built in. When choosing equipment it is essential to determine whether critical equipment items have some or any such capability.

© 2006, Monash University, Australia Recovery Strategies – Runtime Failures Redundant data structures - overwrite bad data with clean data. Signal operator or log problem cause and then die. Hot Start - restart from known position, do not reinitialise data structures. Cold Start - reinitialise data structures and restart, or reboot. Failover to Standby System in redundant scheme (eg flight controls). If an item of networking equipment is critical, it is important to determine how it handles runtime failures. Far too often a failure in synchronisation is not detected, causing chaos as a result.

© 2006, Monash University, Australia Typical Causes of Software Failures Programmer did not understand the system design very well. Programmer made unrealistic assumptions about operating conditions. Programmer made coding error. Programmers and hardware engineers did not talk to each other. Inadequate or inappropriate testing of code. A network designer is unlikely to have access to embedded code in equipment, or access to designers. If an item is critical to the function of the network, it should be tested rigorously before it is introduced into a production network and made available to users.

© 2006, Monash University, Australia Dormant Fault Problem Statistical models used for hardware are irrelevant. Code may be operational for years with a fatal bug hidden somewhere. A set of conditions may one day arise which trigger the fault. If major disaster arises it may be impossible to recreate same conditions. In a large and complex network, many dormant bugs may exist in embedded code inside equipment and in host operating systems. If such bugs result in transient or intermittent failures, they may be extremely difficult to isolate.

© 2006, Monash University, Australia Complex System Problem Extremely complex systems will be extremely difficult to simulate or test. Complexity may result in infeasible regression testing time. Components of system may interact in unpredictable ways. Synchronisation failures may arise. Faults may be hidden and symptoms not easily detectable due complexity. Networks represent a typical case study of a complex system, insofar as they may have hundreds of switches and routers, all with embedded software running in them.

© 2006, Monash University, Australia Network Design for Reliability 1. Network design objectives must be well understood. 2. Redundancy should be used as appropriate for critical portions of the design, especially if a formal reliability specification exists for the network. 3. Failure modes and consequences should be understood, for all items of hardware and software in the network. 4. Each hardware and software module should be tested thoroughly before use in a network design. 5. A hardware reliability model should be produced, based on Mil-Std-756B or a serial model, as required. 6. Good estimates of hardware reliability are feasible, where manufacturers are able to provide MTBF figures for equipment.

© 2006, Monash University, Australia Maintainability Regardless of how reliable a network might be, maintainability is a critical operational issue. When inevitable failures arise, these must be fixed as quickly as possible. Network failures can cause the loss of hundreds or thousands of personnel hours for every hour of network downtime. Time is money! From a user and management perspective, maintainability is very important and impacts any economic assessment of the running costs of a network. It is customary to measure network reliability in terms of overall MTBF, or in terms of Availability which is the fraction of time, over time, the network can be used.

© 2006, Monash University, Australia Maintainability - MTTR The most common measure of maintainability is MTTR. Unfortunately, multiple definitions exist for MTTR, as a result of which a designer or manager must be very careful when contracting: Mean Time To Respond – average time for a maintenance crew to respond to a request. Mean Time To Repair – average time to repair a fault. Mean Time To Restart – average time to restart the network after a fault. Mean Time To Restore – average time to restore network function after a fault. Some hardware suppliers will provide MTTR (repair) numbers for their products. In general, care should be taken when specifying MTTR and when assessing MTTR in a bid.

© 2006, Monash University, Australia Maintainability Repairing hardware faults requires spare parts, or complete replacement equipment. If good MTTR is required, then it is necessary to maintain a stockpile of spares. The size of the stockpile is typically determined by the MTBF of the component, and the number of items in operation. For instance, if the MTBF of a switch is 100,000 hrs and you have 100 of them in operation, annually you incur a total of 876,000 hrs of running time on these switches. You can thus expect, on average, 8.76 faults annually in these switches. A spares stockpile of around 10 switches would be needed, and an annual budget for 10 replacements.

© 2006, Monash University, Australia Frequency of Repairs vs Availability If we can expect some number of faults annually due to random failures, since these are Poisson we cannot know exactly when they will occur. MTBF and population size for specific components will determine on average, how frequently one of these will fail. We can estimate the Availability of the network if we have good estimates for MTBF and MTTR. In practice, MTBF can be calculated accurately, but MTTR can be difficult to measure accurately, especially given the range of possible network failure modes and debugging times which result.

© 2006, Monash University, Australia Axioms to Memorise Murphy's Law applies 99% of the time (Vonada's Law) Simpler solutions are usually easier to prove correct (Occam's Razor) Paranoia Pays Off (Kopp's 4 th Axiom) All systems will fail, the only issue is when, and how frequently (Kopps 5 th Axiom)

© 2006, Monash University, Australia Tutorial Q&A and Discussion, case studies

Download ppt "© 2006, Monash University, Australia CSE4884 Network Design and Management Lecturer: Dr Carlo Kopp, MIEEE, MAIAA, PEng Lecture 15-16 Reliability Theory."

Similar presentations