© 2006, Monash University, Australia CSE4884 Network Design and Management Lecturer: Dr Carlo Kopp, MIEEE, MAIAA, PEng Lecture 15-16 Reliability Theory.

Slides:



Advertisements
Similar presentations
Test process essentials Riitta Viitamäki,
Advertisements

System Reliability and Metrics of Reliability Carlo Kopp Peter Harding & Associates, Pty Ltd Copyright 1996, PHA Pty Ltd, All rights reserved
© 2006, Monash University, Australia CSE4884 Network Design and Management Lecturer: Dr Carlo Kopp, MIEEE, MAIAA, PEng Lecture 2 Performance Criteria vs.
Business Continuity Section 3(chapter 8) BC:ISMDR:BEIT:VIII:chap8:Madhu N PIIT1.
Relex Reliability Software “the intuitive solution
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
© 2009 EMC Corporation. All rights reserved. Introduction to Business Continuity Module 3.1.
SMJ 4812 Project Mgmt and Maintenance Eng.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 23 Slide 1 Software testing.
SWE Introduction to Software Engineering
1 ITC242 – Introduction to Data Communications Week 12 Topic 18 Chapter 19 Network Management.
DS -V - FDT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 23 Slide 1 Software testing.
Soft. Eng. II, Spr. 2002Dr Driss Kettani, from I. Sommerville1 CSC-3325: Chapter 9 Title : Reliability Reading: I. Sommerville, Chap. 16, 17 and 18.
Unit III Module 4 - Hard Time Task
Non-functional requirements
© 2006, Monash University, Australia CSE4884 Network Design and Management Lecturer: Dr Carlo Kopp, MIEEE, MAIAA, PEng Lecture 4 Design Strategies.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 2 Slide 1 Systems engineering 1.
PowerPoint presentation to accompany
BPT2423 – STATISTICAL PROCESS CONTROL.  Fundamental Aspects  Product Life Cycle Curve  Measures of Reliability  Failure Rate, Mean Life and Availability.
1 Product Reliability Chris Nabavi BSc SMIEEE © 2006 PCE Systems Ltd.
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 23 Slide 1 Software testing.
Software Reliability: The “Physics” of “Failure” SJSU ISE 297 Donald Kerns 7/31/00.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
Software Reliability Categorising and specifying the reliability of software systems.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 24 Slide 1 Critical Systems Validation 1.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
System Implementation. System Implementation and Seven major activities Coding Testing Installation Documentation Training Support Purpose To convert.
CCSB223/SAD/CHAPTER141 Chapter 14 Implementing and Maintaining the System.
Failure Spread in Redundant UMTS Core Network n Author: Tuomas Erke, Helsinki University of Technology n Supervisor: Timo Korhonen, Professor of Telecommunication.
Business Continuity and Disaster Recovery Chapter 8 Part 2 Pages 914 to 945.
University of Palestine software engineering department Testing of Software Systems Fundamentals of testing instructor: Tasneem Darwish.
©Ian Sommerville 2000 Software Engineering, 6th edition. Chapter 20 Slide 1 Integration testing l Tests complete systems or subsystems composed of integrated.
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Slide 1 Critical Systems Specification 2.
CSE 303 – Software Design and Architecture
Software Reliability SEG3202 N. El Kadri.
Chapter 2: Non functional Attributes.  It infrastructure provides services to applications  Many of these services can be defined as functions such.
IMPROUVEMENT OF COMPUTER NETWORKS SECURITY BY USING FAULT TOLERANT CLUSTERS Prof. S ERB AUREL Ph. D. Prof. PATRICIU VICTOR-VALERIU Ph. D. Military Technical.
1 Software Testing and Quality Assurance Lecture 33 – Software Quality Assurance.
© 2002 Eaton Corporation. All rights reserved. Designing for System Reliability Dave Loucks, P.E. Eaton Corporation.
Mark A. Magumba Storage Management. What is storage An electronic place where computer may store data and instructions for retrieval The objective of.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Reliability & Maintainability Engineering An Introduction Robert Brown Electrical & Computer Engineering Worcester Polytechnic Institute.
CprE 458/558: Real-Time Systems
RELIABILITY ENGINEERING 28 March 2013 William W. McMillan.
Failures and Reliability Adam Adgar School of Computing and Technology.
CS 360 Lecture 17.  Software reliability:  The probability that a given system will operate without failure under given environmental conditions for.
Software Engineering1  Verification: The software should conform to its specification  Validation: The software should do what the user really requires.
1 Object-Oriented Analysis and Design with the Unified Process Figure 13-1 Implementation discipline activities.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 23 Slide 1 Software testing.
Mean Time To Repair
Stracener_EMIS 7305/5305_Spr08_ Systems Availability Modeling & Analysis Dr. Jerrell T. Stracener, SAE Fellow Leadership in Engineering EMIS 7305/5305.
CS203 – Advanced Computer Architecture Dependability & Reliability.
Failure Modes, Effects and Criticality Analysis
 Software reliability is the probability that software will work properly in a specified environment and for a given amount of time. Using the following.
Software Metrics and Reliability
Hardware & Software Reliability
Fault Tolerance & Reliability CDA 5140 Spring 2006
Software Reliability Definition: The probability of failure-free operation of the software for a specified period of time in a specified environment.
Software Reliability PPT BY:Dr. R. Mall 7/5/2018.
Embry-Riddle Aeronautical University
Unit I Module 3 - RCM Terminology and Concepts
RELIABILITY Reliability is -
Reliability and Error Control 5/17/11
Definitions Cumulative time to failure (T): Mean life:
Presentation transcript:

© 2006, Monash University, Australia CSE4884 Network Design and Management Lecturer: Dr Carlo Kopp, MIEEE, MAIAA, PEng Lecture Reliability Theory Concepts; Managing Reliability and Maintainability in Networks

© 2006, Monash University, Australia References and Reading Igor Bazovsky, Reliability Theory and Practice, Dover Books on Mathematics, (1961) – recommended reading. Igor Bazovsky Kopp C., System Reliability and Metrics of Reliability, Peter Harding & Associates, Pty Ltd, Lecture Slides. MIL-STD-756 Revision B Reliability Modeling & Prediction, US Department of Defense, Revision B - Nov 1991.

© 2006, Monash University, Australia Why Study Reliability in Networking? Networks are among the most complicated systems ever created by man. A modern network combines hardware, embedded software and host system resident software, providing a range of data transfer and management functions. The complexity of a modern network makes its reliability a major consideration. Prudent network design can improve its reliability. Poor network design can reduce its reliability. Running costs and user satisfaction depend strongly on network reliability. Reliability matters!

© 2006, Monash University, Australia What is Reliability? Reliability is defined as the Probability of System Survival or P[S](t) over time T. Where R(t) is reliability and Q(t) is probability of failure. A measure of the likelihood of no fault occurring. Reliability is related to system function and architecture. Kopps 5 th Axiom: All systems will fail, the only issue is when, and how frequently.

© 2006, Monash University, Australia Defining System Reliability System Reliability includes the following components: 1. Hardware Reliability. 2. Software Reliability. 3. Reliability of interaction between hardware and software. 4. Reliability of interaction between the system and the operator. Failure to consider any of these is always at the expense of the reliability of the end product.

© 2006, Monash University, Australia Hardware Reliability Classification Considers the reliability of electronic components, printed circuit boards (PCB), cables, interconnection (ie connector) reliability, and failure modes across all such components. Failure regimes include: 1. Hard failures - The component or system fails and remains in that state. 2. Transient failures – the component or system temporarily experiences a loss a loss of function. 3. Intermittent failures – the component or system repeatedly and temporarily experiences a loss a loss of function. Failure types are primarily divided into: 1. Random failures - exponentially distributed. 2. Wearout failures - normally distributed 3. Infant Mortality failures

© 2006, Monash University, Australia Hardware Life Cycle vs Reliability All hardware exhibits three distinct failure modes through its operational life cycle. The first weeks or months after introduction exhibit infant mortality failures, which arise from manufacturing defects which fail under stress. Once the equipment is established in operation, it exhibits random failures, which arise in components for a variety of reasons. Random failures are Poisson distributed. As the equipment reaches the end of its useful life, it begins to exhibit wearout failures. Wearout failures are normally (Gaussian) distributed. The Bathtub Curve illustrates failure frequency over the product life cycle.

© 2006, Monash University, Australia Life Cycles - Bathtub Curve

© 2006, Monash University, Australia Random Failures Random failures arise throughout the useful life of equipment. They are exponentially distributed. The failure rate λ is a measure of how frequently they arise. MTBF = Mean Time Between Failures Temperature dependency of λ - failure rates always increase at high operating temperatures. Electrical voltage dependency of λ - failure rates always increase at higher electrical stress levels. High stress high failure rates!

© 2006, Monash University, Australia Wearout Failures Wearout failures arise at the end of the useful life of equipment. They are normally distributed. The mean wearout time μ is a measure of the average time at which they arise. The standard deviation of mean wearout time σ is a measure of their spread in time. The spread in wearout failures depends on the quality of the components and the types of loads they are subjected to over their life cycle. Wearout arises in mechanical components and connectors due to cyclic mechanical loads, in semiconductor chips due to cyclic thermal loads and junction diffusion effects.

© 2006, Monash University, Australia Wearout Examples Connectors wear out with insertion and extraction cycles which rub away plating and cause metal fatigue damage. Electrical relays wear out with switching cycles which rub away plating and cause metal fatigue damage. Fans wear out due to bush or ball bearing failures, causing a loss of airflow rate and ultimately seizure of the bush or bearing. Fans with bushes – 10,000 hr life. Cables wear out due to cyclic mechanical loads, especially near connectors, but also due to dielectric degradation due to age and moisture ingress. Many electrical components fail due to oxidation of metals which is a corrosion effect. Water, especially salt water, can produce corrosion.

© 2006, Monash University, Australia System Reliability – Lussers Product Law Lussers Product Law was discovered in Germany during A4/V2 ballistic missile testing during the period. It superceded the earlier and dysfunctional weak link model, which attributed failures to the most failure prone component in a design. Lussers Product Law describes the behaviour of complex series systems, in which the function of the system depends on the function of each and every component. It provides the theoretical basis of the US Mil-Hdbk-217 and Mil-Std-756B standards, which are the industry benchmark for reliability modelling.

© 2006, Monash University, Australia System Reliability – Lussers Product Law Lusser states that the probability of system survival is the product of the individual probabilities of survival of each component in the system, where This means P[S] system = P[S] 1 *P[S] 2 *P[S] 3 ….P[S] N or: Where R(t) = P[S](t) = 1 – Q(t), where Q(t) is the probability of failure. If we know the failure rates λ for the components in a system, we can calculate the system reliability.

© 2006, Monash University, Australia Parallel Systems - Redundancy Failure of single element is survivable, but P[S] then reduced as a result. R p = 1 – Q N Where Q is the probability of failure for each of the redundant components. Used in aircraft flight control systems, Space Shuttle and critical control applications. Large servers with multiple parallel interfaces are a good example of a parallel system built for reliability. RAID storage servers are a similar example.

© 2006, Monash University, Australia Complex System Reliability Complex systems combine parallel and serial models. Such systems require detailed analysis to determine R(t) for subsystems and the complete system. It is necessary to analyse for dependencies. For instance, a cascading series of failures may arise if one component fails, and results in overstress to other components, which fail in turn (Refer Mil-Std-756B). Designers must avoid Single Point of Failure (SPoF) items. Such items are typically shared components. The higher the complexity of the system, the higher the component reliability needed to achieve any given MTBF. In very complicated systems, this can require exceptionally high component reliability (and thus cost).

© 2006, Monash University, Australia Example – RAID Array in Server System (2006) N x 1 RAID array with a single cooling fan and power supply. Assessment: 1. Disk drive redundancy in array is good. 2. Power supply failure represent a single point of failure. 3. Fan failures represent a single point of failure. Problem fixed by introducing redundant fans and power supplies. By removing single point of failure items we have significantly improved the reliability of the system.

© 2006, Monash University, Australia Example - P-38 Twin Engine Aircraft (1944) Electrical propeller pitch control, radiator and intercooler doors, dive flap actuators, turbocharger controls. Twin engine aircraft, only one generator on one of the engines. Loss of generator equipped engine - feather propeller, fail over to battery. Once battery flat, prop unfeathers, windmills, turbo runaway -> aircraft crashes. Problem fixed with dual generators, one per engine. Significant loss of pilot lives until problem solved.

© 2006, Monash University, Australia Software vs Hardware Reliability Hardware failures can induce software failures. Software failures can induce hardware failures. It is often extremely difficult to separate hardware and software failures. We cannot apply physical models to software failures. While Lussers Product Law provides a model for system level reliability, we have no hard measures for calculating or estimating the component failure rates in software. The result of software and/or hardware failures is system failure. Networking equipment, especially routers, contain significant amounts of embedded software to handle protocol stacks, management, and data buffering.

© 2006, Monash University, Australia Modes of Software Failure We can identify four basic modes of software failure: 1. Transient Failure – the program produces an incorrect result, but the program continues to run. 2. Hard Failure – the program crashes (stack overrun, heap overrun, broken thread) and ceases to run. 3. Cascaded Failure – the program crashes and takes down other programs as a result. 4. Catastrophic Failure – the program crashes and takes down the operating system or complete system -> total failure.

© 2006, Monash University, Australia Types of Software Failure Numerical Failure - bad result calculated. Propagated Numerical Failure - bad result used in other calculations. Control Flow Failure - control flow of thread is diverted. Propagated Control Flow Failure - bad control flow propagates through code. Addressing Failure - bad pointer or array index. Synchronisation Failure - two pieces of code misunderstand each other's state. In networking equipment, the synchronisation failure is a very common occurrence. It usually arises due to bugs, misconfiguration or incompatible implementations of a protocol engine. Case study – PPP LCP failures.

© 2006, Monash University, Australia Runtime Detection of Software Failures Consistency checks on values – is the result that which was expected? Watchdog timers – has an operation completed on time? Bounds checking – is the result reasonable or within some safe limits? Embedded software in networking equipment which is well designed must have runtime software failure detection functions built in. When choosing equipment it is essential to determine whether critical equipment items have some or any such capability.

© 2006, Monash University, Australia Recovery Strategies – Runtime Failures Redundant data structures - overwrite bad data with clean data. Signal operator or log problem cause and then die. Hot Start - restart from known position, do not reinitialise data structures. Cold Start - reinitialise data structures and restart, or reboot. Failover to Standby System in redundant scheme (eg flight controls). If an item of networking equipment is critical, it is important to determine how it handles runtime failures. Far too often a failure in synchronisation is not detected, causing chaos as a result.

© 2006, Monash University, Australia Typical Causes of Software Failures Programmer did not understand the system design very well. Programmer made unrealistic assumptions about operating conditions. Programmer made coding error. Programmers and hardware engineers did not talk to each other. Inadequate or inappropriate testing of code. A network designer is unlikely to have access to embedded code in equipment, or access to designers. If an item is critical to the function of the network, it should be tested rigorously before it is introduced into a production network and made available to users.

© 2006, Monash University, Australia Dormant Fault Problem Statistical models used for hardware are irrelevant. Code may be operational for years with a fatal bug hidden somewhere. A set of conditions may one day arise which trigger the fault. If major disaster arises it may be impossible to recreate same conditions. In a large and complex network, many dormant bugs may exist in embedded code inside equipment and in host operating systems. If such bugs result in transient or intermittent failures, they may be extremely difficult to isolate.

© 2006, Monash University, Australia Complex System Problem Extremely complex systems will be extremely difficult to simulate or test. Complexity may result in infeasible regression testing time. Components of system may interact in unpredictable ways. Synchronisation failures may arise. Faults may be hidden and symptoms not easily detectable due complexity. Networks represent a typical case study of a complex system, insofar as they may have hundreds of switches and routers, all with embedded software running in them.

© 2006, Monash University, Australia Network Design for Reliability 1. Network design objectives must be well understood. 2. Redundancy should be used as appropriate for critical portions of the design, especially if a formal reliability specification exists for the network. 3. Failure modes and consequences should be understood, for all items of hardware and software in the network. 4. Each hardware and software module should be tested thoroughly before use in a network design. 5. A hardware reliability model should be produced, based on Mil-Std-756B or a serial model, as required. 6. Good estimates of hardware reliability are feasible, where manufacturers are able to provide MTBF figures for equipment.

© 2006, Monash University, Australia Maintainability Regardless of how reliable a network might be, maintainability is a critical operational issue. When inevitable failures arise, these must be fixed as quickly as possible. Network failures can cause the loss of hundreds or thousands of personnel hours for every hour of network downtime. Time is money! From a user and management perspective, maintainability is very important and impacts any economic assessment of the running costs of a network. It is customary to measure network reliability in terms of overall MTBF, or in terms of Availability which is the fraction of time, over time, the network can be used.

© 2006, Monash University, Australia Maintainability - MTTR The most common measure of maintainability is MTTR. Unfortunately, multiple definitions exist for MTTR, as a result of which a designer or manager must be very careful when contracting: Mean Time To Respond – average time for a maintenance crew to respond to a request. Mean Time To Repair – average time to repair a fault. Mean Time To Restart – average time to restart the network after a fault. Mean Time To Restore – average time to restore network function after a fault. Some hardware suppliers will provide MTTR (repair) numbers for their products. In general, care should be taken when specifying MTTR and when assessing MTTR in a bid.

© 2006, Monash University, Australia Maintainability Repairing hardware faults requires spare parts, or complete replacement equipment. If good MTTR is required, then it is necessary to maintain a stockpile of spares. The size of the stockpile is typically determined by the MTBF of the component, and the number of items in operation. For instance, if the MTBF of a switch is 100,000 hrs and you have 100 of them in operation, annually you incur a total of 876,000 hrs of running time on these switches. You can thus expect, on average, 8.76 faults annually in these switches. A spares stockpile of around 10 switches would be needed, and an annual budget for 10 replacements.

© 2006, Monash University, Australia Frequency of Repairs vs Availability If we can expect some number of faults annually due to random failures, since these are Poisson we cannot know exactly when they will occur. MTBF and population size for specific components will determine on average, how frequently one of these will fail. We can estimate the Availability of the network if we have good estimates for MTBF and MTTR. In practice, MTBF can be calculated accurately, but MTTR can be difficult to measure accurately, especially given the range of possible network failure modes and debugging times which result.

© 2006, Monash University, Australia Axioms to Memorise Murphy's Law applies 99% of the time (Vonada's Law) Simpler solutions are usually easier to prove correct (Occam's Razor) Paranoia Pays Off (Kopp's 4 th Axiom) All systems will fail, the only issue is when, and how frequently (Kopps 5 th Axiom)

© 2006, Monash University, Australia Tutorial Q&A and Discussion, case studies