1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.

1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability

2 Review Probability Part 2 Bayes Rule Using MatLab

3 Review - Probability DEFINITION: The probability of an event is the ratio of the number of cases in which the event occurs to the total number of possible cases P(outcome) = Number of desired outcomes Number of possible outcomes EXAMPLE The probability of drawing a diamond out of a deck of cards is: P(diamond) = 13 52

4 Outline Introduction to Fault Tolerance Reliability Reliability calculations

5 Introduction to Fault Tolerance

6 A Dose of Reality Everything breaks down A switch stuck at open Wrong value in a program Deviation from performance

7 Failure Chain Problems at every stage can result in system failures Specification Mistakes Implementation Mistakes External Disturbances Component Defects Software Faults Hardware Faults Errors System Failures 3 types of control Fault AvoidanceFault MaskingFault Tolerance

8 Primary Design Techniques Fault Avoidance Prevents faults in the first place E.g. Design review Fault Masking Localize fault, prevent error from getting into system informational structure E.g. Error correcting codes Fault Tolerant Allow the system to perform tasks in the presence of faults

9 Ethical & Moral Responsibility Computers are used where system failure would be catastrophic in terms of money, human lives, or ecosystem. As engineers we have a responsibility to ensure that the systems we design provide the highest level of protection required by the application.

10 Failures

11 Downtime Costs

12 Downtime Survey In 1992, Stratus commissioned major research on “Impact of Online Computer Systems Downtime on American Business” Interviewed 450 senior information executives from American corporations in telecommunications, financial services, retail manufacturing, insurance, travel and transportation RESULT: downtime equates to lost revenue and customer dissatisfaction Executives reported $80,000 to $300,000 loss per hour of Downtime Average company reported downtimes 9 times per year, each averaging 4 hours

13 Competing Concerns There is a constant pressure to reduce costs and production time FT adds cost in Hardware, Design, Verification – increase development cycle compressed schedule can result in greater # of errors – errors escape into field

14 Reliability

15 Reliability The reliability of a system, R(t), is a function of time it defines the probability that the system will perform correctly from time 0 to time t When reliability is specified as a design parameter, it is usually a high value a reliability of.9999 is not uncommon it is often noted by the number of 9’s (four 9’s reliability) or as 0.9 4 The design parameter may be something other than reliability mean time to failure (MTTF) mean time between failures (MTBF) mean time to repair (MTTR)

16 Failure Rate The failure rate,, is the expected number of failures of a device or system per a given time period if a system fails on average once every 2000 hours then there are 1/2000 failures/hour or = 0.0005 The failure rate for a device will change over time and experience has shown that it follows a “bathtub” curve time Infant Mortality Phase Useful Life Period Wear Out Phase

17 Exponential Failure Law During the useful life phase when the failure rate is a constant the relationship between the reliability and the failure rate is an exponential R(t) = e - t DESIGN ISSUE The design specifications will be in terms of a certain level of reliability over a given time period To determine the reliability, however, we first need to know the failure rate of the components

18 Design Issue Reliability is often expressed as a design parameter PROBLEM: Given an estimate of the failure rate of a design, how do we calculate the reliability of the system? This is a common problem - going from what we know about a design to a measure of a requirement There are several measures of reliability which are related to the failure rate Mean Time to Failure Mean Time between Failures Mean Time to Repair

19 Mean Time to Failure The expected time that a system will operate before the first failure occurs The expected (read - average) value of the reliability of the system is the MTTF

20 Mean Time to Repair The mean time to repair (MTTR) is the average tie required to repair a system Very difficult to estimate and is often determined experimentally by injecting faults into a system and measuring the time required to repair Normally specified in terms of a repair rate,  MTTR =  

21 Mean Time Between Failures The average time between failures in a system includes the mean time to fail and the mean time to repair MTBF = MTTR + MTTR The relationship between MTBF and MTTF: MTTF MTBF MTTR time

22 Other Performance Measures There are several other performance measures related to reliability Maintainability Availability Availability is the probability that the system will be “up” during its scheduled working period

23 Safety S(t) is the probability that the system does not fail in the interval [0,t] in such a manner as to cause unacceptable damage or other catastrophic effects. Safety is a measure of the fail-safe capability of the system system can be unreliable, yet safe bias towards safe failure

24 Reliability Calculations

25 Example If we design a system made up of 4000 components, each with a failure rate of 2 x 10 -8 per hour, what is the MTTF of the whole system? = (2 x 10 -8 )(4000) = 8 x 10 -4 failures/hour MTTF = 1/ = 1250 hours What is the reliability of the system when t = MTTF? R(t) = e - t = exp(-t/MTTF) R(MTTF) = e -1 RESULT: a system with a MTTF of 100 hours has only a 36.8% chance of running 100 hours without a failure

26 Reliable Architectures How do we make trade-offs in the system design to increase reliability? First, produce reliable systems by selecting reliable components and testing, testing, testing Second, trade-off cost vs reliability and speed vs reliability by adding extra components to the system Adding extra components implies that designers need to understand the impact of extra circuits on system reliability Series/Parallel systems Specific fault tolerant architectures

27 Series System Systems in which each subsystem must function if the system as a whole is to function R =  R i i=1 N If subsystem failures are independent and R i is the reliability of subsystem i, then what is the reliability of two systems connected in series?

28 Series Analysis Given a series system, what is its MTTF? From the results on the prior slide: So, the MTTF is: Thus the MTTF of the series system is much smaller than the MTTF of its components

29 Parallel Systems Systems in which the correct operation of just one subsystem is sufficient for the system to function R = 1 -  R i ) i=1 N If the failures are independent and R i is the reliability of subsystem i then

30 Parallel Analysis The MTTF for a parallel system is given by:

31 Specific Architectures You are given a design specification which includes a required level of reliability of.9999, yet the best you can do for a given circuit is reach a documented reliability of.999, what do you do? the trade-off is to increase the cost of the system by imbedding your design in a fault tolerant architecture (and perhaps reduce speed as well) Possible Architectures Triple Module Redundancy Dynamic Redundancy Hybrid Redundancy Sift Out Modular Redundancy Self-Purging Redundancy others...

32 Triple Modular Redundancy An example of static redundancy (masking redundancy) using extra components so that the effect of a faulty component is instantaneously masked TMR uses three identical components and a voting element (majority component) Originally suggested by John von Neumann in 1956 V M M M

33 TMR Reliability ASSUME: The voting circuit does not fail Is this a good assumption? If the reliability of the individual modules is R M, then the reliability of the TMR scheme i, R TMR, is given by: The probability that all three modules are functioning + the probability that any two modules are functioning R TMR = R M + 3R M (1-R M ) = 3R M - 2R M 2 3 3 2

34 Reliability Improvement A more useful parameter for evaluating reliable systems is the reliability improvement factor, RIF It is the ratio of the probability of failure of the non- redundant system to that of the redundant system for a fixed mission time T Given R N and R R as the reliability's of the non- redundant and the redundant systems at time T: RIF = 1 - R N 1 - R R

35 Simple Calculation Given a system with a reliability of R N = 0.82 at T = 100 hours, what is the RIF of a TMR system? First, find the TMR reliability: (.82) 3 + 3(.82) 2 (1-.82) =.914 Second, find the RIF: (1 -.82)/(1 -.914) = 2.1

36 NMR It is possible to use more than three copies of a system in a redundant architecture M-of-N structure: N identical modules where M are required for the system to function properly This system may tolerate N-M failures The reliability of such a system is: R M-of-N = R N-i (1-R) i  i=0 N-M ( ) N! (N-i)!i! a 5MR system requires that 3 of the 5 modules remain fault free: R 3-of-5 = R 5 + 5R 4 (1-R) + 10R 3 (1-R) 2

37 Possible Quiz Remember that even though each quiz is worth only 5 to 10 points, the points do add up to a significant contribution to your overall grade If there is a quiz it might cover these issues: Name one of the three types of fault control. What is MTBF? What is TMR?

1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.

Similar presentations

Presentation on theme: "1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.

Similar presentations

Presentation on theme: "1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability."— Presentation transcript:

Similar presentations

About project

Feedback