Download presentation

Presentation is loading. Please wait.

2
vargas@computer.org1 Fault-Tolerant Systems Design Part 1

3
vargas@computer.org2 1. Introduction: Basic Definitions Fault-Tolerance is the ability of a system to continuously perform correctly its tasks after the occurrence of a fault.

4
vargas@computer.org3 Reliability of a system is the function, R(t), defined as the probability of the system to perform correctly through the time interval [t 0, t], given that the system was performing correctly at t 0. 1. Introduction: Basic Definitions

5
vargas@computer.org4 Availability is the function, A(t), defined as the probability of the system to operate correctly and to be available to perform its tasks through the interval [t 0, t]. 1. Introduction: Basic Definitions

6
vargas@computer.org5 Fault-Tolerant Systems can be designed by means of two basic approaches: Fault Masking Detection, localization and recovery, (via reconfiguration) of the system to remove the defective part. 2. Design of FT Systems

7
vargas@computer.org6 If the option is reconfiguration, then... before... Fault detection techniques Fault location techniques after... Fault recovery techniques 2. Design of FT Systems

8
vargas@computer.org7 Fault Recovery Techniques... Rollback Recovery Forward Recovery 2. Design of FT Systems

9
vargas@computer.org8 All techniques to design FT systems are based on some type and degree of redundancy. 2. Design of FT Systems

10
vargas@computer.org9 Redundancy is implemented through the use of HW, SW, information, or time beyond that necessary to system normal operation. Results in a not negligible impact in the system in terms of performance, size, weight, power consumption, and reliability. 2. Design of FT Systems

11
vargas@computer.org10 Active Passive Hybrid Redundancy at the HW Level: 2. Design of FT Systems

12
vargas@computer.org11 1. Based on the concept of fault masking to hide the occurrence of faults and prevent the faults from resulting in errors (developed around the concept of majority voting) Do not provide for faults detection, but simply mask them HW Redundancy: 1. Passive 2. Design of FT Systems

13
vargas@computer.org12 Module 1 Module 2 Module 3 Voter Output Basic concept of Triple Modular Replication (TMR) Proc 1 Proc 2 Proc 3 Voter The use of triplicated voters in a TMR configuration Voter Mem 1 Mem 2 Mem 3 HW Redundancy: 1. Passive 2. Design of FT Systems

14
vargas@computer.org13 Example of SW voting VoterTask Task A Task B Task A Proc 1 Proc 3 Proc 2 HW Voting x SW Voting ? 1. The availability of processor to perform the voting 2. The speed at which voting must be performed 3. The criticality of space, power, and weight limitations 4. The # of different voters that must be provided 5. The flexibility required of the voter with respect to future changes in the system HW Redundancy: 1. Passive 2. Design of FT Systems

15
vargas@computer.org14 n In practical applications of voting, 3 results in a TMR system may not completely agree, even in a fault-free environment: e.g., A/D converters in sensors may produce quantities that disagree in the least-significant bits. This disagreement can propagate into larger discrepancies after computation, which can significantly affect the voting process. HW Redundancy: 1. Passive 2. Design of FT Systems

16
vargas@computer.org15 Solution Mid-Value Select Technique A TMR system selects the value that lies in the middle of the others : Corrupted signal Uncorrupted signals Selected signals HW Redundancy: 1. Passive 2. Design of FT Systems

17
vargas@computer.org16 Attempts to achieve fault tolerance by means of fault detection, fault location, reconfiguration, and recovery (property of fault masking is not obtained: there is no attempt to prevent faults from producing errors within the system) More suitable for applications where temporary, erroneous results are acceptable, as long as the system reconfigures and regains its operational status in a satisfactory length of time HW Redundancy: 2. Design of FT Systems 2. Active (or Dynamic)

18
vargas@computer.org17 Duplication of Functional Units Standby Blocks Hot Standby Sparing Cold Standby Sparing HW Redundancy: 2. Design of FT Systems 2. Active (or Dynamic)

19
vargas@computer.org18 Comparison Task Processor A Comparison Task Processor B Error Signals AB Processor A’s Result Processor B’s Result Shared Memory Processor A’s Private Memory Processor A’s Result Processor B’s Private Memory Processor B’s Result A software implementation of duplication with comparison 2. Active (or Dynamic) HW Redundancy: 2. Design of FT Systems

20
vargas@computer.org19 3. Hybrid HW Redundancy: 2. Design of FT Systems Combines the attractive features of both the Active and the Passive approaches

21
vargas@computer.org20 Consistency Checks Capacity Checks N-Self Checking Programming N-Version Programming Recovery Blocks SW Redundancy: 2. Design of FT Systems

22
vargas@computer.org21 Use the previous knowledge about the chacacteristics of a given information to check the information correctness. Typically, for most applications, it is well known that a certain quantity of a given operand cannot assume values beyond predefined limits. Software Redundancy: Consistency Check 2. Design of FT Systems

23
vargas@computer.org22 Examples... A processing system can sample and store many sensor readings in a typical control application. The amount of cash requested by a patron at a bank’s teller machine should never exceed the maximum withdrawal allowed. Software Redundancy: Consistency Check 2. Design of FT Systems

24
vargas@computer.org23 Examples... The address generated by a computer should never lie outside the address range of the available memory. In a computer, each instruction code can be checked to verify that it is not one of the illegal codes. Software Redundancy: Consistency Check 2. Design of FT Systems

25
vargas@computer.org24 Capability checks are performed to verify that a system possesses the capability expected. Software Redundancy: Capability Check 2. Design of FT Systems

26
vargas@computer.org25 Examples... Check whether a computer has the complete memory available. Check whether the processors in a multiprocessor system are working properly. Periodically, a processor can execute specific instructions on specific data and compare the results to known results stored in a ROM: check for ALU and Memory Software Redundancy: Capability Check 2. Design of FT Systems

27
vargas@computer.org26 Software Redundancy: N-Self Checking Programming 2. Design of FT Systems

28
vargas@computer.org27 Software Redundancy: N-Self Checking Programming 2. Design of FT Systems

29
vargas@computer.org28 Software Redundancy: N-Version Programming 2. Design of FT Systems

30
vargas@computer.org29 Software Redundancy: N-Version Programming 2. Design of FT Systems

31
vargas@computer.org30 Software Redundancy: N-Version Programming 2. Design of FT Systems

32
vargas@computer.org31 Software Redundancy: Recovery Blocks 2. Design of FT Systems

33
vargas@computer.org32 Software Redundancy: Recovery Blocks 2. Design of FT Systems

34
vargas@computer.org33 Parity and Berger Codes Arithmetic Codes Hamming Codes Checksum Code CRC ( Cyclic Redundancy Checking ) Code Information Redundancy: 2. Design of FT Systems

35
vargas@computer.org34 Transient Fault Detection Permanent Fault Detection Re-computation for Error Correction Time Redundancy: 2. Design of FT Systems

36
vargas@computer.org35 Transient Faults Detection Time Redundancy: 2. Design of FT Systems The fundamental concept is to perform the same computation two or more times and compare the results to determine if a discrepancy exists.

37
vargas@computer.org36 Time Redundancy: 2. Design of FT Systems Permanent Faults Detection Computation Encode Data Decode Result Store Result Store Result Compare Results Data Time t 0 Data Time t 1 Error

38
vargas@computer.org37 Time Redundancy: 2. Design of FT Systems Re-computation for Error Correction Time redundancy approach can also provide for error correction if the computations are repeated three or more times. AND Consider the example of a logical AND operation. Suppose the operation is performed three times: first, without shifting the operands; second, with a one-bit logical shift of the operands; and third, with a two-bit logical shift of the operands.

Similar presentations

OK

A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.

A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on operating system by galvin Ppt on history of olympics torch Free ppt on indian culture Ppt on unity in diversity of india Ppt on articles of association template Ppt on jindal steels Ppt on asymptotic notation of algorithms Ppt on kingdom monera phylum Free ppt on brain machine interface definition Ppt on icici prudential life insurance