Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fault-Tolerant Systems Design Part 1.

Similar presentations


Presentation on theme: "Fault-Tolerant Systems Design Part 1."— Presentation transcript:

1

2 vargas@computer.org1 Fault-Tolerant Systems Design Part 1

3 vargas@computer.org2 1. Introduction: Basic Definitions Fault-Tolerance is the ability of a system to continuously perform correctly its tasks after the occurrence of a fault.

4 vargas@computer.org3 Reliability of a system is the function, R(t), defined as the probability of the system to perform correctly through the time interval [t 0, t], given that the system was performing correctly at t 0. 1. Introduction: Basic Definitions

5 vargas@computer.org4 Availability is the function, A(t), defined as the probability of the system to operate correctly and to be available to perform its tasks through the interval [t 0, t]. 1. Introduction: Basic Definitions

6 vargas@computer.org5 Fault-Tolerant Systems can be designed by means of two basic approaches: Fault Masking Detection, localization and recovery, (via reconfiguration) of the system to remove the defective part. 2. Design of FT Systems

7 vargas@computer.org6 If the option is reconfiguration, then... before... Fault detection techniques Fault location techniques after... Fault recovery techniques 2. Design of FT Systems

8 vargas@computer.org7 Fault Recovery Techniques... Rollback Recovery Forward Recovery 2. Design of FT Systems

9 vargas@computer.org8 All techniques to design FT systems are based on some type and degree of redundancy. 2. Design of FT Systems

10 vargas@computer.org9 Redundancy is implemented through the use of HW, SW, information, or time beyond that necessary to system normal operation.  Results in a not negligible impact in the system in terms of performance, size, weight, power consumption, and reliability. 2. Design of FT Systems

11 vargas@computer.org10 Active Passive Hybrid Redundancy at the HW Level: 2. Design of FT Systems

12 vargas@computer.org11 1.  Based on the concept of fault masking to hide the occurrence of faults and prevent the faults from resulting in errors (developed around the concept of majority voting)  Do not provide for faults detection, but simply mask them HW Redundancy: 1. Passive 2. Design of FT Systems

13 vargas@computer.org12 Module 1 Module 2 Module 3 Voter Output Basic concept of Triple Modular Replication (TMR) Proc 1 Proc 2 Proc 3 Voter The use of triplicated voters in a TMR configuration Voter Mem 1 Mem 2 Mem 3 HW Redundancy: 1. Passive 2. Design of FT Systems

14 vargas@computer.org13 Example of SW voting VoterTask Task A Task B Task A Proc 1 Proc 3 Proc 2 HW Voting x SW Voting ? 1. The availability of processor to perform the voting 2. The speed at which voting must be performed 3. The criticality of space, power, and weight limitations 4. The # of different voters that must be provided 5. The flexibility required of the voter with respect to future changes in the system HW Redundancy: 1. Passive 2. Design of FT Systems

15 vargas@computer.org14 n In practical applications of voting, 3 results in a TMR system may not completely agree, even in a fault-free environment: e.g., A/D converters in sensors may produce quantities that disagree in the least-significant bits. This disagreement can propagate into larger discrepancies after computation, which can significantly affect the voting process. HW Redundancy: 1. Passive 2. Design of FT Systems

16 vargas@computer.org15  Solution  Mid-Value Select Technique A TMR system selects the value that lies in the middle of the others : Corrupted signal Uncorrupted signals Selected signals HW Redundancy: 1. Passive 2. Design of FT Systems

17 vargas@computer.org16  Attempts to achieve fault tolerance by means of fault detection, fault location, reconfiguration, and recovery (property of fault masking is not obtained: there is no attempt to prevent faults from producing errors within the system)  More suitable for applications where temporary, erroneous results are acceptable, as long as the system reconfigures and regains its operational status in a satisfactory length of time HW Redundancy: 2. Design of FT Systems 2. Active (or Dynamic)

18 vargas@computer.org17 Duplication of Functional Units Standby Blocks  Hot Standby Sparing  Cold Standby Sparing HW Redundancy: 2. Design of FT Systems 2. Active (or Dynamic)

19 vargas@computer.org18 Comparison Task Processor A Comparison Task Processor B Error Signals AB Processor A’s Result Processor B’s Result Shared Memory Processor A’s Private Memory Processor A’s Result Processor B’s Private Memory Processor B’s Result A software implementation of duplication with comparison 2. Active (or Dynamic) HW Redundancy: 2. Design of FT Systems

20 vargas@computer.org19 3. Hybrid HW Redundancy: 2. Design of FT Systems  Combines the attractive features of both the Active and the Passive approaches

21 vargas@computer.org20 Consistency Checks Capacity Checks N-Self Checking Programming N-Version Programming Recovery Blocks SW Redundancy: 2. Design of FT Systems

22 vargas@computer.org21 Use the previous knowledge about the chacacteristics of a given information to check the information correctness. Typically, for most applications, it is well known that a certain quantity of a given operand cannot assume values beyond predefined limits. Software Redundancy: Consistency Check 2. Design of FT Systems

23 vargas@computer.org22 Examples...  A processing system can sample and store many sensor readings in a typical control application.  The amount of cash requested by a patron at a bank’s teller machine should never exceed the maximum withdrawal allowed. Software Redundancy: Consistency Check 2. Design of FT Systems

24 vargas@computer.org23 Examples...  The address generated by a computer should never lie outside the address range of the available memory.  In a computer, each instruction code can be checked to verify that it is not one of the illegal codes. Software Redundancy: Consistency Check 2. Design of FT Systems

25 vargas@computer.org24 Capability checks are performed to verify that a system possesses the capability expected. Software Redundancy: Capability Check 2. Design of FT Systems

26 vargas@computer.org25 Examples...  Check whether a computer has the complete memory available.  Check whether the processors in a multiprocessor system are working properly.  Periodically, a processor can execute specific instructions on specific data and compare the results to known results stored in a ROM: check for ALU and Memory Software Redundancy: Capability Check 2. Design of FT Systems

27 vargas@computer.org26 Software Redundancy: N-Self Checking Programming 2. Design of FT Systems

28 vargas@computer.org27 Software Redundancy: N-Self Checking Programming 2. Design of FT Systems

29 vargas@computer.org28 Software Redundancy: N-Version Programming 2. Design of FT Systems

30 vargas@computer.org29 Software Redundancy: N-Version Programming 2. Design of FT Systems

31 vargas@computer.org30 Software Redundancy: N-Version Programming 2. Design of FT Systems

32 vargas@computer.org31 Software Redundancy: Recovery Blocks 2. Design of FT Systems

33 vargas@computer.org32 Software Redundancy: Recovery Blocks 2. Design of FT Systems

34 vargas@computer.org33 Parity and Berger Codes Arithmetic Codes Hamming Codes Checksum Code CRC ( Cyclic Redundancy Checking ) Code Information Redundancy: 2. Design of FT Systems

35 vargas@computer.org34 Transient Fault Detection Permanent Fault Detection Re-computation for Error Correction Time Redundancy: 2. Design of FT Systems

36 vargas@computer.org35 Transient Faults Detection Time Redundancy: 2. Design of FT Systems The fundamental concept is to perform the same computation two or more times and compare the results to determine if a discrepancy exists.

37 vargas@computer.org36 Time Redundancy: 2. Design of FT Systems Permanent Faults Detection Computation Encode Data Decode Result Store Result Store Result Compare Results Data Time t 0 Data Time t 1 Error

38 vargas@computer.org37 Time Redundancy: 2. Design of FT Systems Re-computation for Error Correction Time redundancy approach can also provide for error correction if the computations are repeated three or more times. AND Consider the example of a logical AND operation. Suppose the operation is performed three times: first, without shifting the operands; second, with a one-bit logical shift of the operands; and third, with a two-bit logical shift of the operands.


Download ppt "Fault-Tolerant Systems Design Part 1."

Similar presentations


Ads by Google