Fault-Tolerant Systems Design Part 1.

vargas@computer.org1 Fault-Tolerant Systems Design Part 1

vargas@computer.org2 1. Introduction: Basic Definitions Fault-Tolerance is the ability of a system to continuously perform correctly its tasks after the occurrence of a fault.

vargas@computer.org3 Reliability of a system is the function, R(t), defined as the probability of the system to perform correctly through the time interval [t 0, t], given that the system was performing correctly at t 0. 1. Introduction: Basic Definitions

vargas@computer.org4 Availability is the function, A(t), defined as the probability of the system to operate correctly and to be available to perform its tasks through the interval [t 0, t]. 1. Introduction: Basic Definitions

vargas@computer.org5 Fault-Tolerant Systems can be designed by means of two basic approaches: Fault Masking Detection, localization and recovery, (via reconfiguration) of the system to remove the defective part. 2. Design of FT Systems

vargas@computer.org6 If the option is reconfiguration, then... before... Fault detection techniques Fault location techniques after... Fault recovery techniques 2. Design of FT Systems

vargas@computer.org7 Fault Recovery Techniques... Rollback Recovery Forward Recovery 2. Design of FT Systems

vargas@computer.org8 All techniques to design FT systems are based on some type and degree of redundancy. 2. Design of FT Systems

vargas@computer.org9 Redundancy is implemented through the use of HW, SW, information, or time beyond that necessary to system normal operation.  Results in a not negligible impact in the system in terms of performance, size, weight, power consumption, and reliability. 2. Design of FT Systems

vargas@computer.org10 Active Passive Hybrid Redundancy at the HW Level: 2. Design of FT Systems

vargas@computer.org11 1.  Based on the concept of fault masking to hide the occurrence of faults and prevent the faults from resulting in errors (developed around the concept of majority voting)  Do not provide for faults detection, but simply mask them HW Redundancy: 1. Passive 2. Design of FT Systems

vargas@computer.org12 Module 1 Module 2 Module 3 Voter Output Basic concept of Triple Modular Replication (TMR) Proc 1 Proc 2 Proc 3 Voter The use of triplicated voters in a TMR configuration Voter Mem 1 Mem 2 Mem 3 HW Redundancy: 1. Passive 2. Design of FT Systems

vargas@computer.org13 Example of SW voting VoterTask Task A Task B Task A Proc 1 Proc 3 Proc 2 HW Voting x SW Voting ? 1. The availability of processor to perform the voting 2. The speed at which voting must be performed 3. The criticality of space, power, and weight limitations 4. The # of different voters that must be provided 5. The flexibility required of the voter with respect to future changes in the system HW Redundancy: 1. Passive 2. Design of FT Systems

vargas@computer.org14 n In practical applications of voting, 3 results in a TMR system may not completely agree, even in a fault-free environment: e.g., A/D converters in sensors may produce quantities that disagree in the least-significant bits. This disagreement can propagate into larger discrepancies after computation, which can significantly affect the voting process. HW Redundancy: 1. Passive 2. Design of FT Systems

vargas@computer.org15  Solution  Mid-Value Select Technique A TMR system selects the value that lies in the middle of the others : Corrupted signal Uncorrupted signals Selected signals HW Redundancy: 1. Passive 2. Design of FT Systems

vargas@computer.org16  Attempts to achieve fault tolerance by means of fault detection, fault location, reconfiguration, and recovery (property of fault masking is not obtained: there is no attempt to prevent faults from producing errors within the system)  More suitable for applications where temporary, erroneous results are acceptable, as long as the system reconfigures and regains its operational status in a satisfactory length of time HW Redundancy: 2. Design of FT Systems 2. Active (or Dynamic)

vargas@computer.org17 Duplication of Functional Units Standby Blocks  Hot Standby Sparing  Cold Standby Sparing HW Redundancy: 2. Design of FT Systems 2. Active (or Dynamic)

vargas@computer.org18 Comparison Task Processor A Comparison Task Processor B Error Signals AB Processor A’s Result Processor B’s Result Shared Memory Processor A’s Private Memory Processor A’s Result Processor B’s Private Memory Processor B’s Result A software implementation of duplication with comparison 2. Active (or Dynamic) HW Redundancy: 2. Design of FT Systems

vargas@computer.org19 3. Hybrid HW Redundancy: 2. Design of FT Systems  Combines the attractive features of both the Active and the Passive approaches

vargas@computer.org20 Consistency Checks Capacity Checks N-Auto testable Programming N-Version Programming Recovery Blocks SW Redundancy: 2. Design of FT Systems

vargas@computer.org21 Consistency Checks SW Redundancy: 2. Design of FT Systems Use the previous knowledge about the chacacteristics of a given information to check the information correctness. Typically, for most applications, it is well known that a certain quantity of a given operand cannot assume values beyond predefined limits.

vargas@computer.org22 Consistency Checks SW Redundancy: 2. Design of FT Systems Examples...  A processing system can sample and store many sensor readings in a typical control application.  The amount of cash requested by a patron at a bank’s teller machine should never exceed the maximum withdrawal allowed.

vargas@computer.org23 Consistency Checks SW Redundancy: 2. Design of FT Systems Examples...  The address generated by a computer should never lie outside the address range of the available memory.  In a computer, each instruction code can be checked to verify that it is not one the illegal codes.

vargas@computer.org24 Capability Checks SW Redundancy: 2. Design of FT Systems Capability checks are performed to verify that a system possesses the capability expected.

vargas@computer.org25 Capability Checks SW Redundancy: 2. Design of FT Systems Examples...  Check whether a computer has the complete memory available.  Check whether the processors in a multiprocessor system are working properly.  Periodically, a processor can execute specific instrutions on specific data and compare the results to known results stored in a ROM: check for ALU and Memory

vargas@computer.org26 Program Version 1 Program Version n Acceptance Tests Selection Logic Program Outputs Program Inputs Program Inputs The N-Self-Checking Programming Approach to software fault tolerance SW Redundancy: N-Auto testable Programming 2. Design of FT Systems

vargas@computer.org27 Parity, Berger, and m-of-n Codes Arithmetic Codes Hamming Codes Checksum Code CRC ( Cyclic Redundancy Checking ) Code Information Redundancy: 2. Design of FT Systems

vargas@computer.org28 Transient Fault Detection Permanent Fault Detection Re-computation for Error Correction Time Redundancy: 2. Design of FT Systems

vargas@computer.org29 Transient Faults Detection Time Redundancy: 2. Design of FT Systems The fundamental concept is to perform the same computation two or more times and compare the results to determine if a discrepancy exists.

vargas@computer.org30 Time Redundancy: 2. Design of FT Systems Permanent Faults Detection Computation Encode Data Decode Result Store Result Store Result Compare Results Data Time t 0 Data Time t 1 Error

vargas@computer.org31 Time Redundancy: 2. Design of FT Systems Re-computation for Error Correction Time redundancy approach can also provide for error correction if the computations are repeated three or more times. AND Consider the example of a logical AND operation. Suppose the operation is performed three times: first, without shifting the operands; second, with a one-bit logical shift of the operands; and third, with a two-bit logical shift of the operands.

Fault-Tolerant Systems Design Part 1.

Similar presentations

Presentation on theme: "Fault-Tolerant Systems Design Part 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fault-Tolerant Systems Design Part 1.

Similar presentations

Presentation on theme: "Fault-Tolerant Systems Design Part 1."— Presentation transcript:

Similar presentations

About project

Feedback