Fault-Tolerant Systems Design Part 1.

Slides:



Advertisements
Similar presentations
Principles of Engineering System Design Dr T Asokan
Advertisements

Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in.
Computer Architecture
Machine cycle.
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Fault-Tolerant Systems Design Part 1.
Introduction High-Availability Systems: An Example Pioneered FT in telephone switching applications. Aggressive availability goal: 2 hours downtime in.
COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.
11. Practical fault-tolerant system design Reliable System Design 2005 by: Amir M. Rahmani.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Self-Checking Circuits
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
7. Fault Tolerance Through Dynamic (or Standby) Redundancy The lowest-cost fault-tolerance technique in multiprocessors. Steps performed: When a fault.
Quantum Error Correction SOURCES: Michele Mosca Daniel Gottesman Richard Spillman Andrew Landahl.
8. Fault Tolerance in Software 8.1 Introduction Is it true that a program that has once performed a given task as specified will continue to do so? Yes,
1 Chapter Fault Tolerant Design of Digital Systems.
DS -V - FDT - 1 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK Zuverlässige Systeme für Web und E-Business (Dependable Systems for Web and E-Business)
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
8. Fault Tolerance in Software
The processor and main memory chapter 4, Exploring the Digital Domain The Development and Basic Organization of Computers.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.6 Reconfiguration in Multiprocessors Focused on permanent and transient faults detection. Three.
Benefits of Early Cache Miss Determination Memik G., Reinman G., Mangione-Smith, W.H. Proceedings of High Performance Computer Architecture Pages: 307.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
Chapter 1 Introduction. Computer Architecture selecting and interconnecting hardware components to create computers that meet functional, performance.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Command and Data Handling (C&DH)
Assuring Application-level Correctness Against Soft Errors Jason Cong and Karthik Gururaj.
Information Coding in noisy channel error protection:-- improve tolerance of errors error detection: --- indicate occurrence of errors. Source.
EXECUTION OF COMPLETE INSTRUCTION
Advanced Computer Architecture 0 Lecture # 1 Introduction by Husnain Sherazi.
Computer Engineering Group Brandenburg University of Technology at Cottbus 1 Ressource Reduced Triple Modular Redundancy for Built-In Self-Repair in VLIW-Processors.
Fault-Tolerant Systems Design Part 1.
Data and Computer Communications by William Stallings Eighth Edition Digital Data Communications Techniques Digital Data Communications Techniques Click.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Synthesis Of Fault Tolerant Circuits For FSMs & RAMs Rajiv Garg Pradish Mathews Darren Zacher.
Error Detection in Hardware VO Hardware-Software-Codesign Philipp Jahn.
CprE 458/558: Real-Time Systems
FAULT-TOLERANT COMPUTING Jenn-Wei Lin Department of Computer Science and Information Engineering Fu Jen Catholic University Simple Concepts in Fault-Tolerance.
FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.
PLC ARCHITECTURE - CPU by Dr. Amin Danial Asham.
1/14 Merging BIST and Configurable Computing Technology to Improve Availability in Space Applications Eduardo Bezerra 1, Fabian Vargas 2, Michael Paul.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
10.1 Chapter 10 Error Detection and Correction Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Error Detection and Correction
Evaluating Logic Resources Utilization in an FPGA-Based TMR CPU
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Paper by F.L. Kastensmidt, G. Neuberger, L. Carro, R. Reis Talk by Nick Boyd 1.
Data Link Layer 1. 2 Single-bit error 3 Multiple-bit error 4.
1 Chapter 1 Basic Structures Of Computers. Computer : Introduction A computer is an electronic machine,devised for performing calculations and controlling.
Week#3 Software Quality Engineering.
Powerpoint Templates Data Communication Muhammad Waseem Iqbal Lec # 15.
Self-Checking Circuits
ECE 753: FAULT-TOLERANT COMPUTING
Error Detection and Correction
Cyclic Redundancy Check (CRC)
Fault Tolerance In Operating System
BASICS OF SOFTWARE TESTING Chapter 1. Topics to be covered 1. Humans and errors, 2. Testing and Debugging, 3. Software Quality- Correctness Reliability.
Error Detection Bit Error Rate(BER): It is the ratio of number Ne of errors appearing over a certain time interval t to the number Nt of 1 and 0 pulses.
Maintaining Data Integrity in Programmable Logic in Atmospheric Environments through Error Detection Joel Seely Technical Marketing Manager Military &
MAPLD 2005 BOF-L Mitigation Methods for
Sequential circuits and Digital System Reliability
Information Redundancy Fault Tolerant Computing
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Seminar on Enterprise Software
Presentation transcript:

Fault-Tolerant Systems Design Part 1

1. Introduction: Basic Definitions Fault-Tolerance is the ability of a system to continuously perform correctly its tasks after the occurrence of a fault.

Reliability of a system is the function, R(t), defined as the probability of the system to perform correctly through the time interval [t 0, t], given that the system was performing correctly at t Introduction: Basic Definitions

Availability is the function, A(t), defined as the probability of the system to operate correctly and to be available to perform its tasks through the interval [t 0, t]. 1. Introduction: Basic Definitions

Fault-Tolerant Systems can be designed by means of two basic approaches: Fault Masking Detection, localization and recovery, (via reconfiguration) of the system to remove the defective part. 2. Design of FT Systems

If the option is reconfiguration, then... before... Fault detection techniques Fault location techniques after... Fault recovery techniques 2. Design of FT Systems

Fault Recovery Techniques... Rollback Recovery Forward Recovery 2. Design of FT Systems

All techniques to design FT systems are based on some type and degree of redundancy. 2. Design of FT Systems

Redundancy is implemented through the use of HW, SW, information, or time beyond that necessary to system normal operation.  Results in a not negligible impact in the system in terms of performance, size, weight, power consumption, and reliability. 2. Design of FT Systems

Active Passive Hybrid Redundancy at the HW Level: 2. Design of FT Systems

1.  Based on the concept of fault masking to hide the occurrence of faults and prevent the faults from resulting in errors (developed around the concept of majority voting)  Do not provide for faults detection, but simply mask them HW Redundancy: 1. Passive 2. Design of FT Systems

Module 1 Module 2 Module 3 Voter Output Basic concept of Triple Modular Replication (TMR) Proc 1 Proc 2 Proc 3 Voter The use of triplicated voters in a TMR configuration Voter Mem 1 Mem 2 Mem 3 HW Redundancy: 1. Passive 2. Design of FT Systems

Example of SW voting VoterTask Task A Task B Task A Proc 1 Proc 3 Proc 2 HW Voting x SW Voting ? 1. The availability of processor to perform the voting 2. The speed at which voting must be performed 3. The criticality of space, power, and weight limitations 4. The # of different voters that must be provided 5. The flexibility required of the voter with respect to future changes in the system HW Redundancy: 1. Passive 2. Design of FT Systems

n In practical applications of voting, 3 results in a TMR system may not completely agree, even in a fault-free environment: e.g., A/D converters in sensors may produce quantities that disagree in the least-significant bits. This disagreement can propagate into larger discrepancies after computation, which can significantly affect the voting process. HW Redundancy: 1. Passive 2. Design of FT Systems

 Solution  Mid-Value Select Technique A TMR system selects the value that lies in the middle of the others : Corrupted signal Uncorrupted signals Selected signals HW Redundancy: 1. Passive 2. Design of FT Systems

 Attempts to achieve fault tolerance by means of fault detection, fault location, reconfiguration, and recovery (property of fault masking is not obtained: there is no attempt to prevent faults from producing errors within the system)  More suitable for applications where temporary, erroneous results are acceptable, as long as the system reconfigures and regains its operational status in a satisfactory length of time HW Redundancy: 2. Design of FT Systems 2. Active (or Dynamic)

Duplication of Functional Units Standby Blocks  Hot Standby Sparing  Cold Standby Sparing HW Redundancy: 2. Design of FT Systems 2. Active (or Dynamic)

Comparison Task Processor A Comparison Task Processor B Error Signals AB Processor A’s Result Processor B’s Result Shared Memory Processor A’s Private Memory Processor A’s Result Processor B’s Private Memory Processor B’s Result A software implementation of duplication with comparison 2. Active (or Dynamic) HW Redundancy: 2. Design of FT Systems

3. Hybrid HW Redundancy: 2. Design of FT Systems  Combines the attractive features of both the Active and the Passive approaches

Consistency Checks Capacity Checks N-Auto testable Programming N-Version Programming Recovery Blocks SW Redundancy: 2. Design of FT Systems

Consistency Checks SW Redundancy: 2. Design of FT Systems Use the previous knowledge about the chacacteristics of a given information to check the information correctness. Typically, for most applications, it is well known that a certain quantity of a given operand cannot assume values beyond predefined limits.

Consistency Checks SW Redundancy: 2. Design of FT Systems Examples...  A processing system can sample and store many sensor readings in a typical control application.  The amount of cash requested by a patron at a bank’s teller machine should never exceed the maximum withdrawal allowed.

Consistency Checks SW Redundancy: 2. Design of FT Systems Examples...  The address generated by a computer should never lie outside the address range of the available memory.  In a computer, each instruction code can be checked to verify that it is not one the illegal codes.

Capability Checks SW Redundancy: 2. Design of FT Systems Capability checks are performed to verify that a system possesses the capability expected.

Capability Checks SW Redundancy: 2. Design of FT Systems Examples...  Check whether a computer has the complete memory available.  Check whether the processors in a multiprocessor system are working properly.  Periodically, a processor can execute specific instrutions on specific data and compare the results to known results stored in a ROM: check for ALU and Memory

Program Version 1 Program Version n Acceptance Tests Selection Logic Program Outputs Program Inputs Program Inputs The N-Self-Checking Programming Approach to software fault tolerance SW Redundancy: N-Auto testable Programming 2. Design of FT Systems

Parity, Berger, and m-of-n Codes Arithmetic Codes Hamming Codes Checksum Code CRC ( Cyclic Redundancy Checking ) Code Information Redundancy: 2. Design of FT Systems

Transient Fault Detection Permanent Fault Detection Re-computation for Error Correction Time Redundancy: 2. Design of FT Systems

Transient Faults Detection Time Redundancy: 2. Design of FT Systems The fundamental concept is to perform the same computation two or more times and compare the results to determine if a discrepancy exists.

Time Redundancy: 2. Design of FT Systems Permanent Faults Detection Computation Encode Data Decode Result Store Result Store Result Compare Results Data Time t 0 Data Time t 1 Error

Time Redundancy: 2. Design of FT Systems Re-computation for Error Correction Time redundancy approach can also provide for error correction if the computations are repeated three or more times. AND Consider the example of a logical AND operation. Suppose the operation is performed three times: first, without shifting the operands; second, with a one-bit logical shift of the operands; and third, with a two-bit logical shift of the operands.