Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Part 1 - Introduction.

Slides:



Advertisements
Similar presentations
Chapter 8 Fault Tolerance
Advertisements

Fault-Tolerant Systems Design Part 1.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
5th Conference on Intelligent Systems
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Self-Checking Circuits
Reliable System Design 2011 by: Amir M. Rahmani
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
1 Chapter Fault Tolerant Design of Digital Systems.
Copyright 2004 Koren & Krishna ECE655/Ckpt Part.11.1 Fall 2006 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
Architecture and Real Time Systems Lab University of Massachusetts, Amherst An Application Driven Reliability Measures and Evaluation Tool for Fault Tolerant.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Introduction to Dependability slides made with the collaboration of: Laprie, Kanoon, Romano.
Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Reliability and Fault Tolerance Setha Pan-ngum. Introduction From the survey by American Society for Quality Control [1]. Ten most important product attributes.
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
Failure Spread in Redundant UMTS Core Network n Author: Tuomas Erke, Helsinki University of Technology n Supervisor: Timo Korhonen, Professor of Telecommunication.
FAULT-TOLERANT COMPUTING
Part.7.1 Copyright 2007 Koren & Krishna, Morgan-Kaufman FAULT TOLERANT SYSTEMS Part 7 - Coding.
CBSSS 2002: DeHon Architecture as Interface André DeHon Friday, June 21, 2002.
Introduction to Dependability. Overview Dependability: "the trustworthiness of a computing system which allows reliance to be justifiably placed on the.
SiLab presentation on Reliable Computing Combinational Logic Soft Error Analysis and Protection Ali Ahmadi May 2008.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
Fault-Tolerant Systems Design Part 1.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
CprE 545Iowa State University CprE 558: Real-Time Systems Lectures 15-16: Dependability Concepts & Faul-Tolerance.
1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012.
Faults and fault-tolerance
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
CprE 458/558: Real-Time Systems
FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.
Fault-Tolerant Systems Design Part 1.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
Mixed Criticality Systems: Beyond Transient Faults Abhilash Thekkilakattil, Alan Burns, Radu Dobrin and Sasikumar Punnekkat.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Introduction to Fault Tolerance By Sahithi Podila.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
DS - IX - NFT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 9 NETWORK FAULT TOLERANCE Wintersemester 99/00 Leitung:
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Fault tolerant Measures.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.12.1 FAULT TOLERANT SYSTEMS Part 12 - Networks.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
CS203 – Advanced Computer Architecture Dependability & Reliability.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Lecture 11: Storage Systems Disk, RAID, Dependability Kai Bu
1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.
Self-Checking Circuits
Software Metrics and Reliability
Faults and fault-tolerance
Outline Introduction Background Distributed DBMS Architecture
Fault Tolerance & Reliability CDA 5140 Spring 2006
Software Reliability Definition: The probability of failure-free operation of the software for a specified period of time in a specified environment.
Software Reliability PPT BY:Dr. R. Mall 7/5/2018.
Fault Tolerance In Operating System
Software Reliability: 2 Alternate Definitions
Reliability and Fault Tolerance
Fault Tolerance Distributed Web-based Systems
Faults and fault-tolerance
Introduction to Fault Tolerance
Seminar on Enterprise Software
Presentation transcript:

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Part 1 - Introduction Chapter 1 - Preliminaries

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.2 Prerequisites  Basic courses in  Digital Design  Hardware Organization/Computer Architecture  Probability

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.3 References  Textbook  I. Koren and C. M. Krishna, Fault Tolerant Systems, Morgan-Kaufman  Further Readings  M.L. Shooman, Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design, Wiley, 2002, ISBN  D.P. Siewiorek and R.S. Swarz, Reliable Computer Systems: Design and Evaluation, A.K. Peters,  D.K. Pradhan (ed.), Fault Tolerant Computer System Design, Prentice-Hall,  B.W. Johnson, Design and Analysis of Fault-Tolerant Digital Systems, Addison-Wesley, 1989.

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.4 Course Outline  Introduction - Basic concepts  Dependability measures  Redundancy techniques  Hardware fault tolerance  Error detecting and correcting codes  Redundant disks (RAID)  Fault-tolerant networks  Software fault tolerance  Checkpointing  Case studies of fault-tolerant systems  Defect tolerance in VLSI circuits  Fault detection in cryptographic systems  Simulation techniques

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.5 Fault Tolerance - Basic definition  Fault-tolerant systems - ideally systems capable of executing their tasks correctly regardless of either hardware failures or software errors  In practice - we can never guarantee the flawless execution of tasks under any circumstances  Limit ourselves to types of failures and errors which are more likely to occur

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.6 Need For Fault Tolerance  1.  2.  3.

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.7 Need For Fault Tolerance - Critical Applications Aircrafts, nuclear reactors, chemical plants, medical equipment  A malfunction of a computer in such applications can lead to catastrophe  Their probability of failure must be extremely low, possibly one in a billion per hour of operation  Also included - financial applications

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.8 Need for Fault Tolerance - Harsh Environments  A computing system operating in a harsh environment where it is subjected to  electromagnetic disturbances  particle hits and alike  Very large number of failures means: the system will not produce useful results unless some fault-tolerance is incorporated

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.9 Need For Fault Tolerance - Highly Complex Systems  Complex systems consist of millions of devices  Every physical device has a certain probability of failure  A very large number of devices implies that the likelihood of failures is high  The system will experience faults at such a frequency which renders it useless

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.10 Hardware Faults Classification  Three types of faults:  Transient Faults - disappear after a relatively short time  Example - a memory cell whose contents are changed spuriously due to some electromagnetic interference  Overwriting the memory cell with the right content will make the fault go away  Permanent Faults - never go away, component has to be repaired or replaced  Intermittent Faults - cycle between active and benign states  Example - a loose connection  Another classification: Benign vs malicious

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.11 Faults Vs. Errors  Fault - either a hardware defect or a software/programming mistake  Error - a manifestation of a fault  Example: An adder circuit with one output lines stuck at 1  This is a fault, but not (yet) an error  Becomes an error when the adder is used and the result on that line should be 0

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.12 Propagation of Faults and Errors  Faults and errors can spread throughout the system  If a chip shorts out power to ground, it may cause nearby chips to fail as well  Errors can spread - output of one unit is frequently used as input by other units  Adder example: erroneous result of faulty adder can be fed into further calculations, thus propagating the error

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.13 Redundancy  Redundancy is at the heart of fault tolerance  Redundancy - incorporation of extra components in the design of a system so that its function is not impaired in the event of a failure  We will study four forms of redundancy:  1.  2.  3.  4.

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.14 Hardware Redundancy  Extra hardware is added to override the effects of a failed component  Static Hardware Redundancy - for immediate masking of a failure  Example: Use three processors and vote on the result.The wrong output of a single faulty processor is masked  Dynamic Hardware Redundancy - Spare components are activated upon the failure of a currently active component  Hybrid Hardware Redundancy - A combination of static and dynamic redundancy techniques

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.15 Software Redundancy – Example  Multiple teams of programmers  Write different versions of software for the same function  The hope is that such diversity will ensure that not all the copies will fail on the same set of input data

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.16 Information Redundancy  Add check bits to original data bits so that an error in the data bits can be detected and even corrected  Error detecting and correcting codes have been developed and are being used  Information redundancy often requires hardware redundancy to process the additional check bits

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.17 Time Redundancy  Provide additional time during which a failed execution can be repeated  Most failures are transient - they go away after some time  If enough slack time is available, failed unit can recover and redo affected computation

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.18 Fault Tolerance Measures  It is important to have proper yardsticks - measures - by which to measure the effect of fault tolerance  A measure is a mathematical abstraction, which expresses only some subset of the object's nature  Measures? 

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.19 Traditional Measures - Reliability  Assumption: The system can be in one of two states: ‘’up” or ‘’down”  Examples:  Lightbulb - good or burned out  Wire - connected or broken  Reliability, R(t): Probability that the system is up during the whole interval [0,t], given it was up at time 0  Related measure - Mean Time To Failure, MTTF : Average time the system remains up before it goes down and has to be repaired or replaced

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.20 Traditional Measures - Availability  Availability, A(t) : Fraction of time system is up during the interval [0,t]  Point Availability, A p (t) : Probability that the system is up at time t  Long-Term Availability, A:  Availability is used in systems with recovery/repair  Related measures:  Mean Time To Repair, MTTR  Mean Time Between Failures, MTBF = MTTF + MTTR

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.21 Need For More Measures  The assumption of the system being in state ‘’up” or ‘’down” is very limiting  Example: A processor with one of its several hundreds of millions of gates stuck at logic value 0 and the rest is functional - may affect the output of the processor once in every 25,000 hours of use  The processor is not fault-free, but cannot be defined as being ‘’down”  More detailed measures than the general reliability and availability are needed

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.22 Computational Capacity Measures Example: N processors in a gracefully degrading system  System is useful as long as at least one processor remains operational  Let P i = Prob {i processors are operational}  Let c = computational capacity of a processor (e.g., number of fixed-size tasks it can execute)  Computational capacity of i processors: C i = i  c  Average computational capacity of system:

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.23 Another Measure - Performability  Another approach - consider everything from the perspective of the application  Application is used to define ‘’accomplishment levels” L 1, L 2,...,L n  Each represents a level of quality of service delivered by the application  Example: L i indicates i system crashes during the mission time period T  Performability is a vector (P(L 1 ),P(L 2 ),...,P(L n )) where P(L i ) is the probability that the computer functions well enough to permit the application to reach up to accomplishment level L i

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.24 Network Connectivity Measures  Focus on the network that connects the processors  Classical Node and Line Connectivity - the minimum number of nodes and lines, respectively, that have to fail before the network becomes disconnected  Measure indicates how vulnerable the network is to disconnection  A network disconnected by the failure of just one (critically-positioned) node is potentially more vulnerable than another which requires several nodes to fail before it becomes disconnected

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.25 Connectivity - Examples

Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.26 Network Resilience Measures  Classical connectivity distinguishes between only two network states: connected and disconnected  It says nothing about how the network degrades as nodes fail before becoming disconnected  Two possible resilience measures:  Average node-pair distance  Network diameter - maximum node-pair distance  Both calculated given probability of node and/or link failure