2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.

Slides:



Advertisements
Similar presentations
Principles of Engineering System Design Dr T Asokan
Advertisements

Test process essentials Riitta Viitamäki,
Fault-Tolerant Systems Design Part 1.
11. Practical fault-tolerant system design Reliable System Design 2005 by: Amir M. Rahmani.
5th Conference on Intelligent Systems
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Dependability TSW 10 Anders P. Ravn Aalborg University November 2009.
Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Fault Tolerance: Basic Mechanisms mMIC-SFT September 2003 Anders P. Ravn Aalborg University.
Dependability ITV Real-Time Systems Anders P. Ravn Aalborg University February 2006.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
1 Product Reliability Chris Nabavi BSc SMIEEE © 2006 PCE Systems Ltd.
Software Dependability CIS 376 Bruce R. Maxim UM-Dearborn.
Software faults & reliability Presented by: Presented by: Pooja Jain Pooja Jain.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.
Reliability and Fault Tolerance Setha Pan-ngum. Introduction From the survey by American Society for Quality Control [1]. Ten most important product attributes.
Software Quality Assurance Lecture #8 By: Faraz Ahmed.
N-Tier Client/Server Architectures Chapter 4 Server - RAID Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept RAID – Redundant Array.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 9 Slide 1 Critical Systems Specification 2.
Testing Basics of Testing Presented by: Vijay.C.G – Glister Tech.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
Fault-Tolerant Systems Design Part 1.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Adaptive control and process systems. Design and methods and control strategies 1.
CprE 545Iowa State University CprE 558: Real-Time Systems Lectures 15-16: Dependability Concepts & Faul-Tolerance.
1 Fault Tolerant Computing Basics Dan Siewiorek Carnegie Mellon University June 2012.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
CprE 458/558: Real-Time Systems
5 May CmpE 516 Fault Tolerant Scheduling in Multiprocessor Systems Betül Demiröz.
RELIABILITY ENGINEERING 28 March 2013 William W. McMillan.
Fault-Tolerant Systems Design Part 1.
Fault Tolerance Benchmarking. 2 Owerview What is Benchmarking? What is Dependability? What is Dependability Benchmarking? What is the relation between.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
Verification of FT System Using Simulation Petr Grillinger.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Silicon Programming--Testing1 Completing a successful project (introduction) Design for testability.
CSE 8377 Software Fault Tolerance. CSE 8377 Motivation Software is becoming central to many life- critical systems Software is created by error-prone.
Introduction to Fault Tolerance By Sahithi Podila.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.
Topic: Reliability and Integrity. Reliability refers to the operation of hardware, the design of software, the accuracy of data or the correspondence.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
CS203 – Advanced Computer Architecture Dependability & Reliability.
Week#3 Software Quality Engineering.
1 Introduction to Engineering Spring 2007 Lecture 16: Reliability & Probability.
Hardware & Software Reliability
Software Testing An Introduction.
Fault Tolerance & Reliability CDA 5140 Spring 2006
Fault Tolerance In Operating System
RAID RAID Mukesh N Tekwani
Reliability and Fault Tolerance
Fault Tolerance Distributed Web-based Systems
EEC 688/788 Secure and Dependable Computing
Introduction to Fault Tolerance
InCheck: An In-application Recovery Scheme for Soft Errors
RAID RAID Mukesh N Tekwani April 23, 2019
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Seminar on Enterprise Software
Presentation transcript:

2. Fault Tolerance

2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior caused by a fault – manifestation of fault Failure = inability of the system to perform its specified service Latent Fault = which has not yet produced error Latent Error = which has not yet produced failure

3 Fault - Error - Failure Note: presents of fault does not ensure that error will occur, e.g. memory stuck-at-0

4 Origin of Defects in Objects (HW/SW) Good object wearing out with age – Hardware (software can age too) – Incorrect maintenance/operation Good object, unforeseen hostile environment – Environmental fault Marginal object: occasionally fails in target environment – Tight design/bad inputs Implementation mistakes Specification mistakes Note: From Top to Down-> Increasing human responsibility

5 Bathtub Curve Three phases of system lifetime – Infant mortality – Normal lifetime – Wear-out period

6 Life-time of a Software system

7 Faults Characteristics 1- Cause – Specification errors very dangerous generic fault – Implementation errors very hard to formally verify – Random component faults random, not manufacturing defects – External disturbance noise, EMP, vibration, radiation much like random component

8 Faults Characteristics 2- Origin – software or hardware Physical device level (HW) Logic level (HW) Chip level (HW) System level (HW/SW) – interfacing, specifications, … – don’t care, except: hardware can be analog indeterminate voltage level

9 Faults Characteristics 3- Duration – Permanent fault occurs and doesn’t go away easiest to diagnose – Transient fault occurs once and disappears 10 times as expected as permanent fault – Intermittent fault occurs occasionally may appear to be transient (if long period) hard and expensive to detect

10 Faults Characteristics 4- Extent – Global A power supply fault – Local A memory fault 5- Value – Determinate memory stuck-at-0 – Indeterminate A sensitive fault to data or time

11 What to do about Faults Finding & identifying faults: Fault detection: is a fault there? Fault location: where? Fault diagnosis: which fault it is? Automatic handling of faults Fault containment: blocking error flow – Fault masking: fault has no effect Fault recovery: back to correct operation

12 System Response to faults Error on output: may be acceptable in non- critical systems if happens only rarely Fault masking: output correct even when fault from a specific class occurs – Critical applications: air/space/manufacturing Fault-secure: output correct or error indication – Retryable: banking, telephony Fail safe: output correct or in safe state – Flashing red traffic light, disabled ATM

13 What is Fault-Tolerance? A fault-tolerant system is one that continues to perform at desired level of service according to their specification, in the presence of faults. There are no failures in a fault-tolerant system. Fault-tolerance is the ability of a system to provide a service complying with the specification in spite of faults. A better title might have been Dependable or Reliable or Available computing

14 Fault Tolerance In the physical universe: - Fault detection - Fault location - Fault containment - Fault recovery - Continue servicing In the informational universe: - Error detection - Error location - Error containment - Error recovery - Continue servicing

15 Fault Recovery How quickly is the fault detected? How soon can recovery begin? – Does is require human intervention – How is the system admin notified? How long does recovery take? – Restore from backup? – Purchase new HW?

16 Fault Coverage (C) Measure of system’s ability to perform: – fault detection – fault location – fault containment – (and/or fault recovery) C = P (fault detection | fault occurrence) C = P (fault recovery | fault occurrence) Note: – recovery implies that the system as a whole is operational – this does not imply that a repair occurred – e.g. duplex system with benign fault can recover to continue operation on one non-faulty processor

17 Design Philosophies to Combat Faults Fault avoidance (off-line) Attempts to prevent fault in the: Design review Component selecting Quality control Shielding Testing Fault masking (on-line) Attempts to prevent a fault in a system from introducing errors Error correcting memory Majority voting Fault tolerance (on-line) Attempts to provide a system to continue performing its expected tasks after the occurrence of faults

18 Design Philosophies to Combat Faults Fault avoidanceFault maskingFault tolerance

19 Fault Avoidance vs. Tolerance Fault avoidance: eliminate problem sources – Remove defects: Testing and debugging – Robust design: reduce probability of defects – Minimize environmental stress: Radiation shielding etc – Impossible to avoid faults completely Fault tolerance: add redundancy to mask effect – Additional resources needed (more later) – Examples: Error correction coding Backup storage Spare tire etc

20 Fault Forecasting vs. Tolerance Fault Tolerance Execution-time techniques that handle with the effects of faults Fault Forecasting Estimate current number, future incidence and likely consequences You can’t tolerate what you don’t expect But if we expected it, we would avoid or eliminate the fault! In general: We can itemize the classes of faults that can occur We can define what we want done if the fault occurs and if the error is detected Example: Automobile tire Lose air Do not expect it to experience electrical overload

21 Fault Tolerant computing Deterministic approaches – Based on simplifying assumptions: “fault model” – Obtain methods using the models: test generation – Evaluation of effectiveness – Used for Testing & combinatorial fault-tolerance Probabilistic approaches – We can’t predict exactly when a person will die, but we can still get “life expectancy = 77.2”, if we have data – Used for evaluating, achieving and optimizing reliability – Random testing

22 Fault Tolerant vs. Performance There are many Fault-tolerance approaches that sacrifice performance to tolerate faults Ex. 1: – Periodically stop the system and checkpoint its state to disk. * If fault occurs, recover state from checkpoint and resume Ex. 2: – Log all changes made to system state in case recovery is needed * During recovery, undo the changes from the log Ex. 3: – Run two identical systems in parallel, compare their results before using them Ex. 4: – Run software with lots of error checking

23 Fault Tolerant vs. Cost There are many Fault-tolerance approaches that sacrifice cost to tolerate faults Ex. 1: – Replicate the hardware 3 times and vote to determine correct output Ex. 2: – Mirror the disks (RAID-1) to tolerate disk failures Ex. 3: – Use multiple independent versions of software to tolerate bugs (Called N-version programming)

24 Fault Tolerant vs. Power There are many Fault-tolerance approaches that sacrifice power to tolerate faults Ex. 1, 2 & 3 (same as previous slide) – Replicate the hardware 3 times and vote to determine correct output – Mirror the disks (RAID-1) to tolerate disk failures – Use multiple independent versions of software to tolerate bugs Ex. 4 – Add continuously running checking hardware to system Ex. 5 – Add extra code to check for software faults

25 Need for Fault Tolerance: Universal Natural objects: Fat deposits in body: survival in starvation Duplication of eyes: graceful degradation upon failure Man-made objects Redundancy in ordinary text Asking for password twice during initial set- up Duplicate tires in trucks

26 Forms of Redundancy Hardware redundancy – add extra hardware for detection or tolerating faults Software redundancy – add extra software for detection and possibly tolerating faults Information redundancy – extra information, i.e. codes Time redundancy – extra time for performing tasks for fault tolerance

27 Redundancy base Time Try Retry Retry Space Try