A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.

Slides:



Advertisements
Similar presentations
Principles of Engineering System Design Dr T Asokan
Advertisements

Chapter 8 Fault Tolerance
Fault-Tolerant Systems Design Part 1.
COE 444 – Internetwork Design & Management Dr. Marwan Abu-Amara Computer Engineering Department King Fahd University of Petroleum and Minerals.
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Making Services Fault Tolerant
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Dependability TSW 10 Anders P. Ravn Aalborg University November 2009.
Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
© Burns and Welling, 2001 Characteristics of a RTS n Large and complex n Concurrent control of separate system components n Facilities to interact with.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Fault Tolerance: Basic Mechanisms mMIC-SFT September 2003 Anders P. Ravn Aalborg University.
1 Chapter Fault Tolerant Design of Digital Systems.
2. Introduction to Redundancy Techniques Redundancy Implies the use of hardware, software, information, or time beyond what is needed for normal system.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Last Class: Weak Consistency
Dependability ITV Real-Time Systems Anders P. Ravn Aalborg University February 2006.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
Developing Dependable Systems CIS 376 Bruce R. Maxim UM-Dearborn.
Chapter 2: Reliability and Fault Tolerance
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
CS, AUHenrik Bærbak Christensen1 Fault Tolerant Architectures Lyu Chapter 14 Sommerville Chapter 20 Part II.
Secure Systems Research Group - FAU 1 A survey of dependability patterns Ingrid Buckley and Eduardo B. Fernandez Dept. of Computer Science and Engineering.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Distributed systems A collection of autonomous computers linked by a network, with software designed to produce an integrated computing facility –A well.
Fault-Tolerant Systems Design Part 1.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.
Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
CprE 458/558: Real-Time Systems
FTC (DS) - V - TT - 0 HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR INFORMATIK DEPENDABLE SYSTEMS Vorlesung 5 FAULT RECOVERY AND TOLERANCE TECHNIQUES (SYSTEM.
Fault-Tolerant Systems Design Part 1.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
Introduction to Fault Tolerance By Sahithi Podila.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
Classifying fault-tolerance Masking tolerance. Application runs as it is. The failure does not have a visible impact. All properties (both liveness & safety)
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
18/05/2006 Fault Tolerant Computing Based on Diversity by Seda Demirağ
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Week#3 Software Quality Engineering.
The Consensus Problem in Fault Tolerant Computing
ECE 753: FAULT-TOLERANT COMPUTING
Fault Tolerance In Operating System
Agreement Protocols CS60002: Distributed Systems
Outline Announcements Fault Tolerance.
Fault Tolerance Distributed Web-based Systems
Introduction to Fault Tolerance
Hardware Assisted Fault Tolerance Using Reconfigurable Logic
Abstractions for Fault Tolerance
Last Class: Fault Tolerance
Seminar on Enterprise Software
Distributed systems A collection of autonomous computers linked by a network, with software designed to produce an integrated computing facility A well.
Presentation transcript:

A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

Introduction  Paper covers:  Definitions of faults/failures  Discuss failure models and elements of fault tolerance  Introduce hardware fault tolerant techniques  Introduce software fault tolerant techniques

Why Fault Tolerance?  Mission critical systems – a requirement to ensure reliability and availability  High availability and need for reliability especially important in distributed real time systems  Complex issues raised in providing fault tolerance in distributed systems compared to single processor systems

What do we do with faults?  Error detection – find the error in the system  Damage control and assessment – contain and fix  Error recovery – return the system back to an error-free state  Fault treatment/continued service – attempt uninterrupted execution regardless of fault

Failure Models  Failstop  Crash  Crash+Link  Receive Omission  Send Omission  General Omission  Byzantine Failures

Types of Faults  Permanent – remains in the system indefinitely till corrective action is taken  Transient – disappears after a short period of time  Intermittent – appear and disappear repeatedly

Elements of Fault Tolerance  Redundancy – addition of information, resources, or time beyond what is needed for normal system operation  Failure semantics – knowledgebase of failure behaviors of a system  Group failure masking – Masks failures from others in group.

Hardware Fault Tolerant Techniques  Hardware redundancy – duplicate components to detect or tolerate faults  Passive techniques – fault masking  Active techniques – fault detection and removal  Hybrid techniques – a combination of both  Techniques listed on the next slide

Triple Modular Redundancy  Execute a task three times  Take a majority vote  In a fault free system, all three results are identical  Does not work for Byzantine(arbitrary) failures

N-Modular Redundancy  Accomplised by masking an error N times  Works similar to TMR.  Masks symmetrical and asymmetrical failures

Standby Sparing  Replicate spares in the system (duplicate components)  Spares activated when fault is detected

Duplex Systems  Duplicate execution twice  Compare results for discrepancies  Execution can occur on separate hardware or sequentially on the same hardware

An example of a hardware fault tolerant system  Stratus servers – Fault tolerant hardware servers that use TMR and fully replicated hardware design to provide fault tolerance. 

Software Fault Tolerant Techniques  Two main areas:  Provide for static redundancy  Provide for dynamic redundancy  N-Version Programming  Recovery Blocks or Primary-Backup technique

N-Version programming  Duplicate n versions of a program on n processes.  Forward recovery scheme that mask faults  Relies on voting mechanisms

Agreement problems  An agreement problem are problems that occur when a processor is faulty and other non-faulty processors have to agree on a course of action  Some agreement problems covered in my paper  Byzantine Generals Protocol  Consensus Problem  Interactive Consistency

Application of agreement protocols  Fault tolerant clock syncs  Non faulty processes must have clocks that are approximately equal in value  Atomic commits  Process actions have certain characteristics that must be followed (indivisible, instantaneous, non- revealing state changes etc.)

Recovery Blocks  Backward error recovery scheme  Also known as primary-backup approach  Relies on acceptance tests  Checks output is within an acceptable range

Error Detection Techniques  Effectiveness of any fault tolerant system depends on the effectiveness of its error detection techniques  Early detection or late detection  Concept of acceptability determines the thoroughness of error detection on a distributed system

Error Detection Techniques  Replication Checks  Timing Checks  Structural Checks  Reasonableness Checks  Reversal checks

Conclusion  Many different means in which fault tolerance can be provided on a distributed system  Sections not covered includes error recover and fault treatment