Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.

Similar presentations


Presentation on theme: "A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633."— Presentation transcript:

1 A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

2 Introduction  Paper covers:  Definitions of faults/failures  Discuss failure models and elements of fault tolerance  Introduce hardware fault tolerant techniques  Introduce software fault tolerant techniques

3 Why Fault Tolerance?  Mission critical systems – a requirement to ensure reliability and availability  High availability and need for reliability especially important in distributed real time systems  Complex issues raised in providing fault tolerance in distributed systems compared to single processor systems

4 What do we do with faults?  Error detection – find the error in the system  Damage control and assessment – contain and fix  Error recovery – return the system back to an error-free state  Fault treatment/continued service – attempt uninterrupted execution regardless of fault

5 Failure Models  Failstop  Crash  Crash+Link  Receive Omission  Send Omission  General Omission  Byzantine Failures

6 Types of Faults  Permanent – remains in the system indefinitely till corrective action is taken  Transient – disappears after a short period of time  Intermittent – appear and disappear repeatedly

7 Elements of Fault Tolerance  Redundancy – addition of information, resources, or time beyond what is needed for normal system operation  Failure semantics – knowledgebase of failure behaviors of a system  Group failure masking – Masks failures from others in group.

8 Hardware Fault Tolerant Techniques  Hardware redundancy – duplicate components to detect or tolerate faults  Passive techniques – fault masking  Active techniques – fault detection and removal  Hybrid techniques – a combination of both  Techniques listed on the next slide

9 Triple Modular Redundancy  Execute a task three times  Take a majority vote  In a fault free system, all three results are identical  Does not work for Byzantine(arbitrary) failures

10 N-Modular Redundancy  Accomplised by masking an error N times  Works similar to TMR.  Masks symmetrical and asymmetrical failures

11 Standby Sparing  Replicate spares in the system (duplicate components)  Spares activated when fault is detected

12 Duplex Systems  Duplicate execution twice  Compare results for discrepancies  Execution can occur on separate hardware or sequentially on the same hardware

13 An example of a hardware fault tolerant system  Stratus servers – Fault tolerant hardware servers that use TMR and fully replicated hardware design to provide fault tolerance.  http://www.stratus.com

14 Software Fault Tolerant Techniques  Two main areas:  Provide for static redundancy  Provide for dynamic redundancy  N-Version Programming  Recovery Blocks or Primary-Backup technique

15 N-Version programming  Duplicate n versions of a program on n processes.  Forward recovery scheme that mask faults  Relies on voting mechanisms

16 Agreement problems  An agreement problem are problems that occur when a processor is faulty and other non-faulty processors have to agree on a course of action  Some agreement problems covered in my paper  Byzantine Generals Protocol  Consensus Problem  Interactive Consistency

17 Application of agreement protocols  Fault tolerant clock syncs  Non faulty processes must have clocks that are approximately equal in value  Atomic commits  Process actions have certain characteristics that must be followed (indivisible, instantaneous, non- revealing state changes etc.)

18 Recovery Blocks  Backward error recovery scheme  Also known as primary-backup approach  Relies on acceptance tests  Checks output is within an acceptable range

19 Error Detection Techniques  Effectiveness of any fault tolerant system depends on the effectiveness of its error detection techniques  Early detection or late detection  Concept of acceptability determines the thoroughness of error detection on a distributed system

20 Error Detection Techniques  Replication Checks  Timing Checks  Structural Checks  Reasonableness Checks  Reversal checks

21 Conclusion  Many different means in which fault tolerance can be provided on a distributed system  Sections not covered includes error recover and fault treatment


Download ppt "A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633."

Similar presentations


Ads by Google