A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.

A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633

Introduction  Paper covers:  Definitions of faults/failures  Discuss failure models and elements of fault tolerance  Introduce hardware fault tolerant techniques  Introduce software fault tolerant techniques

Why Fault Tolerance?  Mission critical systems – a requirement to ensure reliability and availability  High availability and need for reliability especially important in distributed real time systems  Complex issues raised in providing fault tolerance in distributed systems compared to single processor systems

What do we do with faults?  Error detection – find the error in the system  Damage control and assessment – contain and fix  Error recovery – return the system back to an error-free state  Fault treatment/continued service – attempt uninterrupted execution regardless of fault

Failure Models  Failstop  Crash  Crash+Link  Receive Omission  Send Omission  General Omission  Byzantine Failures

Types of Faults  Permanent – remains in the system indefinitely till corrective action is taken  Transient – disappears after a short period of time  Intermittent – appear and disappear repeatedly

Elements of Fault Tolerance  Redundancy – addition of information, resources, or time beyond what is needed for normal system operation  Failure semantics – knowledgebase of failure behaviors of a system  Group failure masking – Masks failures from others in group.

Hardware Fault Tolerant Techniques  Hardware redundancy – duplicate components to detect or tolerate faults  Passive techniques – fault masking  Active techniques – fault detection and removal  Hybrid techniques – a combination of both  Techniques listed on the next slide

Triple Modular Redundancy  Execute a task three times  Take a majority vote  In a fault free system, all three results are identical  Does not work for Byzantine(arbitrary) failures

N-Modular Redundancy  Accomplised by masking an error N times  Works similar to TMR.  Masks symmetrical and asymmetrical failures

Standby Sparing  Replicate spares in the system (duplicate components)  Spares activated when fault is detected

Duplex Systems  Duplicate execution twice  Compare results for discrepancies  Execution can occur on separate hardware or sequentially on the same hardware

An example of a hardware fault tolerant system  Stratus servers – Fault tolerant hardware servers that use TMR and fully replicated hardware design to provide fault tolerance.  http://www.stratus.com

Software Fault Tolerant Techniques  Two main areas:  Provide for static redundancy  Provide for dynamic redundancy  N-Version Programming  Recovery Blocks or Primary-Backup technique

N-Version programming  Duplicate n versions of a program on n processes.  Forward recovery scheme that mask faults  Relies on voting mechanisms

Agreement problems  An agreement problem are problems that occur when a processor is faulty and other non-faulty processors have to agree on a course of action  Some agreement problems covered in my paper  Byzantine Generals Protocol  Consensus Problem  Interactive Consistency

Application of agreement protocols  Fault tolerant clock syncs  Non faulty processes must have clocks that are approximately equal in value  Atomic commits  Process actions have certain characteristics that must be followed (indivisible, instantaneous, non- revealing state changes etc.)

Recovery Blocks  Backward error recovery scheme  Also known as primary-backup approach  Relies on acceptance tests  Checks output is within an acceptable range

Error Detection Techniques  Effectiveness of any fault tolerant system depends on the effectiveness of its error detection techniques  Early detection or late detection  Concept of acceptability determines the thoroughness of error detection on a distributed system

Error Detection Techniques  Replication Checks  Timing Checks  Structural Checks  Reasonableness Checks  Reversal checks

Conclusion  Many different means in which fault tolerance can be provided on a distributed system  Sections not covered includes error recover and fault treatment

A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.

Similar presentations

Presentation on theme: "A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.

Similar presentations

Presentation on theme: "A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633."— Presentation transcript:

Similar presentations

About project

Feedback