A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633
Introduction Paper covers: Definitions of faults/failures Discuss failure models and elements of fault tolerance Introduce hardware fault tolerant techniques Introduce software fault tolerant techniques
Why Fault Tolerance? Mission critical systems – a requirement to ensure reliability and availability High availability and need for reliability especially important in distributed real time systems Complex issues raised in providing fault tolerance in distributed systems compared to single processor systems
What do we do with faults? Error detection – find the error in the system Damage control and assessment – contain and fix Error recovery – return the system back to an error-free state Fault treatment/continued service – attempt uninterrupted execution regardless of fault
Failure Models Failstop Crash Crash+Link Receive Omission Send Omission General Omission Byzantine Failures
Types of Faults Permanent – remains in the system indefinitely till corrective action is taken Transient – disappears after a short period of time Intermittent – appear and disappear repeatedly
Elements of Fault Tolerance Redundancy – addition of information, resources, or time beyond what is needed for normal system operation Failure semantics – knowledgebase of failure behaviors of a system Group failure masking – Masks failures from others in group.
Hardware Fault Tolerant Techniques Hardware redundancy – duplicate components to detect or tolerate faults Passive techniques – fault masking Active techniques – fault detection and removal Hybrid techniques – a combination of both Techniques listed on the next slide
Triple Modular Redundancy Execute a task three times Take a majority vote In a fault free system, all three results are identical Does not work for Byzantine(arbitrary) failures
N-Modular Redundancy Accomplised by masking an error N times Works similar to TMR. Masks symmetrical and asymmetrical failures
Standby Sparing Replicate spares in the system (duplicate components) Spares activated when fault is detected
Duplex Systems Duplicate execution twice Compare results for discrepancies Execution can occur on separate hardware or sequentially on the same hardware
An example of a hardware fault tolerant system Stratus servers – Fault tolerant hardware servers that use TMR and fully replicated hardware design to provide fault tolerance.
Software Fault Tolerant Techniques Two main areas: Provide for static redundancy Provide for dynamic redundancy N-Version Programming Recovery Blocks or Primary-Backup technique
N-Version programming Duplicate n versions of a program on n processes. Forward recovery scheme that mask faults Relies on voting mechanisms
Agreement problems An agreement problem are problems that occur when a processor is faulty and other non-faulty processors have to agree on a course of action Some agreement problems covered in my paper Byzantine Generals Protocol Consensus Problem Interactive Consistency
Application of agreement protocols Fault tolerant clock syncs Non faulty processes must have clocks that are approximately equal in value Atomic commits Process actions have certain characteristics that must be followed (indivisible, instantaneous, non- revealing state changes etc.)
Recovery Blocks Backward error recovery scheme Also known as primary-backup approach Relies on acceptance tests Checks output is within an acceptable range
Error Detection Techniques Effectiveness of any fault tolerant system depends on the effectiveness of its error detection techniques Early detection or late detection Concept of acceptability determines the thoroughness of error detection on a distributed system
Error Detection Techniques Replication Checks Timing Checks Structural Checks Reasonableness Checks Reversal checks
Conclusion Many different means in which fault tolerance can be provided on a distributed system Sections not covered includes error recover and fault treatment