Page 1 Copyright © 2002-2010 Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.

Copyright © 2002-2010 Alexander Allister Shvartsman Fault-Tolerance -- An Overview A fundamental property of distributed systems:  potential for fault tolerance The main tool in achieving fault tolerance is  redundancy Distributed systems consist of multiple components:  When more than one resource is capable of performing a certain function, some fault tolerance is achievable Goal  Take advantage of the multiplicity of resources in constructing systems that tolerate failures

Copyright © 2002-2010 Alexander Allister Shvartsman Fault Tolerance and Dependability A system specification may call for fault-tolerance  By stating that the system must perform correctly  Even if certain internal or external components fail to perform according to their specifications  Additionally, the degradation in in performance due to failures must be “graceful” Dependability : is a closely-related notion  Trustworthiness of a computer system, i.e.,  Reliance can justifiably be placed on system’s service  Dependability is achieved in part through fault-tolerance

Copyright © 2002-2010 Alexander Allister Shvartsman Faults, Errors and Failures We distinguish among faults, errors and failures:  Fault: (or defect) a component or a subsystem fail to perform according to their specification  Error: a computation enters an incorrect state as the result of a fault  Failure: a systems fails to meet its specification as the result of an error Faults may or may not lead to an error Errors may or may not lead to a failure

Copyright © 2002-2010 Alexander Allister Shvartsman Fault-Tolerance -- Basic Approaches Fault prevention:  eliminating faults  before the system put into use or  during periodic preventive maintenance Fault tolerance:  a system detects errors caused by faults,  corrects its state and  does not fail for as long as the faults and errors are within its design parameters Fault masking:  a fault-tolerant system is capable of dealing with faults and errors  in a way that is transparent to the users of the system’s services

Copyright © 2002-2010 Alexander Allister Shvartsman Fault Classification Crash fault  Fail-stop processor (detectable crash)  Failure after a send/receive Omission fault  Communication, send or receive omission  Operation Timing fault  Processor delays  Link time-out Byzantine fault  Arbitrary fault  Malicious behavior Crash Omission Timing Byzantine Increased Severity

Copyright © 2002-2010 Alexander Allister Shvartsman Models of Processor Failures and Restarts Fail-stop processors Model assumptions, e.g.,  Shared memory  Robust interconnect  Resilient memory  Timing guarantees Undetectable restarts Detectable restarts Synchronous restarts No restarts Initial faults

Copyright © 2002-2010 Alexander Allister Shvartsman Fault Tolerance, Redundancy and Efficiency Fault tolerance is achieved through redundancy Redundancy in components/resources -- space redundancy :  additional components (hardware or software) are provided or made available to deal with errors  distributed systems have inherently redundancy Redundancy in computation or time redundancy :  additional computation is performed to detect errors or to test components  here the cost is performance

Copyright © 2002-2010 Alexander Allister Shvartsman Combining Fault-Tolerance and Efficiency The fundamental conflict exists between efficiency and fault tolerance:  Efficiency implies low redundancy  Fault tolerance implies high redundancy Robustness  Property of a system that combines  Efficiency and  Fault-tolerance, e.g., correctness under failures Achieving robustness is very challenging in many cases  Efficiency often must be traded-off for fault tolerance

Copyright © 2002-2010 Alexander Allister Shvartsman Strategies for Fault Tolerance Layered architecture :  a structuring technique in achieving fault tolerance A failure of a lower level component may/will manifest itself as a fault to a higher layer Error at a lower layer may be contained or masked When this is not possible, the layer attempts  to reduce the severity of the error and  to manifest itself through a more benign failure

Copyright © 2002-2010 Alexander Allister Shvartsman Phases in Fault Tolerance Fault prevention and fault tolerance are complementary:  both are needed for dependability Fault tolerance and its “phases”  Error detection Tests, checks and diagnostics  Damage confinement Dynamic assessment of damage boundaries Static firewalls  Progress evaluation and error recovery Backward recovery, checkpointing, roll back Forward recovery and self-stabilization Processor scheduling and load balancing  Fault treatment and continued system service Fault location System repair Dynamic reconfiguration Standby spare components

Copyright © 2002-2010 Alexander Allister Shvartsman Faults: Causes and Temporal Effects Faulty system -- a system with defects  Faulty requirements  Design faults  Hardware faults  Software... bugs (I don’t know who put it there)  Operational faults Faults -- temporal taxonomy  Transient fault -- limited duration  Intermittent fault -- occur repeatedly  Permanent fault -- manifests itself until fixed Faults and fault masking  Is fault masking “good”?  If a system is capable of tolerating k faults, is masking 1 fault good? Masking k-1 faults?  Are faults “bad”?  Is a system containing faults necessarily defective?

Copyright © 2002-2010 Alexander Allister Shvartsman Models of Failure: Overall Considerations Models need to capture/abstract/approximate reality Type of failures --  severity: fail-stop, malicious failures, memory contamination Kind of failure-causing adversary --  omniscient or oblivious; on-line adaptive or off-line. Duration:  no-restart restartable Frequency of failures --  rate of processor attrition (one time, arbitrary, probabilistic) Fine/coarse granularity of failures --  components: processors / gates, processor / thread failures Magnitude of failures --  total number of failures (and recoveries) during computation

Copyright © 2002-2010 Alexander Allister Shvartsman Designing for F/T: Evaluation Criteria What is the cost of failure? Is it bearable? How much is one willing to pay for fault tolerance?  Is slower response preferable to a failure?  Is higher HW cost acceptable?  Is lower HW cost acceptable as long as failures are masked? What is the goal of building-in some fault tolerance?  Elimination of (some failure)?  Reduction in the severity of failures?  Error detection? When the failures are corrected,  Is a slower response time acceptable as long as the computation is correct?  Is a slight error acceptable as long as the computation completes within the required time?

Page 1 Copyright © 2002-2010 Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.

Similar presentations

Presentation on theme: "Page 1 Copyright © 2002-2010 Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Page 1 Copyright © 2002-2010 Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.

Similar presentations

Presentation on theme: "Page 1 Copyright © 2002-2010 Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer."— Presentation transcript:

Similar presentations

About project

Feedback