We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Modified over 4 years ago
Page 1 Copyright © 2002-2010 Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer Science and Engineering University of Connecticut
Page 2 Copyright © 2002-2010 Alexander Allister Shvartsman Fault-Tolerance -- An Overview A fundamental property of distributed systems: potential for fault tolerance The main tool in achieving fault tolerance is redundancy Distributed systems consist of multiple components: When more than one resource is capable of performing a certain function, some fault tolerance is achievable Goal Take advantage of the multiplicity of resources in constructing systems that tolerate failures
Page 3 Copyright © 2002-2010 Alexander Allister Shvartsman Fault Tolerance and Dependability A system specification may call for fault-tolerance By stating that the system must perform correctly Even if certain internal or external components fail to perform according to their specifications Additionally, the degradation in in performance due to failures must be “graceful” Dependability : is a closely-related notion Trustworthiness of a computer system, i.e., Reliance can justifiably be placed on system’s service Dependability is achieved in part through fault-tolerance
Page 4 Copyright © 2002-2010 Alexander Allister Shvartsman Faults, Errors and Failures We distinguish among faults, errors and failures: Fault: (or defect) a component or a subsystem fail to perform according to their specification Error: a computation enters an incorrect state as the result of a fault Failure: a systems fails to meet its specification as the result of an error Faults may or may not lead to an error Errors may or may not lead to a failure
Page 5 Copyright © 2002-2010 Alexander Allister Shvartsman Fault-Tolerance -- Basic Approaches Fault prevention: eliminating faults before the system put into use or during periodic preventive maintenance Fault tolerance: a system detects errors caused by faults, corrects its state and does not fail for as long as the faults and errors are within its design parameters Fault masking: a fault-tolerant system is capable of dealing with faults and errors in a way that is transparent to the users of the system’s services
Page 6 Copyright © 2002-2010 Alexander Allister Shvartsman Fault Classification Crash fault Fail-stop processor (detectable crash) Failure after a send/receive Omission fault Communication, send or receive omission Operation Timing fault Processor delays Link time-out Byzantine fault Arbitrary fault Malicious behavior Crash Omission Timing Byzantine Increased Severity
Page 7 Copyright © 2002-2010 Alexander Allister Shvartsman Models of Processor Failures and Restarts Fail-stop processors Model assumptions, e.g., Shared memory Robust interconnect Resilient memory Timing guarantees Undetectable restarts Detectable restarts Synchronous restarts No restarts Initial faults
Page 8 Copyright © 2002-2010 Alexander Allister Shvartsman Fault Tolerance, Redundancy and Efficiency Fault tolerance is achieved through redundancy Redundancy in components/resources -- space redundancy : additional components (hardware or software) are provided or made available to deal with errors distributed systems have inherently redundancy Redundancy in computation or time redundancy : additional computation is performed to detect errors or to test components here the cost is performance
Page 9 Copyright © 2002-2010 Alexander Allister Shvartsman Combining Fault-Tolerance and Efficiency The fundamental conflict exists between efficiency and fault tolerance: Efficiency implies low redundancy Fault tolerance implies high redundancy Robustness Property of a system that combines Efficiency and Fault-tolerance, e.g., correctness under failures Achieving robustness is very challenging in many cases Efficiency often must be traded-off for fault tolerance
Page 10 Copyright © 2002-2010 Alexander Allister Shvartsman Strategies for Fault Tolerance Layered architecture : a structuring technique in achieving fault tolerance A failure of a lower level component may/will manifest itself as a fault to a higher layer Error at a lower layer may be contained or masked When this is not possible, the layer attempts to reduce the severity of the error and to manifest itself through a more benign failure
Page 11 Copyright © 2002-2010 Alexander Allister Shvartsman Layer Architecture for Fault-Tolerance fault error failure fault Layer N+1 Layer N-1 Layer N failure error fault
Page 12 Copyright © 2002-2010 Alexander Allister Shvartsman Phases in Fault Tolerance Fault prevention and fault tolerance are complementary: both are needed for dependability Fault tolerance and its “phases” Error detection Tests, checks and diagnostics Damage confinement Dynamic assessment of damage boundaries Static firewalls Progress evaluation and error recovery Backward recovery, checkpointing, roll back Forward recovery and self-stabilization Processor scheduling and load balancing Fault treatment and continued system service Fault location System repair Dynamic reconfiguration Standby spare components
Page 13 Copyright © 2002-2010 Alexander Allister Shvartsman Faults: Causes and Temporal Effects Faulty system -- a system with defects Faulty requirements Design faults Hardware faults Software... bugs (I don’t know who put it there) Operational faults Faults -- temporal taxonomy Transient fault -- limited duration Intermittent fault -- occur repeatedly Permanent fault -- manifests itself until fixed Faults and fault masking Is fault masking “good”? If a system is capable of tolerating k faults, is masking 1 fault good? Masking k-1 faults? Are faults “bad”? Is a system containing faults necessarily defective?
Page 14 Copyright © 2002-2010 Alexander Allister Shvartsman Models of Failure: Overall Considerations Models need to capture/abstract/approximate reality Type of failures -- severity: fail-stop, malicious failures, memory contamination Kind of failure-causing adversary -- omniscient or oblivious; on-line adaptive or off-line. Duration: no-restart restartable Frequency of failures -- rate of processor attrition (one time, arbitrary, probabilistic) Fine/coarse granularity of failures -- components: processors / gates, processor / thread failures Magnitude of failures -- total number of failures (and recoveries) during computation
Page 15 Copyright © 2002-2010 Alexander Allister Shvartsman Designing for F/T: Evaluation Criteria What is the cost of failure? Is it bearable? How much is one willing to pay for fault tolerance? Is slower response preferable to a failure? Is higher HW cost acceptable? Is lower HW cost acceptable as long as failures are masked? What is the goal of building-in some fault tolerance? Elimination of (some failure)? Reduction in the severity of failures? Error detection? When the failures are corrected, Is a slower response time acceptable as long as the computation is correct? Is a slight error acceptable as long as the computation completes within the required time?
Principles of Engineering System Design Dr T Asokan
Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Fault-Tolerant Systems Design Part 1.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Part 1 - Introduction.
Self-Stabilization in Distributed Systems Barath Raghavan Vikas Motwani Debashis Panigrahi.
5th Conference on Intelligent Systems
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Making Services Fault Tolerant
Dependability TSW 10 Anders P. Ravn Aalborg University November 2009.
Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
1 DRAFTS Fault Tolerance Some background Claudio Pinello
Fault Tolerance: Basic Mechanisms mMIC-SFT September 2003 Anders P. Ravn Aalborg University.
1 Chapter Fault Tolerant Design of Digital Systems.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
Last Class: Weak Consistency
Dependability ITV Real-Time Systems Anders P. Ravn Aalborg University February 2006.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
© 2019 SlidePlayer.com Inc. All rights reserved.