Page 1 Copyright © 2002-2010 Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.

Slides:



Advertisements
Similar presentations
Principles of Engineering System Design Dr T Asokan
Advertisements

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.
Fault-Tolerant Systems Design Part 1.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Part 1 - Introduction.
Self-Stabilization in Distributed Systems Barath Raghavan Vikas Motwani Debashis Panigrahi.
5th Conference on Intelligent Systems
3. Hardware Redundancy Reliable System Design 2010 by: Amir M. Rahmani.
Making Services Fault Tolerant
Dependability TSW 10 Anders P. Ravn Aalborg University November 2009.
Software Fault Tolerance – The big Picture RTS April 2008 Anders P. Ravn Aalborg University.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
1 DRAFTS Fault Tolerance Some background Claudio Pinello
Fault Tolerance: Basic Mechanisms mMIC-SFT September 2003 Anders P. Ravn Aalborg University.
1 Chapter Fault Tolerant Design of Digital Systems.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
Last Class: Weak Consistency
Dependability ITV Real-Time Systems Anders P. Ravn Aalborg University February 2006.
7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.
1 Making Services Fault Tolerant Pat Chan, Michael R. Lyu Department of Computer Science and Engineering The Chinese University of Hong Kong Miroslaw Malek.
1 Fault-Tolerant Computing Systems #2 Hardware Fault Tolerance Pattara Leelaprute Computer Engineering Department Kasetsart University
2. Fault Tolerance. 2 Fault - Error - Failure Fault = physical defect or flow occurring in some component (hardware or software) Error = incorrect behavior.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Secure Systems Research Group - FAU 1 A survey of dependability patterns Ingrid Buckley and Eduardo B. Fernandez Dept. of Computer Science and Engineering.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
Building Dependable Distributed Systems Chapter 1 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4b) Department of Electrical.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
CprE 545Iowa State University CprE 558: Real-Time Systems Lectures 15-16: Dependability Concepts & Faul-Tolerance.
Fault Tolerance Mechanisms ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg.
1 Reliable Web Services by Fault Tolerant Techniques: Methodology, Experiment, Modeling and Evaluation Term Presentation Presented by Pat Chan 3 May 2006.
Faults and fault-tolerance
ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Availability Copyright 2004 Daniel J. Sorin Duke University.
CprE 458/558: Real-Time Systems
A. Haeberlen Fault Tolerance and the Five-Second Rule 1 HotOS XV (May 18, 2015) Ang Chen Hanjun Xiao Andreas Haeberlen Linh Thi Xuan Phan Department of.
Failure Mode Assumptions and Assumption Coverage David Powell.
Fault-Tolerant Systems Design Part 1.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
1 INTRUSION TOLERANT SYSTEMS WORKSHOP Phoenix, AZ 4 August 1999 Jaynarayan H. Lala ITS Program Manager.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 1:00-2:00 PM.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
Classifying fault-tolerance Masking tolerance. Application runs as it is. The failure does not have a visible impact. All properties (both liveness & safety)
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Week#3 Software Quality Engineering.
Faults and fault-tolerance
Fault Tolerance & Reliability CDA 5140 Spring 2006
Verification and Testing
Fault Tolerance In Operating System
Fault Tolerance - Transactions
Fault Tolerance - Transactions
Faults and fault-tolerance
COP 5611 Operating Systems Fall 2011
Fault Tolerance Distributed Web-based Systems
Faults and fault-tolerance
Fault Tolerance - Transactions
Mattan Erez The University of Texas at Austin July 2015
Fault Tolerance - Transactions
Abstractions for Fault Tolerance
Seminar on Enterprise Software
Fault Tolerance - Transactions
Presentation transcript:

Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer Science and Engineering University of Connecticut

Page 2 Copyright © Alexander Allister Shvartsman Fault-Tolerance -- An Overview A fundamental property of distributed systems:  potential for fault tolerance The main tool in achieving fault tolerance is  redundancy Distributed systems consist of multiple components:  When more than one resource is capable of performing a certain function, some fault tolerance is achievable Goal  Take advantage of the multiplicity of resources in constructing systems that tolerate failures

Page 3 Copyright © Alexander Allister Shvartsman Fault Tolerance and Dependability A system specification may call for fault-tolerance  By stating that the system must perform correctly  Even if certain internal or external components fail to perform according to their specifications  Additionally, the degradation in in performance due to failures must be “graceful” Dependability : is a closely-related notion  Trustworthiness of a computer system, i.e.,  Reliance can justifiably be placed on system’s service  Dependability is achieved in part through fault-tolerance

Page 4 Copyright © Alexander Allister Shvartsman Faults, Errors and Failures We distinguish among faults, errors and failures:  Fault: (or defect) a component or a subsystem fail to perform according to their specification  Error: a computation enters an incorrect state as the result of a fault  Failure: a systems fails to meet its specification as the result of an error Faults may or may not lead to an error Errors may or may not lead to a failure

Page 5 Copyright © Alexander Allister Shvartsman Fault-Tolerance -- Basic Approaches Fault prevention:  eliminating faults  before the system put into use or  during periodic preventive maintenance Fault tolerance:  a system detects errors caused by faults,  corrects its state and  does not fail for as long as the faults and errors are within its design parameters Fault masking:  a fault-tolerant system is capable of dealing with faults and errors  in a way that is transparent to the users of the system’s services

Page 6 Copyright © Alexander Allister Shvartsman Fault Classification Crash fault  Fail-stop processor (detectable crash)  Failure after a send/receive Omission fault  Communication, send or receive omission  Operation Timing fault  Processor delays  Link time-out Byzantine fault  Arbitrary fault  Malicious behavior Crash Omission Timing Byzantine Increased Severity

Page 7 Copyright © Alexander Allister Shvartsman Models of Processor Failures and Restarts Fail-stop processors Model assumptions, e.g.,  Shared memory  Robust interconnect  Resilient memory  Timing guarantees Undetectable restarts Detectable restarts Synchronous restarts No restarts Initial faults

Page 8 Copyright © Alexander Allister Shvartsman Fault Tolerance, Redundancy and Efficiency Fault tolerance is achieved through redundancy Redundancy in components/resources -- space redundancy :  additional components (hardware or software) are provided or made available to deal with errors  distributed systems have inherently redundancy Redundancy in computation or time redundancy :  additional computation is performed to detect errors or to test components  here the cost is performance

Page 9 Copyright © Alexander Allister Shvartsman Combining Fault-Tolerance and Efficiency The fundamental conflict exists between efficiency and fault tolerance:  Efficiency implies low redundancy  Fault tolerance implies high redundancy Robustness  Property of a system that combines  Efficiency and  Fault-tolerance, e.g., correctness under failures Achieving robustness is very challenging in many cases  Efficiency often must be traded-off for fault tolerance

Page 10 Copyright © Alexander Allister Shvartsman Strategies for Fault Tolerance Layered architecture :  a structuring technique in achieving fault tolerance A failure of a lower level component may/will manifest itself as a fault to a higher layer Error at a lower layer may be contained or masked When this is not possible, the layer attempts  to reduce the severity of the error and  to manifest itself through a more benign failure

Page 11 Copyright © Alexander Allister Shvartsman Layer Architecture for Fault-Tolerance fault error failure fault Layer N+1 Layer N-1 Layer N failure error fault

Page 12 Copyright © Alexander Allister Shvartsman Phases in Fault Tolerance Fault prevention and fault tolerance are complementary:  both are needed for dependability Fault tolerance and its “phases”  Error detection Tests, checks and diagnostics  Damage confinement Dynamic assessment of damage boundaries Static firewalls  Progress evaluation and error recovery Backward recovery, checkpointing, roll back Forward recovery and self-stabilization Processor scheduling and load balancing  Fault treatment and continued system service Fault location System repair Dynamic reconfiguration Standby spare components

Page 13 Copyright © Alexander Allister Shvartsman Faults: Causes and Temporal Effects Faulty system -- a system with defects  Faulty requirements  Design faults  Hardware faults  Software... bugs (I don’t know who put it there)  Operational faults Faults -- temporal taxonomy  Transient fault -- limited duration  Intermittent fault -- occur repeatedly  Permanent fault -- manifests itself until fixed Faults and fault masking  Is fault masking “good”?  If a system is capable of tolerating k faults, is masking 1 fault good? Masking k-1 faults?  Are faults “bad”?  Is a system containing faults necessarily defective?

Page 14 Copyright © Alexander Allister Shvartsman Models of Failure: Overall Considerations Models need to capture/abstract/approximate reality Type of failures --  severity: fail-stop, malicious failures, memory contamination Kind of failure-causing adversary --  omniscient or oblivious; on-line adaptive or off-line. Duration:  no-restart restartable Frequency of failures --  rate of processor attrition (one time, arbitrary, probabilistic) Fine/coarse granularity of failures --  components: processors / gates, processor / thread failures Magnitude of failures --  total number of failures (and recoveries) during computation

Page 15 Copyright © Alexander Allister Shvartsman Designing for F/T: Evaluation Criteria What is the cost of failure? Is it bearable? How much is one willing to pay for fault tolerance?  Is slower response preferable to a failure?  Is higher HW cost acceptable?  Is lower HW cost acceptable as long as failures are masked? What is the goal of building-in some fault tolerance?  Elimination of (some failure)?  Reduction in the severity of failures?  Error detection? When the failures are corrected,  Is a slower response time acceptable as long as the computation is correct?  Is a slight error acceptable as long as the computation completes within the required time?