Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.

Slides:



Advertisements
Similar presentations
Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Advertisements

Chapter 8 Fault Tolerance
Chapter 6 - Convergence in the Presence of Faults1-1 Chapter 6 Self-Stabilization Self-Stabilization Shlomi Dolev MIT Press, 2000 Shlomi Dolev, All Rights.
PROTOCOL VERIFICATION & PROTOCOL VALIDATION. Protocol Verification Communication Protocols should be checked for correctness, robustness and performance,
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Part 1 - Introduction.
Failure detector The story goes back to the FLP’85 impossibility result about consensus in presence of crash failures. If crash can be detected, then consensus.
Byzantine Generals Problem: Solution using signed messages.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Last Class: Weak Consistency
CS 603 Communication and Distributed Systems April 15, 2002.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Composition Model and its code. bound:=bound+1.
R R R Fault Tolerant Computing. R R R Acknowledgements The following lectures are based on materials from the following sources; –S. Kulkarni –J. Rushby.
Reliability and Fault Tolerance Setha Pan-ngum. Introduction From the survey by American Society for Quality Control [1]. Ten most important product attributes.
Faults and fault-tolerance
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Review for Exam 2. Topics included Deadlock detection Resource and communication deadlock Graph algorithms: Routing, spanning tree, MST, leader election.
Distributed System Models (Fundamental Model). Architectural Model Goal Reliability Manageability Adaptability Cost-effectiveness Service Layers Platform.
Replicated State Machines ITV Model-based Analysis and Design of Embedded Software Techniques and methods for Critical Software Anders P. Ravn Aalborg.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
Secure Systems Research Group - FAU 1 Active Replication Pattern Ingrid Buckley Dept. of Computer Science and Engineering Florida Atlantic University Boca.
CprE 545Iowa State University CprE 558: Real-Time Systems Lectures 15-16: Dependability Concepts & Faul-Tolerance.
Fault-Tolerant Parallel and Distributed Computing for Software Engineering Undergraduates Ali Ebnenasir and Jean Mayo {aebnenas, Department.
Faults and fault-tolerance
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
CS 542: Topics in Distributed Systems Self-Stabilization.
1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.
Introduction to Fault Tolerance By Sahithi Podila.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 1:00-2:00 PM.
SENG521 (Fall SENG 521 Software Reliability & Testing Fault Tolerant Software Systems: Techniques (Part 4a) Department of Electrical.
Classifying fault-tolerance Masking tolerance. Application runs as it is. The failure does not have a visible impact. All properties (both liveness & safety)
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
The Consensus Problem in Fault Tolerant Computing
Faults and fault-tolerance
When Is Agreement Possible
Faults and fault-tolerance
Fault Tolerance & Reliability CDA 5140 Spring 2006
Operating System Reliability
Operating System Reliability
Fault Tolerance In Operating System
Software Reliability: 2 Alternate Definitions
Faults and fault-tolerance
Agreement Protocols CS60002: Distributed Systems
Faults and fault-tolerance
Reliability and Fault Tolerance
Operating System Reliability
Fault Tolerance Distributed Web-based Systems
Operating System Reliability
Faults and fault-tolerance
Introduction to Fault Tolerance
Operating System Reliability
Abstractions for Fault Tolerance
Operating System Reliability
Operating System Reliability
Presentation transcript:

Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.

Cause and effect Study what causes what. We view the effect of failures at our level of abstraction, and then try to mask it, or recover from it. Be familiar with the terms MTBF (Mean Time Between Failures) and MTTR (Mean Time To Repair)

Classification of failures Crash failure Omission failure Transient failure Byzantine failure Software failure Temporal failure Security failure

Crash failures Crash failure is irreversible. In synchronous system, it is easy to detect crash failure (using heartbeat signals and timeout), but in asynchronous systems, it is never accurate, since it is not possible to distinguish between a process that has crashed, and a process that is running very slowly?. Some failures may be complex and nasty. Fail-stop failure is an simple abstraction that mimics crash failure when program execution becomes arbitrary. Implementations help detect which processor has failed. If a system cannot tolerate fail-stop failure, then it cannot tolerate crash.

Omission failures Message lost in transit. May happen due to various causes, like –Transmitter malfunction –Buffer overflow –Collisions at the MAC layer –Receiver out of range

Transient failure (Hardware) Arbitrary perturbation of the global state. May be induced by power surge, weak batteries, lightning, radio- frequency interferences etc. (Software) Heisenbugs, are a class of temporary internal faults and are intermittent. They are essentially permanent faults whose conditions of activation occur rarely or are not easily reproducible, so they are harder to detect during the testing phase. Over 99% of bugs in IBM DB2 production code are non- deterministic and transient

Byzantine failure Anything goes! Includes every conceivable form of erroneous behavior. Numerous possible causes. Includes malicious behaviors (like a process executing a different program instead of the specified one) too. Most difficult kind of failure to deal with.

Software failures Coding error or human error Design flaws Memory leak Incomplete specification (example Y2K) Many failures (like crash, omission etc) can be caused by software bugs too.

Specification of faulty behavior program example1; define x : boolean ( initially x = true); { a, b are messages); do {S}: x  send a {specified action}  {F}: true  send b {faulty action} od a a a a b a a a b b a a a a a a a … (Most faulty behaviors can be modeled as a fault action F over the normal action S. This is for specification purposes only)

Fault-tolerance F-intolerant vs F-tolerant systems Four types of tolerance: - Masking - Non-masking - Fail-safe - Graceful degradation faults tolerances A system that tolerates failure of type F

Fault-tolerance P is the invariant of the original fault-free system Q represents the worst possible behavior of the system when failures occur. It is called the fault span. Q is closed under S or F. P Q

Fault-tolerance Masking tolerance: P = Q (neither safety nor liveness is violated) Non-masking tolerance: P  Q (safety property may be temporarily violated, but not liveness). Eventually safety property is restored P Q