Faults and fault-tolerance

Slides:



Advertisements
Similar presentations
Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Advertisements

Chapter 8 Fault Tolerance
Chapter 6 - Convergence in the Presence of Faults1-1 Chapter 6 Self-Stabilization Self-Stabilization Shlomi Dolev MIT Press, 2000 Shlomi Dolev, All Rights.
Copyright 2007 Koren & Krishna, Morgan-Kaufman Part.1.1 FAULT TOLERANT SYSTEMS Part 1 - Introduction.
Failure detector The story goes back to the FLP’85 impossibility result about consensus in presence of crash failures. If crash can be detected, then consensus.
Byzantine Generals Problem: Solution using signed messages.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Last Class: Weak Consistency
CS 603 Communication and Distributed Systems April 15, 2002.
Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.
Composition Model and its code. bound:=bound+1.
R R R Fault Tolerant Computing. R R R Acknowledgements The following lectures are based on materials from the following sources; –S. Kulkarni –J. Rushby.
Reliability and Fault Tolerance Setha Pan-ngum. Introduction From the survey by American Society for Quality Control [1]. Ten most important product attributes.
Faults and fault-tolerance
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
PMIT-6102 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Distributed System Models (Fundamental Model). Architectural Model Goal Reliability Manageability Adaptability Cost-effectiveness Service Layers Platform.
Part.1.1 In The Name of GOD Welcome to Babol (Nooshirvani) University of Technology Electrical & Computer Engineering Department.
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
CprE 545Iowa State University CprE 558: Real-Time Systems Lectures 15-16: Dependability Concepts & Faul-Tolerance.
Faults and fault-tolerance
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Hwajung Lee. One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.
CS 542: Topics in Distributed Systems Self-Stabilization.
1 Fault tolerance in distributed systems n Motivation n robust and stabilizing algorithms n failure models n robust algorithms u decision problems u impossibility.
Introduction to Fault Tolerance By Sahithi Podila.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes.
Classifying fault-tolerance Masking tolerance. Application runs as it is. The failure does not have a visible impact. All properties (both liveness & safety)
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Operating System Reliability Andy Wang COP 5611 Advanced Operating Systems.
The Consensus Problem in Fault Tolerant Computing
Buffer Overflows Incomplete Access Control
Faults and fault-tolerance
When Is Agreement Possible
Faults and fault-tolerance
Distributed Systems – Paxos
Fault Tolerance & Reliability CDA 5140 Spring 2006
Software Reliability Definition: The probability of failure-free operation of the software for a specified period of time in a specified environment.
Operating System Reliability
Operating System Reliability
Fault Tolerance In Operating System
Software Reliability: 2 Alternate Definitions
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Distributed Systems CS
Faults and fault-tolerance
COP 5611 Operating Systems Fall 2011
Faults and fault-tolerance
Reliability and Fault Tolerance
Operating System Reliability
Fault Tolerance Distributed Web-based Systems
Operating System Reliability
Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel
Replication Improves reliability Improves availability
Introduction to Fault Tolerance
COP 5611 Operating Systems Spring 2010
Distributed Systems CS
Operating System Reliability
Abstractions for Fault Tolerance
Operating System Reliability
Operating System Reliability
Presentation transcript:

Faults and fault-tolerance One of the selling points of a distributed system is that the system will continue to perform even if some components / processes fail.

Cause and effect Study what causes what. We view the effect of failures at our level of abstraction, and then try to mask it, or recover from it. Be familiar with the terms MTBF (Mean Time Between Failures) and MTTR (Mean Time To Repair)

Classification of failures Omission failure Crash failure Software failure Transient failure Temporal failure Security failure Byzantine failure

Crash failures Crash failure is irreversible. How can we distinguish between a process that has crashed and a process that is running very slowly? In synchronous system, it is easy to detect crash failure (using heartbeat signals and timeout), but in asynchronous systems, it is never accurate. Some failures may be complex and nasty. Arbitrary deviation from program execution is a form of failure that may not be as “nice” as a crash. Fail-stop failure is an simple abstraction that mimics crash failure when program execution becomes arbitrary. Such implementations help detect which processor has failed. If a system cannot tolerate fail-stop failure, then it cannot tolerate crash.

Omission failures Message lost in transit. May happen due to various causes, like Transmitter malfunction Buffer overflow Collisions at the MAC layer Receiver out of range

Transient failure (Hardware) Arbitrary perturbation of the global state. May be induced by power surge, weak batteries, lightning, radio-frequency interferences etc. (Software) Heisenbugs, are a class of temporary internal faults and are intermittent. They are essentially permanent faults whose conditions of activation occur rarely or are not easily reproducible, so they are harder to detect during the testing phase. Over 99% of bugs in IBM DB2 production code are non-deterministic and transient

Byzantine failure Anything goes! Includes every conceivable form of erroneous behavior. Numerous possible causes. Includes malicious behaviors (like a process executing a different program instead of the specified one) too. Most difficult kind of failure to deal with.

Software failures Coding error or human error Design flaws Memory leak Incomplete specification (example Y2K) Many failures (like crash, omission etc) can be caused by software bugs too.

Specification of faulty behavior program example1; define x : boolean (initially x = true); {a, b are messages); do {S}: x  send a {specified action}  {F}: true  send b {faulty action} od a a a a b a a a b b a a a a a a a …

Fault-tolerance F-intolerant vs F-tolerant systems A system that tolerates failure of type F F-intolerant vs F-tolerant systems Four types of tolerance: - Masking - Non-masking - Fail-safe - Graceful degradation tolerances faults

Fault-tolerance original fault-free system Q represents the worst Q P is the invariant of the original fault-free system Q represents the worst possible behavior of the system when failures occur. It is called the fault span. Q is closed under S or F. Q P

Fault-tolerance Masking tolerance: P = Q (neither safety nor liveness is violated Non-masking tolerance: P  Q (safety property may be temporarily violated, but not liveness). Eventually safety property is restored Q P