Chapter 8 Fault Tolerance

Slides:



Advertisements
Similar presentations
1 Process groups and message ordering If processes belong to groups, certain algorithms can be used that depend on group properties membership create (
Advertisements

Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Impossibility of Distributed Consensus with One Faulty Process
1 CS 194: Distributed Systems Process resilience, Reliable Group Communication Scott Shenker and Ion Stoica Computer Science Division Department of Electrical.
Distributed Systems Overview Ali Ghodsi
(c) Oded Shmueli Distributed Recovery, Lecture 7 (BHG, Chap.7)
L-15 Fault Tolerance 1. Fault Tolerance Terminology & Background Byzantine Fault Tolerance Issues in client/server Reliable group communication 2.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Non-blocking Atomic Commitment Aaron Kaminsky Presenting Chapter 6 of Distributed Systems, 2nd edition, 1993, ed. Mullender.
Chapter 7 Fault Tolerance Basic Concepts Failure Models Process Design Issues Flat vs hierarchical group Group Membership Reliable Client.
Distributed Systems CS Fault Tolerance- Part I Lecture 13, Oct 17, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Fault Tolerance A partial failure occurs when a component in a distributed system fails. Conjecture: build the system in a such a way that continues to.
Impossibility of Distributed Consensus with One Faulty Process Michael J. Fischer Nancy A. Lynch Michael S. Paterson Presented by: Oren D. Rubin.
Last Class: Weak Consistency
Distributed Algorithms: Agreement Protocols. Problems of Agreement l A set of processes need to agree on a value (decision), after one or more processes.
Fault Tolerance Dealing successfully with partial failure within a Distributed System. Key technique: Redundancy.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
Time, Clocks, and the Ordering of Events in a Distributed System Leslie Lamport (1978) Presented by: Yoav Kantor.
1 A Modular Approach to Fault-Tolerant Broadcasts and Related Problems Author: Vassos Hadzilacos and Sam Toueg Distributed Systems: 526 U1580 Professor:
Scheduling in Distributed Systems There is not really a lot to say about scheduling in a distributed system. Each processor does its own local scheduling.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
Fault Tolerance in Distributed Systems Suvendu Rup Assistant Professor IIIT Bhubaneswar.
CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009) Lecture 9 Consensus I Section Klara Nahrstedt.
CSE 486/586 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
Fault Tolerance Chapter 7.
Fault Tolerance. Basic Concepts Availability The system is ready to work immediately Reliability The system can run continuously Safety When the system.
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Fault Tolerance Chapter 7. Failures in Distributed Systems Partial failures – characteristic of distributed systems Goals: Construct systems which can.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
1 Fault-Tolerant Computing Systems #1 Introduction Pattara Leelaprute Computer Engineering Department Kasetsart University
Fault Tolerance Chapter 7. Basic Concepts Dependability Includes Availability Reliability Safety Maintainability.
Introduction to Fault Tolerance By Sahithi Podila.
1 Fault-Tolerant Consensus. 2 Communication Model Complete graph Synchronous, network.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Fault Tolerance
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Byzantine Fault Tolerance Steve Ko Computer Sciences and Engineering University at Buffalo.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Fault Tolerance (2). Topics r Reliable Group Communication.
Fault Tolerance in Distributed Systems. A system’s ability to tolerate failure-1 Reliability: the likelihood that a system will remain operational for.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
1 AGREEMENT PROTOCOLS. 2 Introduction Processes/Sites in distributed systems often compete as well as cooperate to achieve a common goal. Mutual Trust/agreement.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
Fault Tolerance Prof. Orhan Gemikonakli
Fault Tolerance Chap 7.
Faults and fault-tolerance
Fault Tolerance & Reliability CDA 5140 Spring 2006
8.2. Process resilience Shreyas Karandikar.
Fault Tolerance In Operating System
Chapter 8 Fault Tolerance Part I Introduction.
COMP28112 – Lecture 14 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 13-Oct-18 COMP28112.
Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed.
Fault Tolerance - Transactions
COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 19-Nov-18 COMP28112.
Outline Announcements Fault Tolerance.
Faults and fault-tolerance
Fault Tolerance CSC 8320 : AOS Class Presentation Shiraj Pokharel
Distributed Systems CS
Introduction to Fault Tolerance
Distributed Systems CS
COMP28112 – Lecture 13 Byzantine fault tolerance: dealing with arbitrary failures The Byzantine Generals’ problem (Byzantine Agreement) 22-Feb-19 COMP28112.
Fault Tolerance - Transactions
Presentation transcript:

Chapter 8 Fault Tolerance Introduction Process resilience Reliable communication Failure recovery Distributed commit

Dependability Dependability is the ability to avoid service failures that are more frequent or severe than desired. It is an important goal of distributed systems. Requirements for dependable systems Availability: the probability that the system is available to perform its functions at any moment 99.999 % availability (five 9s)  5 minutes of downtime per year Reliability: the ability of the system to run continuously without failure Down for 1ms every hour  99.9999 % availability but highly unreliable Down for two weeks every year  high reliability but only 96% availability Safety: when a system temporarily fails to operate correctly, nothing catastrophic happens Maintainability: how easily a failed system can be repaired Security: will cover in Chapter 9 Availability - Readiness for usage, Reliability - Continuity of service delivery. Example: control system for airplanes, nuclear power plants. Safety - Very low probability of catastrophes, Maintainability - How easy can a failed system be repaired Dependability is the trustworthiness of a computing system which allows reliance to be justifiably placed on the service it delivers Dependability attributes Worthy of confidence, confident about relying on its service Attributes - A way to assess the Dependability of a system Need to prevent failure which is caused by faults. Fault tolerance means that a system can provide its services even in the presence of faults

Failures and Faults Building a dependable system comes down to preventing failures A failure of a system occurs when the system cannot meet its promises Failures are caused by faults A fault is an anomalous condition. There are three categories of faults: Transient faults: occur once and never reoccur (e.g., wireless communication being interrupted by external interference) Intermittent faults: reoccur irregularly (e.g., a loose contact on a connector) Permanent faults: persist until the faulty component is replaced (e.g., software bugs)

Types of Failures Fail-stop: server will stop in a way that clients can tell that it has halted Fail-silent: clients do not know server has halted State transition failure: Execution of component brings it into a wrong state Arbitrary failures are also known as Byzantine failures

Fault Tolerance In a single-machine system, a failure is almost always total All components are affected and entire system may be brought down (e.g., OS crash, disk failures) Partial failures are possible in distributed systems When one component fails, it may affect some components, while leaving other components unaffected Fault tolerance means that a system can provide its services even in the presence of faults Fault tolerance requires preventing faults and failures from affecting other components of the system automatically recovering from partial failures DS: multiple independent nodes, Prob(failure) = Prob(any one component fails)

Failure Masking Failure masking is a fault tolerance technique that hides occurrence of failures from other processes The most common approach to failure masking is redundancy Three types of redundancy: Information redundancy: add extra bits to allow recovery from garbled bits Time redundancy: repeat an action if needed Physical redundancy: add extra equipment or processes so that the system can tolerate the loss or malfunctioning of some components

An Example of Physical Redundancy Triple modular redundancy: the effect of a single component failing is completely masked.

Process Resilience Protection against process failures can be achieved by organizing several identical processes into a group Flat group: all process are equal; the processes make decisions collectively No single point of failure, but decision making is more complicated Hierarchical group: a single coordinator makes all decisions Decision making is simpler, but coordinator is a single point of failure Group is transparent to its users, the whole group is dealt with as a single process

Fault Tolerance in Process Groups Having a group of identical processes allows us to mask one or more faculty processes in that group A group of replicated processes is said to be k fault tolerant if it can survive k faults and still meet its specifications With crash failures, K+1 processes are sufficient to survive k faults With Byzantine failures, processes may produce erroneous, random, or malicious results  2k+1 processes are required to survive k faults (group output is defined by voting) Assumption: All requests arrive at all members in the group in the same order (this requires atomic multicast)  only then are we sure that all members do exactly the same thing processes run even if sick

Agreement in Faulty Systems The goal of distributed agreement algorithms is to have all the nonfaulty processes reach consensus on some issue within a finite number of steps Q1: Can consensus be reached with nonfaulty processes and unreliable communication channel? A: Two nonfaulty processes can never reach agreement in presence of unreliable channel Q2: Can consensus be reached with faulty (Byzantine) processes and reliable channel? A: Depends Two-army problem: two blue armies must agree to attack simultaneously in order to defeat the white army Each blue army coordinates with a messenger Messenger can be captured by the white army Can the two blue armies reach agreement?

Conditions for Consensus   Process behavior Message Order Communication delay Unordered Ordered Asynchronous Yes Unbounded Bounded Synchronous Unicast Multicast Message Transmission Assume processes may be faulty and communication is reliable. A system is synchronous iff the processes operate in a lock-step mode (i.e., there is a constant c≥1, such that if any process has taken c+1 steps, every other process has taken at least one step).

Byzantine Agreement Problem Byzantine agreement problem: Can N generals reach consensus about each other’s troop strengths when communication channel is perfect but some of the generals are traitors and will lie to prevent agreement? Formally, there are N processes, each process i will provide a value vi to the others. The goal is to let each process construct a vector V of length N, such that if process i is nonfaulty, V[i]= vi. Otherwise V[i] is undefined. Assume processes are synchronous, messages are unicast while preserving ordering, and communication delay is bounded, with k faulty processes, agreement can be achieved if there are 2k+1 nonfaulty processes [Lamport et al., 1982]. In lamport’s paper, byzantine generals problem requires two conditions to be met: 1) all loyal lieutenants obey the same order 2) if the commanding general is loyal, then every loyal lieutenants obeys the order he sends

Byzantine Agreement Problem: An Example The Byzantine agreement problem for 3 nonfaulty processes and 1 faulty process with vi=i. Consensus is reached for the nonfaulty processes. (a) Each process sends its value to the others. (b) The vectors that each process assembles based on (a). (c) The vectors that each process receives after each process passes its vector from (b) to every other process.

Byzantine Agreement Problem: Another Example The Byzantine agreement problem for 2 nonfaulty processes and 1 faulty process. The algorithm fails to produce agreement.