CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

Slides:



Advertisements
Similar presentations
Impossibility of Distributed Consensus with One Faulty Process
Advertisements

Byzantine Generals. Outline r Byzantine generals problem.
Computer Science Lecture 18, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
CS 582 / CMPE 481 Distributed Systems
CMPT 431 Dr. Alexandra Fedorova Lecture XII: Replication.
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
Replication Management using the State-Machine Approach Fred B. Schneider Summary and Discussion : Hee Jung Kim and Ying Zhang October 27, 2005.
2/23/2009CS50901 Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial Fred B. Schneider Presenter: Aly Farahat.
CS 603 Failure Models April 12, Fault Tolerance in Distributed Systems Perfect world: No Failures –W–We don’t live in a perfect world Non-distributed.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
EEC 693/793 Special Topics in Electrical Engineering Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Last Class: Weak Consistency
Primary-Backup Systems CS249 FALL 2005 Sang Soo Kim.
Computer Science Lecture 16, page 1 CS677: Distributed OS Last Class:Consistency Semantics Consistency models –Data-centric consistency models –Client-centric.
R R R Fault Tolerant Computing. R R R Acknowledgements The following lectures are based on materials from the following sources; –S. Kulkarni –J. Rushby.
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Distributed System Models (Fundamental Model). Architectural Model Goal Reliability Manageability Adaptability Cost-effectiveness Service Layers Platform.
Practical Byzantine Fault Tolerance
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
1 CSE 8343 Presentation # 2 Fault Tolerance in Distributed Systems By Sajida Begum Samina F Choudhry.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
Agenda Fail Stop Processors –Problem Definition –Implementation with reliable stable storage –Implementation without reliable stable storage Failure Detection.
Understanding Fault-Tolerant Distributed Systems  A. Mok 2013 CS 386C.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Fault Tolerant Services
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
The Totem Single-Ring Ordering and Membership Protocol Y. Amir, L. E. Moser, P. M Melliar-Smith, D. A. Agarwal, P. Ciarfella.
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
Presentation-2 Group-A1 Professor:Mohamed Khalil Anita Kanuganti Hemanth Rao.
Chapter 11 Fault Tolerance. Topics Introduction Process Resilience Reliable Group Communication Recovery.
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
Hwajung Lee.  Improves reliability  Improves availability ( What good is a reliable system if it is not available?)  Replication must be transparent.
Replication Improves reliability Improves availability ( What good is a reliable system if it is not available?) Replication must be transparent and create.
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
A Survey of Fault Tolerance in Distributed Systems By Szeying Tan Fall 2002 CS 633.
Fault Tolerance Chapter 7. Goal An important goal in distributed systems design is to construct the system in such a way that it can automatically recover.
Fault Tolerance in Distributed Systems. A system’s ability to tolerate failure-1 Reliability: the likelihood that a system will remain operational for.
Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Chapter 8 Fault Tolerance. Outline Introductions –Concepts –Failure models –Redundancy Process resilience –Groups and failure masking –Distributed agreement.
EEC 688/788 Secure and Dependable Computing Lecture 10 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Replication Hari Shreedharan. Papers! ● Implementing Fault-Tolerant Services using the State Machine Approach (Dec, 1990) Fred B. Schneider Cornell University.
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
Primary-Backup Replication COS 418: Distributed Systems Lecture 5 Kyle Jamieson.
Fault Tolerance Chap 7.
Primary-Backup Replication
Distributed Systems – Paxos
Agreement Protocols CS60002: Distributed Systems
Outline Announcements Fault Tolerance.
7.1. CONSISTENCY AND REPLICATION INTRODUCTION
Replication Improves reliability Improves availability
Distributed Systems CS
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
Replication State Machines via Primary-Backup
Fault-Tolerant State Machine Replication
Understanding Fault-Tolerant Distributed Systems  A. Mok 2018
Distributed Systems CS
Abstractions for Fault Tolerance
Last Class: Fault Tolerance
Presentation transcript:

CS 582 / CMPE 481 Distributed Systems Fault Tolerance

Class Overview Introduction –Definition –failure classification –failure semantcs Failure Model Fault Tolerant Approaches –state machine –primary-backup

Introduction Definition –system is considered faulty once its behavior is no longer consistent with its specification [Schneider] Separation property of distribution systems lead to partial failure property –components that one component depends on may fail to respond due to various reasons system or network failure system or network overload

Introduction (cont) Failure classification [Cristian] –omission failure a server omits to respond to an input –timing failure (performance failure) a server responds correctly but not in time (early or late) –response failure a server responds incorrectly –value failure: incorrect output –state transition failure: incorrect state transition –crash failure a server repeatedly fails to respond to inputs failure management depends on server state at restart –amnesia crash: no record of state at crash; reset to initial state –partial amnesia crash: partially recorded –pause crash: restart in state before crash –halting crash: no restart

Introduction (cont) Failure semantics –description of the ways in which a service may fail –recovery actions depends on the likely failure behavior of the server when its failure is detected –designer should ensure that the behavior of the server conforms to a specified failure semantics e.g. network with omission/time failure semantics –need to guarantee detection of message corruption such as checksum stronger failure semantics costs more in general –adequacy of failure semantics would require preliminary stochastic analyses

Failure Model Representative faulty behavior –Byzantine failures system exhibits arbitrary and malicious behavior which may collude with other systems –fail-stop failures when system fails, it changes to a state that allows others to detect its failure and then stops

Fault-Tolerant Approaches Fault tolerance –can detect a fault and either fail predictably or mask the fault from users –hiding the occurrence of errors in system components and communications –incorporate redundant processing component to achieve fault tolerance k-resilient/fault-tolerant –a set of systems satisfies its specification if no more than k systems become faulty –k is chosen based on statistical measures of system reliability Byzantine failure: 2k+1 fail-stop failure: k+1

Fault-Tolerant Approaches (cont) Two approaches to support fault tolerance (fault masking) –hierarchical masking hierarchical failure and recovery management –error detection in layered communication protocols –various levels of error abstraction in OS –group failure masking state-machine approach primary-backup approach Fault tolerance support can be done –Hardware stable storage –Software replicated servers

Stable Storage Group masking at disk block level –duplicate a careful storage service has only omission failure semantics corruption errors are masked by converting them to omission failures –stable block unit of storage represented by two careful disk blocks containing the same data operation semantics –operation invariant no more than one of the pair is bad the pair has most recent data except during write –read: one of pair –write: one by one with guarantee of success of first write –recovery: good block is copied to bad one

State-Machine Approach State machine specification –state variables + commands state variables –encode states of state machine Commands –implemented by deterministic program –execution of command is atomic with respect to other commands –modify state variables and/or produce some output –assumptions on ordering of requests made by clients O1: requests issued by a single client to a state machine sm are processed by sm in the order they were issued O2: if request r issued by client c could have caused a request r’ issued by client c’, then state machine processes r before r’

State-Machine Approach (cont) Fault-tolerance state machine –Assumptions k fault-tolerant state machine can be implemented by replicating state machine and running a replica on each of processors in a distributed system if all replicas start in the same initial state and execute the same requests in the same order, each replica will do the same task and produce the same result if each failure can affect at most one replica, result for k fault- tolerant state machine is obtained by combining result of replicas –degree of replication Byzantine failure: 2k + 1 failstop failure: k + 1

State-Machine Approach (cont) Requirements for k fault-tolerant state machine –all replicas receive and process the same sequence of request agreement: every non-faulty replica receives every request –specify interaction behavior of a client with state machine replicas –relaxed for read-only request in failstop failure order: every non-faulty replica processes requests it receives in the same relative order –specify behavior of state machine replicas in term of how to process requests from clients –relaxed for commutative requests

State-Machine Approach (cont) Agreement requirement –to satisfy agreement requirement, state-machines should support a message broadcasting protocol which conforms to IC1: all non-faulty processors agree on the same value IC2: if sender of request is non-faulty, then all non-faulty processors use its value as the one on which they agree –message broadcasting protocol is called Byzantine agreement protocol or reliable broadcast protocol

State-Machine Approach (cont) Order requirement –to implement order requirement requires assignment of unique identifier to each message stability (a request is ready to be delivered once all the previous requests have been delivered) test –logical clock-based only for failstop failures unique id assignment: logical clock –LC1: timestamp is incremented after each event at p –LC2: upon receipt of a message with timestamp t, process p resets its timestamp Tp to max(Tp, t)+1 stability test –a request is stable at replica smi if a request with larger timestamp has been received by smi from every client running on a non-faulty processor »messages between a pair of processors are delivered in the order sent »processor p detects that a failstop process q has failed only after p has received q’s last message sent to p

Primary-Backup Approach Primary-backup requirements –Pb1: there is at most one server whose state satisfies a condition being a primary –Pb2: each client maintains a server identity to which client can send a message –Pb3: if a client request arrives at a server that is not a primary, then that request is not enqueued –Pb4: there exist fixed value k and d such that the service behaves such that all server failures can be grouped into at most k intervals of time with each interval having length at most d

Primary-Backup Approach (cont) Simple primary-backup protocol –assumption one primary server p1 and one backup server p2 operations when p1 receives a request –processes the request and updates its state –send update info to p2 (state update message) –send response to client without waiting for ack from p2 –spec conformance Pb1: (p1 has not crashed) ^ (p2 has not received a message from p1 for T+d)=false Pb2: client sends a message to p1 Pb3: requests are not sent to p2 until after p1 has failed Pb4: (1, T+4d)-bofo server