SysRép / 2.5A. SchiperEté 2007 1 2.5 The consensus problem.

Slides:



Advertisements
Similar presentations
Fault Tolerance. Basic System Concept Basic Definitions Failure: deviation of a system from behaviour described in its specification. Error: part of.
Advertisements

Impossibility of Distributed Consensus with One Faulty Process
CS 542: Topics in Distributed Systems Diganta Goswami.
The weakest failure detector question in distributed computing Petr Kouznetsov Distributed Programming Lab EPFL.
A General Characterization of Indulgence R. Guerraoui EPFL joint work with N. Lynch (MIT)
Teaser - Introduction to Distributed Computing
6.852: Distributed Algorithms Spring, 2008 Class 7.
Distributed Systems Overview Ali Ghodsi
P. Kouznetsov, 2006 Abstracting out Byzantine Behavior Peter Druschel Andreas Haeberlen Petr Kouznetsov Max Planck Institute for Software Systems.
Sliding window protocol The sender continues the send action without receiving the acknowledgements of at most w messages (w > 0), w is called the window.
An evaluation of ring-based algorithms for the Eventually Perfect failure detector class Joachim Wieland Mikel Larrea Alberto Lafuente The University of.
Failure detector The story goes back to the FLP’85 impossibility result about consensus in presence of crash failures. If crash can be detected, then consensus.
1 © P. Kouznetsov On the weakest failure detector for non-blocking atomic commit Rachid Guerraoui Petr Kouznetsov Distributed Programming Laboratory Swiss.
Byzantine Generals Problem: Solution using signed messages.
Failure Detectors. Can we do anything in asynchronous systems? Reliable broadcast –Process j sends a message m to all processes in the system –Requirement:
Failure Detectors & Consensus. Agenda Unreliable Failure Detectors (CHANDRA TOUEG) Reducibility ◊S≥◊W, ◊W≥◊S Solving Consensus using ◊S (MOSTEFAOUI RAYNAL)
1 Principles of Reliable Distributed Systems Lecture 3: Synchronous Uniform Consensus Spring 2006 Dr. Idit Keidar.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Asynchronous Consensus (Some Slides borrowed from ppt on Web.(by Ken Birman) )
Non-blocking Atomic Commitment Aaron Kaminsky Presenting Chapter 6 of Distributed Systems, 2nd edition, 1993, ed. Mullender.
Outline Why distributed computing? Atomic Broadcast The atom system Relevance for e-textiles What’s next? Q&A.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Impossibility of Distributed Consensus with One Faulty Process Michael J. Fischer Nancy A. Lynch Michael S. Paterson Presented by: Oren D. Rubin.
1 Principles of Reliable Distributed Systems Recitation 8 ◊S-based Consensus Spring 2009 Alex Shraer.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 4 – Consensus and reliable.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Impossibility.
Aran Bergman, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Recitation 5: Reliable.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 12: Impossibility.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 19: Paxos All slides © IG.
Distributed Systems Tutorial 4 – Solving Consensus using Chandra-Toueg’s unreliable failure detector: A general Quorum-Based Approach.
On the Cost of Fault-Tolerant Consensus When There are no Faults Idit Keidar & Sergio Rajsbaum Appears in SIGACT News; MIT Tech. Report.
Systems of Distributed systems Module 2 - Distributed algorithms Teaching unit 2 – Properties of distributed algorithms Ernesto Damiani University of Bozen.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Efficient Algorithms to Implement Failure Detectors and Solve Consensus in Distributed Systems Mikel Larrea Departamento de Arquitectura y Tecnología de.
1 Principles of Reliable Distributed Systems Recitation 7 Byz. Consensus without Authentication ◊S-based Consensus Spring 2008 Alex Shraer.
Composition Model and its code. bound:=bound+1.
State Machines CS 614 Thursday, Feb 21, 2002 Bill McCloskey.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 8: Failure Detectors.
1 A Modular Approach to Fault-Tolerant Broadcasts and Related Problems Author: Vassos Hadzilacos and Sam Toueg Distributed Systems: 526 U1580 Professor:
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Failure detection and consensus Ludovic Henrio CNRS - projet OASIS Distributed Algorithms.
Bringing Paxos Consensus in Multi-agent Systems Andrei Mocanu Costin Bădică University of Craiova.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Reliable Communication in the Presence of Failures Based on the paper by: Kenneth Birman and Thomas A. Joseph Cesar Talledo COEN 317 Fall 05.
Consensus and Its Impossibility in Asynchronous Systems.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 8 Instructor: Haifeng YU.
CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009) Lecture 9 Consensus I Section Klara Nahrstedt.
Distributed systems Consensus Prof R. Guerraoui Distributed Programming Laboratory.
Sliding window protocol The sender continues the send action without receiving the acknowledgements of at most w messages (w > 0), w is called the window.
Hwajung Lee. Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit or Abort.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
Failure Detectors n motivation n failure detector properties n failure detector classes u detector reduction u equivalence between classes n consensus.
Fault-Tolerant Broadcast Terminology: broadcast(m) a process broadcasts a message to the others deliver(m) a process delivers a message to itself 1.
Replication predicates for dependent-failure algorithms Flavio Junqueira and Keith Marzullo University of California, San Diego Euro-Par Conference, Lisbon,
Alternating Bit Protocol S R ABP is a link layer protocol. Works on FIFO channels only. Guarantees reliable message delivery with a 1-bit sequence number.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Unreliable Failure Detectors for Reliable Distributed Systems Tushar Deepak Chandra Sam Toueg Presentation for EECS454 Lawrence Leinweber.
Alternating Bit Protocol
Distributed Consensus
Distributed Systems, Consensus and Replicated State Machines
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Failure Detectors motivation failure detector properties
Distributed systems Consensus
Presentation transcript:

SysRép / 2.5A. SchiperEté The consensus problem

SysRép / 2.5A. SchiperEté Motivation Implementation of atomic broadcast and other group communication primitives in the presence of failures is a difficult problem Consensus: problem that is the common denominator for the implementation of the various group communication primitives Model: static groups, crash-stop

SysRép / 2.5A. SchiperEté Definitions Processes: correct process: process that does not crash in its whole execution faulty process: process that is not correct

SysRép / 2.5A. SchiperEté Definitions (2) Channels: Reliable channel: if p executes send (m) to q and q is correct, then q eventually receives m Quasi-reliable channel: if p executes send (m) to q and p, q are correct, then q eventually receives m

SysRép / 2.5A. SchiperEté Specification of consensus Informal: n processes: p 1, …, p n Each process p i has an initial value v i Processes must agree on a common value that is the initial value of one of the processes

SysRép / 2.5A. SchiperEté Specification of consensus (2) Formal Consensus defined by two primitives: –propose (v): primitive by which a process proposes an initial value –decide(v): primitive by which a process decides propose(4) propose(7) propose(1) decide(7)

SysRép / 2.5A. SchiperEté Specification of consensus (3) Propose and decide must satisfy the following properties: Validity: if a process decides v, then v was proposed by some process (v is the initial value of some process) Agreement: Two correct processes cannot decide differently Termination: Every correct process eventually decides

SysRép / 2.5A. SchiperEté Specification of consensus (4) Uniform consensus: Validity: if a process decides v, then v was proposed by some process (is the initial value of some process) Uniform agreement: Two correct processes cannot decide differently Termination: Every correct process eventually decides

SysRép / 2.5A. SchiperEté Solving consensus Consensus is easy to solve if processes do not crash and if channels are reliable Otherwise not so easy … Solvability of consensus depends on the system model (which defines assumption about processes and channels)

SysRép / 2.5A. SchiperEté System models: synchronous system Bound  on message delay: If message m is sent by process p to process q at time t, then q receives the message no later than at time t+ . Bound  on relative speed of process: If the fastest process takes x time units to do some computation, then the slowest process does not take more then x  time units to do the same computation

SysRép / 2.5A. SchiperEté System models: synchronous system (2) A synchronous system allows accurate failure detection Handling of “are you alive”: x time units for the fastest process  x  time units for the slowest process Timeout of p: 2  + x  p q are you alive yes 2  + x 

SysRép / 2.5A. SchiperEté System models: asynchronous system No bound on message delay No bound on process relative speed Not possible to know whether a process has crashed or not

SysRép / 2.5A. SchiperEté Synchronous round model First goal: solve consensus in the synchronous model As often done, we express consensus algorithm in a computation model composed of rounds, that can be implemented in the synchronous model Name: synchronous round model

SysRép / 2.5A. SchiperEté Synchronous round model In every round r, each process p: Sends a message to all processes Receives the messages sent in round r Does some local computation stst’ Round r p

SysRép / 2.5A. SchiperEté Synchronous round model (2) In every round r, each process p: … Receives the messages sent in round r … If p does not crash in round r: all processes that do not crash in round r (or before) receive p’s message If p crashes in round r: some processes might receive p’s message, some other processes might not receive p’s messagse

SysRép / 2.5A. SchiperEté floodSet: example 1 f=2 p1 p2 p3 r=1r=2r=3 {3} {7} {5} {3,5,7} DECIDE(3)

SysRép / 2.5A. SchiperEté Synchronous round model: floodSet algorithm Parameter f: maximum number of processes that can crash State: W p : set of values, initially {v p } {p’s initial value} Round r S r : send  W p  to all processes T r : forall q from which  W q  received do W p  W p   W q if r = f+1 then DECIDE (min (W p ))

SysRép / 2.5A. SchiperEté floodSet: example 2 f=2 p1 p2 p3 r=1r=2r=3 {3} {7} {5} {5,7} crash x {3,5,7} crash x DECIDE(5)

SysRép / 2.5A. SchiperEté Proof Validity: if a process decides v, then v was proposed by some process (v is the initial value of some process) Termination: Every correct process eventually decides Agreement: Two correct processes cannot decide differently

SysRép / 2.5A. SchiperEté FLP impossibility result Consensus is solvable in the synchronous system What about the asynchronous system model? Fischer-Lynch-Paterson (1985): Consensus is not solvable in an asynchronous system with reliable channels if one single process may crash.

SysRép / 2.5A. SchiperEté FLP impossibility result (2) What does not solvable mean? There exist no algorithm A such that in all runs of A compatible with the system model, consensus is solved This does not mean that A cannot solve consensus in any run

SysRép / 2.5A. SchiperEté Discussion Asynchronous system: –too weak to solve consensus Synchronous system –Allows us to solve consensus –Drawback: requires to estimate the worst message transmission delay  (e.g.,  must include possible retransmission) –  has a direct impact on the crash detection time, and on the duration of the black-out period that follows a crash) Question: is it possible to solve consensus without making mistakes in the crash detection?

SysRép / 2.5A. SchiperEté Discussion (2) 1.Partially synchronous model (Dwork, Lynch, Stockmeier, 1988): –Model inbetween the synchronous model and the asynchronous model. –The bounds  and  of the synchronous model: 1.Exist but are unknown, or 2.Are known but hold only from a time T on, called global stabilization time 2.Augmenting the asynchronous system with failure detectors (Chandre, Toueg, 1996)

SysRép / 2.5A. SchiperEté Failure detectors Each FDi : maintains a list of suspected processes Each FDi can make a mistake by suspecting a process that has not crashed Each FDi can change its mind by removing a suspected process No agreement among FDi’s is required p1p2 p4p3 FD1FD2 FD4FD3 {p2, p3} {p1} { } {p2, p3} {p2}

SysRép / 2.5A. SchiperEté Failure detectors (2) Without adding constraints on the output of the failure detectors, the new model is equivalent to the asynchronous mode Two types of constraints on the output of failure detectors: –Constraints related to crashed processes: completeness properties –Constraints related to correct processes: accuracy properties A failure detector is defined by a pair (c, a): –c: a completeness property –a: an accuracy property

SysRép / 2.5A. SchiperEté Completeness Strong completeness: Every process that crashes is eventually permanently suspected by every correct process Weak completeness: Every process that crashes is eventually permanently suspected by some correct process.

SysRép / 2.5A. SchiperEté Accuracy Strong accuracy: No process is suspected before it crashes Weak accuracy: Some correct process is never suspected Eventual strong accuracy: There is a time after which correct processes are not suspected by any correct process Eventual weak accuracy: There is a time after which some correct process is never suspected

SysRép / 2.5A. SchiperEté Failure detectors Perfect failure detector: –Strong completeness, strong accuracy – Notation: P Eventually perfect failure detector: –Strong completeness, eventual strong accuracy – Notation:  P Strong failure detector: –Strong completeness, weak accuracy – Notation: S Eventually strong failure detector: –Strong completeness, eventually weak accuracy – Notation:  S Eventually weak failure detector: –Weak completeness, eventually weak accuracy – Notation:  W

SysRép / 2.5A. SchiperEté Solving consensus with  S Proposed by Chandra, Toueg (1996) Hyp: –f < n/2 –  S Eventual weak accuracy: There is a time after which some correct process is no more suspected by any correct process Strong completeness: Every process that crashes is eventually permanently suspected by every correct process

SysRép / 2.5A. SchiperEté Solving consensus with  S (2) Basic idea: Process p1 tries to impose its initial value as the decision How many acks should p1 wait for? p1 v1 ack decide (v1) A majority, i.e.,  (n+1) / 2 

SysRép / 2.5A. SchiperEté Solving consensus with  S (3) What if p1 crashes ? Process p2 takes over the role of p1 Can p2 ignore what p1 has done previously? What is the problem? p2 v2 ack decide (v2)

SysRép / 2.5A. SchiperEté Solving consensus with  S (4) If some process has decided v1, then p2 must ignore v2 and must try to impose v1 as the decision p2 must be able to discover that v1 might have been decided p2 x ack decide (v2) p i : if v1 received from p1 then send v1 to p2 if v1 received then x = v1 else x = v2

SysRép / 2.5A. SchiperEté Solving consensus with  S (5) If p2 does not succeed, then p3 takes over If p3 does not succeed, then p4 takes over … If p n does not succeed, then … … p1 takes over … This is called: rotating coordinator

SysRép / 2.5A. SchiperEté Solving consensus with  S (6) Rotating coordinator p i is the new coordinator: what value should p i choose? The values sent are time-stamped with round numbers; the value with the largest time-stamp is chosen pipi value v x received from p x value v y received from p y

SysRép / 2.5A. SchiperEté Solving consensus with  S (7) coord round phase 1phase 2phase 3phase 4

SysRép / 2.5A. SchiperEté Atomic broadcast in the crash-stop model

SysRép / 2.5A. SchiperEté Reliable broadcast (specification) Atomic broadcast (specification) Reliable broadcast (implementation) Atomic broadcast (implemention)

SysRép / 2.5A. SchiperEté Reliable broadcast Unreliable broadcast of message m to group g –If the sender is correct, then every correct process in g eventually receives m –If the sender crashes, then some correct processes in g might receive m, and others not. We may want stronger guarantees  reliable broadcast

SysRép / 2.5A. SchiperEté Reliable broadcast (2) Defined by the primitives rbcast and rdeliver Convention: –g dropped –sender is member of g Replication technique Group communication Transport layer rbcast (g, m)rdeliver (m) receive (m) send (m) to p

SysRép / 2.5A. SchiperEté Reliable broadcast (3) Rbcast and rdeliver satisfy the following properties: Validity: If a correct process executes rbcast(m), then it eventually rdelivers m. Agreement: If a correct process rdelivers m, then all correct processes eventually rdeliver m. Integrity: For any message m, every correct process rdelivers m at most once, and only if m was previously rbcast.

SysRép / 2.5A. SchiperEté Uniform reliable broadcast Uniform reliable broadcast : agreement  uniform agreement Uniform agreement: If a correct process rdelivers m, then all correct processes eventually rdeliver m.

SysRép / 2.5A. SchiperEté Atomic broadcast Uniform reliable broadcast plus the following property: Uniform total order: If some process (correct or faulty) adelivers m before m’, then every process adelivers m’ only after having adelivered m. NB Should be called uniform atomic broadcast. To simplify, atomic broadcast is often used.

SysRép / 2.5A. SchiperEté Solving reliable broadcast Can be solved in an asynchronous system with quasi- reliable channels for f < n To rbcast(m): send(m) to all processes Upon reception of m for the first time do if p i  sender(m) then send(m) to all processes rdeliver(m)

SysRép / 2.5A. SchiperEté Solving atomic broadcast Atomic broadcast also subject to the FLP impossibility result: –shown by contradiction: if atomic broadcast solvable, then consensus also solvable We will show that if consensus solvable, then atomic broadcast also solvable Consensus and atomic broadcast are equivalent problems

SysRép / 2.5A. SchiperEté Solving atomic broadcast (2) Assume atomic broadcast solvable Solve consensus as follows (code of p i with initial value v i ): –abcast (v i ) –let v be the first value adelivered –decide (v)

SysRép / 2.5A. SchiperEté Solving atomic broadcast (3) abcast(m1) abcast(m2) abcast(m3) consensus abcast(m4) adeliver(m4) adeliver(m2) adeliver(m1) consensus adeliver(m3)

SysRép / 2.5A. SchiperEté Solving atomic broadcast (4) Principle of the algorithm: Sequence of instances of consensus (numbered 1, 2, …) Each consensus on a set of messages Initial value for each consensus: set of messages Let msg k be the set of messages decided by consensus #k: –The messages in msg k are adelivered before the messages in msg k+1 –The messages in msg k are adelivered in some deterministic order (e.g., according to their IDs)

SysRép / 2.5A. SchiperEté Solving atomic broadcast (5) Initialization k i := 0; adelivered i :=  ; rdelivered i :=  To abcast(m): rbcast(m) Upon rdeliver(m) do rdelivered i := rdelivered i  {m} Upon rdelivered i  adelivered i   do k i := k i + 1 aUndelivered := rdelivered i  adelivered i propose(k i, aUndelivered) wait until decide (k i, msg ki ) adeliver ki := msg ki  adelivered i adeliver the messages in adeliver ki in some deterministic order adelivered i := adelivered i  adeliver ki typos

SysRép / 2.5A. SchiperEté Quorum systems vs. group communication c s1 s3 s2 c s1 s3 s2 inc/dec Server with inc/dec operations read write With group communication With quorum systems mutual exclusion

SysRép / 2.5A. SchiperEté Quorum systems vs. group communication (2) Solution based on quorum systems  majority of correct servers  mutual exclusion  perfect failure detector Solution based on group communication  majority of correct servers   S failure detector