SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland.

Slides:



Advertisements
Similar presentations
Impossibility of Distributed Consensus with One Faulty Process
Advertisements

CS 542: Topics in Distributed Systems Diganta Goswami.
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Authored by: Seth Gilbert and Nancy Lynch Presented by:
6.852: Distributed Algorithms Spring, 2008 Class 7.
Distributed Systems Overview Ali Ghodsi
P. Kouznetsov, 2006 Abstracting out Byzantine Behavior Peter Druschel Andreas Haeberlen Petr Kouznetsov Max Planck Institute for Software Systems.
DISTRIBUTED SYSTEMS II REPLICATION CNT. II Prof Philippas Tsigas Distributed Computing and Systems Research Group.
Byzantine Generals Problem: Solution using signed messages.
Failure Detectors. Can we do anything in asynchronous systems? Reliable broadcast –Process j sends a message m to all processes in the system –Requirement:
EEC 688/788 Secure and Dependable Computing Lecture 12 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Distributed Systems Fall 2010 Replication Fall 20105DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
CS 582 / CMPE 481 Distributed Systems Fault Tolerance.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Asynchronous Consensus (Some Slides borrowed from ppt on Web.(by Ken Birman) )
Computer Science Lecture 17, page 1 CS677: Distributed OS Last Class: Fault Tolerance Basic concepts and failure models Failure masking using redundancy.
Non-blocking Atomic Commitment Aaron Kaminsky Presenting Chapter 6 of Distributed Systems, 2nd edition, 1993, ed. Mullender.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
1 Principles of Reliable Distributed Systems Recitation 8 ◊S-based Consensus Spring 2009 Alex Shraer.
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
1 Failure Detectors: A Perspective Sam Toueg LIX, Ecole Polytechnique Cornell University.
Distributed Systems Tutorial 4 – Solving Consensus using Chandra-Toueg’s unreliable failure detector: A general Quorum-Based Approach.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Efficient Algorithms to Implement Failure Detectors and Solve Consensus in Distributed Systems Mikel Larrea Departamento de Arquitectura y Tecnología de.
1 Principles of Reliable Distributed Systems Recitation 7 Byz. Consensus without Authentication ◊S-based Consensus Spring 2008 Alex Shraer.
Composition Model and its code. bound:=bound+1.
CIS 720 Distributed algorithms. “Paint on the forehead” problem Each of you can see other’s forehead but not your own. I announce “some of you have paint.
1 A Modular Approach to Fault-Tolerant Broadcasts and Related Problems Author: Vassos Hadzilacos and Sam Toueg Distributed Systems: 526 U1580 Professor:
Bringing Paxos Consensus in Multi-agent Systems Andrei Mocanu Costin Bădică University of Craiova.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
IM NTU Distributed Information Systems 2004 Replication Management -- 1 Replication Management Yih-Kuen Tsay Dept. of Information Management National Taiwan.
S-Paxos: Eliminating the Leader Bottleneck
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Commit Algorithms Hamid Al-Hamadi CS 5204 November 17, 2009.
Totally Ordered Broadcast in the face of Network Partitions [Keidar and Dolev,2000] INF5360 Student Presentation 4/3-08 Miran Damjanovic
Distributed systems Consensus Prof R. Guerraoui Distributed Programming Laboratory.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
Fault Tolerance and Replication
V1.7Fault Tolerance1. V1.7Fault Tolerance2 A characteristic of Distributed Systems is that they are tolerant of partial failures within the distributed.
SysRép / 2.5A. SchiperEté The consensus problem.
Replication and Group Communication. Management of Replicated Data FE Requests and replies C Replica C Service Clients Front ends managers RM FE RM Instructor’s.
Revisiting failure detectors Some of you asked questions about implementing consensus using S - how does it differ from reaching consensus using P. Here.
Replication Improves reliability Improves availability ( What good is a reliable system if it is not available?) Replication must be transparent and create.
Replication predicates for dependent-failure algorithms Flavio Junqueira and Keith Marzullo University of California, San Diego Euro-Par Conference, Lisbon,
PROCESS RESILIENCE By Ravalika Pola. outline: Process Resilience  Design Issues  Failure Masking and Replication  Agreement in Faulty Systems  Failure.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Unreliable Failure Detectors for Reliable Distributed Systems Tushar Deepak Chandra Sam Toueg Presentation for EECS454 Lawrence Leinweber.
Intrusion Tolerant Consensus in Wireless Ad hoc Networks Henrique Moniz, Nuno Neves, Miguel Correia LASIGE Dep. Informática da Faculdade de Ciências Universidade.
Replication Chapter Katherine Dawicki. Motivations Performance enhancement Increased availability Fault Tolerance.
Reliable multicast Tolerates process crashes. The additional requirements are: Only correct processes will receive multicasts from all correct processes.
The consensus problem in distributed systems
Outline Announcements Fault Tolerance.
Distributed Systems, Consensus and Replicated State Machines
Replication Improves reliability Improves availability
Active replication for fault tolerance
Fault-tolerance techniques RSM, Paxos
PERSPECTIVES ON THE CAP THEOREM
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Distributed systems Consensus
Last Class: Fault Tolerance
Presentation transcript:

SOFSEM 2006André Schiper1 Group Communication: from practice to theory André Schiper EPFL, Lausanne, Switzerland

SOFSEM 2006André Schiper2 Outline Introduction to fault tolerance Replication for fault tolerance Group communication for replication Implementation of group communication Practical issues Theoretical issues

SOFSEM 2006André Schiper3 Introduction Computer systems become every day more complex Probability of faults increases Need to develop techniques to achieve fault tolerance

SOFSEM 2006André Schiper4 Introduction (2) Fault tolerant techniques developed over the year Transactions (all or nothing property) Checkpointing (prevents state loss) Replication (masks faults)

SOFSEM 2006André Schiper5 Transactions ACID properties (Atomicity, Consistency, Isolation, Durability) begin-transaction remove (100, account-1) deposit (100, account-2) end-transaction Crash during the transaction: rollback to the state before the beginning of the transaction account-1 account-2 site 1site 2

SOFSEM 2006André Schiper6 Checkpointing Crash of p1: rollback of p1 to the latest saved state chkpt X crash p1 p3 p2

SOFSEM 2006André Schiper7 Replication Crash of s1 is masked to the client Upon recovery of s1: state transfer to bring s1 up-to-date s1 s2 s3 client replicated server request response

SOFSEM 2006André Schiper8 Introduction to FT (6) Comparison of the three techniques Only replication masks crashes (i.e., ensures high availability) Transactions and checkpointing: progress is only possible after recovery of the crashed process We will concentrate on replication

SOFSEM 2006André Schiper9 Outline Introduction to fault tolerance Replication for fault tolerance Group communication for replication Implementation of group communication

SOFSEM 2006André Schiper10 Context Replication in the general context of client-server interaction: cs request response server clients (several)

SOFSEM 2006André Schiper11 Context (2) Fault tolerance by replicating the server c s2 request response s3 s1 clients (several) server s

SOFSEM 2006André Schiper12 Correctness criterion: linearizability Need to keep the replica consistent Consistency defined in terms of the operations executed by the clients cici cjcj S1 S2 S3 op

SOFSEM 2006André Schiper13 Correctness criterion: linearizability (2) Example: Server: FIFO queue Operations: enqueue/dequeue

SOFSEM 2006André Schiper14 Correctness criterion: linearizability (3) Example 1 p q execution  execution  enq(a) enq(b) deq( ) b a linearizable execution

SOFSEM 2006André Schiper15 Correctness criterion: linearizability (4) Example 2 p q enq(a) enq(b) deq( ) b a non linearizable execution

SOFSEM 2006André Schiper16 Replication techniques Two main replication techniques for linearizability Primary-backup replication (or passive replication) Active replication (or state machine replication)

SOFSEM 2006André Schiper17 Primary-backup replication client S 1  primary S 2  backup S 3  backup request update response

SOFSEM 2006André Schiper18 Primary-backup replication Crash cases client S 1  primary S 2  backup S 3  backup request update response 1 23

SOFSEM 2006André Schiper19 Primary-backup replication (3) Crash detection usually based on time-outs Time-outs may lead to incorrectly suspect the crash of a process The technique must work correctly even if processes are incorrectly suspected to have crashed

SOFSEM 2006André Schiper20 Active replication client S1S1 S2S2 S3S3 request response wait for the first reply Crash of a server replica transparent to the client

SOFSEM 2006André Schiper21 Active replication (2) If more than one client, the requests must be received by all the replicas in the same order Ensured by a communication primitive called total order broadcast (or atomic broadcast) Complexity of active replication hidden in the implementation this primitive S1S1 S3S3 C C’ S2S2

SOFSEM 2006André Schiper22 Outline Introduction to fault tolerance Replication for fault tolerance Group communication for replication Implementation of group communication

SOFSEM 2006André Schiper23 Role of group communication Active replication: requires communication primitive that orders client requests Passive replication: have shown the issues to be addressed Group communication: communication infrastructure that provides solutions to these problems Replication technique Group communication Transport layer

SOFSEM 2006André Schiper24 Role of group communication (2) Let g be a group with members p, q, r multicast(g, m): allows m to be multicast to g without knowing the membership of g IP-multicast (UDP) also provides such a feature Group communication provides stronger guarantees (reliability, order, etc.)

SOFSEM 2006André Schiper25 Role of group communication (3) We will discuss: Group communication for active replication Group communication for passive replication

SOFSEM 2006André Schiper26 Atomic broadcast for active replication Group communication primitive for active replication: atomic broadcast (denoted sometimes abcast) also called total order broadcast Replication technique Group communication Transport layer abcast (g, m)adeliver (m) receive (m) send (m) to p

SOFSEM 2006André Schiper27 Atomic broadcast for active replication (2) S1S1 S3S3 C C’ S2S2 abcast(g S, req) abcast(g S, req’) group g S

SOFSEM 2006André Schiper28 Atomic broadcast: specification If p executes abcast(g, m) and does not crash, then all processes in g eventually adeliver m If some process in g adelivers m and does not crash, then all processes in g that do not crash eventually adeliver m If p, q in g adeliver m and m’, then they adeliver them in the same order

SOFSEM 2006André Schiper29 Role of group communication (3) We will discuss: Group communication for active replication Group communication for passive replication

SOFSEM 2006André Schiper30 Generic broadcast for passive replication Atomic broadcast can also be used to implement passive replication Better solution: generic broadcast Same spec as atomic broadcast, except that not all messages are ordered: –Generic broadcast based on a conflict relation on the messages –Conflicting messages are ordered, non conflicting messages are not ordered

SOFSEM 2006André Schiper31 Generic broadcast for passive replication (2) Two types of messages: –Update –PrimaryChange: issued if a process suspects the current primary Upon delivery: cyclic permutation of process list New primary: process at the head of the process list Conflict relation: PrimaryChangeUpdate PrimaryChangeno conflictconflict Updateconflict

SOFSEM 2006André Schiper32 Generic broadcast for passive replication (3) client S 1  primary S 2  backup S 3  backup request Update PrimaryChange

SOFSEM 2006André Schiper33 Generic broadcast for passive replication (4) Another way to view the strategy: Replicas numbered 0.. n-1 Computation divided into rounds In round r, the primary is the process with number r mod n When process p suspects the primary of round r, it broadcasts newRound (= primaryChange) newRoundUpdate newRoundno conflictconflict Updateconflict Conflict relation All processes deliver Update in the same round

SOFSEM 2006André Schiper34 Other group communication issues To discuss: Static vs. dynamic group Crash-stop vs. crash-recovery model

SOFSEM 2006André Schiper35 Static vs. dynamic groups Static group: group whose membership does not change during the lifetime of the system A static group is sometimes too limitative from a practical point of view: may want to replace a crashed replica with a new replica Dynamic group: group whose membership changes during the lifetime of the system Dynamic group requires to address two problems: –How to add / remove processes from the group (group membership problem) –Semantics of communication primitives

SOFSEM 2006André Schiper36 Group membership problem Need to distinguish –(1) group –(2) membership of the group at time t New notion: view of group g defined by: (i, s) –i identifies the view –s: membership of view i (set of processes) –membership i also denoted v i

SOFSEM 2006André Schiper37 Group membership problem (2) Changing the membership of a group: add (g, p): for adding p to group g remove (g, p): for removing p from group g Usual requirement: all the members of a group see the same sequence of views: if (i, s p ) the view i of p and (i, s q ) the view i of q then s p = s q

SOFSEM 2006André Schiper38 Process recovery Usual theoretical model: crash-stop –processes do not have access to stable storage (disk) –if a process crashes, its whole state is lost (i.e. no recovery possible) Drawback of the static crash-stop model: if non null crash probability, eventually all processes will have crashed Drawback of the dynamic crash-stop model: not tolerant to catastrophic failures (crash of all the processes in a group) To tolerate catastrophic failures: crash-recovery model

SOFSEM 2006André Schiper39 Combining the different GC models Combining static/dynamic with crash-stop/crash-recovery: 1. Static groups in the crash-stop model Considered in most theoretical papers 2.Dynamic groups in the crash-stop model Considered in most existing GC systems 3.Static groups in the crash-recovery model 4.Dynamic groups in the crash-recovery model

SOFSEM 2006André Schiper40 Quorum systems vs. group communication Quorum system: context of read/write operations Typical situation: –Access a majority of replicas to perform a read operation –Access a majority of replicas to perform a write operation

SOFSEM 2006André Schiper41 Quorum systems vs. group communication Example: a replicated server managing an integer with read/write operations (called register) operation to perform: increment

SOFSEM 2006André Schiper42 Quorum systems vs. group communication (2) Solution with quorum system: Solution with group communication client s1 s2 s3 client s1 s2 s3 Read Increment Write Increment Critical section

SOFSEM 2006André Schiper43 Outline Introduction to fault tolerance Replication for fault tolerance Group communication for replication Implementation of group communication

SOFSEM 2006André Schiper44 Implementation of group communication Context: Static groups Crash-stop model Non-malicious processes Atomic broadcast Generic broadcast Consensus Atomic broadcast Generic broadcast Common denominator: Consensus

SOFSEM 2006André Schiper45 Consensus (informal) p1 p2 p

SOFSEM 2006André Schiper46 Consensus (formal) A set  of processes, each p i   with an initial value v i Three properties: Validity: If a process decides v, then v is the initial value of some process Agreement: No two processes decide differently Termination: Every process eventually decides some value

SOFSEM 2006André Schiper47 Impossibility result Asynchronous system: No bound on the message transmission delay No bound on the process relative speeds Fischer-Lynch-Paterson impossibility result (1985): Consensus is not solvable in an asynchronous system with a deterministic algorithm and reliable links if one single process may crash

SOFSEM 2006André Schiper48 Impossibility of atomic broadcast By contradiction p1 p2 p Abcast(4) Abcast(7) Abcast(1) Adeliver(7)

SOFSEM 2006André Schiper49 Models for solving consensus Synchronous system: –There is a known bound on the transmission delay of messages –There is a known bound on the process relative speeds Consensus solvable with up to n-1 faulty processes Drawback: Requires to be pessimistic (large bounds) Large bounds lead to a large blackout period in case of a crash

SOFSEM 2006André Schiper50 Models for solving consensus (2) Partially synchronous system (Dwork, Lynch, Stockmeyer, 1988) Two variants: There is a bound on the transmission delay of messages and on the process relative speed, but these bounds are not known There are known bounds on the transmission delay of messages and on the process relative speed, but these bounds hold only from some unknown point on. Consensus solvable with a majority of correct processes

SOFSEM 2006André Schiper51 Models for solving consensus (3) Failure detector model (Chandra,Toueg, 1995) Asynchronous model augmented with an oracle (called failure detector) defined by abstract properties A failure detector defined by a completeness property and an accuracy property p FD p r FD r q FD q network

SOFSEM 2006André Schiper52 Failure detector model (2) Example: failure detector  S Strong completeness: eventually every process that crashes is permanently suspected to have crashed by every correct process Eventual weak accuracy: There is a time after which some correct process is never suspected by any correct process Consensus solvable with  S and a majority of correct processes Drawback: requires reliable links

SOFSEM 2006André Schiper53 RbR model Joint work with Bernadette Charron-Bost Combination of: –Gafni 1998, Round-by-round failure detectors (unifies synchronous model and failure detector model) –Santoro, Widmayer, 1989, Time is not a healer (show that dynamic faults have the same destructive effect on solving consensus as asynchronicity) Existing models for solving consensus have put emphasis on static faults Our new model handles static faults and dynamic faults in the same way

SOFSEM 2006André Schiper54 RbR machine Distributed algorithm A on the set  of processes Communication predicate P on the HO’s Examples: P1 :  p  ,  r > 0 : | HO(p, r) | > n/2 P2 :  r 0,  HO  2 ,  p   : HO(p, r 0 ) = HO A problem is solved by a pair ( A, P ) In round r, process p receives messages from the set HO(p, r) stst’ Round r p

SOFSEM 2006André Schiper55 Round-by-Round (RbR) model (2) Assume q  HO(p, r): in the RbR model, no tentative to justify this (e.g, channel failure, crash of q, send-omission failure of q, etc.) This removes the difference between static and dynamic faults (static: crash-stop; dynamic: channel fault, crash- recovery) In the RbR model, no notion of faulty process

SOFSEM 2006André Schiper56 Consensus algorithms A lot …. Most influential algorithms: –Consensus algorithm based on the failure detector  S (Chandra- Toueg, 1995) –Paxos (Lamport, )

SOFSEM 2006André Schiper57 Consensus algorithms (2) The CT (Chandra-Toueg) and Paxos algorithms have strong similarities but also important differences: CT based on rotating coordinator / Paxos based on a dynamically chosen leader CT requires reliable links / Paxos tolerates lossy links The condition for liveness clearly defined in CT The condition for liveness less formally defined in Paxos (can be formally expressed in the RbR model)

SOFSEM 2006André Schiper58 Paxos in the RbR model Initialization x p := v p vote p := V  {?}, initially ? voteToSend p :=false; decided p :=false; ts p :=0 Round r = 4Φ – 3 : S r p : send  x p, ts p  to leader p (Φ) T r p : if p = leader p (Φ) and  #  x, ts  rcvd > n/2 then let θ be the largest θ from  v, θ  received vote p := one v such that  v,θ  received voteToSend p := true Round r = 4Φ – 2 : S r p : if p = leader p (Φ) and voteToSend p then send  vote p  to all processes T r p : if received  v  from leader p (Φ) then x p := v ; ts p := Φ Round r = 4Φ – 1 : S r p : if ts p = Φ then send  ack  to leader p (Φ) T r p : if p = leader p (Φ) and  #  ack  rcvd > n/2 then DECIDE (vote p ); decided p := true Round r = 4Φ S r p : if p = coord p (Φ) and decided p then send  vote p  to all processes T r p : if received  v  and not decided p then DECIDE(v) ); decided p := true voteToSend p := false leader

SOFSEM 2006André Schiper59 Paxos (2) CONDITION FOR SAFETYCONDITION FOR LIVENESS None  Φ 0 > 0 :  p : | HO(p, Φ 0 ) | > n/2 and  p, q: leader p (Φ 0 ) = leader q (Φ 0 ) and leader p (Φ 0 )  K(Φ 0 ) Kernel of round r: K (r) =  HO (p, r) Kernel of a phase Φ : K(Φ) =  K (r) p    r  (phase = sequence of rounds)

SOFSEM 2006André Schiper60 Atomic broadcast A lot of algorithms published … Easy to implement using a sequence of instance of consensus Consider atomic broadcast within a static group g: –Consensus within g on a set of messages –Let consensus #k decide on the set Msg(k) –Each process delivers the messages in Msg(k) before those in Msg(k+1) –Each process delivers the messages in Msg(k) in a deterministic order (e.g., in the order of the message ids)

SOFSEM 2006André Schiper61 Atomic broadcast (2) Leads to the following algorithm: Each process p manages a counter k and a set undeliveredMessages Upon abcast(m) do broadcast(m) Upon reception of m do –add m to undeliveredMessages –if no consensus algorithm is running then Msg(k)  consensus (undeliveredMessages) deliver messages in Msg(k) in some deterministic order

SOFSEM 2006André Schiper62 Conclusion Necessarily superficial presentation of group communication Comments: –Static/crash-stop model has reached maturity –Maturity not yet reached in the other models (static/crash-recovery, dynamic/crash-stop, …) –With the RbR model we hope to bridge the gap between static/crash-stop and static/crash-recovery (ongoing implementation work) –More work needed to quantitatively compare the various atomic broadcast algorithms (and other algorithms)