Timeliness, Failure Detectors, and Consensus Performance Alex Shraer Joint work with Dr. Idit Keidar Technion – Israel Institute of Technology In PODC.

Slides:



Advertisements
Similar presentations
Timed Distributed System Models  A. Mok 2014 CS 386C.
Advertisements

CS 542: Topics in Distributed Systems Diganta Goswami.
The weakest failure detector question in distributed computing Petr Kouznetsov Distributed Programming Lab EPFL.
A General Characterization of Indulgence R. Guerraoui EPFL joint work with N. Lynch (MIT)
Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services Authored by: Seth Gilbert and Nancy Lynch Presented by:
IMPOSSIBILITY OF CONSENSUS Ken Birman Fall Consensus… a classic problem  Consensus abstraction underlies many distributed systems and protocols.
6.852: Distributed Algorithms Spring, 2008 Class 7.
Distributed Systems Overview Ali Ghodsi
P. Kouznetsov, 2006 Abstracting out Byzantine Behavior Peter Druschel Andreas Haeberlen Petr Kouznetsov Max Planck Institute for Software Systems.
Computer Science 425 Distributed Systems CS 425 / ECE 428 Consensus
How to Choose a Timing Model? Idit Keidar and Alexander Shraer Technion – Israel Institute of Technology.
1 © R. Guerraoui Implementing the Consensus Object with Timing Assumptions R. Guerraoui Distributed Programming Laboratory.
Byzantine Generals Problem: Solution using signed messages.
Failure Detectors. Can we do anything in asynchronous systems? Reliable broadcast –Process j sends a message m to all processes in the system –Requirement:
1 Principles of Reliable Distributed Systems Lectures 11: Authenticated Byzantine Consensus Spring 2005 Dr. Idit Keidar.
© Idit Keidar and Sergio Rajsbaum; PODC 2002 On the Cost of Fault-Tolerant Consensus When There are no Faults Idit Keidar and Sergio Rajsbaum PODC 2002.
1 Principles of Reliable Distributed Systems Lecture 6: Synchronous Uniform Consensus Spring 2005 Dr. Idit Keidar.
Failure Detectors & Consensus. Agenda Unreliable Failure Detectors (CHANDRA TOUEG) Reducibility ◊S≥◊W, ◊W≥◊S Solving Consensus using ◊S (MOSTEFAOUI RAYNAL)
1 Principles of Reliable Distributed Systems Lecture 3: Synchronous Uniform Consensus Spring 2006 Dr. Idit Keidar.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
Sergio Rajsbaum 2006 Lecture 3 Introduction to Principles of Distributed Computing Sergio Rajsbaum Math Institute UNAM, Mexico.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous (Uniform)
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Asynchronous Consensus (Some Slides borrowed from ppt on Web.(by Ken Birman) )
Timeliness, Failure Detectors, and Consensus Performance Idit Keidar and Alexander Shraer Technion – Israel Institute of Technology.
Aran Bergman Eddie Bortnikov, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Recitation.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 9: SMR with Paxos.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Sergio Rajsbaum 2006 Lecture 4 Introduction to Principles of Distributed Computing Sergio Rajsbaum Math Institute UNAM, Mexico.
1 Principles of Reliable Distributed Systems Recitation 8 ◊S-based Consensus Spring 2009 Alex Shraer.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 4 – Consensus and reliable.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Impossibility.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Synchronous Byzantine.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 12: Impossibility.
Josef Widder1 Why, Where and How to Use the  - Model Josef Widder Embedded Computing Systems Group INRIA Rocquencourt, March 10,
1 Failure Detectors: A Perspective Sam Toueg LIX, Ecole Polytechnique Cornell University.
Distributed Algorithms: Agreement Protocols. Problems of Agreement l A set of processes need to agree on a value (decision), after one or more processes.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 19: Paxos All slides © IG.
Distributed Systems Tutorial 4 – Solving Consensus using Chandra-Toueg’s unreliable failure detector: A general Quorum-Based Approach.
On the Cost of Fault-Tolerant Consensus When There are no Faults Idit Keidar & Sergio Rajsbaum Appears in SIGACT News; MIT Tech. Report.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Efficient Algorithms to Implement Failure Detectors and Solve Consensus in Distributed Systems Mikel Larrea Departamento de Arquitectura y Tecnología de.
1 Principles of Reliable Distributed Systems Recitation 7 Byz. Consensus without Authentication ◊S-based Consensus Spring 2008 Alex Shraer.
Composition Model and its code. bound:=bound+1.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 8: Failure Detectors.
Paxos Made Simple Jinghe Zhang. Introduction Lock is the easiest way to manage concurrency Mutex and semaphore. Read and write locks. In distributed system:
1 A Modular Approach to Fault-Tolerant Broadcasts and Related Problems Author: Vassos Hadzilacos and Sam Toueg Distributed Systems: 526 U1580 Professor:
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
Distributed systems Consensus Prof R. Guerraoui Distributed Programming Laboratory.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
SysRép / 2.5A. SchiperEté The consensus problem.
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
On the Performance of Consensus Algorithms: Theory and Practice Idit Keidar Technion & MIT.
Distributed Systems Lecture 9 Leader election 1. Previous lecture Middleware RPC and RMI – Marshalling 2.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Unreliable Failure Detectors for Reliable Distributed Systems Tushar Deepak Chandra Sam Toueg Presentation for EECS454 Lawrence Leinweber.
© 2007 P. Kouznetsov On the Weakest Failure Detector Ever Petr Kouznetsov (Max Planck Institute for SWS) Joint work with: Rachid Guerraoui (EPFL) Maurice.
Consensus, impossibility results and Paxos Ken Birman.
The consensus problem in distributed systems
Distributed Systems, Consensus and Replicated State Machines
PERSPECTIVES ON THE CAP THEOREM
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Distributed systems Consensus
Presentation transcript:

Timeliness, Failure Detectors, and Consensus Performance Alex Shraer Joint work with Dr. Idit Keidar Technion – Israel Institute of Technology In PODC 2006

How do you survive failures and achieve high availability?

Replication

State Machine Replication aaa bb c Replicas are identical deterministic state machines Process operations in the same order  remain consistent

Consensus Building block for state machine replication Each process has an input, should decide on an output so that – Agreement: decisions are the same Validity: decision is input of one process Termination: eventually all correct processes decide

Keidar & Shraer, Technion, Israel PODC 2006 Basic Model Message passing Links between every pair of processes –do not create, duplicate or alter messages (integrity) Process and link failures

Keidar & Shraer, Technion, Israel PODC 2006 Synchronous Model a b c Very convenient for algorithms –understanding performance –early decision with no/few failures a b c a b c d d d

Keidar & Shraer, Technion, Israel PODC 2006 Synchronous Model Known bound Δ on message delay, processing Very convenient for algorithms Requires very conservative timeouts –in practice: avg. latency < –Computation might be too sloooow! max. latency 100 [Cardwell, Savage, Anderson 2000], [Bakr-Keidar 2002]

Keidar & Shraer, Technion, Israel PODC 2006 Asynchronous Model Unbounded message delay Much more practical Fault-tolerant consensus impossible [FLP85]

Keidar & Shraer, Technion, Israel PODC 2006 The DLS solution [Dwork, Lynch, Stockmeyer 88] Asynchronous fault-tolerant service you can live with –Always safe – no matter how slow the network is –Usually live What does “usually” mean? Option 1Option 2 Your account balance is:

Keidar & Shraer, Technion, Israel PODC 2006 Eventual Synchrony! (DLS) Formally capture liveness conditions Bounds known, but hold only eventually, from some unknown time GST onward –“ eventually ” formally models “ most of the time ” (in stable periods) In practice, does not have to be forever, just “ long enough ” for the algorithm to finish its task

Keidar & Shraer, Technion, Israel PODC 2006 Unreliable Failure Detectors [Chandra, Toueg 96] Asynchronous distributed system (no bounds). Reliable links (sent messages eventually arrive to correct processes) Each process has local failure detector oracle –Typically outputs list of processes suspected to have crashed at any given time Unreliable: failure detector output can be arbitrary for unbounded (finite) prefix of run

Keidar & Shraer, Technion, Israel PODC 2006 Failure Detectors - Examples ◊S - Eventually Strong –Strong Completeness: From some point on, every faulty process is suspected by every correct process –Eventual Weak Accuracy: From some point on, some correct process is not suspected  - Leader –Outputs one trusted process –From some point, all correct processes trust the same correct process Equivalent Weakest for consensus

Keidar & Shraer, Technion, Israel PODC 2006 Eventually Stable (Indulgent) Models Initially asynchronous –for unbounded period of time Eventually reach stabilization –GST (Global Stabilization Time) –following GST certain assumptions hold Examples –ES (Eventual Synchrony) – starting from GST all links have a bound on message delay [Dwork, Lynch, Stockmeyer 88] –failure detectors Example:  leader) failure detector –Outputs one trusted process –From some point, all correct processes trust the same correct process [Chandra, Toueg 96], [Chandra, Hadzilacos, Toueg 96]

Keidar & Shraer, Technion, Israel PODC 2006 Indulgent Models: Research Trend Weaken post-GST assumptions as much as possible [Guerraoui, Schiper96], [Aguilera et al. 03, 04], [Malkhi et al. 05] Weaker = better?

Keidar & Shraer, Technion, Israel PODC 2006 You only need ONE machine with eventually ONE timely link. Buy the hardware to ensure it, set the timeout accordingly, and EVERYTHING WILL WORK. Indulgent Models: Research Trend

Keidar & Shraer, Technion, Israel PODC 2006 Consensus with Weak Assumptions Network Why isn’t anything happening ???Don’t worry! It will eventually happen!

Keidar & Shraer, Technion, Israel PODC 2006 Consensus with Weak Assumptions Network

Keidar & Shraer, Technion, Israel PODC 2006 What’s Going On? In practice, bounds just need to hold “long enough” for the algorithm (T A ) to finish But T A depends on our synchrony assumptions –with weak assumptions, T A might be unbounded For practical systems, eventual completion of the job is not enough!

Keidar & Shraer, Technion, Israel PODC 2006 Our Goal Understand the relationship between: –assumptions (1 timely link, failure detectors, etc.) that eventually hold –performance of algorithms that exploit these assumptions, and only them Challenge: How do we understand the performance of asynchronous algorithms that make very different assumptions?

Keidar & Shraer, Technion, Israel PODC 2006 Typical Metric: Count “Rounds” Algorithms normally progress in rounds, though rounds are not synchronized among processes at process p i : forever do send messages receive messages while (!some conditions) compute… Previous work: –look at synchronous runs (every message takes exactly  time) –count rounds or “  s” [Keidar, Rajsbaum 01], [Dutta, Guerraoui 02], [Guerraoui, Raynal 04] [Dutta et al. 03], etc.

Keidar & Shraer, Technion, Israel PODC 2006 Are All “Rounds” the Same? Algorithm 1 waits for messages from a majority that includes a pre-defined leader in each round –takes 3 rounds Algorithm 2 waits for messages from all (unsuspected) processes in each round –E.g., group membership –takes 2 rounds

Keidar & Shraer, Technion, Israel PODC 2006 Do All Rounds Cost the Same? LAN Market Apples $1.00Oranges $1.00

Keidar & Shraer, Technion, Israel PODC 2006 Do All “Rounds” Cost the Same? On the Internet, n 2 timely links can be a rarity, [Bakr, Keidar 02] Timely communication –with leader –with majority require timeouts orders of magnitude smaller WAN Market Apples $1.00 Oranges $100.00

GIRAF General Round-based Algorithm Framework Inspired by Gafni ’ s RRFD, generalizes it Organize algorithms into rounds Separate algorithm logic from waiting condition Waiting condition defines model Allows reasoning about lower and upper bounds for rounds of different types

Keidar & Shraer, Technion, Israel PODC 2006 waiting condition controlled by env. GIRAF – The Generic Algorithm Your pet algorithm here Algorithm for process p i upon receive m add m to M (msg buffer) upon end-of-round FD ← oracle (k) if (k = 0) then out_msg ← initialize(FD) else out_msg ← compute(k, M, FD) k ← k+1 enable sending of out_msg to all

Keidar & Shraer, Technion, Israel PODC 2006 GIRAF’s Generality Does not require rounds to be synchronized among processes Can capture any oracle model –in [CHT96] general failure detector model –leader oracle + majority in each round Messages can arrive in any round –allows for untimely albeit reliable links

Keidar & Shraer, Technion, Israel PODC 2006 Defining Properties in GIRAF Environment can have –perpetual properties –eventual properties In every run r, there exists a round GSR(r) GSR(r) – the first round from which: –no process fails –all eventual properties hold in each round

Keidar & Shraer, Technion, Israel PODC 2006 Defining Properties Timeliness of incoming, outgoing and bidirectional links. Some known failure detector properties Use properties to clearly define models

Keidar & Shraer, Technion, Israel PODC 2006 Some Results: Context Consensus problem Global decision time metric –Time until all correct processes decide Message passing Crash failures –t 1 processes

Keidar & Shraer, Technion, Israel PODC 2006 ◊LM Model: Leader and Majority Nothing required before GSR In every round k ≥ GSR –Every correct process receives a round k message from a majority of processes, one of which is the Ω-leader. Practically requires much shorter timeouts than Eventual Synchrony [Bakr, Keidar]

Keidar & Shraer, Technion, Israel PODC 2006 ◊LM: Previous Work Most Ω-based algorithms wait for majority in each round (not ◊LM) Paxos [Lamport 98] works for ◊LM –Takes constant number of rounds in Eventual Synchrony (ES) –But how many rounds without ES?

Keidar & Shraer, Technion, Israel PODC 2006 Paxos Run in ES (Commit, 21,v 1 ) (“prepare”,21) yes decide v 1 (Commit, 21, v 1 ) Ω Leader BallotNum number of attempts to decide initiated by leaders no yes (“prepare”,2)

Keidar & Shraer, Technion, Israel PODC 2006 Paxos in ◊LM (w/out ES) 2 (“prepare”,2) (“prepare”,9) (“prepare”,14) Ω Leader ok no (5) no (8) ok no (13) GSRGSR+1GSR+2GSR+3 BallotNum Commit takes O(n) rounds!

Keidar & Shraer, Technion, Israel PODC 2006 What Can We Hope For? Tight lower bound for ES: 3 rounds from GSR [DGK05] ◊LM weaker than ES One might expect it to take a longer time in ◊LM than in ES

Keidar & Shraer, Technion, Israel PODC 2006 Result 1: Don't Need ES Leader and majority can give you the same performance! Algorithm that matches lower bound for ES!

Keidar & Shraer, Technion, Israel PODC 2006 Our ◊LM Algorithm in a Nutshell Commit with increasing ballot numbers, decide on value committed by majority –like Paxos, etc. Challenge: Don’t know all ballots, how to choose the new one to be highest one? Solution: Choose it to be the round number Challenge: rounds are wasted if a prepare/commit fails. Solution: pipeline prepares and commits: try in each round Challenge: do they really need to say no? Solution: support leader’s prepare even if have a higher ballot number –challenge: higher number may reflect later decision! Won’t agreement be compromised? –solution: new field “trustMe” ensures supported leader doesn't miss real decisions

Keidar & Shraer, Technion, Israel PODC 2006 Example Run: GSR= Ω Leader Rounds: GSR+1 GSR GSR All PREPARE with ! trustMe All COMMIT 101 All DECIDE Did not lead to decision

Keidar & Shraer, Technion, Israel PODC 2006 Question 2: ◊S and Ω Equivalent? ◊S and Ω equivalent in the “classical” sense [Chandra, Hadzilacos, Toueg 96] –Weakest for consensus ◊S: eventually (from GSR onward), –all faulty processes are suspected by every correct process –there exists one correct process that is not suspected by any correct process. Can we substitute Ω with ◊S in ◊LM?

Keidar & Shraer, Technion, Israel PODC 2006 Result 2: ◊S and Ω not that Equivalent Consensus takes linear time from GSR By reduction to mobile failure model [Santoro, Widmayer 89]

Keidar & Shraer, Technion, Israel PODC 2006 Result 3: Do We Need Oracles? Timely communication with majority suffices! ◊AFM (All-From-Majority) simplified: –In every round k ≥ GSR, every correct process p receives round k message from a majority of processes, and p’s message reaches a majority of processes. Decision in 5 rounds from GSR –1 st constant time algorithm w/out oracle or ES –idea: information passes to all nodes in 2 rounds

Keidar & Shraer, Technion, Israel PODC 2006 ◊MFM: Majority from Majority –The rest receive a message from a minority Only a little missing for ◊AFM Stronger than models in literature [Aguilera et al. 03, 04], [Malkhi et al. 05] Bounded time from GSR impossible! Result 4: Can We Assume Less?

Conclusions Which guarantees should one implement ? –weaker ≠ better some previously suggested assumptions are too weak –sometimes a little stronger = much better worth longer timeouts / better hardware –ES is not essential not worth longer timeouts / better hardware –future: more models, bounds to explore GIRAF