© 2007 P. Kouznetsov On the Weakest Failure Detector Ever Petr Kouznetsov (Max Planck Institute for SWS) Joint work with: Rachid Guerraoui (EPFL) Maurice.

Slides:



Advertisements
Similar presentations
Impossibility of Distributed Consensus with One Faulty Process
Advertisements

N-Consensus is the Second Strongest Object for N+1 Processes Eli Gafni UCLA Petr Kuznetsov Max Planck Institute for Software Systems.
© 2005 P. Kouznetsov Computing with Reads and Writes in the Absence of Step Contention Hagit Attiya Rachid Guerraoui Petr Kouznetsov School of Computer.
The weakest failure detector question in distributed computing Petr Kouznetsov Distributed Programming Lab EPFL.
A General Characterization of Indulgence R. Guerraoui EPFL joint work with N. Lynch (MIT)
Is 1 different from 12? Eli Gafni UCLA Eli Gafni UCLA.
P. Kouznetsov, 2006 Abstracting out Byzantine Behavior Peter Druschel Andreas Haeberlen Petr Kouznetsov Max Planck Institute for Software Systems.
6.852: Distributed Algorithms Spring, 2008 Class 16.
Distributed Computing 8. Impossibility of consensus Shmuel Zaks ©
Failure detector The story goes back to the FLP’85 impossibility result about consensus in presence of crash failures. If crash can be detected, then consensus.
Distributed Algorithms – 2g1513 Lecture 10 – by Ali Ghodsi Fault-Tolerance in Asynchronous Networks.
Computer Science 425 Distributed Systems CS 425 / ECE 428 Consensus
Distributed Computing 8. Impossibility of consensus Shmuel Zaks ©
1 © R. Guerraoui Implementing the Consensus Object with Timing Assumptions R. Guerraoui Distributed Programming Laboratory.
1 © P. Kouznetsov On the weakest failure detector for non-blocking atomic commit Rachid Guerraoui Petr Kouznetsov Distributed Programming Laboratory Swiss.
What is ’’hard’’ in distributed computing? R. Guerraoui EPFL/MIT joint work with. Delporte and H. Fauconnier (Univ of Paris)
Byzantine Generals Problem: Solution using signed messages.
Failure Detectors. Can we do anything in asynchronous systems? Reliable broadcast –Process j sends a message m to all processes in the system –Requirement:
Failure Detectors & Consensus. Agenda Unreliable Failure Detectors (CHANDRA TOUEG) Reducibility ◊S≥◊W, ◊W≥◊S Solving Consensus using ◊S (MOSTEFAOUI RAYNAL)
Sergio Rajsbaum 2006 Lecture 3 Introduction to Principles of Distributed Computing Sergio Rajsbaum Math Institute UNAM, Mexico.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Asynchronous Consensus (Some Slides borrowed from ppt on Web.(by Ken Birman) )
CPSC 668Set 9: Fault Tolerant Consensus1 CPSC 668 Distributed Algorithms and Systems Fall 2006 Prof. Jennifer Welch.
CPSC 668Set 9: Fault Tolerant Consensus1 CPSC 668 Distributed Algorithms and Systems Spring 2008 Prof. Jennifer Welch.
1 Fault-Tolerant Consensus. 2 Failures in Distributed Systems Link failure: A link fails and remains inactive; the network may get partitioned Crash:
Sergio Rajsbaum 2006 Lecture 4 Introduction to Principles of Distributed Computing Sergio Rajsbaum Math Institute UNAM, Mexico.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 4 – Consensus and reliable.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 2 – Distributed Systems.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Impossibility.
CPSC 668Set 11: Asynchronous Consensus1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch.
Distributed Algorithms: Agreement Protocols. Problems of Agreement l A set of processes need to agree on a value (decision), after one or more processes.
Distributed Systems Tutorial 4 – Solving Consensus using Chandra-Toueg’s unreliable failure detector: A general Quorum-Based Approach.
On the Cost of Fault-Tolerant Consensus When There are no Faults Idit Keidar & Sergio Rajsbaum Appears in SIGACT News; MIT Tech. Report.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Efficient Algorithms to Implement Failure Detectors and Solve Consensus in Distributed Systems Mikel Larrea Departamento de Arquitectura y Tecnología de.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 8: Failure Detectors.
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
1 © P. Kouznetsov A Note on Set Agreement with Omission Failures Rachid Guerraoui, Petr Kouznetsov, Bastian Pochon Distributed Programming Laboratory Swiss.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS Fall 2011 Prof. Jennifer Welch CSCE 668 Set 11: Asynchronous Consensus 1.
Consensus and Its Impossibility in Asynchronous Systems.
1 Lectures on Parallel and Distributed Algorithms COMP 523: Advanced Algorithmic Techniques Lecturer: Dariusz Kowalski Lectures on Parallel and Distributed.
Consensus with Partial Synchrony Kevin Schaffer Chapter 25 from “Distributed Algorithms” by Nancy A. Lynch.
DISTRIBUTED ALGORITHMS AND SYSTEMS Spring 2014 Prof. Jennifer Welch Set 11: Asynchronous Consensus 1.
BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom.
CS294, Yelick Consensus revisited, p1 CS Consensus Revisited
CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009) Lecture 9 Consensus I Section Klara Nahrstedt.
Distributed systems Consensus Prof R. Guerraoui Distributed Programming Laboratory.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
SysRép / 2.5A. SchiperEté The consensus problem.
Impossibility of Distributed Consensus with One Faulty Process By, Michael J.Fischer Nancy A. Lynch Michael S.Paterson.
Failure Detectors n motivation n failure detector properties n failure detector classes u detector reduction u equivalence between classes n consensus.
Replication predicates for dependent-failure algorithms Flavio Junqueira and Keith Marzullo University of California, San Diego Euro-Par Conference, Lisbon,
Fault tolerance and related issues in distributed computing Shmuel Zaks GSSI - Feb
DISTRIBUTED ALGORITHMS Spring 2014 Prof. Jennifer Welch Set 9: Fault Tolerant Consensus 1.
CS4231 Parallel and Distributed Algorithms AY 2006/2007 Semester 2 Lecture 9 Instructor: Haifeng YU.
1 Fault-Tolerant Consensus. 2 Communication Model Complete graph Synchronous, network.
1 © R. Guerraoui Set-Agreement (Generalizing Consensus) R. Guerraoui.
Byzantine Vector Consensus in Complete Graphs Nitin Vaidya University of Illinois at Urbana-Champaign Vijay Garg University of Texas at Austin.
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
Algebraic Topology and Distributed Computing part two
Alternating Bit Protocol
Agreement Protocols CS60002: Distributed Systems
Distributed Systems, Consensus and Replicated State Machines
CSCE 668 DISTRIBUTED ALGORITHMS AND SYSTEMS
FLP Impossibility of Consensus
Failure Detectors motivation failure detector properties
Distributed systems Consensus
Presentation transcript:

© 2007 P. Kouznetsov On the Weakest Failure Detector Ever Petr Kouznetsov (Max Planck Institute for SWS) Joint work with: Rachid Guerraoui (EPFL) Maurice Herlihy (Brown) Nancy Lynch (MIT) Calvin Newport (MIT)

2 © 2007 P. Kouznetsov Big picture Choosing a model:  Optimistic model: the system is very efficient but likely to fail  Conservative model: the system is very robust but inefficient (or impossible to implement) What is the right model?

3 © 2007 P. Kouznetsov Synchrony assumptions  Asynchronous read-write shared memory model: no bounds on relative processing speed  Very appealing in practice!  Too conservative: most problems are not solvable [ FLP85, LA87; HS,SZ,BG93]; (solvable in synchronous systems though)

4 © 2007 P. Kouznetsov So what do we need exactly? What is the minimal amount of synchrony that circumvents some asynchronous impossibility?  “minimal amount of synchrony”? - The weakest failure detector

5 © 2007 P. Kouznetsov Model p FD qr Asynchronous read-write shared-memory system with failure detectors

6 © 2007 P. Kouznetsov Comparing failure detectors Failure detector D is weaker than failure detector D ’ if there exists an algorithm that emulates D using D ’ p D’D qr DD

7 © 2007 P. Kouznetsov The weakest non-trivial failure detector A failure detectors X that is: non-trivial: circumvents some asynchronous impossibility weaker than any non-trivial failure detector The “easiest” non-trivial problem?

8 © 2007 P. Kouznetsov A Very Weak Failure Detector Y outputs a non-empty set of process ids Eventually, the same set U is output at every correct process: U is not the current set of correct processes Example:  Π={p,q,r}, C={p,q}  Y outputs {p},{q},{p,r},{q,r},{p,q,r}

9 © 2007 P. Kouznetsov Y is non-trivial Theorem 1 Y solves (N-1)-set agreement Every process in P 1,…,P N proposes a value and must decide on some proposed value so that:  At most N-1 distinct values are decided (!) not solvable in asynchronous systems [HS93,BG93,SZ93]

10 © 2007 P. Kouznetsov Set agreement is almost solvable  If N-1 or less distinct values are proposed, e.g., if N-1 or less processes participate k-convergence [YNG98]  Y should handle the case when N values are around

11 © 2007 P. Kouznetsov Citizens and gladiators  Split the system into Gladiators (the stable output of Y) and Citizens (all the rest)  Gladiators eliminate at least one value using (G-1)-convergence or adopt a value from Citizens Y Π-YΠ-Y

12 © 2007 P. Kouznetsov Correctness Eventually, Gladiators are not the set of correct processes ⇨ At least one gladiator is faulty, or at least one Citizen is correct ⇨ Gladiators commit on G-1 values or adopt a value from a citizen ⇨ At least one process gives up its value ⇨ at most N-1 values survive!

13 © 2007 P. Kouznetsov Y is minimal Theorem 2 Y is weaker than any stable non- trivial failure detector D D is stable if, eventually, the same value is permanently output at every correct process (e.g., P, ⃟P, Ω, Ω k )

14 © 2007 P. Kouznetsov Minimality proof: toy example Consider a “faithful” failure detector D that solves a wait-free impossible problem P: in every execution E, D outputs the same value v that depends only on correct(E) Claim 1 For all v, there is a non-empty set of processes C such that v cannot be output by D when C is the set of correct processes Suppose not: v is valid for any C => D can be replaced with a “dummy” that always outputs v --- a contradiction!

15 © 2007 P. Kouznetsov Minimality proof: general case Consider any non-trivial stable D Claim 2 For all v, there exists an infinite execution E in which v cannot be the only value output by D Reduction:  As long as D is stable on v: use E(v) to extract Y

16 © 2007 P. Kouznetsov Conclusions  Y is the weakest non-trivial stable failure detector (can be generalized to the f-resilient case – Y f )  (N-1)-set agreement is the easiest non-trivial problem?

17 © 2007 P. Kouznetsov Future  Establishing the “weakest ever” result in the most general class of failure detectors (not Y!) Y is not the weakest: an unstable “composition” of Ω n and Y is even weaker! [Chen et al., Zielinski, …]

18 © 2007 P. Kouznetsov Thank you!

19 © 2007 P. Kouznetsov k-convergence [YNG98] Processes propose values and commit on or adopt one of the proposed values:  If a process commits, then at most k values are committed or adopted  If k or less values proposed, every process commits (!) wait-free solvable for any k (!!) (N-1)-convergence almost solves (N-1)-agreement! But termination is an issue in case all N values are around – that’s where Y is of use!

20 © 2007 P. Kouznetsov Minimality proof: general case Consider any non-trivial stable D Claim 2 For all v, there exists an infinite execution E in which v cannot be the only value output by D Reduction:  As long as D is stable: Locate a faulty process in a finite prefix of E (including all steps of faulty(E) ) Or, output correct(E)  Y is extracted!

21 © 2007 P. Kouznetsov Generalization to f-resilience f-resilient impossible problems: can be solved when less or f fail but cannot when f fail  Y f output a set of size ≥N-f  Eventually, the same set U is permanently output at every correct process Y f is the weakest stable failure detector to circumvent an f-resilient impossibility

22 © 2007 P. Kouznetsov Big picture Addressing the WFD question contributes to:  Understanding complexity and computability bounds of distributed abstractions  Establishing a clean classification of problems in distributed computing “WFD ever” corresponds to the easiest non-trivial problem in distributed computing