UPV / EHU Distributed Algorithms for Failure Detection and Consensus in Crash, Crash-Recovery and Omission Environments Mikel Larrea Distributed Systems.

Slides:



Advertisements
Similar presentations
Impossibility of Distributed Consensus with One Faulty Process
Advertisements

DISTRIBUTED SYSTEMS II FAULT-TOLERANT BROADCAST Prof Philippas Tsigas Distributed Computing and Systems Research Group.
The weakest failure detector question in distributed computing Petr Kouznetsov Distributed Programming Lab EPFL.
Teaser - Introduction to Distributed Computing
IMPOSSIBILITY OF CONSENSUS Ken Birman Fall Consensus… a classic problem  Consensus abstraction underlies many distributed systems and protocols.
Distributed Systems Overview Ali Ghodsi
P. Kouznetsov, 2006 Abstracting out Byzantine Behavior Peter Druschel Andreas Haeberlen Petr Kouznetsov Max Planck Institute for Software Systems.
An evaluation of ring-based algorithms for the Eventually Perfect failure detector class Joachim Wieland Mikel Larrea Alberto Lafuente The University of.
Failure detector The story goes back to the FLP’85 impossibility result about consensus in presence of crash failures. If crash can be detected, then consensus.
Distributed Algorithms – 2g1513 Lecture 10 – by Ali Ghodsi Fault-Tolerance in Asynchronous Networks.
1 © P. Kouznetsov On the weakest failure detector for non-blocking atomic commit Rachid Guerraoui Petr Kouznetsov Distributed Programming Laboratory Swiss.
Failure Detectors CS 717 Ashish Motivala Dec 6 th 2001.
UPV / EHU Efficient Eventual Leader Election in Crash-Recovery Systems Mikel Larrea, Cristian Martín, Iratxe Soraluze University of the Basque Country,
Byzantine Generals Problem: Solution using signed messages.
Failure Detectors. Can we do anything in asynchronous systems? Reliable broadcast –Process j sends a message m to all processes in the system –Requirement:
© Idit Keidar and Sergio Rajsbaum; PODC 2002 On the Cost of Fault-Tolerant Consensus When There are no Faults Idit Keidar and Sergio Rajsbaum PODC 2002.
1 Principles of Reliable Distributed Systems Lecture 6: Synchronous Uniform Consensus Spring 2005 Dr. Idit Keidar.
UPV - EHU An Evaluation of Communication-Optimal P Algorithms Mikel Larrea Iratxe Soraluze Roberto Cortiñas Alberto Lafuente Department of Computer Architecture.
Failure Detectors & Consensus. Agenda Unreliable Failure Detectors (CHANDRA TOUEG) Reducibility ◊S≥◊W, ◊W≥◊S Solving Consensus using ◊S (MOSTEFAOUI RAYNAL)
1 Principles of Reliable Distributed Systems Lecture 3: Synchronous Uniform Consensus Spring 2006 Dr. Idit Keidar.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 3 – Distributed Systems.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Asynchronous Consensus (Some Slides borrowed from ppt on Web.(by Ken Birman) )
1 Secure Failure Detection in TrustedPals Felix Freiling University of Mannheim San Sebastian Aachen Mannheim Joint Work with: Marjan Ghajar-Azadanlou.
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 5: Synchronous Uniform.
UPV / EHU Brief Announcement: An Efficient Failure Detector for Omission Environments R. Cortiñas, I. Soraluze, A. Lafuente, M. Larrea University of the.
Sergio Rajsbaum 2006 Lecture 4 Introduction to Principles of Distributed Computing Sergio Rajsbaum Math Institute UNAM, Mexico.
1 Principles of Reliable Distributed Systems Recitation 8 ◊S-based Consensus Spring 2009 Alex Shraer.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 4 – Consensus and reliable.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 2 – Distributed Systems.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Impossibility.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 12: Impossibility.
1 Failure Detectors: A Perspective Sam Toueg LIX, Ecole Polytechnique Cornell University.
Distributed Algorithms: Agreement Protocols. Problems of Agreement l A set of processes need to agree on a value (decision), after one or more processes.
Distributed Systems Tutorial 4 – Solving Consensus using Chandra-Toueg’s unreliable failure detector: A general Quorum-Based Approach.
Systems of Distributed systems Module 2 - Distributed algorithms Teaching unit 2 – Properties of distributed algorithms Ernesto Damiani University of Bozen.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Efficient Algorithms to Implement Failure Detectors and Solve Consensus in Distributed Systems Mikel Larrea Departamento de Arquitectura y Tecnología de.
1 Principles of Reliable Distributed Systems Recitation 7 Byz. Consensus without Authentication ◊S-based Consensus Spring 2008 Alex Shraer.
Composition Model and its code. bound:=bound+1.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 8: Failure Detectors.
1 A Modular Approach to Fault-Tolerant Broadcasts and Related Problems Author: Vassos Hadzilacos and Sam Toueg Distributed Systems: 526 U1580 Professor:
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Consensus and Its Impossibility in Asynchronous Systems.
Review for Exam 2. Topics included Deadlock detection Resource and communication deadlock Graph algorithms: Routing, spanning tree, MST, leader election.
BFTW 3 workshop (Sep 22, 2009)© 2009 Andreas Haeberlen 1 The Fault Detection Problem Andreas Haeberlen MPI-SWS Petr Kuznetsov TU Berlin / Deutsche Telekom.
Approximation of δ-Timeliness Carole Delporte-Gallet, LIAFA UMR 7089, Paris VII Stéphane Devismes, VERIMAG UMR 5104, Grenoble I Hugues Fauconnier, LIAFA.
CS 425/ECE 428/CSE424 Distributed Systems (Fall 2009) Lecture 9 Consensus I Section Klara Nahrstedt.
Distributed systems Consensus Prof R. Guerraoui Distributed Programming Laboratory.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
SysRép / 2.5A. SchiperEté The consensus problem.
1 Eventual Leader Election in Evolving Mobile Networks Luciana Arantes 1, Fabiola Greve 2, Véronique Simon 1, and Pierre Sens 1 1 Université de Paris 6.
1 © R. Guerraoui Distributed algorithms Prof R. Guerraoui Assistant Marko Vukolic Exam: Written, Feb 5th Reference: Book - Springer.
Failure Detectors n motivation n failure detector properties n failure detector classes u detector reduction u equivalence between classes n consensus.
Fault-Tolerant Broadcast Terminology: broadcast(m) a process broadcasts a message to the others deliver(m) a process delivers a message to itself 1.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Unreliable Failure Detectors for Reliable Distributed Systems Tushar Deepak Chandra Sam Toueg Presentation for EECS454 Lawrence Leinweber.
Intrusion Tolerant Consensus in Wireless Ad hoc Networks Henrique Moniz, Nuno Neves, Miguel Correia LASIGE Dep. Informática da Faculdade de Ciências Universidade.
Distributed Systems, Consensus and Replicated State Machines
Presented By: Md Amjad Hossain
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
FLP Impossibility of Consensus
Distributed Algorithms for Failure Detection in Crash Environments
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Failure Detectors motivation failure detector properties
Distributed systems Consensus
Presentation transcript:

UPV / EHU Distributed Algorithms for Failure Detection and Consensus in Crash, Crash-Recovery and Omission Environments Mikel Larrea Distributed Systems Group University of the Basque Country, UPV/EHU

UPV / EHU 2 Mikel Larrea − Mannheim, May 2011 Context and Seminal Papers In the Consensus problem, all correct processes propose a value and must reach a unanimous and irrevocable decision on some proposed value [FLP85] M. Fischer, N. Lynch, M. Paterson. Impossibility of distributed consensus with one faulty process. Journal of the ACM, 1985 [CT96] T. Chandra, S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 1996 [CHT96] T. Chandra, V. Hadzilacos, S. Toueg. The weakest failure detector for solving consensus. Journal of the ACM, 1996

UPV / EHU 3 Mikel Larrea − Mannheim, May 2011 Motivation

UPV / EHU 4 Mikel Larrea − Mannheim, May 2011 Motivation++ (Zurich, July 2010)

UPV / EHU 5 Mikel Larrea − Mannheim, May 2011 Crash Failure Detectors [CT96]

UPV / EHU 6 Mikel Larrea − Mannheim, May 2011 Strengthening Completeness

UPV / EHU 7 Mikel Larrea − Mannheim, May 2011 Guest Stars: P and Omega P: strong completeness, eventual strong accuracy –Eventually every process that crashes is permanently suspected by every correct process –There is a time after which correct processes are not suspected by any correct process Omega satisfies the following property: –There is a time after which all the correct processes always trust the same correct process What is a correct process? –It depends on the failure model :-)

UPV / EHU 8 Mikel Larrea − Mannheim, May 2011 FD-based Consensus

UPV / EHU 9 Mikel Larrea − Mannheim, May 2011 Fault-tolerant Architecture

UPV / EHU 10 Mikel Larrea − Mannheim, May 2011 Outline Part I: Crash Environments –(Near-) Communication-efficient algorithms for P –Communication-optimal algorithms for P Part II: Crash-Recovery Environments –Implementing Omega with/without stable storage –Communication-efficient algorithms for Omega –From Omega to P –Fault-tolerant aggregator election and data aggregation in wireless sensor networks Part III: Omission Environments –Secure failure detection and consensus in TrustedPals –Communication-efficient algorithm for P

UPV / EHU Part I: P in Crash Environments Joint work with Roberto Cortiñas, Alberto Lafuente, Iratxe Soraluze, Joachim Wieland

UPV / EHU 12 Mikel Larrea − Mannheim, May 2011 The First P Algorithm [CT96]

UPV / EHU 13 Mikel Larrea − Mannheim, May 2011 Part I. Summary of Results Efficient implementations of P –Nearly communication-efficient algorithms (n+C links are used forever) Q-based, transformations –Communication-efficient algorithms (n links) Pure ring-based, optimizations Optimal implementations of P –Communication-optimal algorithms (C links) RBcast-based, one-to-one, one-to-all

UPV / EHU 14 Mikel Larrea − Mannheim, May 2011 Reliable Broadcast [CT96] “All correct processes deliver the same set of messages”

UPV / EHU 15 Mikel Larrea − Mannheim, May 2011 P in Crash Environments [WLL07] J. Wieland, M. Larrea, A. Lafuente. An evaluation of ring-based algorithms for the Eventually Perfect failure detector class. 15th International Conference on Parallel, Distributed and Network-based Processing, 2007 [LSCL08] M. Larrea, I. Soraluze, R. Cortiñas, A. Lafuente. An Evaluation of Communication-Optimal P Algorithms. 16th International Conference on Parallel, Distributed and Network-based Processing, 2008

UPV / EHU Joint work with José Javier Astrain, Ernesto Jiménez, Cristian Martín, Iratxe Soraluze Part II: Omega in Crash-Recovery Environments

UPV / EHU 17 Mikel Larrea − Mannheim, May 2011 Part II. Summary of Results Redefinition of Omega –Take into account unstable processes –Take into account the availability of stable storage Implementation of Omega –With and without stable storage –Efficient algorithms From Omega to P Fault-tolerant aggregator election and data aggregation in wireless sensor networks

UPV / EHU 18 Mikel Larrea − Mannheim, May 2011 From Omega to P

UPV / EHU Joint work with Roberto Cortiñas, Felix Freiling, Marjan Ghajar-Azadanlou, Alberto Lafuente, Lucia Penso, Iratxe Soraluze Part III: P in Omission Environments

UPV / EHU 20 Mikel Larrea − Mannheim, May 2011 Part III. Summary of Results Reduction from Byzantine to omission –Processes are equipped with tamper proof security modules (e.g., smartcards) Actually, omission + buffering/timing attacks Omission models –send | receive | general –permanent | transient –non-selective | selective

UPV / EHU 21 Mikel Larrea − Mannheim, May 2011 Part III. Summary of Results Impossibility result –P is impossible to implement in the (transient) general omission model Redefinition and implementation of P –In-connected and out-connected processes –All-to-all communication, sequence numbers, connectivity matrix P-based Consensus –Termination: every in-connected process eventually decides –Adaptation of Chandra-Toueg’s algorithm

UPV / EHU Distributed Algorithms for Failure Detection and Consensus in Crash, Crash-Recovery and Omission Environments Mikel Larrea Distributed Systems Group University of the Basque Country, UPV/EHU Thank you!