The Need for Language Support for Fault-Tolerant Distributed Systems Cezara Dr ă goi, INRIA ENS CNRS Thomas A. Henzinger, IST Austria Damien Zufferey,

Slides:



Advertisements
Similar presentations
NETWORK ALGORITHMS Presenter- Kurchi Subhra Hazra.
Advertisements

The weakest failure detector question in distributed computing Petr Kouznetsov Distributed Programming Lab EPFL.
A General Characterization of Indulgence R. Guerraoui EPFL joint work with N. Lynch (MIT)
Teaser - Introduction to Distributed Computing
CS 5204 – Operating Systems1 Paxos Student Presentation by Jeremy Trimble.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
6.852: Distributed Algorithms Spring, 2008 Class 7.
Distributed Systems Overview Ali Ghodsi
How to Choose a Timing Model? Idit Keidar and Alexander Shraer Technion – Israel Institute of Technology.
Consensus Hao Li.
CSE 486/586, Spring 2013 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
1 © P. Kouznetsov On the weakest failure detector for non-blocking atomic commit Rachid Guerraoui Petr Kouznetsov Distributed Programming Laboratory Swiss.
UPV / EHU Efficient Eventual Leader Election in Crash-Recovery Systems Mikel Larrea, Cristian Martín, Iratxe Soraluze University of the Basque Country,
Byzantine Generals Problem: Solution using signed messages.
Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.
1 Principles of Reliable Distributed Systems Lectures 11: Authenticated Byzantine Consensus Spring 2005 Dr. Idit Keidar.
Paxos Made Simple Gene Pang. Paxos L. Lamport, The Part-Time Parliament, September 1989 Aegean island of Paxos A part-time parliament – Goal: determine.
Sergio Rajsbaum 2006 Lecture 3 Introduction to Principles of Distributed Computing Sergio Rajsbaum Math Institute UNAM, Mexico.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Asynchronous Consensus (Some Slides borrowed from ppt on Web.(by Ken Birman) )
Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Synchronous Byzantine.
1 Secure Failure Detection in TrustedPals Felix Freiling University of Mannheim San Sebastian Aachen Mannheim Joint Work with: Marjan Ghajar-Azadanlou.
1 Principles of Reliable Distributed Systems Lecture 5: Failure Models, Fault-Tolerant Broadcasts and State-Machine Replication Spring 2005 Dr. Idit Keidar.
Sergio Rajsbaum 2006 Lecture 4 Introduction to Principles of Distributed Computing Sergio Rajsbaum Math Institute UNAM, Mexico.
Distributed Systems CS Case Study: Replication in Google Chubby Recitation 5, Oct 06, 2011 Majd F. Sakr, Vinay Kolar, Mohammad Hammoud.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 4 – Consensus and reliable.
Distributed systems Module 2 -Distributed algorithms Teaching unit 1 – Basic techniques Ernesto Damiani University of Bozen Lesson 2 – Distributed Systems.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 6: Impossibility.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 12: Impossibility.
Distributed Algorithms: Agreement Protocols. Problems of Agreement l A set of processes need to agree on a value (decision), after one or more processes.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 19: Paxos All slides © IG.
Distributed Systems Tutorial 4 – Solving Consensus using Chandra-Toueg’s unreliable failure detector: A general Quorum-Based Approach.
On the Cost of Fault-Tolerant Consensus When There are no Faults Idit Keidar & Sergio Rajsbaum Appears in SIGACT News; MIT Tech. Report.
Systems of Distributed systems Module 2 - Distributed algorithms Teaching unit 2 – Properties of distributed algorithms Ernesto Damiani University of Bozen.
 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Lecture 7: Failure Detectors.
Consensus and Related Problems Béat Hirsbrunner References G. Coulouris, J. Dollimore and T. Kindberg "Distributed Systems: Concepts and Design", Ed. 4,
Paxos Made Simple Jinghe Zhang. Introduction Lock is the easiest way to manage concurrency Mutex and semaphore. Read and write locks. In distributed system:
Fault Tolerance via the State Machine Replication Approach Favian Contreras.
1 © P. Kouznetsov A Note on Set Agreement with Omission Failures Rachid Guerraoui, Petr Kouznetsov, Bastian Pochon Distributed Programming Laboratory Swiss.
Bringing Paxos Consensus in Multi-agent Systems Andrei Mocanu Costin Bădică University of Craiova.
Practical Byzantine Fault Tolerance
Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
A. Haeberlen Fault Tolerance and the Five-Second Rule 1 HotOS XV (May 18, 2015) Ang Chen Hanjun Xiao Andreas Haeberlen Linh Thi Xuan Phan Department of.
CSE 60641: Operating Systems Implementing Fault-Tolerant Services Using the State Machine Approach: a tutorial Fred B. Schneider, ACM Computing Surveys.
Chap 15. Agreement. Problem Processes need to agree on a single bit No link failures A process can fail by crashing (no malicious behavior) Messages take.
SysRép / 2.5A. SchiperEté The consensus problem.
Systems Research Barbara Liskov October Replication Goal: provide reliability and availability by storing information at several nodes.
Replication predicates for dependent-failure algorithms Flavio Junqueira and Keith Marzullo University of California, San Diego Euro-Par Conference, Lisbon,
PSync: A Partially Synchronous Language for Fault-tolerant Distributed Algorithms Cezara Dr ă goi, INRIA ENS CNRS Thomas A. Henzinger, IST Austria Damien.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Fundamentals of Fault-Tolerant Distributed Computing In Asynchronous Environments Paper by Felix C. Gartner Graeme Coakley COEN 317 November 23, 2003.
Unreliable Failure Detectors for Reliable Distributed Systems Tushar Deepak Chandra Sam Toueg Presentation for EECS454 Lawrence Leinweber.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Intrusion Tolerant Consensus in Wireless Ad hoc Networks Henrique Moniz, Nuno Neves, Miguel Correia LASIGE Dep. Informática da Faculdade de Ciências Universidade.
Consensus, impossibility results and Paxos Ken Birman.
The consensus problem in distributed systems
Distributed Systems: Paxos
Distributed Systems, Consensus and Replicated State Machines
PERSPECTIVES ON THE CAP THEOREM
CS 425 / ECE 428 Distributed Systems Fall 2017 Indranil Gupta (Indy)
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Consensus: Paxos Haobin Ni Oct 22, 2018.
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Implementing Consistency -- Paxos
IS 698/800-01: Advanced Distributed Systems Crash Fault Tolerance
Presentation transcript:

The Need for Language Support for Fault-Tolerant Distributed Systems Cezara Dr ă goi, INRIA ENS CNRS Thomas A. Henzinger, IST Austria Damien Zufferey, MIT CSAIL SNAPL,

Fault-tolerant distributed algorithms How to get it right when things go wrong ? Crash, network partition, … Mean time to failure (thing eventually go wrong) Replication using Consensus Agreement : Every correct process must agree on the same value. Irrevocability : Every correct process decides at most one value. Validity : If all processes propose the same value v, then all correct processes decide v. Integrity : If value v is a decision, then v must have been proposed by some process. Termination : Every correct process decides some value.

Our journey starts on the island of Paxos … … where archeologists made an interesting discovery about a parliament system … CC-BY-SA-NC Matt Taylor Copyright ACM 3

The Paxos Algorithm [Lamport 98] Used at Google (Chubby), Microsoft (Autopilot) Proposer Acceptor Prepare Promise Accept Accepted

Paxos in the Literature The part-time parliament [Lamport 98] Paxos made simple [Lamport 01] Paxos made live: An engineering perspective [Chandra et al. 07] In search of an understandable consensus algorithm. [Ongaro and Ousterhout 14] Paxos made moderately complex [van Renesse and Altinbuken 15]... Claim: If it is hard, more of the same is not going to help. Changing the way we think about it might.

Why is the PL community concerned ? Quotes from Paxos made live [Chandra et al. 07] “The fault-tolerance computing community has not developed the tools to make it easy to implement their algorithms.” “The fault-tolerance computing community has not paid enough attention to testing, a key ingredient for building fault-tolerant systems.” “In order to build a real-world system, an expert needs to use numerous ideas scattered in the literature and make several relatively small protocol extensions. The cumulative effort will be substantial and the final system will be based on an unproven protocol.”

Challenges to understanding what is going on Parametric systems Asynchrony (Interleaving, delays) Channels Faults … n

Programming Models & Languages Consensus is not solvable with asynchrony and faults ([FLP 85]). AsynchronousSynchronous (timed) Actor model, CSP, CCS, pi-calculus, … Not realistic for distributed system Many PL based on or implementing those models Timed-automata, timed process calculi Lustre, Esterel, Giotto, LabVIEW ? Partial synchrony Failure detectors Crash-stop, crash-recovery Benign, Byzantine faults Faults introduce a middle ground Alternation between synchronous and asynchronous period We don’t want a model/language for each variation. We want a simple model that unifies all of them. network contentioncrash

Structure of distributed algorithms: Communication-closed Rounds Proposer Acceptor Prepare Promise Accept Accepted [Elrad & Francez 82]: decomposition of algorithm in communication-closed rounds. [Dwork & Lynch & Stockmeyer, 88] defines round model for non-synchronous models: partial synchrony A round defines the scope of its messages.

Faults: the environment as an adversary. Semantics: Execution: Compiler + runtime

Benefits for verification Promise Accept Reason about rounds in isolation. Lock-step semantics, no interleaving. Simple invariants that connects the round at the boundaries. No message in flight, only local state of the processes.

The Heard-Of model [Charron-Bost & Schiper 09] Intuitive model: communication-closed rounds send and update operations Illusion of synchrony a single process cannot distinguish between a synchronous and an asynchronous execution Maps every faults to message faults A crashed process is the same as a process whose messages are dropped. Byzantine faults can be simulated altering messages Simplify the proofs: does not need to case split on (in)correct processes Handling transient/permanent faults is transparent at the algorithm level Developed for theoretical simplicity

Conclusion Building fault-tolerant distributed systems is hard and important. The current programming abstraction are inadequate. The DA community has models that streamline faults handling. We started to build a language around those idea: Key elements (HO-model): Communication-closed rounds Asynchrony and faults as an adversary that drops messages Benefits: Conceptually simpler Automated reasoning/verification becomes possible Acceptable runtime overhead (early results)