Consensus Algorithms Willem Visser RW334. Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid.

Slides:



Advertisements
Similar presentations
Teaser - Introduction to Distributed Computing
Advertisements

Failure Detection The ping-ack failure detector in a synchronous system satisfies – A: completeness – B: accuracy – C: neither – D: both.
1 Indranil Gupta (Indy) Lecture 8 Paxos February 12, 2015 CS 525 Advanced Distributed Systems Spring 2015 All Slides © IG 1.
6.852: Distributed Algorithms Spring, 2008 Class 7.
Distributed Systems Overview Ali Ghodsi
CS542: Topics in Distributed Systems Distributed Transactions and Two Phase Commit Protocol.
4/26/ Two Phase Commit CSEP 545 Transaction Processing for E-Commerce Philip A. Bernstein Copyright ©2003 Philip A. Bernstein.
(c) Oded Shmueli Distributed Recovery, Lecture 7 (BHG, Chap.7)
CS 603 Handling Failure in Commit February 20, 2002.
Consensus Hao Li.
1 ICS 214B: Transaction Processing and Distributed Data Management Lecture 12: Three-Phase Commits (3PC) Professor Chen Li.
ICS 421 Spring 2010 Distributed Transactions Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/16/20101Lipyeow.
Distributed Transaction Processing Some of the slides have been borrowed from courses taught at Stanford, Berkeley, Washington, and earlier version of.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
Non-blocking Atomic Commitment Aaron Kaminsky Presenting Chapter 6 of Distributed Systems, 2nd edition, 1993, ed. Mullender.
The Atomic Commit Problem. 2 The Problem Reaching a decision in a distributed environment Every participant: has an opinion can veto.
Eddie Bortnikov & Aran Bergman, Principles of Reliable Distributed Systems, Technion EE, Spring Principles of Reliable Distributed Systems Recitation.
1 Distributed Databases CS347 Lecture 16 June 6, 2001.
Distributed Systems 2006 Group Membership * *With material adapted from Ken Birman.
CS 603 Three-Phase Commit February 22, Centralized vs. Decentralized Protocols What if we don’t want a coordinator? Decentralized: –Each site broadcasts.
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Distributed Transactions Transaction may access data at several sites. Each site has a local.
1 More on Distributed Coordination. 2 Who’s in charge? Let’s have an Election. Many algorithms require a coordinator. What happens when the coordinator.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 19: Paxos All slides © IG.
1 ICS 214B: Transaction Processing and Distributed Data Management Distributed Database Systems.
Distributed Commit. Example Consider a chain of stores and suppose a manager – wants to query all the stores, – find the inventory of toothbrushes at.
Composition Model and its code. bound:=bound+1.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
CMPT Dr. Alexandra Fedorova Lecture XI: Distributed Transactions.
Paxos Made Simple Jinghe Zhang. Introduction Lock is the easiest way to manage concurrency Mutex and semaphore. Read and write locks. In distributed system:
Distributed Commit Dr. Yingwu Zhu. Failures in a distributed system Consistency requires agreement among multiple servers – Is transaction X committed?
CS162 Section Lecture 10 Slides based from Lecture and
Distributed Consensus Reaching agreement is a fundamental problem in distributed computing. Some examples are Leader election / Mutual Exclusion Commit.
Distributed Transactions March 15, Transactions What is a Distributed Transaction?  A transaction that involves more than one server  Network.
Distributed Algorithms – 2g1513 Lecture 9 – by Ali Ghodsi Fault-Tolerance in Distributed Systems.
Chapter 19 Recovery and Fault Tolerance Copyright © 2008.
Distributed Transactions Chapter 13
Consensus and Its Impossibility in Asynchronous Systems.
Operating Systems Distributed Coordination. Topics –Event Ordering –Mutual Exclusion –Atomicity –Concurrency Control Topics –Event Ordering –Mutual Exclusion.
Distributed Transaction Management, Fall 2002Lecture Distributed Commit Protocols Jyrki Nummenmaa
University of Tampere, CS Department Distributed Commit.
Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.
Consensus and leader election Landon Cox February 6, 2015.
Two-Phase Commit Brad Karp UCL Computer Science CS GZ03 / M th October, 2008.
6.830 Lecture 19 Eventual Consistency No class next Wednesday Oscar Office Hours Today 4PM G9 Lounge.
Multi-phase Commit Protocols1 Based on slides by Ken Birman, Cornell University.
Alternating Bit Protocol S R ABP is a link layer protocol. Works on FIFO channels only. Guarantees reliable message delivery with a 1-bit sequence number.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
CSE 486/586, Spring 2014 CSE 486/586 Distributed Systems Paxos Steve Ko Computer Sciences and Engineering University at Buffalo.
Consensus, impossibility results and Paxos Ken Birman.
The consensus problem in distributed systems
CS 525 Advanced Distributed Systems Spring 2013
Two Phase Commit CSE593 Transaction Processing Philip A. Bernstein
Outline Introduction Background Distributed DBMS Architecture
Two phase commit.
Implementing Consistency -- Paxos
CS 525 Advanced Distributed Systems Spring 2018
Alternating Bit Protocol
Distributed Consensus
Commit Protocols CS60002: Distributed Systems
Distributed Systems, Consensus and Replicated State Machines
Distributed Commit Phases
2PC Recap Eventual Consistency & Dynamo
2PC Recap Eventual Consistency & Dynamo
CS 425 / ECE 428 Distributed Systems Fall 2017 Indranil Gupta (Indy)
EEC 688/788 Secure and Dependable Computing
EEC 688/788 Secure and Dependable Computing
Causal Consistency and Two-Phase Commit
Distributed Databases Recovery
Implementing Consistency -- Paxos
Presentation transcript:

Consensus Algorithms Willem Visser RW334

Why do we need consensus? Distributed Databases – Need to know others committed/aborted a transaction to avoid inconsistency – Also, to agree on the order of transaction log entries to ensure eventual consistency Actions that must still be performed, but at least you know everyone agrees on what those should be Leader Elections Many, many more applications

So what is the problem? FAILURES! Network failures Node Failures Two types of Node failures – Fail-Stop Once a node fails it stops – Fail-Recover A failed node can recover at some later stage

And it gets worse! ‘Impossibility of Distributed Consensus with One Faulty Process’ ‘Impossibility of Distributed Consensus with One Faulty Process’ – Fischer, Lynch and Patterson, 1985 Consensus is possible in a synchronous system – Wait one time step and if someone didn’t respond they are dead Asynchronous system – Impossible to tell the difference between a node taking a long time, or, is dead

The 2 Phase Commit Protocol Phase 1 – The coordinator sends a Request-to-Prepare message to each participant – Coordinator waits for all participants to vote – Each participant votes Prepared if it’s ready to commit may vote No for any reason may delay voting indefinitely Phase 2 all – If Coordinator receives Prepared from all participants Sends Commit to all Otherwise Abort to all – Participants reply Done

Commit CoordinatorParticipant Request-to-Prepare Prepared Commit Done

Coordinator Request-to-Prepare No Abort Done Participant Abort

2/14/018 Performance In the absence of failures, 2PC requires 3 rounds of messages before the decision is made – Request-to-prepare – Votes – Decision Done messages are just for bookkeeping – they don’t affect response time – they can be batched

Uncertainty Before it votes, a participant can abort unilaterally After a participant votes Prepared and before it receives the coordinator’s decision, it is uncertain. It can’t unilaterally commit or abort during its uncertainty period. CoordinatorParticipant Request-to-Prepare Prepared Commit Done Uncertainty Period The coordinator is never uncertain If a participant fails or is disconnected from the coordinator while it’s uncertain, at recovery it must find out the decision

The problems Blocking: If something went wrong you must wait for it before continuing Failure Handling: What to do if a Coordinator or Participant times out waiting for a message A participant times out waiting for coordinator’s Request-to-prepare. – It decides to abort. The coordinator times out waiting for a participant’s vote – It decides to abort A participant that voted Prepared times out waiting for the coordinator’s decision – It’s blocked. – Use a termination protocol to decide what to do. – Naïve termination protocol - wait till the coordinator recovers The coordinator times out waiting for Done – it must resolicit them, so it can forget the decision

Logging 2PC State Changes Logging may be eager –meaning it’s flushed to disk before the next Send Message Or it may be lazy = not eager Coordinator Participant Request-to-Prepare Prepared Commit Done Log commit (eager) Log commit (eager) Log commit (lazy) Log prepared (eager) Log Start2PC (eager)

Coordinator Recovery If the coordinator fails and later recovers, it must know the decision. It must therefore log – the fact that it began T’s 2PC protocol, including the list of participants, and – Commit or Abort, before sending Commit or Abort to any participant (so it knows whether to commit or abort after it recovers). If the coordinator fails and recovers, it resends the decision to participants from whom it doesn’t remember getting Done – If the participant forgot the transaction, it replies Done – The coordinator should therefore log Done after it has received them all.

Participant Recovery If a participant P fails and later recovers, it first performs centralized recovery (Restart) For each distributed transaction T that was active at the time of failure – If P is not uncertain about T, then it unilaterally aborts T – If P is uncertain, it runs the termination protocol (which may leave P blocked) To ensure it can tell whether it’s uncertain, P must log its vote before sending it to the coordinator To avoid becoming totally blocked due to one blocked transaction, P should reacquire T’s locks during Restart and allow Restart to finish before T is resolved.

Heuristic Commit Suppose a participant recovers, but the termination protocol leaves T blocked. Operator can guess whether to commit or abort – Must detect wrong guesses when coordinator recovers – Must run compensations for wrong guesses Heuristic commit – If T is blocked, the local resource manager (actually, transaction manager) guesses – At coordinator recovery, the transaction managers jointly detect wrong guesses.

The Main Issue with 2PC Once Coordinator sends message to Commit, each Participant does commit without considering other participants When Coordinator and all participants that finished committing goes down, then the rest doesn’t know the state of the system – All that knew are now dead – Cannot just abort, since the commit action might have completed at some and cannot be rolled back – Also cannot commit, since the original decision might have been to abort

3 Phase Commit (3PC) Phase 1 as in 2PC – Prepared-to-Commit Reply Prepared or No Phase 2 is now split into two – First send Ready-to-Commit – When it receives all Yes votes – Then send Commit Message The reason for the extra step is to let all the Participants know what the decision is, in case of failure everyone then knows and the state can be recovered

3PC Failure Handling If coordinator times out before receiving Prepared from all participants, it decides to abort. Coordinator ignores participants that don’t ack its Ready-to--Commit. Participants that voted Prepared and timed out waiting for Ready-to-Commit or Commit use the termination protocol. The termination protocol is where the complexity lies. (E.g. see [Bernstein, Hadzilacos, Goodman 87], Section 7.4)

3PC can still fail Network partition failure – All the ones that gets Ready-to-Commit is on one side – All the rest on the other side Recovery will take place on both sides – One side will commit – Other side will abort When network merges back, you have an inconsistent state

2PC versus 3PC FLP states you cannot have both safety and liveness! Liveness – 2PC can block – 3PC will always make progress Safety – 2PC is safe – 3PC is safe-ish as seen in the network partitioning case one can get to the wrong result

Paxos! Safety and Liveness (but only in perfect conditions, i.e. when process behave synchronously) Leslie Lamport – 1990, but only published 1998 – “Paxos Made Simple” paper in 2001 describes it so that mere mortals have a chance to understand it too

Paxos Core Idea Majority Voting Nugget – In all possible majority groups there must be at least 1 shared member – Thus if anything fails rendering a majority group incapable of a decision, then the shared member will convey the information to the next majority Also orders proposals to allow one to know which one should be considered

Paxos Continued Paxos can tolerate – lost messages, delayed messages, repeated messages, and messages delivered out of order. – In fact can work if nearly halve of the nodes fail to reply 2F+1 Nodes can tolerate F failures It will reach consensus if there is a single leader for long enough that the leader can talk to a majority of processes twice. Any process, including leaders, can fail and restart; in fact all processes can fail at the same time, the algorithm is still safe. There can be more than one leader at a time. Used in Google’s Chubby system, Zookeeper, etc.