Presentation on theme: "Consensus on Transaction Commit"— Presentation transcript:
1 Consensus on Transaction Commit Paxos CommitJim GrayLeslie LamportMicrosoft ResearchPreview of a paper in preparationPresented Microsoft Research Techfest3 March 2004,Redmond, WAArticle MSR-TRConsensus on Transaction Commit
2 Commit is Common Do you? I do. I now pronounce you… Marriage ceremony Ready on the set? Ready! Action!Offer Signature Deal / lawsuitMarriage ceremonyTheaterContract law
3 The Common Picture director Ready Action! Ready? actors Ready? Ready
4 All or Nothing: If any actor says no the deal is off. Ready?No deal!actorsdirectorReadyReady?No deal!actorsNo!No deal!Ready?actorsReadyReady?ReadyNo!or timeoutNo deal!
5 The Database Version TM: Transaction Manager RM: Resource Manager clientTMdirectordirectorRMactorsactorsRMRMactorsCommitReady?ReadyCommitCommitTM: Transaction ManagerRM: Resource Manager
6 Two Phase Commit N Resource Managers (RMs) Want all RMs to commit or all abort.Coordinated by Transaction Manager (TM) TM sends Prepare, Commit-AbortRM responds Prepared, Aborted3N+1 messagesN+1 stable writesDelay4 message2 stable writeBlocking: if TM fails, Commit-Abort stallsRequestCommitPrepareCommitPreparedworkingpreparedcommittedabortedResource ManagerworkingcommittedabortedTransaction Manager
7 The Problem With 2PC Blocks if TM fails Atomicity – all or nothing Consistency – does right thingIsolation – no concurrency anomaliesDurability / Reliability – state survives failuresAvailability: always upBlocks if TM fails
8 Problem Statement ACID Transactions make error handling easy. One fault can make 2-Phase Commit block.Goal: ACID and Available. Non-blocking despite F faults.
9 Fault-Tolerant Two Phase Commit PreparedclientTMRMRequestCommitPreparePreparedPrepareTMRMRequestCommitPreparePreparedIf the 2PC Transaction Manager (TM) Fails, transaction blocks.Solution: Add a “spare” transaction manager (non blocking commit, 3 phase commit)
10 Fault-Tolerant Two Phase Commit clientTMRMabortPreparePreparedcommitcommitTMTMRMPreparedcommitPrepareRequestCommitPreparePreparedInconsistent!Now What?PreparePreparedcommitcommitabortIf the 2PC Transaction Manager (TM) Fails, transaction blocks.Solution: Add a “spare” transaction manager (non blocking commit, 3 phase commit)But… What if….?The complexity is a mess.
11 Fault Tolerant 2PC Several workarounds proposed in database community: Often called "3-phase" or "non-blocking" commit.None with complete algorithm and correctness proof.
12 “Reaching Agreement in the Presence of Faults” Shostak, Pease, & LamportJACM, 198025 years of theoryNow called the Consensus problemN processes want to agree on a value, even if F of them have failed.
13 Consensus consensus box collects proposed values Propose XconsensusboxclientW ChosenPropose WclientW ChosenclientW Chosencollects proposed valuesPicks one proposed valueremembers it forever
14 Consensus for Commit The Obvious Approach boxclientTMRMRequest CommitPropose PreparedPrepared ChosenPreparedPrepareCommitCommitPrepareCommitTMRMPrepared ChosenPreparedRequestCommitPreparePreparedPropose PreparedPrepared ChosenCommitCommitGet consensus on TM’s decision.TM just learns consensus value.TM is “stateless”
15 Consensus for Commit The Paxos Commit Approach clientTMRMRequest CommitPropose RM1 PreparedconsensusboxPrepareRM1 Prepared ChosenCommitCommitPrepareconsensusboxCommitTMRMPropose RM2 PreparedRM2 Prepared ChosenRequestCommitPreparePropose RM1 PreparedPropose RM2 PreparedRM1 Prepared ChosenRM2 Prepared ChosenCommitCommitGet consensus on each RM’s choice.TM just combines consensus values.TM is “stateless”
17 Consensus in Action The normal (failure-free) case Two message delays Consensus boxPropose RM PreparedacceptorPropose RM PreparedVote RM PreparedTMRM PreparedChosenPropose RM PreparedVote RM PreparedacceptorVote RM PreparedTMacceptorThe normal (failure-free) caseTwo message delaysCan optimize
18 Consensus in Action TM can always learn what was chosen, RMConsensus boxacceptorTMacceptorTMTMacceptorTM can always learn what was chosen,or get Aborted chosen if nothing chosen yet;if majority of acceptors working .
19 The Complete Algorithm Subtle.More weird cases than most people imagine.Proved correct.
20 Paxos Commit N RMs 2F+1 acceptors (~2F+1 TMs) If F+1 acceptors see all RMs prepared, then transaction committed.2F(N+1) + 3N + 1 messages 5 message delays 2 stable write delays.ClientTMRM1…NAcceptors0…2Frequestcommitpreparepreparedall prepared
21 Same algorithm when F=0 and TM = Acceptor Two-Phase CommitPaxos Committolerates F faults3N+1 messagesN+1 stable writes4 message delays2 stable-write delays3N+ 2F(N+1) +1 messagesN+2F+1 stable writes5 message delays2 stable-write delaysSame algorithm when F=0 andTM = Acceptor
22 Summary Commit is common Two Phase commit is good but… It is the un-availability protocolPaxos commit is non-blocking if there are at most F faults.When F=0 (no fault-tolerance), Paxos Commit == 2PC
24 Paxos Consensus 6F+4 messages, 2F+1 stable writes Group has a leader known to allleader election is a subroutineProcess proposes a value v to leader.Leader sends proposal (phase 2) (ballot, value) to all acceptorsAcceptors respond with: max(ballot, value) they have seenIf leader gets no higher ballot, and gets at least F+1 responses then leader can announce (ballot, value)Full protocol 3-phasePhase 1:Leader starts new ballotPhase 2Leader proposes valuePhase 3If value accepted by F+1 then value is accepted.If not, leader tries to get majority value accepted.6F+4 messages, 2F+1 stable writes4 message delays and 2 stable write delays
25 Using Consensus Have a consensus for each RM PreparedclientTMRMRequestCommitconsensusboxPrepareCommitconsensusboxPreparedCommitPrepareCommitTMRMRequestCommitPreparePreparedCommitCommit
28 ConsensusThe distributed systems theory community has thought about this a lot.They call it Consensus: N processes want to agree on a valueWant to tolerate F faultsTolerate F processes stoppingTolerate F Messages delayed or lostIf there are fewer than F faults in a window Then consensus achieved.Byzantine faults need 3F “acceptors”Benign faults need 2F+1 “acceptors” stalls but safe if more than F faults