Presentation on theme: "1 Paxos Commit Jim Gray Leslie Lamport Microsoft Research Preview of a paper in preparation Presented Microsoft Research Techfest 3 March 2004, Redmond,"— Presentation transcript:
1 Paxos Commit Jim Gray Leslie Lamport Microsoft Research Preview of a paper in preparation Presented Microsoft Research Techfest 3 March 2004, Redmond, WA Article MSR-TR-2003-96 Consensus on Transaction Commit http://research.microsoft.com/research/pubs/view.aspx?tr_id=701
2 Commit is Common Marriage ceremony Theater Contract law Do you? I do. I now pronounce you… Ready on the set? Ready! Action! Offer Signature Deal / lawsuit
3 Action! The Common Picture directoractors Ready? Ready Action!
4 All or Nothing: If any actor says no the deal is off. director actors Ready? Ready No! Ready No deal! No! or timeout
5 The Database Version directoractors RM director Commit Ready Commit TM: Transaction Manager RM: Resource Manager client TM RM Ready?
6 Two Phase Commit N Resource Managers (RMs) Want all RMs to commit or all abort. Coordinated by Transaction Manager (TM) TM sends Prepare, Commit-Abort RM responds Prepared, Aborted 3N+1 messages N+1 stable writes Delay –4 message –2 stable write Blocking: if TM fails, Commit-Abort stalls working committedaborted Transaction Manager working prepared committedaborted Resource Manager RequestCommit Prepare Commit Prepare Prepared
7 The Problem With 2PC Atomicity – all or nothing Consistency – does right thing Isolation – no concurrency anomalies Durability / Reliability – state survives failures Availability: always up Blocks if TM fails
8 Problem Statement ACID Transactions make error handling easy. One fault can make 2-Phase Commit block. Goal: ACID and Available. Non-blocking despite F faults.
9 RequestCommit Prepare Prepared client TMRM TMRM RequestCommit Prepare Prepared Fault-Tolerant Two Phase Commit If the 2PC Transaction Manager (TM) Fails, transaction blocks. Solution: Add a spare transaction manager (non blocking commit, 3 phase commit)
10 RequestCommit Prepare commit client TMRM TMRM Prepare Prepared commit abort commit Fault-Tolerant Two Phase Commit If the 2PC Transaction Manager (TM) Fails, transaction blocks. Solution: Add a spare transaction manager (non blocking commit, 3 phase commit) But… What if….? TM Prepare Prepared commit abort Inconsistent! Now What? The complexity is a mess. Prepared
11 Fault Tolerant 2PC Several workarounds proposed in database community: Often called "3-phase" or "non-blocking" commit. None with complete algorithm and correctness proof.
12 Reaching Agreement in the Presence of Faults 25 years of theory Now called the Consensus problem N processes want to agree on a value, even if F of them have failed. Shostak, Pease, & Lamport JACM, 1980
13 W Chosen client Propose X consensus box client Propose W W Chosen Consensus collects proposed values Picks one proposed value remembers it forever
14 RM Propose Prepared Prepared Chosen consensus box Prepared Chosen Prepared RequestCommit Prepare Commit client TMRM TM Request Commit Prepare Commit Consensus for Commit The Obvious Approach Get consensus on TMs decision. TM just learns consensus value. TM is stateless Propose Prepared Prepared Chosen
15 RM RM1 Prepared Chosen RM2 Prepared Chosen RequestCommit Prepare Commit client TM Request Commit Prepare Commit consensus box consensus box Propose RM2 Prepared Propose RM1 Prepared Consensus for Commit The Paxos Commit Approach Get consensus on each RMs choice. TM just combines consensus values. TM is stateless Propose RM1 Prepared RM2 Prepared Chosen Propose RM2 Prepared
22 Summary Commit is common Two Phase commit is good but… It is the un-availability protocol Paxos commit is non-blocking if there are at most F faults. When F=0 (no fault-tolerance), Paxos Commit == 2PC
24 Paxos Consensus Group has a leader known to all –leader election is a subroutine Process proposes a value v to leader. Leader sends proposal (phase 2) (ballot, value) to all acceptors Acceptors respond with: max(ballot, value) they have seen If leader gets no higher ballot, and gets at least F+1 responses then leader can announce (ballot, value) Full protocol 3-phase Phase 1: –Leader starts new ballot Phase 2 –Leader proposes value Phase 3 –If value accepted by F+1 then value is accepted. –If not, leader tries to get majority value accepted. 6F+4 messages, 2F+1 stable writes 4 message delays and 2 stable write delays
25 RequestCommit Prepare Commit Prepared client TMRM TMRM RequestCommit Prepare Prepared Commit consensus box consensus box Using Consensus Have a consensus for each RM
26 X Chosen RM Propose X consensus box TM Propose W X Chosen
27 Paxos Commit (success case) Acceptors working prepared committedaborted Resource Managers working AllPreparedaborted Commit Leader working committedaborted Request Commit Prepare Prepared Commit All Prepared
28 Consensus The distributed systems theory community has thought about this a lot. They call it Consensus: N processes want to agree on a value Want to tolerate F faults –Tolerate F processes stopping –Tolerate F Messages delayed or lost If there are fewer than F faults in a window Then consensus achieved. Byzantine faults need 3F acceptors Benign faults need 2F+1 acceptors stalls but safe if more than F faults