 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 1 Principles of Reliable Distributed Systems Lecture 10: SMR with Paxos.

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 1 Principles of Reliable Distributed Systems Lecture 10: SMR with Paxos and Authenticated Byzantine Paxos Spring 2009 Idit Keidar

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 2 Material The Part-Time Parliament Lamport, TOCS 1998 Practical Byzantine Fault-Tolerance Castro and Liskov, OSDI 1999 The ABCDs of Paxos Lampson 2001

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 Reminder: Practical Models Eventually Synchronous Asynchronous with unreliable failure detectors –E.g., ◊S, ◊P,  –Output can be wrong (even arbitrary) for an unbounded (finite) prefix of a run –No safety properties 3

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 Rationale Prepare for the worst –Safety under asynchrony Hope for the best –Liveness & good performance in common cases Nice clean models –Eventual stability –Time-free abstractions: unreliable failure detectors 4

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 Indulgent Algorithms Algorithms for eventual synchrony or asynchronous model with unreliable failure detectors like ◊S, ◊P, or  An algorithm that tolerates unbounded periods of asynchrony is called indulgent [ Guerraoui 98 ] 5

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 Observations on Indulgent Consensus Algorithms Every indulgent consensus algorithm also solves uniform consensus [ Guerraoui 98 ] It is impossible to solve t-resilient indulgent consensus when t ≥ n/2 [ Chandra, Toueg 96; Guerraoui 98 ] 6

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 33 Example: n=4, t=2 p1 p2 P Decide 0 by p3 p4 Q p1 p2 P Decide p3 p4 Q 0 0 1 0 0 1 1 1 11 22 x x x x Decide 0 by validity and termination Can all fail because |Q| ≤ t Decide 1 7

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 33 Example: n=4, t=2 p1 p2 P Decide 0 by validity and terminati on p3 p4 Q 0 0 1 1 Decide 0 by validity and terminati on Decide 1 All messages between P and Q arrive after decisions are already made 8

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 9 SMR (Atomic Broadcast) by Running a Sequence of Consensus Instances

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 10 Reminder: State-Machine Replication (SMR) Data is replicated at n servers Operations are initiated by clients Operations need to be performed at all correct servers in the same order –Need to agree upon the sequence of operations –Use Atomic Broadcast = Reliable Broadcast + Total Order

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 11 Using Atomic Broadcast for SMR Network – Reliable Links between Correct Processes Atomic Broadcast Server deliver broadcast receivesend deliver broadcast receivesend Atomic Broadcast Client response request Server f(x)

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 12 Reminder: Paxos Algorithm for state machine replication with eventual synchrony (ES) –Uses  failure detector (leader election) Overcomes transient crashes & recoveries and message loss Main component: (one-shot) consensus protocol (aka Synod) –Last week

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 13 The 2 Phases of Paxos (Synod) 11 2 n...... (“accept”, b, v) 1 2 n...... 11 2 n...... (“prepare”, b) (“ack”, b, n’, v’) (“accept”, b, v) Phase 1: Learn about smaller ballotsPhase 2: Majority accepts v

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 14 SMR: Client-Server Interaction Leader-based: each process (client/server) has an estimate of who is the current leader A client sends a request to its current leader –E.g., “store X 100” or “f(x)” The leader runs the Paxos (Synod) consensus algorithm to agree on the place of the request in the sequence –Input value: request + proposed sequence number –E.g., “store X 100” is the 7 th operation in the sequence The leader sends the response to the client –After invoking the operation on its copy

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 15 Consensus per Request Number Many consensus instances are running at the same time, each for some request number, ReqNum Each node has unbounded arrays –AcceptNum[r], AcceptVal[r], r = 1,2, … –AcceptVal holds the client’s requested operation E.g., “store X 100” Invoke operations on the state machine –In order: AcceptVal[1], then AcceptVal[2], etc. –After the respective consensus algorithm decides –Then send outcome as response to client (leader only)

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 16 Failure-Free Message Flow S1 S2 Sn...... C S1 S2 Sn...... S1 S2 Sn...... (“accept”, b, r, v) (“prepare”, b) (“ack”, b, n’, v’) C Phase 1Phase 2 request response (“accept”, b, r, v) Client’s request ReqNum

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 17 Observation In Phase 1, no new consensus values sent: –Leader chooses largest unique ballot number –Gets a majority to join this ballot number –Learns the outcome of all smaller ballots from this majority In Phase 2, leader proposes either its initial value (request from client) or latest value it learned in Phase 1

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 18 Failure-Free Message Flow S1 S2 Sn...... C S1 S2 Sn...... S1 S2 Sn...... (“accept”, b, r, v) (“prepare”, b) (“ack”, b, n’, v’) C Phase 1Phase 2 request response (“accept”, b, r, v)

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 19 Message Flow: Take 2 S1 S2 Sn...... C S1 S2 Sn...... S1 S2 Sn...... C Phase 1 Phase 2 request response S1 (“accept”, b, r, v) (“prepare”, b) (“ack”, b, n’, v’) (“accept”, b, r, v)

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 20 Optimization Run Phase 1 only when the leader changes –Phase 1 is called “view change” or “recovery mode” –Phase 2 is the “normal mode” Each message includes BallotNum (from the last Phase 1) and ReqNum –E.g., ReqNum = 7 when we’re trying to agree what the 7 th operation to invoke on the state machine should be Respond only to messages with the “right” BallotNum

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 21 ReqNum vs. BallotNum BallotNum is the ballot when the current leader was elected –Changes only in “recovery mode”, when a new node thinks it is leader (by  ) –Can be once in a few hours –E.g., George W. Bush is 43 rd president ReqNum is the sequence number of the transaction being processed –E.g., 1000 transactions per second –E.g., bill number 127,356 is being passed

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 22 Paxos Atomic Broadcast: Normal Mode Upon receive (“request”, v) from client if (I am not the leader) then forward to leader else /* propose v as request number n */ ReqNum  ReqNum +1; send (“accept”, BallotNum, ReqNum, v) to all Upon receive (“accept”, b, r, v) with b = BallotNum /* accept proposal for request number n */ AcceptNum[r]  b; AcceptVal[r]  v send (“accept”, b, r, v) to all (first time only)

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 23 Recovery Mode Run once per ballot, for all request numbers –No ReqNum in “prepare” and “ack” messages The new leader must learn the outcome of all the pending requests that have smaller BallotNums –The “ack” messages include AcceptNum[r] and AcceptVal[r] for all pending requests For each of the pending requests, the leader sends an “accept” message What if there are holes? –E.g., leader learns of request number 13 and not of 12 –Fill in the gaps with dummy “do nothing” requests

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 24 Practical Byzantine Fault-Tolerance (PBFT) Aka Byzantine Paxos

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 25 Reminder: Byzantine Faults Faulty process can behave arbitrarily, i.e., they don’t have to follow the protocol. E.g., –Can suffer benign failures – crash, timing –Can send bogus values in messages –Can send messages at the wrong time –Can send different messages to different processes, etc. Captures software bugs, hacker intrusions

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 26 Reminder: Authenticated (Byzantine) Model Authentication: The receiver of a message can ascertain its origin –An intruder cannot masquerade as someone else Integrity: The receiver of a message can verify that it has not been modified in transit; –An intruder cannot substitute a false message for a legitimate one Nonrepudiation: A sender cannot falsely deny later that he sent a message

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 27 Byzantine Fault-Tolerant Consensus: Overview of Results Synchronous t-resilient algorithm – –iff t < n with authentication and weak unanimity –iff t < n/2 with authentication and strong unanimity –iff t < n/3 without authentication Eventually synchronous (ES) t-resilient algorithm –iff t < n/3 with or without authentication Homework problem: show the lower bound

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 28 Overcoming Byzantine Failures With 3t+1 Processes Recall what we did for crash failures – –We gathered “votes” from a majority in every ballot –Since every two majorities intersect, for every two ballots, at least one process votes in both But now, a faulty process can lie about what it did in the other ballot –We want a correct process in the intersection –Since n-t ≥ 2t+1, two sets of size n-t intersect by at least one correct process –Gather n-t votes in a ballot, to ensure that for every two ballots, at least one correct process votes in both

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 29 Byzantine Paxos Setting State machine replication Structured like Paxos: –Updates are sent to the current leader –Leader uses a consensus algorithm to have all replicas agree on the order of updates –Our focus today is the Consensus algorithm Used to implement BFS – Byzantine Fault Tolerant NFS –Only 3% slower than un-replicated NFS

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 30 Model n processes: {1,…n} Up to t Byzantine failures, t < n/3 –For simplicity, assume n = 3t+1 Authentication (PKI) Reliable links, no recovery (for now)

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 31 Reminder: Classic Paxos Phase I Periodically, until decision is reached do: if leader (by  ) then BallotNum   BallotNum.num+1, myId  send (“prepare”, BallotNum) to all Upon receive (“prepare”, b) from i if b  BallotNum then BallotNum  b send (“ack”, b, AcceptNum, AcceptVal) to i

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 32 Reminder: Classic Paxos Phase II Upon receive (“ack”, BallotNum, b, val) from n-t if all vals =  then myVal = initial value else myVal = received val with highest b send (“accept”, BallotNum, myVal) to all /* proposal */ Upon receive (“accept”, b, v) with b  BallotNum AcceptNum  b; AcceptVal  v /* accept proposal */ send (“accept”, b, v) to all (first time only)

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 33 How Can Byzantine Failures Cause Problems?

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 34 Safety Problems: Leader Can Lie Problem 1: Leader can choose a value different than the highest accepted by n-t processes –Solution: Can “prove” he’s not lying by sending the signed “ack” (Phase 1) messages to all processes Problem 2: If no previous ballot was accepted, leader can send different new values to different processes

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 35 Solution to the 2 nd Problem Before accepting a value proposed by the leader, verify that the value was proposed to “enough” processes Byzantine Paxos Phases: –Phase 1: Prepare –Phase 2: Propose – echo leader’s proposal –Phase 3: Accept – now only if n-t proposed Add new variable: PropNum, initially 0

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 36 Safety Problems: Others Can Lie Problem 3: Faulty users can send invalid “accept” messages –Solution: Wait for n-t=2t+1 “accept” messages Problem 4: Faulty users can send invalid values with higher AcceptNums in “ack” messages –Solution: Can “prove” value is valid by forwarding signed “propose” messages –Add new variable: Proof, initially empty set

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 37 Liveness Problems Problem 5: Faulty leader can deadlock algorithm –Solution: Propose a new leader when the current does not deliver –Use rotating coordinator until one is correct, leader will be (BallotNum mod n)+1 Problem 6: Faulty processes may keep selecting new leaders all the time (livelock) –Solution: Accept a new ballot only if t+1 processes propose a new leader

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 38 And Now For Our Feature Presentation The Byzantine Paxos Consensus Algorithm

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 39 Byzantine Paxos Variables Int BallotNum, initially 0 Int PropNum, initially 0 Int AcceptNum, initially 0 Value  {  } AcceptVal, initially  Message Set Proof, initially empty Define: Leader = (BallotNum mod n)+1

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 40 Byzantine Paxos Phase I: Prepare Upon timeout on Leader BallotNum  BallotNum +1 send (“prepare”, BallotNum) to all Upon receive (“prepare”, b) from t+1 if (b < BallotNum) then return if (b > BallotNum) then BallotNum  b send (“prepare”, BallotNum) to all send (“ack”, b, AcceptNum, AcceptVal, Proof) to Leader

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 41 Byzantine Paxos Phase II: Propose Upon receive (“ack”, BallotNum, b, val, proof) from n-t S = {received (signed) “ack” messages} if (all vals that have valid proofs in S are  then myVal  init value else myVal  val that has valid proof with highest b in S send (“propose”, BallotNum, myVal, S) to all Upon receive (“propose”, BallotNum, v, S) if (BallotNum  PropNum) then return if (v is not a valid choice given S) then return PropNum  BallotNum send (“propose”, BallotNum, v, S) to all

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 42 Byzantine Paxos Phase III: Accept Upon receive (“propose”, b, v, S) from n-t if (b < BallotNum) then return AcceptNum  b; AcceptVal  v Proof  set of n-t signed “propose” messages send (“accept”, b, v) to all Upon receive (“accept”, b, v) from n-t decide v

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 43 In Failure-Free Runs accept 1 prepareackpropose 2 n...... 1 2 n...... 11 2 n...... 1 2 n...... 1 2 n...... All send prepare All echo propose

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 44 Some Optimizations Prepare and its “ack” can be merged into one message round Proofs don’t have to be sent with messages: processes can have the information to check the proofs locally because the original messages are multicast

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 45 Invariant If proposals (b,v) and (b, v’) are accepted by correct processes i and j, (possibly i = j ) then v’=v Proof: –An accepted proposal is proposed by n-t processes –Two sets of n-t = 2t+1 processes have at least one correct process in common –A correct process sends no more than one propose message with the same b

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 46 Lemma 1 If a proposal (b,v) is accepted by t+1 correct processes, then for every proposal (b’, v’) with b’>b that is proposed by a correct process, v’=v. Again, follows from Lemma 2…

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 47 Lemma 2 If a proposal (b,v) is proposed by a correct process, then there is a set S including at least t+1 correct processes such that either –(1) no correct p in S accepts a proposal ranked less than b; or –(2) v is the value of the highest-ranked proposal among proposals ranked less than b accepted by correct processes in S

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 48 Proving Lemma 1 from Lemma 2 Assume (b,v) is accepted by t+1 correct processes, and consider the lowest ranked proposal (b’, v’) with b’>b proposed by a correct process Since two sets of t+1 correct processes have at least one correct process in common, case (1) of Lemma 2 is impossible, and by case (2), v’=v Continue by induction on ballot number

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 49 Proving Agreement Let v be a decided value. The first process that decides v receives a n-t accept messages for v with some ballot b, i.e., (b,v) is accepted by at least t+1 correct processes No other value is accepted by a correct process with the same b. Why? Let (v 1,b 1 ) be the first proposal accepted by n-t By Lemma 1, v 1 is the only possible decision value

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 50 Liveness Is the current leader making progress? –If yes, some correct process decides. This process can periodically forward the “proof” for its decision to others so they will decide too. –If not, all timeout on the leader and start a new ballot. Once there is a correct leader. –The n-t correct processes will send all the needed messages. –The t faulty processes will not be able to force a new ballot.

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 51 Atomic Broadcast: Issues Leader can propose invalid client requests Leader can refrain from proposing client requests Leader can lie to client about response Leader can refrain from sending client responses Solution: clients cannot trust a single server

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 52 Byzantine Message Flow accept S1 prepareackpropose S2 Sn...... S1 S2 Sn...... S1 S2 Sn...... S1 S2 Sn...... propose S1 S2 Sn...... S1 request S2 Sn...... C C response

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 1 Principles of Reliable Distributed Systems Lecture 10: SMR with Paxos.

Similar presentations

Presentation on theme: " Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 1 Principles of Reliable Distributed Systems Lecture 10: SMR with Paxos."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

 Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 1 Principles of Reliable Distributed Systems Lecture 10: SMR with Paxos.

Similar presentations

Presentation on theme: " Idit Keidar, Principles of Reliable Distributed Systems, Technion EE, Spring 2009 1 Principles of Reliable Distributed Systems Lecture 10: SMR with Paxos."— Presentation transcript:

Similar presentations

About project

Feedback