State Machine Replication aaa bb c Replicas are identical deterministic state machines Process operations in the same order remain consistent
Consensus Building block for state machine replication Each process has an input, should decide on an output so that– Agreement: decisions are the same Validity: decision is input of one process Termination: eventually all correct processes decide
Basic Model Message passing Channels between every pair of processes –do not create, duplicate or alter messages (integrity) Failures What about timing?
Synchronous Model a b c Very convenient for algorithms –understanding performance –early decision with no/few failures a b c a b c d d d
Synchronous Model: Limitation Requires very conservative timeouts –in practice: avg. latency < max. latency 100 [Cardwell, Savage, Anderson 2000], [Bakr-Keidar 2002] long timeout
Asynchronous Model Unbounded message delay Much more practical Fault-tolerant consensus impossible [FLP85]
Eventually Stable (Indulgent) Models Initially asynchronous –for unbounded period of time Eventually reach stabilization –GST (Global Stabilization Time) –following GST certain assumptions hold Examples –ES (Eventual Synchrony) – all links are ◊timely [Dwork, Lynch, Stockmeyer 88] –failure detectors: (eventual leader), ◊S [Chandra, Toueg 96], [Chandra, Hadzilacos, Toueg 96]
Why Eventual Stabilization? Because “eventually” formally models “most of the time” (in stable periods). In practice, does not have to be forever, just “long enough” for the algorithm (T A ) T A depends on our synchrony assumptions !
Our Goals 1.Understand the relationship between: –Assumptions (number of timely links, with or without , etc.) and –performance of algorithms that exploit them In runs that eventually satisfy these assumptions –unlike stable runs in previous work And only these assumptions –unlike synchronous runs in previous work 2.Understand how message complexity affects performance.
Reminder – GIRAF [Keidar&Shraer 06] General Round-based Algorithm Framework Organize algorithms into rounds Separate algorithm logic from waiting condition Does not require rounds to be synchronized among processes Allows messages to arrive in any round Can capture any oracle model of [CHT 96] Can express models that cannot be expressed in RRFD [Gafni 98].
Algorithm for process p i upon receive m add m to M (msg buffer) upon end-of-round FD ← oracle (k) if (k = 0) then ← initialize(FD) else ← compute(k, M, FD) k ← k+1 enable sending of out_msg to Dest waiting condition controlled by env. GIRAF – The Generic Algorithm Your pet algorithm here
Defining Properties in GIRAF Environment can have –perpetual properties, φ –eventual properties, ◊φ In every run r, there exists a round GSR(r) GSR(r) – the first round from which: –no process fails –all eventual properties hold in each round
Example Communication Properties timely link in round k: p d receives the round k message of p s, in round k –if p d is correct, and p s executes round k (end-of-round s occurs in round k) j-source: same j timely outgoing links in every round j-source v : j timely outgoing links in every round (can vary in each round) j-destination: same j incoming timely links from correct processes in every round
Example Oracle Properties leader: correct process p i s.t. round k and process p j : oracle j (k)=i –range of oracle( ) is failure detector: ◊leader
Timing Models ES (Eventual Synchrony) [Dwork et al. 88] –All links between correct processes are ◊timely –Consensus in 3 rounds (optimal) [Dutta et al. 04] ◊AFM (All-From-Majority) simplified: –every correct process ◊majority–destination v, ◊majority–source v –Consensus in 5 rounds [Keidar&Shraer 06] ◊LM (Leader and Majority): –Ω, leader is ◊n–source, every correct process is ◊majority-destination v –Consensus in 3 rounds [Keidar&Shraer 06] From some round onward, one process is trusted by all (leader) From some round onward, the link delivers messages in the round they were sent majority of timely incoming links (v means majority can change each round) majority of timely outgoing links
New Model: ◊WLM Ω, leader is ◊n–source, ◊majority-destination v –unlike all processes in ◊LM –similar to [Malkhi et al. 05], a little stronger Previous Work Most Ω-based algorithms wait for majority in each round Paxos [Lamport 98] can make progress in WLM –Takes constant number of rounds in ES –But how many rounds without ES?
Paxos Run in ES 21...... (Commit, 21,v 1 ) 21...... 20 21...... (“prepare”,21) yes decide v 1 (Commit, 21, v 1 ) Ω Leader BallotNum BallotNum – number of attempts to decide initiated by leaders 1 2 5 20...... no 5 20...... yes (“prepare”,2)
Paxos in ◊WLM (w/out ES) 2 (“prepare”,2) 2 5 20 8 13 9 9 9 20 9 13 (“prepare”,9) (“prepare”,14) Ω Leader ok no (5) no (8) ok no (13) 1 5 20 8 13 GSRGSR+1GSR+2GSR+3 BallotNum Commit takes O(n) rounds!
New Consensus Algorithm for WLM Tolerates unbounded periods of asynchrony Minority can crash Message efficient: O(n) stable-state message complexity Achieves global decision in 4 rounds if leader is stable before GSR –5 otherwise
Our ◊WLM Algorithm in a Nutshell Commit with increasing ballot numbers, decide on value committed by majority –like Paxos, etc. Challenge: Don’t know all ballots, how to choose the new one to be highest one? Solution: Choose it to be the round number Challenge: rounds are wasted if a prepare/commit fails. Solution: pipeline prepares and commits: try in each round Challenge: do they really need to say no? Solution: support leader’s prepare even if have a higher ballot number –challenge: higher number may reflect later decision! Won’t agreement be compromised? –solution: new field “trustMe” ensures supported leader doesn't miss real decisions: it is set in round k+1 if majority trust the leader in round k
Example Run: GSR=100 1 5 20 8 13 Ω Leader GSR+1GSR+2 8 5 20 8 13 GSR All Prepare with ! trustMe All Commit Did not lead to decision GSR+3GSR+4 8 8 20 8 13 102 Leader Decides All Decide
Probabilistic Analysis Each link is timely with probability p in each round –Independent and Identically Distributed (IID) Bernoulli random variables Other simplifying assumptions: –Synchronous rounds –No failures Good starting point to understand behavior in real systems For each model M, calculate: –P M – probability of requirements of M to hold in a round –Expected number of rounds until the requirements of M hold long enough –E(D M ) – expected number of rounds until (global) decision in M
Comparing the Models (IID) Expected number of rounds for global decision (n=8) ES requires 350 rounds for p=0.97
LAN measurements How frequent is a stable round in each model ? –compare measured P M to IID prediction For IID: p = fraction of timely messages (over all rounds) –Example: for timeout = 0.1ms, p=0.7; timeout=0.2, p=0.976 ES is slightly better in practice (a slow round) AFM is slightly worse in practice (a slow node) WLM, LM are better in practice (good leader) WLM, LM are better in practice (good leader) WLM rounds are the most frequent !
GIRAF implementation for WAN Some round synchronization is needed for all models –In LAN, computers often have synchronized clocks A simple algorithm to implement GIRAF: –L i [j]: average latency between n i and n j as measured by n i (pings) –timeout – input parameter Receiver thread: Sender thread: upon receive m send message to peers add m to M (msg buffer) wait for timeout if m belongs to round k j > k i, compute next round msg. notify sender thread upon notify: stop waiting jump to round k j duration: timeout – L i [j]
WAN measurements Questions: How frequent is a stable round in each model ? (P M ) For each model M, measure time and #rounds until global decision in M How to set the timeout? The experiment: 33 runs for each timeout, 300 rounds per run –A run is represented by average on 15 different points in the run Asynchronous node startup –don’t consider rounds before the first stable round of the model
Question 1: Stable Rounds (P M ) Up to 99% of messages arrive till timeout = 350ms. –Waiting for 100% requires orders of magnitude longer [Cadwell et al. 98] LM is sensitive to a single slow node In some runs P LM = 95%, in others P LM = 15% AFM is constantly low: around 40% ES is constantly rare for small timeouts Occasionally good for larger timeouts – sensitive to a individual slow messages WLM rounds are the most frequent (15% better than LM for 160ms), with lowest variance !
Question 2: Global decision WLM is best for timeout < 180ms. Same as others for higher timeouts. Choice of leader matters… With a bad leader – use AFM
Question 3: Choosing the Timeout Tradeoff: –Longer timeouts: more stable rounds, less time/rounds for decision –But: each round takes longer and decision time is longer –The values are right for our system – might be different for yours Less rounds, each one is longer More rounds, each one is shorter With their optimal timeouts, WLM is just 80ms worse
Conclusions WLM – new timing model New algorithm for WLM –Tolerates unbounded periods of asynchrony – O(n) stable-state message complexity –Achieves global decision in 4 or 5 rounds Thanks to the weak stability requirements, our algorithm has better/comparable performance compared to algorithms that take less rounds. –Even though other algorithms send more messages (Ω(n 2 ))