Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman.

Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman

Distributed Systems 20062 Leslie Lamport “A distributed system is one in which the failure of a machine you have never heard of can cause your own machine to become unusable”

Distributed Systems 20063 Plan Goals Static and Dynamic Membership Logical Time Distributed Commit

Distributed Systems 20064 Thought question Suppose that a distributed system was built by interconnecting a set of extremely reliable components running on fault-tolerant hardware –Would such a system be expected to be reliable? –Perhaps not. The pattern of interaction, the need to match rates of data production and consumption, and other “distributed” factors all can prevent a system from operating correctly!

Distributed Systems 20065 Example (1) The Web components are individually reliable –But the Web can fail by returning inconsistent or stale data, can freeze up or claim that a server is not responding (even if both browser and server are operational), and it can be so slow that we consider it faulty even if it is working For stateful systems (the Web is stateless) this issue extends to joint behavior of sets of programs

Distributed Systems 20066 Example (2) Ariane 5 –June 4, 1996, 40 seconds after takeoff… –Self destruction after abrupt course correction –“… caused by the complete loss of guidance and attitude information … due to specification and design errors in the software of the inertial reference system” –Loss of 500 million $, but no loss of life Where are the distribution aspects?

Distributed Systems 20067 Our Goal Here We want to replicate data and computation –For availability –For performance while guaranteeing consistent behavior Work towards ”virtual synchronous communication” –System appears to have no replicated data –System appears to only have multi-thread concurrency

Distributed Systems 20068 Synchronous and Asynchronous Executions pqrpqr …processes share a synchronized clock In the synchronous model messages arrive on time … and failures are easily detected None of these properties holds in an asynchronous model

Distributed Systems 20069 Reality: Neither One Real distributed systems aren’t synchronous –Although some can come close Nor are they asynchronous –Software often treats them as asynchronous –In reality, clocks work well… so in practice we often use time cautiously and can even put limits on message delays For our purposes we usually start with an asynchronous model –Subsequently enrich it with sources of time when useful

Distributed Systems 200610 Steps Towards Our Goal Tracking group membership: We’ll base 2PC and 3PC Fault-tolerant multicast: We’ll use membership Ordered multicast: We’ll base it on fault-tolerant multicast Tools for solving practical replication and availability problems: we’ll base them on ordered multicast Robust Web Services: We’ll build them with these tools 2PC and 3PC: Our first “tools” (lowest layer)

Distributed Systems 200611 Membership Which processes are available in a distributed system? Dynamic membership –Use group membership protocol to track members –Performant, complicated Static membership –Use static list of potential group members –Resolve liveness on a per-operation basis –May be slow, simpler (Approaches may be combined)

Distributed Systems 200612 Dynamic Membership Provides a Group Membership Service (GMS) –Processes as members –Processes may join or leave the group and monitor other processes in the group (More next time) ”80,000 updates per seconds, 5 members” –Static membership: ”tens of updates per second, 5 members”

Distributed Systems 200613 Static Membership Example –Static set of potential members E.g., {p, q, r, s, t} –Support replicated data on members E.g., x: integer value E.g., x: [t 0, v 0] -> [t 21, v 17] -> [t 25, v 97],... –Each process records version of x and value of x –p reading a value? Cannot just look at its own version – may have been changed at others

Distributed Systems 200614 Quorum Update and Read Simple fix –Make sure that operations reach a majority of processes in the system –Update and read only if supported by a majority of processes x will be sure to read latest value updated – just take one with largest version General fix –Two basic rules A quorum read should intersect prior quorum write at at least one process Likewise, quorum writes should intersect prior quorum writes –In a group of size N Qr + Qw > N Qw + Qw > N The example again, N = 5 –Qr = 3, Qw = 3 –Other possibilities? Note that we want Qw 1!

Distributed Systems 200615 Update Protocol 1) p issues RPC-style read request to one replica after another –p collects at least Qr replies –p notes version (and value) 2) p computes new version of data –Larger than maximum current version received 3) p issues RPC to Qw members asking to ”prepare” –Processes reply to p 4) p checks number of acknowledgements –>= Qw -> ”commit” – ”abort” (Actually a two-phase commit protocol (2PC) is used in 3) and 4), more later)

Distributed Systems 200616 Time We were somewhat careful to avoid time in static membership In distributed system we need practical ways to deal with time –E.g., we may need to agree that update A occurred ‘before’ update B –Or offer a “lease” on a resource that expires ‘at’ time 10:10:01.50 –Or guarantee that a time critical event will reach all interested parties ‘within’ 100ms

Distributed Systems 200617 But what does Time “Mean”? Time on a machine’s local clock –But was it set accurately? –And could it drift, e.g. run fast or slow? –What about faults, like stuck bits? Time on a global clock? –E.g. with GPS receiver –Still not accurate enough to determine which events happens before other events Or could try to agree on time

Distributed Systems 200618 Lamport’s Approach Leslie Lamport suggested that we should reduce time to its basics –Cannot order events according to a global clock None available… –Can use logical clock Time basically becomes a way of labeling events so that we may ask if event A happened before event B Answer should be consistent with what could have happened with respect to a global clock –Often this is what matters

Distributed Systems 200619 Drawing time-line pictures: p m snd p (m) q rcv q (m) deliv q (m) D

Distributed Systems 200620 Drawing time-line pictures: A, B, C and D are “events”. –Could be anything meaningful to the application microcode, program code, file write, message handling, … –So are snd(m) and rcv(m) and deliv(m) What ordering claims are meaningful? p m A C B snd p (m) q rcv q (m) deliv q (m) D

Distributed Systems 200621 Drawing time-line pictures: A happens before B, and C before D –“Local ordering” at a single process –Write and p q m A C B rcv q (m) deliv q (m) snd p (m) D

Distributed Systems 200622 Drawing time-line pictures: snd p (m) also happens before rcv q (m) –“Distributed ordering” introduced by a message –Write p q m A C B rcv q (m) deliv q (m) snd p (m) D

Distributed Systems 200623 Drawing time-line pictures: A happens before D –Transitivity: A happens before snd p (m), which happens before rcv q (m), which happens before D p q m D A C B rcv q (m) deliv q (m) snd p (m)

Distributed Systems 200624 Drawing time-line pictures: B and D are concurrent –Looks like B happens first, but D has no way to know. No information flowed… p q m D A C B rcv q (m) deliv q (m) snd p (m)

Distributed Systems 200625 The Happens-Before Relation We’ll say that “A happens-before B”, written A  B, if 1) A  P B according to the local ordering, or 2) A is a snd and B is a rcv and A  M B, or A and B are related under the transitive closure of rules 1. and 2. Thus, A  D So far, this is just a mathematical notation, not a “systems tool” A new event seen by a process happens logically after other events seen by that process A message receive happens logically after a message has been sent

Distributed Systems 200626 ”Simultaneous” Actions There are many situations in which we want to talk about some form of simultaneous event –Think about updating replicated data Perhaps we have multiple conflicting updates The need is to ensure that they will happen in the same order at all copies This “looks” like a kind of simultaneous action Want to know the states of a distributed systems that might have occurred at an instant of real- time

Distributed Systems 200627 Temporal distortions Things can be complicated because we can’t predict –Message delays (they vary constantly) –Execution speeds (often a process shares a machine with many other tasks) –Timing of external events Lamport looked at this question too

Distributed Systems 200628 Temporal distortions What does “now” mean? p 0 a f e p 3 b p 2 p 1 c d

Distributed Systems 200629 Temporal distortions What does “now” mean? p 0 a f e p 3 b p 2 p 1 c d

Distributed Systems 200630 Temporal distortions Timelines can “stretch”… … caused by scheduling effects, message delays, message loss… p 0 a f e p 3 b p 2 p 1 c d

Distributed Systems 200631 Temporal distortions Timelines can “shrink” E.g. something lets a machine speed up p 0 a f e p 3 b p 2 p 1 c d

Distributed Systems 200632 Temporal distortions Cuts represent instants of time –Viz., subsets of events, one per process E.g., {a, c}, {a, rcv(d), f, rcv(e)} But not every “cut” makes sense –Black cuts could occur but not gray ones. p 0 a f e p 3 b p 2 p 1 c d

Distributed Systems 200633 Temporal distortions Red messages cross gray cuts “backwards” –Need to avoid capturing states in which a message is received but nobody is shown as having sent it Consistent cuts –If rcv(m) is in cut, snd(m) (or earlier) is in cut –snd(m) may be in cut without rcv(m) is in cut m is ”in message channel” p 0 a f e p 3 b p 2 p 1 c d

Distributed Systems 200634 Who Cares? Suppose –p has lock –m = release lock –p sends m to q –snd(m) -> rcv(q) Inconsistent cut –{rcv(q)} –Sees that both p and q have lock

Distributed Systems 200635 Logical clocks A simple tool that can capture parts of the happens before relation First version: uses just a single integer –Designed for big (64-bit or more) counters –Each process p maintains LT p, a local counter –A message m will carry LT m

Distributed Systems 200636 Rules for managing logical clocks When an event happens at a process p it increments LT p –Any event that matters to p –Normally, also snd and rcv events (since we want receive to occur “after” the matching send) When p sends m, set –LT m = LT p When q receives m, set –LT q = max(LT q, LT m )+1

Distributed Systems 200637 Time-line with LT annotations LT(A) = 1, LT(snd p (m)) = 2, LT(m) = 2 LT(rcv q (m))=max(1,2)+1=3, etc… p q m D A C B rcv q (m) deliv q (m) snd p (m) LT q 0001111333455 LT p 0112222223333

Distributed Systems 200638 Logical clocks If A happens before B, A  B, then LT(A)<LT(B) –A  B : A = E0  …  En = B, where each pair is ordered either by  p or  m LT associated with these only increase But converse might not be true: –If LT(A)<LT(B) can’t be sure that A  B –This is because processes that don’t communicate still assign timestamps and hence events will “seem” to have an order

Distributed Systems 200639 Can we do better? One option is to use vector clocks Here we treat timestamps as a list –One counter for each process Rules for managing vector times differ from what did with logical clocks

Distributed Systems 200640 Vector clocks Clock is a vector: e.g. VT(A)=[1, 0] –We’ll just assign p index 0 and q index 1 –Vector clocks require either agreement on the numbering/static membership, or that the actual process id’s be included with the vector Rules for managing vector clock –When event happens at p, increment VT p [index p ] Normally, also increment for snd and rcv events –When sending a message, set VT(m)=VT p –When receiving, set VT q =max(VT q, VT(m)) Where “max” is max on components of vector

Distributed Systems 200641 Time-line with VT annotations p q m D A C B rcv q (m) deliv q (m) snd p (m) VT q000 0101 0101 0101 0101222 2323 2323 2424 VT p0 1010 1010 2020 2020 2020 2020 2020 2020 3030 3030 3030 3030 VT(m)=[2,0] Could also be [1,0] if we decide not to increment the clock on a snd event. Decision depends on how the timestamps will be used.

Distributed Systems 200642 Rules for comparison of VTs We’ll say that VT A ≤ VT B if –  i, VT A [i] ≤ VT B [i] And we’ll say that VT A < VT B if –VT A ≤ VT B but VT A ≠ VT B –That is, for some i, VT A [i] < VT B [i] Examples? –[2,4] ≤ [2,4] –[1,3] < [7,3] –[1,3] is “incomparable” to [3,1]

Distributed Systems 200643 Time-line with VT annotations VT(A)=[1,0]. VT(D)=[2,4]. So VT(A)<VT(D) VT(B)=[3,0]. So VT(B) and VT(D) are incomparable p q m D A C B rcv q (m) deliv q (m) snd p (m) VT q000 0101 0101 0101 0101222 2323 2323 2424 VT p0 1010 1010 2020 2020 2020 2020 2020 2020 3030 3030 3030 3030 VT(m)=[2,0]

Distributed Systems 200644 Vector time and happens before If A  B, then VT(A)<VT(B) –Write a chain of events from A to B –Step by step the vector clocks get larger But also VT(A)<VT(B) then A  B –Two cases If A and B both happen at same process p – all events seen by p increments vector clocks If A happens at p and B at q, can trace the path back by which q “learned” VT(A)[p] since q only updates VT(A)[p] based on message receipt from, say, q’ –If q’ <> p trace further back (Otherwise A and B happened concurrently)

Distributed Systems 200645 Introducing “wall clock time” There are several options –“Extend” a logical clock or vector clock with the clock time and use it to break ties Makes meaningful statements like “B and D were concurrent, although B occurred first” But unless clocks are closely synchronized such statements could be erroneous! –We use a clock synchronization algorithm to reconcile differences between clocks on various computers in the network

Distributed Systems 200646 Synchronizing clocks Without help, clocks will often differ by many milliseconds –Problem is that when a machine downloads time from a network clock it can’t be sure what the delay was –This is because the “uplink” and “downlink” delays are often very different in a network Outright failures of clocks are rare…

Distributed Systems 200647 Synchronizing clocks Suppose p synchronizes with time.windows.com and notes that 123 ms elapsed while the protocol was running… what time is it now? p time.windows.com What time is it? 09:23.02921 Delay: 123ms

Distributed Systems 200648 Synchronizing clocks Options? –P could guess that the delay was evenly split, but this is rarely the case in WAN settings (downlink speeds are higher) –P could ignore the delay –P could factor in only “certain” delay, e.g. if we know that the link takes at least 5ms in each direction. Works best with GPS time sources! In general can’t do better than uncertainty in the link delay from the time source down to p

Distributed Systems 200649 Consequences? In a network of processes, we must assume that clocks are –Not perfectly synchronized. Even GPS has uncertainty, although small We say that clocks are “inaccurate” (with respect to real time) –And clocks can drift during periods between synchronizations Relative drift between clocks is their “precision” (with respect to each other)

Distributed Systems 200650 Thought question –We are building an anti-missile system –Radar tells the interceptor where it should be and what time to get there –Do we want the radar and interceptor to be as accurate as possible, or as precise as possible?

Distributed Systems 200651 Thought question We want them to agree on the time but it isn’t important whether they are accurate with respect to “true” time –“Precision” matters more than “accuracy” –Although for this, a GPS time source would be the way to go Might achieve higher precision than we can with an “internal” synchronization protocol!

Distributed Systems 200652 Transactions in distributed systems A client and database might not run on same computer –Both may not fail at same time –Also, either could timeout waiting for the other in normal situations When this happens, we normally abort the transaction –Exception is a timeout that occurs while commit is being processed –If server fails, one effect of crash is to break locks even for read-only access

Distributed Systems 200653 Transactions in distributed systems What if data is on multiple servers? –In a networked system, transactions run against a single database system Indeed, many systems structured to use just a single operation – a “one shot” transaction! –In true distributed systems may want one application to talk to multiple databases Main issue that arises is that now we can have multiple database servers that are touched by one transaction Reasons? –Data spread around: each owns subset –Could have replicated some data object on multiple servers, e.g. to load-balance read access for large client set –Might do this for high availability Solve using 2-phase commit (2PC) protocol!

Distributed Systems 200654 Two-phase commit in transactions Phase 1 –Transaction wishes to commit. Data managers force updates and lock records to the disk (e.g. to the log) and then say prepared to commit Phase 2 –Transaction manager makes sure all are prepared, then says commit (or abort, if some are not) –Data managers then make updates permanent or rollback to old values, and release locks

Distributed Systems 200655 As a time-line picture 2PC initiator p q r s t Vote? All vote “commit” Commit!

Distributed Systems 200656 As a time-line picture 2PC initiator p q r s t Vote? All vote “commit” Commit! Phase 1Phase 2

Distributed Systems 200657 Missing Stuff Eventually will need to do some form of garbage collection –Issue is that participants need memory of the protocol, at least for a while –But can delay garbage collection and run it later on behalf of many protocol instances Part of any real implementation but not thought of as part of the protocol

Distributed Systems 200658 Fault tolerance We can separate this into three cases –Group member fails; initiator remains healthy –Initiator fails; group members remain healthy –Both initiator and group member fail Further separation –Handling recovery of a failed member –Recovery after “total” failure of the whole group

Distributed Systems 200659 Fault tolerance Some cases are pretty easy –E.g. if a member fails before voting we just treat it as an abort –If a member fails after voting commit, we assume that when it recovers it will finish up the commit and perform whatever action we requested Hard cases involve crash of initiator

Distributed Systems 200660 Initiator fails, members healthy When did it fail? –Could fail before starting the 2PC protocol In this case if the members were expecting the protocol to run, e.g., to terminate a pending transaction on a database, they do “unilateral abort” –Could fail after some are prepared to commit Those members need to learn the outcome before they can “finish” the protocol –Could fail after some have learned the outcome Others may still be in a prepared state

Distributed Systems 200661 How to handle initiator failures? Wait for initiator to come up again… –May hold resources on members Rather –Initiator should record the decision in a logging server for use after crashes If decision is logged, a process may learn outcome by examining log if initiator fails (timeout needed here) –Also, members can help one-another terminate the protocol This is needed if a failure happens before the initiator has a chance to log its decision A process member may repeat phase 1

Distributed Systems 200662 Problems? 2PC has a “bad state” –Suppose that the initiator and a member, p, both fail and we are not using a log May not always want to use log because of extra overhead and reliability concerns –Other members cannot determine if commit should abort or not p may have transferred $10M to a bank account, want to be consistent with that… –There is a case in which we can’t terminate the protocol!

Distributed Systems 200663 As a time-line picture 2PC initiator p q r s t Vote? All vote “commit” Phase 1Phase 2 Commit!

Distributed Systems 200664 Can we do Better? 3 phase commit (3PC) –Assumes detectable failures We happen to know that real systems can’t detect failures, unless they can unplug the power for a faulty node –Idea is to add an extra “prepared to commit” stage

Distributed Systems 200665 3PC 3PC initiator p q r s t Vote? All vote “commit” Phase 1 Prepare to commit All say “ok” Phase 2 They commit Commit! Phase 3

Distributed Systems 200666 Why 3PC? A “new leader” in the group can deduce the outcomes when this protocol is used Main insight? –In 2PC the decision to commit can be known by only initiator and one other process In 3PC nobody can enter the commit state unless all are first in the prepared state –Makes it possible to determine the state, then push the protocol forward (or back) But does require accurate failure detections –Only commit if all operational in prepared to commit state or abort if all operational in ok to commit state Failed processes may learn outcome when they become operational

Distributed Systems 200667 Value of 3PC? Even with inaccurate failure detections, it greatly reduces the window of vulnerability –The bad case for 2PC is not so uncommon Especially if a group member is the initiator In that case one badly timed failure freezes the whole group –With 3PC in real systems, the troublesome case becomes very unlikely But the problems remain –E.g., in network partition where half may be prepared to commit and half may be ok to commit

Distributed Systems 200668 State diagram for non-faulty member Protocol starts in the initial state. Initiator sends the “OK to commit” inquiry We collect responses. If any is an abort, we enter the abort stage Otherwise send prepare-to-commit messages out Coordinator failure sends us into an inquiry mode in which someone (anyone) tries to figure out the situation This state corresponds to the coordinator sending out the commit messages. We enter the state when all members receive them Here, we “finish off” the prepare state if a crash interrupted it, by resending the prepare message (needed in case only some processes saw the coordinator’s message before it crashed) We get here if some processes were still in the initial “OK to commit?” stage In this case it is safe to abort, and we do so

Distributed Systems 200669 Summary We looked at goals and prerequisites for consistent replication –(Static and) and Dynamic Membership –Logical Time –Distributed Commit

Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman.

Similar presentations

Presentation on theme: "Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman.

Similar presentations

Presentation on theme: "Distributed Systems 2006 Overcoming Failures in a Distributed System * *With material adapted from Ken Birman."— Presentation transcript:

Similar presentations

About project

Feedback