Presentation is loading. Please wait.

Presentation is loading. Please wait.

CPSC 668Set 17: Fault-Tolerant Register Simulations1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch.

Similar presentations


Presentation on theme: "CPSC 668Set 17: Fault-Tolerant Register Simulations1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch."— Presentation transcript:

1 CPSC 668Set 17: Fault-Tolerant Register Simulations1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch

2 CPSC 668Set 17: Fault-Tolerant Register Simulations2 Fault-Tolerant Shared Memory Simulations Previous algorithms implemented shared variable on top of message passing, assuming no failures. What if some processors might crash? Can we still provide a shared read/write variable on top of message passing? Yes, even in an asynchronous system, if we have enough nonfaulty processors. First, we must specify a failure-prone shared memory.

3 CPSC 668Set 17: Fault-Tolerant Register Simulations3 Specification of f-Resilient Shared Memory Inputs are invocations on the shared object. Outputs are responses of the shared object. A sequence of inputs and outputs is allowable iff: –there is a partitioning of proc. indices into "faulty" and "nonfaulty" –Correct Interaction: each proc. alternates invocations and matching responses –Nonfaulty Liveness: Every invocation by a nonfaulty proc. has a matching response –Extended Linearizability: Linearizability holds for all the completed operations and some subset of the pending operations

4 CPSC 668Set 17: Fault-Tolerant Register Simulations4 Assumptions for Algorithm Each read/write variable ("register") to be simulated has –one reader and –one writer –(next topic will be to build more powerful variables out of these) There are n procs. which are cooperating to simulate a collection of such variables Underlying communication system is asynchronous message passing n > 2f (less than half the processors can crash)

5 CPSC 668Set 17: Fault-Tolerant Register Simulations5 Main Ideas of Algorithm Each simulated register has a replica stored at each of the n procs., not just at the designated reader and writer of that register. Use the redundant storage to provide fault- tolerance. Describe algorithm just for one simulated register; use a separate copy of the same algorithm in parallel for each simulated register.

6 CPSC 668Set 17: Fault-Tolerant Register Simulations6 Writing the Simulated Register generate the next sequence number send a message with the value and the sequence number to all the procs. –each recipient updates its local copy of the register wait to get back an ack from > n/2 procs. –safe since n - f > n/2 do the ack for the write

7 CPSC 668Set 17: Fault-Tolerant Register Simulations7 Reading the Simulated Register send a request to all the procs. –each recipient sends back current value of its replica wait to get reply from > n/2 procs. return value associated with largest sequence number

8 CPSC 668Set 17: Fault-Tolerant Register Simulations8 Key Idea for Correctness Each read should return the value of "the most recent" write. Each read or write communicates with > n/2 procs., so the set of procs. participating in operation O 1 is guaranteed to intersect with the set of procs. participating in any other operation O 2.

9 CPSC 668Set 17: Fault-Tolerant Register Simulations9 But What About Asynchrony? The underlying communication system is asynchronous: –message on behalf of one operation could be overtaken by a message on behalf of a later operation. Avoid such problems by adding additional mechanism to the algorithm: –reader and writer keep track of "status" of each link –don't send a msg on a link until ack from previous msg has been received

10 CPSC 668Set 17: Fault-Tolerant Register Simulations10 Outline of Correctness Proof Interesting part is proving linearizability. Let ts(W) = sequence number of W Let ts(R) = sequence number of write that R reads from Let O 1  O 2 denote O 1 finishes before O 2 starts Key lemmas: If W 1  W 2, then ts(W 1 ) < ts(W 2 ) If W  R, then ts(W) ≤ ts(R) If R  W, then ts(R ) < ts(W) If R 1  R 2, then ts(R 1 ) ≤ ts(R 2 )

11 CPSC 668Set 17: Fault-Tolerant Register Simulations11 Matching Lower Bound on Resiliency Theorem (10.22): No simulation of a 1-reader, 1-writer read/write linearizable register using n procs and asynchronous message passing can tolerate f ≥ n/2 crash failures. Proof: Suppose in contradiction there is an algorithm A that tolerates f = n/2 crashes and simulates a 1-reader, 1-writer linearizable register on top of asynchronous message passing.

12 CPSC 668Set 17: Fault-Tolerant Register Simulations12 Lower Bound Proof Partition procs into two sets, S 0 and S 1, each of size f. Let  0 be admissible exec. of A s.t. –initial value of simulated register is 0 –all procs. in S 1 crash initially –proc. p 0 in S 0 invokes write(1) at time 0 and no other operations are invoked. –the write completes at some time t 0 without any proc in S 0 receiving a message from any proc in S 1 : must happen since A is supposed to tolerate f failures.

13 CPSC 668Set 17: Fault-Tolerant Register Simulations13 S1S1 S0S0 p0p0 0:0:

14 CPSC 668Set 17: Fault-Tolerant Register Simulations14 Lower Bound Proof Let  1 be admissible exec. of A s.t. –initial value of simulated register is 0 –all procs. in S 0 crash initially –proc. p 1 in S 1 invokes a read at time t 0 +1 and no other operations are invoked. –the read completes at some time t 1 without any proc. in S 1 receiving a message from any proc. in S 0 : must happen since A is supposed to tolerate f failures –the read returns 0: must be since A guarantees linearizability

15 CPSC 668Set 17: Fault-Tolerant Register Simulations15 p1p1 1:1:

16 CPSC 668Set 17: Fault-Tolerant Register Simulations16 Lower Bound Proof Now create admissible execution  by "merging" the views of procs in S 0 from  0 and the views of procs in S 1 from  1 : –messages that go between S 0 and S 1 are delayed so that they don't arrive until after time t 1.  is not linearizable, since read(0) follows write(1). Contradiction.

17 CPSC 668Set 17: Fault-Tolerant Register Simulations17 p1p1 1:1: p1p1 p0p0 :: delay until after t 1 S1S1 S0S0 p0p0 0:0:

18 CPSC 668Set 17: Fault-Tolerant Register Simulations18 Lower Bound Diagram for n = 2 time 0 t0t0 t 0 +1 t1t1 p0p0 p1p1 o:o: p0p0 p1p1 1:1: p0p0 p1p1 :: write(1) read(0) write(1) read(0)

19 CPSC 668Set 17: Fault-Tolerant Register Simulations19 Simulating R/W Registers Using R/W Registers The previous algorithm showed how to simulate a 1-reader, 1-writer register on top of message passing. How can we get more powerful (flexible) registers, i.e., with –more readers –more writers We'll start with a warm-up: –simulate multi-valued register using binary-valued registers –1-reader and 1-writer

20 CPSC 668Set 17: Fault-Tolerant Register Simulations20 Wait-Free Register Simulations Asynchronous model Linearizable shared registers Wait-free –tolerate any number of crash failures We want to simulate one kind of (n-1)- resilient shared memory with another kind of (n-1)-resilient memory –recall earlier definition of f-resilient shared memory –recall earlier definition of one kind of communication system simulating another

21 CPSC 668Set 17: Fault-Tolerant Register Simulations21 Alternative Definition of Wait- Free Simulation Alternative definition for the wait-free case: The failure-free version of one communication system simulates the failure- free version of the other, and for any prefix of an admissible execution of the simulation algorithm in which p i has a pending operation, there is an extension in which the operation completes and only p i takes steps. Equivalent to previous definition, sometimes more convenient.

22 CPSC 668Set 17: Fault-Tolerant Register Simulations22 Proving Linearizability We've seen one approach: –explicitly construct a permutation and prove that it has the desired properties Alternative approach: –identify a time point for each operation, between invocation and response: linearization points –Linearization points give the permutation –Obviously real-time order is preserved –Just need to show that legality holds

23 CPSC 668Set 17: Fault-Tolerant Register Simulations23 Overview of Register Simulations multi-reader single-writer multi-valued single-reader single-writer multi-valued multi-reader multi-writer multi-valued single-reader single-writer binary-valued

24 CPSC 668Set 17: Fault-Tolerant Register Simulations24 Multi-Valued From Binary Some ideas… Use a different binary register to store each bit of the multi-valued register being simulated Read algorithm is to read all the binary registers and return the resulting value Write algorithm is to write the new bits in some order Difficulties arise if the reader overlaps a slow write and sees some new bits and some old bits

25 CPSC 668Set 17: Fault-Tolerant Register Simulations25 A Unary Approach Suppose the simulated register is to take on the values {0,…,K-1}. Use an array of K binary registers, B[0..K-1] –represent value v by having B[v] = 1 and the other entries 0 Read algorithm: read B[0], B[1],…, until finding the first 1; return the index Write algorithm: zero out the old entry of B and set the new entry

26 CPSC 668Set 17: Fault-Tolerant Register Simulations26 Problems with Unary Approach OK if reads and writes don't overlap. If they do, have to worry about –reader never finding a 1 in B –new-old inversion: writer writes 1, then 2, but reader reads 2, then 1. Counter-example execution on next slide –since binary registers are linearizable, we just mark the linearization points of the reads and writes on the binary registers

27 CPSC 668Set 17: Fault-Tolerant Register Simulations27 Counter-Example read 0 from B[0] read 0 from B[1] write 1 to B[1] write 0 to B[3] write 1 read 1 from B[2] write 1 to B[2] Initially B[0] = B[1] = B[2] = 0 and B[3] = 1 read 0 from B[0] read 1 from B[1] write 0 to B[1] write 2 read 2read 1

28 CPSC 668Set 17: Fault-Tolerant Register Simulations28 Corrected Multi-Valued Algorithm To prevent "falling off the edge" of the end of B without finding a 1, write algorithm only clears (sets to 0) entries that are smaller the entry that is set (to 1) To prevent new-old inversions, read algorithm scans up to find first 1, and then scans down to make sure those entries are still 0. –returns smallest value associated with a 1 entry in B that is observed during the downward scan

29 CPSC 668Set 17: Fault-Tolerant Register Simulations29 Multi-Valued Construction B[0] 0/1...... B[K-1] 0/1 readerwriter reader alg. writer alg. read write

30 CPSC 668Set 17: Fault-Tolerant Register Simulations30 Algorithm is Wait-Free Algorithm for writer does not involve any waiting: just do at most K (low-level) writes Algorithm for reader does not involve any waiting: just do at most 2K-1 (low- level) reads.

31 CPSC 668Set 17: Fault-Tolerant Register Simulations31 Algorithm Ensures Linearizability Describe an ordering of the (high-level) operations that is obviously legal (by the definition of the ordering) Then show that it respects real-time ordering of non-overlapping operations. Fix any admissible execution of the algorithm. Fix any linearization of the low-level operations (on the binary registers) –exists since the execution is admissible, which implies the underlying communication system (the binary registers) behaves properly (is linearizable)

32 CPSC 668Set 17: Fault-Tolerant Register Simulations32 Reads-From Relations Low-level read r on a binary register B[v] reads from low-level write w on the register if w is the latest write to B[v] that precedes r in the linearization of the low-level operations. High-level read R on the simulated multi- valued register reads from high-level write W on the register if W returns v and W contains the low-level write that R's last read of B[v] reads from.

33 CPSC 668Set 17: Fault-Tolerant Register Simulations33 Reads-From Diagram write 0 to B[0] write 1 to B[1] write 1 read 1 from B[1] read 0 from B[0] read 0 from B[0] read 1 low-level reads-from relationships high-level reads-from relationship

34 CPSC 668Set 17: Fault-Tolerant Register Simulations34 Construct Permutation Place all (high-level) writes in the order in which they occur –no concurrent writes Consider each (high-level) read in the occur in which they occur –no concurrent reads Suppose read R reads from write W. Place R immediately before the write that follows W in the permutation.

35 CPSC 668Set 17: Fault-Tolerant Register Simulations35 Correctness of Permutation Permutation is legal by construction –each read is placed after the write that it reads from Why does it preserve order of non- overlapping operations? –two writes: by construction –a read that precedes a write in the execution: OK, since the read cannot read from a later write.

36 CPSC 668Set 17: Fault-Tolerant Register Simulations36 Correctness of Permutation Lemma (10.1): Suppose (high-level) read R returns v R reads B[u], with u < v, during its upward scan this read of B[u] reads from a (low-level) write contained in high-level write W 1 Then R reads from a write that follows W 1.

37 CPSC 668Set 17: Fault-Tolerant Register Simulations37 write 1 to B[w] write 0 to B[u] write w Figure for Lemma 10.1 write 1 to B[v] write v low-level reads-from relationships high-level reads-from relationship read 0 from B[u] during upward scan, u < v read v read 1 from B[v] top of upward scan or during downward scan

38 CPSC 668Set 17: Fault-Tolerant Register Simulations38 Correctness of Permutation Two cases remain to show that real- time order of non-overlapping operations is preserved: –a write that precedes a read in the execution –two reads Proof of both cases are by contradiction and showing that there is a situation that violates Lemma 10.1.

39 CPSC 668Set 17: Fault-Tolerant Register Simulations39 Multi-Reader from Single-Reader First consider a simple idea: Use a different single-reader register for each reader (Val[1],…,Val[n]). –n is number of readers Write algorithm: write the new value in each of the single-reader registers Read algorithm: read your own single- reader register and return that value

40 CPSC 668Set 17: Fault-Tolerant Register Simulations40 pwpw p1p1 p2p2 write 1 Counter-Example write 1 to Val[1] write 1 to Val[2] read 0 from Val[2] read 0 read 1 from Val[1] read 1 Suppose 0 is initial value of multi-reader register. Suppose n = 2.

41 CPSC 668Set 17: Fault-Tolerant Register Simulations41 New Idea for Correct Algorithm Have the multi-reader algorithm write some information to the single-reader registers to prevent new-old inversions on the simulated register. This is provably necessary…

42 CPSC 668Set 17: Fault-Tolerant Register Simulations42 Readers Must Write Theorem (10.3): In any wait-free simulation of a multi-reader single-writer register from single-reader single-writer registers, at least one reader must write. Proof: Suppose in contradiction there is an algorithm in which readers never write.

43 CPSC 668Set 17: Fault-Tolerant Register Simulations43 Readers Must Write p w is the writer, p 1 and p 2 are the readers initial value of simulated register is 0 S 1 is the set of single-reader registers that are read by p 1 S 2 is the set of single-reader registers that are read by p 2

44 CPSC 668Set 17: Fault-Tolerant Register Simulations44 Readers Must Write Consider execution in which p w writes 1 to the simulated register. The write algorithm performs a series of writes, w 1,…,w k, to the single-reader registers. Each w j is a write to a register in either S 1 or S 2. Let v j i be the value that would be returned if p i were to do a read immediately after w j

45 CPSC 668Set 17: Fault-Tolerant Register Simulations45 Readers Must Write pwpw pipi write to w 1 write to w j write to w j+1 write to w k …… write 1 read v j i

46 CPSC 668Set 17: Fault-Tolerant Register Simulations46 Readers Must Write For each reader (p 1 and p 2 ), there is a point when the writes w 1, …, w k cause the value of the simulated register, as it would be observed by that reader, to "switch" from 0 (old) to 1 (new). For p 1 :  v 1 1 = v 2 1 = … = v a-1 1 = 0  v a 1 = … = v k 1 = 1 For p 2 :  v 1 2 = v 2 2 = … = v b-1 2 = 0  v b 2 = … = v k 2 = 1

47 CPSC 668Set 17: Fault-Tolerant Register Simulations47 Readers Must Write Why must a and b be different? a marks the point when p 1 's view of the simulated register's current value changes from old to new. So w a must write to a register in S 1. Similarly, w b must write to a register in S 2. W.l.o.g., assume a < b.

48 CPSC 668Set 17: Fault-Tolerant Register Simulations48 Readers Must Write pwpw write to w 1 write to w a write to w a+1 write to w k …… write 1 p1p1 read v a 1 = 1 p2p2 read v a 2 = 0

49 CPSC 668Set 17: Fault-Tolerant Register Simulations49 Readers Must Write Where did we use the assumption in this proof that readers don't write? The writer doing the slow write of 1 is oblivious to whether any readers are concurrently reading. The readers are oblivious to each other.

50 CPSC 668Set 17: Fault-Tolerant Register Simulations50 Corrected Multi-Reader Algorithm As part of the algorithm for the read on the simulated register, announce the value to be returned. Before deciding what value to return, check what values have been returned by previous reads and don't pick anything earlier. Need timestamps to be able to determine relative age of returned values. Reader p i uses row i of a matrix to report its most recently returned value to all the other readers (remember, we only have single- reader variables at our disposal)

51 CPSC 668Set 17: Fault-Tolerant Register Simulations51 Writer's Algorithm get the next sequence number –use integers that are increased by one each time write value and sequence number to Val[1],…,Val[n] (one copy for each reader)

52 CPSC 668Set 17: Fault-Tolerant Register Simulations52 Reader p i 's Algorithm read the value and timestamp written by the writer to Val[i] read the value and timestamp written by each reader to Report[j,i] choose the value-timestamp pair with the largest timestamp write that pair to row i of Report return value associated with that pair

53 CPSC 668Set 17: Fault-Tolerant Register Simulations53 Multi-Reader Construction 3 readers writer alg. reader alg. reader alg. reader alg. writes Val Report reads writes

54 CPSC 668Set 17: Fault-Tolerant Register Simulations54 Correctness of Multi-Reader Algorithm Wait-free –writer does n low-level writes –reader does n+1 low-level reads and n low- level writes To prove linearizability, explicitly construct a permutation of the high-level operations that is clearly legal and then prove it preserves real-time order of non-overlapping operations.

55 CPSC 668Set 17: Fault-Tolerant Register Simulations55 Constructing the Permutation  Put in all writes in the order in which they occur in the execution –since single-writer, writes do not overlap Consider the reads in the order of their responses in the execution. –read R reads from write W if W generates the timestamp associated with the value R returns –place R immediately before the write that follows W By construction, the permutation is legal.

56 CPSC 668Set 17: Fault-Tolerant Register Simulations56 Preserving Real-Time Order write-write: by construction of  read-write: Suppose R precedes W in . Then R cannot read from W or any succeeding write, so R is placed in  before W. write-read: Suppose W precedes R in . Then R reads W 's timestamp or a larger one from Val[ ] and reads from W or a later write. Thus R is placed in  after W. read-read: Suppose R i by p i precedes R j by p j in . Then p j reads R i 's timestamp or a larger one from Report[i,j]. So R j reads from the same write that R i reads from or a later write. Thus R j is placed in  after R i.

57 CPSC 668Set 17: Fault-Tolerant Register Simulations57 Multi-Writer from Single-Writer Idea: –each writer should announce each value it wants to write to all the readers, by writing the value to its own (SW,MR) register. –each reader reads all the values written by the writers and returns the latest one How to determine latest value? –use timestamps –new wrinkle is that multiple processes generate timestamps, need to coordinate

58 CPSC 668Set 17: Fault-Tolerant Register Simulations58 Using Vector Timestamps Data structure VT at each proc consisting of a vector of m integers –m is the number of writers To get a new timestamp, writer p i increments VT[i] by one To compare timestamps, use lexicographic order –This is a total order that extends the partial order defined for vector timestamps

59 CPSC 668Set 17: Fault-Tolerant Register Simulations59 Writer p w 's Algorithm get the next vector timestamp: –read the timestamp written by each writer to TS[0],…,TS[m-1] –extract the i-th entry of each TS[i] –increment own entry by 1 –write my new timestamp to TS[w] write value and timestamp to Val[w]

60 CPSC 668Set 17: Fault-Tolerant Register Simulations60 Reader p r 's Algorithm read the value and timestamp written by each writer to Val[0], …, Val[m-1] choose the value-timestamp pair with the largest timestamp return value associated with that pair

61 CPSC 668Set 17: Fault-Tolerant Register Simulations61 Multi-Writer Construction 3 readers 2 writers writer alg. reader alg. reader alg. reader alg. Val TS writer alg. read write

62 CPSC 668Set 17: Fault-Tolerant Register Simulations62 Correctness of Multi-Writer Algorithm Wait-free –writer does m low-level reads and 2 low- level writes –reader does m low-level reads To prove linearizability, explicitly construct a permutation of the high-level operations that is clearly legal and then prove it preserves real-time order of non-overlapping operations.

63 CPSC 668Set 17: Fault-Tolerant Register Simulations63 Constructing the Permutation  Put in all writes in timestamp order –Lemma 10.6 shows this preserves order of non- overlapping writes Consider the reads in the order of their responses in the execution. –read R reads from write W if W generates the timestamp associated with the value R returns –place R immediately before the write that follows W By construction, the permutation is legal.

64 CPSC 668Set 17: Fault-Tolerant Register Simulations64 Preserving Real-Time Order write-write: by construction of  read-write: Suppose R precedes W in . Then R cannot read from W or any succeeding write, so R is placed in  before W. write-read: Suppose W precedes R in . Then R reads W 's timestamp or a larger one from Val[ ] and reads from W or a later write. Thus R is placed in  after W. read-read: Suppose R i by p i precedes R j by p j in . By Lemmas 10.6 and 10.7, p j reads R i 's timestamp or a larger one from Val[ ]. So R j reads from the same write that R i reads from or a later write. Thus R j is placed in  after R i.

65 CPSC 668Set 17: Fault-Tolerant Register Simulations65 Atomic Snapshot Objects (ASO) An array of elements: –each one can be updated by just one proc. –a proc. can scan the whole array "atomically" Useful abstraction for designing shared memory algorithms Can be wait-free implemented from read/write variables

66 CPSC 668Set 17: Fault-Tolerant Register Simulations66 ASO Sequential Specification Operations are –invocation scan i, response return i (V) where V is an array of n values, 0 ≤ i ≤ n-1 –invocation update i (d) where d is a data value, response ack i, 0 ≤ i ≤ n-1 Legal sequences: for each V returned by a scan, V[i] equals parameter of latest preceding update i

67 CPSC 668Set 17: Fault-Tolerant Register Simulations67 ASO Example Suppose array = [a,b,c] initially. This sequence is legal: update 1 (x), update 2 (y), scan([a,x,y]), update 0 (z), scan([z,x,y])

68 CPSC 668Set 17: Fault-Tolerant Register Simulations68 Sketch of Implementation Store each array entry ("segment") in a different read/write variable Update algorithm: –write to the variable holding that segment Scan algorithm: –Collect (read) all the values in the segments twice –If no segment is updated during the "double collect", then we got a valid snapshot -- return it Issues: –how to tell if a segment is updated? –what to do if a segment is updated?

69 CPSC 668Set 17: Fault-Tolerant Register Simulations69 Detecting Updates Simple idea is to tag each value stored in a segment with a counter (1,2,3,…) –requires unbounded space More complex, bounded-space, solution is given in the textbook –uses a "handshaking" mechanism

70 CPSC 668Set 17: Fault-Tolerant Register Simulations70 Reacting to Update During Scan If a scanner observes enough changes to a particular segment, then the corresponding updater has performed a complete update during this scan Embed a scan at the beginning of each update: –the view obtained in this scan is written with the data to the segment Scanner returns view obtained in last collect

71 CPSC 668Set 17: Fault-Tolerant Register Simulations71 Complexity of ASO Algorithm Number of building-block read/write variables is O(n) (although some are large) Scan algorithm uses O(n 2 ) low-level reads and writes. Update algorithm uses O(n 2 ) low-level reads and writes.


Download ppt "CPSC 668Set 17: Fault-Tolerant Register Simulations1 CPSC 668 Distributed Algorithms and Systems Fall 2009 Prof. Jennifer Welch."

Similar presentations


Ads by Google