Time Bounds for Shared Objects in Partially Synchronous Systems Jennifer L. Welch Dagstuhl Seminar on Consistency in Distributed Systems Feb 2013.

Time Bounds for Shared Objects in Partially Synchronous Systems Jennifer L. Welch Dagstuhl Seminar on Consistency in Distributed Systems Feb 2013

Acknowledgment  Joint work with  Jiaqi Wang  Hyunyoung Lee  Edward Talmage  Jiaqi Wang’s M.S. thesis, CSE, TAMU, 2011  PODC 2011 brief announcement 1

2 Model  Fixed set of n nodes  Nodes communicate through reliable message-passing  delay in range [d−u,d]  Nodes have approximately synchronized clocks with skew ε  ε ≥ (1−1/n)u [Lundelius and Lynch 1984]  no clock drift  No node failures

Problem  Each node runs an application process  Application processes communicate through (logically) shared variables  arbitrary data types  How to implement the shared variables that the application processes use?  Desired consistency condition is linearizability  Focus on elapsed time of implemented operations 3

Related Work: Lower Bounds  Lipton and Sandberg 1988  |read|+|write| ≥ d (for sequential consistency)  Attiya and Welch 1991, 1994  |read| ≥ u/4  |write/enq/push| ≥ u/2  Mavronicolas and Roth 1991, 1992, 1999  |read/write| ≥ min{ε/2,u/2}  |read|+|write| ≥ d + min{ε/2,u/2} 4

Related Work: Upper Bounds for Read-Write Registers  Mavronicolas and Roth 1991, 1992, 1999:  |read| ≤ βd+3u+min{ε,u}+γ  |write| ≤ (1−β)d + 3u  β is tradeoff parameter in [o,1−u/d)  γ is a small constant  Chaudhuri, Gawlich and Lynch 1993:  |read| ≤ u + c  |write| ≤ d + u − c  c is tradeoff parameter in [0,d] 6

Related Work: General Upper Bounds Folklore algorithm #1:  centralized (single copy):  send operation invocation to node with the copy  node with copy serializes invocations and updates the copy  node with copy sends response to invoker.  Each operation takes 2d time 7

Related Work: General Upper Bounds Folklore algorithm #2:  Use atomic broadcast (full replication):  broadcast invocation  upon receipt do the operation  invoker waits for broadcast time and provides response  Each operation takes h time, where h is broadcast time:  h = 2d 8

Overview of Our Results  Lower bound #1:  (1 – 1/n)u for operations which can be executed in any order but result in different states for different orders  includes write, push and enq  improves previously known bound of u/2  uses classic shifting technique 9

Overview of Results  Lower bound #2:  d + min{ε,u,d/3} for operations that “immediately” do not commute with themselves (invalidate each other)  includes RMW, pop, deq  improves previous lower bound of d  uses a new shifting technique which provides a larger bound by shifting by a larger amount, then manipulating the new execution to fix message delays that are too big or too small 10

Overview of Results  New generic algorithm for any data type  Partitions operations into  pure accessors (don’t change state)  pure mutators (don’t observe state)  other  Upper bounds are, for any X in [0,d+ε−u],  d + ε − X for pure accessor  ε + X for pure mutator  d + ε for other  Improves on folklore algorithms (2d time per op) 11

Bounds for Read-Modify-Write Register 12 operationlower boundupper bound read-modify- write d + min{ε,u,d/3}d + ε (all X) readu/2 [Kosa] u (X = d+ε−u) write(1−1/n)u ε (X = 0) read + writed [Lipton & Sandberg] d + 2ε (all X) Recall ε can be as small as (1−1/n)u

Bounds for Queue 13 operationlower boundupper bound enq(1−1/n)uε (X = 0) deqd + min{ε,u,d/3}d + ε (all X) peeku/2 [Kosa] u (X = d+ε−u) peek + enqd + min{ε,u,d/3}d + 2ε (all X) Recall ε can be as small as (1−1/n)u

Bounds for Stack 14 operationlower boundupper bound push(1−1/n)uε (X = 0) popd + min{ε,u,d/3}d + ε (all X) peeku/2 [Kosa] u (X = d+ε−u) peek + pushd + min{ε,u,d/3}d + 2ε (all X) Recall ε can be as small as (1−1/n)u

Terminology  operation: operation w/o arg and return value. Ex: read  0peration instance: operation w/ arg and return value. Ex: read(-,3).  legal op sequence: one of the sequences in the sequential spec of the data type. Ex: for register, every read returns value of latest preceding write  equivalent sequences of ops, ρ 1 and ρ 2 : for all op sequences ρ 3, ρ 1.ρ 3 is legal iff ρ 2.ρ 3 is legal  OP is a mutator: there exist op sequence ρ and op instance in OP s.t. ρ.op and ρ are not equivalent  OP is an accessor: there exist legal op sequence ρ and op instance in OP s.t. ρ.op is illegal  Pure mutator: mutator but not accessor  Pure accessor: accessor but not mutator 15

Lower Bound #1 (write, push, enq, etc.)  If  for all operation sequences ρ and all instances op 1 and op 2 of OP, ρ.op 1 and ρ.op 2 legal => ρ.op 1.op 2 and ρ.op 2.op 1 are both legal, and  there exists operation sequence ρ and instances op 1,op 2,...,op n of OP s.t.  ρ.op i is legal, i = 1,...,n and  for all permutations π 1 and π 2 of op 1,...,op n, last(π 1 ) ≠ last(π 2 ) => ρ.π 1 and ρ.π 2 are not equivalent  then |OP| ≥ (1 − 1/n)u. 16

Classic Shifting Proof Idea  Assume in contradiction there is an implementation with |OP| < (1 − 1/n)u  Specify a carefully designed reference execution  Specify which operations are invoked when, message delays, and clock skews  Shift the real times when events occur in reference execution to get a new execution that still should be correct, but because of the shifting, the semantics of OP are violated  Carefully design shift amounts to keep msg delays and clock skews within bounds 17

Classic Shifting Picture 18 p1p1 ρ observing ops p2p2 p3p3 p4p4 linearized last p1p1 ρ observing ops p2p2 p4p4 p3p3 linearized last shift p 3 op 1 op 2 op 3 op 4 op 1 op 2 op 3 op 4

Shifting Proof Idea: Some Details  Reference execution:  Execute ρ sequentially (from 2 nd condition)  Have n procs concurrently invoke op 1,...,op n  Argue that the responses of the concurrent operations are the same as for the op i ’s  Execute a sequence of operations that “observe” the result of the concurrent operations  Specify the message delays carefully  Identify the last operation of the permutation into which the op i ’s are linearized  Shift carefully so that this last operation finishes before the first one starts => permutation in which the operations are linearized in shifted execution has different last operation  Since different last operations produce non-equivalent states, “observer” sequence is incorrect, contradiction 19

Lower Bound #2 (rmw, pop, deq, etc.)  If  there exist operation sequence ρ and instances op 1 and op 2 of OP s.t. ρ.op 1 and ρ.op 2 are both legal and ρ.op 1.op 2 and ρ.op 2.op1 are both illegal  then |OP| ≥ d + min{ε,u,d/3}. 20

Proof Idea  New shifting method:  Shift reference execution by a (larger) amount so that there is one pair of nodes with too large message delay  Chop the shifted execution as late as possible before first violation of message delay bound  Different nodes are chopped at different, carefully chosen, points that form a consistent cut  Extend prefix of shifted execution from the cut to have correct message delays 21

Proof Idea 22 p1p1 p2p2 op1 = op(arg1,resp1) op2 = op(arg2,resp2) reference execution: op1 starts at t, op2 starts at t+m, m = min{ε,u,d/3} shift p 2 by −m p1p1 p2p2 op(arg1,resp1’) op(arg2,resp2’) shift amount of m is too large for classic shift – use new shift and operation properties to prove that resp1’ = resp1 and resp2’ = resp2. Thus operations are still op1 and op2. p1p1 p2p2 op(arg1,resp1’’) op(arg2,resp2’’) shift p 1 by m shift amount of m is too large for classic shift – use new shift and operation properties to prove that resp1’’ = resp1 and resp2’’ = resp2. Thus operations are still op1 and op2.

Algorithm Intuition for Mutators  Mutators must be executed in same order at every node  On invocation, broadcast to all nodes w/ timestamp  If pure mutator, wait ε+X and return to user  wait d−u to simulate minimum message delay to self, when broadcast is received, add to pending set  Wait long enough (u+ε) to ensure that no operation with smaller timestamp can be received and then execute locally all pending ops with smaller or equal timestamp  If not pure mutator, then return to user 23

Algorithm Intuition for Pure Accessors  Pure accessors only need to execute locally so no need to exchange messages  This allows squeezing the timing, since we only have to make sure no remote invocations with smaller timestamps will arrive after the pure accessor executes and returns  Give pure accessor a special timestamp X in the past  Wait d+ε−X time, then execute locally all pending ops with smaller timestamp, execute locally the pure accessor, and return to user 24

Algorithm  when a pure accessor aop(arg) is invoked at node i at clock time T:  set timer to respond to (aop,arg,(T−X,i)) for d+ε−X in the future  when timer to respond to (aop,arg,ts) expires:  execute all ops in pending set with timestamp < ts, in timestamp order, and cancel associated execute timers  execute aop  respond to user  when a non pure accessor op(arg) is invoked at node i at clock time T:  if op is a pure mutator then set timer to respond to (op,arg,(T,i)) for ε+X in the future  set timer to add (op,arg(,T,i)) to pending set for d−u in the future  send (op,arg,(T,i)) msg to all other nodes  when timer to respond to pure mutator (mop,arg,ts) expires:  respond to user  when timer to add (op,arg,ts) to pending set expires or (op,arg,ts) msg is received:  add (op,arg,ts) to pending  set timer to execute (op,arg,ts) for u+ε in the future  when timer to execute (op,arg,ts) expires:  execute all ops in pending set with timestamp ≤ ts, in timestamp order, and cancel associated execute timers  if i is the invoker of (op,arg,ts) then respond to user 25

Algorithm Example: Operations in Isolation 26 p0p0 t p1p1 real time p2p2 t+d+ε−X invoke read execute read return read t+ε+Xt+d−ut+d+ε invoke writerespond writeadd writeexecute write add write execute write add write invoke RMWadd RMW execute RMW respond RMW execute RMWadd RMW execute RMW

Algorithm Example: Operations Interacting (T 2 < T 1 ) 27 p0p0 t p1p1 real time p2p2 t+d+ε−X invoke read execute read return read t+ε−Xt+d−ut+d+ε invoke writerespond writeadd writeexecute write add write execute write add write invoke RMWadd RMW execute RMW respond RMW execute RMWadd RMW execute RMW T1T1 T2T2

Algorithm Analysis  Linearizability shown in a standard way (provide an ordering of the operations and show it satisfies the properties)  Mutators are linearized by timestamps  Accessors fit in between to reflect what they saw  Time bounds:  pure accessor: timer ensures d+e−X  pure mutator: timer ensures e+X  other: two timers ensure (d−u)+(u+e) = d+e  X is a parameter to trade off the time of pure accessors and pure mutators (as in [Mavronicolas and Roth 1999] for registers ) 28

Conclusion  Summary:  Showed improved lower bounds on elapsed time of operations for linearizable implementations of arbitrary data types in partially synchronous systems  Presented generic algorithm for the problem  Tight and almost tight bounds in many cases for some common data types  Open problems:  Tighten gaps  Consider clock drift, failures, churn,…  Other consistency conditions? 29

Time Bounds for Shared Objects in Partially Synchronous Systems Jennifer L. Welch Dagstuhl Seminar on Consistency in Distributed Systems Feb 2013.

Similar presentations

Presentation on theme: "Time Bounds for Shared Objects in Partially Synchronous Systems Jennifer L. Welch Dagstuhl Seminar on Consistency in Distributed Systems Feb 2013."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Time Bounds for Shared Objects in Partially Synchronous Systems Jennifer L. Welch Dagstuhl Seminar on Consistency in Distributed Systems Feb 2013.

Similar presentations

Presentation on theme: "Time Bounds for Shared Objects in Partially Synchronous Systems Jennifer L. Welch Dagstuhl Seminar on Consistency in Distributed Systems Feb 2013."— Presentation transcript:

Similar presentations

About project

Feedback