Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Alexander A. Shvartsman University of Connecticut, USA Specifying, Reasoning About, Optimizing, and Implementing Atomic Data Services for Dynamic Distributed.

Similar presentations

Presentation on theme: "1 Alexander A. Shvartsman University of Connecticut, USA Specifying, Reasoning About, Optimizing, and Implementing Atomic Data Services for Dynamic Distributed."— Presentation transcript:

1 1 Alexander A. Shvartsman University of Connecticut, USA Specifying, Reasoning About, Optimizing, and Implementing Atomic Data Services for Dynamic Distributed Systems FRIDA 2014, Vienna, Austria

2 2 “Three R’s” (Lesen, Schreiben, Rechnen)  Reading, ’Riting, and ’Rithmetic  Underlay much of human intellectual activity  Venerable foundation of computing technology  With networking, communication became a major activity  Email – electronic counterpart of postal service  Yet, it is natural to deal with reading, writing, and computing  In the Web, a browser app may load (i.e., read) a page, perform computation, and save (i.e., write) the results  In distributed databases we retrieve and store data, and rarely talk about sending and receiving data  Arguably, it is also easier to develop distributed algorithms with readable/writeable objects, than to use message passing

3 3 Sharing Memory in a Networked System  Let’s place a shareable object at a node in a network  Not fault-tolerant – single point of failure  Not efficient – performance bottleneck  Not very available, does not provide longevity, etc…

4 4 Sharing Memory in a Networked System  So we replicate – we’d have to anyway, since redundancy is the only means for providing fault-tolerance

5 5 Sharing Memory in a Networked System  With replication come challenges:  How to preserve consistency while managing replicas?  What kind of consistency?  How to provide it?  How to use it? a a b

6 6 Sharing Memory in Networked Systems  Let’s place a shareable object at a node in a network  Not fault-tolerant – single point of failure  Not efficient – performance bottleneck  Not very available, does not provide longevity, etc…  So we replicate – we’d have to anyway, since redundancy is the only means for providing fault-tolerance  With replication come challenges:  How to preserve consistency while managing replicas?  What kind of consistency?  How to provide it?  How to use it?

7 7Consistency  Easiest for users: a single copy view  Sequence of operations; a read sees all previous writes  Captured by atomicity [Lamport] / linearizability [Herlihy Wing]  Not cheap to implement even without general updates  Cheapest to implement: a read sees a subset of prior writes  Not the most natural semantic for the users  Additional complications in dynamic systems  Ever-changing sets of replicas and participants  Crashes never stop, timing variations persist  Evolving topology, ultimately mobility

8 8 Atomicity / Linearizability [Lamport / Herlihy Wing ]  “Shrink” the interval of each operation to a serialization point so that the behavior of the object is consistent with its sequential type      read(0) write(8) read(8) Time read(8) read(0) write(8) read(8) Time read(0) write(8) read(8) Time

9 9  Valid (atomic) executions:  Invalid executions: More Atomicity Examples read( ) write(8) ret(0) ack( ) read( ) write(8) ret(0) read( ) ret(8) Time read( ) write(8) ret(0) ack( ) read( ) write(8) ret(8)read( )ret(0) Time read( )ret(0)

10 10 Consistency Polemics  Parallel / distributed systems  Performance  Speed-up  Distributed theory focus  Fault-tolerance  Consistency  User: Yes, mine is wrong… But it is fast! Yes, mine is slow… But it is correct! I wish they reconciled…

11 11 Using Majorities/Quorums for Consistency  Consistency of replicated data: using intersecting sets  Starting with Gifford (79) and Thomas (79)  Upfal and Wigderson (85)  Majority sets of readers and writers to emulate shared memory in a synchronous distributed setting  Vitanyi and Awerbuch (86)  MW/MR registers using matrices of SW/SR registers where rows and columns are read and written  Attiya, Bar-Noy, and Dolev (91/95)  Atomic SW/MR objects in message passing systems, majorities of processors, minority may crash  Two-phase protocol (ABD)

12 12 Quorum Systems and Examples AC B Majorities [Thomas79,Gifford79] Matrix Quorums: Processor ids arranged in a matrix. A quorum: Row U Column Quorum system Q over P, a set of processor ids: Q = {Q 1, Q 2, … } Q i  P Q i  Q j ≠ Ø for all i, j 1234 5678 9101112 13141516 Lemma: The join of quorum system Q a over P a and system Q b over P b, Q a ⋈ Q b, is a quorum system over P a U P b.

13 13 Main Idea: (Logical) Timestamps and Quorums  An object is represented as a pair (value, timestamp)  A write records (new-value, new-timestamp) in a quorum  A read obtains (value, timestamp) pairs from a quorum, then returns the value with the largest timestamp read(v max ) (v max,t max ) write(v 1 ) (v 1,t 1 ) writeread Time

14 14 “Readers Must Write”  If operations are concurrent and a reader simply returns the latest value, then atomicity can be violated:  Solution: If readers first help the writer to record the value in a quorum, then it is safe to return the latest value Write(v 1 )... (happened to be slow) v0v0 v1v1 Read returns v 1 Read returns v 0 v0v0 v0v0 v0v0 v0v0

15 15 Lastly: “Writers Must Read” I assume you’re being facetious, Josef, I distinctly yelled ‘second’ before you did! Writers can “read” before writing (and riding) to obtain the latest timestamp in order to compute a new timestamp

16 16 The ABD Algorithm [Attiya Bar-Noy Dolev]

17 17 The Complete Algorithm [Attiya Bar-Noy Dolev] Replica hosts respond to Get and Put requests Read and write uses identical two-phase communication patterns: Get phase queries and obtains values from a quorum, Put phase propagates values to a quorum. The only difference is in what is sent out in the Put phase.

18 18 New Frontiers: Dynamic Atomic Memory  Goal: Develop and analyze algorithms to solve problems of coherent data sharing in dynamic distributed environments  “Dynamic” encompasses  Changing sets of participants: nodes come and go as they please  Wide range of failures  Asynchrony, timing variations  Crashes, message loss, weak delivery guarantees  Changes in network topology  Processor mobility

19 19 Approach: Dynamic Quorums [Lynch Shvartsman]  Objects are replicated at several network locations  To accommodate small, transient changes:  Use quorum configurations: members, quorums  Maintains atomicity during “normal operation”  Allows concurrency  To handle larger, more permanent changes:  Reconfigure  Maintains atomicity across configuration changes  Any configuration can be installed at any time  Reconfigure concurrently with reads/writes -- no heavy-handed view change

20 20 Algorithmic Ideas  Reconfigurable quorum systems  Quorums maintain consistency during modest and transient changes  Reconfigurations accommodate more drastic and permanent changes  Read/write operations are frequent  Use quorum access and allow concurrency  Isolate from reconfiguration  Reconfigurations are infrequent  Use consensus to impose total order (Paxos)  Optimistic dissemination without formal installation  Conservative garbage collection of obsolete config-s

21 21 Related Other Approaches  Using consensus to agree on each operation [Lamport]  Performance overhead for each reads and write op  Termination of operations depends on consensus  Use group communication service [Birman 87] with TO bcast [Amir, Dolev, Melliar-Smith, Moser 94], [Keidar, Dolev 96]  View change delays reads/writes  One change may trigger view formation  DynaStore: Reconfiguration without consensus [Aguilera, Keidar, Malkhi, Shraer 11]  Initial quorum system, incremental adds/removes  Changes yield DAGs of possibilities  Reads/writes use ABD-like phases, traverse DAGs  Assuming finite reconfigurations guarantees termination

22 22 GCS: Group Communication Services (a Detour)  Pioneering work of Ken Birman & Co. made GCSs one of the most important distributed system building blocks  GCS: group membership + multicast communication  Isis, Transis, Consul, Ensemble, Totem, Horus, Newtop, Relacs …  Commercial variants of Birman’s Isis used in  French air traffic control system  Stock exchange on Wall Street  Swiss Electronic Bourse  GCSs for years defied sufficiently formal specification that would both  Detail the utility of the services and  Enable compelling reasoning about correctness

23 23 View-Synchrony [Fekete Lynch Shvartsman 01]  A new, simple formal specification for a partitionable view- oriented group communication service.  To demonstrate the value of our specification, we used it to construct an algorithm for an ordered-broadcast application that reconciles information derived from different views.  Atomic memory is now easily obtained  Our algorithm is based on algorithms of Amir, Dolev, Keidar, Melliar-Smith and Moser.  Our specification has a simple implementation, based on the membership algorithm of Cristian and Schmuck.

24 24 View-Synchrony [Fekete Lynch Shvartsman 01]  We proved the correctness and analyzed the performance and fault-tolerance of the algorithm  Our work on GCS was (we believe) the first successful formal specification + rigorous proofs  Safety: Invariants and simulation  Conditional performance analysis  Myla Archer @ NRL checked several invariants using PVS – invaluable contribution to perfecting the proofs  Despite perhaps only modest algorithmic advances the paper was/is highly cited -- the value is in the proofs!

25 25 Our Ground-Up Approach: RAMBO  Reconfigurable Atomic Memory for Basic Objects [Lynch, Shvartsman 02], [Gilbert, Lynch, Shvartsman 10]  For small, transient changes, use quorum configurations  Maintain atomicity and allow concurrency, a la ABD  For larger, permanent changes, reconfigure:  Dynamically install any quorum system  Guarantee atomicity across configurations  Guarantees:  Linearizability in all cases  Good latency in common cases  High availability  Everything proceeds concurrently  Use low-level gossip communication

26 26Methodology  Specify algorithm  Interacting state machines  Using non-deterministic “gossip”  Show correctness/safety for  arbitrary patterns of asynchrony  assuming arbitrary crash-failures and message loss  Analyze performance for a subset of timed executions  Bounded message delay, 0-time local processing  Some “gossip” becomes deliberate, some periodic  Non-failure of certain quorums for certain periods  Reason about operation latency  Of course none of this impacts safety

27 27 Reconfigurable Atomic Memory for Basic Objects  Global service specification  Implementation: Main algorithm + “recon” service  Recon service:  “Advance reconnaissance”  Consistent sequence of configurations  Loosely coupled  Main algorithm:  Reading, writing  Receives, disseminates new configuration information; no explicit installation  Reconfigures: upgrade to new and remove old  Reads/writes may use several configurations Net RAMBO Recon RAMBO

28 28 Architectural View Communication Network Node i Joiner i R/W i Node j Recon Joiner j R/W j Application RAMBO

29 29 High-Level Functions  Joiner  Introduces new participants to the system  Reader-Writer  Routine read and write operations  Two-phased algorithm using all “known” configurations  Using tags  Recon  Chooses new next configuration, e.g., using Paxos  Informs members of the previous configuration  Configuration upgrade (“packaged” with Reader-Writer)  Identify and remove obsolete configurations (garbage collection)

30 30 Configurations and Reconfiguration  Configuration: quorum system  Collection of subsets of replica host ids where any two subsets intersect  Or collection of read-quorums and write-quorums, where any read-quorum intersects with any write-quorum  Reconfiguration process involves two decoupled steps  Recon: Emitting a new configuration, then, at some later time  Garbage-collecting obsolete configurations while “upgrading” to the latest (locally) known configuration  No constraints on the membership of distinct quorum systems Q1Q1 Q2Q2 Q3Q3 …

31 31 Consensus Recon Net Implementation of Recon  Uses distributed consensus to determine new configurations 1,2,3,…  Note: when the universe of configurations is finite and known, then consensus is not needed even with unbounded reconfiguration [GeoQuorums]  Members of old configuration may propose a new config.  Proposals reconciled using consensus  Consensus is a fairly heavy mechanism, but it is  Used only for reconfigurations, which are infrequent  Does not delay Read/Write operations

32 32 Configurations and Config Maps  Configuration c  members(c) -- set of members of configuration c  read-quorums(c), write-quorums(c) -- sets of quorums  Configuration map cm  mapping from naturals to configurations  cm(k) is configuration k  Can be defined ( c ), undefined (  ), garbage-collected ( ± ) ±± ccc  c ...  GC ’ d Defined Mixed Undefined c

33 33 Configuration Upgrade [Gilbert, Lynch, S 10]  Reconfigure to last configuration in a contiguous segment  Phase 1:  Informs write-quorum of c j … c k-1 about c k  Collects (value,tag) from read-quorums of c j … c k-1  Phase 2:  Propagates latest (value, tag) to a write-quorum of c k  Garbage-collect: Set cmap(j…k -1) to ±  Constant-time upgrade regardless of the number of obsolete configurations (conditioned on failures)  Maintains good read/write latency during system instability or frequent reconfigurations ± cjcj...ckck 

34 34 Configuration Map Changes (Local View) c0c0  c0c0 c1c1  c0c0 c1c1 c2c2  ckck  ± c1c1 c2c2  ckck  ±± c2c2  ckck  TIME... ±±± c3c3  ckck  ±±±±± ccc  c 

35 35 On to Reads and Writes: Values and Tags  Each value v has an associated tag t (logical timestamp)  Tag is made up of the sequence-processor pair  Reads:  a set of value-tag pairs is obtained  the result is the value with the maximum tag  Writes:  a set of value-tag pairs is obtained  new-value is propagated with a new-tag that is a lexicographic increment of tag : new-tag :=  tag.seq + 1, pid 

36 36 Dynamic Reader-Writer and Recon  The work is split between Reader-Writer and Recon  Recon emits consistent configurations  Reader-Writer processes run two-phase quorum-based algorithm, with all “active” configurations  Background “gossip” builds fixed-points  If Recon emits new configuration, Reader-Writer continues reads/writes in progress, until fixed-point is reached Net R/W i Recon R/W j

37 37 Processing Reads and Writes  Reads and Writes perform Query and Propagation phases using known configurations, stored in op.cmap.  Query: Gets “fresh” value, tag, and cmap information from read-quorums  Propagation: Gives latest (value,tag) to write-quorums  Both phases: Extend op.cmap with newly-discovered configurations that now must also be involved.  Each phase ends with a fixed point, involving all the configurations currently in op.cmap Read or Write Propagation PhaseQuery Phase Start Query StartProp End Query End Prop

38 38Methodology  Algorithms are presented formally, using interacting state machine models: Input/Output automata)  service specifications  algorithm descriptions  models for applications  Safety: rigorous proof of correctness (atomicity) for arbitrary patterns of asynchrony and change  Conditional performance analysis  E.g., when message latency < d, quorum configurations are “viable”, then read and write operations take time between 4d and 8d, under reasonable “steady-state” assumptions. … Net …

39 39 39 Input Output Automata (IOA)  Input Output Automata [Lynch & Tuttle]  A language and formalism based on simple mathematical foundation  Elegant, suitable for analysis, well documented  Supports composition, abstraction  Model system components as a collection of interacting state machines  Allows rigorous reasoning about system properties  Widely used in research – hundreds of algorithms

40 40 Example Spec’n: Asynchronous Lossy Channel Input send i,j (m)Output recv j, i (m) Internal lose (m) Channel i,j

41 41 Details: Reader-Writer: Send and Recv Code Specification of gossip using Input/Output Automata of [Lynch Tuttle] Send Receive

42 42 Details: Reader-Writer Fixed Points Phase 1 fixed point Specification of fixed points using Input/Output Automata Phase 2 fixed point

43 43 Showing Read/Write Atomicity  We show atomicity using a partial order  For  be any sequence of reads and writes  Let  be an irreflexive PO of all op-s in   To prove atomicity show that  satisfies:  For any operation , finitely many op-s     If  precedes , then not     If  is a write then either    or     Any read returns value written by last write, per  (or init value if no such writes)  [Lynch, Lemma 13.16]  Rigorous proof that in RAMBO the max-tag of reads and new-tag of writes, lexi-ordered, provide such a PO

44 44 Some Latency Analysis Details  Restrict attention to a subset of timed executions  Reminder: Read and write operations are not affected by Recon delays or non-termination  Configuration upgrade (garbage collection) takes 4d  If reconfigurations are “rare” -- operations take 4d  If configurations are in “steady state” -- operations take 8d  Logarithmic time “catch-up” after a bursty period of “bad timing behavior”  A recovering node joins quickly after a long absence

45 45Implementation  Experimental system implementations [Musial 07]  Platform for refinement, optimization, tuning  Observe of algorithms in a local area setting  Cluster with 16+/- Linux machines & fast switch  Developed by manually translating the Input/Output Automata specification to Java code  Precise rules are followed to mitigate error introduction during translation  Rigorous proofs [Georgiou, Musial, Sh., Sonderegger 07, 11]  Next steps:  Specification in Tempo [Lynch Michel S 08] (Timed IOA)  Code generation ( [Georgiou Lynch Mavrommatis Tauber 09] )

46 46 Optimizations – Rigorously Proved  Parallel upgrade [Gilbert Lynch S 10]  Improving communication efficiency [Georgiou, Musial, S 06]  Explicit “leave” service  Incremental gossip for long-lived data  Improving liveness & gossip efficiency [Gramoli, Musial, S 05]  Non-deterministic heuristic for prodding slow operations  Constrained gossip patterns  Providing complete shared memory [Georgiou, Musial, S 05]  Domain-oriented Rambo  Optimizing performance for groups of related objects

47 47 Optimization and Development Methodology Atomicity properties Abstract Rambo Abstract Rambo with graceful departures Rambo with graceful departures Long-Lived Rambo Long-Lived Rambo Proof Derivation Running System Simulation Manual derivation [Musial 07] or semi-automated code generation [Georgiou + 09,10] [Lynch, S 02] [Gilbert, L, S 10] [Georgiou, Musial, S 04, 06]

48 48 Forward Simulation  Let C (low-level, concrete) and A (high-level, abstract) be two systems. The goal is to show that “C implements A” by “trace inclusion”  An execution is a finite or infinite alternating sequence s 0 a 1 s 1 a 2 s 2 … of states and actions, where s 0 is a start state  A trace is obtained by removing all states and internal actions  Forward simulation  from C to A    states(C) x states(A)  If s  start(C) then  (s)  start(A)  If (s,a,s’)  transitions(C), then there exists an execution fragment  =  (s), …,  (s’) such that trace(a) = trace(  ) Theorem: If there exists a simulation relation  from C to A, then traces(C)  traces(A).  This implies that any safety property of A is also satisfied by C

49 49 Simulation (or Abstraction) Function  Simulation function F maps each state of specification C (concrete) to a state of specification A (abstract)  F is required to satisfy the following two conditions 1. If s is an initial state of C then F(s) is an initial state of A 2. If s and F(s) are reachable states of C and A respectively, and (s, , s') is a step of C, then there is a step  ’of A from F(s) to F(s'), having the same trace A: C: F(s)F(s)F(s') s s' FF  ’’

50 50 Optimization: Improving performance  Long-Lived RAMBO: Graceful Leave + Incremental Gossip  Rigorous proof of correctness by simulation  Performance study  [Georgiou, Musial, Shvartsman 06]

51 51 Complete Shared Memory  Atomicity is compositional  Naïve approach:  Implement a singleton memory location  Get a complete shared memory by running several implementations  Not so fast! Literally.

52 52 Complete Shared Memory  Domain-oriented reconfigurable atomic memory  Optimizing performance for groups of related objects - Composition - Domain

53 53 Atila: Atomicity through Indirect Learning  Abstract Rambo assumes all-to-all connectivity  Atila: “Atomicity Through Indirect Learning Algorithm”  Indirect learning enables progress without any routing protocol or complete connecticivity [Konwar,Musial, Nicolaou, S 07]  Prove atomicity guarantees  Conditional analysis of operation latency  Identify deployment settings that lead to reasonable communication cost

54 54 RDS: Rambo  Paxos  Reconfigurable Distributed Storage for Dynamic Networks  RDS [Chockler, Gilbert, Gramoli, Musial, Shvartsman 09]  How can liveness of Rambo get hurt?  Reconfiguration bursts and  Rapid rate of failures  Solution:  Paxos-based procedure for installing new configurattons  Integrate configuration upgrade with installation  Overlap phases of Paxos with “garbage collection”  Obsolete configuration are removed quicker  Reads and writes use only 1 or 2 configurations

55 55 Federated Array of Bricks (FAB)  Storage system developed and evaluated at HP Labs  [Saito Frølund Veitch Merchant Spence 05]  Distributes workload and handles failures and recoveries without disturbing client requests  Read or write protocol involves majority quorums of storage “bricks” following the Rambo algorithm  Evaluations of the implementation showed  FAB performance is similar to centralized solutions,  While offering at the same time continuous service and high availability

56 56 DynaDisk Implementation  Data-center read/write storage system  Allows add/remove of storage devices on-the-fly  Based on DynaStore, but with and without consensus  [Shraer Martin Malkhi Keidar 10]

57 57GeoQuorums  Dynamic atomic read/write memory for mobile settings  [Dolev, Gilbert, Lynch, Shvartsman, Welch 04, 05]  Use Rambo architecture over Virtual Node layer  Nodes: fixed geographical locations called Focal Points  Centers of populated, compact geographical areas:  Traffic intersections, buildings, bridges, points-of-interest  Continuously populated, thus able to maintain state  Implementations:  Virtual Node layer over the physical mobile network  Atomic read/write memory over the Virtual Node layer

58 58GeoQuorums  Mobile nodes  Focal points – implemented as Virtual Nodes  Quorums are defined over focal points  Use GPS as timestamps  Fast(er) read/write operations  Single phase writes  One or two phase reads  Simplified, consensus-free, reconfiguration  Two-phase algorithm using fixed configurations  Can be motivated by performance: e.g., if writes are frequent, install smaller write quorums

59 59 Read-Modify-Write Storage Systems  RMW is strictly stronger than atomic read and write  Some storage systems implement atomic RMW operations. This type of consistency is expensive, and requires at its core atomic updates  Reduce parts of the system to a single-writer model (e.g., Microsoft’s Azure)  Depend on clock synchronization hardware (Google’s Spanner)  Rely on complex mechanisms for resolving event ordering such as vector clocks (Amazon’s Dynamo)

60 60 Tempo @  A toolkit for the Timed Input/Output Automata formalism  Computer-aided design framework  Specification language  Suite of tools used to specify systems and reasons about them (interfaces to UPPAAL and PVS)  Eclipse GUI (and alternatively a command line interface)  Tempo is used to  Specify distributed algorithms / protocols  Verify their correctness / safety with theorem provers / model checkers  Simulate their behavior (native simulator)  Tempo availability: beta at

61 61 Tempo – bird’s eye view Syntax Error Reports Syntax Error Reports Parser Vocabularies Automata Files Syntax Tree Semantic Analysis Back-end libraries Internal representation Back-end Plugin loader Simulator (interactive!) Uppaal Model checker Probabilistic Model checker PVS Generation Code Generation Deployment Optimization Legend Existing back-end tools: Proposed back-end tools:

62 62 Tempo Availability (beta for research use) Pure Java Runs on the Eclipse Rich Client Platform Runs on multiple platforms Mac OS X Windows Linux

63 63 Problems & Futures  Use Tempo ( )  Complete specifications  Code generation  “Weaker” consistency – crafting atomicity?  Local optimizations that preserve global atomicity  Practical settings  Recoveries, reconfiguration, and self-stabilization  Indirect learning in mobile settings  Reconfiguration  When to reconfigure? How to choose configurations?  Optimization of deployment [Sonderegger 11]  Going beyond read/write objects

64 64 Thank You! Questions and Discussion ?

65 65 Specifying, Reasoning About, Optimizing, and Implementing Atomic Data Services for Distributed Systems  Consistent shareable data services supporting atomic (linearizable) objects provide convenient building blocks for dynamic distributed systems. In general it is challenging to combine the guarantee of linearizability with efficiency in practical systems subject to perturbations in the computing medium. We describe our work on specification and implementation of consistent data services for network settings. We present a general framework that can be tailored to yield services for various target network settings. In these services, long-lived data availability is ensured through replication, while atomicity is ensured by performing reads and writes using evolvable quorum configurations. We describe on-the-fly reconfiguration that only modestly interferes with on-going operations, in particular, the termination of reads and writes does not depend on the termination of reconfiguration. Safety (atomicity) holds for arbitrary patterns of asynchrony, processor crashes, and link failures. We describe specific examples of specification, rigorous reasoning about correctness, provable optimizations, and methodical implementations of consistent data services in distributed systems.

Download ppt "1 Alexander A. Shvartsman University of Connecticut, USA Specifying, Reasoning About, Optimizing, and Implementing Atomic Data Services for Dynamic Distributed."

Similar presentations

Ads by Google