Presentation is loading. Please wait.

Presentation is loading. Please wait.

Https://www.kth.se/profile/icarad/page/doctoral-thesis/ Programming Model and Protocols for Reconfigurable Distributed Systems & C OSMIN I ONEL A RAD Doctoral.

Similar presentations


Presentation on theme: "Https://www.kth.se/profile/icarad/page/doctoral-thesis/ Programming Model and Protocols for Reconfigurable Distributed Systems & C OSMIN I ONEL A RAD Doctoral."— Presentation transcript:

1 https://www.kth.se/profile/icarad/page/doctoral-thesis/ Programming Model and Protocols for Reconfigurable Distributed Systems & C OSMIN I ONEL A RAD Doctoral Thesis Defense, 5 th June 2013 KTH Royal Institute of Technology

2 Presentation Overview Context, Motivation, and Thesis Goals introduction & Design philosophy Distributed abstractions & P2P framework Component execution & Scheduling Distributed systems Experimentation – Development cycle: build, test, debug, deploy scalable & consistent key-value store – System architecture and testing using – Scalability, Elasticity, and Performance Evaluation Conclusions 2

3 Trend 1: Computer systems are increasingly distributed For fault-tolerance – E.g.: replicated state machines For scalability – E.g.: distributed databases Due to inherent geographic distribution – E.g.: content distribution networks 3

4 Trend 2: Distributed systems are increasingly complex connection management, location and routing, failure detection, recovery, data persistence, load balancing, scheduling, self-optimization, access-control, monitoring, garbage collection, encryption, compression, concurrency control, topology maintenance, bootstrapping,... 4

5 Trend 3: Modern Hardware is increasingly parallel Multi-core and many-core processors Concurrent/parallel software is needed to leverage hardware parallelism Major software concurrency models – Message-passing concurrency  Data-flow concurrency viewed as a special case – Shared-state concurrency  5

6 Distributed Systems are still Hard… … to implement, test, and debug Sequential sorting is easy – Even for a first-year computer science student Distributed consensus is hard – Even for an experienced practitioner having all the necessary expertise 6

7 Experience from building Chubby, Google’s lock service, using Paxos “The fault-tolerance computing community has not developed the tools to make it easy to implement their algorithms. The fault-tolerance computing community has not paid enough attention to testing, a key ingredient for building fault-tolerant systems.” [Paxos Made Live] Tushar Deepak Chandra Edsger W. Dijkstra Prize in Distributed Computing 2010 7

8 A call to action “It appears that the fault-tolerant distributed computing community has not developed the tools and know-how to close the gaps between theory and practice with the same vigor as for instance the compiler community. Our experience suggests that these gaps are non-trivial and that they merit attention by the research community.” [Paxos Made Live] Tushar Deepak Chandra Edsger W. Dijkstra Prize in Distributed Computing 2010 8

9 Thesis Goals Raise the level of abstraction in programming distributed systems Make it easy to implement, test, debug, and evaluate distributed systems Attempt to bridge the gap between the theory and the practice of fault-tolerant distributed computing 9

10 We want to build distributed systems 10

11 by composing distributed protocols 11

12 implemented as reactive, concurrent components 12

13 with asynchronous communication and message-passing concurrency NetworkTimer Failure detector Broadcast Consensus Application 13

14 Design principles Tackle increasing system complexity through abstraction and hierarchical composition Decouple components from each other – publish-subscribe component interaction – dynamic reconfiguration for always-on systems Decouple component code from its executor – same code executed in different modes: production deployment, interactive stress testing, deterministic simulation for replay debugging 14

15 Model entire sub-systems as first-class composite components – Richer architectural patterns Tackle system complexity – Hiding implementation details – Isolation Natural fit for developing distributed systems – Virtual nodes – Model entire system: each node as a component Nested hierarchical composition 15

16 Message-passing concurrency Compositional concurrency – Free from the idiosyncrasies of locks and threads Easy to reason about – Many concurrency formalisms: the Actor model (1973), CSP (1978), CCS (1980), π-calculus (1992) Easy to program – See the success of Erlang, and Go, Rust, Akka,... Scales well on multi-core hardware – Almost all modern hardware 16

17 Loose coupling “Where ignorance is bliss, 'tis folly to be wise.” – Thomas Gray, Ode on a Distant Prospect of Eton College (1742) Communication integrity – Law of Demeter Publish-subscribe communication Dynamic reconfiguration 17

18 Design Philosophy 1.Nested hierarchical composition 2.Message-passing concurrency 3.Loose coupling 4.Multiple execution modes 18

19 Component Model Event Port Component Channel Handler Subscription Publication / Event trigger Port Event Component channel handler 19

20 20

21 21

22 Process1 A simple distributed system Network Comp handler Application handler 2 handler 1 Process2 Network Comp handler Application handler 2 handler 1 handler handler Ping Pong Ping Network 22

23 A Failure Detector Abstraction using a Network and a Timer Abstraction 23 Ping Failure Detector Eventually Perfect Failure Detector MyNetwork Network MyTimer Timer NetworkTimer Eventually Perfect Failure Detector ++ StartMonitoring StopMonitoring Suspect Restore ++

24 A Leader Election Abstraction using a Failure Detector Abstraction 24 Ω Leader Elector Ping Failure Detector Eventually Perfect Failure Detector Leader Election Eventually Perfect Failure Detector Leader ++ Leader Election ++

25 A Reliable Broadcast Abstraction using a Best-Effort Broadcast Abstraction 25 Reliable Broadcast Broadcast Best-Effort Broadcast Broadcast Network RbDeliver  Deliver RbBroadcast  Broadcast BebDeliver  Deliver BebBroadcast  Broadcast Broadcast Deliver ++ Broadcast ++

26 A Consensus Abstraction using a Broadcast, a Network, and a Leader Election Abstraction 26 MyNetwork Ω Leader Elector Paxos Consensus Leader Election Network Best-Effort Broadcast Broadcast Consensus

27 A Shared Memory Abstraction 27 Best-Effort Broadcast ABD Atomic Register Broadcast MyNetwork Network Broadcast Network Atomic Register ++ ReadRequest WriteRequest ReadResponse WriteResponse ++

28 A Replicated State Machine using a Total-Order Broadcast Abstraction 28 Paxos Consensus Consensus Uniform Total-Order Broadcast Total-Order Broadcast State Machine Replication Replicated State Machine Total-Order Broadcast Consensus Propose Decide ++ Consensus ++ Replicated State Machine ++ Execute Output ++ Total-Order Broadcast ++ TobBroadcast TobDeliver ++

29 Probabilistic Broadcast and Topology Maintenance Abstractions using a Peer Sampling Abstraction 29 Epidemic Dissemination Cyclon Random Overlay Peer Sampling NetworkTimer Probabilistic Broadcast Peer Sampling T-Man Topology Peer Sampling Network

30 A Structured Overlay Network implements a Distributed Hash Table 30 Structured Overlay Network One-Hop Router Cyclon Random Overlay Peer Sampling NetworkTimer Overlay Router Peer Sampling Chord Periodic Stabilization Ping Failure Detector Failure Detector NetworkTimer Consistent Hashing Ring Topology Failure Detector Network Distributed Hash Table NetworkTimer

31 A Video on Demand Service using a Content Distribution Network and a Gradient Topology Overlay 31 Distributed TrackerCentralized Tracker ClientPeer Exchange BitTorrent TrackerNetwork Content Distribution Network Video On-Demand Content Distribution Network Tracker Peer Sampling Distributed Hash Table Network Tracker Gradient Overlay Gradient Topology Peer Sampling Network Timer Network

32 Generic Bootstrap and Monitoring Services provided by the Kompics Peer-to-Peer Protocol Framework 32 MonitorServerMain MyTimerMyNetwork Network + + Timer MonitorServer Network – Timer – MyWebServer Web – + PeerMain MyTimerMyNetwork Network + + Timer Peer Network – Timer – MyWebServer Web – + BootstrapServerMain MyTimerMyNetwork Network + + Timer BootstrapServer Network – Timer – MyWebServer Web – +

33 Whole-System Repeatable Simulation 33 Deterministic Simulation Scheduler Network Model Experiment Scenario

34 Define parameterized scenario events – Node failures, joins, system requests, operations Define “stochastic processes” – Finite sequence of scenario events – Specify distribution of event inter-arrival times – Specify type and number of events in sequence – Specify distribution of each event parameter value Scenario: composition of “stochastic processes” – Sequential, parallel: Experiment scenario DSL 34

35 Local Interactive Stress Testing 35 Work-Stealing Multi-Core Scheduler Network Model Experiment Scenario

36 execution profiles Distributed Production Deployment – One distributed system node per OS process – Multi-core component scheduler (work stealing) Local / Distributed Stress Testing – Entire distributed system in one OS process – Interactive stress testing, multi-core scheduler Local Repeatable Whole-system Simulation – Deterministic simulation component scheduler – Correctness testing, stepped / replay debugging 36

37 Incremental Development & Testing Define emulated network topologies – processes and their addresses: – properties of links between processes latency (ms) loss rate (%) Define small-scale execution scenarios – the sequence of service requests initiated by each process in the distributed system Experiment with various topologies / scenarios – Launch all processes locally on one machine 37

38 Distributed System Launcher 38

39 39 The script of service requests of the process is shown here… After the Application completes the script it can process further commands input here…

40 Programming in the Large Events and ports are interfaces – service abstractions – packaged together as libraries Components are implementations – provide or require interfaces – dependencies on provided / required interfaces expressed as library dependencies [Apache Maven] multiple implementations for an interface – separate libraries deploy-time composition 40

41 Kompics Scala, by Lars Kroll 41

42 Kompics Python, by Niklas Ekström 42

43 Case study A Scalable, Self-Managing Key-Value Store with Atomic Consistency and Partition Tolerance 43

44 Key-Value Store? Store.Put(key, value)  OK [write] Store.Get(key)  value [read] Put (”www.sics.se”, ”193.10.64.51”)  OK Get (”www.sics.se”)  ”193.10.64.51” 44

45 Consistent Hashing Incremental scalability Self-organization Simplicity Dynamo Project Voldemort 45

46 Single client, Single server Client ServerX = 1 X = 0 Put(X, 1) Ack(X) Get(X) Return(1) 46

47 Multiple clients, Multiple servers Client 2 Server 1X = 1 X = 0 Get(X) Return(0) Get(X) Return(1) Client 1 Put(X, 1) Ack(X) Server 2 X = 0 X = 1 47

48 Atomic Consistency Informally put/get ops appear to occur instantaneously Once a put(key, newValue) completes – new value immediately visible to all readers – each get returns the value of the last completed put Once a get(key) returns a new value – no other get may return an older, stale value 48

49 CATS Node 49 Persistent Storage Group Member Bulk Data Transfer Garbage Collector Status Monitor Ping Failure Detector Consistent Hashing Ring Reconfiguration Coordinator Operation Coordinator CATS Web Application Bootstrap Client Cyclon Random Overlay Epidemic Dissemination One-Hop Router Load Balancer NetworkData Transfer NetworkTimer Failure Detector NetworkTimerNetworkTimer BootstrapLocal Store Peer SamplingStatus NetworkTimer Ring Topology NetworkTimer Ring Topology Status Data TransferStatus NetworkTimer Distributed Hash Table Peer Status Overlay Router Peer Sampling Network Web NetworkLocal Store Overlay RouterNetworkTimerNetwork Broadcast Aggregation Status Broadcast Distributed Hash Table Status Failure Detector NetworkTimer StatusNetwork Status Replication StatusReplication Bootstrap Peer Sampling

50 Simulation and Stress Testing 50 CATS Simulation Main Discrete-Event Simulator CATS Simulator CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network Timer NetworkCATS Experiment Web Timer NetworkCATS Experiment Simulation Scheduler Network Model Experiment Scenario CATS Stress Testing Main Generic Orchestrator CATS Simulator CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network CATS Node Web DHT Timer Network Timer NetworkCATS Experiment Web Timer NetworkCATS Experiment Multi-core Scheduler Network Model Experiment Scenario

51 Example Experiment Scenario 51

52 Reconfiguration Protocols Testing and Debugging Use whole-system repeatable simulation Protocol Correctness Testing – Each experiment scenario is a Unit test Regression test suite – Covered all “types” of churn scenarios – Tested each scenario for 1 million RNG seeds Debugging – Global state snapshot on every change – Traverse state snapshots Forward and backward in time 52

53 53 Global State Snapshot: 25 joined

54 54 Snapshot During Reconfiguration

55 55 Reconfiguration Completed OK Distributed Systems Debugging Done Right!

56 CATS Architecture for Distributed Production Deployment 56

57 57 Demo: SICS Cluster Deployment

58 58

59 59 An Interactive Put Operation An Interactive Get Operation

60 CATS Architecture for Production Deployment and Performance Evaluation 60 Bootstrap Server MainCATS Client MainCATS Peer Main CATS Node Grizzly NetworkMyTimer Timer Network Jetty Web ServerApplication Grizzly NetworkMyTimer Timer Network Grizzly NetworkMyTimer Timer Network Web DHT Web DHT Timer Network CATS Client YCSB Benchmark Timer Network Distributed Hash Table CATS Bootstrap Server Timer Network Jetty Web Server Web

61 Experimental Setup 128 Rackspace Cloud Virtual Machines – 16GB RAM, 4 virtual cores – 1 client for every 3 servers Yahoo! Cloud Serving Benchmark (YCSB) – Read-intensive workload: 95% reads, 5% writes – Write-intensive workload: 50% reads, 50% writes CATS nodes equally-distanced on the ring – Avoid load-imbalance 61

62 Performance (50% reads, 50% writes) 62

63 Performance (95% reads, 5% writes) 63

64 Scalability (50% reads, 50% writes) 64

65 Scalability (95% reads, 5% writes) 65

66 Elasticity (read-only workload) * Experiment ran on SICS cloud machines [1 YCSB client, 32 threads] 66

67 Overheads (50% reads, 50% writes) 24% 67

68 Overheads (95% reads, 5% writes) 4% 68

69 CATS vs Cassandra (50% read, 50% write) 69

70 CATS vs Cassandra (95% reads, 5% writes) 70

71 Summary Self- organization Elasticity Scalability Atomic Data Consistency Network Partition Tolerance Network Partition Tolerance Fault Tolerance Decentralization Atomic data consistency is affordable! 71

72 Dynamo [SOSP’07], Cassandra, Riak, Voldemort – scalable, not consistent (key-value stores) Chubby [OSDI’06], ZooKeeper (meta-data stores) – consistent, not scalable, not auto-reconfigurable RAMBO [DISC’02], RAMBO II [DSN’03], SMART [EuroSys’06], RDS [JPDC’09], Dynastore [JACM’11] (replication systems) – Reconfigurable, consistent, not scalable Scatter [SOSP’11] – Scalable and linearizable DHT – Reconfiguration needs distributed transactions Related work 72

73 scalable key-value stores structured overlay networks gossip-based protocols peer-to-peer media streaming video-on-demand systems NAT-aware peer-sampling services Teaching: broadcast, concurrent objects, consensus, replicated state machines, etc. is practical 73

74 Related work Component models and ADLs: Fractal, OpenCom, ArchJava, ComponentJ, … – blocking interface calls vs. message passing Protocol composition frameworks: x-Kernel, Ensemble, Horus, Appia, Bast, Live Objects, … – static, layered vs. dynamic, hierarchical composition Actor models: Erlang, Kilim, Scala, Unix pipes – flat / stacked vs. hierarchical architecture Process calculi: π -calculus, CCS, CSP, Oz/K – synchronous vs. asynchronous message passing 74

75 Message-passing, hierarchical component model facilitating concurrent programming Good for distributed abstractions and systems Multi-core hardware exploited for free Hot upgrades by dynamic reconfiguration Same code used in production deployment, deterministic simulation, local execution DSL to specify complex simulation scenarios Battle-tested in many distributed systems Summary 75

76 Acknowledgements Seif Haridi Jim Dowling Tallat M. Shafaat Muhammad Ehsan ul Haque Frej Drejhammar Lars Kroll Niklas Ekström Alexandru Ormenian Hamidreza Afzali 76

77 http://kompics.sics.se/

78 BACKUP SLIDES

79 Sequential consistency A concurrent execution is sequentially consistent if there is a sequential way to reorder the client operations such that: – (1) it respects the semantics of the objects, as defined by their sequential specification – (2) it respects the order of operations at the client that issued the operations 79

80 Linearizability A concurrent execution is linearizable if there is a sequential way to reorder the client operations such that: – (1) it respects the semantics of the objects, as defined by their sequential specification – (2) it respects the order of non-overlapping operations among all clients 80

81 Replicas act as a distributed shared-memory register Consistency: naïve solution 20 30 35 40 45 Keys: 21 – 30 r1r2 r3 81

82 The problem Asynchrony Impossible to accurately detect process failures 82

83 50 42 52 48 successor pointer predecessor pointer 50 thinks 48 has failed 48 thinks replication group for (42, 48] is {48, 50, 52} 50 thinks replication group for (42, 48] is {50, 52, 60} PUT(45) may contact majority quorum {48, 50} GET(45) may contact majority quorum {52, 60} X Incorrect failure detection Key 45 Non- intersecting quorums 83

84 Solution: Consistent Quorums A consistent quorum is a quorum of nodes that are in the same view when the quorum is assembled – Maintain consistent view of replication group membership – Modified Paxos using consistent quorums – Essentially a reconfigurable RSM (state == view) Modified ABD using consistent quorums – Dynamic linearizable read-write register 84

85 Guarantees Concurrent reconfigurations are applied at every node in a total order For every replication group, any two consistent quorums always intersect – Same view, consecutive, non-consecutive views In a partially synchronous system, reconfigurations and operations terminate – once network partitions cease Consistent mapping from key-ranges to replication groups 85


Download ppt "Https://www.kth.se/profile/icarad/page/doctoral-thesis/ Programming Model and Protocols for Reconfigurable Distributed Systems & C OSMIN I ONEL A RAD Doctoral."

Similar presentations


Ads by Google