Presentation is loading. Please wait.

Presentation is loading. Please wait.

AMAZON’S KEY-VALUE STORE: DYNAMO DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available.

Similar presentations


Presentation on theme: "AMAZON’S KEY-VALUE STORE: DYNAMO DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available."— Presentation transcript:

1 AMAZON’S KEY-VALUE STORE: DYNAMO DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available key-value store. SOSP 2007 UCSB CS2711 Adapted from Amazon’s Dynamo Presentation

2 Motivation Reliability at a massive scale Slightest outage  significant financial consequences High write availability Amazon’s platform: 10s of thousands of servers and network components, geographically dispersed Provide persistent storage in spite of failures Sacrifice consistency to achieve performance, reliability, and scalability UCSB CS2712

3 Dynamo Design rationale Most services need key-based access: – Best-seller lists, shopping carts, customer preferences, session management, sales rank, product catalog, and so on. Prevalent application design based on RDBMS technology will be catastrophic. Dynamo therefore provides primary-key only interface. UCSB CS2713

4 Dynamo Design Overview Data partitioning using consistent hashing Data replication Consistency via version vectors Replica synchronization via quorum protocol Gossip-based failure-detection and membership protocol UCSB CS2714

5 System Requirements Data & Query Model: – Read/write operations via primary key – No relational schema: use object – Object size < 1 MB, typically. Consistency guarantees: – Weak – Only single key updates – Not clear if read-modify-write isolate Efficiency: – SLA 99.9 percentile of operations Notes: – Commodity hardware – Minimal security measures since for internal use UCSB CS2715

6 Service Level Agreements (SLA) Application can deliver its functionality in a bounded time: Every dependency in the platform needs to deliver its functionality with even tighter bounds. Example SLA: service guaranteeing that it will provide a response within 300ms for 99.9% of its requests for a peak client load of 500 requests per second. UCSB CS2716

7 System Interface Two basic operations: – Get(key): Locates replicas Returns the object + context (encodes meta data including version) – Put(key, context, object): Writes the replicas to the disk Context: version (vector timestamp) Hash(key)  128-bit identifier UCSB CS2717

8 Partition Algorithm Consistent hashing: the output range of a hash function is treated as a fixed circular space or “ring” a la Chord. “ Virtual Nodes”: Each node can be responsible for more than one virtual node (to deal with non-uniform data and load distribution) UCSB CS2718

9 Virtual Nodes UCSB CS2719

10 Advantages of using virtual nodes The number of virtual nodes that a node is responsible can be decided based on its capacity, accounting for heterogeneity in the physical infrastructure. A real node’s load can be distributed across the ring, thus ensuring a hot spot is not targeted to a single node. If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes. When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes. UCSB CS27110

11 Replication Each data item is replicated at N hosts. preference list: The list of nodes that is responsible for storing a particular key. Some fine-tuning to account for virtual nodes UCSB CS27111

12 Replication UCSB CS27112

13 Replication UCSB CS27113

14 Preference Lists List of nodes responsible for storing a particular key. Due to failures, preference list contains more than N nodes. Due to virtual nodes, preference list skips positions to ensure distinct physical nodes. UCSB CS27114

15 Data Versioning A put() call may return to its caller before the update has been applied at all the replicas A get() call may return many versions of the same object. Challenge: an object may have distinct versions Solution: use vector clocks in order to capture causality between different versions of same object. UCSB CS27115

16 Vector Clock A vector clock is a list of (node, counter) pairs. Every version of every object is associated with one vector clock. If the all counters on the first object’s clock are less-than-or-equal to all of the counters in the second clock, then the first is an ancestor of the second and can be forgotten. Application reconciles divergent versions and collapses into a single new version. UCSB CS27116

17 Vector clock example UCSB CS27117

18 Routing requests Route request through a generic load balancer that will select a node based on load information. Use a partition-aware client library that routes requests directly to relevant node. A gossip protocol propagates membership changes. Each node contacts a peer chosen at random every second and the two nodes reconcile their membership change histories. UCSB CS27118

19 Sloppy Quorum R and W is the minimum number of nodes that must participate in a successful read/write operation. Setting R + W > N yields a quorum-like system. In this model, the latency of a get (or put) operation is dictated by the slowest of the R (or W) replicas. For this reason, R and W are usually configured to be less than N, to provide better latency and availability. UCSB CS27119

20 Highlights of Dynamo High write availability Optimistic: vector clocks for resolution Consistent hashing (Chord) in controlled environment Quorums for relaxed consistency. UCSB CS27120

21 CASSANDRA (FACEBOOK) Lakshman and Malik Cassandra—A Decentralized Structured Storage System. LADIS 2009 UCSB CS27121

22 Data Model Key-value store—more like Bigtable. Basically, a distributed multi-dimensional map indexed by a key. Value is structured into Columns, which are grouped into Column Families: simple and super (column family within a column family). An operation is atomic on a single row. API: insert, get and delete. UCSB CS27122

23 System Architecture Like Dynamo (and Chord). Uses order preserving hash function on a fixed circular space. Node responsible for a key is called the coordinator. Non-uniform data distribution: keep track of data distribution and reorganize if necessary. UCSB CS27123

24 Replication Each item is replicated at N hosts. Replicas can be: Rack Unaware; Rack Aware (within a data center); Datacenter Aware. System has an elected leader. When a node joins the system, the leader assigns it a range of data items and replicas. Each node is aware of every other node in the system and the range they are responsible for. UCSB CS27124

25 Membership and Failure Detection Gossip-based mechanism to maintain cluster membership. A node determines which nodes are up and down using a failure detector. The Φ accrual failure detector returns a suspicion level, Φ, for each monitored node. Say a node suspects A when Φ=1, 2, 3, then the likelihood of a mistake is 10%, 1% and.1%. Every node maintains a sliding window of interarrival times of gossip messages from other nodes to determine distribution of interarrival times and then calculate Φ. Approximate using an exponential distribution. UCSB CS27125

26 Operations Use quorums: R and W If R+W < N then read will return latest value. – Read operations return value with highest timestamp, so may return older versions – Read Repair: with every read, send newest version to any out-of-date replicas. – Anti-Entropy: compute Merkle tree to catch any out of synch data (expensive) Each write: first into a persistent commit log, then an in-memory data structure. UCSB CS27126


Download ppt "AMAZON’S KEY-VALUE STORE: DYNAMO DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available."

Similar presentations


Ads by Google