Scaling Services: Partitioning, Hashing, Key-Value Storage

Scaling Services: Partitioning, Hashing, Key-Value Storage
CS 240: Computing Systems and Concurrency Lecture 14 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. Selected content adapted from B. Karp, R. Morris.

Horizontal or vertical scalability?
So today we’re going to talk about how we can scale services to increase their capacity to handle millions of users, billions of transactions per second, and do it without simply building faster computers (vertical). So the trend instead starting in the 2000s has been large clusters of relatively cheap servers in datacenters. Vertical Scaling Horizontal Scaling

Horizontal scaling is chaotic
Probability of any failure in given period = 1−(1−p)n p = probability a machine fails in given period n = number of machines For 50K machines, each with % available 16% of the time, data center experiences failures For 100K machines, failures 30% of the time! As we’ll see today it’s possible to build really powerful systems like this but the problem with horizontal scaling with cheap servers is the high frequency of failure. SEGUE: Today we’ll see algorithms that overcome these failures.

Today Techniques for partitioning data Metrics for success
Case study: Amazon Dynamo key-value store So today we’ll begin with some general techniques for partitioning data, as well as identifying what matters in their performance. Then Amazon Dynamo, that partitions data for the company’s web store. And all of the techniques we’ve been discussing in recent lectures are going to come into play.

Scaling out: Partition and place
Partition management Including how to recover from node failure e.g., bringing another node into partition group Changes in system size, i.e. nodes joining/leaving Data placement On which node(s) to place a partition? Maintain mapping from data object to responsible node(s) Centralized: Cluster manager Decentralized: Deterministic hashing and algorithms So the main tool we'll use for scaling horizontally you've seen before in Chord, and it's two subproblems: PARTITIONING and PLACING data. >>> Two general ways this can happen: either a cluster manager coordinates it, … SEGUE: So the differences between all these systems are in how the partition and how the place data.

Modulo hashing Consider problem of data partition:
Given object id X, choose one of k servers to use Suppose instead we use modulo hashing: Place X on server i = hash(X) mod k What happens if a server fails or joins (k  k±1)? or different clients have different estimate of k? Simplest scheme for partitioning and placement we’ll talk about is modulo hashing, used to be the preferred way of partitioning and placing data before consistent hashing was invented. SEGUE: So let's look at what happens when the number of servers changes.

Problem for modulo hashing: Changing number of servers
h(x) = x + 1 (mod 4) Add one machine: h(x) = x + 1 (mod 5) Server 4 3 All entries get remapped to new nodes!  Need to move objects over the network 2 1 5 7 10 11 27 29 36 38 40 Object serial number

Consistent hashing Assign n tokens to random points on mod 2k circle; hash key size = k Hash object to random circle position Put object in closest clockwise bucket successor (key)  bucket 14 Token 12 4 Bucket So with Chord we saw consistent hashing, where each server gets one token, and that token lies somewhere randomly on the circle. SEGUE: And so when we discussed Chord, we saw how it improves over modulo hashing with respect to smoothness. But as usual we’re not quite done. 8 Desired features – Balance: No bucket has “too many” objects Smoothness: Addition/removal of token minimizes object movements for other buckets

Consistent hashing’s load balancing problem
Each node owns 1/nth of the ID space in expectation Says nothing of request load per bucket If a node fails, its successor takes over bucket Smoothness goal ✔: Only localized shift, not O(n) But now successor owns two buckets: 2/nth of key space The failure has upset the load balance So consider a Chord ring of n nodes, let’s think about the impact of node failures on smoothness and balance.

Virtual nodes Idea: Each physical node now maintains v > 1 tokens
Each token corresponds to a virtual node Each virtual node owns an expected 1/(vn)th of ID space Upon a physical node’s failure, v successors take over, each now stores (v+1)/v×1/nth of ID space Result: Better load balance with larger v So the Chord designers realized this, and came up with a solution to more evenly distribute load. SEGUE: So that’s exactly what Dynamo does as a starting point for Amazon’s system.

Today Techniques for partitioning data
Case study: the Amazon Dynamo key-value store

Dynamo: The P2P context Chord and DHash intended for wide-area P2P systems Individual nodes at Internet’s edge, file sharing Central challenges: low-latency key lookup with small forwarding state per node Techniques: Consistent hashing to map keys to nodes Replication at successors for availability under failure So to give you context for Dynamo, let's think back to P2P. Chord and Dhash intended for much more chaotic, wide-area, P2P systems like file sharing. So setting was very different than a centrally managed datacenter. SEGUE: Yet they introduced two very powerful techniques. These techniques worked in these very chaotic systems (users stealing files on the internet).

Amazon’s workload (in 2007)
Tens of thousands of servers in globally-distributed data centers Peak load: Tens of millions of customers Tiered service-oriented architecture Stateless web page rendering servers, atop Stateless aggregator servers, atop Stateful data stores (e.g. Dynamo) put( ), get( ): values “usually less than 1 MB” Today we’ll look how these same techniques apply to settings like this. SEGUE: So it's a much more managed, engineered environment.

How does Amazon use Dynamo?
Shopping cart Session info Maybe “recently visited products” et c.? Product list Mostly read-only, replication for high read throughput So how does Amazon actually use Dynamo. One of the most important parts of their business is the shopping cart, they want to make it fast and available so they make money, every millisecond in user response time matters here.

Dynamo requirements Highly available writes despite failures Despite disks failing, network routes flapping, “data centers destroyed by tornadoes” Always respond quickly, even during failures  replication Low request-response latency: focus on 99.9% SLA Incrementally scalable as servers grow to workload Adding “nodes” should be seamless Comprehensible conflict resolution High availability in above sense implies conflicts Non-requirement: Security, viz. authentication, authorization (used in a non-hostile environment) First requirement is availability. Availability is measured in the success of users’ operations: ”rejecting customer updates could result in poor customer experience [shopping cart]” Latency performance matters, because a higher latency = lower “conversion rate” on purchases. Each service in call chain must meet its performance contract.

Design questions How is data placed and replicated?
How are requests routed and handled in a replicated system? How to cope with temporary and permanent node failures? So going a level deeper, the previous requirements are going to force us to make decisions about these parts of the system.

Dynamo’s system interface
Basic interface is a key-value store get(k) and put(k, v) Keys and values opaque to Dynamo get(key)  value, context Returns one value or multiple conflicting values Context describes version(s) of value(s) put(key, context, value)  “OK” Context indicates which versions this version supersedes or merges The API that the aggregator servers see looking at Dynamo is a basic key value store. Keys and values are opaque to Dynamo, but there is a wrinkle to support consistency.

Key trade-offs: Response time vs. consistency vs. durability
Dynamo’s techniques Place replicated data on nodes with consistent hashing Maintain consistency of replicated data with vector clocks Eventual consistency for replicated data: prioritize success and low latency of writes over reads And availability over consistency (unlike DBs) Efficiently synchronize replicas using Merkle trees So at a high level, this is what Dynamo does. All of these techniques were known at the time this work was done, but it’s a novel combination of techniques for its time atop a unique workload. So we get to see how these techniques really work at scale and what they had to do at Amazon to make them work with this big and real workload. Key trade-offs: Response time vs. consistency vs. durability

Data placement Key K Coordinator node
put(K,…), get(K) requests go to me Coordinator node So let’s now get into the details of how Dynamo works, starting with how data is placed at nodes. First we have a concept of a coordinator node… Remember, these nodes are virtual. SEGUE: Each data item is replicated at some number N of other nodes. Each data item is replicated at N virtual nodes (e.g., N = 3)

Data replication Much like in Chord: a key-value pair  key’s N successors (preference list) Coordinator receives a put for some key Coordinator then replicates data onto nodes in the key’s preference list Preference list size > N to account for node failures For robustness, the preference list skips tokens to ensure distinct physical nodes So replication in Dynamo is similar to DHash replication, with the data replicated at the N successor tokens of a key. >>> Two modifications to improve robustness to failure. Pref list > N to give Dynamo more options when there’s a failure (will see in a sec)

Gossip and “lookup” Gossip: Once per second, each node contacts a randomly chosen other node They exchange their lists of known nodes (including virtual node IDs) Each node learns which others handle all key ranges Result: All nodes can send directly to any key’s coordinator (“zero-hop DHT”) Reduces variability in response times So how do nodes learn where a given key should go?

Partitions force a choice between availability and consistency
Suppose three replicas are partitioned into two and one If one replica fixed as master, no client in other partition can write In Paxos-based primary-backup, no client in the partition of one can write Traditional distributed databases emphasize consistency over availability when there are partitions So let’s see the impact of failures, and in particular network partitions that might for example be caused by the failure of a top-of-rack switch in the datacenter. Partition of one doesn’t have a majority.

Alternative: Eventual consistency
Dynamo emphasizes availability over consistency when there are partitions Tell client write complete when only some replicas have stored it Propagate to other replicas in background Allows writes in both partitions…but risks: Returning stale data Write conflicts when partition heals: Just like we saw with Bayou, Dynamo emphasizes availability over consistency. Therefore it tells the client the write is complete when only some of the replicas on the preference list (NOT ALL) have stored it. put(k,v0) put(k,v1)

Mechanism: Sloppy quorums
If no failure, reap consistency benefits of single master Else sacrifice consistency to allow progress Dynamo tries to store all values put() under a key on first N live nodes of coordinator’s preference list BUT to speed up get() and put(): Coordinator returns “success” for put when W < N replicas have completed write Coordinator returns “success” for get when R < N replicas have completed read The strategy that Dynamo uses for consistency is called SLOPPY QUORUMS. The goal is that if there is no failure…

Sloppy quorums: Hinted handoff
Suppose coordinator doesn’t receive W replies when replicating a put() Could return failure, but remember goal of high availability for writes… Hinted handoff: Coordinator tries next successors in preference list (beyond first N) if necessary Indicates the intended replica node to recipient Recipient will periodically try to forward to the intended replica node But then Dynamo takes this a step further with another mechanism called HINTED HANDOFF. >>> Then when it goes to the recipient it leaves a POINTER to the INTENDED node at the recipient.

Hinted handoff: Example
Suppose C fails Node E is in preference list Needs to receive replica of the data Hinted Handoff: replica at E points to node C When C comes back E forwards the replicated data back to C Key K Coordinator So for example, the coordinator of KEY K is NODE B. B stores the write also at node E and includes a pointer to NODE C.

Wide-area replication
Last ¶,§4.6: Preference lists always contain nodes from more than one data center Consequence: Data likely to survive failure of entire data center Blocking on writes to a remote data center would incur unacceptably high latency Compromise: W < N, eventual consistency And finally, recall that they are obsessed with a tornado wiping out one of their data centers. >>> But that means that they will have writes to a remote data center. So they don’t block on those. SO GUARANTEE ISN’T IMMEDIATE.

Sloppy quorums and get()s
Suppose coordinator doesn’t receive R replies when processing a get() Penultimate ¶,§4.5: “R is the min. number of nodes that must participate in a successful read operation.” Sounds like these get()s fail Why not return whatever data was found, though? As we will see, consistency not guaranteed anyway…

Sloppy quorums and freshness
Common case given in paper: N = 3, R = W = 2 With these values, do sloppy quorums guarantee a get() sees all prior put()s? If no failures, yes: Two writers saw each put() Two readers responded to each get() Write and read quorums must overlap! So let’s see why consistency isn’t guaranteed, and it’s because of the sloppy quorums. >>> First, the case with no failures.

Sloppy quorums and freshness
Common case given in paper: N = 3, R = W = 2 With these values, do sloppy quorums guarantee a get() sees all prior put()s? With node failures, no: Two nodes in preference list go down put() replicated outside preference list Two nodes in preference list come back up get() occurs before they receive prior put() Now with node failures, consider the following sequence of events. SEGUE: So sloppy quorums increase probability get()s observe recent put()s, but no guarantee

Conflicts Suppose N = 3, W = R = 2, nodes are named A, B, C
1st put(k, …) completes on A and B 2nd put(k, …) completes on B and C Now get(k) arrives, completes first at A and C Conflicting results from A and C Each has seen a different put(k, …) Dynamo returns both results; what does client do now? So now let's see how conflicts can arise.

Conflicts vs. applications
Shopping cart: Could take union of two shopping carts What if second put() was result of user deleting item from cart stored in first put()? Result: “resurrection” of deleted item Can we do better? Can Dynamo resolve cases when multiple values are found? Sometimes. If it can’t, application must do so.

Version vectors (vector clocks)
Version vector: List of (coordinator node, counter) pairs e.g., [(A, 1), (B, 3), …] Dynamo stores a version vector with each stored key-value pair Idea: track “ancestor-descendant” relationship between different versions of data stored under the same key k One way that Dynamo resolves conflicts is with a familiar concept to us -- vector clocks. Their terminology for these is VERSION VECTORS.

Version vectors: Dynamo’s mechanism
Rule: If vector clock comparison of v1 < v2, then the first is an ancestor of the second – Dynamo can forget v1 Each time a put() occurs, Dynamo increments the counter in the V.V. for the coordinator node Each time a get() occurs, Dynamo returns the V.V. for the value(s) returned (in the “context”) Then users must supply that context to put()s that modify the same key So given two versions of an object v1 and v2, each with a VV, if Dynamo can make the comparison of the vector clocks then the LATER version is more up to date and it can FORGET the EARLIER version.

Version vectors (auto-resolving case)
put handled by node A v1 [(A,1)] So in this example, a client makes a put that node A handles. >>> Then makes another put updating the same key that is handled by node C. This is stamped A1/C1. >>> Vector clock comparison holds, so Dynamo nodes automatically drop. v2 [(A,1), (C,1)] put handled by node C v2 > v1, so Dynamo nodes automatically drop v1, for v2

Version vectors (app-resolving case)
put handled by node A v1 [(A,1)] put handled by node B v2 [(A,1), (B,1)] v3 [(A,1), (C,1)] put handled by node C So suppose now a client makes two updates to an object, which nodes A then B handle. >>> Then a different client reads and updates v1, and another server C handles put. >>> Now a client reads both v2 and v3 and gets a write context that summarizes both versions. >>> Then when the client reconciles the versions node A handles the put, which the version vector reflects. v2 || v3, so a client must perform semantic reconciliation Client reads v2, v3; context: [(A,1), (B,1), (C,1)] v4 [(A,2), (B,1), (C,1)] Client reconciles v2 and v3; node A handles the put

Trimming version vectors
Many nodes may process a series of put()s to same key Version vectors may get long – do they grow forever? No, there is a clock truncation scheme Dynamo stores time of modification with each V.V. entry When V.V. > 10 nodes long, V.V. drops the timestamp of the node that least recently processed that key

Impact of deleting a VV entry?
put handled by node A v1 [(A,1)] So in this example, we forget the timestamp of node A since it least recently processed a put to this key. Forgetting the oldest is clever, since that’s the element most likely to be present in other versions. Forgetting newest might erase evidence of a recent update that hasn’t yet made it to other versions at other nodes. v2 [(A,1), (C,1)] put handled by node C v2 || v1, so looks like application resolution is required

Concurrent writes What if two clients concurrently write w/o failure?
e.g. add different items to same cart at same time Each does get-modify-put They both see the same initial version And they both send put() to same coordinator Will coordinator create two versions with conflicting VVs? We want that outcome, otherwise one was thrown away Paper doesn't say, but coordinator could detect problem via put() context We want that outcome b/c this is a legitimate conflict.

Removing threats to durability
Hinted handoff node crashes before it can replicate data to node in preference list Need another way to ensure that each key-value pair is replicated N times Mechanism: replica synchronization Nodes nearby on ring periodically gossip Compare the (k, v) pairs they hold Copy any missing keys the other has Problem: A node holding data received via “hinted handoff” crashes before it can pass data to a previously unavailable node in the key’s preference list. >>> SEGUE: So we’re seeking an efficient synchronization How to compare and copy replica state quickly and efficiently?

Efficient synchronization with Merkle trees
Merkle trees hierarchically summarize the key-value pairs a node holds One Merkle tree for each virtual node key range Leaf node = hash of one key’s value Internal node = hash of concatenation of children Compare roots; if match, values match If they don’t match, compare children Iterate this process down the tree

Merkle tree reconciliation
B is missing orange key; A is missing green one Exchange and compare hash nodes from root downwards, pruning when hashes match A’s values: B’s values: Suppose virtual node A and virtual node B are syncing. Find the keys that differ quickly and with minimum information exchange. This also reduces the number of disk reads needed. [0, 2128) [0, 2128) [0, 2127) [2127, 2128) [0, 2127) [2127, 2128) Finds differing keys quickly and with minimum information exchange

How useful is it to vary N, R, W?
Behavior 3 2 Parameters from paper: Good durability, good R/W latency 1 Slow reads, weak durability, fast writes Slow writes, strong durability, fast reads More likely that reads see all prior writes? Read quorum doesn’t overlap write quorum They claim a main advantage of Dynamo is flexibility in configuring N, R, and W, so let’s discuss the impact of changing each of those parameters. 3, 3, 1: Will need to wait longer for reads and although writes fast, weaker durability for writes. 3, 1, 3: Waiting for a write quorum of 3 for writes SLOW – showstopper. 3, 3, 3: Downside would be latency increase. 3, 1, 1: (Incl. others) NO b/c in common case of no failures, we have a LIKELIHOOD that a READ doesn’t see EARLIER WRITE!

Evolution of partitioning and placement
Strategy 1: Chord + virtual nodes partitioning and placement New nodes “steal” key ranges from other nodes Scan of data store from “donor” node took a day Burdensome recalculation of Merkle trees on join/leave First strategy is as described before (called Strategy 1 in paper): Chord + virtual nodes partitioning & placement. >>> When a new node joins, needs to steal keys from others; other node must SCAN its store to retrieve the right key-value pairs BUT scan of key store took too long, so JOIN took too long. >>> Second, when a node joins OR leaves, key ranges change so entire Merkle trees need to be recalculated, adding to the join/leave time.

Strategy 2: Fixed-size partitions, random token placement Q partitions: fixed and equally sized Placement: T virtual nodes per physical node (random tokens) Place the partition on first N nodes after its end So that led to what the paper calls Strategy 2, where instead of letting partitions be of a random length, they fixed the size of partitions. >>> Still choose tokens randomly and follow a similar rule for placing partitions on virtual nodes. SEGUE: Turns out the combo of fixed partitions and random tokens is worst for LOAD BALANCING, so…

Strategy 3: Fixed-size partitions, equal tokens per partition Q partitions: fixed and equally sized S total nodes in the system Placement: Q/S tokens per partition Their production strategy (Strategy 3) also allocates equal tokens per partition, achieving the best load balancing. SEGUE: So much more structured than Chord, but it still uses the consistent hashing idea for smoothness property.

Dynamo: Take-away ideas
Consistent hashing broadly useful for replication—not only in P2P systems Extreme emphasis on availability and low latency, unusually, at the cost of some inconsistency Eventual consistency lets writes and reads return quickly, even when partitions and failures Version vectors allow some conflicts to be resolved automatically; others left to application

Next topic: Strong consistency and CAP Theorem

Scaling Services: Partitioning, Hashing, Key-Value Storage

Similar presentations

Presentation on theme: "Scaling Services: Partitioning, Hashing, Key-Value Storage"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scaling Services: Partitioning, Hashing, Key-Value Storage

Similar presentations

Presentation on theme: "Scaling Services: Partitioning, Hashing, Key-Value Storage"— Presentation transcript:

Similar presentations

About project

Feedback