Presentation on theme: "Distributed Hash Tables Chord and Dynamo Costin Raiciu, Advanced Topics in Distributed Systems 18/12/2012."— Presentation transcript:
Distributed Hash Tables Chord and Dynamo Costin Raiciu, Advanced Topics in Distributed Systems 18/12/2012
Motivation: file sharing Many users want to share files online If a file’s location is known, downloading is easy – The challenge is to find who stores the file we want Early attempts – Napster (centralized), Kazaa Gnutella (March 2000) – Completely decentralized
How should we fix Gnutella’s problems? Decouple storage from lookup – Gnutella: node only answers queries for nodes it has locally Requirements – Extreme scalability: millions of nodes – Load balance: spread load across nodes evenly – Availability: must cope with node churn (nodes joining/leaving/failing)
Chord [Stoica et al, Sigcomm 2001] Opens a new body of research on “Distributed Hash Tables” – Together with Content Addressable Networks (also Sigcomm 2001) Most popular application: a Distributed Hash Table (DHT)
Chord basics A single fundamental operation: lookup(key) – Given a key, find the node responsible for that key How do we do this?
Consistent hashing Assign unique m-bit identifiers to both nodes and objects (e.g. files) – E.g. m=160, use SHA1 – Node identifier: hash of IP address – Object identifier: hash of name. Split key space across all servers – Not necessary to store keys for the files you have! Who is responsible for storing metadata relating to a given key?
Key assignment Identifiers are ordered in an identifier circle modulo 2 m Key k is assigned to the first node whose identifier is equal to or follows (the identifier of) k in the identifier space. – This node is called the successor node of k (successor(k)) – If identifiers are represented as a circle of numbers from 0 to 2 m −1 then successor (k) is the first node clockwise from k
Lookup Each node n maintains a routing table with (at most) m entries called the finger table The ith entry in the table at node n contains the identity of the first node(s) that succeeds n by at least 2 i −1 on the circle – n.finger[i]= successor ( n + 2 i-1 ), 1< i < m
Lookup (2) Each node stores information about only a small number of other nodes (log n) Nodes know more about nodes closely following them on the circle than about nodes farther away Is there enough information in the finger table to find the successor of an arbitrary key?
How should we use finger pointers to guide the lookup?
Node joins To maintain correctness, Chord maintains two invariants: – Each node’s successor is correctly maintained – For every key k,successor(k) is responsible for k
Node joins: detail Chord uses a predecessor pointer to walk counterclockwise – Maintains Chord ID and IP address of previous node – Why? When a node joins the network Chord: – Initializes the predecessor and fingers of node n; – Updates the fingers and predecessors of existing nodes to reflect the addition of n – Notifies the higher layer software so that it can transfer state associated with keys that n is now responsible for
Stabilization: Dealing with Concurrent Joins and Failures In practice Chord needs to deal with nodes joining the system concurrently and with nodes that fail or leave voluntarily Solution: Every node runs a stabilize process periodically – When n runs stabilize,it asks n’s successor for the successor’s predecessor p, and decides whether p should be n ’s successor instead – stabilize also notifies n’s successor of n’s existence, giving the successor the chance to change its predecessor to n
Implementing a Distributed Hash Table over Chord put(k, v) – lookup n, the node responsible for k and store v on n get(k) – lookup node responsible for k, return value How long does it take to join/leave Chord? – Fix: store on n and a few of its successors – Locally broadcast query
Other aspects of Distributed Hash Tables How do we deal with security? – Nodes that return wrong answers – Nodes that do not forward messages – …
Applications of Distributed Hash Tables? A whole body of research – Distributed Filesystems (Past, Oceanstore) – Distributed Search – None deployed. Why? Today: – Kademlia is used for “tracker-less” torrents
Amazon Dynamo [DeCandia et al, SOSP 2007] (slides adapted from DeCandia et al)
Context Want a distributed storage system to use as support some of Amazon’s tasks: – best seller lists – shopping carts – customer preferences – session management – sales rank – product catalog Traditional databases scale poorly and have poor availability
Amazon Dynamo Requirements – Scale – Simple: key-value – Highly available – Guarantee Service Level Agreements (SLA) Uses key-value store as abstraction
System Assumptions and Requirements Query Model – Read and write operations to a data item that is uniquely identified by a key – No schema needed – Small Objects (<1MB) stored as blobs ACID Properties? – Atomicity and weaker consistency, durability Efficiency – Commodity hardware – Mind the SLA! Other Assumptions – Environment is friendly (no security issues)
Design Considerations Sacrifice strong consistency for availability – Why are consistency and availability at odds? Optimistic replication increases availability – Allow disconnected operations – This may lead to concurrent updates to the same object: conflict – When to perform conflict resolution? Delaying writes unacceptable (e.g. shopping cart update) Solve conflicts during read instead of write, i.e. “always writeable”. Who resolves conflict? – App – e.g. merge shopping cart contents – Datastore – last write wins.
Other design considerations Incremental scalability Symmetry Decentralization Heterogeneity
Partitioning Algorithm Dynamo uses consistent hashing Consistent hashing issues: –Load imbalance –Dealing with heterogeneity ”Virtual Nodes”: Each node can be responsible for more than one virtual node.
Advantages of using virtual nodes If a node becomes unavailable the load handled by this node is evenly dispersed across the remaining available nodes. When a node becomes available again, the newly available node accepts a roughly equivalent amount of load from each of the other available nodes. The number of virtual nodes that a node is responsible can decided based on its capacity, accounting for heterogeneity in the physical infrastructure.
Replication Each data item is replicated at N hosts – N is specified per instance “preference list”: the N-1 successors of the key that store it.
Data Versioning A put() call may return to its caller before the update has been applied at all the replicas A get() call may return many versions of the same object Challenge: an object having distinct version sub- histories, which the system will need to reconcile in the future. Solution: uses vector clocks in order to capture causality between different versions of the same object.
Vector Clock A vector clock is a list of (node, counter) pairs. Every version of every object is associated with one vector clock. If the counters on the first object’s clock are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten.
Execution of get () and put () operations 1.Route its request through a generic load balancer that will select a node based on load information. 2.Use a partition-aware client library that routes requests directly to the appropriate coordinator nodes.
Quorum systems We are balancing writes and reads over N nodes How do we make sure a read sees the latest write? – Write on all nodes, wait for reply from all; read from any node – Or write to one, read from all Quorum systems: write to W, read from R such that W+R>N
Dynamo uses Sloppy Quorum Send write to all nodes – Return when W reply Send read to all nodes – Return result(s) when R reply What did we lose?.
Hinted handoff Assume N = 3. When B is temporarily down or unreachable during a write, send replica to E. E’s metadata hints that the replica belongs to A and it will deliver it to A when A is recovered. Write will succeed as long as where are W nodes (any) available in the system
Dynamo membership Membership changes are manually configured – Gossip based protocol propagates membership information – Everyone node knows about every other node’s range Failures are detected by each node via timeouts – Enable hinted handoffs, etc.
Implementation Java Local persistence component allows for different storage engines to be plugged in: – Berkeley Database (BDB) Transactional Data Store: object of tens of kilobytes – MySQL: object of > tens of kilobytes – BDB Java Edition, etc.