DYNAMO: AMAZON’S HIGHLY AVAILABLE KEY-VALUE STORE GIUSEPPE DECANDIA, DENIZ HASTORUN, MADAN JAMPANI, GUNAVARDHAN KAKULAPATI, AVINASH LAKSHMAN, ALEX PILCHIN,

DYNAMO: AMAZON’S HIGHLY AVAILABLE KEY-VALUE STORE GIUSEPPE DECANDIA, DENIZ HASTORUN, MADAN JAMPANI, GUNAVARDHAN KAKULAPATI, AVINASH LAKSHMAN, ALEX PILCHIN, SWAMINATHAN SIVASUBRAMANIAN, PETER VOSSHALL AND WERNER VOGELS GRADUATE SEMINAR ADVANCED TOPICS IN STORAGE SYSTEMS SPRING 2013 Amit Kleinman Electrical Engineering Tel-Aviv University 1

SERVICE-ORIENTED ARCHITECTURE OF AMAZON’S PLATFORM E.g., SimpleDB - for complex queries - for large files - for primary key access 2

DYNAMO: REQUIREMENTS => FEATURES Fault tolerant - even if an entire data center fails Failure handling - the normal case Gives tight control over tradeoffs between: Consistency, availability, and performance RDBMS chooses consistency Dynamo prefers availability Highly available distributed storages system No lock and always writable SLAs - request rate distribution & service latency Load distribution Eventually consistent Replicas are loosely synchronized Inconsistencies are resolved by clients Trusted network  No authentication, no data integrity, no authorization 3

THE BIG PICTURE 4

EASY USAGE: INTERFACE Client does not need to know all nodes Can send request to any node get(key) – returns: single object, or list of objects with conflicting versions and context put(key, context, object) store object and context under key e.g., “add to cart” and “delete item from cart” Context encodes system meta-data e.g. version number N = 3 5

DHT Lookup service: Retrieve value associated with a given key Manage items (insert, deleting) Properties: Autonomy and Decentralization Fault-tolerant - possible even if some nodes fail Scalability – change in the set of participants causes minimal disruption HT Distributed among a set of cooperating nodes Each node is responsible for part of the items (= key/value pairs) Key space partitioning scheme (KPS) - splits ownership among nodes Using (a structured) overlay network (Pastry) Leafset tables - logical neighbors in a key space (Chord) Finger tables - relatively far away nodes in the key space to enable fast routing Key-Based Routing (KBR) - find the closest node for the data Network's topology = structure for picking neighbors DHT Protocols – Specify: KPS: How keys are assigned to nodes KBR: How a node can discover the value for a given key by first locating the node responsible for that key 6

DHT PROTOCOLS Properties Organization of the peer address space: e.g., flat or hierarchical Decentralization, e.g., completely decentralized: Next-hop decision criteria, distance function, converged distance metric, e.g., Prefix matching, Euclidean distance in a d-dimensioned space, Linear distance in a ring Geometry of the overlay E.g., how the node degree changes as the size of the overlay grows, e.g., logarithmic, constant Strategies for overlay maintenance, e.g., active versus correct on use Maximum number of hops taken by a request given an overlay of N peers Tradeoff between Routing State and Path Distance Latency Chance to fail Space Topology maintenance BW Dynamo – Zero-hop DHT; Consistent hashing… DHT ProtocolArchitectureRouting hopsUse Tapestry, PastryPlaxton MeshO(log B N)PRR Routing CANd-dimensional coordinate spaceO(dN 1/d ) ChordUni-directional & circular nodeId spaceO(logN)Consistent Hashing 7

CONSISTENT HASHING Map: object  point on the edge of a circle The SHA-1 algorithm is the base hashing function To find where an object should be placed: Find its location on the edge of the circle Walk clockwise around the circle until the first bucket Removal or addition of one node Changes only the set of keys owned by the nodes with adjacent IDs Resized of the hash table => remap of k/n keys k = number of keys n = number of slots May lead to load imbalance Storage bits Popularity of the item Processing required to serve the item … 8

LOAD IMBALANCE (1/4) Node identifiers may not be balanced 9

LOAD IMBALANCE (1/4) Node identifiers may not be balanced 10

LOAD IMBALANCE (2/4) Node identifiers may not be balanced Data identifiers may not be balanced 11

LOAD IMBALANCE (3/4) Node identifiers may not be balanced Data identifiers may not be balanced Hot spots 12

LOAD IMBALANCE (4/4) Node identifiers may not be balanced Data identifiers may not be balanced Hot spots Heterogeneous nodes 13

LOAD BALANCING VIA VIRTUAL NODES Each physical node picks multiple random identifiers Each identifier represents a virtual node Each node runs multiple virtual nodes Each node responsible for noncontiguous regions 14

VIRTUAL NODE PLACEMENT How many virtual nodes? For homogeneous, all nodes run log N VNodes For heterogeneous, nodes run clogN VNodes, where ‘c’ is Small for weak nodes Large for powerful nodes Move virtual nodes from heavily loaded physical nodes to lightly loaded physical nodes 15

REPLICATION Successor (preference) list replication Replicates at the N-1 clockwise successor nodes in the ring Skip positions in the ring to ensure that the list contains only distinct physical nodes 16

THE BIG PICTURE 17

DATA VERSIONING No master, no clock synchronization Updates are propagated asynchronously Update of an item => new immutable version of the data Multiple versions of an object may exist Target applications are aware that multiple versions can exist Replicas eventually become consistent Version branching can happen due to: node failures, network failures/partitions, etc. 18

SOLVING CONFLICTS 19 Who/When/How to resolve? Business logic specific reconciliation, e.g. shopping cart service Timestamp based reconciliation, e.g., customer’s session-info. service Syntactic reconciliation New versions subsume older versions The system itself can determine the authoritative version Semantic reconciliation Conflicting versions of an object The system cannot reconcile the multiple versions of the same object The client must perform the reconciliation Collapse multiple branches of data evolution back into one Use vector clocks for capturing causality Vector clock = list of (node, counter) pairs If causal: older version can be forgotten If concurrent (version numbers are equal): conflict exists => reconciliation is needed

Client C1 writes new object say via Sx C1 updates the object say via Sx C1 updates the object say via Sy C2 reads D2 and updates the object Say via Sz Reconciliation RECONCILIATION - EXAMPLE 20

PUT (KEY, VALUE, CONTEXT) Quorum systems R / W : minimum number of nodes that must participate in a successful read / write R + W > N (overlap) 21 Coordinator: Generates new vector clock and Writes the new version locally Sends to N nodes Waits for response from W-1 nodes Using W=1 High availability for writes Low durability

(VALUE, CONTEXT)  GET (KEY) Coordinator: Requests existing versions from N Waits for response from R nodes If multiple versions, returns all versions that are causally unrelated Divergent versions are then reconciled Reconciled version written back “read repair” Using R=1 High performance read engine 22

MEMBERSHIP A managed system Administrator explicitly adds and removes nodes Gossiping to propagate membership changes Eventually consistent view Each node contacts a random peer every second Reconcile their persisted membership change histories O(1) hop overlay log(n) hops, e.g. n=1024, 10 hops, 50ms/hop, 500ms Seed nodes - discovered via an external mechanism and are known to all nodes Client updates the view of membership periodically (every 10 seconds) 23

THE BIG PICTURE 24

FAILURE DETECTION Permanent node additions and removals are explicit In the absence of client requests, A doesn’t need to know if B is alive Local failure detection A detects B as failed if it doesn’t respond to a message A periodically checks if B is alive again Use pings only for detection from failed to alive Membership changes are stored with timestamp 25

Using sloppy quorum read & write on the first N healthy nodes from the preference list Say A is unreachable ’put’ will use D Later, D detects A is alive send the (hinted) replica to A remove the (hinted) replica Tolerate failure of a data center Each object replicated across multiple data centers HANDLING TRANSIENT FAILURES: HINTED HANDOFF 26

HANDLING PERMANENT FAILURES Anti-entropy for replica synchronization Gossip info until it is made obsolete by newer info Use Merkle trees for: Fast inconsistency detection and Minimum transfer of data Nodes exchange the root of the Merkle tree corresponding to the key ranges that they host in common 27

KEY WEAKNESSESTECHNIQUES USED BY DYNAMO 28 Lack information on: - Full routing info at each node => How the gossip-based membership protocol scale? - Dynamic placement of v-nodes - Geo-distribution Trusted env. => no security Reconciliation at the application  More complex business logic  App less portable (platfrm dpn.) (N,W,R) val. Static vs. Dynamic - based on component failres? Improper&incomplete exprimnts - Only on homogeneous machines - Only strict (not sloppy) quorum - Consistent hashing – well known More details are needed, e.g., data arrival distributions, performance under high churn

MAIN TAKEAWAY In several cases it is easier and more cost-effective to reformulate the problem, to work in an eventually consistent system, rather than trying to provide stronger consistency 29

DYNAMO: AMAZON’S HIGHLY AVAILABLE KEY-VALUE STORE GIUSEPPE DECANDIA, DENIZ HASTORUN, MADAN JAMPANI, GUNAVARDHAN KAKULAPATI, AVINASH LAKSHMAN, ALEX PILCHIN,

Similar presentations

Presentation on theme: "DYNAMO: AMAZON’S HIGHLY AVAILABLE KEY-VALUE STORE GIUSEPPE DECANDIA, DENIZ HASTORUN, MADAN JAMPANI, GUNAVARDHAN KAKULAPATI, AVINASH LAKSHMAN, ALEX PILCHIN,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

DYNAMO: AMAZON’S HIGHLY AVAILABLE KEY-VALUE STORE GIUSEPPE DECANDIA, DENIZ HASTORUN, MADAN JAMPANI, GUNAVARDHAN KAKULAPATI, AVINASH LAKSHMAN, ALEX PILCHIN,

Similar presentations

Presentation on theme: "DYNAMO: AMAZON’S HIGHLY AVAILABLE KEY-VALUE STORE GIUSEPPE DECANDIA, DENIZ HASTORUN, MADAN JAMPANI, GUNAVARDHAN KAKULAPATI, AVINASH LAKSHMAN, ALEX PILCHIN,"— Presentation transcript:

Similar presentations

About project

Feedback