DYNAMO: AMAZON’S HIGHLY AVAILABLE KEY-VALUE STORE GIUSEPPE DECANDIA, DENIZ HASTORUN, MADAN JAMPANI, GUNAVARDHAN KAKULAPATI, AVINASH LAKSHMAN, ALEX PILCHIN,

Slides:



Advertisements
Similar presentations
Dynamo: Amazon’s Highly Available Key-value Store
Advertisements

Dynamo: Amazon’s Highly Available Key-value Store Slides taken from created by paper authors Giuseppe DeCandia, Deniz Hastorun,
CAN 1.Distributed Hash Tables a)DHT recap b)Uses c)Example – CAN.
Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, Hari Balakrishnan MIT and Berkeley presented by Daniel Figueiredo Chord: A Scalable Peer-to-peer.
Pastry Peter Druschel, Rice University Antony Rowstron, Microsoft Research UK Some slides are borrowed from the original presentation by the authors.
Dynamo: Amazon’s Highly Available Key-value Store ID2210-VT13 Slides by Tallat M. Shafaat.
Case Study - Amazon. Amazon r Amazon has many Data Centers r Hundreds of services r Thousands of commodity machines r Millions of customers at peak times.
Scalable Content-Addressable Network Lintao Liu
AMAZON’S KEY-VALUE STORE: DYNAMO DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available.
Distributed Hash Tables Chord and Dynamo Costin Raiciu, Advanced Topics in Distributed Systems 18/12/2012.
P2P Systems and Distributed Hash Tables Section COS 461: Computer Networks Spring 2011 Mike Freedman
Chord: A scalable peer-to- peer lookup service for Internet applications Ion Stoica, Robert Morris, David Karger, M. Frans Kaashock, Hari Balakrishnan.
Amazon’s Dynamo Simple Cloud Storage. Foundations 1970 – E.F. Codd “A Relational Model of Data for Large Shared Data Banks”E.F. Codd –Idea of tabular.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Dynamo: Amazon's Highly Available Key-value Store Guiseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin,
Amazon Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google,
Dynamo: Amazon’s Highly Available Key-value Store Adopted from slides and/or materials by paper authors (Giuseppe DeCandia, Deniz Hastorun, Madan Jampani,
1 Dynamo Amazon’s Highly Available Key-value Store Scott Dougan.
Dynamo Highly Available Key-Value Store 1Dennis Kafura – CS5204 – Operating Systems.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
Peer to Peer File Sharing Huseyin Ozgur TAN. What is Peer-to-Peer?  Every node is designed to(but may not by user choice) provide some service that helps.
Dynamo Kay Ousterhout. Goals Small files Always writeable Low latency – Measured at 99.9 th percentile.
A Scalable Content-Addressable Network Authors: S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker University of California, Berkeley Presenter:
Dynamo: Amazon’s Highly Available Key- value Store (SOSP’07) Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman,
Distributed Systems Fall 2009 Replication Fall 20095DV0203 Outline Group communication Fault-tolerant services –Passive and active replication Highly.
Wide-area cooperative storage with CFS
Dynamo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as well as related cloud storage implementations.
Structured P2P Network Group14: Qiwei Zhang; Shi Yan; Dawei Ouyang; Boyu Sun.
Amazon’s Dynamo System The material is taken from “Dynamo: Amazon’s Highly Available Key-value Store,” by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,
Dynamo: Amazon's Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, et.al., SOSP ‘07.
Cloud Storage – A look at Amazon’s Dyanmo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as.
Dynamo: Amazon’s Highly Available Key-value Store Presented By: Devarsh Patel 1CS5204 – Operating Systems.
EECS 262a Advanced Topics in Computer Systems Lecture 22 P2P Storage: Dynamo November 14 th, 2012 John Kubiatowicz and Anthony D. Joseph Electrical Engineering.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Case Study: Amazon Dynamo Steve Ko Computer Sciences and Engineering University at Buffalo.
Peer-to-Peer in the Datacenter: Amazon Dynamo Aaron Blankstein COS 461: Computer Networks Lectures: MW 10-10:50am in Architecture N101
Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma CSCI-572 (Prof. Chris Mattmann)
Dynamo: Amazon’s Highly Available Key-value Store COSC7388 – Advanced Distributed Computing Presented By: Eshwar Rohit
Chord & CFS Presenter: Gang ZhouNov. 11th, University of Virginia.
Depot: Cloud Storage with minimal Trust COSC 7388 – Advanced Distributed Computing Presentation By Sushil Joshi.
Dynamo: Amazon's Highly Available Key-value Store Dr. Yingwu Zhu.
Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications Xiaozhou Li COS 461: Computer Networks (precept 04/06/12) Princeton University.
Dynamo: Amazon’s Highly Available Key-value Store DeCandia, Hastorun, Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels PRESENTED.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.
Dynamo: Amazon’s Highly Available Key-value Store
1 Distributed Hash Tables (DHTs) Lars Jørgen Lillehovde Jo Grimstad Bang Distributed Hash Tables (DHTs)
CSE 486/586 CSE 486/586 Distributed Systems Case Study: Amazon Dynamo Steve Ko Computer Sciences and Engineering University at Buffalo.
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications.
Peer to Peer Networks Distributed Hash Tables Chord, Kelips, Dynamo Galen Marchetti, Cornell University.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Peer to Peer A Survey and comparison of peer-to-peer overlay network schemes And so on… Chulhyun Park
Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin,
1 Secure Peer-to-Peer File Sharing Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, Hari Balakrishnan MIT Laboratory.
Chord Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google,
LOOKING UP DATA IN P2P SYSTEMS Hari Balakrishnan M. Frans Kaashoek David Karger Robert Morris Ion Stoica MIT LCS.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Cassandra Architecture.
Big Data Yuan Xue CS 292 Special topics on.
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 1 Dynamo: Amazon.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.
CSCI5570 Large Scale Data Processing Systems NoSQL Slide Ack.: modified based on the slides from Peter Vosshall James Cheng CSE, CUHK.
CSE 486/586 Distributed Systems Case Study: Amazon Dynamo
P2P: Storage.
Distributed Hash Tables
Dynamo: Amazon’s Highly Available Key-value Store
Scaling Out Key-Value Storage
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
P2P Systems and Distributed Hash Tables
EECS 498 Introduction to Distributed Systems Fall 2017
CSE 486/586 Distributed Systems Case Study: Amazon Dynamo
Presentation transcript:

DYNAMO: AMAZON’S HIGHLY AVAILABLE KEY-VALUE STORE GIUSEPPE DECANDIA, DENIZ HASTORUN, MADAN JAMPANI, GUNAVARDHAN KAKULAPATI, AVINASH LAKSHMAN, ALEX PILCHIN, SWAMINATHAN SIVASUBRAMANIAN, PETER VOSSHALL AND WERNER VOGELS GRADUATE SEMINAR ADVANCED TOPICS IN STORAGE SYSTEMS SPRING 2013 Amit Kleinman Electrical Engineering Tel-Aviv University 1

SERVICE-ORIENTED ARCHITECTURE OF AMAZON’S PLATFORM E.g., SimpleDB - for complex queries - for large files - for primary key access 2

DYNAMO: REQUIREMENTS => FEATURES Fault tolerant - even if an entire data center fails Failure handling - the normal case Gives tight control over tradeoffs between: Consistency, availability, and performance RDBMS chooses consistency Dynamo prefers availability Highly available distributed storages system No lock and always writable SLAs - request rate distribution & service latency Load distribution Eventually consistent Replicas are loosely synchronized Inconsistencies are resolved by clients Trusted network  No authentication, no data integrity, no authorization 3

THE BIG PICTURE 4

EASY USAGE: INTERFACE Client does not need to know all nodes Can send request to any node get(key) – returns: single object, or list of objects with conflicting versions and context put(key, context, object) store object and context under key e.g., “add to cart” and “delete item from cart” Context encodes system meta-data e.g. version number N = 3 5

DHT Lookup service: Retrieve value associated with a given key Manage items (insert, deleting) Properties: Autonomy and Decentralization Fault-tolerant - possible even if some nodes fail Scalability – change in the set of participants causes minimal disruption HT Distributed among a set of cooperating nodes Each node is responsible for part of the items (= key/value pairs) Key space partitioning scheme (KPS) - splits ownership among nodes Using (a structured) overlay network (Pastry) Leafset tables - logical neighbors in a key space (Chord) Finger tables - relatively far away nodes in the key space to enable fast routing Key-Based Routing (KBR) - find the closest node for the data Network's topology = structure for picking neighbors DHT Protocols – Specify: KPS: How keys are assigned to nodes KBR: How a node can discover the value for a given key by first locating the node responsible for that key 6

DHT PROTOCOLS Properties Organization of the peer address space: e.g., flat or hierarchical Decentralization, e.g., completely decentralized: Next-hop decision criteria, distance function, converged distance metric, e.g., Prefix matching, Euclidean distance in a d-dimensioned space, Linear distance in a ring Geometry of the overlay E.g., how the node degree changes as the size of the overlay grows, e.g., logarithmic, constant Strategies for overlay maintenance, e.g., active versus correct on use Maximum number of hops taken by a request given an overlay of N peers Tradeoff between Routing State and Path Distance Latency Chance to fail Space Topology maintenance BW Dynamo – Zero-hop DHT; Consistent hashing… DHT ProtocolArchitectureRouting hopsUse Tapestry, PastryPlaxton MeshO(log B N)PRR Routing CANd-dimensional coordinate spaceO(dN 1/d ) ChordUni-directional & circular nodeId spaceO(logN)Consistent Hashing 7

CONSISTENT HASHING Map: object  point on the edge of a circle The SHA-1 algorithm is the base hashing function To find where an object should be placed: Find its location on the edge of the circle Walk clockwise around the circle until the first bucket Removal or addition of one node Changes only the set of keys owned by the nodes with adjacent IDs Resized of the hash table => remap of k/n keys k = number of keys n = number of slots May lead to load imbalance Storage bits Popularity of the item Processing required to serve the item … 8

LOAD IMBALANCE (1/4) Node identifiers may not be balanced 9

LOAD IMBALANCE (1/4) Node identifiers may not be balanced 10

LOAD IMBALANCE (2/4) Node identifiers may not be balanced Data identifiers may not be balanced 11

LOAD IMBALANCE (3/4) Node identifiers may not be balanced Data identifiers may not be balanced Hot spots 12

LOAD IMBALANCE (4/4) Node identifiers may not be balanced Data identifiers may not be balanced Hot spots Heterogeneous nodes 13

LOAD BALANCING VIA VIRTUAL NODES Each physical node picks multiple random identifiers Each identifier represents a virtual node Each node runs multiple virtual nodes Each node responsible for noncontiguous regions 14

VIRTUAL NODE PLACEMENT How many virtual nodes? For homogeneous, all nodes run log N VNodes For heterogeneous, nodes run clogN VNodes, where ‘c’ is Small for weak nodes Large for powerful nodes Move virtual nodes from heavily loaded physical nodes to lightly loaded physical nodes 15

REPLICATION Successor (preference) list replication Replicates at the N-1 clockwise successor nodes in the ring Skip positions in the ring to ensure that the list contains only distinct physical nodes 16

THE BIG PICTURE 17

DATA VERSIONING No master, no clock synchronization Updates are propagated asynchronously Update of an item => new immutable version of the data Multiple versions of an object may exist Target applications are aware that multiple versions can exist Replicas eventually become consistent Version branching can happen due to: node failures, network failures/partitions, etc. 18

SOLVING CONFLICTS 19 Who/When/How to resolve? Business logic specific reconciliation, e.g. shopping cart service Timestamp based reconciliation, e.g., customer’s session-info. service Syntactic reconciliation New versions subsume older versions The system itself can determine the authoritative version Semantic reconciliation Conflicting versions of an object The system cannot reconcile the multiple versions of the same object The client must perform the reconciliation Collapse multiple branches of data evolution back into one Use vector clocks for capturing causality Vector clock = list of (node, counter) pairs If causal: older version can be forgotten If concurrent (version numbers are equal): conflict exists => reconciliation is needed

Client C1 writes new object say via Sx C1 updates the object say via Sx C1 updates the object say via Sy C2 reads D2 and updates the object Say via Sz Reconciliation RECONCILIATION - EXAMPLE 20

PUT (KEY, VALUE, CONTEXT) Quorum systems R / W : minimum number of nodes that must participate in a successful read / write R + W > N (overlap) 21 Coordinator: Generates new vector clock and Writes the new version locally Sends to N nodes Waits for response from W-1 nodes Using W=1 High availability for writes Low durability

(VALUE, CONTEXT)  GET (KEY) Coordinator: Requests existing versions from N Waits for response from R nodes If multiple versions, returns all versions that are causally unrelated Divergent versions are then reconciled Reconciled version written back “read repair” Using R=1 High performance read engine 22

MEMBERSHIP A managed system Administrator explicitly adds and removes nodes Gossiping to propagate membership changes Eventually consistent view Each node contacts a random peer every second Reconcile their persisted membership change histories O(1) hop overlay log(n) hops, e.g. n=1024, 10 hops, 50ms/hop, 500ms Seed nodes - discovered via an external mechanism and are known to all nodes Client updates the view of membership periodically (every 10 seconds) 23

THE BIG PICTURE 24

FAILURE DETECTION Permanent node additions and removals are explicit In the absence of client requests, A doesn’t need to know if B is alive Local failure detection A detects B as failed if it doesn’t respond to a message A periodically checks if B is alive again Use pings only for detection from failed to alive Membership changes are stored with timestamp 25

Using sloppy quorum read & write on the first N healthy nodes from the preference list Say A is unreachable ’put’ will use D Later, D detects A is alive send the (hinted) replica to A remove the (hinted) replica Tolerate failure of a data center Each object replicated across multiple data centers HANDLING TRANSIENT FAILURES: HINTED HANDOFF 26

HANDLING PERMANENT FAILURES Anti-entropy for replica synchronization Gossip info until it is made obsolete by newer info Use Merkle trees for: Fast inconsistency detection and Minimum transfer of data Nodes exchange the root of the Merkle tree corresponding to the key ranges that they host in common 27

KEY WEAKNESSESTECHNIQUES USED BY DYNAMO 28 Lack information on: - Full routing info at each node => How the gossip-based membership protocol scale? - Dynamic placement of v-nodes - Geo-distribution Trusted env. => no security Reconciliation at the application  More complex business logic  App less portable (platfrm dpn.) (N,W,R) val. Static vs. Dynamic - based on component failres? Improper&incomplete exprimnts - Only on homogeneous machines - Only strict (not sloppy) quorum - Consistent hashing – well known More details are needed, e.g., data arrival distributions, performance under high churn

MAIN TAKEAWAY In several cases it is easier and more cost-effective to reformulate the problem, to work in an eventually consistent system, rather than trying to provide stronger consistency 29