CSCI5570 Large Scale Data Processing Systems NoSQL Slide Ack.: modified based on the slides from Peter Vosshall James Cheng CSE, CUHK.

Slides:



Advertisements
Similar presentations
Dynamo: Amazon’s Highly Available Key-value Store
Advertisements

Dynamo: Amazon’s Highly Available Key-value Store Slides taken from created by paper authors Giuseppe DeCandia, Deniz Hastorun,
Dynamo: Amazon’s Highly Available Key-value Store ID2210-VT13 Slides by Tallat M. Shafaat.
Case Study - Amazon. Amazon r Amazon has many Data Centers r Hundreds of services r Thousands of commodity machines r Millions of customers at peak times.
AMAZON’S KEY-VALUE STORE: DYNAMO DeCandia,Hastorun,Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels: Dynamo: Amazon's highly available.
D YNAMO : A MAZON ’ S H IGHLY A VAILABLE K EY - V ALUE S TORE Presented By Roni Hyam Ami Desai.
Distributed Hash Tables Chord and Dynamo Costin Raiciu, Advanced Topics in Distributed Systems 18/12/2012.
Amazon’s Dynamo Simple Cloud Storage. Foundations 1970 – E.F. Codd “A Relational Model of Data for Large Shared Data Banks”E.F. Codd –Idea of tabular.
Dynamo: Amazon's Highly Available Key-value Store Distributed Storage Systems CS presented by: Hussam Abu-Libdeh.
Dynamo: Amazon's Highly Available Key-value Store Guiseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin,
Amazon Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google,
Dynamo: Amazon’s Highly Available Key-value Store Adopted from slides and/or materials by paper authors (Giuseppe DeCandia, Deniz Hastorun, Madan Jampani,
1 Dynamo Amazon’s Highly Available Key-value Store Scott Dougan.
Time and synchronization (“There’s never enough time…”)
Dynamo Highly Available Key-Value Store 1Dennis Kafura – CS5204 – Operating Systems.
Dynamo Kay Ousterhout. Goals Small files Always writeable Low latency – Measured at 99.9 th percentile.
Dynamo: Amazon’s Highly Available Key- value Store (SOSP’07) Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman,
Rethinking Dynamo: Amazon’s Highly Available Key-value Store --An Offense Shih-Chi Chen Hongyu Gao.
Versioning and Eventual Consistency COS 461: Computer Networks Spring 2011 Mike Freedman 1.
Dynamo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as well as related cloud storage implementations.
Amazon’s Dynamo System The material is taken from “Dynamo: Amazon’s Highly Available Key-value Store,” by G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati,
Dynamo: Amazon's Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, et.al., SOSP ‘07.
Cloud Storage – A look at Amazon’s Dyanmo A presentation that look’s at Amazon’s Dynamo service (based on a research paper published by Amazon.com) as.
Dynamo: Amazon’s Highly Available Key-value Store Presented By: Devarsh Patel 1CS5204 – Operating Systems.
EECS 262a Advanced Topics in Computer Systems Lecture 22 P2P Storage: Dynamo November 14 th, 2012 John Kubiatowicz and Anthony D. Joseph Electrical Engineering.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Case Study: Amazon Dynamo Steve Ko Computer Sciences and Engineering University at Buffalo.
Peer-to-Peer in the Datacenter: Amazon Dynamo Aaron Blankstein COS 461: Computer Networks Lectures: MW 10-10:50am in Architecture N101
Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma CSCI-572 (Prof. Chris Mattmann)
Dynamo: Amazon’s Highly Available Key-value Store COSC7388 – Advanced Distributed Computing Presented By: Eshwar Rohit
Depot: Cloud Storage with minimal Trust COSC 7388 – Advanced Distributed Computing Presentation By Sushil Joshi.
Dynamo: Amazon's Highly Available Key-value Store Dr. Yingwu Zhu.
Dynamo: Amazon’s Highly Available Key-value Store DeCandia, Hastorun, Jampani, Kakulapati, Lakshman, Pilchin, Sivasubramanian, Vosshall, Vogels PRESENTED.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.
D YNAMO : A MAZON ’ S H IGHLY A VAILABLE K EY - VALUE S TORE Presenters: Pourya Aliabadi Boshra Ardallani Paria Rakhshani 1 Professor : Dr Sheykh Esmaili.
Dynamo: Amazon’s Highly Available Key-value Store
CSE 486/586 CSE 486/586 Distributed Systems Case Study: Amazon Dynamo Steve Ko Computer Sciences and Engineering University at Buffalo.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Data Versioning Lecturer.
Peer to Peer Networks Distributed Hash Tables Chord, Kelips, Dynamo Galen Marchetti, Cornell University.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Paper Survey of DHT Distributed Hash Table. Usages Directory service  Very little amount of information, such as URI, metadata, … Storage  Data, such.
Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin,
DYNAMO: AMAZON’S HIGHLY AVAILABLE KEY-VALUE STORE GIUSEPPE DECANDIA, DENIZ HASTORUN, MADAN JAMPANI, GUNAVARDHAN KAKULAPATI, AVINASH LAKSHMAN, ALEX PILCHIN,
Department of Computer Science, Johns Hopkins University EN Instructor: Randal Burns 24 September 2013 NoSQL Data Models and Systems.
Big Data Yuan Xue CS 292 Special topics on.
Kitsuregawa Laboratory Confidential. © 2007 Kitsuregawa Laboratory, IIS, University of Tokyo. [ hoshino] paper summary: dynamo 1 Dynamo: Amazon.
VICTORIA UNIVERSITY OF WELLINGTON Te Whare Wananga o te Upoko o te Ika a Maui SWEN 432 Advanced Database Design and Implementation Amazon’s Dynamo Lecturer.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
CSCI5570 Large Scale Data Processing Systems
Cassandra - A Decentralized Structured Storage System
Amazon Simple Storage Service (S3)
CSE 486/586 Distributed Systems Case Study: Amazon Dynamo
CS 440 Database Management Systems
P2P: Storage.
Dynamo: Amazon’s Highly Available Key-value Store
Lecturer : Dr. Pavle Mogin
Lecture 9: Dynamo Instructor: Weidong Shi (Larry), PhD
John Kubiatowicz Electrical Engineering and Computer Sciences
Ivy Eva Wu.
Scaling Out Key-Value Storage
Replication Middleware for Cloud Based Storage Service
Massively Parallel Cloud Data Storage Systems
Providing Secure Storage on the Internet
EECS 498 Introduction to Distributed Systems Fall 2017
CS 440 Database Management Systems
EECS 498 Introduction to Distributed Systems Fall 2017
Key-Value Tables: Chord and DynamoDB (Lecture 16, cs262a)
Replica Placement Model: We consider objects (and don’t worry whether they contain just data or code, or both) Distinguish different processes: A process.
CSE 486/586 Distributed Systems Consistency --- 3
CSE 486/586 Distributed Systems Case Study: Amazon Dynamo
Presentation transcript:

CSCI5570 Large Scale Data Processing Systems NoSQL Slide Ack.: modified based on the slides from Peter Vosshall James Cheng CSE, CUHK

Dynamo: Amazon's highly available key-value store Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels SOSP

Why are we reading this paper? Database, eventually consistent, write any replica A real system: used for e.g. shopping cart at Amazon More availability, less consistency than PNUTS Influential design; inspired e.g. Facebook's Cassandra 3

Amazon’s eCommerce Platform Architecture Loosely coupled, service oriented architecture Stateful services own and manage their own state Stringent latency requirements – services must adhere to formal SLAs – Measured at the 99.9 th percentile Availability is paramount Large scale (and growing) 4

Motivation Amazon.com: one of the largest e-commerce operations in the world Reliability at massive scale: – tens of thousands of servers and network components – highly decentralized, loosely coupled – slightest outage has significant financial consequences and impacts on customer trust 5

Motivation Most services on Amazon only need primary-key access to a data store => key-value store State management (primary factor for scalability and availability) => need a highly available, scalable storage system RDBMS is a poor fit – most features unused – scales up, not out – availability limitations Consistency vs. Availability – High availability is very important – User perceived consistency is very important – Trade-off strong consistency in favor of high availability 6

Key Requirements “Always Writable” – accept writes during failure scenarios – allow write conversations without prior context User-perceived consistency Guaranteed performance (99.9 th percentile latency) Incremental scalability “Knobs” to tune tradeoffs between cost, consistency, durability and latency No existing production-ready solutions met these requirements 7

What is Dynamo A highly available and scalable distributed data storage system – a key-value store – data partitioned and replicated using consistent hashing (convenient for adding/removing nodes) – consistency facilitated by object versioning – a quorum-like technique and a decentralized replica synchronization protocol to maintain consistency among replicas during updates – always writeable and eventually consistent – a gossip based protocol to detect failure and membership 8

System Assumptions & Requirements Query model – simple reads/writes by a key – no operation spans multiple data items – no need for relation schema ACID properties – weaker consistency – no isolation guarantee and only single key updates Security – run internally, no security requirements 9

System Interface A simple primary-key only interface – get(key) locate the object replicas associated with key return a single object or a list of objects with conflicting versions along with a context – put(key, context, object) determines where the replicas of the object should be placed based on key, and writes the replicas to disk context: system metadata about the object 10

Techniques used in Dynamo ProblemTechnique usedAdvantage PartitioningConsistent hashingIncremental scalability High availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates Handling temporary failures Sloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background Membership and failure detection Gossip-based membership protocol and failure detection Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information 11

Partitioning For incremental scaling => dynamically partitioning data when storage nodes added/removed By consistent hashing 12

Consistent hashing 0 A B C h(key1) 13 Each node is assigned a random value on the ring Each data item hashed to a position on the ring, & located the first node next to it clockwise h(key2)

Incremental Scaling 0 A B C h(key1) h(key2) D 14 Problem: random node assignment => non-uniform data and load distribution Solution: next slide h(key2)

Load Balancing 0 A B C A A A B B B C C C D D D D 15 Virtual nodes: each node assigned to multiple points on the ring

Replication For high availability and durability By replicating data on N storage nodes 16

Replication 0 A B h(key1) D 17 F C E Each key replicated at the N-1 clockwise successor nodes Each node i stores keys in the ranges (i-3, i-2], (i-2, i-1] and (i-1, i] These N nodes are called the preference list of the key

Load Balancing 0 A B C A A A B B B C C C D D D D 18 h(key2) The preference list (PL) may contain multiple virtual nodes of a physical node E.g., if N=3, PL of key2 is {C, B, C} Allow only 1 virtual node of each physical node in constructing PL E.g., PL of key2 now becomes {C, B, A}

Tradeoffs efficiency/scalability => partition availability => replication always writeable => allowed to write just one replica always writeable + replicas + partitions = conflicting versions Dynamo solution: eventual consistency by data versioning 19

Eventual Consistency accept writes at any replica allow divergent replicas allow reads to see stale or conflicting data resolve conflicts when failures go away – reader must merge and then write 20

Unhappy Consequences of Eventual Consistency No notion of "latest version“ Can read multiple conflicting versions Application must merge and resolve conflicts No atomic operations (e.g. no PNUTS test-and- set-write) 21

Techniques Used Vector clocks – Distributed time “Sloppy quorum” – Hinted handoff Anti-entropy mechanism – Merkle trees 22

Distributed Time The notion of time is well-defined (and measurable) at each single location But the relationship between time at different locations is unclear – Can minimize discrepancies, but never eliminate them Examples: – If two file servers get different update requests to same file, what should be the order of those requests? 23

A Baseball Example Four locations: pitcher’s mound (P), home plate, first base, and third base Ten events: e1: pitcher (P) throws ball toward home e2: ball arrives at home e3: batter (B) hits ball toward pitcher e4: batter runs toward first base e5: runner runs toward home e6: ball arrives at pitcher e7: pitcher throws ball toward first base e8: runner arrives at home e9: ball arrives at first base e10: batter arrives at first base R R B B P P 2 nd base 1 st base3 rd base home plate 24

A Baseball Example Pitcher knows e1 happens before e6, which happens before e7 Home plate umpire knows e2 is before e3, which is before e4, which is before e8, … Relationship between e8 and e9 is unclear 25

Ways to Synchronize Send message from first base to home when ball arrives? – Or both home and first base send messages to a central timekeeper when runner/ball arrives – But: How long does this message take to arrive? 26

Logical Time 27

Global Logical Time 28

Concurrency 29

Back to Baseball e1: pitcher (P) throws ball toward home e2: ball arrives at home e3: batter (B) hits ball toward pitcher e4: batter runs toward first base e5: runner runs toward home e6: ball arrives at pitcher e7: pitcher throws ball toward first base e8: runner arrives at home e9: ball arrives at first base e10: batter arrives at first base 30

Vector Clocks 31

Vector Clock Algorithm 32

Vector clocks on the baseball example EventVectorAction e1[1,0,0,0]pitcher throws ball to home e2[1,0,1,0]ball arrives at home e3[1,0,2,0]batter hits ball to pitcher e4[1,0,3,0]batter runs to first base e5[0,0,0,1]runner runs to home e6[2,0,2,0]ball arrives at pitcher e7[3,0,2,0]pitcher throws ball to 1 st base e8[1,0,4,1]runner arrives at home e9[3,1,2,0]ball arrives at first base e10[3,2,3,0]batter arrives at first base Vector: [p,f,h,t] 33

Important Points Physical Clocks – Can keep closely synchronized, but never perfect Logical Clocks – Encode causality relationship – Vector clocks provide exact causality information 34

Vector Clocks in Dynamo Consistency Management – Each put() creates new, immutable version – Dynamo tracks version history When vector clocks grow large – Keep recently-updated entries Write handled by Sx D1([Sx, 1]) Write handled by Sx D2([Sx, 2]) Write handled by Sy Write handled by Sz D3([Sx, 2],[Sy, 1])D4([Sx, 2],[Sz, 1]) D5([Sx, 3],[Sy, 1], [Sz, 1]) reconciled and written by Sx 35

Execution of get() and put() A consistency protocol similar to quorum Quorum: R + W > N – Consider N healthy nodes – R is the minimum number of nodes that must participate in a successful read operation – W is the minimum number of nodes that must participate in a successful write operation – never wait for all N – but R and W will overlap => at least 1 R will see updated W 36

Main advantage of Dynamo is flexible N, R, W – What do you get by varying them? Configurability NRWApplication 322Consistent, durable, interactive, user state (typical configuration) n1nHigh performance read engine 111Distributed web cache 37

Write by Quorum 0 A B h(key1) D F C E put(key1, v1) Key1=v1 local write 38 Generate vector clock for the new version and write locally Preference list: A, F, B

Write by Quorum 0 A B h(key1) D F C E put(key1, v1) Key1=v1 forwarded writes 39 Send new version to top-N reachable nodes in preference list

Write by Quorum 0 A B h(key1) D F C E put(key1, v1) Key1=v1 success Success! 40 Write successful if W-1 nodes respond W=2

Read by Quorum 0 A B h(key1) D F C E get(key1) Key1=v1 local read forwarded reads 41 Request all existing versions from top-N reachable nodes in preference list

Read by Quorum 0 A B h(key1) D F C E get(key1) Key1=v1 = v1 42 Read successful if R nodes respond success If multiple causally unrelated versions, return all R=2

Failures -- two levels Temporary failures vs. Permanent failures Node unreachable -- what to do? – if really dead, need to make new copies to maintain fault- tolerance – if really dead, want to avoid repeatedly waiting for it – if just temporary, hugely wasteful to make new copies 43

Temporary failure handling: quorum Goal – do not block waiting for unreachable nodes – get should have high prob of seeing most recent “put”s “Sloppy quorum”: N is not all nodes, but first N reachable nodes in preference list – each node pings to keep rough estimate of up/down 44

Sloppy Quorum 0 A B h(key1) D F C E put(key, v2) Key1=v1 45 Preference list: A, F, B => A, F, D

Sloppy Quorum 0 A B h(key1) D F C E put(key, v2) Key1=v2 Key1=v1 local write 46

Sloppy Quorum 0 A B h(key1) D F C E put(key, v2) Key1=v2 Key1=v1 local write forwarded writes Key1=v2 hint: B Key1=v1Key1=v2 47 Send the write to another node with a hint on the intended recipient (now temporary down)

Sloppy Quorum 0 A B h(key1) D F C E put(key, v2) Key1=v2 Key1=v1 Key1=v2 hint: B Key1=v2 Success! 48 Send the write to another node with a hint on the intended recipient (now temporary down) success

Sloppy Quorum 0 A B h(key1) D F C E Key1=v2 Key1=v1 Key1=v2 hint: B Key1=v2 49 When the dead node recovers, the hinted replica is transferred to it and deleted from the other node

Sloppy Quorum 0 A B h(key1) D F C E Key1=v2 Key1=v1 Key1=v2 50 When the dead node recovers, the hinted replica is transferred to it and deleted from the other node Preference list: A, F, B Key1=v2 hint: B

Permanent failure handling: anti-entropy Anti-entropy: comparing all the replicas of each piece of data that exist and updating each replica to the newest version Merkle tree: a hash tree – a leave is the hash value of a key – an internal node is the hash value of its children 51

Permanent failure handling: anti-entropy Use of Merkle tree to detect inconsistencies between replicas – compare the hash values of the root of two (sub)trees yes: all leaves are equal => no synchronization needed no: compare the children of the two trees, recursively until reaching leaves: the replicas needed for synchronization are the leaves of the two trees that have different hash values – no data transfer at internal node level 52

Permanent failure handling: anti-entropy Use of Merkle tree in Dynamo – each node maintains a Merkle tree for each key range on the ring – two nodes hosting a common key range compare whether keys within the key range are up-to-date by exchanging the root of their Merkle trees 53

Membership and Failure Detection Ring membership – addition of nodes (for incremental scaling) – removal of nodes (due to failures) One node chosen (probably randomly each time) to write any membership change to persistent store A gossip-based protocol propagates membership changes: each node contacts a peer chosen at random every sec to reconcile their membership info 54

Wrap-up Main ideas: – eventual consistency, consistent hashing, allow conflicting writes, client merges Maybe a good way to get high availability + no blocking on WAN Awkward model for some applications (stale reads, merges) Services that use Dynamo: best seller lists, shopping carts, customer preferences, session management, sales rank, product catalog, etc. No agreement on whether it's good for storage systems – Unclear what's happened to Dynamo at Amazon in the meantime – Almost certainly significant changes (2007->2016) 55