Presentation on theme: "Amazon Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google,"— Presentation transcript:
Amazon Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 2006
Introduction r What is Amazon? m Amazons provides a e-commerce platform for millions of customers m Data centers all over the world m Reliability is important since it impacts customer trust
Overview r Founded in 1994; Didn’t make a profit until 2001 r Headquarters in Seattle, Washington USA r Started as an on-line bookstore r Today it offers various products and services including: music, DVDs, videos, electronics, camera and photography, clothing apparel, shoes, etc. r US’s largest on-line retailer
Overview r In 1999 it offered its own auction service m eBay had too much momentum r Offered AmazonFresh in 2007 m Grocery store service r Attracted 615 million users in 2008 which is twice that of Walmart r 50 million US users a month
Overview r Amazon makes 40% of its money from third parties r Features include: one-click shopping, customer review and order verification
Overview r Amazon has many Data Centers: m Hundreds of services m Thousands of commodity machines m Millions of customers at peak times m Performance + Reliability + Efficiency = $$$$$ m Outages are bad Customers lose confidence, Business loses money m Accidents happen
Requirement r Need a distributed storage system: m Scale m Simple: key-value m Highly available m Guarantee Service Level Agreements (SLA)
System Assumptions and Requirements r Query Model: m Simple read and write operations to a data item that is uniquely identified by a key. r ACID Properties: m Atomicity, Consistency, Isolation, Durability. m Weak consistency m Permits only single key updates
System Assumptions and Requirements r Efficiency: m Latency requirements which are in general measured at the 99.9th percentile of the distribution. r Other Assumptions: m Operation environment is assumed to be non- hostile and there are no security related requirements such as authentication and authorization.
Design Consideration r Sacrifice strong consistency for availability r Conflict resolution is executed during read instead of write, i.e. “always writeable”. r Other principles: m Incremental scalability. m Symmetry. m Decentralization. m Heterogeneity.
Summary of Techniques ProblemTechniqueAdvantage PartitioningConsistent HashingIncremental Scalability High Availability for writes Vector clocks with reconciliation during reads Version size is decoupled from update rates. Handling temporary failuresSloppy Quorum and hinted handoff Provides high availability and durability guarantee when some of the replicas are not available. Recovering from permanent failures Anti-entropy using Merkle trees Synchronizes divergent replicas in the background. Membership and failure detection Gossip-based membership protocol and failure detection. Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information.
Data Storage r Dynamo is a distributed storage system r It stores information to be retrieved r It does not store the data into tables r The data stored for an object is relatively small m GFS /Bigtable assume large amounts of data r Instead all objects are stored and looked up via a key
Data Storage r Let’s say that you go to the web page of the “Catcher in the Rye” that has the URL Salinger/dp/ /ref=pd_sim_b_1 r What do you see: m Description m Customer reviews m Related books m etc
Data Storage r To create, Amazon’s infrastructure has to perform many database lookups r Example: It grabs information about the book from its URL or from its ASIN which the unique identifier that Amazon provides to a product ( )
r Dynamo offers a simple put and get interface r Features m Physical nodes are thought of as organized a a ring m Virtual nodes are created by the system mapped to physical nodes so that hardware can be swapped for maintenance and failure m The partitioning algorithm (ala Chord-like) specifies which nodes will store an object m Every object is replicated to N nodes
Request Routing r Requests come in. r How do we find the data specified in requests? r Data is partitioned among a set of storage hosts r Consistent hashing is used. m Chord-like
Replication r Each data item is replicated at N hosts. r preference list: The list of nodes that is responsible for storing a particular key. r A node handling a read/write operation is the coordinator m Typically the first node in the preference list r Read/write ops involve healthy nodes
Replication r To account for node failures, the preference list contains more than N nodes.
Execution of put()/get() r For a put() operation the coordinator: m Generates the vector clock for the new version and writes the new version locally m Sends the new version (with the new vector clock) to the N highest-ranked reachable nodes m If at least W-1 nodes respond to the write is considered successful m A put() call may return to its caller before the update has been applied at all the replicas
Execution of put()/get() r For a get() operation the coordinator: m Coordinator requests all existing versions of data for that key from the N highest-ranked reachable nodes in the preference list for that key m Waits for R responses before returning the result to the client m If the coordinator ends up gathering multiple versions of the data, it returns all the versions it deems to be causally unrelated m The divergent versions are reconciled and the reconciled version is written back (more later)
Execution of put()/get() r Two configurable parameters: R, W m R is the minimum number of nodes that must participate in a successful read operation m W is the minimum number of nodes that participate in a successful write operation
Data Versioning r Updates are propagated to all replicas asynchronously r A put() call may return to its caller before the update r A subsequent get() operation may return an object that does not have the latest updates
Data Versioning r Ok. Why do we reconcile with reads and not writes r Dynamo wants an always writable data store m Rejecting customer updates could result in a poor customer experience m Do we want a shopping car that we have to often wait on (the answer it turns out is no). r The advantage of reconciling different versions at the read end is that this can be done based on the application needs
Data Versioning r Example: Shopping cart m Add to Cart operation can never be forgotten or rejected m If the most recent state of the cart is unavailable and a user makes a change to an older version of the cart then the change is considered meaningful and should be preserved m It should not supersede the currently unavailable state of the cart m Add to cart and delete item from cart operations are translated into Dynamo put operations
Data Versioning r When a customer wants to add/remove from a cart and the latest is not available the item is added/removed from the version it has r Most of the time new versions subsume the previous version r Version branching may happen resulting in conflicting versions of the object r The client application must perform the reconciliation in order to collapse multiple branches back into one r Adds are preserved r Deleted items can resurface r Dynamo uses vector clocks
Data Versioning r Associate a vector clock with a value m Versioned value is a (value, vector clock) tuple m Multiple versioned values can exist for a key m We can use a vector clock to determine causality m If two versioned values aren’t causally related, allow application to reconcile
Failure Handling r If node A is temporarily down or unreachable during a write operation then a replica is put on node D r Replica sent to D has a hint in its metadata suggesting that A is the intended recipient r When A recovers D sends the replica A r This is called hinted handoff
Replica Synchronization r What if a hinted replica becomes unavailable before it can be returned to the original replica node r Dynamo implements an anti-entropy protocol r Use Merkle trees m Leaves are hashes of keys
Replica Synchronization Leaves are hashes of values of individual keys Parent nodes are hashes of their respective children Two nodes can exchange the root to determine if there is a change
Replica Synchronization r Can compare trees incrementally, without transferring the whole tree r If a part the tree is not modified, the parent nodes’ hashes will be identical r So parts of the tree can be compared without sending data between two replicas r Only keys that are out of sync are transferred
Replica Synchronization r Each node maintains a Merkle tree for each key range it hosts m Don’t forget a physical node hosts a set of virtual nodes m Each virtual node is associated with a key range m Virtual nodes are replicated on other physical nodes r Let’s say that a node goes down m When it comes back up it can request from another node its Merkle tree m It compares the Merkle trees to determine if there is a difference
Summary r We will discuss the different algorithms and show how they fit within Amazon