Copysets: Reducing the Frequency of Data Loss in Cloud Storage

Copysets: Reducing the Frequency of Data Loss in Cloud Storage
Asaf Cidon, Stephen M. Rumble, Ryan Stutsman, Sachin Katti, John Ousterhout and Mendel Rosenblum Hi everyone, my name is Asaf Cidon from Stanford University. Today I’m going to talk about techniques that control the frequency of data loss in cloud storage systems. This is joint work with Steve Rumble, Ryan Stutsman, Sachin Katti, John Ousterhout and Mendel Rosenblum. Stanford University

Goal: Tolerate Node Failures
Random replication used by: HDFS GFS Windows Azure RAMCloud … Choose random Cloud storage systems typically spray their data across thousands of commodity servers When you have thousands of nodes, there’s a high likelihood of node failures One of the main goals of these systems, is to tolerate node failures The common approach taken by these systems is to replicate data chunks on 3 random servers on different racks This technique is used by most cloud storage systems, and prominently by Hadoop, Google, Windows Azure, RAMcloud If you assume independent failures the chance of losing all three copies are nearly 0

Not All Failures are Independent
Power outages 1-2 times a year [Google, LinkedIn, Yahoo] Large scale network failures 5-10 times a year [Google, LinkedIn] And more: Rolling software/hardware upgrades Power down Unfortunately, not all node failures are independent There are frequent correlated failures, where multiple nodes fail at the same time These can be caused by cluster power outages, where the entire cluster loses power, and a small percentage of nodes (usually about 1%) doesn’t reboot properly

Random Replication Fails Under Simultaneous Failures
In this talk we’ll focus on one particular failure, namely – power outages, where some percentage of machines fails to reboot This graph shows the probability of losing all copies of at least one chunk, on the y axis, as a function of the number of nodes in the cluster in the x axis, when 1% of the nodes fail at the same time As you can see the probability of losing data in this scenario is very high, this effect has been documented by several systems operators, like Yahoo, LinkedIn and Facebook Any three nodes that fail, there are probably some blocks in common Confirmed by: Facebook, Yahoo, LinkedIn

Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7
To understand why this happens, let’s look at the following example. Assume we have a cluster of 9 nodes. Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

Node number randomly places a chunk on nodes 5 and 6 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

From the perspective of this single chunk, we will only lose data if nodes 1, 5 and 6 fail at the same time Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

Now let’s add another chunk, from node 2, that is randomly replicated on nodes 6 and 8. Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

Now we will only lose data if we either lose the combination of nodes 1, 5 and 6, or the combination of nodes 2, 6 and 8 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Copysets:
It’s time to introduce a key concept of our paper, called a copyset. A copyset is a set of nodes that contain all copies of a single chunk. For example, Nodes 1, 5 and 6 form a copyset. A copyset is in essence a unit of failure – because if all nodes of a copyset fail at the same time, we will lose data. Node 4 Node 5 Node 6 Copysets: {1, 5, 6}, {2, 6, 8} Node 7 Node 8 Node 9

Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Copysets:
{1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 2, 6}, {1, 2, 7}, {1, 2, 8}, … Node 7 Node 8 Node 9

Random Replication Causes Frequent Data Loss
Random replication eventually creates maximum number of copysets Any combination of 3 nodes = 84 copysets If 3 nodes fail, 100% probability of data loss

MinCopysets Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8

MinCopysets Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Copysets:
{1, 5, 7}, {2, 4, 9}, {3, 6, 8} Node 7 Node 8 Node 9

MinCopysets Minimizes Data Loss Frequency
MinCopysets creates minimum number of copysets Only {1, 5, 7}, {2, 4, 9}, {3, 6, 8} If 3 nodes fail, 3.5% of data loss

MinCopysets Reduces Probability of Data Loss
In terms of reducing the probability of data loss this more or less eliminiates the problem But it does have two major disadvantages

Facebook, LinkedIn, NetApp, Google
The Trade-off MinCopysets Random Replication Mean Time to Failure 625 years 1 year Amount of Data Lost 1 TB 5.5 GB Many system designers prefer this trade-off, including Facebook and LinkedIn 5000-node cluster Power outage occurs every year Confirmed by: Facebook, LinkedIn, NetApp, Google

Problem: MinCopysets Increases Single Node Recovery Time
With random replication since you sprayed your data across a whole bunch of nodes you can reconsitute it very quickly Whereas with MinCopysets you more or less have to copy all the data from two nodes

Facebook Extension to HDFS
Choose random Many HDFS have noticed this and worked around it Here’s an example Buddy Group

A Compromise XXX – Facebook extension to HDFS

Can We Do Better? Facebook extension to HDFS

Definition: Scatter Width
MinCopysets Scatter Width = 2 Facebook Extension to HDFS Scatter Width = 10

Facebook Extension to HDFS
1 2 3 4 5 6 7 8 9 Node 1’s copysets: {1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 3, 4}, {1, 3, 5}, {1, 4, 5} Overall: 54 copysets If 3 nodes fail simultaneously: Buddy group Maybe graphical? Not introduce terminology Show a sliding window

Copyset Replication: Intuition
Same scatter width (4), different scheme: {1, 2, 3}, {4, 5, 6}, {7, 8, 9} {1, 4, 7}, {2, 5, 8}, {3, 6, 9} Ingredients of ideal scheme Maximize scatter width Minimize overlaps 1 2 3 If you carefully place the chunks so they create a smaller number of copysets, you can get for the same scatter width you can get much better This paper explores this notion of rather than randomly spraying data you get the most reliable and fastest recovery time 4 7

Copyset Replication: Initialization
1 2 3 4 5 6 7 8 9 Random Permutation 7 3 5 6 2 9 1 8 4 It’s a very complex problem to solve this optimally. In the paper we present a heuristic scheme, which we call Copyset Replication Split into copysets (Scatter width = 2) 7 3 5 6 2 9 1 8 4 Copyset Copyset Copyset

Copyset Replication: Initialization
1 2 3 4 5 6 7 8 9 Permutation 1: Scatter width = 2 7 3 5 6 2 9 1 8 4 Permutation 2: Scatter width = 4 9 7 1 5 6 8 4 2 3 … Permutation 5: Scatter width = 10

Copyset Replication: Replication
1 2 3 4 5 6 7 8 9 Randomly choose copyset 7 3 5 6 2 9 1 8 4 9 7 1 5 6 8 4 2 3 …

Insignificant Overhead
Add the Copyset Replication – show the overhead

Copyset Replication

Inherent Trade-off

Related Work BIBD (Balanced Incomplete Block Designs)
Originally proposed for designing agricultural experiments in the 1930’s! [Fisher, ’40] Other applications Power downs [Harnik et al ’09, Leverich et al ’10, Thereska ’11] Multi-fabric interconnects [Mehra, ’99]

Summary Many storage systems randomly spray their data across a large number of nodes Serious problem with correlated failures Copyset Replication is a better way of spraying data that decreases the probability of correlated failures

Thank You! Stanford University
Intro: We initially designed a replication system for RAMCloud called MinCopysets and gave a talk on it We saw that it doesn’t translate well to disk based systems because it impacts their node recovery We then designed a replication system that Stanford University

More Failures (Facebook)

RAMCloud

Copysets: Reducing the Frequency of Data Loss in Cloud Storage

Similar presentations

Presentation on theme: "Copysets: Reducing the Frequency of Data Loss in Cloud Storage"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Copysets: Reducing the Frequency of Data Loss in Cloud Storage

Similar presentations

Presentation on theme: "Copysets: Reducing the Frequency of Data Loss in Cloud Storage"— Presentation transcript:

Similar presentations

About project

Feedback