Download presentation

Presentation is loading. Please wait.

Published bySemaj Reaney Modified over 2 years ago

1
**Copysets: Reducing the Frequency of Data Loss in Cloud Storage**

Asaf Cidon, Stephen M. Rumble, Ryan Stutsman, Sachin Katti, John Ousterhout and Mendel Rosenblum Hi everyone, my name is Asaf Cidon from Stanford University. Today I’m going to talk about techniques that control the frequency of data loss in cloud storage systems. This is joint work with Steve Rumble, Ryan Stutsman, Sachin Katti, John Ousterhout and Mendel Rosenblum. Stanford University

2
**Goal: Tolerate Node Failures**

Random replication used by: HDFS GFS Windows Azure RAMCloud … Choose random Cloud storage systems typically spray their data across thousands of commodity servers When you have thousands of nodes, there’s a high likelihood of node failures One of the main goals of these systems, is to tolerate node failures The common approach taken by these systems is to replicate data chunks on 3 random servers on different racks This technique is used by most cloud storage systems, and prominently by Hadoop, Google, Windows Azure, RAMcloud If you assume independent failures the chance of losing all three copies are nearly 0

3
**Not All Failures are Independent**

Power outages 1-2 times a year [Google, LinkedIn, Yahoo] Large scale network failures 5-10 times a year [Google, LinkedIn] And more: Rolling software/hardware upgrades Power down Unfortunately, not all node failures are independent There are frequent correlated failures, where multiple nodes fail at the same time These can be caused by cluster power outages, where the entire cluster loses power, and a small percentage of nodes (usually about 1%) doesn’t reboot properly

4
**Random Replication Fails Under Simultaneous Failures**

In this talk we’ll focus on one particular failure, namely – power outages, where some percentage of machines fails to reboot This graph shows the probability of losing all copies of at least one chunk, on the y axis, as a function of the number of nodes in the cluster in the x axis, when 1% of the nodes fail at the same time As you can see the probability of losing data in this scenario is very high, this effect has been documented by several systems operators, like Yahoo, LinkedIn and Facebook Any three nodes that fail, there are probably some blocks in common Confirmed by: Facebook, Yahoo, LinkedIn

5
**Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7**

To understand why this happens, let’s look at the following example. Assume we have a cluster of 9 nodes. Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

6
**Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7**

Node number randomly places a chunk on nodes 5 and 6 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

7
**Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7**

From the perspective of this single chunk, we will only lose data if nodes 1, 5 and 6 fail at the same time Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

8
**Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7**

Now let’s add another chunk, from node 2, that is randomly replicated on nodes 6 and 8. Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

9
**Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7**

Now we will only lose data if we either lose the combination of nodes 1, 5 and 6, or the combination of nodes 2, 6 and 8 Node 4 Node 5 Node 6 Node 7 Node 8 Node 9

10
**Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Copysets:**

It’s time to introduce a key concept of our paper, called a copyset. A copyset is a set of nodes that contain all copies of a single chunk. For example, Nodes 1, 5 and 6 form a copyset. A copyset is in essence a unit of failure – because if all nodes of a copyset fail at the same time, we will lose data. Node 4 Node 5 Node 6 Copysets: {1, 5, 6}, {2, 6, 8} Node 7 Node 8 Node 9

11
**Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7**

12
**Random Replication Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Copysets:**

{1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 2, 6}, {1, 2, 7}, {1, 2, 8}, … Node 7 Node 8 Node 9

13
**Random Replication Causes Frequent Data Loss**

Random replication eventually creates maximum number of copysets Any combination of 3 nodes = 84 copysets If 3 nodes fail, 100% probability of data loss

14
**MinCopysets Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Node 7 Node 8**

15
**MinCopysets Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 Copysets:**

{1, 5, 7}, {2, 4, 9}, {3, 6, 8} Node 7 Node 8 Node 9

16
**MinCopysets Minimizes Data Loss Frequency**

MinCopysets creates minimum number of copysets Only {1, 5, 7}, {2, 4, 9}, {3, 6, 8} If 3 nodes fail, 3.5% of data loss

17
**MinCopysets Reduces Probability of Data Loss**

In terms of reducing the probability of data loss this more or less eliminiates the problem But it does have two major disadvantages

18
**Facebook, LinkedIn, NetApp, Google**

The Trade-off MinCopysets Random Replication Mean Time to Failure 625 years 1 year Amount of Data Lost 1 TB 5.5 GB Many system designers prefer this trade-off, including Facebook and LinkedIn 5000-node cluster Power outage occurs every year Confirmed by: Facebook, LinkedIn, NetApp, Google

19
**Problem: MinCopysets Increases Single Node Recovery Time**

With random replication since you sprayed your data across a whole bunch of nodes you can reconsitute it very quickly Whereas with MinCopysets you more or less have to copy all the data from two nodes

20
**Facebook Extension to HDFS**

Choose random Many HDFS have noticed this and worked around it Here’s an example Buddy Group

21
A Compromise XXX – Facebook extension to HDFS

22
Can We Do Better? Facebook extension to HDFS

23
**Definition: Scatter Width**

MinCopysets Scatter Width = 2 Facebook Extension to HDFS Scatter Width = 10

24
**Facebook Extension to HDFS**

1 2 3 4 5 6 7 8 9 Node 1’s copysets: {1, 2, 3}, {1, 2, 4}, {1, 2, 5}, {1, 3, 4}, {1, 3, 5}, {1, 4, 5} Overall: 54 copysets If 3 nodes fail simultaneously: Buddy group Maybe graphical? Not introduce terminology Show a sliding window

25
**Copyset Replication: Intuition**

Same scatter width (4), different scheme: {1, 2, 3}, {4, 5, 6}, {7, 8, 9} {1, 4, 7}, {2, 5, 8}, {3, 6, 9} Ingredients of ideal scheme Maximize scatter width Minimize overlaps 1 2 3 If you carefully place the chunks so they create a smaller number of copysets, you can get for the same scatter width you can get much better This paper explores this notion of rather than randomly spraying data you get the most reliable and fastest recovery time 4 7

26
**Copyset Replication: Initialization**

1 2 3 4 5 6 7 8 9 Random Permutation 7 3 5 6 2 9 1 8 4 It’s a very complex problem to solve this optimally. In the paper we present a heuristic scheme, which we call Copyset Replication Split into copysets (Scatter width = 2) 7 3 5 6 2 9 1 8 4 Copyset Copyset Copyset

27
**Copyset Replication: Initialization**

1 2 3 4 5 6 7 8 9 Permutation 1: Scatter width = 2 7 3 5 6 2 9 1 8 4 Permutation 2: Scatter width = 4 9 7 1 5 6 8 4 2 3 … Permutation 5: Scatter width = 10

28
**Copyset Replication: Replication**

1 2 3 4 5 6 7 8 9 Randomly choose copyset 7 3 5 6 2 9 1 8 4 9 7 1 5 6 8 4 2 3 …

29
**Insignificant Overhead**

Add the Copyset Replication – show the overhead

30
Copyset Replication

31
Inherent Trade-off

32
**Related Work BIBD (Balanced Incomplete Block Designs)**

Originally proposed for designing agricultural experiments in the 1930’s! [Fisher, ’40] Other applications Power downs [Harnik et al ’09, Leverich et al ’10, Thereska ’11] Multi-fabric interconnects [Mehra, ’99]

33
Summary Many storage systems randomly spray their data across a large number of nodes Serious problem with correlated failures Copyset Replication is a better way of spraying data that decreases the probability of correlated failures

34
**Thank You! Stanford University**

Intro: We initially designed a replication system for RAMCloud called MinCopysets and gave a talk on it We saw that it doesn’t translate well to disk based systems because it impacts their node recovery We then designed a replication system that Stanford University

35
**More Failures (Facebook)**

36
RAMCloud

37
HDFS

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google