Presentation is loading. Please wait.

Presentation is loading. Please wait.

MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and.

Similar presentations


Presentation on theme: "MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and."— Presentation transcript:

1 MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and Mendel Rosenblum Unpublished – Please do not distribute

2 Overview Assumptions: no geo-replication, Azure uses much smaller clusters in practice Unpublished – Please do not distribute

3 Primary data stored on master (memory) Divide each masters data into chunks Chunks are replicated on backups (disk) – When master crashes, recover from thousands of backups RAMCloud Masters Backups Crashed Master Unpublished – Please do not distribute

4 Node 1 Node 2Node 3 Node 4 Node 5 Node 6 Node 7Node 8 Node 9 Node 10 Random Replication Chunk 1Chunk 2Chunk 3 Chunk 1 Secondary Chunk 1 Primary Chunk 1 Secondary Chunk 2 Secondary Chunk 2 Primary Chunk 3 Primary Chunk 3 Secondary Unpublished – Please do not distribute

5 The Problem Randomized replication loses data in power outages – 0.5-1% of the nodes fail to reboot – 1-2 times a year – Result: handful of chunks (GBs of data) are unavailable (LinkedIn 12) Sub-problem: managed power downs – Software upgrades – Reduced power consumption Unpublished – Please do not distribute

6 Intuition If we have one chunk, we are safe: – Replicate chunk on three nodes – Data is lost if failed nodes contain three copies of a chunk – 1% of the nodes fail: 0.0001% of data loss If we have millions of chunks, we lose data: – 1000 node HDFS cluster has 10 million chunks – 1% of the nodes fail: 99.93% of data loss Unpublished – Please do not distribute

7 Mathematical Intuition Unpublished – Please do not distribute

8 Changing R Doesnt Help Unpublished – Please do not distribute

9 Changing the Chunk Size Doesnt Help Unpublished – Please do not distribute

10 MinCopysets: Decouple Load Balancing and Durability Split nodes into fixed replication groups Random Distribution: Place primary replica on random node Deterministic Replication: Place secondary replicas deterministically on same replication group as primary Unpublished – Please do not distribute

11 MinCopysets Architecture Replication Group 3Replication Group 2Replication Group 1 Chunk 1Chunk 2Chunk 3Chunk 4 Node 55 Chunk 1 Secondary Chunk 3 Primary Node 7 Chunk 1 Primary Chunk 3 Secondary Node 24 Chunk 1 Secondary Chunk 3 Secondary Node 2 Node 83Node 8 Chunk 2 Secondary Chunk 2 Primary Node 1 Node 22Node 47 Chunk 4 Primary Chunk 4 Secondary Unpublished – Please do not distribute

12

13

14

15 Extreme Failure Scenarios In the extreme scenario of 3-4% of the clusters nodes fail to reboot, MinCopysets provides low data loss probabilities For example: – 4000 node HDFS cluster – 120 nodes fail to reboot after power outage – Only 3.5% probability of data loss Unpublished – Please do not distribute

16 Extreme Failure Scenarios: Normal Clusters Unpublished – Please do not distribute

17 Extreme Failure Scenarios: Big Clusters Unpublished – Please do not distribute

18 MinCopysets Trade Off Trades off frequency and magnitude of failures – Expected data loss is the same – Data loss occurs very rarely – The magnitude of data loss is greater Unpublished – Please do not distribute

19 Frequency vs. Magnitude of Failures Setup: – 5000 node HDFS cluster – 3 TB per machine – R = 3 – Power outage once a year Random replication – Lose 5.5 GB every single year MinCopysets – Lose data once every 625 years – Lose an entire node in case of failure Unpublished – Please do not distribute

20 RAMCloud Implementation RAMCloud implementation was relatively straightforward Two non-trivial issues: 1.Need to manage groups of nodes Allocate chunks on entire groups Manage nodes joining and leaving groups 2.Machine failures are more complex Need to re-replicate entire group, rather than individual nodes Unpublished – Please do not distribute

21 RAMCloud Implementation RAMCloud Coordinator RAMCloud Master RAMCloud Backup Request: Assign Replication Group RPC Server IDReplication Group ID Server 05 Server 10 Server 25 Server 37 …… Request: Open New Chunk RPC Reply: Replication Group Coordinator Server List Unpublished – Please do not distribute

22 HDFS Implementation Even simpler than RAMCloud In HDFS replication decisions are centralized on NameNode, in RAMCloud they are distributed – NameNode assigns DataNodes to replication groups Prototyped in 200 LoC Unpublished – Please do not distribute

23 HDFS Issues Has the same issues as RAMCloud in managing groups of nodes Issue: Repair bandwidth – Solution: Hybrid scheme Issue: Network bottlenecks and load balancing – Solution: Kill replication group, re-replicate its data elsewhere Issue: Replication groups capacity is limited by node with the smallest capacity – Solution: Choose replication groups with similar capacities Unpublished – Please do not distribute

24 Facebooks HDFS Replication Facebook constrains the placement of secondary replicas to a group of 10 nodes to prevent data loss Facebooks Algorithm: – Primary replica is replicated on node j and rack k – Secondary replicas are replicated on randomly selected nodes among (j+1,…,j+5), on racks (k+1, k+2) Unpublished – Please do not distribute

25 Facebooks Replication Unpublished – Please do not distribute

26 Hybrid MinCopysets Split nodes into replication groups of 2 and 15 First and second replica are always placed on the group of 2 Third replica is randomly placed on the group of 15

27

28 Thank You! Stanford University Unpublished – Please do not distribute


Download ppt "MinCopysets: Derandomizing Replication in Cloud Storage Stanford University Asaf Cidon, Ryan Stutsman, Stephen Rumble, Sachin Katti, John Ousterhout and."

Similar presentations


Ads by Google