)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,

)1()1( Presenter: Noam Presman presmann@eng.tau.ac.il Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti, J.Ousterhout and M.Rosenblum (Stanford University ) 29/5/2013 Paper presentation:

)2()2( Talk Outline The problem: random replication results in a high probability for data loss due to cluster-wide failures. Mincopysets as a Proposed Solution MinCopysets The Tradeoffs: the good and the challenges Lessons from the field: integration in RAMCloud and HDFS Relaxed MinCopyests

)3()3( Replication & Copysets Cloud applications are based on data center storage systems that spans thousands of machines (nodes). Typically the data is partitioned into chunks and these chunks are distributed across the nodes. A replication is used to protect against data loss when node failure occurs. A copyset = a set of nodes that contain all the replicas of a data chunk (=a single unit of failure). a chunk 1 2 34 5 a copyset

)4()4( Random Replication The replica targets are typically assigned at random (typically to nodes residing on different failure domains). Simple and fast assignment. Allows load balancing. Provides strong protection against:  Independent node failures (thousands times a year on a large cluster, due to software, hardware and disk failures).  Correlated failures within a failure domain (dozens times a year due to rack and network failures) Fails to protect against cluster-wide failures (1-2 times a year on a large cluster, typically due to power outages).

)5()5( Cluster Wide Failures A non-negligible percentage of nodes (~typically 0.5%- 1%) do not recover after the power has restored. Causes data loss if a copyset is contained within the unrecovered nodes set  occurs with high probability for commercial systems with more than 300 nodes. Figure 1: Computed probability of data loss when 1% of the nodes don’t survive a restart after a power failure.

)6()6( Cluster Wide Failures (cont’d) Assumption: Each data loss event includes a high fixed cost (independent on the size of the lost data)  needs to locate and roll out magnetic tape archives for recovery. Most storage systems would probably prefer: –Lower the frequency of data-loss events. –Losing larger amount of data when such event occurs. Changing the profile of data loss

)7()7( Why is Random Replication Bad? R = replication size N = number of nodes in the cluster F = number of unrecovered nodes = dN C = number of chunks per node.  The probability of losing a specific chunk:  The probability of losing at least one chunk in the cluster The NC/R chunks are assigned randomly into replication groups of size R.

)8()8( Random Replication is Bad – Possible Workarounds Workaround I - Increase R: Increases the durability to simultaneous errors  to support thousands of nodes requires R>=5. Hurts system performance (increases the write network latency and disk bandwidth). Increases the cost of storage. Figure 2: RAMCloud with C = 8K, d =1%

)9()9( Random Replication is Bad – Possible Workarounds (cont’d) Workaround II - Decrease C (  Increase the chunk size) : Increases the durability due to simultaneous errors  to support thousands of nodes needs to increase the chunk size by 3-4 orders of magnitude. Data objects are distributed to fewer nodes  Compromises the parallel operation and load balancing of the data center. Figure 3: RAMCloud with capacity of each node = 64GB, d =1% disk mirroring

)10( Talk Outline The problem: random replication results in a high probability for data loss due to cluster-wide failures. Mincopysets as a Proposed Solution MinCopysets The Tradeoffs: the good and the challenges Lessons from the field: integration in RAMCloud and HDFS Relaxed MinCopyests

)11( MinCopysets The nodes are partitioned into replication groups of size R. When a chunk is replicated, a primary node v is selected randomly from the entire cluster to store the first replica. The other R-1 secondary replica nodes are always the other members of v‘s replication group. Figure 4: MinCopysets Illustration Chunks are distributed uniformly  Load Balancing Chunks are distributed uniformly  Load Balancing The number of copysets is limited (=N/R)  Durability The number of copysets is limited (=N/R)  Durability

)12( MinCopysets have Better Durability A data-loss event occurs if the set of unrecovered nodes contains at least one copyset. The probability of data-loss event increases with the number of copysets. Figure 5: MinCopysets compared with random replication, C=8K. For N=1K & R=3, random replication has 99.7% chance for data-loss, while MinCopysets has only 0.02% chance. MinCopysets can scale up to N=100K. The number of copysets as the number of chunks increases Random Replication MinCopysets N/R=O(N)

)13( MinCopysets – the Good the Bad The good: Increase the durability. Simple and scalable. General purpose (integration examples: HDFS, RAMCloud). Simplify planned power-downs  keep up at least one member of each group to provide availability. The challenges: There are two problems of MinCopysets with respect to node failure and recovery event. Challenge 1: the group administration challenge Challenge 2: the recovery bandwidth challenge requires a centralized entity to force the replication groups (large data center storage systems usually have such a service) Group 1Group 2 Challenges

)14( The Group Administration Challenge When a node fails the coordinator cannot simply re-replicate each replica  violates the replication group boundaries.  Solution 1: Keep a small number of unassigned servers, that will act as replacements for failed nodes. Unassigned servers are not utilized during normal operation. Difficult to predict how many servers are needed.  Solution 2: Allow nodes to be members of several replication groups. Increases the number of copysets  reduces durability.  Solution 3: Re-replicate the entire group when one of the nodes fails. The nodes that did not fail will be reassigned to a new group once there are enough unassigned nodes. Increases the network and disk consumption during the recovery.

)15( The Recovery Bandwidth Challenge When a node fails the system can recover from MinCopysets: only the members of the groups. Random replication: many nodes across the cluster.  Solution 1: Each node has a “buddy group” from which it is allowed to choose his replicas.  Solution 2: Relaxed MinCopysets. Long recovery time for MinCopysets compared to random replication. The Facebook (HDFS) Solution:  Each node v has a buddy group of size 11 ( a window of 2 racks and five nodes around v).  For each replica: Primary replica node v is selected randomly. The second and the third replicas are randomly selected from the group.

)16( group in A group in B Relaxed MinCopysets Two sets of replications groups: –Set A: Contains groups of R-1 nodes. –Set B: Contains groups of G nodes. Each node is a member of single group in A and a single group in B. Replica choice: –The primary replica holder (v) is randomly chosen. –R-2 secondary nodes are v’s group in A. –One secondary node is selected randomly from v’s group in B. Example: If a system needs to be able to recover from 10 nodes with R=3, then Set A : contains groups of 2 nodes. Set B : contains groups of 10 nodes.

)17( Relaxed MinCopysets (cont’d) N = number of nodes in the cluster. R = replication size. S = the number of nodes available for recovery for a single failure. How many copysets do we have? Relaxed MinCopysets Facebook (HDFS) G = S+1-(R-2) buddy group size = S+1 Figure 7: d =1%, R=3, C=10K. Buddy group size = 11, G =11 * In the paper, a different assignment method is implied, that enables S=10, with G=11, with much less copysets.

)18( Implementation of MinCopysets in RAMCloud The coordinator assigns backups to the replication group (new field in the DB). The groups typically contain members of different failure domains. When a master tries to create a new chunk – It first selects the primary backup randomly. –If the backup has been assigned a replication group, it will accept the master’s write request and respond with the other members of the group. –Otherwise, the backup will not accept and the master will retry its write RPC on a new primary backup. RAMCloud Basics: A copy of each chunk is stored in RAM, while keeping R=3 replicas on disks. The players: The coordinator – manages the cluster by keeping an up-to-date list of the servers. The masters – serve client requests from the in- memory copy of the data (  low latency read access). The backups – keep the data persistent. Write operations from the master, read operations only on recovery. Figure 8: MinCopysets implemented in RAMCloud

)19( Implementation of MinCopysets in RAMCloud (cont’d) When a backup node fails –The coordinator changes the replication group ID of the other members to limbo. –Limbo backups can receive read requests but cannot accept write requests. –Limbo backups are treated by the coordinator as new backups that were not been assigned a replication group. –The coordinator forms a new group, once there are enough unassigned backups. –All masters are notified of the failure.  A master that has data stored on the node’s group tries to re-replicate its data to a new replication group. –Backup servers won’t garbage collect the data until the masters have replicated it entirely on a new group. Performance benchmark: –Normal client operations –Master recovery –Backup recovery: Not affected by MinCopysets A single backup crashed on a cluster of 39 masters and 72 backups, storing 33 GB. Masters re-replicate data in parallel  +51% All the group is re- replicated  +190%.

)20( Implementation of MinCopysets in HDFS The NameNode was modified to assign new DataNodes to replication groups and choose replica placements based on these group assignments. Complicated chunk rebalancing: groups must be taken into account. Pipelined replication is hard to implement with MinCopysets  over utilization of a relatively small number of links. –A possible workaround: NameNode should take the network topology into account when forming the groups, and periodically reassign certain nodes to different groups for load balancing. HDFS Basics: The NameNode – the node that controls all the file system metadata and dictates the placement of every chunk replica. The DataNodes – the nodes that store the data. Chunk rebalancing - migration of replicas to different DataNodes to spread the data more uniformly (e.g. when the topology changes). Pipelined Replication - DataNodes replicate data on a pipeline from one node to the next, in an order that minimizes the network distance from the client to the last DataNode.

)21( Summary and Conclusions Random replication has high probability for data loss in cluster wide failures. MinCopysets change the profile of the data loss distribution:  Decrease the frequency of data loses events  Increase the amount of data lost per event. Integrated in RamCloud and HDFS. Backup recovery and nodes rebalancing are still weak points for the scheme.

)22( Talk Outline The problem: random replication results in a high probability for data loss due to cluster-wide failures. Mincopysets as a Proposed Solution MinCopysets The Tradeoffs: the good and the challenges. Lessons from the field: integration in RAMCloud and HDFS. Relaxed MinCopyests Thank you !

)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,

Similar presentations

Presentation on theme: ")1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,

Similar presentations

Presentation on theme: ")1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,"— Presentation transcript:

Similar presentations

About project

Feedback