)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti,

Slides:



Advertisements
Similar presentations
MinCopysets: Derandomizing Replication in Cloud Storage
Advertisements

Copysets: Reducing the Frequency of Data Loss in Cloud Storage
The google file system Cs 595 Lecture 9.
G O O G L E F I L E S Y S T E M 陳 仕融 黃 振凱 林 佑恩 Z 1.
Analysis of RAMCloud Crash Probabilities Asaf Cidon Stanford University.
Availability in Globally Distributed Storage Systems
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,
Large Scale Sharing GFS and PAST Mahesh Balakrishnan.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 12: Managing and Implementing Backups and Disaster Recovery.
The Google File System.
5.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 5: Working with File Systems.
Failures in the System  Two major components in a Node Applications System.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Metrics for RAMCloud Recovery John Ousterhout Stanford University.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
PETAL: DISTRIBUTED VIRTUAL DISKS E. K. Lee C. A. Thekkath DEC SRC.
Configuring File Services Lesson 6. Skills Matrix Technology SkillObjective DomainObjective # Configuring a File ServerConfigure a file server4.1 Using.
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
RAMCloud Design Review Recovery Ryan Stutsman April 1,
1 The Google File System Reporter: You-Wei Zhang.
LAN / WAN Business Proposal. What is a LAN or WAN? A LAN is a Local Area Network it usually connects all computers in one building or several building.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
The Hadoop Distributed File System
RAMCloud: A Low-Latency Datacenter Storage System Ankita Kejriwal Stanford University (Joint work with Diego Ongaro, Ryan Stutsman, Steve Rumble, Mendel.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 12: Managing and Implementing Backups and Disaster Recovery.
Introduction to Hadoop and HDFS
Durability and Crash Recovery for Distributed In-Memory Storage Ryan Stutsman, Asaf Cidon, Ankita Kejriwal, Ali Mashtizadeh, Aravind Narayanan, Diego Ongaro,
RAMCloud: Low-latency DRAM-based storage Jonathan Ellithorpe, Arjun Gopalan, Ashish Gupta, Ankita Kejriwal, Collin Lee, Behnam Montazeri, Diego Ongaro,
Module 13 Implementing Business Continuity. Module Overview Protecting and Recovering Content Working with Backup and Restore for Disaster Recovery Implementing.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Presenters: Rezan Amiri Sahar Delroshan
Configuring File Services. Using the Distributed File System Larger enterprises typically use more file servers Used to improve network performce Reduce.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
GFS : Google File System Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
Problem-solving on large-scale clusters: theory and applications Lecture 4: GFS & Course Wrap-up.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Presenter: Seikwon KAIST The Google File System 【 Ghemawat, Gobioff, Leung 】
Load Rebalancing for Distributed File Systems in Clouds.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Robustness in the Salus scalable block store Yang Wang, Manos Kapritsos, Zuocheng Ren, Prince Mahajan, Jeevitha Kirubanandam, Lorenzo Alvisi, and Mike.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
RAID.
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Configuring File Services
Server Upgrade HA/DR Integration
Maintaining Windows Server 2008 File Services
CPS 512 midterm exam #1, 10/7/2016 Your name please: ___________________ NetID:___________ /60 /40 /10.
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
湖南大学-信息科学与工程学院-计算机与科学系
An Introduction to Computer Networking
EECS 498 Introduction to Distributed Systems Fall 2017
EECS 498 Introduction to Distributed Systems Fall 2017
by Mikael Bjerga & Arne Lange
Presentation transcript:

)1()1( Presenter: Noam Presman Advanced Topics in Storage Systems – Semester B 2013 Authors: A.Cidon, R.Stutsman, S.Rumble, S.Katti, J.Ousterhout and M.Rosenblum (Stanford University ) 29/5/2013 Paper presentation:

)2()2( Talk Outline The problem: random replication results in a high probability for data loss due to cluster-wide failures. Mincopysets as a Proposed Solution MinCopysets The Tradeoffs: the good and the challenges Lessons from the field: integration in RAMCloud and HDFS Relaxed MinCopyests

)3()3( Replication & Copysets Cloud applications are based on data center storage systems that spans thousands of machines (nodes). Typically the data is partitioned into chunks and these chunks are distributed across the nodes. A replication is used to protect against data loss when node failure occurs. A copyset = a set of nodes that contain all the replicas of a data chunk (=a single unit of failure). a chunk a copyset

)4()4( Random Replication The replica targets are typically assigned at random (typically to nodes residing on different failure domains). Simple and fast assignment. Allows load balancing. Provides strong protection against:  Independent node failures (thousands times a year on a large cluster, due to software, hardware and disk failures).  Correlated failures within a failure domain (dozens times a year due to rack and network failures) Fails to protect against cluster-wide failures (1-2 times a year on a large cluster, typically due to power outages).

)5()5( Cluster Wide Failures A non-negligible percentage of nodes (~typically 0.5%- 1%) do not recover after the power has restored. Causes data loss if a copyset is contained within the unrecovered nodes set  occurs with high probability for commercial systems with more than 300 nodes. Figure 1: Computed probability of data loss when 1% of the nodes don’t survive a restart after a power failure.

)6()6( Cluster Wide Failures (cont’d) Assumption: Each data loss event includes a high fixed cost (independent on the size of the lost data)  needs to locate and roll out magnetic tape archives for recovery. Most storage systems would probably prefer: –Lower the frequency of data-loss events. –Losing larger amount of data when such event occurs. Changing the profile of data loss

)7()7( Why is Random Replication Bad? R = replication size N = number of nodes in the cluster F = number of unrecovered nodes = dN C = number of chunks per node.  The probability of losing a specific chunk:  The probability of losing at least one chunk in the cluster The NC/R chunks are assigned randomly into replication groups of size R.

)8()8( Random Replication is Bad – Possible Workarounds Workaround I - Increase R: Increases the durability to simultaneous errors  to support thousands of nodes requires R>=5. Hurts system performance (increases the write network latency and disk bandwidth). Increases the cost of storage. Figure 2: RAMCloud with C = 8K, d =1%

)9()9( Random Replication is Bad – Possible Workarounds (cont’d) Workaround II - Decrease C (  Increase the chunk size) : Increases the durability due to simultaneous errors  to support thousands of nodes needs to increase the chunk size by 3-4 orders of magnitude. Data objects are distributed to fewer nodes  Compromises the parallel operation and load balancing of the data center. Figure 3: RAMCloud with capacity of each node = 64GB, d =1% disk mirroring

)10( Talk Outline The problem: random replication results in a high probability for data loss due to cluster-wide failures. Mincopysets as a Proposed Solution MinCopysets The Tradeoffs: the good and the challenges Lessons from the field: integration in RAMCloud and HDFS Relaxed MinCopyests

)11( MinCopysets The nodes are partitioned into replication groups of size R. When a chunk is replicated, a primary node v is selected randomly from the entire cluster to store the first replica. The other R-1 secondary replica nodes are always the other members of v‘s replication group. Figure 4: MinCopysets Illustration Chunks are distributed uniformly  Load Balancing Chunks are distributed uniformly  Load Balancing The number of copysets is limited (=N/R)  Durability The number of copysets is limited (=N/R)  Durability

)12( MinCopysets have Better Durability A data-loss event occurs if the set of unrecovered nodes contains at least one copyset. The probability of data-loss event increases with the number of copysets. Figure 5: MinCopysets compared with random replication, C=8K. For N=1K & R=3, random replication has 99.7% chance for data-loss, while MinCopysets has only 0.02% chance. MinCopysets can scale up to N=100K. The number of copysets as the number of chunks increases Random Replication MinCopysets N/R=O(N)

)13( MinCopysets – the Good the Bad The good: Increase the durability. Simple and scalable. General purpose (integration examples: HDFS, RAMCloud). Simplify planned power-downs  keep up at least one member of each group to provide availability. The challenges: There are two problems of MinCopysets with respect to node failure and recovery event. Challenge 1: the group administration challenge Challenge 2: the recovery bandwidth challenge requires a centralized entity to force the replication groups (large data center storage systems usually have such a service) Group 1Group 2 Challenges

)14( The Group Administration Challenge When a node fails the coordinator cannot simply re-replicate each replica  violates the replication group boundaries.  Solution 1: Keep a small number of unassigned servers, that will act as replacements for failed nodes. Unassigned servers are not utilized during normal operation. Difficult to predict how many servers are needed.  Solution 2: Allow nodes to be members of several replication groups. Increases the number of copysets  reduces durability.  Solution 3: Re-replicate the entire group when one of the nodes fails. The nodes that did not fail will be reassigned to a new group once there are enough unassigned nodes. Increases the network and disk consumption during the recovery.

)15( The Recovery Bandwidth Challenge When a node fails the system can recover from MinCopysets: only the members of the groups. Random replication: many nodes across the cluster.  Solution 1: Each node has a “buddy group” from which it is allowed to choose his replicas.  Solution 2: Relaxed MinCopysets. Long recovery time for MinCopysets compared to random replication. The Facebook (HDFS) Solution:  Each node v has a buddy group of size 11 ( a window of 2 racks and five nodes around v).  For each replica: Primary replica node v is selected randomly. The second and the third replicas are randomly selected from the group.

)16( group in A group in B Relaxed MinCopysets Two sets of replications groups: –Set A: Contains groups of R-1 nodes. –Set B: Contains groups of G nodes. Each node is a member of single group in A and a single group in B. Replica choice: –The primary replica holder (v) is randomly chosen. –R-2 secondary nodes are v’s group in A. –One secondary node is selected randomly from v’s group in B. Example: If a system needs to be able to recover from 10 nodes with R=3, then Set A : contains groups of 2 nodes. Set B : contains groups of 10 nodes.

)17( Relaxed MinCopysets (cont’d) N = number of nodes in the cluster. R = replication size. S = the number of nodes available for recovery for a single failure. How many copysets do we have? Relaxed MinCopysets Facebook (HDFS) G = S+1-(R-2) buddy group size = S+1 Figure 7: d =1%, R=3, C=10K. Buddy group size = 11, G =11 * In the paper, a different assignment method is implied, that enables S=10, with G=11, with much less copysets.

)18( Implementation of MinCopysets in RAMCloud The coordinator assigns backups to the replication group (new field in the DB). The groups typically contain members of different failure domains. When a master tries to create a new chunk – It first selects the primary backup randomly. –If the backup has been assigned a replication group, it will accept the master’s write request and respond with the other members of the group. –Otherwise, the backup will not accept and the master will retry its write RPC on a new primary backup. RAMCloud Basics: A copy of each chunk is stored in RAM, while keeping R=3 replicas on disks. The players: The coordinator – manages the cluster by keeping an up-to-date list of the servers. The masters – serve client requests from the in- memory copy of the data (  low latency read access). The backups – keep the data persistent. Write operations from the master, read operations only on recovery. Figure 8: MinCopysets implemented in RAMCloud

)19( Implementation of MinCopysets in RAMCloud (cont’d) When a backup node fails –The coordinator changes the replication group ID of the other members to limbo. –Limbo backups can receive read requests but cannot accept write requests. –Limbo backups are treated by the coordinator as new backups that were not been assigned a replication group. –The coordinator forms a new group, once there are enough unassigned backups. –All masters are notified of the failure.  A master that has data stored on the node’s group tries to re-replicate its data to a new replication group. –Backup servers won’t garbage collect the data until the masters have replicated it entirely on a new group. Performance benchmark: –Normal client operations –Master recovery –Backup recovery: Not affected by MinCopysets A single backup crashed on a cluster of 39 masters and 72 backups, storing 33 GB. Masters re-replicate data in parallel  +51% All the group is re- replicated  +190%.

)20( Implementation of MinCopysets in HDFS The NameNode was modified to assign new DataNodes to replication groups and choose replica placements based on these group assignments. Complicated chunk rebalancing: groups must be taken into account. Pipelined replication is hard to implement with MinCopysets  over utilization of a relatively small number of links. –A possible workaround: NameNode should take the network topology into account when forming the groups, and periodically reassign certain nodes to different groups for load balancing. HDFS Basics: The NameNode – the node that controls all the file system metadata and dictates the placement of every chunk replica. The DataNodes – the nodes that store the data. Chunk rebalancing - migration of replicas to different DataNodes to spread the data more uniformly (e.g. when the topology changes). Pipelined Replication - DataNodes replicate data on a pipeline from one node to the next, in an order that minimizes the network distance from the client to the last DataNode.

)21( Summary and Conclusions Random replication has high probability for data loss in cluster wide failures. MinCopysets change the profile of the data loss distribution:  Decrease the frequency of data loses events  Increase the amount of data lost per event. Integrated in RamCloud and HDFS. Backup recovery and nodes rebalancing are still weak points for the scheme.

)22( Talk Outline The problem: random replication results in a high probability for data loss due to cluster-wide failures. Mincopysets as a Proposed Solution MinCopysets The Tradeoffs: the good and the challenges. Lessons from the field: integration in RAMCloud and HDFS. Relaxed MinCopyests Thank you !