1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The.

Slides:

Advertisements

Similar presentations

MinCopysets: Derandomizing Replication in Cloud Storage

Advertisements

Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.

current hadoop architecture

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.

Henry C. H. Chen and Patrick P. C. Lee

1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.

Yuchong Hu1, Henry C. H. Chen1, Patrick P. C. Lee1, Yang Tang2

1 Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu The Chinese University of Hong Kong.

1 STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures in Practical Storage Systems Mingqiang Li and Patrick P. C.

Availability in Globally Distributed Storage Systems

A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Jack Lee Yiu-bun, Raymond Leung Wai Tak Department.

Impact of Data Locality on Garbage Collection in SSDs: A General Analytical Study Yongkun Li, Patrick P. C. Lee, John C. S. Lui, Yinlong Xu The Chinese.

A “Hitchhiker’s” Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers K. V. Rashmi, Nihar Shah, D. Gu, H. Kuang, D. Borthakur,

1 Data Persistence in Large-scale Sensor Networks with Decentralized Fountain Codes Yunfeng Lin, Ben Liang, Baochun Li INFOCOM 2007.

Energy Efficient Prefetching with Buffer Disks for Cluster File Systems 6/19/ Adam Manzanares and Xiao Qin Department of Computer Science and Software.

CS 268: Project Suggestions Ion Stoica February 6, 2003.

Performance Evaluation of Peer-to-Peer Video Streaming Systems Wilson, W.F. Poon The Chinese University of Hong Kong.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.

Understanding Network Failures in Data Centers: Measurement, Analysis and Implications Phillipa Gill University of Toronto Navendu Jain & Nachiappan Nagappan.

Network Coding vs. Erasure Coding: Reliable Multicast in MANETs Atsushi Fujimura*, Soon Y. Oh, and Mario Gerla *NEC Corporation University of California,

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Raymond Leung and Jack Y.B. Lee Department of Information.

Network Coding Distributed Storage Patrick P. C. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong 1.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.

Investigation of Data Locality in MapReduce

1 Convergent Dispersal: Toward Storage-Efficient Security in a Cloud-of-Clouds Mingqiang Li 1, Chuan Qin 1, Patrick P. C. Lee 1, Jin Li 2 1 The Chinese.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

22/07/ The MDS Scaling Problem for Cloud Storage Yu-chong Hu Institute of Network Coding.

An Architecture for Video Surveillance Service based on P2P and Cloud Computing Yu-Sheng Wu, Yue-Shan Chang, Tong-Ying Juang, Jing-Shyang Yen speaker:

Google File System Simulator Pratima Kolan Vinod Ramachandran.

Min Xu1, Yunfeng Zhu2, Patrick P. C. Lee1, Yinlong Xu2

Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Papers on Storage Systems 1) Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud, SC ) Making Cloud Intermediate Data Fault-Tolerant,

Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu th Annual IEEE/IFIP International.

BALANCED DATA LAYOUT IN HADOOP CPS 216 Kyungmin (Jason) Lee Ke (Jessie) Xu Weiping Zhang.

A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng Zhu 1, Patrick P. C. Lee 2, Liping Xiang 1, Yinlong.

Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding Yuchong Hu, Yinlong Xu, Xiaozhao Wang, Cheng Zhan and Pei.

1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.

The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.

A P2P-Based Architecture for Secure Software Delivery Using Volunteer Assistance Purvi Shah, Jehan-François Pâris, Jeffrey Morgan and John Schettino IEEE.

Matchmaking: A New MapReduce Scheduling Technique

Parity Logging with Reserved Space: Towards Efficient Updates and Recovery in Erasure-coded Clustered Storage Jeremy C. W. Chan*, Qian Ding*, Patrick P.

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

Subways: A Case for Redundant, Inexpensive Data Center Edge Links Vincent Liu, Danyang Zhuo, Simon Peter, Arvind Krishnamurthy, Thomas Anderson University.

Exact Regenerating Codes on Hierarchical Codes Ernst Biersack Eurecom France Joint work and Zhen Huang.

The IEEE International Conference on Cluster Computing 2010

 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.

20/10/ Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding Yuchong Hu Institute of Network Coding Please.

Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.

HDFS MapReduce Hadoop  Hadoop Distributed File System (HDFS)  An open-source implementation of GFS  has many similarities with distributed file.

Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.

SketchVisor: Robust Network Measurement for Software Packet Processing

Chen Qian, Xin Li University of Kentucky

Yiting Xia, T. S. Eugene Ng Rice University

A Tale of Two Erasure Codes in HDFS

Double Regenerating Codes for Hierarchical Data Centers

Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le

A Simulation Analysis of Reliability in Erasure-coded Data Centers

Repair Pipelining for Erasure-Coded Storage

Presented by Haoran Wang

Zhirong Shen+, Patrick Lee+, Jiwu Shu$, and Wenzhong Guo*

A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Raymond Leung and Jack Y.B. Lee Department of Information.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

HashKV: Enabling Efficient Updates in KV Storage via Hashing

Dingming Wu+, Yiting Xia+*, Xiaoye Steven Sun+,

Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1

Xiaolu Li†, Runhui Li†, Patrick P. C. Lee†, and Yuchong Hu‡

Presentation transcript:

1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The Chinese University of Hong Kong DSN’15

Motivation  Clustered file systems (CFSes) (e.g., GFS, HDFS, Azure) are widely adopted by enterprises  A CFS comprises nodes connected via a network Nodes are prone to failures  data availability is crucial  CFSes store data with redundancy Store new hot data with replication Transit to erasure coding after getting cold  encoding  Question: Can we improve the encoding process in both performance and reliability? 2

Background: CFS  Nodes are grouped into racks Nodes in one rack are connected to the same top-of-rack (ToR) switch ToR switches are connected to the network core  Link conditions: Sufficient intra-rack link Scarce cross-rack link 3 Network core Rack 1Rack 2Rack 3

Replication vs. Erasure Coding  Replication has better read throughput while erasure coding has smaller storage overhead  Hybrid redundancy to balance performance and storage Replication for new hot data Erasure coding for cold old data 4 A A B B B A Node1 Node2 Node3 Node4 A B A+B A+2B Node1 Node2 Node3 Node4 Stripe

Encoding  Consider 5-rack cluster and a 4-block file Using 3-way replication  Encode with (5,4) code  3-step encoding: Download Encode and upload Remove redundant replicas Rack 1 Rack 2 Rack 3 Rack 4 Rack P Replication policy of HDFS: 3 replicas in 3 nodes from 2 racks

Problem P Rack 1 Rack 2 Rack 3 Rack 4 Rack 5 Single rack failure tolerable  For each stripe, AT MOST ONE block in each rack 4 Relocation!

Our Contributions  Propose encoding-aware replication (EAR), which enables efficient and reliable encoding Eliminate cross-rack downloads during encoding Guarantee reliability by avoiding relocation after encoding Maintains load balance of RR  Implement an EAR prototype and integrate with Hadoop-20  Conduct testbed experiments in a 13-node cluster  Perform discrete-event simulations to compare EAR and RR in large-scale clusters 7

Related Work  Asynchronous encoding, DiskReduce [Fan et al. PDSW’09]  Erasure coding in CFSes Local repair codes (LRC), e.g., Azure [Huang et al. ATC’12], HDFS [Rashmi et al. SIGCOMM’14] Regenerating codes, e.g., HDFS [Li et al. MSST’13]  Replica placement Reducing block loss probability, CopySet [Cidon et al. ATC’13] Improving write performance by leveraging the network capacities, SinBad [Chowdhury et al. Sigcomm’13]  To the best of our knowledge, there is no explicit study of the encoding operation 8

Motivating Example  Consider the previous example of 5-rack cluster and 4-block file  Performance: eliminate cross-rack downloads  Reliability: avoid relocation after encoding Rack 1 Rack 2 Rack 3 Rack 4 Rack 5 P

Eliminate Cross-Rack Downloads  Formation of a stripe: blocks with at least one replica stored in the same rack We call this rack the core rack of this stripe Pick a node in the core rack to encode the stripe  NO cross- rack downloads  We do NOT interfere with the replication algorithm, we just group blocks according to replica locations. 10 Blk IDRacks storing replicas 1Rack 1, Rack 2 2Rack 1, Rack 3 3Rack 1, Rack 2 4 Blk1: 1,2 Blk2: 3,2 Blk3: 3,2 Blk4: 1,3 Blk5: 1,2 Blk6: 1,2 Blk7: 3,1 Blk8: 3,2 Stripe1: Blk1, Blk4, Blk5, Blk6 Stripe2: Blk2, Blk3, Blk7, Blk8 Core rack

Availability Issues 11

Modeling Reliability Problem  Replica layout  bipartite graph Left side: replicated block Right side: node Edge: replica 12 Block 1 Block 2 Block 3 Rack 1 Rack 2 Rack 3 Rack 4 A replica layout is valid ↔ A valid max matching exists in the bipartite graph

Modeling Reliability Problem 13 S T BlockNode Rack 1 1 1

Incremental Algorithm 14 S T BlockNode Rack core rack Max flow = 1 Max flow = 2 Max flow = 3

Implementation  Leverage locality preservation of MapReduce RaidNode: attaching locality information to each stripe JobTracker: guarantee encoding carried out by slave in core rack 15 Stripe info: blk list, etc. Encoding MapReduce job Task1 stripe1:rack1 Task2 stripe2:rack2 NameNode RaidNode JobTracker Slave 1 Slave 2 Slave 3Slave 4 EAR Rack 1 Rack 2

Testbed Experiments 16

Encoding Throughput  Encoding with UDP traffic  More injected traffic, higher throughput gain Rise from 57.5% to 119.7% 17

Write Response Time  Encoding operation with write requests  Compared with RR, EAR Has similar write response time without encoding. Reduces write response time during encoding by 12.4% Reduces encoding duration by 31.6% 18 Arriving intervals Poisson distribution 2 requests/s Encoding starts at 30s Each point : average response time of 3 consecutive write requests

Impact on MapReduce Jobs  50-job MapReduce workload generated by SWIM to mimic a one-hour workload trace in a Facebook cluster  EAR shows very similar performance as RR 19

Discrete-Event Simulations  C++-based simulator built on CSIM20  Validate by replaying the write response time experiment 20 Write response time (s) TestbedSimulation with encoding RR EAR without encoding RR EAR Our simulator captures performance of both write and encoding operation precisely!

Discrete-Event Simulation 21

Simulation Results 22

Simulation Results  Bandwidth ↑ encode gain ↓ write gain − Encode throughput gain: up to 165.2% Write throughput gain: around 20%  Request rate ↑ encode gain ↑ write gain − Encode throughput gain: up to 89.1% Write throughput gain: between 25% to 28% 23

Simulation Results  Tolerable rack failures ↑ encode gain ↓ write gain ↓ Encode throughput gain: from 82.1% to 70.1% Write throughput gain: from 34.7% to 20.5%  Number of replicas ↑ encode gain − write gain ↓ Encode throughput gain: around 70% Write throughput gain: up to 34.7% 24

Load Balancing Analysis 25 Rack ID123 Stored blk ID1,212 Request percent50%25%

Conclusions  Build EAR to Eliminate cross-rack downloads during encoding Eliminate relocation after encoding operation Maintain load balance of random replication  Implement an EAR prototype in Hadoop-20  Show performance gain of EAR over RR via testbed experiments and discrete-event simulations  Source code of EAR is available at: 26