Repair Pipelining for Erasure-Coded Storage

Slides:



Advertisements
Similar presentations
February 20, Spatio-Temporal Bandwidth Reuse: A Centralized Scheduling Mechanism for Wireless Mesh Networks Mahbub Alam Prof. Choong Seon Hong.
Advertisements

Redundant Array of Independent Disks (RAID) Striping of data across multiple media for expansion, performance and reliability.
RAID (Redundant Arrays of Independent Disks). Disk organization technique that manages a large number of disks, providing a view of a single disk of High.
Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.
current hadoop architecture
CSCE430/830 Computer Architecture
Henry C. H. Chen and Patrick P. C. Lee
1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.
Coding and Algorithms for Memories Lecture 12 1.
Yuchong Hu1, Henry C. H. Chen1, Patrick P. C. Lee1, Yang Tang2
1 Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu The Chinese University of Hong Kong.
1 STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures in Practical Storage Systems Mingqiang Li and Patrick P. C.
RAID Redundant Arrays of Inexpensive Disks –Using lots of disk drives improves: Performance Reliability –Alternative: Specialized, high-performance hardware.
Availability in Globally Distributed Storage Systems
CSE 486/586 CSE 486/586 Distributed Systems Case Study: Facebook f4 Steve Ko Computer Sciences and Engineering University at Buffalo.
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Jack Lee Yiu-bun, Raymond Leung Wai Tak Department.
Typhoon: An Ultra-Available Archive and Backup System Utilizing Linear-Time Erasure Codes.
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Presented by: Raymond Leung Wai Tak Supervisor:
Failures in the System  Two major components in a Node Applications System.
MetaSync File Synchronization Across Multiple Untrusted Storage Services Seungyeop Han Haichen Shen, Taesoo Kim*, Arvind Krishnamurthy,
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Raymond Leung and Jack Y.B. Lee Department of Information.
NCCloud: A Network-Coding-Based Storage System in a Cloud-of-Clouds
Network Coding Distributed Storage Patrick P. C. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong 1.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
1 Convergent Dispersal: Toward Storage-Efficient Security in a Cloud-of-Clouds Mingqiang Li 1, Chuan Qin 1, Patrick P. C. Lee 1, Jin Li 2 1 The Chinese.
1 The Google File System Reporter: You-Wei Zhang.
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
© 2012 A. Datta & F. Oggier, NTU Singapore Redundantly Grouped Cross-object Coding for Repairable Storage Anwitaman Datta & Frédérique Oggier NTU Singapore.
Lecture 9 of Advanced Databases Storage and File Structure (Part II) Instructor: Mr.Ahmed Al Astal.
Min Xu1, Yunfeng Zhu2, Patrick P. C. Lee1, Yinlong Xu2
Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu th Annual IEEE/IFIP International.
"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.
A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng Zhu 1, Patrick P. C. Lee 2, Liping Xiang 1, Yinlong.
1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.
RevDedup: A Reverse Deduplication Storage System Optimized for Reads to Latest Backups Chun-Ho Ng, Patrick P. C. Lee The Chinese University of Hong Kong.
1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The.
Parity Logging with Reserved Space: Towards Efficient Updates and Recovery in Erasure-coded Clustered Storage Jeremy C. W. Chan*, Qian Ding*, Patrick P.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Exact Regenerating Codes on Hierarchical Codes Ernst Biersack Eurecom France Joint work and Zhen Huang.
The IEEE International Conference on Cluster Computing 2010
20/10/ Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding Yuchong Hu Institute of Network Coding Please.
Coding and Algorithms for Memories Lecture 13 1.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Cloud Computing Vs RAID Group 21 Fangfei Li John Soh Course: CSCI4707.
Using Deduplicating Storage for Efficient Disk Image Deployment Xing Lin, Mike Hibler, Eric Eide, Robert Ricci University of Utah.
Elastic Parity Logging for SSD RAID Arrays Yongkun Li*, Helen Chan #, Patrick P. C. Lee #, Yinlong Xu* *University of Science and Technology of China #
I/O Errors 1 Computer Organization II © McQuain RAID Redundant Array of Inexpensive (Independent) Disks – Use multiple smaller disks (c.f.
A Tale of Two Erasure Codes in HDFS
rain technology (redundant array of independent nodes)
Double Regenerating Codes for Hierarchical Data Centers
Steve Ko Computer Sciences and Engineering University at Buffalo
Steve Ko Computer Sciences and Engineering University at Buffalo
A Simulation Analysis of Reliability in Erasure-coded Data Centers
Partial-Parallel-Repair (PPR):
Section 7 Erasure Coding Overview
Partial-Parallel-Repair (PPR):
Zhirong Shen+, Patrick Lee+, Jiwu Shu$, and Wenzhong Guo*
A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Raymond Leung and Jack Y.B. Lee Department of Information.
Storage Virtualization
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
A Survey on Distributed File Systems
ICOM 6005 – Database Management Systems Design
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
CMPE 252A : Computer Networks
Database System Architectures
5/7/2019 Map Reduce Map reduce.
Xiaolu Li†, Runhui Li†, Patrick P. C. Lee†, and Yuchong Hu‡
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Repair Pipelining for Erasure-Coded Storage Runhui Li, Xiaolu Li, Patrick P. C. Lee, Qun Huang The Chinese University of Hong Kong USENIX ATC 2017

Introduction Fault tolerance for distributed storage is critical Availability: data remains accessible under failures Durability: no data loss even under failures Erasure coding is a promising redundancy technique Minimum data redundancy via “data encoding” Higher reliability with same storage redundancy than replication Reportedly deployed in Google, Azure, Facebook e.g., Azure reduces redundancy from 3x (replication) to 1.33x (erasure coding)  PBs saving

Erasure Coding Divide file data to k blocks Encode k (uncoded) blocks to n coded blocks Distribute the set of n coded blocks (stripe) to n nodes Fault-tolerance: any k out of n blocks can recover file data Nodes File encode divide A B C D A+C B+D A+D B+C+D (n, k) = (4, 2) Remark: for systematic codes, k of n coded blocks are the original k uncoded blocks

Erasure Coding Practical erasure codes satisfy linearity and addition associativity Each block can be expressed as a linear combination of any k blocks in the same stripe, based on Galois Field arithmetic e.g., block B = a1B1 + a2B2 + a3B3 + a4B4 for k = 4, coefficients ai’s, and blocks Bi’s Also applicable to XOR-based erasure codes Examples: Reed-Solomon codes, regenerating codes, LRC

Erasure Coding Good: Low redundancy with high fault tolerance Bad: High repair penalty In general, k blocks retrieved to repair a failed block Mitigating repair penalty of erasure coding is a hot topic New erasure codes to reduce repair bandwidth or I/O e.g., Regenerating codes, LRC, Hitchhiker Efficient repair approaches for general erasure codes e.g., lazy repair, PPR

Conventional Repair Single-block repair: Repair time = k timeslots Retrieve k blocks from k working nodes (helpers) Store the repaired block at requestor k = 4 helpers requestor Network R N1 N2 N3 N4 Bottleneck Repair time = k timeslots Bottlenecked by requestor’s downlink Uneven bandwidth usage (e.g., links among helpers are idle)

Partial-Parallel-Repair (PPR) [Mitra, EuroSys’16] Partial-Parallel-Repair (PPR) Exploit linearity and addition associativity to perform repair in a “divide-and-conquer” manner Timeslot 1: N1 sends a1B1 to N2  a1B1+a2B2 N3 sends a3B3 to N4  a3B3+a4B4 Network R N1 N2 N3 N4 Timeslot 2: N2 sends a1B1+a2B2 to N4  a1B1+a2B2+a3B3+a4B4 Timeslot 3: N4  R  repaired block k = 4 helpers requestor Repair time = ceil(log2(k+1)) timeslots

Open Question Repair time of erasure coding remains larger than normal read time Repair-optimal erasure codes still read more data than amount of failed data Erasure coding is mainly for warm/cold data Repair penalty only applies to less frequently accessed data Hot data remains replicated Can we reduce repair time of erasure coding to almost the same as the normal read time? Create opportunity for storing hot data with erasure coding

Our Contributions Repair pipelining, a technique to speed up repair for general erasure coding Applicable for degraded reads and full-node recovery O(1) repair time in homogeneous settings Extensions to heterogeneous settings A prototype ECPipe integrated with HDFS and QFS Experiments on local cluster and Amazon EC2 Reduction of repair time by 90% and 80% over conventional repair and partial-parallel-repair (PPR), respectively

Repair Pipelining Goals: Eliminate bottlenecked links Effectively utilize available bandwidth resources in repair Key observation: coding unit (word) is much smaller than read/write unit (block) e.g., word size ~ 1 byte; block size ~ 64 MiB Words at the same offset are encoded together in erasure coding Eliminate bottleneck links: no link transmits more traffic than others Utilize bandwidth resources Links should not be idle for most times words at the same offset are encoded together … word n blocks of a stripe

Repair Pipelining Idea: slicing a block Single-block repair: Each slice comprises multiple words (e.g., slice size ~ 32 KiB) Pipeline the repair of each slice through a linear path Single-block repair: a1b1+a2b2 +a3b3 a1b1+a2b2 +a3b3+a4b4 a1b1 a1b1+a2b2 k = 4 s = 6 slices N1 N2 N3 N4 R N1 N2 N3 N4 R N1 N2 N3 N4 R N1 N2 N3 N4 R N1 N2 N3 N4 R N1 N2 N3 N4 R time Repair time = 1 + (k+1)/s  1 timeslot if s is large

Repair Pipelining Two types of single-failure repair (most common case): Degraded read Repairing an unavailable block at a client Full-node recovery Repairing all lost blocks of a failed node at one or multiple nodes Greedy scheduling of multiple stripes across helpers Challenge: repair degraded by stragglers Any repair of erasure coding faces similar problems due to data retrievals from multiple helpers Our approach: address heterogeneity and bypass stragglers

Extension to Heterogeneity Heterogeneity: link bandwidths are different Case 1: limited bandwidth when a client issues reads to a remote storage system Cyclic version of repair pipelining: allow a client to issue parallel reads from multiple helpers Case 2: arbitrary link bandwidths Weighted path selection: select the “best” path of helpers for repair

Repair Pipelining (Cyclic Version) Send to requestor R N2 N3 N4 N1 N3 N4 N1 N2 N4 N1 N2 Group 1 N1 N2 N3 N4 Send to requestor R N2 N3 N4 N1 N3 N4 N1 N2 N4 N1 N2 Group 2 Requestor receives repaired data from k-1 helpers Repair time in homogeneous environments  1 timeslot for large s

Weighted Path Selection Goal: Find a path of k + 1 nodes (i.e., k helpers and requestor) that minimizes the maximum link weight e.g., set link weight as inverse of link bandwidth Any straggler is associated with large weight Brute-force search is expensive (n-1)!/(n-1-k)! permutations Our algorithm: Apply brute-force search, but avoid search of non-optimal paths If link L has weight larger than the max weight of the current optimal path, any path containing L must be non-optimal Remain optimal, with much less search time

Implementation ECPipe: a middleware atop distributed storage system Helper Node Coordinator Requestor control flow datal flow Requestor implemented as a C++/Java class Each helper daemon directly reads local blocks via native FS Coordinator access block locations and block-to-stripe mappings ECPipe is integrated with HDFS and QFS, with around 110 and 180 LOC of changes, respectively

Single-block repair time vs. slice size for (n,k) = (14,10) Evaluation ECPipe performance on a 1Gb/s local cluster Trade-off of slice size: Too small: transmission overhead is significant Too large: less parallelization Best slice size = 32 KiB Repair pipelining (basic and cyclic) outperforms conventional and PPR by 90.9% and 80.4%, resp. Only 7% more than direct send time over a 1Gb/s link  O(1) repair time Single-block repair time vs. slice size for (n,k) = (14,10)

Full-node recovery rate vs. number of requestors for (n,k) = (14,10) Evaluation ECPipe performance on a 1Gb/s local cluster Recovery rate increases with number of requestors Repair pipelining (RP and RP+scheduling) achieves high recovery rate Greedy scheduling balances repair load across helpers when there are more requestors (i.e., more resource contention) Full-node recovery rate vs. number of requestors for (n,k) = (14,10)

Evaluation ECPipe performance on Amazon EC2 Weighted path selection reduces single-block repair time of basic repair pipelining by up to 45%

Evaluation Single-block repair performance on HDFS and QFS QFS: slice size QFS: block size HDFS: (n,k) ECPipe significantly improves repair performance Conventional repair under ECPipe outperforms original conventional repair inside distributed file systems (by ~20%) Avoid fetching blocks via distributed storage system routine Performance gain is mainly due to repair pipelining (by ~90%)

Conclusions Repair pipelining, a general technique that enables very fast repair for erasure-coded storage Contributions: Designs for both degraded reads and full-node recovery Extensions to heterogeneity Prototype implementation ECPipe Extensive experiments on local cluster and Amazon EC2 Source code: http://adslab.cse.cuhk.edu.hk/software/ecpipe