Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.

Slides:



Advertisements
Similar presentations
Disk Arrays COEN 180. Large Storage Systems Collection of disks to store large amount of data. Performance advantage: Each drive can satisfy only so many.
Advertisements

1 A triple erasure Reed-Solomon code, and fast rebuilding Mark Manasse, Chandu Thekkath Microsoft Research - Silicon Valley Alice Silverberg Ohio State.
1 On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice Yunfeng Zhu 1, Patrick P. C. Lee 2, Yuchong Hu 2, Liping.
RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.
Availability in Globally Distributed Storage Systems
current hadoop architecture
Alex Dimakis based on collaborations with Dimitris Papailiopoulos Arash Saber Tehrani USC Network Coding for Distributed Storage.
CSCE430/830 Computer Architecture
Henry C. H. Chen and Patrick P. C. Lee
1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.
HAIL (High-Availability and Integrity Layer) for Cloud Storage
BASIC Regenerating Codes for Distributed Storage Systems Kenneth Shum (Joint work with Minghua Chen, Hanxu Hou and Hui Li)
Coding and Algorithms for Memories Lecture 12 1.
Simple Regenerating Codes: Network Coding for Cloud Storage Dimitris S. Papailiopoulos, Jianqiang Luo, Alexandros G. Dimakis, Cheng Huang, and Jin Li University.
Yuchong Hu1, Henry C. H. Chen1, Patrick P. C. Lee1, Yang Tang2
1 Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu The Chinese University of Hong Kong.
1 STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures in Practical Storage Systems Mingqiang Li and Patrick P. C.
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
Availability in Globally Distributed Storage Systems
REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.
Beyond the MDS Bound in Distributed Cloud Storage
Typhoon: An Ultra-Available Archive and Backup System Utilizing Linear-Time Erasure Codes.
Network Coding for Large Scale Content Distribution Christos Gkantsidis Georgia Institute of Technology Pablo Rodriguez Microsoft Research IEEE INFOCOM.
A Hybrid Approach of Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation Yinlong Xu University of Science and Technology of.
Redundant Data Update in Server-less Video-on-Demand Systems Presented by Ho Tsz Kin.
Cooperative regenerating codes for distributed storage systems Kenneth Shum (Joint work with Yuchong Hu) 22nd July 2011.
Mario Vodisek 1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Erasure Codes for Reading and Writing Mario Vodisek ( joint work.
Failures in the System  Two major components in a Node Applications System.
Network Coding vs. Erasure Coding: Reliable Multicast in MANETs Atsushi Fujimura*, Soon Y. Oh, and Mario Gerla *NEC Corporation University of California,
NCCloud: A Network-Coding-Based Storage System in a Cloud-of-Clouds
Network Coding for Distributed Storage Systems IEEE TRANSACTIONS ON INFORMATION THEORY, SEPTEMBER 2010 Alexandros G. Dimakis Brighten Godfrey Yunnan Wu.
Network Coding Distributed Storage Patrick P. C. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong 1.
1 Convergent Dispersal: Toward Storage-Efficient Security in a Cloud-of-Clouds Mingqiang Li 1, Chuan Qin 1, Patrick P. C. Lee 1, Jin Li 2 1 The Chinese.
Repairable Fountain Codes Megasthenis Asteris, Alexandros G. Dimakis IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 32, NO. 5, MAY /5/221.
22/07/ The MDS Scaling Problem for Cloud Storage Yu-chong Hu Institute of Network Coding.
1 Failure Correction Techniques for Large Disk Array Garth A. Gibson, Lisa Hellerstein et al. University of California at Berkeley.
Min Xu1, Yunfeng Zhu2, Patrick P. C. Lee1, Yinlong Xu2
Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu th Annual IEEE/IFIP International.
1 Network Coding and its Applications in Communication Networks Alex Sprintson Computer Engineering Group Department of Electrical and Computer Engineering.
Growth Codes: Maximizing Sensor Network Data Persistence abhinav Kamra, Vishal Misra, Jon Feldman, Dan Rubenstein Columbia University, Google Inc. (SIGSOMM’06)
A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng Zhu 1, Patrick P. C. Lee 2, Liping Xiang 1, Yinlong.
Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding Yuchong Hu, Yinlong Xu, Xiaozhao Wang, Cheng Zhan and Pei.
1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.
1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The.
On Coding for Real-Time Streaming under Packet Erasures Derek Leong *#, Asma Qureshi *, and Tracey Ho * * California Institute of Technology, Pasadena,
Exact Regenerating Codes on Hierarchical Codes Ernst Biersack Eurecom France Joint work and Zhen Huang.
20/10/ Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding Yuchong Hu Institute of Network Coding Please.
A Fast Repair Code Based on Regular Graphs for Distributed Storage Systems Yan Wang, East China Jiao Tong University Xin Wang, Fudan University 1 12/11/2013.
Coding and Algorithms for Memories Lecture 13 1.
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Data architecture challenges for CERN and the High Energy.
Secret Sharing in Distributed Storage Systems Illinois Institute of Technology Nexus of Information and Computation Theories Paris, Feb 2016 Salim El Rouayheb.
Seminar On Rain Technology
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Pouya Ostovari and Jie Wu Computer & Information Sciences
SEMINAR TOPIC ON “RAIN TECHNOLOGY”
A Tale of Two Erasure Codes in HDFS
Double Regenerating Codes for Hierarchical Data Centers
IERG6120 Lecture 22 Kenneth Shum Dec 2016.
Steve Ko Computer Sciences and Engineering University at Buffalo
Steve Ko Computer Sciences and Engineering University at Buffalo
A Simulation Analysis of Reliability in Erasure-coded Data Centers
Repair Pipelining for Erasure-Coded Storage
Presented by Haoran Wang
RAID RAID Mukesh N Tekwani
Maximally Recoverable Local Reconstruction Codes
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
On Sequential Locally Repairable Codes
Dr. Zhijie Huang and Prof. Hong Jiang University of Texas at Arlington
Erasure Correcting Codes for Highly Available Storage
RAID RAID Mukesh N Tekwani April 23, 2019
Presentation transcript:

Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth W. Shum The Chinese University of Hong Kong INFOCOM’13

Distributed Storage Distributed storage is widely adopted P2P storage: OceanStore, TotalRecall, etc. Cloud storage: Azure, GFS, etc. Data availability is important Cause: node failures are common Solution: redundancy over multiple storage nodes Redundancy schemes Replication: make identical copies Erasure codes: encode data into parity blocks Erasure codes have less storage overhead than replication One class of erasure codes: maximum distance separable (MDS) codes Replication: Simple but Expensive. Maximum Distance Separable (MDS) codes: Much less redundancy overhead than replication.

(n, k) MDS codes Encode a file of size M into chunks Distribute encoded chunks into n nodes Each node stores M/k units of data MDS property: any k out of n nodes can recover file Nodes A A B B C C File A divide encode D D B C A+C A+C D B+D B+D File of size M A+D A+D B+C+D B+C+D n = 4, k = 2

Repairing a Failure Q: Can we minimize repair traffic? Conventional repair: download data from any k nodes repaired node Node 1 A B File of size M A B C D Node 2 C C D D A B Node 3 A+C A+C B+D B+D Node 4 A+D B+C+D Disk Read = = M + Repair Traffic = = M + Q: Can we minimize repair traffic?

Regenerating Codes Repair in regenerating codes: [Dimakis et al.; ToIT’10] Repair in regenerating codes: Surviving nodes encode chunks (network coding) Download one encoded chunk from each node Node 1 A repaired node File of size M A B C D B C Node 2 C D A Node 3 A+C A+C B B+D Node 4 A+D A+B+C B+C+D Disk Read = = M + Repair Traffic = = 0.75M + + Q: Can we minimize disk read further?

Functional Minimum Storage Regenerating (FMSR) Codes [Hu et al., FAST’12] Repair in FMSR codes: No chunk encoding in surviving nodes (uncoded repair) Download one chunk from each surviving node Parity chunk Pi = linear combination of original native chunks New parity chunk P’i = linear combination of downloaded chunks Node 1 P1 repaired node File of size M A B C D P2 P3 Node 2 P3 Remark: A trade-off of FMSR codes is that the whole encoded file must be decoded first if parts of a file are accessed. Nevertheless, FMSR codes are suited to long-term archival applications, since data backups are rarely read and it is common to restore the whole file rather than file parts. P4 P1’ P5 Node 3 P5 P2’ P6 P7 Node 4 P7 P8 Disk Read = = 0.75M + Repair Traffic = = 0.75M + +

FMSR Codes NCCloud [Hu’12] implements and evaluates FMSR codes FMSR codes in NCCloud aim for following goals: Double-fault tolerance and optimal storage efficiency (k = n-2) MDS property: any n-2 out of n nodes can recover file Each node stores M/k units of data Minimum repair bandwidth Up to 50% saving compared to conventional repair Uncoded repair Non-systematic codes Suited to long-term archival storage whose data is rarely read

Related Work Theoretical studies on FMSR codes [Dimakis et al. ’10] Uncoded MSR codes e.g., [Tamo'11], [Wang'10], [Wang'11] Implementation complexity unknown Applied work: NCCloud [Hu’12] Implement and evaluate FMSR codes for k=n-2 Correctness of FMSR codes proven for n=4, k=2 [Shum’12] Explanations: Regenerating codes exploit the optimal trade-off between storage and repair traffic. One end of the trade-off corresponding to storage refers MSR codes; the other end corresponding to repair traffic refers to MBR codes.

Challenge Can FMSR codes maintain double-fault tolerance after any number of rounds of repair? NCCloud only shows via simulations that FMSR codes maintain MDS property after hundreds of repair rounds In other words, can we achieve benefits of network coding without network coding? No node encoding required Provide theoretical foundations for applied work Explanations: given that new parity chunks are regenerated in each round of repair, we need to ensure that such chunks preserve the fault tolerance of MDS codes after multiple rounds of repair.

Our Contributions Prove existence of FMSR codes Preserve double-fault tolerance after any number of repair rounds with uncoded repair  Minimize disk read and repair bandwidth Provide a deterministic FMSR code construction: Specify chunks to be read from surviving nodes Specify encoding coefficients for regenerating new chunks  Minimize computational time to construct new parity chunks Evaluate our deterministic FMSR codes random FMSR codes: exponentially increasing time deterministic FMSR codes: done in 0.5 seconds

FMSR Codes in NCCloud NCCloud NCCloud n(n-k) chunks k(n-k) chunks File Upload NCCloud n(n-k) chunks P1,1 P1,1 k(n-k) chunks P1,2 P1,2 File F1 P2,1 P2,1 divide F2 encode P2,2 upload P2,2 F3 P3,1 F4 P3,2 P3,1 P4,1 P3,2 P4,2 n=4, k=2 P4,1 P4,2 File Download NCCloud P1,1 P1,2 Remark: In this slide, we introduce the following definition MDS property: For any subset of k out of n nodes, if the k(n − k) parity chunks from the k nodes can be decoded to the k(n −k) native chunks of the original file, then the MDS property is satisfied. Decodability. We say that a collection of k(n−k) parity chunks is decodable if the parity chunks can be decoded to the original file. download k(n-k) chunks k(n-k) chunks P2,1 File P1,1 F1 P2,2 decode MDS Property P1,2 F2 merge P3,1 P2,1 F3 P3,2 P2,2 F4 decodable P4,1 n=4, k=2 P4,2

Counter-Example NCCloud NCCloud New node 1 New node 2 1st repair encode P’1,1 repair P’1,1 P3,1 P3,1 P’1,2 P’1,2 P3,2 P4,1 Node 4 P4,1 P4,2 2nd repair NCCloud Node 1 P’1,1 Remark: NCCloud randomly read the chunk from the surviving nodes, so the first two repairs may be shown in this slide. P’1,2 Node 2 P2,1 New node 2 P’1,1 P2,2 encode P’2,1 repair P’2,1 Node 3 P3,1 P3,1 P’2,2 P’2,2 P3,2 P4,1 Node 4 P4,1 P4,2

Counter-Example Observe file decoding after the 2nd repair NCCloud download NCCloud P’2,1 P’2,2 P’1,1 P’1,2 P’2,1 P’2,2 P’1,1 P2,1 P3,1 P4,1 reduce reduce P’1,2 P3,1 P3,1 P3,2 P4,1 P4,1 File decoding fails!! linear dependent chunks P4,2

LDC Linear Dependent Collection (LDC) encode P’1,1 lead to P3,1 P’1,2 P4,1 1st repair Three LDCs of the 1st repair Definition: An LDC of the rth repair is a collection of k(n-k) chunks formed by less than k(n-k) chunks of the rth repair Problem: If the selected chunks after the (r+1)th repair are formed by an LDC of the rth repair , then file decoding fails

RBC Repair-based Collection (RBC) Definition: An RBC of the rth round of repair is a collection of k(n-k) chunks formed as follows: Step 1 Select any n-1 out of n nodes. Step 2 Select k-1 out of the n-1 nodes found in Step 1 and collect n-k chunks from each selected node. Step 3 Collect one chunk from each of the non-selected n-k nodes. Node 1 P’1,1 P’1,1 P’1,1 P’1,1 P’1,2 P’1,2 P’1,2 P’1,2 In order to solve the problem, we first introduce a concept “RBC” Node 2 P2,1 P2,1 P2,1 P2,1 P2,2 P2,2 P3,1 Node 3 P3,1 P3,1 P3,1 P3,2 one RBC after the 1st repair P3,2 Node 4 P4,1 Step 1 Step 2 Step 3 P4,2 Fact:RBCs contain all the LDCs.

Repair MDS Property Repair MDS (rMDS) property Definition: If all RBCs, after excluding the LDCs, of the rth repair are decodable, rMDS property is satisfied NCCloud: two-phase checking MDS property check of current round of repair rMDS property check: MDS property check for every possible failure in next round of repair Can two-phase checking maintain MDS property after any number of rounds of repair?

Lemma 1 There are two or more common chunks between selected chunks from surviving nodes for repairing and each LDC P’1,1 P’1,2 P2,1 P3,1 P’1,1 P’1,2 P2,1 P4,1 P’1,1 P’1,2 P3,1 P4,1 P2,1 P3,1 P4,1 Selected chunks in 1st repair Three LDCs of the 1st repair P2,1 P3,1 P2,1 P4,1 P3,1 P4,1

Lemma 2 If the rMDS property is satisfied, there always exist n-1 chunks from any n-1 nodes (each offers one chunk) s.t. any RBC containing them is decodable If the RBC doesn’t have green chunks, it is not a LDC If the RBC has a green chunk, we replace the chunk by a blue chunk (P2,2, P3,2 or P4,2). So it won’t be a LDC Node 1 P1,1 P’1,1 P1,2 P’1,2 Corollary: We can always use the un-selected chunks of the repair to construct an RBC such that it is decodable. P’1,1 P’1,2 P2,1 P3,1 P’1,1 P’1,2 P2,1 P4,1 P’1,1 P’1,2 P3,1 P4,1 Node 2 P2,1 P2,1 P2,2 P2,2 Node 3 P3,1 P3,1 P3,2 P3,2 Three LDCs of the 1st repair Node 4 P4,1 P4,1 P4,2 P4,2

Theorem Consider a file encoded using FMSR codes with k = n-2: In the rth repair, the lost chunks are reconstructed by random linear combinations of n-1 chunks selected from n-1 surviving nodes (each offers one chunk) After the repair, the reconstructed file still satisfies both MDS and rMDS properties with probability driven arbitrarily to 1 with increasing field size. Theorem  prove existence of FMSR codes

Theorem: Proof Proof: (by induction). Initialization We use Reed-Solomon codes to encode a file into n(n-k) = 2n chunks that satisfy both the MDS and rMDS properties MDS property check The chunks of any k nodes are linearly combined with a certain RBC which is decodable via Lemma 2 Sufficient field size ensures the probability of MDS property can be driven to one via Schwartz-Zippel Theorem

Theorem: Proof Proof (cont.): rMDS property check: all RBCs excluding the LDCs are decodable. (3.1) The RBC selects the repaired node in Step 2 We exclude the LDCs, and prove that remaining RBCs are decodable via induction hypothesis (3.2) The RBC selects the repaired node in Step 3 We can prove the RBC is not an LDC via Lemma 1 and is decodable

Deterministic FMSR codes Code construction: Store a file (1.1) Divide a file into k(n-k) = 2k equal-size chunks (1.2) Encode them into n(n-k) = 2(k+2) chunks by P1,1,P1,2;…;Pk+2,1,Pk+2,2 using Reed-Solomon codes (1.3) Upload them to n nodes. P1,1 P1,1 2k chunks P1,2 P1,2 File F1 P2,1 P2,1 divide F2 encode P2,2 upload P2,2 F3 P3,1 F4 P3,2 P3,1 RS codes P4,1 P3,2 n=4, k=2 P4,2 P4,1 P4,2

Deterministic FMSR codes Code construction: 1st repair (suppose node 1 fails) (2.1) Select k+1 chunks P2,1,...,Pk+2,1. (2.2) Construct coefficients that satisfy inequalities (2.3) Regenerate the chunks from (2,1) and (2.2) Node 1 P1,1 P1,2 Node 2 P2,1 P2,2 P2,1 Node 3 encode P’1,1 P3,1 P3,1 P’1,2 P3,2 P4,1 Node 4 P4,1 Encoding coefficients based on given inequalities P4,2

Deterministic FMSR codes Code construction: rth repair (only chunk selection is different) (3.1) Select k+1 chunks different from those selected in (r-1)th repair (3.2) Construct coefficients based on given inequalities Node 1 P’1,1 P’1,2 Node 2 P2,1 P’1,1 P2,2 encode P’2,1 Node 3 P3,1 P3,2 P’2,2 P3,2 P4,2 Encoding coefficients based on given inequalities Node 4 P4,1 P4,2 E.g., In the 2nd repair, we select P’1,1(or P’1,2) and P3,2 and P4,2 to perform the repair, since they are different from P2,1, P3,1 and P4,1 which are selected in the 1st repair.

Experiments Implementation Coding schemes Metric C language Galois Field (28) Intel CPU at 2.4GHZ Coding schemes random FMSR codes in NCCloud [Hu’12] our deterministic FMSR codes Metric Checking time spent on finding the chunks from surviving nodes for recovery

Evaluation Aggregate checking time of 50 rounds of repair Our investigation finds that the checking time of random FMSR codes increases dramatically as the value of n increases. For example, when n = 12 (not shown in our figures), we find that the repair operation of our random FMSR code implementation still cannot return a right set of regenerated chunks after running for two hours. In contrast, our deterministic FMSR codes can return a solution within 0.5 seconds. Aggregate checking time of 50 rounds of repair Random FMSR codes: exponentially increasing time Deterministic FMSR codes : time remains significantly small

Evaluation

Evaluation Summary: Deterministic FMSR codes significantly reduce the number of iterations of two-phase checking overhead of ensuring that the MDS property is preserved during repair.

Conclusions Formulate an uncoded repair problem based on FMSR codes Prove existence of FMSR codes Provide a deterministic FMSR code construction Show via our evaluation that our deterministic FMSR codes significantly reduce the repair time overhead of random FMSR codes. Slides available: http://www.cse.cuhk.edu.hk/~pclee