Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1

Slides:

Advertisements

Similar presentations

Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.

Advertisements

current hadoop architecture

Alex Dimakis based on collaborations with Dimitris Papailiopoulos Arash Saber Tehrani USC Network Coding for Distributed Storage.

CSCE430/830 Computer Architecture

Henry C. H. Chen and Patrick P. C. Lee

1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.

Digital Fountain Codes V. S

Coding and Algorithms for Memories Lecture 12 1.

Simple Regenerating Codes: Network Coding for Cloud Storage Dimitris S. Papailiopoulos, Jianqiang Luo, Alexandros G. Dimakis, Cheng Huang, and Jin Li University.

Yuchong Hu1, Henry C. H. Chen1, Patrick P. C. Lee1, Yang Tang2

1 Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu The Chinese University of Hong Kong.

1 STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures in Practical Storage Systems Mingqiang Li and Patrick P. C.

Availability in Globally Distributed Storage Systems

Beyond Trilateration: On the Localizability of Wireless Ad Hoc Networks Reported by: 莫斌.

REDUNDANT ARRAY OF INEXPENSIVE DISCS RAID. What is RAID ? RAID is an acronym for Redundant Array of Independent Drives (or Disks), also known as Redundant.

A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Jack Lee Yiu-bun, Raymond Leung Wai Tak Department.

Network Optimization Models: Maximum Flow Problems In this handout: The problem statement Solving by linear programming Augmenting path algorithm.

1 Forward Error Correction Shimrit Tzur-David School of Computer Science and Engineering Hebrew University of Jerusalem.

A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Presented by: Raymond Leung Wai Tak Supervisor:

EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley Asynchronous Distributed Algorithm Proof.

Cooperative regenerating codes for distributed storage systems Kenneth Shum (Joint work with Yuchong Hu) 22nd July 2011.

Failures in the System  Two major components in a Node Applications System.

Network Coding vs. Erasure Coding: Reliable Multicast in MANETs Atsushi Fujimura*, Soon Y. Oh, and Mario Gerla *NEC Corporation University of California,

A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Raymond Leung and Jack Y.B. Lee Department of Information.

NCCloud: A Network-Coding-Based Storage System in a Cloud-of-Clouds

Network Coding for Distributed Storage Systems IEEE TRANSACTIONS ON INFORMATION THEORY, SEPTEMBER 2010 Alexandros G. Dimakis Brighten Godfrey Yunnan Wu.

Network Coding Distributed Storage Patrick P. C. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong 1.

1 Convergent Dispersal: Toward Storage-Efficient Security in a Cloud-of-Clouds Mingqiang Li 1, Chuan Qin 1, Patrick P. C. Lee 1, Jin Li 2 1 The Chinese.

New Protocols for Remote File Synchronization Based on Erasure Codes Utku Irmak Svilen Mihaylov Torsten Suel Polytechnic University.

Repairable Fountain Codes Megasthenis Asteris, Alexandros G. Dimakis IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 32, NO. 5, MAY /5/221.

22/07/ The MDS Scaling Problem for Cloud Storage Yu-chong Hu Institute of Network Coding.

© 2012 A. Datta & F. Oggier, NTU Singapore Redundantly Grouped Cross-object Coding for Repairable Storage Anwitaman Datta & Frédérique Oggier NTU Singapore.

Min Xu1, Yunfeng Zhu2, Patrick P. C. Lee1, Yinlong Xu2

Network Coding and Information Security Raymond W. Yeung The Chinese University of Hong Kong Joint work with Ning Cai, Xidian University.

Network Survivability Against Region Failure Signal Processing, Communications and Computing (ICSPCC), 2011 IEEE International Conference on Ran Li, Xiaoliang.

1 Network Coding and its Applications in Communication Networks Alex Sprintson Computer Engineering Group Department of Electrical and Computer Engineering.

1 CloudVS: Enabling Version Control for Virtual Machines in an Open- Source Cloud under Commodity Settings Chung-Pan Tang, Tsz-Yeung Wong, Patrick P. C.

A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng Zhu 1, Patrick P. C. Lee 2, Liping Xiang 1, Yinlong.

Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding Yuchong Hu, Yinlong Xu, Xiaozhao Wang, Cheng Zhan and Pei.

1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.

Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.

Coding and Algorithms for Memories Lecture 14 1.

EE 685 presentation Optimization Flow Control, I: Basic Algorithm and Convergence By Steven Low and David Lapsley.

1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The.

Maximum Flow Problem (Thanks to Jim Orlin & MIT OCW)

Computer Network Lab. Integrated Coverage and Connectivity Configuration in Wireless Sensor Networks SenSys ’ 03 Xiaorui Wang, Guoliang Xing, Yuanfang.

15.082J and 6.855J March 4, 2003 Introduction to Maximum Flows.

Exact Regenerating Codes on Hierarchical Codes Ernst Biersack Eurecom France Joint work and Zhen Huang.

20/10/ Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding Yuchong Hu Institute of Network Coding Please.

A Fast Repair Code Based on Regular Graphs for Distributed Storage Systems Yan Wang, East China Jiao Tong University Xin Wang, Fudan University 1 12/11/2013.

Coding and Algorithms for Memories Lecture 13 1.

Secret Sharing in Distributed Storage Systems Illinois Institute of Technology Nexus of Information and Computation Theories Paris, Feb 2016 Salim El Rouayheb.

Seminar On Rain Technology

Elastic Parity Logging for SSD RAID Arrays Yongkun Li*, Helen Chan #, Patrick P. C. Lee #, Yinlong Xu* *University of Science and Technology of China #

SketchVisor: Robust Network Measurement for Software Packet Processing

A Tale of Two Erasure Codes in HDFS

Double Regenerating Codes for Hierarchical Data Centers

IERG6120 Lecture 22 Kenneth Shum Dec 2016.

A Simulation Analysis of Reliability in Erasure-coded Data Centers

Repair Pipelining for Erasure-Coded Storage

Presented by Haoran Wang

Section 7 Erasure Coding Overview

Zhirong Shen+, Patrick Lee+, Jiwu Shu$, and Wenzhong Guo*

A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Raymond Leung and Jack Y.B. Lee Department of Information.

Maximally Recoverable Local Reconstruction Codes

A Fusion-based Approach for Tolerating Faults in Finite State Machines

Dr. Zhijie Huang and Prof. Hong Jiang University of Texas at Arlington

Speaker : Lee Heon-Jong

Xiaolu Li†, Runhui Li†, Patrick P. C. Lee†, and Yuchong Hu‡

Presentation transcript:

Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1 Toward Optimal Storage Scaling via Network Coding: From Theory to Practice Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1 1Huazhong University of Science and Technology 2The Chinese University of Hong Kong INFOCOM 2018

Introduction Fault tolerance for distributed storage is critical Availability: data remains accessible under failures Durability: no data loss even under failures Erasure coding is a promising redundancy technique Minimum data redundancy via “data encoding” Higher reliability with same storage redundancy than replication Reportedly deployed in Google, Azure, Facebook’s data centers e.g., Azure reduces redundancy from 3x (replication) to 1.33x  PBs saving

Erasure Coding An (n, k) code (k < n) encodes k data blocks to n-k parity blocks Distribute the set of n blocks (stripe) to n nodes Maximum Distance Separable (MDS) Any k out of n blocks can recover file data Redundancy = n/k is minimum Nodes (n, k) = (4, 2) encode D1 D2 P1 P2

Storage Scaling Requirements in storage systems: Adapt to increasing storage demands (i.e., new nodes added) Adapt to workload changes by tuning redundancy to trade between performance and storage overhead e.g., Facebook f4 [Muralidhar et al., OSDI’14]; HACFS [Xia et al., FAST’15] Storage scaling: (n, k) code  (n’, k’) code Relocate some existing data to different nodes Recompute erasure-coded data based on new data layout

Scaling Problem Scaling problem: Challenges: Minimize scaling bandwidth (i.e., amount of network traffic due to scaling) Related to classical repair problem of minimizing repair bandwidth, but fundamentally different Repair problem: redundancy maintained Scaling problem: redundancy changed Challenges: How to quantify the scaling bandwidth? What is the fundamental limit of scaling bandwidth? Can we design a scaling approach that matches the limit?

State-of-the-Art Scale (3, 2) code to (4, 3) code Scale-RS*: Existing nodes: X1, X2, X3 Add one new node Y1 Scale-RS*: Parity block updates Send 4 data blocks to X3 for parity updates Data block migration Move 4 data blocks to Y1 Scaling bandwidth: 8 blocks Note here that the parity blocks are stored in a dedicated node in Scale-RS *Huang et al., “Scale-RS: An Efficient Scaling Scheme for RS-coded Storage Clusters”, TPDS 2015

Network Coding Combine parity block updates and data block migration Idea: Update parity blocks locally Send encoded outputs to new nodes e.g., Send Q4 to Y1 Maintain a balanced parity layout Scaling bandwidth: 4 blocks

Our Contributions Prove information-theoretically minimum scaling bandwidth First formal study of applying network coding to storage scaling Design and implement NCScale A prototype that realizes network-coding-based scaling Achieve minimum (or near-minimum) scaling bandwidth Conduct experiments on Amazon EC2 Up to 50% of scaling bandwidth savings Consistent with theoretical findings

Problem Formulation (n, k, s)-scaling: Transform (n, k) code to (n+s, k+s) code for s > 0 Properties: MDS and systematic MDS: tolerate any n-k failures with minimum redundancy Systematic: k blocks are kept in a stripe Uniform data and parity distribution Decentralized scaling Goal: minimize scaling bandwidth subject to above properties * Our ISIT’18 paper covers a more general case of scaling

Vandermonde-based Reed-Solomon Codes Building scaling on Vandermonde-based Reed-Solomon codes Parity blocks computed from an (n-k) x k Vandermonde matrix Each new parity block can be computed by adding an existing parity block with a parity delta block Example: (4, 2) code:  (6, 4) code: Parity delta blocks

Vandermonde-based Reed-Solomon Codes Note: systematic Vandermonde-based Reed-Solomon codes are generally non-MDS in finite fields Data blocks cannot be recovered from any k blocks Nevertheless, MDS property holds for practical parameters See: http://www.cse.cuhk.edu.hk/~pclee/www/pubs/infocom18_corrections.pdf

Lower Bound Analysis Let (n, k, s) = (4, 2, 2) File size = M X1 X2 X3 X4 Y1 Y2 β Let (n, k, s) = (4, 2, 2) i.e., (4,2) code  (6,4) code File size = M β = bandwidth between any existing node Xi (1 ≤ i ≤ n) to new node Yj (1 ≤ j ≤ s) Goal: minimize β Scaling bandwidth = n ∙ s ∙ β

Lower Bound Analysis S T Information flow graph: X1 ∞ X2 X3 X4 Y1 β Y2 M/(k+s) X2 X3 X4 Y1 Y2 β X1 S M/k T in mid out ∞

Lower Bound Analysis Lemma 1: β must be at least 𝑀 𝑛(𝑘+𝑠) . Proof: Each new node Yj (1 ≤ j ≤ s) must receive at least 𝑀 (𝑘+𝑠) units of data from every node Xi (1 ≤ i ≤ n).

Lower Bound Analysis Lemma 2: Suppose that β is equal to the lower bound 𝑀 𝑛(𝑘+𝑠) . Then the capacity of each possible min-cut is at least M. That is, the lower bound is tight Proof idea: classify nodes into four types

Lower Bound Analysis S T cut X1 ∞ X2 X3 X4 Y1 β Type 1 Y2 M/(k+s) M/k in mid out ∞ Type 1

Lower Bound Analysis S T cut X1 ∞ X2 X3 X4 Y1 β Type 2 Y2 M/(k+s) M/k in mid out ∞ Type 2

Lower Bound Analysis S T S T cut X2 X3 X4 Y1 Y2 β X1 ∞ X2 X3 X4 Y1 Y2 M/(k+s) X2 X3 X4 Y1 Y2 β X1 S M/k T in mid out ∞ M/(k+s) X2 X3 X4 Y1 Y2 β X1 S M/k T in mid out ∞ Type 3

Lower Bound Analysis S T cut X1 ∞ X2 X3 X4 Y1 β Type 4 Y2 M/(k+s) M/k in mid out ∞ cut Type 4

Λ= t1∙ 𝑀 𝑘 +t2∙ 𝑀 𝑘+𝑠 +t3∙ 𝑀 𝑘+𝑠 +t4∙(𝑛−𝑡1)β Lower Bound Analysis Suppose that T connects to t i nodes of type i, where 1 ≤ i ≤ 4 Let Λ(t1, t2,t3,t4) denote the capacity of a cut Derive Λ as: Λ= t1∙ 𝑀 𝑘 +t2∙ 𝑀 𝑘+𝑠 +t3∙ 𝑀 𝑘+𝑠 +t4∙(𝑛−𝑡1)β Since t1+t2+t3+t4 = k+s and β= 𝑀 𝑛(𝑘+𝑠) Λ ≥ 𝑀+M∙ 𝑡1∙ 𝑛∙𝑠−𝑘∙𝑡4 𝑘 𝑘+𝑠 𝑛 ≥ 𝑀.

NCScale NCScale: a distributed storage system prototype that operates on systematic Reed-Solomon codes Preserve systematic and MDS properties after scaling Preserve uniform data and parity distribution Support decentralized scaling Constraints: If 𝑛−𝑘=1, 𝑠>0 (optimal scaling) else 𝑠< 𝑛 𝑛−𝑘−1 (near-optimal scaling)

NCScale Prepare step: Two groups of stripes Identify the set of blocks involved in scaling operations Two groups of stripes PG: a group of stripes in which parity blocks will be updated DG: a group of stripes in which data blocks are used to update the parity blocks in PG

NCScale Compute step: Update parity blocks in PG using data blocks in DG

NCScale Send step: Send data blocks to new nodes Send the locally updated parity blocks and data blocks to new nodes

Implementation NCScale prototype: Setup: Intel ISA-L for erasure code implementation Both Scale-RS and network-coding-based scaling implementations for comparisons Setup: Amazon EC2 Up to 14 m4.4xlarge instances

Amazon EC2 Experiments Impact of bandwidth between VM instances 64 MB block size Findings: Empirical results are consistent with numerical results

Amazon EC2 Experiments Impact of 𝑠 (the number of new nodes) Findings: 1Gb/s bandwidth 64MB block size 𝑛,𝑘 =(9, 6) Findings: As s increases, both Scale-RS and NCScale send more blocks and the gap decreases NCScale still reduces scaling time of Scale-RS by 11.5-24.6%

Conclusions Study how to apply network coding to storage scaling from both theoretical and applied perspectives Formally prove the information-theoretically minimum point Build NCScale prototype to realize network-coding-based scaling Conduct extensive experiments on Amazon EC2 Empirical results match theories