A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng Zhu 1, Patrick P. C. Lee 2, Liping Xiang 1, Yinlong.

Slides:

Advertisements

Similar presentations

1 On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice Yunfeng Zhu 1, Patrick P. C. Lee 2, Yuchong Hu 2, Liping.

Advertisements

Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.

current hadoop architecture

Alex Dimakis based on collaborations with Dimitris Papailiopoulos Arash Saber Tehrani USC Network Coding for Distributed Storage.

CSCE430/830 Computer Architecture

Henry C. H. Chen and Patrick P. C. Lee

1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.

Coding and Algorithms for Memories Lecture 12 1.

Simple Regenerating Codes: Network Coding for Cloud Storage Dimitris S. Papailiopoulos, Jianqiang Luo, Alexandros G. Dimakis, Cheng Huang, and Jin Li University.

Computer Science Dr. Peng NingCSC 774 Adv. Net. Security1 CSC 774 Advanced Network Security Topic 7.3 Secure and Resilient Location Discovery in Wireless.

Yuchong Hu1, Henry C. H. Chen1, Patrick P. C. Lee1, Yang Tang2

1 Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu The Chinese University of Hong Kong.

Optimal redundancy allocation for information technology disaster recovery in the network economy Benjamin B.M. Shao IEEE Transaction on Dependable and.

Abstract HyFS: A Highly Available Distributed File System Jianqiang Luo, Mochan Shrestha, Lihao Xu Department of Computer Science, Wayne State University.

1 STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures in Practical Storage Systems Mingqiang Li and Patrick P. C.

Chapter 3 Presented by: Anupam Mittal.  Data protection: Concept of RAID and its Components Data Protection: RAID - 2.

Availability in Globally Distributed Storage Systems

1 Toward I/O-Efficient Protection Against Silent Data Corruptions in RAID Arrays Mingqiang Li and Patrick P. C. Lee The Chinese University of Hong Kong.

CSE 486/586 CSE 486/586 Distributed Systems Case Study: Facebook f4 Steve Ko Computer Sciences and Engineering University at Buffalo.

A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Jack Lee Yiu-bun, Raymond Leung Wai Tak Department.

Compressive Oversampling for Robust Data Transmission in Sensor Networks Infocom 2010.

Ashish Gupta Under Guidance of Prof. B.N. Jain Department of Computer Science and Engineering Advanced Networking Laboratory.

1 Data Persistence in Large-scale Sensor Networks with Decentralized Fountain Codes Yunfeng Lin, Ben Liang, Baochun Li INFOCOM 2007.

Computer ArchitectureFall 2007 © November 28, 2007 Karem A. Sakallah Lecture 24 Disk IO and RAID CS : Computer Architecture.

Network Coding for Large Scale Content Distribution Christos Gkantsidis Georgia Institute of Technology Pablo Rodriguez Microsoft Research IEEE INFOCOM.

A Hybrid Approach of Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation Yinlong Xu University of Science and Technology of.

A Server-less Architecture for Building Scalable, Reliable, and Cost-Effective Video-on-demand Systems Presented by: Raymond Leung Wai Tak Supervisor:

More Codes Never Enough. 2 EVENODD Code Basics of EVENODD code  each storage node as a single column # of data nodes k = p (prime) # of total nodes n.

Failures in the System  Two major components in a Node Applications System.

Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.

NCCloud: A Network-Coding-Based Storage System in a Cloud-of-Clouds

Network Coding for Distributed Storage Systems IEEE TRANSACTIONS ON INFORMATION THEORY, SEPTEMBER 2010 Alexandros G. Dimakis Brighten Godfrey Yunnan Wu.

Network Coding Distributed Storage Patrick P. C. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong 1.

ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.

CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Storage Systems.

Redundant Array of Inexpensive Disks aka Redundant Array of Independent Disks (RAID) Modified from CCT slides.

Min Xu1, Yunfeng Zhu2, Patrick P. C. Lee1, Yinlong Xu2

Example: Sorting on Distributed Computing Environment Apr 20,

Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu th Annual IEEE/IFIP International.

1 CloudVS: Enabling Version Control for Virtual Machines in an Open- Source Cloud under Commodity Settings Chung-Pan Tang, Tsz-Yeung Wong, Patrick P. C.

"1"1 Introduction to Managing Data " Describe problems associated with managing large numbers of disks " List requirements for easily managing large amounts.

1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.

1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The.

Parity Logging with Reserved Space: Towards Efficient Updates and Recovery in Erasure-coded Clustered Storage Jeremy C. W. Chan*, Qian Ding*, Patrick P.

ICDCS 2014 Madrid, Spain 30 June-3 July 2014

20/10/ Cooperative Recovery of Distributed Storage Systems from Multiple Losses with Network Coding Yuchong Hu Institute of Network Coding Please.

Coding and Algorithms for Memories Lecture 13 1.

Service Reliability Engineering The Chinese University of Hong Kong

Wireless and Mobile Networks (ELEC6219) Session 4: Efficiency of a link. Data Link Protocols. Adriana Wilde and Jeff Reeve 22 January 2015.

Application-Aware Traffic Scheduling for Workload Offloading in Mobile Clouds Liang Tong, Wei Gao University of Tennessee – Knoxville IEEE INFOCOM

Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.

Pouya Ostovari and Jie Wu Computer & Information Sciences

Elastic Parity Logging for SSD RAID Arrays Yongkun Li*, Helen Chan #, Patrick P. C. Lee #, Yinlong Xu* *University of Science and Technology of China #

A Tale of Two Erasure Codes in HDFS

rain technology (redundant array of independent nodes)

Double Regenerating Codes for Hierarchical Data Centers

Steve Ko Computer Sciences and Engineering University at Buffalo

Steve Ko Computer Sciences and Engineering University at Buffalo

A Simulation Analysis of Reliability in Erasure-coded Data Centers

Disks and RAID.

Vladimir Stojanovic & Nicholas Weaver

Repair Pipelining for Erasure-Coded Storage

Presented by Haoran Wang

Section 7 Erasure Coding Overview

Zhirong Shen+, Patrick Lee+, Jiwu Shu$, and Wenzhong Guo*

ICOM 6005 – Database Management Systems Design

Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1

Dr. Zhijie Huang and Prof. Hong Jiang University of Texas at Arlington

TECHNICAL SEMINAR PRESENTATION

RAID Redundant Array of Inexpensive (Independent) Disks

Presentation transcript:

A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng Zhu 1, Patrick P. C. Lee 2, Liping Xiang 1, Yinlong Xu 1, Lingling Gao 1 1 University of Science and Technology of China 2 The Chinese University of Hong Kong DSN’12 1

Fault Tolerance  Fault tolerance becomes more challenging in modern distributed storage systems Increase in scale Usage of inexpensive but less reliable storage nodes  Fault tolerance is ensured by introducing redundancy across storage nodes Replication Erasure codes (e.g., Reed-Solomon codes) 2 A A B B A+B A+2B A A B B A A B B A A B B

XOR-Based Erasure Codes  Encoding/decoding involve XOR operations only Low computational overhead  Different redundancy levels 2-fault tolerant: RDP, EVENODD, X-Code 3-fault tolerant: STAR General-fault tolerant: Cauchy Reed-Solomon (CRS) 3

Failure Recovery  Recovering node failures is necessary Preserve the required redundancy level Avoid data unavailability  Single-node failure recovery  Single-node failure occurs more frequently than a concurrent multi-node failure

Example: Recovery in RDP d 0,6 d 1,6 d 2,6 d 3,6 d 4,6 d 5,6 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ d 0,0 d 0,1 d 0,2 d 0,3 d 0,4 d 0,5 d 1,0 d 1,1 d 1,2 d 1,3 d 1,4 d 1,5 d 2,0 d 2,1 d 2,2 d 2,3 d 2,4 d 2,5 d 3,0 d 3,1 d 3,2 d 3,3 d 3,4 d 3,5 d 4,0 d 4,1 d 4,2 d 4,3 d 4,4 d 4,5 d 5,0 d 5,1 d 5,2 d 5,3 d 5,4 d 5,5 d 0,7 d 1,7 d 2,7 d 3,7 d 4,7 d 5,7 ⊕ ⊕ ⊕ ⊕ ⊕ ⊕ node 0 node 1 node 2 node 3 node 4 node 5 node 6 node 7 5  An RDP code example with 8 nodes Let’s say node0 fails. How do we recover node0?

Conventional Recovery  Idea: use only row parity sets. Recover each lost data symbol (i.e., data chunk) independently node 0node 1node 2node 3node 4node 5node 6node 7 Read symbols: 36 Then how do we recover node 0 efficiently? Different metrics can be used to measure the efficiency of a recovery scheme 6

Minimize Number of Read Symbols  Idea: use a combination of row and diagonal parity sets to maximize overlapping symbols [Xiang, ToS’11] node 0node 1node 2node 3node 4node 5node 6node 7 Read symbols: 27 Improve rate: 25% Read symbols: 27 Improve rate: 25% 7

Need A New Metric?  A modern storage system is natural to be composed of heterogeneous types of storage nodes System upgrades New node addition  A heterogeneous environment 8 Proxy node 0 node 1node 2 node 3 node4 node 5node 6 node 7 New node 26Mbps 68Mbps 109Mbps 110Mbps 113Mbps 10Mbps 110Mbps 86Mbps Need a new efficient failure recovery solution for heterogeneous environment! Need a new efficient failure recovery solution for heterogeneous environment!

Related Work  Hybrid recovery Minimize number of read symbols RAID-6 XOR-based erasure codes e.g., RDP [Xiang, ToS’11], EVENODD [Wang, Globecom’10  Enumeration recovery [Khan, FAST’12] Enumerate all recovery possibilities to achieve optimal recovery for general XOR-based erasure codes  Greedy recovery [Zhu, MSST’12] Efficient search of recovery solutions for general XOR-based erasure codes  Regenerating codes [Dimakis, ToIT’10] Nodes encode data during recovery Minimize recovery bandwidth Heterogeneous case considered in [Li, Infocom’10], but requires node encoding and collaboration 9

Challenges  How to enable efficient failure recovery for heterogeneous settings? Minimizing # of read symbols  homogeneous settings Performance bottlenecked by poorly performed nodes  How to quickly find the recovery strategy? Minimizing # of read symbols  deterministic metric Minimizing general cost  non-deterministic metric  Recovery decision typically can’t be pre-determined

Our Contributions  Target two RAID-6 codes: RDP and EVENODD XOR-based encoding operations  Goals: Minimize search time Minimize recovery cost Cost-based single-node failure recovery for heterogeneous distributed storage systems 11

Our Contributions  Formulate an optimization problem for single- node failure recovery in heterogeneous settings  Propose a cost-based heterogeneous recovery (CHR) algorithm  Narrow down search space  Suitable for online recovery  Implement and experiment on a heterogeneous networked storage testbed 12

... Node p-1Node p... Weight: Download Distribution: w0w0 w1w1 w p-1 wpwp y0y0 y1y1 y p-1 ypyp... Minimizing total recovery cost: Model Formulation  Our formulation: 13 Node : v0v0 v1v1 vkvk v p-1 vpvp Node 0Node 1 Node k

Physical Meanings wiwi C 1 for all itotal number of symbols being read from surviving nodes inverse of transmission bandwidth of node V i total amount of transmission time to download symbols from surviving nodes monetary cost of migrating per unit of data outbound from node V i the total monetary cost of migrating data from surviving nodes (or clouds) 14

Solving the Model  Important: Which symbols to be fetched from surviving nodes must follow inherent rules of specific coding schemes  To solve the model, we introduce recovery sequence (x 0, x 1, …, x p-2, 0) –x i = 0, d i,k is recovered from its row parity set –x i = 1, d i,k is recovered from its diagonal parity set  download distribution: (3, 2, 2, 3, 2)  recovery sequence: (0, 0, 1, 1, 0) d 0,0 d 1,0 d 2,0 d 3,0 d 0,1 d 1,1 d 2,1 d 3,1 d 0,2 d 1,2 d 2,2 d 3,2 d 0,3 d 1,3 d 2,3 d 3,3 d 0,4 d 1,4 d 2,4 d 3,4 d 0,5 d 1,5 d 2,5 d 3,5 node 0node 1node 2node 3node 4node 5 15  An example: 1) Each recovery sequence represents a feasible recovery solution; 2) Download distribution can be represented by recovery sequence; 1) Each recovery sequence represents a feasible recovery solution; 2) Download distribution can be represented by recovery sequence;

Solving the Model (2)  Step 1: use recovery sequence to represent downloads  Step 2: narrow down search space by only considering min-read recovery sequences (i.e., download minimum number of read symbols during recovery)  Step 3: reformulate the model as Minimize 16

Expensive Enumeration PTotal # of recovery sequences # of min-read recovery sequences # of unique min-read recovery sequences Challenge: Too many min-read recovery sequences to enumerate even we narrow down search space 17 Observation: many min-read recovery sequences return the same download distribution

Optimize Enumeration Process  Two conditions under which different recovery sequences have same download distribution:  Shift condition (0, 0, 0, 1, 1, 1, 0)  (0, 0, 1, 1, 1, 0, 0)  (0, 1, 1, 1, 0, 0, 0)  (1, 1, 1, 0, 0, 0, 0) …  Reverse condition (0, 0, 0, 1, 1, 1, 0)  (0, 1, 1, 1, 0, 0, 0) 18 Key idea: not all recovery sequences need to be enumerated (details in the paper)

Cost-based Heterogeneous Recovery (CHR) Algorithm: Intuition  Step 1: initialize a bitmap to track all possible min-read recovery sequences R  Step 2: compute recovery cost of R.  Step 3: mark all shifted and reverse sequences of R as being enumerated  Step 4: switch to another R; return the one with minimum cost 19

Example Proxy node 0 node 1node 2 node 3 node4 node 5node 6 node 7 New node 26Mbps 68Mbps 109Mbps 110Mbps 113Mbps 10Mbps 110Mbps 86Mbps Our proposed CHR algorithmHybrid approach [Xiang, ToS’11]

Recovery Cost Comparison  CHR approach  Hybrid approach  Conventional approach reduce by 25.89% reduce by 40.91% 21

Simulation Studies (1): Traverse Efficiency  Evaluate the computational time of CHR PNaive traverse time (ms) CHR’s traverse time (ms) Improved rate (%) CHR significantly reduces the traverse time of the naive approach by over 90% as p increases! 22

Simulation Studies (2): Robustness Efficiency  Evaluate if CHR achieves the global optimal among all the feasible recovery sequences PHit Global Optimal Probability(%) Global Optimal Max Improvement(%) CHR has a very high probability (over 93%) to hit the global optimal recovery cost! 23

Simulation Studies (3): Recovery Efficiency  Evaluate via 100 runs for each p the recovery efficiency of CHR in a heterogeneous storage environment CHR can reduce recovery cost by up to 50% over the conventional approach CHR can reduce recovery cost by up to 30% over the hybrid approach 24

Experiments  Experiments on a networked storage testbed Conventional vs. Hybrid vs. CHR Default chunk size = 1MB Communication via ATA over Ethernet (AoE) Consider two codes: RDP and EVENODD Only RDP results shown in this talk  Recovery operation: Read chunks from surviving nodes Reconstruct lost chunks Write reconstructed chunks to a new node 25 Recovery process Gigabit switch nodes

Experiments  Two types of Ethernet interface card equipped by physical storage devices 100Mbps  set weight = 1/(100Mbps) 1Gbps  set weight = 1/(1Gbps) 26 pTotal # of nodes # of nodes with 100Mbps # of nodes with 1Gbps Configuration for RDP code

Different Number of Storage Nodes  Total recovery time for RDP CHR improves conventional by 21-31% CHR improves hybrid by 15-20% 27

Different Chunk Size  Total recovery time for RDP (p = 11) CHR improves conventional by 18-26% CHR improves hybrid by 14-19%

Different Failed Nodes  Total recovery time for RDP (p = 11) CHR still outperforms conventional and hybrid 29

Conclusions  Address single-node failure recovery RAID-6 coded heterogeneous storage systems  Formulate a computation-efficient optimization model  Propose a cost-based heterogeneous recovery algorithm  Validate the effectiveness of the CHR algorithm through extensive simulations and testbed experiments  Future work:  Different cost formulations  Extension for general XOR-based erasure codes  Degraded reads  Source code: 30

Backup

Cost-based Heterogeneous Recovery (CHR) Algorithm F A bitmap that identifies if a min-read recovery sequence has been enumerated R, C A min-read recovery sequence with its recovery cost R*, C* The min-cost recovery sequence with the minimum total recovery cost 1 Initialize F[0…2 p-1 -1] with 0-bits; Initialize R with 1-bits followed by 0-bits; Initialize R* with R ; Initialize C* with MAX_VALUE 2 If R is null, then go to Step 4; Convert R into integer value v, if R has already enumerated, then go to Step 3; Mark all the shifted an reverse recovery sequences of R as being enumerated; Calculate the recovery cost C of R; Update R* and C* if necessary 3 Get the next min-read recovery sequence R and go to Step 2; 4 Finally, initialize R with all 0-bits; Calculate the recovery cost C of R; Update R* and C* if necessary Notation: Algorithm: 32

Example Proxy node 0 node 1 node 2 node 3 node 4 node 5 node 6 node 7 New node 26Mbps 68Mbps 109Mbps 110Mbps 113Mbps 10Mbps 110Mbps 86Mbps Step 1: Initialize F[0..63] with 0-bits, R = { }, the recovery cost C = MAX_VALUE Step 2: F[7]=1, mark R’s shifted and reverse recovery sequences: F[56]=F[28]=F[14]=1; Calculate the recovery cost for R, C will be α; R*, C* will be updated by R, C Step 2: F[7]=1, mark R’s shifted and reverse recovery sequences: F[56]=F[28]=F[14]=1; Calculate the recovery cost for R, C will be α; R*, C* will be updated by R, C Step 3: Get the next min-read recovery sequence R and go to Step 2 Step 4: Finally, we can find that R* = { } and C* = α 33 node 0 node 1 node 2 node 3 node 4 node 5 node 6 node

Recovery Cost Comparison  CHR approach  Hybrid approach  Conventional approach reduce by 25.89% reduce by 40.91% 34 node 0 node 1 node 2 node 3 node 4 node 5 node 6 node

Different Number of Storage Nodes  Consider the overall performance of the complete recovery operation for EVENODD 35

Different Chunk Size  Evaluate the impact of chunk size for EVENODD on the recovery time performance 36

Different Failed Nodes  Evaluate the recovery time performance for EVENODD when the failed node is in a different column 37