1 On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice Yunfeng Zhu 1, Patrick P. C. Lee 2, Yuchong Hu 2, Liping.

Slides:



Advertisements
Similar presentations
Adders Used to perform addition, subtraction, multiplication, and division (sometimes) Half-adder adds rightmost (least significant) bit Full-adder.
Advertisements

Using Matrices in Real Life
Analysis of Computer Algorithms
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2003 Chapter 11 Ethernet Evolution: Fast and Gigabit Ethernet.
Chapter 1 The Study of Body Function Image PowerPoint
Cognitive Radio Communications and Networks: Principles and Practice By A. M. Wyglinski, M. Nekovee, Y. T. Hou (Elsevier, December 2009) 1 Chapter 12 Cross-Layer.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 OpenFlow + : Extension for OpenFlow and its Implementation Hongyu Hu, Jun Bi, Tao Feng, You Wang, Pingping Lin Tsinghua University
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
1 Term 2, 2004, Lecture 9, Distributed DatabasesMarian Ursu, Department of Computing, Goldsmiths College Distributed databases 3.
1 Outline relationship among topics secrets LP with upper bounds by Simplex method basic feasible solution (BFS) by Simplex method for bounded variables.
Solve Multi-step Equations
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Chapter 4: Informed Heuristic Search
A CASE FOR REDUNDANT ARRAYS OF INEXPENSIVE DISKS (RAID) D. A. Patterson, G. A. Gibson, R. H. Katz University of California, Berkeley.
Copyright © 2009 EMC Corporation. Do not Copy - All Rights Reserved.
1 RAID Overview n Computing speeds double every 3 years n Disk speeds cant keep up n Data needs higher MTBF than any component in system n IO.
1 Lecture 18: RAID n I/O bottleneck n JBOD and SLED n striping and mirroring n classic RAID levels: 1 – 5 n additional RAID levels: 6, 0+1, 10 n RAID usage.
4.1 © 2004 Pearson Education, Inc. Exam Managing and Maintaining a Microsoft® Windows® Server 2003 Environment Lesson 4: Organizing a Disk for Data.
RAID A RRAYS Redundant Array of Inexpensive Discs.
Triple-Parity RAID and Beyond Hai Lu. RAID RAID, an acronym for redundant array of independent disks or also known as redundant array of inexpensive disks,
Availability in Globally Distributed Storage Systems
HyLog: A High Performance Approach to Managing Disk Layout Wenguang Wang Yanping Zhao Rick Bunt Department of Computer Science University of Saskatchewan.
Hash Tables.
Chapter 10: Virtual Memory
VOORBLAD.
15. Oktober Oktober Oktober 2012.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
LT Codes Paper by Michael Luby FOCS ‘02 Presented by Ashish Sabharwal Feb 26, 2003 CSE 590vg.
Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.
© 2012 National Heart Foundation of Australia. Slide 2.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
Chapter 5 Test Review Sections 5-1 through 5-4.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
25 seconds left…...
Januar MDMDFSSMDMDFSSS
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8
CPSC 322, Lecture 14Slide 1 Local Search Computer Science cpsc322, Lecture 14 (Textbook Chpt 4.8) Oct, 5, 2012.
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Choosing an Order for Joins
Scalable Rule Management for Data Centers Masoud Moshref, Minlan Yu, Abhishek Sharma, Ramesh Govindan 4/3/2013.
Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.
CSCE430/830 Computer Architecture
Henry C. H. Chen and Patrick P. C. Lee
1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.
 RAID stands for Redundant Array of Independent Disks  A system of arranging multiple disks for redundancy (or performance)  Term first coined in 1987.
Yuchong Hu1, Henry C. H. Chen1, Patrick P. C. Lee1, Yang Tang2
1 STAIR Codes: A General Family of Erasure Codes for Tolerating Device and Sector Failures in Practical Storage Systems Mingqiang Li and Patrick P. C.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
A Hybrid Approach of Failed Disk Recovery Using RAID-6 Codes: Algorithms and Performance Evaluation Yinlong Xu University of Science and Technology of.
Storage System: RAID Questions answered in this lecture: What is RAID? How does one trade-off between: performance, capacity, and reliability? What is.
Network Coding Distributed Storage Patrick P. C. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong 1.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
Min Xu1, Yunfeng Zhu2, Patrick P. C. Lee1, Yinlong Xu2
A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng Zhu 1, Patrick P. C. Lee 2, Liping Xiang 1, Yinlong.
Double Regenerating Codes for Hierarchical Data Centers
ICOM 6005 – Database Management Systems Design
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
Presentation transcript:

1 On the Speedup of Single-Disk Failure Recovery in XOR-Coded Storage Systems: Theory and Practice Yunfeng Zhu 1, Patrick P. C. Lee 2, Yuchong Hu 2, Liping Xiang 1, Yinlong Xu 1 1 University of Science and Technology of China 2 The Chinese University of Hong Kong MSST12

Modern Storage Systems Large-scale storage systems have seen deployment in practice Cloud storage Data centers P2P storage Data is distributed over a collection of disks Disk physical storage device 2 disks …

How to Ensure Data Reliability? Disks can crash or have bad data Data reliability is achieved by keeping data redundancy across disks Replication Efficient computation High storage overhead Erasure codes (e.g., Reed-Solomon codes) Less storage overhead than replication, with same fault tolerance More expensive computation than replication 3

XOR-Based Erasure Codes XOR-based erasure codes Encoding/decoding involve XOR operations only Low computational overhead Different redundancy levels 2-fault tolerant: RDP, EVENODD, X-Code 3-fault tolerant: STAR General-fault tolerant: Cauchy Reed-Solomon (CRS) 4

Example EVENODD, where number of disks = 4 5 a a b b c c d d a+c b+d b+c a+b+d a? b? a=c+(a+c) b=d+(b+d) Note: + denotes XOR operation

Failure Recovery Problem Recovering disk failures is necessary Preserve the required redundancy level Avoid data unavailability Single-disk failure recovery Single-disk failure occurs more frequently than a concurrent multi-disk failure One objective of efficient single-disk failure recovery: minimize the amount of data being read from surviving disks 6

Related Work Hybrid recovery Minimize amount of data being read for double-fault tolerant XOR-based erasure codes e.g., RDP [Xiang, ToS11], EVENODD [Wang, Globecom10], X-Code [Xu, Tech Report11] Enumeration recovery [Khan, FAST12] Enumerate all recovery possibilities to achieve optimal recovery for general XOR-based erasure codes Regenerating codes [Dimakis, ToIT10] Disks encode data during recovery Minimize recovery bandwidth 7

Example: Recovery in RDP RDP with 8 disks. 8 Disk6 d 0,6 d 1,6 d 2,6 d 3,6 d 4,6 d 5,6 Disk0 Disk1 Disk2 Disk3 Disk4 Disk5 d 0,0 d 0,1 d 0,2 d 0,3 d 0,4 d 0,5 d 1,0 d 1,1 d 1,2 d 1,3 d 1,4 d 1,5 d 2,0 d 2,1 d 2,2 d 2,3 d 2,4 d 2,5 d 3,0 d 3,1 d 3,2 d 3,3 d 3,4 d 3,5 d 4,0 d 4,1 d 4,2 d 4,3 d 4,4 d 4,5 d 5,0 d 5,1 d 5,2 d 5,3 d 5,4 d 5,5 Disk7 d 0,7 d 1,7 d 2,7 d 3,7 d 4,7 d 5,7 Lets say Disk0 fails. How do we recover Disk0?

Conventional Recovery Idea: use only row parity sets. Recover each lost data symbol independently 9 Total number of read symbols: 36

Hybrid Recovery Idea: use a combination of row and diagonal parity sets to maximize overlapping symbols 10 [Xiang, ToS11] Total number of read symbols: 27

Enumeration Recovery 11 D0 D1 D2 D3 C0 C1 C2 C3 Generator Matrix D0 D1 D2 D3 D0 D1 D2 D3 C0 C1 C2 C3 Disk0 Disk1 Disk2 Disk3 Data Codeword Conventional Recovery download 4 symbols (D 2, D 3, C 0, C 1 ) to recover D 0 and D 1 Recovery Equations for D 0 Recovery Equations for D 1 D 0 D 2 C 0 D 1 D 3 C 1 D 0 D 3 C 2 D 1 D 2 C 0 C 1 C 2 D 0 D 3 C 0 C 1 C 3 D 1 D 2 C 3 D 0 D 2 C 1 C 2 C 3 D 1 D 3 C 0 C 2 C 3 Total read symbols: 3 [Khan, FAST12]

Challenges Hybrid recovery cannot be easily generalized to STAR and CRS codes, due to different data layouts Enumeration recovery has exponential computational overhead Can we develop an efficient scheme for efficient single-disk failure recovery? 12

Our Work Speedup in three aspects: Minimize search time for returning a recovery solution Minimize I/Os for recovery (hence minimize recovery time) Can be extended for parallelized recovery using multi-core technologies Applications: when no pre-computations are available, or in online recovery 13 Speedup of single-disk failure recovery for XOR-based erasure codes

Our Work Design a replace recovery algorithm Hill-climbing approach: incrementally replace feasible recovery solutions with fewer disk reads Implement and experiment on a networked storage testbed Show recovery time reduction in both single-threaded and parallelized implementation 14

Key Observation 15 ……… k data disks m parity disks n disks Strip size: ω A strip of ω data symbols is lost There likely exists an optimal recovery solution, such that this solution has exactly ω parity symbols!

Simplified Recovery Model To recover a failed disk, choose a collection of parity symbols (per stripe) such that: The collection has ω parity symbols The collection can correctly resolve the ω lost data symbols Total number of data symbols encoded in the ω parity symbols is minimum minimize disk reads 16

Replace Recovery Algorithm 17 PiPi set of parity symbols in the ith (1i m) parity disk Xcollection of ω parity symbols used for recovery Ycollection of parity symbols that are considered to be included in X Target: reduce number of read symbols 1Initialize X with the ω parity symbols of P 1 2Set Y to be the collection of parity symbols in P 2 ; Replace some parity symbols in X with same number of symbols in Y, such that X is valid to resolve the ω lost data symbols 3Replace Step 2 by resetting Y with P 3, …,P m 4Obtain resulting X and corresponding encoding data symbols Notation: Algorithm:

Example 18 D0 D1 D2 D3 C0 C1 C2 C3 Generator Matrix D0 D1 D2 D3 D0 D1 D2 D3 C0 C1 C2 C3 Disk0 Disk1 Disk2 Disk3 Data Codeword Step 1: Initialize X = {C0, C1}. Number of read symbols of X is 4 Step 2: Consider Y = {C2, C3}. C2 can replace C0 (X is valid). Number of read symbols equal to 3 Step 3: Replace C0 with C2. X = {C2,C1}. Note it is an optimal solution.

Algorithmic Extensions Replace recovery has polynomial complexity Extensions: increase search space, while maintaining polynomial complexity Multiple rounds Use different parity disks for initialization Successive searches After considering P i, reconsider the previously considered i-2 parity symbol collections (univariate search) Can be extended for general I/O recovery cost Details in the paper 19

Evaluation: Recovery Performance Recovery performance for STAR 20 Replace recovery is close to lower bound

Evaluation: Recovery Performance Recovery performance for CRS 21 m = 3, ω = 4 m = 3, ω = 5 Replace recovery is close to optimal (< 3.5% difference)

Evaluation: Search Performance 22 Enumeration recovery has a huge search space Maximum number of recovery equations being enumerated is 2 mω. Search performance for CRS Intel 3.2GHz CPU, 2GB RAM (k, m, ω)Time (Enumeration)Time (Replace) (10, 3, 5)6m32s0.08s (12, 4, 4)17m17s0.09s (10, 3, 6)18h15m17s0.24s (12, 4, 5)13d18h6m43s0.30s Replace recovery uses significantly less search time than enumeration recovery

Design and Implementation Recovery thread Reading data from surviving disks Reconstructing lost data of failed disk Writing reconstructed data to a new disk Parallel recovery architecture Stripe-oriented recovery: each recovery thread recovers data of a stripe Multi-thread, multi-server Details in the paper 23

Experiments Experiments on a networked storage testbed Conventional vs. Recovery Default chunk size = 512KB Communication via ATA over Ethernet (AoE) Types of disks (physical storage devices) Pentium 4 PCs Network attached storage (NAS) drives Intel Quad-core servers 24 Recovery architecture Gigabit switch disks

Recovery Time Performance Conventional vs Replace: double-fault tolerant codes: 25 RDPEVENODD X-CodeCRS(k, m=2)

Recovery Time Performance Conventional vs Replace: Triple and general-fault tolerant codes 26 STARCRS(k, m=3) CRS(k, m>3)

Summary of Results Replace recovery reduces recovery time of conventional recovery by 10-30% Impact of chunk size: Larger chunk size, recovery time decreases Replace recovery still shows the recovery time reduction Parallel recovery: Overall recovery time reduces with multi-thread, multi-server implementation Replace recovery still shows the recovery time reduction Details in the paper 27

Conclusions Propose a replace recovery algorithm provides near-optimal recovery performance for STAR and CRS codes has a polynomial computational complexity Implement replace recovery on a parallelized architecture Show via testbed experiments that replace recovery speeds up recovery over conventional Source code: 28

Backup 29

Impact of Chunk Size 30 Conventional recoveryReplace recovery Recovery time decreases as chunk size increases Recovery time stabilizes for large chunk size

Parallel Recovery Recovery performance of multi-threaded implementation: Recovery time decreases as number of threads increases Improvement bounded by number of CPU cores We show applicability of replace recovery in parallelized implementation Similar results observed in our multi-server recovery implementation 31 STAR (p = 13) Quad-core case