Presentation is loading. Please wait.

Presentation is loading. Please wait.

A “Hitchhiker’s” Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers K. V. Rashmi, Nihar Shah, D. Gu, H. Kuang, D. Borthakur,

Similar presentations


Presentation on theme: "A “Hitchhiker’s” Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers K. V. Rashmi, Nihar Shah, D. Gu, H. Kuang, D. Borthakur,"— Presentation transcript:

1 A “Hitchhiker’s” Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers
K. V. Rashmi, Nihar Shah, D. Gu, H. Kuang, D. Borthakur, K. Ramchandran UC Berkeley, Facebook ACM SIGCOMM 2014

2 A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems : A Study on the Facebook Warehouse Cluster K. V. Rashmi, Nihar Shah, D. Gu, H. Kuang, D. Borthakur, K. Ramchandran UC Berkeley, Facebook The 5th USENIX Workshop on Hot Topics in File and Storage Technologies, HotStorage 2013

3 Outline Introduction Hitchhiker’s erasure code Evaluation results
Conclusion

4 Need for Redundant Storage in Data Centers
Frequent unavailability events in data centers Unreliable components Software glitches, maintenance shutdowns, power failures, etc. Redundancy necessary for reliability and availability

5 Popular Approach for Redundant Storage: Replication
Distributed file systems used in data centers store multiple copies of data on different machines Machines typically chosen on different racks to tolerate rack failures E.g., Hadoop Distributed File System (HDFS) stores 3 replicas by default

6 HDFS

7 Massive Data Sizes: Need Alternative to Replication
Small to moderately sized data: disk storage is inexpensive Replication viable No longer true for massive scales of operation e.g., Facebook data warehouse cluster stores multiple tens of Petabytes (PBs) “Erasure codes” are an alternative

8 Erasure Codes in Data Centers
Facebook data warehouse cluster Uses Reed-Solomon(RS) codes instead of 3 replication on a portion of the data Savings of multiple Petabytes of storage space

9 Erasure Codes Replication Reed-Solomon (RS) code a a data blocks a b b
parity blocks   block 4 Redundancy   b 2x   block 4 a+2b 2x  

10 Erasure Codes Replication Reed-Solomon (RS) code a a data blocks a b b
parity blocks   block 4 b block 4 a+2b Redundancy   First  order     comparison:   2x   tolerates  any  one  failure   2x   tolerates  any  two  failures  

11 Erasure Codes Replication Reed-Solomon (RS) code a a data blocks a b b
parity blocks   block 4 b block 4 a+2b Redundancy   First  order     comparison:   2x   tolerates  any  one  failure   2x   tolerates  any  two  failures  

12 Erasure Codes Replication Reed-Solomon (RS) code a a data blocks a b b
parity blocks   block 4 b block 4 a+2b Redundancy   First  order     comparison:   2x   tolerates  any  one  failure   2x   tolerates  any  two  failures  

13 Erasure Codes Replication Reed-Solomon (RS) code a a data blocks a b b
parity blocks   block 4 b block 4 a+2b Redundancy   First  order     comparison:   2x   tolerates  any  one  failure   2x   tolerates  any  two  failures  

14 Erasure Codes Replication Reed-Solomon (RS) code a a data blocks a b b
parity blocks   block 4 b block 4 a+2b Redundancy   First order     comparison:   In general:   2x   Tolerates any one failure   Lower MTTDL (Mean Time To Data Loss), High storage requirement   2x   Tolerates any two failures   Order of magnitude higher MTTDL     with much lesser storage  

15 Erasure Codes Using RS codes instead of 3-replication on less-frequently accessed data has led to savings of multiple Petabytes in the Facebook warehouse cluster Facebook warehouse cluster employs an (k=10, r=4) RS code, thus resulting in a 1.4x storage requirement

16 Reed-Solomon (RS) Codes
Example: (2, 2) RS code   • (#data, #parity) RS code:   –tolerates failure of any #parity blocks   –these (#data + #parity) blocks constitute a “stripe”   • Facebook warehouse cluster uses a (10, 4) RS code   a b a+b a+2b 4 blocks   In a stripe   #data = 2     (data blocks)   #parity = 2     (parity blocks)  

17 Existing Systems Need additional storage Highly restricted parameters
Huang et al. (Windows Azure) 2012, Sathiamoorthy et al. (Xorbas) 2013, Esmaili et al. (CORE) 2013 Add additional parities to reduce download Hu et al. (NCFS 2011) Highly restricted parameters Khan et al. (Rotated-RS) 2012: #parity≤3 Xiang et al., Wang et al. 2010, Hu (NCCloud) et al. 2012: #parity≤2 Hitchhiker performs as good or better for these restricted settings as well

18 Erasure codes in Data Centers: HDFS-RAID
Borthakur, “HDFS and Erasure Codes (HDFS-RAID)” Fan, Tantisiriroj, Xiao and Gibson, “DiskReduce: RAID for Data-Intensive Scalable Computing”, PDSW 09

19 Erasure codes in Data Centers: HDFS-RAID
• Any 10 blocks sufficient • Can tolerate any 4 failures (10, 4) Reed-Solomon code Borthakur, “HDFS and Erasure Codes (HDFS-RAID)” Fan, Tantisiriroj, Xiao and Gibson, “DiskReduce: RAID for Data-Intensive Scalable Computing”, PDSW 09

20 Impact on Data Center Network

21 Impact on Data Center Network
RS codes significantly increase network usage during reconstruction

22 Impact on Data Center Network
Burdens the already oversubscribed Top Of Rack(TOR) switch and higher Router

23 Machine Unavailability Events
From HDFS Name‐Node logs   Logged when no heart-beat for > 15min machines unavailable for more than 15 minutes in a day 15 minutes is the default wait-time of the cluster to flag a machine as unavailable Blocks marked unavailable, periodic recovery process The period 22nd Jan. to 24th Feb. 2013 Rashmi et al., “A Solution to the Network Challenges of Data Recovery in Erasure-coded Storage: A Study on the Facebook Warehouse Cluster”, Usenix HotStorage Workhsop

24 Machine Unavailability Events
Median of ≈50 machine-unavailability events logged per day

25 Missing blocks per stripe
Dominant scenario: Single block recovery

26 Facebook Data Warehouse Cluster
Median of 180 TB transferred across racks per day for recovery operations Reduction of more than 50TB of cross-rack traffic per day Around 5 times that under 3-replication

27 RS codes: The Good and The Bad
Maximum possible fault tolerance for given storage overhead Storage capacity optimal (“maximum-distance-separable” in coding theory parlance) Flexibility in choice of parameters Supports any number of data and parity blocks Not designed to handle reconstruction operations efficiently Negative impact on the network

28 Goal To build a system with:

29 Hitchhiker Is a system with:

30 At an Abstract Level HITCHHIKER

31 Hitchhiker’s Erasure Code: Toy Example
Take a (2, 2) Reed-Solomon code   Block  1   Block  2   Block  3   Block  4   a1 a2 a1+a2 a1+2a2 1  Byte   b1 b2 b1+b2 b1+2b2 1  Byte   data   blocks   parity  

32 Hitchhiker’s Erasure Code: Toy Example
 (In (2,2) RS code: recovery download & IO = 4 Bytes)   a2 a1+a2 b2 b1+b2 Block  1   Block  2   Block  3   Block  4   a1 a2 a1+a2 a1+2a2 b1 b2 b1+b2 b1+2b2

33 Hitchhiker’s Erasure Code: Toy Example
Add information from first group on to parities of the second group Block  1   Block  2   Block  3   Block  4   a1 a2 a1+a2 a1+2a2 b1 b2 b1+b2 b1+b2 + a1 No additional storage!  

34 Fault-Tolerance (Toy Example)
Same fault tolerance as RS code: can tolerate failure of any 2 nodes   Block  1   Block  2   Block  3   Block  4   a1 a2 a1+a2 a1+2a2 b1 b2 b1+b2 b1+2b2+a1

35 Fault-Tolerance (Toy Example)
Same fault tolerance as RS code: can tolerate failure of any 2 nodes   Block  1   Block  2   Block  3   Block  4   a1 a2 a1+a2 a1+2a2 b1 b2 b1+b2 b1+2b2+a1 a1 a2

36 Fault-Tolerance (Toy Example)
Same fault tolerance as RS code: can tolerate failure of any 2 nodes   Block  1   Block  2   Block  3   Block  4   a1 a2 a1+a2 a1+2a2 b1 b2 b1+b2 b1+2b2+a1 subtract   a1 a2

37 Fault-Tolerance (Toy Example)
Same fault tolerance as RS code: can tolerate failure of any 2 nodes   Block  1   Block  2   Block  3   Block  4   a1 a2 a1+a2 a1+2a2 b1 b2 b1+b2 b1+2b2+a1 a1 a2 b1 b2

38 Efficient Reconstruction
Data transferred (Download & IO) only 3 Bytes (instead of 4 Bytes as in RS)   Block  1   Block  2   Block  3   Block  4   a1 a2 a1+a2 a1+2a2 b1 b2 b1+b2 b1+2b2+a1

39 Efficient Reconstruction
Data transferred (Download & IO) only 3 Bytes (instead of 4 Bytes as in RS)   b2 b1+b2 b1+2b2+a1 Block  1   Block  2   Block  3   Block  4   a1 a2 a1+a2 a1+2a2 b1 b2 b1+b2 b1+2b2+a1

40 Efficient Reconstruction
Data transferred (Download & IO) only 3 Bytes (instead of 4 Bytes as in RS)   Subtract   b2 b1+b2 b1+2b2+a1 Block  1   Block  2   Block  3   Block  4   a1 a2 a1+a2 a1+2a2 b1 b2 b1+b2 b1+2b2+a1

41 Efficient Reconstruction
Data transferred (Download & IO) only 3 Bytes (instead of 4 Bytes as in RS)   b2 b1+b2 b1+2b2+a1 Block  1   Block  2   Block  3   Block  4   a1 a2 a1+a2 a1+2a2 b1 b2 b1+b2 b1+2b2+a1 Subtract  

42 Hitchhiker’s Erasure Code
Builds on top of RS codes Uses our theoretical framework of “Piggybacking”* Three versions XOR XOR+ Non-XOR * K.V. Rashmi, Nihar Shah, K. Ramchandran, “A Piggybacking Design Framework for Read-and Download efficient Distributed Storage Codes”, in IEEE International Symposium on Information Theory, 2013.

43 Hop and couple (disk layout)
Way of choosing which bytes to mix couples bytes farther apart in block to minimize the degree of discontinuity in disk reads during data reconstruction Translate savings in network-transfer to savings in disk-IO as well By making reads contiguous

44 RS vs Hitchhiker from the Network’s Perspective…

45 Data Transfer during Reconstruction in RS-based System
Transfer: 10 full blocks Connect to 10 machines

46 Data Transfer during Reconstruction in Hitchhiker
Reconstruction of data blocks 1-9: Transfer: 2 full blocks + 9 half blocks (=6.5 blocks total) Connect to 11 machines

47 Data Transfer during Reconstruction in Hitchhiker
Reconstruction of data block 10: Transfer: 13 half blocks (=6.5 blocks total) Connect to 13 machines

48 Hop-and-Couple Technique to pair bytes under Hitchhiker’s erasure code
Makes disk reads during reconstruction contiguous

49 Hop-and-Couple hop length = 1 byte = 1
Figure 7: Two ways of coupling bytes to form stripes for Hitchhiker's erasure code. The shaded bytes are read and downloaded for the reconstruction of the first unit. While both methods require the same amount of data to be read, the reading is discontiguous in (a), while (b) ensures that the data to be read is contiguous.

50 Hop-and-Couple hop length = half the size of a unit
Figure 7: Two ways of coupling bytes to form stripes for Hitchhiker's erasure code. The shaded bytes are read and downloaded for the reconstruction of the first unit. While both methods require the same amount of data to be read, the reading is discontiguous in (a), while (b) ensures that the data to be read is contiguous.

51 Hitchhiker’s Erasure Code
Figure 2: Two stripes of a (k=10, r=4) Reed-Solomon (RS) code. Ten units of data (first ten rows) are encoded using the RS code to generate four parity units (last four rows).

52 Hitchhiker’s Erasure Code
Figure 2: Two stripes of a (k=10, r=4) Reed-Solomon (RS) code. Ten units of data (first ten rows) are encoded using the RS code to generate four parity units (last four rows). Figure 3: The theoretical framework of Piggybacking [22] for parameters (k=10, r=4). Each row represents one unit of data.

53 Hitchhiker-XOR The XOR-only feature of these erasure codes significantly reduces the computational complexity of decoding, making degraded reads and failure recovery faster. Hitchhiker's erasure code optimizes only the reconstruction of data units; reconstruction of parity units is performed as in RS codes.

54 Hitchhiker-XOR Figure 4: Hitchhiker-XOR code for (k=10, r=4). Each row represents one unit of data.

55 Hitchhiker-XOR+ Hitchhiker-XOR+ reduces the amount of data required for reconstruction and employs only additional XOR operations. This property, which we term the all-XOR-parity property, requires at least one parity function of the RS code to be an XOR of all the data units.

56 Hitchhiker-XOR+ Figure 5: Hitchhiker-XOR+ for (k=10, r=4). Parity 2 of the underlying RS code is all-XOR.

57 Hitchhiker-nonXOR Hitchhiker-nonXOR guarantees the same savings as Hitchhiker-XOR+ even when the underlying RS code does not possess the all-XOR-parity property, but at the cost of additional finite-field arithmetic. Hitchhiker-nonXOR can be built on top of any RS code.

58 Hitchhiker-nonXOR Figure 6: Hitchhiker-nonXOR code for (k=10, r=4). This can be built on any RS code. Each row is one unit of data.

59 Implementation & Evaluation Setup(1)
Implemented on top of HDFS-RAID Erasure coding module in HDFS based on RS Used in the Facebook data warehouse cluster Deployed and tested on a 60-machine test cluster at Facebook Verified 35% reduction in the network transfers during reconstruction

60 Implementation & Evaluation Setup(2)
Evaluation of timing metrics on the Facebook data warehouse cluster in production under real-time production traffic and workloads using Map-Reduce to run encoding and reconstruction jobs, just as HDFS-RAID

61 Decoding Time RS decoding on only half portion of the blocks
36% reduction RS decoding on only half portion of the blocks Faster computation for degraded reads and recovery XOR versions: 25% lesser than non-XOR

62 Read & Transfer Time 35% less
Read & transfer time 30% lower in Hitchhiker (HH) Similar reduction for other block sizes(4、64MB) as well

63 The 95th percentile of the read time
Median

64 The 95th percentile of the read time
Median

65 Encoding Time 72% higher Benefits outweigh higher encoding cost in many systems (e.g., HDFS): • encoding of raw data into erasure-coded data is one time operation • often run as a background job

66 Hitchhiker

67 Conclusion We present Hitchhiker, an new erasure-coded storage system.
Hitchhiker reduces both network and disk traffic during reconstruction by 25% to 45% as RS-based systems.

68 References [6] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. Wainwright, and K. Ramchandran. Network coding for distributed storage systems. IEEE Trans. Inf. Th., Sept [17] S. Mahesh, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur. Xoring elephants: Novel erasure codes for big data. In VLDB, 2013. [19] D. Papailiopoulos, A. Dimakis, and V. Cadambe. Repair optimal erasure codes through hadamard designs. IEEE Trans. Inf. Th., May 2013. [20] K. V. Rashmi, N. B. Shah, D. Gu, H. Kuang, D. Borthakur, and K. Ramchandran. A solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the Facebook warehouse cluster. In Proc. USENIX HotStorage, June 2013. [21] K. V. Rashmi, N. B. Shah, and P. V. Kumar. Optimal exact-regenerating codes for the MSR and MBR points via a product-matrix construction. IEEE Trans. Inf. Th., 2011. [22] K. V. Rashmi, N. B. Shah, and K. Ramchandran. A piggybacking design framework for read-and download-ecient distributed storage codes. In IEEE International Symposium on Information Theory, 2013. [26] N. Shah, K. Rashmi, P. Kumar, and K. Ramchandran. Distributed storage codes with repair-by-transfer and non-achievability of interior points on the storage-bandwidth tradeoff. IEEE Trans. Inf. Theory, 2012.

69 Figure 9: Data read patterns during reconstruction of blocks 1, 4 and 10 in Hitchhiker-XOR+: the shaded bytes are read and downloaded.


Download ppt "A “Hitchhiker’s” Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers K. V. Rashmi, Nihar Shah, D. Gu, H. Kuang, D. Borthakur,"

Similar presentations


Ads by Google