Presentation is loading. Please wait.

Presentation is loading. Please wait.

Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International.

Similar presentations


Presentation on theme: "Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International."— Presentation transcript:

1 Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

2 Outline  Introduction  Background  Motivating Example  Design Of Degraded-first Scheduling  Simulation  Experiments  Conclusion 2

3 Introduction(1/5)  As a storage system scales, node failures are commonplace.  To ensure data availability at any time, traditional designs of GFS, HDFS, and Azure replicate each data block into three copies to provide double-fault tolerance.  However, as the volume of global data surges to the zettabyte scale, the 200% redundancy overhead of 3- way replication becomes a scalability bottleneck. 3

4 Introduction(2/5)  Erasure coding costs less storage overhead than replication under the same fault tolerance.  Extensive efforts[13, 20, 29] have studied the use of erasure coding in clustered storage systems that provide data analytics services. 4 [13] D. Ford, F. Labelle, F. I. Popovici, M. Stokel, V.-A. Truong, L. Barroso, C. Grimes, and S. Quinlan. Availability in Globally Distributed Storage Systems. In Proc. of USENIX OSDI, Oct 2010. [20] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and S. Yekhanin. Erasure Coding in Windows Azure Storage. In Proc. of USENIX ATC, Jun 2012. [29] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur. Xoring Elephants: Novel Erasure Codes for Big Data. In Proc. of VLDB Endowment, pages 325–336, 2013.

5 Introduction(3/5)  In particular, when data is unavailable due to node failures, reads are degraded in erasure-coded storage as they need to download data from surviving nodes to reconstruct the missing data.  Several studies [20, 22, 29] propose to optimize degraded reads in erasure-coded clustered storage systems, by reducing the amount of downloaded data for reconstruction. 5 [20] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and S. Yekhanin. Erasure Coding in Windows Azure Storage. In Proc. of USENIX ATC, Jun 2012. [22] O. Khan, R. Burns, J. Plank, W. Pierce, and C. Huang. Rethinking Erasure Codes for Cloud File Systems: Minimizing I/O for Recovery and Degraded Reads. In Proc. of USENIX FAST, Feb 2012. [29] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur. Xoring Elephants: Novel Erasure Codes for Big Data. In Proc. of VLDB Endowment, pages 325–336, 2013.

6 Introduction(4/5)  Despite the extensive studies on erasure-coded clustered storage systems, it remains an open issue of how to customize the data analytics paradigm, such as MapReduce.  In this work, we explore Hadoop’s version of MapReduce on HDFS-RAID [18], a middleware layer that extends HDFS to support erasure coding. 6 [18] HDFS-RAID. http://wiki.apache.org/hadoop/HDFS-RAID.

7 Introduction(5/5)  Traditional MapReduce scheduling emphasizes locality, and implements locality-first scheduling.  MapReduce is designed with replication-based storage. In the presence of node failures, it re- schedules tasks to run on other nodes that hold the replicas.  A key motivation of this work is to customize MapReduce scheduling for erasure-coded storage in failure mode. 7

8 Background(1/6)  Hadoop  Hadoop runs on a distributed file system HDFS.  HDFS divides a file into fixed-size blocks, which form the basic units for read and write operations.  HDFS uses replication to maintain data availability, such that each block is replicated into multiple copies and distributed across different nodes. 8

9 Background(2/6)  In typical deployment environments of MapReduce, network bandwidth is scarce.  MapReduce emphasizes data locality by trying to schedule a map task to run on a (slave) node that stores a replica of the data block, or a node that is located near the data block.  This saves the time of downloading blocks from other nodes over the network. 9

10 Background(3/6)  A map task can be classified into three types: 1. Node-local : task processes a block stored in the same node. 2. Rack-local : task downloads and processes a block stored in another node of the same rack. 3. Remote : task downloads and processes a block stored in another node of a different rack.  The default task scheduling scheme in Hadoop first assigns map slots to local tasks, followed by remote tasks. 10

11 Background(4/6)  Erasure Coding  To reduce the redundancy overhead due to replication, erasure coding can be used.  In replication, a read to a lost block can be re-directed to another block replica.  In erasure coding, reading a lost block requires a degraded read, which reads the blocks from any k surviving nodes of the same stripe and reconstructs the lost blocks. 11

12 Background(5/6)  MapReduce follows locality-first scheduling.  HDFS-RAID reconstructs the lost block via a degraded read.  Degraded tasks : first read data from other surviving nodes to reconstruct the lost block and then process the reconstructed block.  Degraded tasks are given the lowest priority in the default locality-first scheduling, and they are scheduled after local and remote tasks. 12

13 Background(6/6) 13

14 Motivating Example(1/2)  HDFS uses 3-way replication and places the three replicas.  The first replica is placed in a random node.  The second and third replicas are placed in two different random nodes that are located in a different rack from the first replica.  This placement policy can tolerate an arbitrary double- node failure and an arbitrary single-rack failure. 14

15 Motivating Example(2/2) 15

16 Design Of Degraded-first Scheduling(1/7)  Main idea is to move part of degraded tasks to the earlier stage of the map phase.  The degraded tasks can take advantage of the unused network resources while the local tasks are running.  Avoid the network resource competition among degraded tasks at the end of the map phase. 16

17 Design Of Degraded-first Scheduling(2/7)  Basic Design 17 M : total number of all map tasks to be launched. M d : Total number of degraded tasks to be launched. m : number of all map tasks that have been launched. m d : number of degraded tasks that have been launched.

18 Design Of Degraded-first Scheduling(3/7) 18

19 Design Of Degraded-first Scheduling(4/7) 19 T : processing time of a map task. S : input block size. W : download bandwidth of each rack. F : total number of native blocks to be processed by MapReduce. N : number of nodes. R : number of racks. L : number of map slots allocated for each node. An (n, k) erasure code to encode k native blocks to generate n−k parity blocks. Runtime of a MapReduce job without any node failure. Number of degraded tasks in each rack. Expected time for downloading blocks from other racks.

20 Design Of Degraded-first Scheduling(5/7) 20 N = 40, R = 4, L = 4, S = 128MB W = 1Gbps, T = 20s, F = 1440 (n, k) = (16, 12) (a) runtime reduction ranging from 15% to 32%. (b) runtime reduction ranging from 25% to 28%. (a) runtime reduction ranging from 18% to 43%.

21 Design Of Degraded-first Scheduling(6/7)  Enhanced Design  Locality preservation  Having additional remote tasks is clearly undesirable as they compete for network resources as degraded tasks do.  We implement locality preservation by restricting the launch of degraded tasks, such that we prevent the local map tasks from being unexpectedly assigned to other nodes.  Rack awareness  In failure mode, launching multiple degraded tasks in the same rack may result in competition for network resources, since the degraded tasks download data through the same top-of-rack switch. 21

22 Design Of Degraded-first Scheduling(7/7) 22 t s : processing time for the local map tasks of each slave s. E[t s ] : expected processing time for the local map tasks across all slaves. t r : duration since the last degraded task is assigned to each rack r. E[t r ] : expected duration across all racks.

23 Simulation(1/5)  Compare enhanced degraded-first scheduling (EDF) with the locality-first scheduling (LF).  Compare the basic and enhanced versions of degraded-first scheduling (BDF and EDF).  MapReduce simulator is a C++-based discrete event simulator built on CSIM20 [8]. 23 [8] CSIM. http://www.mesquite.com/products/csim20.htm.

24 Simulation(2/5)  Locality-First vs. Degraded-First Scheduling  40 nodes evenly grouped into four racks.  The rack download bandwidth is 1Gbps.  The block size is 128MB.  Use (20,15) erasure codes.  The total number of map tasks is 1440, while the number of reduce tasks is fixed at 30. 24

25 25 (a) 17.4% for (8, 6) to 32.9% for (20, 15). (b) 34.8% to 39.6% (c) 35.1% on average when bandwidth is 500.

26 Simulation(4/5) 26 (d) 33.2%, 22.3%, and 5.9% on average. (e) EDF r educes LF by 20.0% to 33.2%. (e) EDF r educes LF by 28.6% to 48.6%.

27 Simulation(5/5)  Basic vs. Enhanced Degraded-First Scheduling  Heterogeneous cluster :  The same configuration as the homogeneous one.  Half of the nodes have worse processing power with the mean times of the map and reduce tasks set. 27

28 Experiments(1/4)  Run experiments on a small-scale Hadoop cluster testbed composed of a single master node and 12 slave nodes.  The 12 slaves are grouped into three racks with four slaves each.  Three I/O-heavy MapReduce jobs :  WordCount  Grep  LineCount 28

29 Experiments(2/4)  HDFS block size as 64MB and use a (12,10) erasure code to provide failure-tolerance.  Generate 15GB of plain text data from the Gutenberg website [17].  The data is divided into 240 blocks and written to HDFS. 29 [17] Gutenberg. http://www.gutenberg.org.

30 Experiments(3/4) 30 (a) 27.0%, 26.1% and 24.8% (b) 16.6%, 28.4%, and 22.6%

31 Experiments(4/4)  Compare the average runtime of normal map tasks (local and remote tasks), degraded tasks and reduce tasks.  The runtime of a task includes data transmission time, and the data processing time. 31

32 Conclusion  We present degraded-first scheduling, a new MapReduce scheduling scheme designed for improving MapReduce performance in erasure-coded clustered storage systems that run in failure mode.  Degraded-first scheduling can reduce the MapReduce runtime of locality-first scheduling by 27.0% for a single-job scenario and 28.4% for a multi-job scenario. 32


Download ppt "Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International."

Similar presentations


Ads by Google