Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu 2014 44th Annual IEEE/IFIP International.

Slides:



Advertisements
Similar presentations
Availability in Globally Distributed Storage Systems
Advertisements

 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Analysis and Construction of Functional Regenerating Codes with Uncoded Repair for Distributed Storage Systems Yuchong Hu, Patrick P. C. Lee, Kenneth.
current hadoop architecture
SDN + Storage.
Henry C. H. Chen and Patrick P. C. Lee
1 NCFS: On the Practicality and Extensibility of a Network-Coding-Based Distributed File System Yuchong Hu 1, Chiu-Man Yu 2, Yan-Kit Li 2 Patrick P. C.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
BASIC Regenerating Codes for Distributed Storage Systems Kenneth Shum (Joint work with Minghua Chen, Hanxu Hou and Hui Li)
Coding and Algorithms for Memories Lecture 12 1.
Simple Regenerating Codes: Network Coding for Cloud Storage Dimitris S. Papailiopoulos, Jianqiang Luo, Alexandros G. Dimakis, Cheng Huang, and Jin Li University.
Yuchong Hu1, Henry C. H. Chen1, Patrick P. C. Lee1, Yang Tang2
1 Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu The Chinese University of Hong Kong.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Availability in Globally Distributed Storage Systems
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
A “Hitchhiker’s” Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers K. V. Rashmi, Nihar Shah, D. Gu, H. Kuang, D. Borthakur,
Towards Energy Efficient MapReduce Yanpei Chen, Laura Keys, Randy H. Katz University of California, Berkeley LoCal Retreat June 2009.
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
Improving MapReduce Performance Using Smart Speculative Execution Strategy Qi Chen, Cheng Liu, and Zhen Xiao Oct 2013 To appear in IEEE Transactions on.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
NCCloud: A Network-Coding-Based Storage System in a Cloud-of-Clouds
Network Coding Distributed Storage Patrick P. C. Lee Department of Computer Science and Engineering The Chinese University of Hong Kong 1.
The Hadoop Distributed File System: Architecture and Design by Dhruba Borthakur Presented by Bryant Yao.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Investigation of Data Locality in MapReduce
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
1 The Google File System Reporter: You-Wei Zhang.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah.
Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
BALANCED DATA LAYOUT IN HADOOP CPS 216 Kyungmin (Jason) Lee Ke (Jessie) Xu Weiping Zhang.
A Cost-based Heterogeneous Recovery Scheme for Distributed Storage Systems with RAID-6 Codes Yunfeng Zhu 1, Patrick P. C. Lee 2, Liping Xiang 1, Yinlong.
1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The.
Matchmaking: A New MapReduce Scheduling Technique
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Exact Regenerating Codes on Hierarchical Codes Ernst Biersack Eurecom France Joint work and Zhen Huang.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
 Introduction  Architecture NameNode, DataNodes, HDFS Client, CheckpointNode, BackupNode, Snapshots  File I/O Operations and Replica Management File.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Coding and Algorithms for Memories Lecture 13 1.
A Two-phase Execution Engine of Reduce Tasks In Hadoop MapReduce XiaohongZhang*GuoweiWang* ZijingYang*YangDing School of Computer Science and Technology.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
BIG DATA/ Hadoop Interview Questions.
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Double Regenerating Codes for Hierarchical Data Centers
Curator: Self-Managing Storage for Enterprise Clusters
A Simulation Analysis of Reliability in Erasure-coded Data Centers
Repair Pipelining for Erasure-Coded Storage
PA an Coordinated Memory Caching for Parallel Jobs
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
湖南大学-信息科学与工程学院-计算机与科学系
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
Distributed Systems CS
Presentation transcript:

Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu th Annual IEEE/IFIP International Conference on Dependable Systems and Networks

Outline  Introduction  Background  Motivating Example  Design Of Degraded-first Scheduling  Simulation  Experiments  Conclusion 2

Introduction(1/5)  As a storage system scales, node failures are commonplace.  To ensure data availability at any time, traditional designs of GFS, HDFS, and Azure replicate each data block into three copies to provide double-fault tolerance.  However, as the volume of global data surges to the zettabyte scale, the 200% redundancy overhead of 3- way replication becomes a scalability bottleneck. 3

Introduction(2/5)  Erasure coding costs less storage overhead than replication under the same fault tolerance.  Extensive efforts[13, 20, 29] have studied the use of erasure coding in clustered storage systems that provide data analytics services. 4 [13] D. Ford, F. Labelle, F. I. Popovici, M. Stokel, V.-A. Truong, L. Barroso, C. Grimes, and S. Quinlan. Availability in Globally Distributed Storage Systems. In Proc. of USENIX OSDI, Oct [20] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and S. Yekhanin. Erasure Coding in Windows Azure Storage. In Proc. of USENIX ATC, Jun [29] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur. Xoring Elephants: Novel Erasure Codes for Big Data. In Proc. of VLDB Endowment, pages 325–336, 2013.

Introduction(3/5)  In particular, when data is unavailable due to node failures, reads are degraded in erasure-coded storage as they need to download data from surviving nodes to reconstruct the missing data.  Several studies [20, 22, 29] propose to optimize degraded reads in erasure-coded clustered storage systems, by reducing the amount of downloaded data for reconstruction. 5 [20] C. Huang, H. Simitci, Y. Xu, A. Ogus, B. Calder, P. Gopalan, J. Li, and S. Yekhanin. Erasure Coding in Windows Azure Storage. In Proc. of USENIX ATC, Jun [22] O. Khan, R. Burns, J. Plank, W. Pierce, and C. Huang. Rethinking Erasure Codes for Cloud File Systems: Minimizing I/O for Recovery and Degraded Reads. In Proc. of USENIX FAST, Feb [29] M. Sathiamoorthy, M. Asteris, D. Papailiopoulos, A. G. Dimakis, R. Vadali, S. Chen, and D. Borthakur. Xoring Elephants: Novel Erasure Codes for Big Data. In Proc. of VLDB Endowment, pages 325–336, 2013.

Introduction(4/5)  Despite the extensive studies on erasure-coded clustered storage systems, it remains an open issue of how to customize the data analytics paradigm, such as MapReduce.  In this work, we explore Hadoop’s version of MapReduce on HDFS-RAID [18], a middleware layer that extends HDFS to support erasure coding. 6 [18] HDFS-RAID.

Introduction(5/5)  Traditional MapReduce scheduling emphasizes locality, and implements locality-first scheduling.  MapReduce is designed with replication-based storage. In the presence of node failures, it re- schedules tasks to run on other nodes that hold the replicas.  A key motivation of this work is to customize MapReduce scheduling for erasure-coded storage in failure mode. 7

Background(1/6)  Hadoop  Hadoop runs on a distributed file system HDFS.  HDFS divides a file into fixed-size blocks, which form the basic units for read and write operations.  HDFS uses replication to maintain data availability, such that each block is replicated into multiple copies and distributed across different nodes. 8

Background(2/6)  In typical deployment environments of MapReduce, network bandwidth is scarce.  MapReduce emphasizes data locality by trying to schedule a map task to run on a (slave) node that stores a replica of the data block, or a node that is located near the data block.  This saves the time of downloading blocks from other nodes over the network. 9

Background(3/6)  A map task can be classified into three types: 1. Node-local : task processes a block stored in the same node. 2. Rack-local : task downloads and processes a block stored in another node of the same rack. 3. Remote : task downloads and processes a block stored in another node of a different rack.  The default task scheduling scheme in Hadoop first assigns map slots to local tasks, followed by remote tasks. 10

Background(4/6)  Erasure Coding  To reduce the redundancy overhead due to replication, erasure coding can be used.  In replication, a read to a lost block can be re-directed to another block replica.  In erasure coding, reading a lost block requires a degraded read, which reads the blocks from any k surviving nodes of the same stripe and reconstructs the lost blocks. 11

Background(5/6)  MapReduce follows locality-first scheduling.  HDFS-RAID reconstructs the lost block via a degraded read.  Degraded tasks : first read data from other surviving nodes to reconstruct the lost block and then process the reconstructed block.  Degraded tasks are given the lowest priority in the default locality-first scheduling, and they are scheduled after local and remote tasks. 12

Background(6/6) 13

Motivating Example(1/2)  HDFS uses 3-way replication and places the three replicas.  The first replica is placed in a random node.  The second and third replicas are placed in two different random nodes that are located in a different rack from the first replica.  This placement policy can tolerate an arbitrary double- node failure and an arbitrary single-rack failure. 14

Motivating Example(2/2) 15

Design Of Degraded-first Scheduling(1/7)  Main idea is to move part of degraded tasks to the earlier stage of the map phase.  The degraded tasks can take advantage of the unused network resources while the local tasks are running.  Avoid the network resource competition among degraded tasks at the end of the map phase. 16

Design Of Degraded-first Scheduling(2/7)  Basic Design 17 M : total number of all map tasks to be launched. M d : Total number of degraded tasks to be launched. m : number of all map tasks that have been launched. m d : number of degraded tasks that have been launched.

Design Of Degraded-first Scheduling(3/7) 18

Design Of Degraded-first Scheduling(4/7) 19 T : processing time of a map task. S : input block size. W : download bandwidth of each rack. F : total number of native blocks to be processed by MapReduce. N : number of nodes. R : number of racks. L : number of map slots allocated for each node. An (n, k) erasure code to encode k native blocks to generate n−k parity blocks. Runtime of a MapReduce job without any node failure. Number of degraded tasks in each rack. Expected time for downloading blocks from other racks.

Design Of Degraded-first Scheduling(5/7) 20 N = 40, R = 4, L = 4, S = 128MB W = 1Gbps, T = 20s, F = 1440 (n, k) = (16, 12) (a) runtime reduction ranging from 15% to 32%. (b) runtime reduction ranging from 25% to 28%. (a) runtime reduction ranging from 18% to 43%.

Design Of Degraded-first Scheduling(6/7)  Enhanced Design  Locality preservation  Having additional remote tasks is clearly undesirable as they compete for network resources as degraded tasks do.  We implement locality preservation by restricting the launch of degraded tasks, such that we prevent the local map tasks from being unexpectedly assigned to other nodes.  Rack awareness  In failure mode, launching multiple degraded tasks in the same rack may result in competition for network resources, since the degraded tasks download data through the same top-of-rack switch. 21

Design Of Degraded-first Scheduling(7/7) 22 t s : processing time for the local map tasks of each slave s. E[t s ] : expected processing time for the local map tasks across all slaves. t r : duration since the last degraded task is assigned to each rack r. E[t r ] : expected duration across all racks.

Simulation(1/5)  Compare enhanced degraded-first scheduling (EDF) with the locality-first scheduling (LF).  Compare the basic and enhanced versions of degraded-first scheduling (BDF and EDF).  MapReduce simulator is a C++-based discrete event simulator built on CSIM20 [8]. 23 [8] CSIM.

Simulation(2/5)  Locality-First vs. Degraded-First Scheduling  40 nodes evenly grouped into four racks.  The rack download bandwidth is 1Gbps.  The block size is 128MB.  Use (20,15) erasure codes.  The total number of map tasks is 1440, while the number of reduce tasks is fixed at

25 (a) 17.4% for (8, 6) to 32.9% for (20, 15). (b) 34.8% to 39.6% (c) 35.1% on average when bandwidth is 500.

Simulation(4/5) 26 (d) 33.2%, 22.3%, and 5.9% on average. (e) EDF r educes LF by 20.0% to 33.2%. (e) EDF r educes LF by 28.6% to 48.6%.

Simulation(5/5)  Basic vs. Enhanced Degraded-First Scheduling  Heterogeneous cluster :  The same configuration as the homogeneous one.  Half of the nodes have worse processing power with the mean times of the map and reduce tasks set. 27

Experiments(1/4)  Run experiments on a small-scale Hadoop cluster testbed composed of a single master node and 12 slave nodes.  The 12 slaves are grouped into three racks with four slaves each.  Three I/O-heavy MapReduce jobs :  WordCount  Grep  LineCount 28

Experiments(2/4)  HDFS block size as 64MB and use a (12,10) erasure code to provide failure-tolerance.  Generate 15GB of plain text data from the Gutenberg website [17].  The data is divided into 240 blocks and written to HDFS. 29 [17] Gutenberg.

Experiments(3/4) 30 (a) 27.0%, 26.1% and 24.8% (b) 16.6%, 28.4%, and 22.6%

Experiments(4/4)  Compare the average runtime of normal map tasks (local and remote tasks), degraded tasks and reduce tasks.  The runtime of a task includes data transmission time, and the data processing time. 31

Conclusion  We present degraded-first scheduling, a new MapReduce scheduling scheme designed for improving MapReduce performance in erasure-coded clustered storage systems that run in failure mode.  Degraded-first scheduling can reduce the MapReduce runtime of locality-first scheduling by 27.0% for a single-job scenario and 28.4% for a multi-job scenario. 32