MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.

Slides:



Advertisements
Similar presentations
SDN + Storage.
Advertisements

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Developing a MapReduce Application – packet dissection.
SkewTune: Mitigating Skew in MapReduce Applications
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Wei-Chiu Chuang 10/17/2013 Permission to copy/distribute/adapt the work except the figures which are copyrighted by ACM.
Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.
Center-of-Gravity Reduce Task Scheduling to Lower MapReduce Network Traffic Mohammad Hammoud, M. Suhail Rehman, and Majd F. Sakr 1.
Towards Energy Efficient MapReduce Yanpei Chen, Laura Keys, Randy H. Katz University of California, Berkeley LoCal Retreat June 2009.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.
A Hadoop MapReduce Performance Prediction Method
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HADOOP ADMIN: Session -2
On Availability of Intermediate Data in Cloud Computations Steven Y. Ko, Imranul Hoque, Brian Cho, and Indranil Gupta Distributed Protocols Research Group.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Distributed and Parallel Processing Technology Chapter6
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.
Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve , Devendra Dahiphale , Amit Chhajer 報告 : 饒展榕.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.
Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.
BOF: Megajobs Gracie: Grid Resource Virtualization and Customization Infrastructure How to execute hundreds of thousands tasks concurrently on distributed.
BALANCED DATA LAYOUT IN HADOOP CPS 216 Kyungmin (Jason) Lee Ke (Jessie) Xu Weiping Zhang.
GreenSched: An Energy-Aware Hadoop Workflow Scheduler
Hadoop System simulation with Mumak Fei Dong, Tianyu Feng, Hong Zhang Dec 8, 2010.
The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Using Map-reduce to Support MPMD Peng
Next Generation of Apache Hadoop MapReduce Owen
Part III BigData Analysis Tools (YARN) Yuan Xue
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Edinburgh Napier University
Resource Elasticity for Large-Scale Machine Learning
Software Engineering Introduction to Apache Hadoop Map Reduce
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
The Basics of Apache Hadoop
Cloud Distributed Computing Environment Hadoop
On Spatial Joins in MapReduce
Data-Intensive Computing: From Clouds to GPU Clusters
Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming
Distributed Systems CS
Cloud Computing MapReduce in Heterogeneous Environments
Presentation transcript:

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1

Hadoop MapReduce MapReduce is now a pervasive data processing framework on the cloud Hadoop is an open source implementation of MapReduce Hadoop MapReduce incorporates two phases, Map and Reduce phases, which encompass multiple Map and Reduce tasks Map Task Reduce Task Partition To HDFS Dataset HDFS HDFS BLK Map Phase Shuffle Stage Merge Stage Reduce Stage Reduce Phase 2 Partition Split 0 Split 1 Split 2 Split 3 Partition

How to Effectively Configure Hadoop? Hadoop has more than 190 configuration parameters  parameters can have significant impact on job performance A main challenge that faces Hadoop users on the cloud:  Running MapReduce applications in the most economical way  While still achieving good performance The burden falls on Hadoop users to effectively configure Hadoop Hadoop’s default configuration is not necessarily optimal  Several X speedup/slowdown between tuned and default Hadoop 3

Map Tasks and Map Concurrency Among the influential configuration parameters in Hadoop are:  Number of Map Tasks  Determined by the number of HDFS blocks  Number of Map Slots  Allocated to run Map Tasks Core Switch TaskTracker1 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1 Rack Switch 1 Rack Switch 2 TaskTracker2 TaskTracker3TaskTracker4 TaskTracker5 JobTracker MT1 MT2 MT3 MT2 MT3 Map Concurrency = Map Tasks/Map Slots 4

Impact of Map Concurrency  Observations:  Map concurrency has a strong impact on Hadoop performance  Hadoop’s default Map concurrency settings are not optimal  For effective execution, Hadoop might require different Map concurrencies for different applications Sobel K-MeansSort WordCount-CD Default Hadoop Tuned Hadoop 5

Our Work We propose MC 2 :  A simple, fast and static “utility” program  Which predicts the best Map Concurrency for any given MapReduce application MC 2 is based on a mathematical model which exploits two main MapReduce internal characteristics:  Map Setup Time (or the total overhead for setting up all map tasks in a job)  Early Shuffle (or the process of shuffling intermediate data while the Map phase is still running) 6

Talk Roadmap Characterizing Map Concurrency  Map Concurrency ≤ 1  Map Concurrency > 1 A Mathematical Model for Predicting Runtimes of MR jobs The MC 2 Predictor Quantitative Evaluation Concluding Remarks 7

Talk Roadmap Characterizing Map Concurrency  Map Concurrency ≤ 1  Map Concurrency > 1 A Mathematical Model for Predicting Runtimes of MR jobs The MC 2 Predictor Quantitative Evaluation Concluding Remarks 8

Concurrency with a Single Map Wave The maximum number of concurrent Map tasks is bound by the total number of Map slots in a Hadoop cluster  We refer to the maximum concurrent Map tasks as Map Wave MS1 MS2 MS3 MS4 MT1 MT2 MS5 MS6 MS1 MS2 MS3 MS4 MT3 MT4 MS5 MS6 MT1 MT2 Ends at time t + MST Ends at time t/2 + MST MS1 MS2 MS3 MS4 MS5 MS6 Ends at time t/2 + 2MST MST = Map Setup Time Fill as much Map slots as possible within a Map wave  More Parallelism & Better Utilization 9 t MST t/2 2MST t/4 + t/4 = t/2

Concurrency with Multiple Map Waves What are the tradeoffs as the number of Map waves is varied? MS1 MS2 MS3 MS4 MT3 MT4 RS1 RS2 RT1 RT2 MT2 MT1 Shuffle & Merge Reduce Shuffle & Merge Reduce One Map Wave MS1 MS2 MS3 MS4 RS1 RS2 Shuffle & Merge Reduce Shuffle & Merge Reduce RT1 RT2 RS1 RS2 Shuffle & Merge Reduce Shuffle & Merge Reduce MS1 MS2 MS3 MS4 RT1 RT2 Two Map Waves Four Map Waves As the number of Map waves is increased: (-) Map Setup Time increases-- Cost (+) Data Shuffling starts earlier (i.e., earlier Early Shuffle)-- Opportunity [- No Phase Overlap] [Map Setup Time = MST] [+ With Phase Overlap] [- Map Setup Time = 2MST] [+ With More Phase Overlap] [- Map Setup Time = 4MST] [Map Time = t] t 2MST t = t/2 + t/2 4MSTt = t/4 + t/4 + t/4 + t/4 [Map Time = t] MST 10

When to Trigger Early Shuffle? Early Shuffle can be activated earlier by increasing the number of Map Waves The preference of when exactly the early shuffle process must be activated varies across applications The more the amount of data an application shuffles, the earlier the early shuffle process must be triggered  With a larger shuffle data, a larger number of map waves is preferred We devise a mathematical model that allows locating the best number of map waves for any given MR application 11

Talk Roadmap Characterizing Map Concurrency  Map Concurrency ≤ 1  Map Concurrency > 1 A Mathematical Model for Predicting Runtimes of MR jobs The MC 2 Predictor Quantitative Evaluation Concluding Remarks 12

A Mathematical Model (1) Assumptions:  Map tasks start and finish at similar times  Time impact of speculative execution is masked  Ignore slow Mappers and Reducers  Map time is typically longer than Map Setup Time 13 Shuffle & Merge Reduce RS1 RS2 Total Map Setup Time (MST) MS1 MS2 MS3 MS4 Exposed Shuffle Time (EST) Hidden Shuffle Time (HST) Runtime Reduce Time

A Mathematical Model (2) 14 Shuffle & Merge Reduce RS1 RS2 Total Map Setup Time (MST) MS1 MS2 MS3 MS4 Exposed Shuffle Time (EST) Hidden Shuffle Time (HST) Runtime Reduce Time (1) (2) (3) (4)

Talk Roadmap Characterizing Map Concurrency  Map Concurrency ≤ 1  Map Concurrency > 1 A Mathematical Model for Predicting Runtimes of MR jobs The MC 2 Predictor Quantitative Evaluation Concluding Remarks 15

MC 2 : Map Concurrency Characterization Our mathematical model can be utilized to predict the best number of map waves for any given MR application – Fix all the model’s factors except the “Number of Map Waves” – Measure Runtime for a range of map wave numbers – Select the minimum Runtime 16 Single Map Wave Time MST Sweet Spot  Total MST  HST  EST  Runtime Shuffle Data Shuffle Rate Reduce Time Initial Map Slots Number Compute:

Talk Roadmap Characterizing Map Concurrency  Map Concurrency ≤ 1  Map Concurrency > 1 A Mathematical Model for Predicting Runtimes of MR jobs The MC 2 Predictor Quantitative Evaluation Concluding Remarks 17

Quantitative Methodology We evaluate MC 2 on:  A private cloud with 14 machines  Amazon EC2 with 20 large instances We use Apache Hadoop We use various benchmarks with different dataset sizes 18 BenchmarkDataset Size (Private & Public) Sobel4.3GB and 8.7GB WordCount-CE28GB and 20GB K-Means5.5GB and 11.1GB Sort28GB and 20GB WordCount-CD14GB and 20GB

Results: WordCount-CE 19

Results: K-Means 20

Results: Sort 21

Results: WordCount-CD 22

Results: Sobel 23

MC 2 Results: Summary BenchmarkPrivate CloudAmazon EC2 WordCount-CE2.1X1.2X K-Means1.34X1.13X Sort1.07X1.1X WordCount-CD1.1X2.2X Sobel1.43X1.04X Runtime speedups provided by MC 2 versus default Hadoop 24 MC 2 correctly predicts the best numbers of map waves for WordCount-CE, K-Means, Sort, WordCount-CD and Sobel on our private cloud and on Amazon EC2 Even if a miss-prediction occurs, it is typically the case that the sweet spot is very close to the observed minimum

Talk Roadmap Characterizing Map Concurrency  Map Concurrency ≤ 1  Map Concurrency > 1 A Mathematical Model for Predicting Runtimes of MR jobs The MC 2 Predictor Quantitative Evaluation Concluding Remarks 25

Concluding Remarks We observed a strong dependency between map concurrency and MapReduce performance We realized that a good map concurrency configuration can be determined by simply leveraging two main MapReduce characteristics, data shuffling and map setup time (MST) We developed a mathematical model that exploits data shuffling and MST, and built MC 2 which uses the model to predict the best map concurrency for any given MR application MC 2 works successfully on a private cloud and on Amazon EC2 26

Thank You! Questions? 27