Profiling, What-if Analysis and Cost- based Optimization of MapReduce Programs Oct 7 th 2013 Database Lab. Wonseok Choi.

Slides:



Advertisements
Similar presentations
Starfish: A Self-tuning System for Big Data Analytics.
Advertisements

Herodotos Herodotou Shivnath Babu Duke University.
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Using the Optimizer to Generate an Effective Regression Suite: A First Step Murali M. Krishna Presented by Harumi Kuno HP.
SkewTune: Mitigating Skew in MapReduce Applications
Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications Piyush Shivam, Shivnath Babu, Jeffrey Chase Duke University.
Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.
Clydesdale: Structured Data Processing on MapReduce Jackie.
Outline SQL Server Optimizer  Enumeration architecture  Search space: flexibility/extensibility  Cost and statistics Automatic Physical Tuning  Database.
Presented by Nirupam Roy Starfish: A Self-tuning System for Big Data Analytics Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong,
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
Ivory : Ivory : Pairwise Document Similarity in Large Collection with MapReduce Tamer Elsayed, Jimmy Lin, and Doug Oard Laboratory for Computational Linguistics.
CSCI 5708: Query Processing I Pusheng Zhang University of Minnesota Feb 3, 2004.
Descriptive Modelling: Simulation “Simulation is the process of designing a model of a real system and conducting experiments with this model for the purpose.
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
FLANN Fast Library for Approximate Nearest Neighbors
A Hadoop MapReduce Performance Prediction Method
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Ch 4. The Evolution of Analytic Scalability
Self-Organizing Agents for Grid Load Balancing Junwei Cao Fifth IEEE/ACM International Workshop on Grid Computing (GRID'04)
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.
Introduction to Hadoop and HDFS
QMapper for Smart Grid: Migrating SQL-based Application to Hive Yue Wang, Yingzhong Xu, Yue Liu, Jian Chen and Songlin Hu SIGMOD’15, May 31–June 4, 2015.
RECON: A TOOL TO RECOMMEND DYNAMIC SERVER CONSOLIDATION IN MULTI-CLUSTER DATACENTERS Anindya Neogi IEEE Network Operations and Management Symposium, 2008.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.
Stochastic DAG Scheduling using Monte Carlo Approach Heterogeneous Computing Workshop (at IPDPS) 2012 Extended version: Elsevier JPDC (accepted July 2013,
Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
The Volcano Optimizer Generator Extensibility and Efficient Search.
Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.
Digital Intuition Cluster, Smart Geometry 2013, Stylianos Dritsas, Mirco Becker, David Kosdruy, Juan Subercaseaux Welcome Notes Overview 1. Perspective.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,
CS4432: Database Systems II Query Processing- Part 2.
Virtualization and Databases Ashraf Aboulnaga University of Waterloo.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Efficient Gigabit Ethernet Switch Models for Large-Scale Simulation Dong (Kevin) Jin David Nicol Matthew Caesar University of Illinois.
Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS
Introduction to emulators Tony O’Hagan University of Sheffield.
Performance Assurance for Large Scale Big Data Systems
Advanced Algorithms Analysis and Design
An Open Source Project Commonly Used for Processing Big Data Sets
Applying Control Theory to Stream Processing Systems
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Chapter 15 QUERY EXECUTION.
On Spatial Joins in MapReduce
Database Applications (15-415) DBMS Internals- Part VI Lecture 15, Oct 23, 2016 Mohammad Hammoud.
Objective of This Course
Akshay Tomar Prateek Singh Lohchubh
Tools for Processing Big Data Jinan Al Aridhee and Christian Bach
Ch 4. The Evolution of Analytic Scalability
Applying Data Warehousing and Big Data Techniques to Analyze Internet Performance Thiago Barbosa, Renan Souza, Sérgio Serra, Maria Luiza and Roger Cottrell.
MAPREDUCE TYPES, FORMATS AND FEATURES
Evaluation of Relational Operations: Other Techniques
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
ReStore: Reusing Results of MapReduce Jobs
Presentation transcript:

Profiling, What-if Analysis and Cost- based Optimization of MapReduce Programs Oct 7 th 2013 Database Lab. Wonseok Choi

2 발표 전날

3 이번에 발표 못하면 끝이야 !!!! 학점 받기는 불가능해 !!!! + 졸업시험 !!

4 시간안에 죽지않고 발표 준비를 마칠 수 있을까

5 목차 1. Introduction 2. Profiler 3. What-if engine 4. Cost-based optimizer 5. Experimental evaluation 6. Conclusion

6 Introduction MapReduce has emerged as a viable competitor to database systems in big data analytics. Profiler, What-if Engine, Cost-based Optimizer  Profiler : collect detailed statistical information from unmodified MapReduce programs.  What-if Engine : fine-grained costestimation.  Cost-based Optimizer : optimize configuration parameter setting.

7 Introduction MapReduce job J  J =  p: MapReduce program  d: map(k1, v1) 과 reduce(k2, list(v2)) 두 함수를 통해 입력되는 data  r: Cluster resources  c: Configuration parameter settings

8 Introduction Configuration parameter settings include..  The number of map tasks  The number of reduce tasks  The amount of memory  The settings for multiphase external sorting  Whether the output data from the map (reduce) tasks should be compressed before being written to disk  Whether a program-specified Combiner function should be used to preaggregate map outputs before their transfer to reduce tasks.

9 Introduction

10 Introduction

11 Introduction Costbased Optimization to Select Configuration Parameter Settings Automatically  perf = F(p, d, r, c)  perf is some performance metric of interest for jobs  Optimizing the performance of program p for given input data d and cluster resources r requires finding configuration parameter settings that give near-optimal values of perf.

12 Introduction MapReduce program optimization poses new challenges compared to conventional database query optimization  Black-box map and reduce functions  Lack of schema and statistics about the input data  Differences in plan spaces Cost-based Optimizer  Profiler  What-if Engine  Cost-based Optimizer

13 Profiler Phase of Map Task Execution  Read, Map, Collect, Spill, Merge Phase of Reduce Task Execution  Shuffle, Merge, Reduce, Write

14 Profiler Job Profiler  A MapReduce job profile is a vector in which each field captures some unique aspect of dataflow or cost during job execution at the task level or the phase level within tasks.  Data flow fields  Cost fields  Dataflow Statistics fields  Cost Statistics fields

15 Profiler Using Profiles to Analyze Job Behavior

16 Profiler Generating Profiles via Measurement  Job profiles are generated in two distinct ways.(Profiler, What-if Engine)  Monitoring through dynamic instrumentation  From raw monitoring data to profile fields  Task-level sampling to generate approximate profiles

17 What-if Engine A what-if question has the following form  Given the profile of a job j = hp; d1; r1; c1i that runs a MapReduce program p over input data d1 and cluster resources r1 using configuration c1, what will the performance of program p be if p is run over input data d2 and cluster resources r2 using configuration c2? That is, how will job j0 = hp; d2; r2; c2i perform? The What-if Engine executes the following two steps to answer a what-if question  Estimating a virtual job profile for the hypothetical job j’.  Using the virtual profile to simulate how j’ will execute. We will discuss these steps in turn.

18 What-if Engine Estimating the Virtual Profile  Estimating Dataflow and Cost fields  Estimating Dataflow Statistics fields  Estimating Cost Statistics fields

19 What-if Engine Estimating Dataflow and Cost fields  detailed set of analytical (white-box) models for estimating the Dataflow and Cost fields in the virtual job profile for j'. Estimating Dataflow Statistics fields  Dataflow proportionality assumption Estimating Cost Statistics fields  Cluster node homogeneity assumption Simulating the Job Execution  Task Scheduler Simulator

20 Cost-based Optimizer (CBO) MapReduce program optimization can be defined as  Given a MapReduce program p to be run on input data d and cluster resources r, find the setting of configuration parameters for the cost model F represented by the What-if Engine over the full space S of configuration parameter settings. The CBO addresses this problem by making what-if calls with settings c of the configuration parameters selected through an enumeration and search over S. Once a job profile to input to the What-if Engine is available, the CBO uses a two-step process, discussed next.

21 Cost-based Optimizer (CBO) Subspace Enumeration  A straightforward approach the CBO can take is to apply enumeration and search techniques to the full space of parameter settings S.  More efficient search techniques can be developed if the individual parameters in c can be grouped into clusters.  Equation 2 states that the globally-optimal setting c opt can be found using a divide and conquer approach by : breaking the higher-dimensional space S into the lower-dimensional subspaces S (i) considering an independent optimization problem in each smaller subspace composing the optimal parameter settings found per subspace to give the setting c opt

22 Cost-based Optimizer (CBO) Search Strategy within a Subspace  searching within each enumerated subspace to find the optimal configuration in the subspace.  Gridding (Equispaced or Random)  Recursive Random Search (RRS) RRS provides probabilistic guarantees on how close the setting it finds is to the optimal setting RRS is fairly robust to deviations of estimated costs from actual performance RRS scales to a large number of dimensions

23 Cost-based Optimizer (CBO) there are two choices for subspace enumeration: Full or Clustered that deal respectively with the full space or smaller subspaces for map and reduce tasks three choices for search within a subspace: Gridding (Equispaced or Random) and RRS.

24 Experimental Evaluation

25 Experimental Evaluation

26 Experimental Evaluation

27 Experimental Evaluation

28 Experimental Evaluation

29 Experimental Evaluation

30 Discussion and Future work Costbased Optimizer for simple to arbitrarily complex MapReduce programs. Several new research challenges arise when we consider the full space of optimization opportunities provided by these higher-level systems. proposed a lightweight Profiler to collect detailed statistical information from unmodified MapReduce programs. proposed a What-if Engine for the fine-grained cost estimation needed by the Cost-based Optimizer.

Q & A 31

32 좋아 ! 이정도면 선방했 …