Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.

Slides:

Advertisements

Similar presentations

A MapReduce Workflow System for Architecting Scientific Data Intensive Applications By Phuong Nguyen and Milton Halem phuong3 or 1.

Advertisements

Copyright © SoftTree Technologies, Inc. DB Tuning Expert.

Starfish: A Self-tuning System for Big Data Analytics.

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.

Herodotos Herodotou Shivnath Babu Duke University.

Three Perspectives & Two Problems Shivnath Babu Duke University.

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications Piyush Shivam, Shivnath Babu, Jeffrey Chase Duke University.

Spark: Cluster Computing with Working Sets

Resource Management with YARN: YARN Past, Present and Future

Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.

Clydesdale: Structured Data Processing on MapReduce Jackie.

Outline SQL Server Optimizer  Enumeration architecture  Search space: flexibility/extensibility  Cost and statistics Automatic Physical Tuning  Database.

12 Copyright © 2005, Oracle. All rights reserved. Proactive Maintenance.

Presented by Nirupam Roy Starfish: A Self-tuning System for Big Data Analytics Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong,

CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.

Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.

Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.

HADOOP ADMIN: Session -2

Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.

Map/Reduce and Hadoop performance Ioana Manolescu Senior researcher, OAK team lead Inria Saclay and Université Paris-Sud Big Data Paris, 2013.

Profiling, What-if Analysis and Cost- based Optimization of MapReduce Programs Oct 7 th 2013 Database Lab. Wonseok Choi.

H ADOOP DB: A N A RCHITECTURAL H YBRID OF M AP R EDUCE AND DBMS T ECHNOLOGIES FOR A NALYTICAL W ORKLOADS By: Muhammad Mudassar MS-IT-8 1.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

HAMS Technologies 1

Oracle Challenges Parallelism Limitations Parallelism is the ability for a single query to be run across multiple processors or servers. Large queries.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

Informix IDS Administration with the New Server Studio 4.0 By Lester Knutsen My experience with the beta of Server Studio and the new Informix database.

Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.

CPS216: Advanced Database Systems (Data-intensive Computing Systems) Introduction to MapReduce and Hadoop Shivnath Babu.

Eneryg Efficiency for MapReduce Workloads: An Indepth Study Boliang Feng Renmin University of China Dec 19.

RESTORE IMPLEMENTATION as an extension to pig Vijay S.

1 Chapter Overview Performing Configuration Tasks Setting Up Additional Features Performing Maintenance Tasks.

BALANCED DATA LAYOUT IN HADOOP CPS 216 Kyungmin (Jason) Lee Ke (Jessie) Xu Weiping Zhang.

GreenSched: An Energy-Aware Hadoop Workflow Scheduler

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.

Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.

Using Map-reduce to Support MPMD Peng

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,

Virtualization and Databases Ashraf Aboulnaga University of Waterloo.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Data Engineering How MapReduce Works

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.

MISSION CRITICAL COMPUTING Siebel Database Considerations.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

CPS 216: Advanced Database Systems Shivnath Babu.

Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.

Next Generation of Apache Hadoop MapReduce Owen

Part III BigData Analysis Tools (YARN) Yuan Xue

3 Copyright © 2006, Oracle. All rights reserved. Designing and Developing for Performance.

PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,

Apache Tez : Accelerating Hadoop Query Processing Page 1.

HADOOP ADMIN: Session -2

An Open Source Project Commonly Used for Processing Big Data Sets

Software Engineering Introduction to Apache Hadoop Map Reduce

Myoungjin Kim1, Yun Cui1, Hyeokju Lee1 and Hanku Lee1,2,*

Automatic Physical Design Tuning: Workload as a Sequence

Cloud Computing: Project Tutorial Hadoop Map-Reduce Programming

Pig Hive HBase Zookeeper

Presentation transcript:

Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University

JAQL Roadmap Call to action to improve automatic optimization techniques in MapReduce frameworks Challenges & promising directions Hadoop HDFS PigHive …

Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job

Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job

Map Wave 1 Reduce Wave 1 Map Wave 2 Reduce Wave 2 Input Splits Lifecycle of a MapReduce Job Time How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?

Job Configuration Parameters 190+ parameters in Hadoop Set manually or defaults are used Are defaults or rules-of- thumb good enough?

Running time (seconds) Experiments On EC2 and local clusters Running time (seconds) Running time (minutes)

Performance at default and rule-of-thumb settings can be poor Cross-parameter interactions are significant Illustrative Result: 50GB Terasort 17-node cluster, concurrent map+reduce slots mapred.reduce. tasks io.sort. factor io.sort.record. percent Running time Based on popular rule-of- thumb

Problem Space Current approaches: Predominantly manual Post-mortem analysis Job configuration parameters Declarative HiveQL/Pig operations Multi-job workflows Performance objectives Cost in pay-as-you-go environment Energy considerations Complexity Space of execution choices Is this where we want to be?

Good planGood setting of parameters Can DB Query Optimization Technology Help? But: – MapReduce jobs are not declarative – No schema about the data – Impact of concurrent jobs & scheduling? – Space of parameters is huge Optimizer: Enumerate Cost Search Query Database Execution Engine MapReduce job Hadoop Results Can we: – Borrow/adapt ideas from the wide spectrum of query optimizers that have been developed over the years Or innovate! – Exploit design & usage properties of MapReduce frameworks

Spectrum of Query Optimizers Conventional Optimizers Rule- based Cost models + statistics about data AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks Insight: Predictability(RBO) >> Predictability(CBO)

Spectrum of Query Optimizers Conventional Optimizers Rule- based Cost models + statistics about data AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks Insight: Predictability(RBO) >> Predictability(CBO) Learning Optimizers (learn from execution & adapt) Tuning Optimizers (proactively try different plans)

Spectrum of Query Optimizers Conventional Optimizers Rule- based Cost models + statistics about data Learning Optimizers (learn from execution & adapt) Exploit usage & design properties of MapReduce frameworks: High ratio of repeated jobs to new jobs Schema can be learned (e.g., Pig scripts) Common sort-partition-merge skeleton Mechanisms for adaptation stemming from design for robustness (speculative execution, storing intermediate results) Fine-grained and pluggable scheduler Tuning Optimizers (proactively try different plans)

Summary Call to action to improve automatic optimization techniques in MapReduce frameworks – Automated generation of optimized Hadoop configuration parameter settings, HiveQL/Pig/JAQL query plans, etc. – Rich history to learn from – MapReduce execution creates unique opportunities/challenges