Download presentation
Presentation is loading. Please wait.
Published byAdele Wilson Modified over 9 years ago
1
Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University
2
JAQL Roadmap Call to action to improve automatic optimization techniques in MapReduce frameworks Challenges & promising directions Hadoop HDFS PigHive …
3
Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job
4
Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job
5
Map Wave 1 Reduce Wave 1 Map Wave 2 Reduce Wave 2 Input Splits Lifecycle of a MapReduce Job Time How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?
6
Job Configuration Parameters 190+ parameters in Hadoop Set manually or defaults are used Are defaults or rules-of- thumb good enough?
7
Running time (seconds) Experiments On EC2 and local clusters Running time (seconds) Running time (minutes)
8
Performance at default and rule-of-thumb settings can be poor Cross-parameter interactions are significant Illustrative Result: 50GB Terasort 17-node cluster, 64+32 concurrent map+reduce slots mapred.reduce. tasks io.sort. factor io.sort.record. percent 10 0.15 Running time 105000.15 28100.15 300100.15 3005000.15 Based on popular rule-of- thumb
9
Problem Space Current approaches: Predominantly manual Post-mortem analysis Job configuration parameters Declarative HiveQL/Pig operations Multi-job workflows Performance objectives Cost in pay-as-you-go environment Energy considerations Complexity Space of execution choices Is this where we want to be?
10
Good planGood setting of parameters Can DB Query Optimization Technology Help? But: – MapReduce jobs are not declarative – No schema about the data – Impact of concurrent jobs & scheduling? – Space of parameters is huge Optimizer: Enumerate Cost Search Query Database Execution Engine MapReduce job Hadoop Results Can we: – Borrow/adapt ideas from the wide spectrum of query optimizers that have been developed over the years Or innovate! – Exploit design & usage properties of MapReduce frameworks
11
Spectrum of Query Optimizers Conventional Optimizers Rule- based Cost models + statistics about data AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks Insight: Predictability(RBO) >> Predictability(CBO)
12
Spectrum of Query Optimizers Conventional Optimizers Rule- based Cost models + statistics about data AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks Insight: Predictability(RBO) >> Predictability(CBO) Learning Optimizers (learn from execution & adapt) Tuning Optimizers (proactively try different plans)
13
Spectrum of Query Optimizers Conventional Optimizers Rule- based Cost models + statistics about data Learning Optimizers (learn from execution & adapt) Exploit usage & design properties of MapReduce frameworks: High ratio of repeated jobs to new jobs Schema can be learned (e.g., Pig scripts) Common sort-partition-merge skeleton Mechanisms for adaptation stemming from design for robustness (speculative execution, storing intermediate results) Fine-grained and pluggable scheduler Tuning Optimizers (proactively try different plans)
14
Summary Call to action to improve automatic optimization techniques in MapReduce frameworks – Automated generation of optimized Hadoop configuration parameter settings, HiveQL/Pig/JAQL query plans, etc. – Rich history to learn from – MapReduce execution creates unique opportunities/challenges
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.