Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University.

Similar presentations


Presentation on theme: "Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University."— Presentation transcript:

1 Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University

2 JAQL Roadmap Call to action to improve automatic optimization techniques in MapReduce frameworks Challenges & promising directions Hadoop HDFS PigHive …

3 Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job

4 Lifecycle of a MapReduce Job Map function Reduce function Run this program as a MapReduce job

5 Map Wave 1 Reduce Wave 1 Map Wave 2 Reduce Wave 2 Input Splits Lifecycle of a MapReduce Job Time How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?

6 Job Configuration Parameters 190+ parameters in Hadoop Set manually or defaults are used Are defaults or rules-of- thumb good enough?

7 Running time (seconds) Experiments On EC2 and local clusters Running time (seconds) Running time (minutes)

8 Performance at default and rule-of-thumb settings can be poor Cross-parameter interactions are significant Illustrative Result: 50GB Terasort 17-node cluster, 64+32 concurrent map+reduce slots mapred.reduce. tasks io.sort. factor io.sort.record. percent 10 0.15 Running time 105000.15 28100.15 300100.15 3005000.15 Based on popular rule-of- thumb

9 Problem Space Current approaches: Predominantly manual Post-mortem analysis Job configuration parameters Declarative HiveQL/Pig operations Multi-job workflows Performance objectives Cost in pay-as-you-go environment Energy considerations Complexity Space of execution choices Is this where we want to be?

10 Good planGood setting of parameters Can DB Query Optimization Technology Help? But: – MapReduce jobs are not declarative – No schema about the data – Impact of concurrent jobs & scheduling? – Space of parameters is huge Optimizer: Enumerate Cost Search Query Database Execution Engine MapReduce job Hadoop Results Can we: – Borrow/adapt ideas from the wide spectrum of query optimizers that have been developed over the years Or innovate! – Exploit design & usage properties of MapReduce frameworks

11 Spectrum of Query Optimizers Conventional Optimizers Rule- based Cost models + statistics about data AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks Insight: Predictability(RBO) >> Predictability(CBO)

12 Spectrum of Query Optimizers Conventional Optimizers Rule- based Cost models + statistics about data AT’s Conjecture: Rule-based Optimizers (RBOs) will trump Cost-based Optimizers (CBOs) in MapReduce frameworks Insight: Predictability(RBO) >> Predictability(CBO) Learning Optimizers (learn from execution & adapt) Tuning Optimizers (proactively try different plans)

13 Spectrum of Query Optimizers Conventional Optimizers Rule- based Cost models + statistics about data Learning Optimizers (learn from execution & adapt) Exploit usage & design properties of MapReduce frameworks: High ratio of repeated jobs to new jobs Schema can be learned (e.g., Pig scripts) Common sort-partition-merge skeleton Mechanisms for adaptation stemming from design for robustness (speculative execution, storing intermediate results) Fine-grained and pluggable scheduler Tuning Optimizers (proactively try different plans)

14 Summary Call to action to improve automatic optimization techniques in MapReduce frameworks – Automated generation of optimized Hadoop configuration parameter settings, HiveQL/Pig/JAQL query plans, etc. – Rich history to learn from – MapReduce execution creates unique opportunities/challenges


Download ppt "Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University."

Similar presentations


Ads by Google