Presentation is loading. Please wait.

Presentation is loading. Please wait.

Predicting Execution Bottlenecks in Map-Reduce Clusters Edward Bortnikov, Ari Frank, Eshcar Hillel, Sriram Rao Presenting: Alex Shraer Yahoo! Labs.

Similar presentations


Presentation on theme: "Predicting Execution Bottlenecks in Map-Reduce Clusters Edward Bortnikov, Ari Frank, Eshcar Hillel, Sriram Rao Presenting: Alex Shraer Yahoo! Labs."— Presentation transcript:

1 Predicting Execution Bottlenecks in Map-Reduce Clusters Edward Bortnikov, Ari Frank, Eshcar Hillel, Sriram Rao Presenting: Alex Shraer Yahoo! Labs

2 - 2 - Yahoo! Confidential The Map Reduce (MR) Paradigm  Architecture for scalable information processing  Simple API  Computation scales to Web-scale data collections  Google MR  Pioneered the technology in early 2000’s  Hadoop: open-source implementation  In use at Amazon, eBay, Facebook, Yahoo!, …  Scales to 10K’s nodes (Hadoop 2.0)  Many proprietary implementations  MR technologies at Microsoft, Yandex, …

3 - 3 - Yahoo! Confidential Computational Model M1 M2 M3 M4 R1 R2 Output (on DFS) Input (on DFS) Synchronous execution: every R starts computing after all M’s have completed Slowest task (straggler) affects the job latency

4 - 4 - Yahoo! Confidential Predicting Straggler Tasks  Straggler tasks are an inherent bottleneck  Affect job latency, and to some extend throughput  Two approaches to tackle stragglers  Avoidance – reduce the probability of straggler emergence  Detection – once a task goes astray, speculatively fire a duplicate task somewhere else  This work – straggler prediction  Fits with both avoidance and detection scenarios

5 - 5 - Yahoo! Confidential Background  Detection, Speculative Execution  First implemented in Google MR (OSDI ’04)  Hadoop employs a crude detection heuristic  LATE scheduler (OSDI ‘08) addresses the issues of heterogeneous hardware. Evaluated on small scale.  Microsoft MR (Mantri project, OSDI ‘10)  Avoidance  Local/rack-local data access is preferred for mappers  … Network less likely to become the bottleneck  All optimizations are heuristic

6 - 6 - Yahoo! Confidential Machine-Learned vs Heuristic Prediction  Heuristics are hard to …  Tune for real workloads  Catch transient bottlenecks  Some evidence from Hadoop grids at Yahoo!  Speculative scheduling is non-timely and wasteful  90% of the fired duplicates are eventually killed  Data-local computation amplifies contention  Can we use the wealth of historical grid performance data to train a machine-learned bottleneck classifier?

7 - 7 - Yahoo! Confidential Why Should Machine Learning Work? Recurrence (over 5 months) Fraction of jobs Fraction of maps Fraction of reduces 13.3%0.1%<0.1% 22%0.2%0.1% 3-511.4%0.3%0.5% 6-106.2%0.8%0.5% 11-207.9%1.1%1.0% 21-5010.9%2.6%2.7% 51-10011%2.9%7.2% 101-20032.9%6.2%9.2% 201-5004.5%9.8%5.2% 501-10001.3%5.6%2.0% 1001+8.6%68.5%71.5% Huge recurrence of large jobs in production grids 95% of mappers and reducers belong to jobs that ran 50+ times in a 5-month sample Target workload

8 - 8 - Yahoo! Confidential The Slowdown Metric  Task slowdown factor  Ratio between the task’s running time and the median running time among the sibling tasks in the same job.  Root causes  Data skew – input or output significantly exceeds the median for the job  Tasks with skew > 4x happen really seldom.  Hotspots – all the other reasons  Congested/misconfigured/degraded nodes, disks, or network.  Typically transient. The resulting slow can be very high.

9 - 9 - Yahoo! Confidential Jobs with Mapper Slowdown > 5x  1% among all jobs  5% among jobs with 1000 mappers or more  40% due to data skew (2x or above), 60% due to hotspots Sample of ~50K jobs

10 - 10 - Yahoo! Confidential Jobs with Reducer Slowdown > 5x  5% among all jobs  50% among jobs with 1000 reducers or more  10% due to data skew (2x or above), 90% due to hotspots Sample of ~60K jobs

11 - 11 - Yahoo! Confidential Locality is No Silver Bullet  The same nodes are constantly lagging behind  Weaker CPUs (grid HW is heterogeneous), data hotspots, etc.  Pushing for locality too hard amplifies the problem! Top contributor of straggler tasks over 6 hours

12 - 12 - Yahoo! Confidential Slowdown Predictor  An oracle plugin into a Map Reduce system  Input: node features + task features  Output: slowdown estimate  Features  M/R metrics (job- and task-level)  DFS metrics (datanode-level)  System metrics (host-level: CPU, RAM, disk I/O, JVM, …)  Network traffic (host-, rack- and cross-rack-level)

13 - 13 - Yahoo! Confidential Slowdown Prediction - Mappers Mis-predicted, need improvement

14 - 14 - Yahoo! Confidential Slowdown Prediction - Reducers More dispersed than the mappers

15 - 15 - Yahoo! Confidential Some Conclusions  Data skew is the most important signal, but there are many more that are important  Node HW generation is a very significant signal for both mappers and reducers  Large grids undergo continuous HW upgrades  Network traffic features (intra-rack and cross-rack) is much more important for reducers than for mappers  How to collect efficiently in a real-time setting?  Need to enhance data sampling/weighting to capture outliers better

16 - 16 - Yahoo! Confidential Takeaways  Slowdown prediction  ML approach to straggler avoidance and detection  Initial evaluation showed viability  Need to enhance training to capture outliers better  Challenge: runtime implementation  A good blend with the modern MR system architecture?

17 - 17 - Yahoo! Confidential Thank you

18 - 18 - Yahoo! Confidential Machine Learning Technique Gradient Boosted Decision Trees (GBDT)  Additive regression model  Based on ensemble of binary decision trees  100 trees, 10 leaf nodes each …

19 - 19 - Yahoo! Confidential Challenges – Hadoop Use Case  Hadoop 1.0 – centralized architecture  The single Job Tracker process manages all task assignment and scheduling  Full picture of Map and Reduce slots across the cluster  Hadoop 2.0 – distributed architecture  Resource management and scheduling functions split  Thin centralized Resource Manager (RM) creates application containers (e.g., for running a Map Reduce job)  Per-job App Master (AM) does scheduling within a container  May negotiate resource allocation with the RM  Challenge: working with a limited set of local signals

20 - 20 - Yahoo! Confidential Possible Design – Hadoop 2.0 Application Master Resource Manager Node Manager Node Manager Node Manager Model Metrics Resource requests App Container Creation Metrics collection (extends the existing HB protocol) New component or API Job Execution Environment  Some metrics already collected (CPU ticks, bytes R/W)  Others might be collected either by NM, or externally Centralized prediction will not scale. Will distributed prediction be accurate enough?


Download ppt "Predicting Execution Bottlenecks in Map-Reduce Clusters Edward Bortnikov, Ari Frank, Eshcar Hillel, Sriram Rao Presenting: Alex Shraer Yahoo! Labs."

Similar presentations


Ads by Google