Lecture 14:Combating Outliers in MapReduce Clusters Xiaowei Yang.

Lecture 14:Combating Outliers in MapReduce Clusters Xiaowei Yang

References: – Reining in the Outliers in Map-Reduce Clusters using Mantri by Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward Harris – http://research.microsoft.com/en- us/UM/people/srikanth/data/Combating%20Outlier s%20in%20Map-Reduce.web.pptx

log(size of dataset) GB 10 9 TB 10 12 PB 10 15 EB 10 18 log(size of cluster) 10 4 1 10 3 10 2 10 1 10 5 HPC, || databases mapreduce MapReduce Decouples customized data operations from mechanisms to scale Is widely used Cosmos (based on SVC’s Dryad) + Scope @ Bing MapReduce @ Google Hadoop inside Yahoo! and on Amazon’s Cloud (AWS) e.g., the Internet, click logs, bio/genomic data 3

Local write An Example How it Works: Goal Find frequent search queries to Bing SELECT Query, COUNT(*) AS Freq FROM QueryTable HAVING Freq > X What the user says: Read Map Reduce file block 0 job manager task output block 0 output block 1 file block 1 file block 2 file block 3 assign work, get progress 4

Outliers slow down map-reduce jobs Map.Read 22K Map.Move 15K Map 13K Reduce 51K Barrier File System Goals Speeding up jobs improves productivity Predictability supports SLAs … while using resources efficiently We find that: 5

What is an outlier A phase (map or reduce) has n tasks and s slots (available compute resources) Every task takes T seconds to run t i = f (datasize, code, machine, network) Ideally run time = ceiling (n/s) * T A naïve scheduler Goal is to be closer to

From a phase to a job A job may have many phases An outlier in an early phase has a cumulative effect Data loss may cause multi-phase recompute  outliers

Delay due to a recompute readily cascades Why outliers? reduce sort Delay due to a recompute map Problem: Due to unavailable input, tasks have to be recomputed 8

Previous work The original MapReduce paper observed the problem But didn’t deal with it in depth Solution was to duplicate the slow tasks Drawbacks – Some may be unnecessary – Use extra resources – Placement may be the problem

Quantifying the Outlier Problem Approach: – Understanding the problem first before proposing solutions – Understanding often leads to solutions 1.Prevalence of outliers 2.Causes of outliers 3.Impact of outliers

stragglers = Tasks that take  1.5 times the median task in that phase recomputes = Tasks that are re-run because their output was lost 50% phases have 10% stragglers and no recomputes 10% of the stragglers take >10X longer 50% phases have 10% stragglers and no recomputes 10% of the stragglers take >10X longer Why bother? Frequency of Outliers straggler Outlier 11

Causes of outliners: data skew In 40% of the phases, all the tasks with high runtimes (>1.5x the median task) correspond to large amount of data over the network Duplicating will not help!

Non-outliers can be improved as well 20% of them are 55% longer than median

Reduce task Map output uneven placement is typical in production reduce tasks are placed at first available slot Problem: Tasks reading input over the network experience variable congestion 14

Causes of outliers: cross rack traffic 70% of cross track traffic is reduce traffic Tasks in a spot with slow network run slower Tasks compete network among themselves Reduce reads from every map Reduce is put into any spare slot 50% phases takes 62% longer to finish than ideal placement

Cause of outliers: bad and busy machines 50% of recomputes happen on 5% of the machines Recompute increases resource usage

Outliers cluster by time – Resource contention might be the cause Recomputes cluster by machines – Data loss may cause multiple recomputes

Why bother? Cost of outliers (what-if analysis, replays logs in a trace driven simulator) At median, jobs slowed down by 35% due to outliers 18

Mantri Design

High-level idea Cause aware, and resource aware Runtime = f (input, network, machine, datatoProcess, …) Fix each problem with different strategies

Resource-aware restarts Duplicate or kill long outliers

When to restart Every ∆ seconds, tasks report progress Estimate t rem and t new

γ= 3 Schedule a duplicate if the total running time is smaller P(c t rem > (c+1) t new ) > δ When there are available slots, restart if reduction time is more than restart time – E(t rem – t new ) > ρ ∆

Network Aware Placement Compute the rack location for each task Find the placement that minimizes the maximum data transfer time If rack i has d i map output and u i, v i bandwidths available on uplink and downlink, Place a i fraction of reduces such that:

Avoid recomputation Replicating the output – Restart a task if data are lost – Replicate the most costly job

Data-aware task ordering Outliers due to large input Schedule tasks in descending order of dataToProcess At most 33% worse than optimal scheduling

Estimation of t rem and t new d: input data size d read : the amount read

Estimation of t new processRate: estimated of all tasks in the phase locationFactor: machine performance d: input size

Results Deployed in production cosmos clusters Prototype Jan’10  baking on pre-prod. clusters  release May’10 Trace driven simulations thousands of jobs mimic workflow, task runtime, data skew, failure prob. compare with existing schemes and idealized oracles 29

Evaluation Methodology Mantri run on production clusters Baseline is results from Dryad Use trace-driven simulations to compare with other systems

Comparing jobs in the wild w/ and w/o Mantri for one month of jobs in Bing production cluster 31 340 jobs that each repeated at least five times during May 25-28 (release) vs. Apr 1-30 (pre-release)

In production, restarts… improve on native cosmos by 25% while using fewer resources 32

In trace-replay simulations, restarts… are much better dealt with in a cause-, resource- aware manner. Each job repeated thrice CDF % cluster resources 33

Network-aware Placement Equal: all links have the same bandwidth Start: same as the start Ideal: available bandwidth at run time 34

Protecting against recomputes CDF % cluster resources 35

Summary a)Reduce recomputation: preferentially replicate costly-to-recompute tasks b) Poor network: each job locally avoids network hot-spots c) Bad machines: quarantine persistently faulty machines d) DataToProcess: schedule in descending order of data size e)Others: restart or duplicate tasks, cognizant of resource cost. Prune

Conclusion Outliers in map-reduce clusters are a significant problem happen due to many causes – interplay between storage, network and map-reduce cause-, resource- aware mitigation improves on prior art 37

Lecture 14:Combating Outliers in MapReduce Clusters Xiaowei Yang.

Similar presentations

Presentation on theme: "Lecture 14:Combating Outliers in MapReduce Clusters Xiaowei Yang."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 14:Combating Outliers in MapReduce Clusters Xiaowei Yang.

Similar presentations

Presentation on theme: "Lecture 14:Combating Outliers in MapReduce Clusters Xiaowei Yang."— Presentation transcript:

Similar presentations

About project

Feedback