Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 14:Combating Outliers in MapReduce Clusters Xiaowei Yang.

Similar presentations


Presentation on theme: "Lecture 14:Combating Outliers in MapReduce Clusters Xiaowei Yang."— Presentation transcript:

1 Lecture 14:Combating Outliers in MapReduce Clusters Xiaowei Yang

2 References: – Reining in the Outliers in Map-Reduce Clusters using Mantri by Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward Harris – us/UM/people/srikanth/data/Combating%20Outlier s%20in%20Map-Reduce.web.pptx

3 log(size of dataset) GB 10 9 TB PB EB log(size of cluster) HPC, || databases mapreduce MapReduce Decouples customized data operations from mechanisms to scale Is widely used Cosmos (based on SVC’s Dryad) + Bing Google Hadoop inside Yahoo! and on Amazon’s Cloud (AWS) e.g., the Internet, click logs, bio/genomic data 3

4 Local write An Example How it Works: Goal Find frequent search queries to Bing SELECT Query, COUNT(*) AS Freq FROM QueryTable HAVING Freq > X What the user says: Read Map Reduce file block 0 job manager task output block 0 output block 1 file block 1 file block 2 file block 3 assign work, get progress 4

5 Outliers slow down map-reduce jobs Map.Read 22K Map.Move 15K Map 13K Reduce 51K Barrier File System Goals Speeding up jobs improves productivity Predictability supports SLAs … while using resources efficiently We find that: 5

6 What is an outlier A phase (map or reduce) has n tasks and s slots (available compute resources) Every task takes T seconds to run t i = f (datasize, code, machine, network) Ideally run time = ceiling (n/s) * T A naïve scheduler Goal is to be closer to

7 From a phase to a job A job may have many phases An outlier in an early phase has a cumulative effect Data loss may cause multi-phase recompute  outliers

8 Delay due to a recompute readily cascades Why outliers? reduce sort Delay due to a recompute map Problem: Due to unavailable input, tasks have to be recomputed 8

9 Previous work The original MapReduce paper observed the problem But didn’t deal with it in depth Solution was to duplicate the slow tasks Drawbacks – Some may be unnecessary – Use extra resources – Placement may be the problem

10 Quantifying the Outlier Problem Approach: – Understanding the problem first before proposing solutions – Understanding often leads to solutions 1.Prevalence of outliers 2.Causes of outliers 3.Impact of outliers

11 stragglers = Tasks that take  1.5 times the median task in that phase recomputes = Tasks that are re-run because their output was lost 50% phases have 10% stragglers and no recomputes 10% of the stragglers take >10X longer 50% phases have 10% stragglers and no recomputes 10% of the stragglers take >10X longer Why bother? Frequency of Outliers straggler Outlier 11

12 Causes of outliners: data skew In 40% of the phases, all the tasks with high runtimes (>1.5x the median task) correspond to large amount of data over the network Duplicating will not help!

13 Non-outliers can be improved as well 20% of them are 55% longer than median

14 Reduce task Map output uneven placement is typical in production reduce tasks are placed at first available slot Problem: Tasks reading input over the network experience variable congestion 14

15 Causes of outliers: cross rack traffic 70% of cross track traffic is reduce traffic Tasks in a spot with slow network run slower Tasks compete network among themselves Reduce reads from every map Reduce is put into any spare slot 50% phases takes 62% longer to finish than ideal placement

16 Cause of outliers: bad and busy machines 50% of recomputes happen on 5% of the machines Recompute increases resource usage

17 Outliers cluster by time – Resource contention might be the cause Recomputes cluster by machines – Data loss may cause multiple recomputes

18 Why bother? Cost of outliers (what-if analysis, replays logs in a trace driven simulator) At median, jobs slowed down by 35% due to outliers 18

19 Mantri Design

20 High-level idea Cause aware, and resource aware Runtime = f (input, network, machine, datatoProcess, …) Fix each problem with different strategies

21 Resource-aware restarts Duplicate or kill long outliers

22 When to restart Every ∆ seconds, tasks report progress Estimate t rem and t new

23 γ= 3 Schedule a duplicate if the total running time is smaller P(c t rem > (c+1) t new ) > δ When there are available slots, restart if reduction time is more than restart time – E(t rem – t new ) > ρ ∆

24 Network Aware Placement Compute the rack location for each task Find the placement that minimizes the maximum data transfer time If rack i has d i map output and u i, v i bandwidths available on uplink and downlink, Place a i fraction of reduces such that:

25 Avoid recomputation Replicating the output – Restart a task if data are lost – Replicate the most costly job

26 Data-aware task ordering Outliers due to large input Schedule tasks in descending order of dataToProcess At most 33% worse than optimal scheduling

27 Estimation of t rem and t new d: input data size d read : the amount read

28 Estimation of t new processRate: estimated of all tasks in the phase locationFactor: machine performance d: input size

29 Results Deployed in production cosmos clusters Prototype Jan’10  baking on pre-prod. clusters  release May’10 Trace driven simulations thousands of jobs mimic workflow, task runtime, data skew, failure prob. compare with existing schemes and idealized oracles 29

30 Evaluation Methodology Mantri run on production clusters Baseline is results from Dryad Use trace-driven simulations to compare with other systems

31 Comparing jobs in the wild w/ and w/o Mantri for one month of jobs in Bing production cluster jobs that each repeated at least five times during May (release) vs. Apr 1-30 (pre-release)

32 In production, restarts… improve on native cosmos by 25% while using fewer resources 32

33 In trace-replay simulations, restarts… are much better dealt with in a cause-, resource- aware manner. Each job repeated thrice CDF % cluster resources 33

34 Network-aware Placement Equal: all links have the same bandwidth Start: same as the start Ideal: available bandwidth at run time 34

35 Protecting against recomputes CDF % cluster resources 35

36 Summary a)Reduce recomputation: preferentially replicate costly-to-recompute tasks b) Poor network: each job locally avoids network hot-spots c) Bad machines: quarantine persistently faulty machines d) DataToProcess: schedule in descending order of data size e)Others: restart or duplicate tasks, cognizant of resource cost. Prune

37 Conclusion Outliers in map-reduce clusters are a significant problem happen due to many causes – interplay between storage, network and map-reduce cause-, resource- aware mitigation improves on prior art 37


Download ppt "Lecture 14:Combating Outliers in MapReduce Clusters Xiaowei Yang."

Similar presentations


Ads by Google