Reining in the Outliers in Map-Reduce Clusters using Mantri Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha,

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

Combating Outliers in map-reduce Srikanth Kandula Ganesh Ananthanarayanan , Albert Greenberg, Ion Stoica , Yi Lu, Bikas Saha , Ed Harris   1.
Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica.
Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Srikanth Kandula, Scott Shenker, Ion Stoica.
Lecture 14:Combating Outliers in MapReduce Clusters Xiaowei Yang.
GRASS: Trimming Stragglers in Approximation Analytics Ganesh Ananthanarayanan, Michael Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, Minlan Yu.
SDN + Storage.
Effective Straggler Mitigation: Attack of the Clones [1]
LIBRA: Lightweight Data Skew Mitigation in MapReduce
SkewTune: Mitigating Skew in MapReduce Applications
1 Degraded-First Scheduling for MapReduce in Erasure-Coded Storage Clusters Runhui Li, Patrick P. C. Lee, Yuchong Hu The Chinese University of Hong Kong.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
Predicting Execution Bottlenecks in Map-Reduce Clusters Edward Bortnikov, Ari Frank, Eshcar Hillel, Sriram Rao Presenting: Alex Shraer Yahoo! Labs.
Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Distributed Computations
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Delay.
Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Making Sense of Performance in Data Analytics Frameworks Kay Ousterhout Joint work with Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun UC.
Distributed Computations MapReduce
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
Improving MapReduce Performance Using Smart Speculative Execution Strategy Qi Chen, Cheng Liu, and Zhen Xiao Oct 2013 To appear in IEEE Transactions on.
The Power of Choice in Data-Aware Cluster Scheduling
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
On Availability of Intermediate Data in Cloud Computations Steven Y. Ko, Imranul Hoque, Brian Cho, and Indranil Gupta Distributed Protocols Research Group.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Sharing the Data Center Network Alan Shieh, Srikanth Kandula, Albert Greenberg, Changhoon Kim, Bikas Saha Microsoft Research, Cornell University, Windows.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
1 Quincy: Fair Scheduling for Distributed Computing Clusters Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg.
Introduction to Hadoop and HDFS
MapReduce M/R slides adapted from those of Jeff Dean’s.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Papers on Storage Systems 1) Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud, SC ) Making Cloud Intermediate Data Fault-Tolerant,
Department of Computer Science, UIUC Presented by: Muntasir Raihan Rahman and Anupam Das CS 525 Spring 2011 Advanced Distributed Systems Cloud Scheduling.
Low Latency Geo-distributed Data Analytics Qifan Pu, Ganesh Ananthanarayanan, Peter Bodik, Srikanth Kandula, Aditya Akella, Paramvir Bahl, Ion Stoica.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
BALANCED DATA LAYOUT IN HADOOP CPS 216 Kyungmin (Jason) Lee Ke (Jessie) Xu Weiping Zhang.
Hadoop System simulation with Mumak Fei Dong, Tianyu Feng, Hong Zhang Dec 8, 2010.
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.
Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Fair.
Matchmaking: A New MapReduce Scheduling Technique
Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Multi-Resource Packing for Cluster Schedulers Robert Grandl Aditya Akella Srikanth Kandula Ganesh Ananthanarayanan Sriram Rao.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
CS239-Lecture 6 Performance Madan Musuvathi Visiting Professor, UCLA Principal Researcher, Microsoft Research.
Geo-distributed Data Analytics Qifan Pu, Ganesh Ananthanarayanan, Peter Bodik, Srikanth Kandula, Aditya Akella, Paramvir Bahl, Ion Stoica.
Big Data Analytics with Parallel Jobs
GRASS: Trimming Stragglers in Approximation Analytics
Edinburgh Napier University
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Introduction to MapReduce and Hadoop
PA an Coordinated Memory Caching for Parallel Jobs
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Reining in the Outliers in MapReduce Jobs using Mantri
Cloud Computing MapReduce in Heterogeneous Environments
Presentation transcript:

Reining in the Outliers in Map-Reduce Clusters using Mantri Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward Presenter: Weiyue Xu OSDI'10 Proceedings of the 9th USENIX conference on Operating systems design and implementation

Credits Modified version of – ures/ ures/ –research.microsoft.com/en- us/UM/people/srikanth/data/Combating%20Outliers %20in%20Map-Reduce.web.pptxresearch.microsoft.com/en- us/UM/people/srikanth/data/Combating%20Outliers %20in%20Map-Reduce.web.pptx

Outline Introduction Cause of the Outlier Mantri Evaluation Discussion and Conclusion

log(size of dataset) GB 10 9 TB PB EB log(size of cluster) HPC, || databases mapreduce MapReduce Decouples customized data operations from mechanisms to scale Widely used Cosmos (based on SVC ’ s Dryad) + Bing Google Hadoop inside Yahoo! and on Amazon ’ s Cloud (AWS) e.g., the Internet, click logs, bio/genomic data

Local write An Example How it Works: Goal Find frequent search queries to Bing SELECT Query, COUNT(*) AS Freq FROM QueryTable HAVING Freq > X What the user says: ReadMap Reduce file block 0 job manager task output block 0 output block 1 file block 1 file block 2 file block 3 assign work, get progress

Outliers slow down map-reduce jobs Map.Read 22K Map.Move 15K Map 13K Reduce 51K Barrier File System We find that: 6 Tackling outliers, we can speed up jobs while using resources efficiently: Quicker response improves productivity Predictability supports SLAs Better resource utilization

From a phase to a job A job may have many phases An outlier in an early phase has a cumulative effect Data loss may cause multi-phase recompute  outliers

Delay due to a recompute readily cascades Why outliers? reduce sort Delay due to a recompute map 8 Due to unavailable input, tasks have to be recomputed

Outlier stragglers : Tasks that take  1.5 times the median task in that phase recomputes : Tasks that are re-run because their output was lost 50% phases have > 10% stragglers and no recomputes 10% of the stragglers take >10X longer Frequency of Outliers straggler

At median, jobs slowed down by 35% due to outliers Cost of Outliers

Previous solution The original MapReduce paper observed the problem and did not solve it in depth Current schemes (e.g. Hadoop, LATE) duplicate long- running tasks based on some metrics Drawbacks –Some may be unnecessary –Use extra resources –Placement may be a problem

What this Paper is About Identify fundamental causes of outliers Mantri: A Cause-, Resource-Aware mitigation scheme –Case by case analysis: takes distinct actions based on cause –Considers opportunity cost of actions Results from a production deployment

Causes of Outliers Data Skew: data size varies across tasks in a phase.

Reduce task Map output uneven placement is typical in production reduce tasks are placed at first available slots Crossrack Traffic Causes of Outliers Rack

50% of recomputes happen on 5% of the machines Recompute increases resource usage Bad and Busy Machines Causes of Outliers

70% of cross track traffic is reduce traffic Reduce reads from every map, tasks in a spot with slow network run slower Tasks compete network among themselves 50% phases takes 62% longer to finish than ideal placement Causes of Outliers

Outliers cluster by time –Resource contention might be the cause Recomputes cluster by machines –Data loss may cause multiple recomputes

Mantri Cause aware, and resource aware Fix each problem with different strategies Runtime = f (input, network, dataToProcess,...)

Delay due to a recompute readily cascades Why outliers? reduce sort Delay due to a recompute map 19 Mantri [Avoid Recomputations]

Idea: Replicate intermediate data, use copy if original is unavailable Challenge: What data to replicate? Insight: Cost to recompute vs cost to replicate M1M1 M2M2 t redo = r 2 (t 2 +t 1 redo ) Cost to recompute depends on data loss probabilities and time taken, and also recursively looks at prior phases. t redo > t rep Mantri preferentially acts on more costly inputs 20 t = predicted runtime of task r = predicted probability of recomputation at machine

Reduce task Map output uneven placement is typical in production reduce tasks are placed at first available slots Crossrack Traffic Rack Mantri [Network Aware Placement]

Idea: Avoid hot-spots, keep traffic on a link proportional to bandwidth Challenges: Global co-ordination, congestion detection Insights: –Local control is a good approximation(each job balances its traffic) –Link utilizations average out on the long task and are steady on the short task If rack i has d i map output need d u i and d v i for upload and download while b u i and b v i bandwidths available Place a i fraction of reduces such that: 22

Data Skew - About 25% of outliers occur due to more dataToProcess (workload imbalance) Ignoring these is better than duplicating (state-of-the-art) Mantri [Data Aware Task Ordering]

Problem: Workload imbalance causes tasks to straggle. Idea: Restarting outliers that are lengthy is counter- productive. Insights: –Theorem [Graham, 1969] –Scheduling tasks with longest processing time first is at-most 33% worse than optimal schedule. 24 Builds an estimator T ~ function (dataToProcess) We schedule tasks in descending order of dataToProcess Theorem [due to Graham, 1969] Doing so is no more than 33% worse than the optimal

Mantri [Resource Aware Restart] Problem: 25% outliers remain, likely due to Idea: Restart tasks elsewhere in the cluster asap Challenge: restart or duplicate? 25 (a) (b) (c) Running task Potential restart (t new ) now time t rem √ × × Save time and resources iff P(c t rem > (c+1) t new ) P(c t rem > (c+1) t new ) > δ If pending work, duplicate iff save both time and resource Else, duplicate if expected savings are high Continuously observe and kill wasteful copies

26 Summary Reduce recomputation: –preferentially replicate costly-to-recompute tasks Poor network: –each job locally avoids network hot-spots DataToProcess: –schedule in descending order of data size Bad machines: – quarantine persistently faulty machines Others: –restart or duplicate tasks, cognizant of resource cost.

Evaluation Methodology Mantri run on production clusters Baseline is results from Dryad Use trace-driven simulations to compare with other systems

Comparing Jobs in the Wild w/ and w/o Mantri for one month of jobs in Bing production cluster 340 jobs that each repeated at least five times during May (release) vs. Apr 1-30 (pre-release)

In Production, Restarts …

In Trace-Replay Simulations, Restarts … CDF % cluster resources

Protecting Against Recomputes CDF % cluster resources

Conclusion Outliers are a significant problem Happens due to many causes Mantri: cause and resource aware mitigation outperforms prior schemes 32

Discussion Mantri does case by case analysis for each cause, what if the causes are inter-dependent?

Questions or Comments? Thanks!

Estimation of t rem and t new d: input data size d read : the amount read

Estimation of t new processRate: estimated of all tasks in the phase locationFactor: machine performance d: input size