Effective Straggler Mitigation: Attack of the Clones [1]

Slides:

Advertisements

Similar presentations

Aggressive Cloning of Jobs for Effective Straggler Mitigation Ganesh Ananthanarayanan, Ali Ghodsi, Scott Shenker, Ion Stoica.

Advertisements

Effective Straggler Mitigation: Attack of the Clones Ganesh Ananthanarayanan, Ali Ghodsi, Srikanth Kandula, Scott Shenker, Ion Stoica.

Big Data + SDN SDN Abstractions. The Story Thus Far Different types of traffic in clusters Background Traffic – Bulk transfers – Control messages Active.

Traffic Engineering with Forward Fault Correction (FFC)

GRASS: Trimming Stragglers in Approximation Analytics Ganesh Ananthanarayanan, Michael Hung, Xiaoqi Ren, Ion Stoica, Adam Wierman, Minlan Yu.

Scheduling CSCI 3753 Operating Systems Spring 2005 Prof. Rick Han.

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

Chapter 101 Cleaning Policy When should a modified page be written out to disk?  Demand cleaning write page out only when its frame has been selected.

Cache Memory Locality of reference: It is observed that when a program refers to memory, the access to memory for data as well as code are confined to.

Yousi Zheng Dept. of ECE, The Ohio State University

1 Stochastic Event Capture Using Mobile Sensors Subject to a Quality Metric Nabhendra Bisnik, Alhussein A. Abouzeid, and Volkan Isler Rensselaer Polytechnic.

Receiver-driven Layered Multicast S. McCanne, V. Jacobsen and M. Vetterli SIGCOMM 1996.

Making Sense of Spark Performance

Extended Gantt-chart in real-time scheduling for single processor

Chapter 1 and 2 Computer System and Operating System Overview

Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.

Admission Control and Dynamic Adaptation for a Proportional-Delay DiffServ-Enabled Web Server Yu Cai.

Improving MapReduce Performance Using Smart Speculative Execution Strategy Qi Chen, Cheng Liu, and Zhen Xiao Oct 2013 To appear in IEEE Transactions on.

The Power of Choice in Data-Aware Cluster Scheduling

Energy, Energy, Energy  Worldwide efforts to reduce energy consumption  People can conserve. Large percentage savings possible, but each individual has.

On Availability of Intermediate Data in Cloud Computations Steven Y. Ko, Imranul Hoque, Brian Cho, and Indranil Gupta Distributed Protocols Research Group.

1 Uniprocessor Scheduling Chapter 9. 2 Aim of Scheduling Main Job: Assign processes to be executed by the processor(s) and processes to be loaded in main.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.

Primal-Dual Meets Local Search: Approximating MST’s with Non-uniform Degree Bounds Author: Jochen Könemann R. Ravi From CMU CS 3150 Presentation by Dan.

ISO Layer Model Lecture 9 October 16, The Need for Protocols Multiple hardware platforms need to have the ability to communicate. Writing communications.

CPU Scheduling Chapter 6 Chapter 6.

MM Process Management Karrie Karahalios Spring 2007 (based off slides created by Brian Bailey)

1 Chapter 5 Flow Lines Types Issues in Design and Operation Models of Asynchronous Lines –Infinite or Finite Buffers Models of Synchronous (Indexing) Lines.

Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

MILP Approach to the Axxom Case Study Sebastian Panek.

Software Project Management

임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.

1 Reducing Queue Lock Pessimism in Multiprocessor Schedulability Analysis Yang Chang, Robert Davis and Andy Wellings Real-time Systems Research Group University.

A Survey of Distributed Task Schedulers Kei Takahashi (M1)

Scheduling policies for real- time embedded systems.

1 Our focus  scheduling a single CPU among all the processes in the system  Key Criteria: Maximize CPU utilization Maximize throughput Minimize waiting.

Reining in the Outliers in Map-Reduce Clusters using Mantri Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha,

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

BALANCED DATA LAYOUT IN HADOOP CPS 216 Kyungmin (Jason) Lee Ke (Jessie) Xu Weiping Zhang.

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.

1 Flow and Congestion Control for Reliable Multicast Communication In Wide-Area Networks A Doctoral Dissertation By Supratik Bhattacharyya.

Rassul Ayani 1 Performance of parallel and distributed systems  What is the purpose of measurement?  To evaluate a system (or an architecture)  To compare.

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

1 Real-Time Scheduling. 2Today Operating System task scheduling –Traditional (non-real-time) scheduling –Real-time scheduling.

CSCI1600: Embedded and Real Time Software Lecture 24: Real Time Scheduling II Steven Reiss, Fall 2015.

Introduction to Embedded Systems Rabie A. Ramadan 5.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Quality of Service Schemes for IEEE Wireless LANs-An Evaluation 主講人 : 黃政偉.

Load Balanced Link Reversal Routing in Mobile Wireless Ad Hoc Networks Nabhendra Bisnik, Alhussein Abouzeid ECSE Department RPI Costas Busch CSCI Department.

CS333 Intro to Operating Systems Jonathan Walpole.

PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,

Big Data Analytics with Parallel Jobs

GRASS: Trimming Stragglers in Approximation Analytics

Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le

New Workflow Scheduling Techniques Presentation: Anirban Mandal

Uniprocessor Scheduling

PA an Coordinated Memory Caching for Parallel Jobs

Measuring Service in Multi-Class Networks

On Scheduling in Map-Reduce and Flow-Shops

CPU Scheduling G.Anuradha

High Throughput Route Selection in Multi-Rate Ad Hoc Wireless Networks

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Reining in the Outliers in MapReduce Jobs using Mantri

Linköping University, IDA, ESLAB

Cloud Computing MapReduce in Heterogeneous Environments

Presentation transcript:

Effective Straggler Mitigation: Attack of the Clones [1] [1] Ananthanarayanan, Ganesh, et al. "Effective straggler mitigation: attack of the clones." Proc. NSDI. 2013. Effective Straggler Mitigation: Attack of the Clones [1] Shy-Yauer Lin (slin52) CS525 Spring 2014 ver04221417

Contribution of this thesis New programming paradigm (e.g., Map-Reduce) has intrinsic straggler problem What’s the problem?? (next slide) Proposed technique mitigate straggler problem Shorten job time, speed up 34% to 46% Trade off is inexpensive (5% additional resource) What’s technique?? (discuss later) Note the scope of this thesis: smaller jobs (discuss later) Practical scenario (Facebook, Bing) This proposed system called Dolly (mainly modified scheduler)

Straggler problem Job = multiple parallel tasks Task: process unit Job time = longest task time (slowest one dominate) task 1.4 task 1.1 task 1.2 task 1.3 Job 1 t job time task 2.4 task 2.1 task 2.2 task 2.3 t Job 2 Single stage?? More imbalance task times, more severe “straggler problem”

Solve Straggler problem task 1.4 task 1.1 task 1.2 task 1.3 Job 1 t job time Insight: clone job (run concurrently) Assumption: task times are not constant Job time = shortest clone job time (fastest dominate) Trade off: more resource usage Job level vs. Task level clone task 1c.4 task 1c.1 task 1c.2 task 1c.3 Job 1c t job time

Good clone approach shall… Trade off: shorten job time at the expense of additional resource usage Rule of thumb: shorten job time as much as possible given additional resource usage constraint (Budget Clone) speculative execution?? resource (less good) job time (less good) proactive/eager (clone before straggler) reactive/lazy (clone after straggler) Spectrum of clone timing

Solve Straggler problem (another approach) Insight: black list (run tasks on good machines; block bad machines) task 2.4 task 2.1 task 2.2 task 2.3 t Job 2 task 2.4 task 2.1 task 2.2 task 2.3 t Job 2 with black list How to predict?? Clone and black list are orthogonal?

We are not satisfied for sols (Reactive) clone job and black list works well for larger jobs Not good for smaller job (job with few tasks: <10) 6.2% vs. 29% (ideal), 8.4% vs. 47% (ideal) “slowdown ratio” (y-axis) in single stage slowdown ratio = “progress rate” of median task / “progress rate” of slowest task “progress rate” = task input data size / task duration High “progress rate” : faster data processing Higher slowdown ratio : slowest (dominant) task lag behind median task more smaller larger Higher slowdown ratio : severe straggler problem (1 for ideal balancing; >5 for LATE/Mantri) (mitigation)

Why care about smaller job This thesis focus on scope of smaller job with highly proactive clone job approach Why focus on smaller job?? Not solved yet Heavy tail fact (smaller job scenario is indeed the most important) Smaller job clone need much less resource overhead Thesis still use many resource-aware technique to reduce this overhead So many smaller job for Facebook! Google: 92% jobs with 2% resource

Insight of this thesis Unconditionally/proactively clone smaller job Clone timing: clone jobs run simultaneously Not good enough: some modifications (next slides) less resource overhead is even better More problems to solve for multistage (talk later) Rule of thumb: shorten job time as much as possible given additional resource usage constraint: Budget Clone End of story??

Modification 1: Task level clone Job 1 t job time task 1.4 task 1.1 task 1.2 task 1.3 Job 1 with task level clone t job time task 1c.1 task 1c.2 task 1c.3 task 1c.4 task 1c.4 task 1c.1 task 1c.2 task 1c.3 Job 1c t job time Task level at least has the same (or shorter) job time to job level trade off??

Job clone vs. task clone job2 ------- task2.1 4sec task2.2 5sec job level clone ------- 5sec = min(job2, job2c) where job2 = max(task2.1, task2.2) = max(4, 5) = 5 and job2c = max(task2c.1, task2c.2) = max(5, 4) = 5 job2c ------- task2c.1 5sec task2c.2 4sec task level clone ------- 4sec = max(t1, t2) where t1 = min(task2.1, task2c.1) = min(4, 5) = 4 and t2 = min(task2.2, task2c.2) = min(5, 4) = 4

Modification 1: Task level clone (1-(1-p)n)c 1-(1-pc)n Fix prob., we need much less copies in task level than job level. n tasks for a job Less required clone c, less resource overhead “Straggler” problem source is too complicated to find out, we model this phenomenon as a probabilistic event: straggler occur with probability p

Budget Clone Insight Step 1: estimate system parameters, like straggler prob. p, available resource budget B, desired straggler prob. upper bound ɛ step 2: Calc required clone number c given p, ɛ step 3: Spend budget B for c clone if budget is enough Or, no-clone at all if budget is not enough (Spend or not-at-all approach: room for improvement) step 4: update budget B (budget decrease; utilized budget increase) Admission control?? Larger job is unlikely to clone; mainly for smaller job clone.

Clone seems good, but… Multistage fact (E.g., Map-Reduce) has new assignment problem Clone maps, clone reduces (clone groups) Mapping strategy between clone maps and clones reduces is an assignment problem one-to-many (one map to many reduces) assignment may cause (I/O) contention one-to-one assignment may be inefficient (reduce wait for slow map) clone group U1 U1 U2 U2 ?? ?? ?? Shuffle vs. Assignment?? ?? Shuffle constraint D1 D1 D2 D2 U1 → D1, D2 U2 → D1, D2

Shuffle vs. Assignment Shuffle (Mapper→Reducers) Assignment (Mapper clones→Reducer clones) clone group U1 U2 D1 D2 clone group U1 U1 U2 U2 D1 D1 D2 D2 Shuffle constraint U1 → D1, D2 U2 → D1, D2

One-to-one assignment (CAC) Contention Avoidance Cloning (CAC) one-to-one assignment may be inefficient (reduce wait for slow map) Longer wait time, longer job time in higher prob. slow Exclusive data path. e.g. U1 → D1; U1c → D1c Assign timing?? Shuffle constraint U1 → D1, D2 U2 → D1, D2

One-to-many assignment (CC) Contention Cloning (CC) one-to-many (one map to many reduces) assignment may cause (I/O) contention no wait at all More contention, longer job time. Contention data path. e.g., U1c → D1; U1c → D1c since U1 is straggler Assign timing?? Shuffle constraint U1 → D1, D2 U2 → D1, D2

New problem: assignment problem for multistage Fix prob., we need much less copies in CC than CAC. Wait vs. no-wait

Solution CC is “no wait at all” approach: bad CAC is “always wait” approach: even worse Insight: “wait for a short period” approach (delay assignment) short period: window w Fine tune window w leads better results than CC and CAC Properly choose w for minimum expected task time (p.192, see next slide) Periodically estimates parameters (like bandwidth) of those formulas One-to-one after wait and waited less than window w waited (< w) -> exclusive data path One-to-many after waiting longer than window w waiting (> w) -> contention data path (w = ∞) (2.5s to 4.7 is decent) Assign timing??

Solution (formula1) Object: find w that minimize expected job time: pc Tc + (1 - pc) TE Contention prob. pc : prob. that wait time ≤ w (window) w contention Tx t B : bandwidth without contention, not budget α B: bandwidth while contention r: data size Case 1. Tc (waiting for w, enforce contention)

Solution (formula2) Object: find w that minimize expected job time: pc Tc + (1 - pc) TE Contention prob. pc : prob. that wait time ≤ w (window) w t w B : bandwidth without contention, not budget r: data size t Case 2. TE (waiting less than w, no contention) is upper bound

Evaluation settings 150 machines 1Gbp network links 12 cores CPU 24GB RAM 2TB storage Competitors: LATE, Mantri (speculative execution solutions) Workflow: Facebook (Hadoop), Bing (Dryad) Jobs scales (# of tasks): 1-10 (bin-1), 11-50, 50-150, 150-500, >500 larger job smaller job

Straggler mitigation ? smaller job larger job 4X% for smaller job Still better for larger job ? smaller job larger job

Straggler mitigation ? smaller job larger job 3X% for smaller job Less improvement (3X%) than facebook flow (4X%), why? ? smaller job larger job

Straggler mitigation Balanced: 1.06 – 1.17 (Dolly); > 5 (LATE, Mantri) for smaller job by slowdown ratio (y-axis)

Resource tradeoff 5% budget is enough for smaller and medium job ?

Resource tradeoff 3% - 5% budget is enough for smaller job “saturate” after resource budget reaches 5% 3% - 5% budget is enough for smaller job smaller job

Delay assignment vs. CC, CAC Delay assignment seems good for smaller job (Delay Assign > CC > CAC) ? ? smaller job

Delay assignment for multistage Delay assignment decays slow while stage grows

Communication pattern Delay assignment not only works for all-to-all

Admission control “Spend or not-at-all (step3)” is good enough Preemption: smaller job first policy ? smaller job smaller job

Parameter Estimation 2.5 to 4.7 sec for w (window) is a good rule of thumb How about larger job? smaller job Estimation period seems not important with other interval setting (written)

Discussions Smaller job vs. larger job? Dolly still benefits larger job, why? Analytical prob. formulas for CAC, but no for CC, why? Straggler phenomena probabilistic formulation assumption? Delay assignment should be as good as better(CAC, CC); but it is not true for larger job, why? (Hint: parameter w graph does not show the case for larger job)