UC Berkeley Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Scott Shenker, Ion Stoica 1 RAD Lab, * Facebook Inc.

Slides:



Advertisements
Similar presentations
SDN + Storage.
Advertisements

Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.
 Basic Concepts  Scheduling Criteria  Scheduling Algorithms.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
UC Berkeley Job Scheduling with the Fair and Capacity Schedulers Matei Zaharia Wednesday, June 10, 2009 Santa Clara Marriott.
Why static is bad! Hadoop Pregel MPI Shared cluster Today: static partitioningWant dynamic sharing.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Delay.
CS444/CS544 Operating Systems Scheduling 1/31/2007 Prof. Searleman
CPU Scheduling. Schedulers Process migrates among several queues –Device queue, job queue, ready queue Scheduler selects a process to run from these queues.
6/25/2015Page 1 Process Scheduling B.Ramamurthy. 6/25/2015Page 2 Introduction An important aspect of multiprogramming is scheduling. The resources that.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
A. Frank - P. Weisberg Operating Systems CPU Scheduling.
7/12/2015Page 1 Process Scheduling B.Ramamurthy. 7/12/2015Page 2 Introduction An important aspect of multiprogramming is scheduling. The resources that.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
The Power of Choice in Data-Aware Cluster Scheduling
Distributed Low-Latency Scheduling
A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人:碩資工一甲 董耀文.
CPU Scheduling Chapter 6 Chapter 6.
Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-3 CPU Scheduling Department of Computer Science and Software Engineering.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Lecture 5 Operating Systems.
Fair Resource Access & Allocation  OS Scheduling
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 6: CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Scheduling. Alternating Sequence of CPU And I/O Bursts.
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.
CPU Scheduling CSCI 444/544 Operating Systems Fall 2008.
1 Our focus  scheduling a single CPU among all the processes in the system  Key Criteria: Maximize CPU utilization Maximize throughput Minimize waiting.
CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Thread Scheduling Multiple-Processor Scheduling Operating Systems Examples Algorithm.
1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.
Lecture 7: Scheduling preemptive/non-preemptive scheduler CPU bursts
Using Map-reduce to Support MPMD Peng
Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Fair.
Chapter 5: Process Scheduling. 5.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Basic Concepts Maximum CPU utilization can be obtained.
Sparrow Distributed Low-Latency Spark Scheduling Kay Ousterhout, Patrick Wendell, Matei Zaharia, Ion Stoica.
Matchmaking: A New MapReduce Scheduling Technique
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,
A Two-phase Execution Engine of Reduce Tasks In Hadoop MapReduce XiaohongZhang*GuoweiWang* ZijingYang*YangDing School of Computer Science and Technology.
A Platform for Fine-Grained Resource Sharing in the Data Center
Introduction to Operating System Created by : Zahid Javed CPU Scheduling Fifth Lecture.
Dominant Resource Fairness: Fair Allocation of Multiple Resource Types Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, Ion.
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
CPU Scheduling G.Anuradha Reference : Galvin. CPU Scheduling Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time.
Basic Concepts Maximum CPU utilization obtained with multiprogramming
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center NSDI 11’ Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D.
Tao Zhu1,2, Chengchun Shu1, Haiyan Yu1
Chapter 10 Data Analytics for IoT
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
PA an Coordinated Memory Caching for Parallel Jobs
CPU scheduling 6. Schedulers, CPU Scheduling 6.1. Schedulers
Software Engineering Introduction to Apache Hadoop Map Reduce
Chapter 6: CPU Scheduling
CPU Scheduling G.Anuradha
Module 5: CPU Scheduling
3: CPU Scheduling Basic Concepts Scheduling Criteria
Chapter5: CPU Scheduling
Chapter 6: CPU Scheduling
CPU SCHEDULING.
Lecture 2 Part 3 CPU Scheduling
Process Scheduling B.Ramamurthy 2/23/2019.
Process Scheduling B.Ramamurthy 4/24/2019.
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Process Scheduling B.Ramamurthy 5/7/2019.
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Presentation transcript:

UC Berkeley Job Scheduling for MapReduce Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Scott Shenker, Ion Stoica 1 RAD Lab, * Facebook Inc

Motivation Hadoop was designed for large batch jobs –FIFO queue + locality optimization At Facebook, we saw a different workload: –Many users want to share a cluster –Many jobs are small ( tasks) Sampling, ad-hoc queries, periodic reports, etc  How should we schedule tasks in a shared MapReduce cluster?

Benefits of Sharing Higher utilization due to statistical multiplexing Data consolidation (ability to query disjoint data sets together)

Why is it Interesting? Data locality is crucial for performance Conflict between locality and fairness 70% gain from simple algorithm

Outline Task scheduling in Hadoop Two problems –Head-of-line scheduling –Slot stickiness A solution (global scheduling) Lots more problems (future work)

Task Scheduling in Hadoop Master Heartbeat Task Slaves send heartbeats periodically Master responds with task if a slot is free, picking task with data closest to the node Job queue Slave

Problem 1: Poor Locality for Small Jobs Job Sizes at Facebook

Problem 1: Poor Locality for Small Jobs

Cause Only head-of-queue job is schedulable on each heartbeat Chance of heartbeat node having local data is low Jobs with blocks on X% of nodes get X% locality

Problem 2: Sticky Slots Suppose we do fair sharing as follows: –Divide task slots equally between jobs –When a slot becomes free, give it to the job that is farthest below its fair share

Slave Problem 2: Sticky Slots Job 1 Job 2 Master

Slave Problem 2: Sticky Slots Master Slave Heartbeat Job Fair Share Running Tasks Job 12 1 Job 222 Problem: Jobs never leave their original slots

Calculations

Solution: Locality Wait Scan through job queue in order of priority Jobs must wait before they are allowed to run non-local tasks –If wait < T 1, only allow node-local tasks –If T 1 < wait < T 2, also allow rack-local –If wait > T 2, also allow off-rack

Slave Locality Wait Example Master Slave Job Fair Share Running Tasks Job 121 Job 222 Jobs can now shift between slots Slave Job Fair Share Running Tasks Job 121 Job 223 Slave

Evaluation – Locality Gains

Throughput Gains 70% 31% 20%18%

Network Traffic Reduction With locality waitWithout locality wait Network Traffic in Sort Workload

Further Analysis When is it worthwhile to wait, and how long? For throughput: –Always worth it, unless there’s a hotspot –If hotspot, prefer to run IO-bound tasks on the hotspot node and CPU-bound tasks remotely (rationale: maximize rate of local IO)

Further Analysis When is it worthwhile to wait, and how long? For response time: E(gain) = (1 – e -w/t )(D – t) –Worth it if E(wait) < cost of running non-locally –Optimal wait time is infinity Delay from running non-locally Expected time to get local heartbeat Wait amount

Problem 3: Memory-Aware Scheduling a)How much memory does each job need? –Asking users for per-job memory limits leads to overestimation –Use historic data about working set size? a)High-memory jobs may starve –Reservation scheme + finish time estimation?

Problem 4: Reduce Scheduling Job 1 Job 2 Time Maps Time Reduces

Problem 4: Reduce Scheduling Job 1 Job 2 Time Maps Time Reduces Job 2 maps done

Problem 4: Reduce Scheduling Job 1 Job 2 Time Maps Time Reduces Job 2 reduces done

Problem 4: Reduce Scheduling Job 1 Job 2 Job 3 Time Maps Time Reduces Job 3 submitted

Problem 4: Reduce Scheduling Job 1 Job 2 Job 3 Time Maps Time Reduces Job 3 maps done

Problem 4: Reduce Scheduling Job 1 Job 2 Job 3 Time Maps Time Reduces

Problem 4: Reduce Scheduling Job 1 Job 2 Job 3 Time Maps Time Reduces Problem: Job 3 can’t launch reduces until Job 1 finishes

Conclusion Simple idea improves throughput by 70% Lots of future work: –Memory-aware scheduling –Reduce scheduling –Intermediate-data-aware scheduling –Using past history / learning job properties –Evaluation using richer benchmarks