Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

Slides:

Advertisements

Similar presentations

Big Data + SDN SDN Abstractions. The Story Thus Far Different types of traffic in clusters Background Traffic – Bulk transfers – Control messages Active.

Advertisements

Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.

Effective Straggler Mitigation: Attack of the Clones [1]

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Meeting Service Level Objectives of Pig Programs Zhuoyao Zhang, Ludmila Cherkasova, Abhishek Verma, Boon Thau Loo University of Pennsylvania Hewlett-Packard.

60 GHz Flyways: Adding multi-Gbps wireless links to data centers

Reservation-based Scheduling: If You’re Late Don’t Blame Us! Carlo Curinom, Djellel E. Difallahu, Chris Douglasm, Subru Krishnanm, Raghu Ramakrishnanm,

Matei Zaharia, Dhruba Borthakur *, Joydeep Sen Sarma *, Khaled Elmeleegy +, Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Delay.

1 A General Auction-Based Architecture for Resource Allocation Weidong Cui, Matthew C. Caesar, and Randy H. Katz EECS, UC Berkeley {wdc, mccaesar,

The Power of Choice in Data-Aware Cluster Scheduling

Not All Microseconds are Equal: Fine-Grained Per-Flow Measurements with Reference Latency Interpolation Myungjin Lee †, Nick Duffield‡, Ramana Rao Kompella†

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Ana-Maria Oprescu, Thilo Kielmann (Vrije University) Presented By Gal Cohen Cloud Computing Seminar CS Technion, Spring 2012.

Sharing the Data Center Network Alan Shieh, Srikanth Kandula, Albert Greenberg, Changhoon Kim, Bikas Saha Microsoft Research, Cornell University, Windows.

A Dynamic MapReduce Scheduler for Heterogeneous Workloads Chao Tian, Haojie Zhou, Yongqiang He,Li Zha 簡報人：碩資工一甲董耀文.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

Network Aware Resource Allocation in Distributed Clouds.

Scheduling in Cloud Presented by: Abdullah Al Mahmud Course: Cloud Computing(Fall 2012)

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

1 Quincy: Fair Scheduling for Distributed Computing Clusters Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg.

David G. Andersen CMU Guohui Wang, T. S. Eugene Ng Rice Michael Kaminsky, Dina Papagiannaki, Michael A. Kozuch, Michael Ryan Intel Labs Pittsburgh 1 c-Through:

The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.

Introduction to Hadoop and HDFS

Papers on Storage Systems 1) Purlieus: Locality-aware Resource Allocation for MapReduce in a Cloud, SC ) Making Cloud Intermediate Data Fault-Tolerant,

Stochastic DAG Scheduling using Monte Carlo Approach Heterogeneous Computing Workshop (at IPDPS) 2012 Extended version: Elsevier JPDC (accepted July 2013,

Reining in the Outliers in Map-Reduce Clusters using Mantri Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha,

Scalable Multi-Class Traffic Management in Data Center Backbone Networks Amitabha Ghosh (UtopiaCompression) Sangtae Ha (Princeton) Edward Crabbe (Google)

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.

Optimizing Live Migration of Virtual Machines across Wide Area Networks using Integrated Replication and Scheduling Sumit Kumar Bose, Unisys Scott Brock,

1 Making MapReduce Scheduling Effective in Erasure-Coded Storage Clusters Runhui Li and Patrick P. C. Lee The Chinese University of Hong Kong LANMAN’15.

The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.

Database replication policies for dynamic content applications Gokul Soundararajan, Cristiana Amza, Ashvin Goel University of Toronto Presented by Ahmed.

1 Enabling Efficient and Reliable Transitions from Replication to Erasure Coding for Clustered File Systems Runhui Li, Yuchong Hu, Patrick P. C. Lee The.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Dynamic Slot Allocation Technique for MapReduce Clusters School of Computer Engineering Nanyang Technological University 25 th Sept 2013 Shanjiang Tang,

Surviving Failures in Bandwidth Constrained Datacenters Authors: Peter Bodik Ishai Menache Mosharaf Chowdhury Pradeepkumar Mani David A.Maltz Ion Stoica.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

Multi-Resource Packing for Cluster Schedulers Robert Grandl Aditya Akella Srikanth Kandula Ganesh Ananthanarayanan Sriram Rao.

Optimizing Live Migration of Virtual Machines across Wide Area Networks using Integrated Replication and Scheduling Sumit Kumar Bose, Unisys Scott Brock,

Dzmitry Kliazovich University of Luxembourg, Luxembourg

Satisfying Strong Application Requirements in Data-Intensive Clouds Ph.D Final Exam Brian Cho 1.

Wajid Minhass, Paul Pop, Jan Madsen Technical University of Denmark

MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.

On the Placement of Web Server Replicas Yu Cai. Paper On the Placement of Web Server Replicas Lili Qiu, Venkata N. Padmanabhan, Geoffrey M. Voelker Infocom.

Theophilus Benson*, Ashok Anand*, Aditya Akella*, Ming Zhang + *University of Wisconsin, Madison + Microsoft Research.

Multi-Resource Packing for Cluster Schedulers Robert Grandl, Ganesh Ananthanarayanan, Srikanth Kandula, Sriram Rao, Aditya Akella.

1 VLDB, Background What is important for the user.

Efficient Coflow Scheduling with Varys

PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,

1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Aalo Efficient Coflow Scheduling Without Prior Knowledge Mosharaf Chowdhury, Ion Stoica UC Berkeley.

Yiting Xia, T. S. Eugene Ng Rice University

Packing Tasks with Dependencies

Chris Cai, Shayan Saeed, Indranil Gupta, Roy Campbell, Franck Le

Hydra: Leveraging Functional Slicing for Efficient Distributed SDN Controllers Yiyang Chang, Ashkan Rezaei, Balajee Vamanan, Jahangir Hasan, Sanjay Rao.

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

Authors: Sajjad Rizvi, Xi Li, Bernard Wong, Fiodar Kazhamiaka

A Study of Group-Tree Matching in Large Scale Group Communications

Resource Elasticity for Large-Scale Machine Learning

PA an Coordinated Memory Caching for Parallel Jobs

On Scheduling in Map-Reduce and Flow-Shops

ISP and Egress Path Selection for Multihomed Networks

EECS 582 Final Review Mosharaf Chowdhury EECS 582 – F16.

Multi-hop Coflow Routing and Scheduling in Data Centers

Hawk: Hybrid Datacenter Scheduling

Towards Predictable Datacenter Networks

Presentation transcript:

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew Caesar

Network scheduling important for data-parallel jobs Network-intensive stages (e.g., shuffle, join) More than 50% time spent in network transfers* Oversubscribed network from rack to core Ratios between 3:1 to 10:1 Cross-rack bandwidth shared across apps Nearly 50% used for background transfers** *Efficient Coflow Scheduling with Varys, Sigcomm 2014. **Leveraging Endpoint Flexibility in Data-Intensive Clusters, Sigcomm 2013

Several techniques proposed MapReduce job M M R M R Input M Maps Reducers Focus on placing tasks (e.g., Tetris, Quincy) Focus on scheduling network flows (e.g., Varys, Baraat, D3) Limitation: Assume fixed input data placement

Limitation of existing techniques Problems Use congested cross-rack links Contention with other jobs M R MapReduce job M R R M M M Rack 1 Rack 2 Rack 3 Rack 4 Map input data spread randomly (HDFS) Hence, reduce input spread randomly = Map in = Map out/ reduce in

Our proposal: Place input in a few racks

Our proposal: Place input in a few racks M R MapReduce job M R M M R M Rack 1 Rack 2 Rack 3 Rack 4 Map input data placed in one rack Reduce input in the same rack All transfers stay within the rack: Rack-level locality = Map in = Map out/ reduce in

Our proposal: Place input in a few racks High bandwidth between tasks Reduced contention across jobs Scenarios Recurring jobs (~40%) known ahead of time Separate storage and compute cluster Benefits? Is placing data feasible?

Challenges How many racks to assign a job? How to avoid hotspots? How to determine job characteristics? What about jobs without history? Ad hoc jobs Offline planning using job properties Need more than 1 rack for 5-25% jobs Use history Input data size (log 10 scale) One rack may be sufficient for 75-95% jobs Can be predicted with low error for recurring jobs (~6.5% on average) Benefit from freed up resources 1 2 3 4 5 6 7 8 9 10 Days

Coordinated placement of data and compute Our system: Corral Coordinated placement of data and compute Exploits predictable characteristics of jobs

Job 𝑖→ Racks(𝑖), Prio(𝑖) Corral architecture Future job estimates Offline planner CORRAL Offline Online Placement hints Job 𝑖→ Racks(𝑖), Prio(𝑖) Job 𝑖 submitted Data upload for Job 𝑖 Cluster scheduler Task placement policy Data placement policy Solves an offline planning problem Data placement: One data replica constrained to Racks 𝑖 Other replicas spread randomly for fault tolerance Task placement: Tasks assigned slots only in Racks 𝑖 ; ties broken using Prio 𝑖

Outline Motivation Architecture Planning problem Evaluation Conclusion Formulation Solution Evaluation Conclusion

Planning problem: Formulation Given a set of jobs and their arrival times, find a schedule which meets their goals Job 𝑖→ {Racks(𝑖), Prio(𝑖)} Scenarios Batch (minimize makespan) Online (minimize average job time) Focus on batch scenario in this talk

Planning problem: Solution How many racks to allocate to each job? How many racks to allocate to each job? Which rack(s) to allocate to each job? Which rack(s) to allocate to each job? Provisioning phase Prioritization phase All jobs assigned one rack Schedule on cluster Initialize Increase #racks of longest job Iterate Select schedule with minimum makespan

Planning problem: Solution Cluster Jobs A B C Iter=1 1 Iter=2 2 1 Iter=3 3 1 … 1 125 50 100 250 125 50 125 50 100 300 125 50 Racks 2 3 Makespan 300s 250s 225s Time Schedule widest-job first; ties broken using longest-job first Planning assumption: Jobs “rectangular”, exclusive use of racks Work-conserving at runtime Performs at most 3%-15% worse than optimal Select schedule with minimum makespan # racks Latency Job latency determined using latency-response curves Longest-job first can lead to wasted resources

Outline Motivation Architecture Planning problem Evaluation Conclusion Formulation Solution Evaluation Conclusion

Evaluation (Recurring) MapReduce workloads Mix of recurring and ad hoc jobs DAG-style workloads Sensitivity analysis Benefits with flow-level schedulers

Evaluation: Setup Implemented Corral in Yarn Cluster Baselines Modified HDFS and Resource Manager Cluster 210 machines, 7 racks 10Gbps/machine, 5:1 oversubscription Background traffic (~50% of cross-rack bandwidth) Baselines Yarn-CS: Capacity scheduler in Yarn ShuffleWatcher [ATC’14]: Schedules to min. cross-rack data transferred

Evaluation: MapReduce workloads # of jobs Quantcast (W1) 200 Yahoo (W2) 400 Microsoft Cosmos (W3) Reasons Avg. reducer time (sec) ~42% ~40% Corral reduces makespan by 10-33% Improved network locality Reduced contention

Evaluation: Mix of jobs 100 recurring jobs and 50 ad-hoc jobs from W1 Recurring jobs Ad-hoc jobs ~42% ~37% ~27% ~10% Recurring jobs finish faster and free up resources

Corral summary Exploit predictable characteristics of data-parallel jobs Place data and compute together in a few racks Up to 33% (56%) reduction in makespan (avg. job time) Provides orthogonal benefits to flow-level techniques