Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
Virajith Jalaparti Peter Bodik, Ishai Menache, Sriram Rao, Konstantin Makarychev, Matthew Caesar

Network scheduling important for data-parallel jobs
Network-intensive stages (e.g., shuffle, join) More than 50% time spent in network transfers* Oversubscribed network from rack to core Ratios between 3:1 to 10:1 Cross-rack bandwidth shared across apps Nearly 50% used for background transfers** *Efficient Coflow Scheduling with Varys, Sigcomm 2014. **Leveraging Endpoint Flexibility in Data-Intensive Clusters, Sigcomm 2013

Several techniques proposed
MapReduce job M M R M R Input M Maps Reducers Focus on placing tasks (e.g., Tetris, Quincy) Focus on scheduling network flows (e.g., Varys, Baraat, D3) Limitation: Assume fixed input data placement

Limitation of existing techniques
Problems Use congested cross-rack links Contention with other jobs M R MapReduce job M R R M M M Rack 1 Rack 2 Rack 3 Rack 4 Map input data spread randomly (HDFS) Hence, reduce input spread randomly = Map in = Map out/ reduce in

Our proposal: Place input in a few racks

M R MapReduce job M R M M R M Rack 1 Rack 2 Rack 3 Rack 4 Map input data placed in one rack Reduce input in the same rack All transfers stay within the rack: Rack-level locality = Map in = Map out/ reduce in

High bandwidth between tasks Reduced contention across jobs Scenarios Recurring jobs (~40%) known ahead of time Separate storage and compute cluster Benefits? Is placing data feasible?

Challenges How many racks to assign a job? How to avoid hotspots?
How to determine job characteristics? What about jobs without history? Ad hoc jobs Offline planning using job properties Need more than 1 rack for 5-25% jobs Use history Input data size (log 10 scale) One rack may be sufficient for 75-95% jobs Can be predicted with low error for recurring jobs (~6.5% on average) Benefit from freed up resources 1 2 3 4 5 6 7 8 9 10 Days

Coordinated placement of data and compute
Our system: Corral Coordinated placement of data and compute Exploits predictable characteristics of jobs

Job 𝑖→ Racks(𝑖), Prio(𝑖)
Corral architecture Future job estimates Offline planner CORRAL Offline Online Placement hints Job 𝑖→ Racks(𝑖), Prio(𝑖) Job 𝑖 submitted Data upload for Job 𝑖 Cluster scheduler Task placement policy Data placement policy Solves an offline planning problem Data placement: One data replica constrained to Racks 𝑖 Other replicas spread randomly for fault tolerance Task placement: Tasks assigned slots only in Racks 𝑖 ; ties broken using Prio 𝑖

Outline Motivation Architecture Planning problem Evaluation Conclusion
Formulation Solution Evaluation Conclusion

Planning problem: Formulation
Given a set of jobs and their arrival times, find a schedule which meets their goals Job 𝑖→ {Racks(𝑖), Prio(𝑖)} Scenarios Batch (minimize makespan) Online (minimize average job time) Focus on batch scenario in this talk

Planning problem: Solution
How many racks to allocate to each job? How many racks to allocate to each job? Which rack(s) to allocate to each job? Which rack(s) to allocate to each job? Provisioning phase Prioritization phase All jobs assigned one rack Schedule on cluster Initialize Increase #racks of longest job Iterate Select schedule with minimum makespan

Planning problem: Solution
Cluster Jobs A B C Iter=1 1 Iter=2 2 1 Iter=3 3 1 … 1 125 50 100 250 125 50 125 50 100 300 125 50 Racks 2 3 Makespan 300s 250s 225s Time Schedule widest-job first; ties broken using longest-job first Planning assumption: Jobs “rectangular”, exclusive use of racks Work-conserving at runtime Performs at most 3%-15% worse than optimal Select schedule with minimum makespan # racks Latency Job latency determined using latency-response curves Longest-job first can lead to wasted resources

Outline Motivation Architecture Planning problem Evaluation Conclusion
Formulation Solution Evaluation Conclusion

Evaluation (Recurring) MapReduce workloads
Mix of recurring and ad hoc jobs DAG-style workloads Sensitivity analysis Benefits with flow-level schedulers

Evaluation: Setup Implemented Corral in Yarn Cluster Baselines
Modified HDFS and Resource Manager Cluster 210 machines, 7 racks 10Gbps/machine, 5:1 oversubscription Background traffic (~50% of cross-rack bandwidth) Baselines Yarn-CS: Capacity scheduler in Yarn ShuffleWatcher [ATC’14]: Schedules to min. cross-rack data transferred

Evaluation: MapReduce workloads
# of jobs Quantcast (W1) 200 Yahoo (W2) 400 Microsoft Cosmos (W3) Reasons Avg. reducer time (sec) ~42% ~40% Corral reduces makespan by 10-33% Improved network locality Reduced contention

Evaluation: Mix of jobs
100 recurring jobs and 50 ad-hoc jobs from W1 Recurring jobs Ad-hoc jobs ~42% ~37% ~27% ~10% Recurring jobs finish faster and free up resources

Corral summary Exploit predictable characteristics of data-parallel jobs Place data and compute together in a few racks Up to 33% (56%) reduction in makespan (avg. job time) Provides orthogonal benefits to flow-level techniques

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

Similar presentations

Presentation on theme: "Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can

Similar presentations

Presentation on theme: "Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can"— Presentation transcript:

Similar presentations

About project

Feedback