Presentation is loading. Please wait.

Presentation is loading. Please wait.

ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters Vignesh Ravi, Michela Becchi, Gagan Agrawal, Srimat Chakradhar.

Similar presentations


Presentation on theme: "ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters Vignesh Ravi, Michela Becchi, Gagan Agrawal, Srimat Chakradhar."— Presentation transcript:

1 ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters Vignesh Ravi, Michela Becchi, Gagan Agrawal, Srimat Chakradhar

2 Context GPUs are used in supercomputers – Some of the top500 supercomputers use GPUs Tianhe-1A – 14,336 Xeon X5670 processors – 7,168 Nvidia Tesla M2050 GPUs Stampede – about 6,000 nodes: » Xeon E5-2680 8C, Intel Xeon Phi GPUs are used in cloud computing 2 Need for resource managers and scheduling schemes for heterogeneous clusters including many-core GPUs

3 Categories of Scheduling Objectives Traditional schedulers for supercomputers aim to improve system-wide metrics: throughput & latency A market-based service world is emerging: focus on provider’s profit and user’s satisfaction – Cloud: pay-as-you-go model Amazon: different users (On-Demand, Free, Spot, …) – Recent resource managers for supercomputers (e.g. MOAB) have the notion of service-level agreement (SLA) 3

4 Motivation Open-source batch schedulers start to support GPUs – TORQUE, SLURM – Users’ guide mapping of jobs to heterogeneous nodes – Simple scheduling schemes (goals: throughput & latency) Recent proposals describe runtime systems & virtualization frameworks for clusters with GPUs – [gViM HPCVirt '09][vCUDA IPDPS '09][ rCUDA HPCS’10 ] [gVirtuS Euro-Par 2010][our HPDC’11, CCGRID’12, HPDC’12] – Simple scheduling schemes (goals: throughput & latency) Proposals on market-based scheduling policies focus on homogeneous CPU clusters – [Irwin HPDC’04][Sherwani Soft.Pract.Exp.’04] 4 State of the Art Our Goal: Reconsider market-based scheduling for heterogeneous clusters including GPUs

5 Considerations Community looking into code portability between CPU and GPU – OpenCL – PGI CUDA-x86 – MCUDA (CUDA-C), Ocelot, SWAN (CUDA-OpenCL), OpenMPC → Opportunity to flexibly schedule a job on CPU/GPU In cloud environments oversubscription commonly used to reduce infrastructural costs → Use of resource sharing to improve performance by maximizing hardware utilization 5

6 6 Problem Formulation Given a CPU-GPU cluster Schedule a set of jobs on the cluster –To maximize the provider’s profit / aggregate user satisfaction Exploit the portability offered by OpenCL –Flexibly map the job on to either CPU or GPU Maximize resource utilization –Allow sharing of multi-core CPU or GPU Assumptions/Limitations 1 multi-core CPU and 1 GPU per node Single-node, single GPU jobs Only space-sharing, limited to two jobs per resource

7 Value Function 7 Market-based Scheduling Formulation For each job, Linear-Decay Value Function [Irwin HPDC’04] Max Value → Importance/Priority of job Decay → Urgency of job Delay due to: ₋queuing, execution on non-optimal resource, resource sharing Execution time Yield/Value T Max Value Decay rate Yield = maxValue – decay * delay

8 Overall Scheduling Approach 8 Jobs arrive in batches Execute on CPUExecute on GPU Scheduling Flow Enqueue into CPU Queue Enqueue into GPU Queue Jobs are enqueued on their optimal resource. Phase 1 is oblivious of other jobs (based on optimal walltime) Phase 1: Mapping Sort jobs to Improve Yield Inter-jobs scheduling considerations Sort jobs to Improve Yield Phase 2: Sorting Phase 3: Re-mapping Different schemes: - When to remap? - What to remap?

9 Phase 1: Mapping 9 Users provide walltime on GPU and GPU – walltime used as indicator of optimal/non optimal resource – Each job is mapped onto its optimal resource NOTE: in our experiments we assumed maxValue = optimal walltime

10 Phase 2: Sorting 10 Sort jobs based on Reward [Irwin HPDC’04] Present Value – f(maxValue i, discount_rate) – Value after discounting the risk of running a job – The shorter the job, the lower the risk Opportunity Cost – Degradation in value due to the selection of one among several alternatives

11 Phase 3: Remapping When to remap: – Uncoordinated schemes queue is empty and resource is idle – Coordinated scheme When CPU and GPU queues are imbalanced What to remap: – Which job will have best reward on non-optimal resource? – Which job will suffer least reward penalty ? 11

12 Phase 3: Uncoordinated Schemes 1.Last Optimal Reward (LOR) – Remap job with least reward on optimal resource – Idea: least reward → least risk in moving 2.First Non-Optimal Reward (FNOR) – Compute the reward job could produce on non-optimal resource – Remap job with highest reward on non-optimal resource – Idea: consider non-optimal penalty 3.Last Non-Optimal Reward Penalty (LNORP) – Remap job with least reward degradation RewardDegradation i = OptimalReward i - NonOptimalReward i 12

13 Phase 3: Coordinated Scheme Coordinated Least Penalty (CORLP) When to remap: imbalance between queues – Imbalance affected by: decay rates and execution times of jobs – Total Queuing-Delay Decay-Rate Product (TQDP) – Remap if |TQDP CPU – TQDP GPU | > threshold What to remap – Remap job with least penalty degradation 13

14 Heuristic for Sharing Limitation: Two jobs can space-share of CPU/GPU Factors affecting sharing - Slowdown incurred by jobs using half of a resource + More resource available for other jobs Jobs – Categorized as low, medium, high scaling (based on models/profiling) When to enable sharing – Large fraction of jobs in pending queues with negative yield What jobs share a resource – Scalability-DecayRate factor Jobs grouped based on scalability Within each group, jobs are ordered by decay rate (urgency) –Pick top K fraction of jobs, ‘K’ is tunable (low scalability, low decay) 14 Resource Sharing Heuristic

15 15 Master Node Compute Node … Overall System Prototype Compute Node

16 16 Master Node Cluster-Level Scheduler Scheduling Schemes & Policies TCP Communicator Submission Queue Pending Queues Execution Queues Finished Queues Compute Node Multi-core CPU GPU … Overall System Prototype Compute Node Multi-core CPU GPU Compute Node Multi-core CPU GPU CPU GPU CPU GPU CPU GPU

17 17 Master Node Cluster-Level Scheduler Scheduling Schemes & Policies TCP Communicator Submission Queue Pending Queues Execution Queues Finished Queues Compute Node Node-Level Runtime Multi-core CPU GPU … Overall System Prototype Compute Node Node-Level Runtime Multi-core CPU GPU Compute Node Node-Level Runtime Multi-core CPU GPU CPU GPU CPU GPU CPU GPU

18 18 Master Node Cluster-Level Scheduler Scheduling Schemes & Policies TCP Communicator Submission Queue Pending Queues Execution Queues Finished Queues Compute Node Node-Level Runtime Multi-core CPU GPU … Overall System Prototype Compute Node Node-Level Runtime Multi-core CPU GPU Compute Node Node-Level Runtime Multi-core CPU GPU CPU GPU CPU GPU CPU GPU TCP Communicator CPU Execution Processes GPU Execution Processes GPU Consolidation Framework OS-based scheduling & sharing

19 19 Master Node Cluster-Level Scheduler Scheduling Schemes & Policies TCP Communicator Submission Queue Pending Queues Execution Queues Finished Queues Compute Node Node-Level Runtime Multi-core CPU GPU … Overall System Prototype Compute Node Node-Level Runtime Multi-core CPU GPU Compute Node Node-Level Runtime Multi-core CPU GPU CPU GPU CPU GPU CPU GPU TCP Communicator CPU Execution Processes GPU Execution Processes GPU Consolidation Framework Assumption: shared file system Centralized decision making Execution & sharing mechanisms OS-based scheduling & sharing

20 20 GPU CUDA app 1 CUDA Driver CUDA Runtime GPU execution processes (Front-End) GPU Consolidation Framework Back-End GPU Sharing Framework GPU-related Node-Level Runtime CUDA app N … CUDA Interception Library CUDA Interception Library Front End – Back End Communication Channel

21 21 Front End – Back End Communication Channel GPU CUDA Interception Library CUDA app 1 CUDA Driver CUDA Runtime GPU Consolidation Framework GPU execution processes (Front-End) Back-End Virtual Context CUDA calls arrive from Frontend GPU Sharing Framework GPU-related Node-Level Runtime CUDA app N … CUDA Interception Library Back-End Server CUDA stream 1 Manipulates kernel configurations to allow GPU space sharing CUDA stream 2 CUDA stream N Workload Consolidator Simplified version of our HPDC’11 runtime

22 Experimental Setup 16-node cluster – CPU: 8-core Intel Xeon E5520 (2.27 GHz), 48 GB memory – GPU: Nvidia Tesla C2050 (1.15 GHz), 3GB device memory 256-job workload – 10 benchmark programs – 3 configurations: small, large, very large datasets – Various application domains: scientific computations, financial analysis, data mining, machine learning Baselines – TORQUE (always optimal resource) – Minimum Completion Time (MCT) [Maheswaran et.al, HCW’99] 22

23 Comparison with Torque-based Metrics 23 Baselines suffer from idle resources By privileging shorter jobs, our schemes reduce queuing delays Throughput & Latency COMPLETION TIMEAVERAGE LATENCY 10-20% better ~ 20% better

24 Results with Average Yield Metric 24 up to 8.8x better Yield: Effect of Job Mix Skewed-GPUSkewed-CPU Uniform up to 2.3x better Better on skewed job mixes: −More idle time in case of baseline schemes −More room for dynamic mapping

25 25 up to 3.8x better up to 6.9x better Results with Average Yield Metric Yield: Effect of Value Function Adaptability of our schemes to different value functions

26 Results with Average Yield Metric 26 up to 8.2x better Yield: Effect of System Load As load increases, yield from baselines decreases linearly Proposed schemes achieve initially increased yield and then sustained yield

27 Yield Improvements from Sharing 27 Fraction of jobs to share Careful space sharing can help performance by freeing resources Excessive sharing can be detrimental to performance Yield: Effect of Sharing up to 23x improvement

28 Summary 28 Value-based Scheduling on CPU-GPU clusters -Goal: improve aggregate yield Coordinated and uncoordinated scheduling schemes for dynamic mapping Automatic space sharing of resources based on heuristics Prototypical framework for evaluating the proposed schemes Improvement over state-of-the-art -Based on completion time & latency -Based on average yield Conclusion


Download ppt "ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters Vignesh Ravi, Michela Becchi, Gagan Agrawal, Srimat Chakradhar."

Similar presentations


Ads by Google