Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Name: Chimera: Collaborative Preemption for Multitasking on a Shared GPU
Uploaded: 2018-01-12T03:16:26+00:00
Duration: PTM10S31
Channel: Ben Limbrick
Description: Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Chimera: Collaborative Preemption for Multitasking on a Shared GPU
Jason Jong Kyu Park1, Yongjun Park2, and Scott Mahlke1 1University of Michigan, Ann Arbor 2Hongik University

GPUs in Modern Computer Systems
GPU is now a default component in modern computer systems Servers, desktops, laptops, etc. Mobile devices Offloads data-parallel kernels CUDA, OpenCL, and etc.

GPU Execution Model threads thread blocks

Multitasking Needs in GPUs
Augmented Reality Bodytrack 3D rendering Graphics Data parallel algorithm ….

Traditional Context Switching
GTX 780 (Kepler) 256kB registers + 48kB shared memory 288.4 GB/s for 12 SMs ~88us per SM (~1us for CPUs) K1 Context Save Context Load K2 K2 launches Time

Challenge 1: Preemption latency
GTX 780 (Kepler) 256kB registers + 48kB shared memory Too long for latency-critical kernels 288.4 GB/s for 12 SMs ~88us per SM K1 Context Save Context Load K2 K2 launches Time

Challenge 2: Throughput overhead
GTX 780 (Kepler) 256kB registers + 48kB shared memory 288.4 GB/s for 12 SMs ~88us per SM K1 Context Save Context Load K2 No useful work is done K2 launches Time

Objective of This Work Prior work Switch Thread Block Progress (%)
Preemption Cost Prior work 0% 100% Thread Block Progress (%)

SM draining [Tanasic’ 14]
No issue Thread block K1 K2 K2 launches Time

Chimera Opportunity Switch!! Drain!! Switch Drain
Preemption Cost Opportunity Drain 0% 100% Thread Block Progress (%)

Idempotent if read state is not modified
SM Flushing Instant preemption Throw away what was running on the SM Re-execute from the beginning later (Idempotent kernel) CPU Idempotent if read state is not modified GPU Global Memory Only observable state

Finding Relaxed Idempotence
Detected by compiler CUDA __global__ void kernel_cuda(const float *in, float* out, float* inout) { … = inout[idx]; … atomicAdd(…); … out[idx] = …; inout[idx] = …; } Atomic Operation Idempotent Region Global Overwrite

Chimera Flush near the beginning Context switch in the middle
Preemption Cost Optimal Drain 0% 100% Thread Block Progress (%) Flush near the beginning Context switch in the middle Drain near the end

Independent Thread Block Execution
No shared state between SMs and thread blocks Each SM/thread block can be preempted with different preemption technique GPU SM No shared state … Thread block SM

Chimera Collaborative Preemption … Progress Flush GPU Thread SM Drain
Thread block … SM Switch

Thread Block Scheduler
Architecture SM Scheduling Policy How many SMs will each kernel have? Two level scheduler Kernel scheduler + thread block scheduler Kernel Scheduler SM Scheduling Policy TB-to-Kernel Mapping Thread Block Scheduler Chimera

Thread Block Scheduler
Architecture Chimera Which SM will be preempted? Which preemption technique to use? Kernel Scheduler SM Scheduling Policy TB-to-Kernel Mapping Thread Block Scheduler Chimera

Architecture Thread block scheduler
Which thread block will be scheduled? Carry out preemption decision Kernel Scheduler Thread Block Queue per Kernel Preempted TB TB-to-Kernel Mapping Next TB Thread Block Scheduler Preempt

Cost Estimation: Preemption Latency
Switch Context size / (Memory bandwidth / # of SMs) Drain Instructions measured in a warp granularity Progress in insts Average execution insts: Estimated remaining insts x CPI = Estimated preemption latency Flush Zero preemption latency

Cost Estimation: Throughput
Switch IPC * Preemption latency * 2 Doubled due to context saving and loading Drain Instructions measured in a warp granularity Progress in insts Most progressed in the same SM: Overhead Flush Executed instructions in a warp granularity

Chimera Algorithm Preemption victim Least throughput overhead
Meets preemption latency constraint Flush GPU SM Flush : … Switch: Thread Drain : SM Thread block Switch

Chimera Algorithm Preemption victim Least throughput overhead
Meets preemption latency constraint Constraint Flush GPU SM Latency : … Overhead : Thread Latency : SM Thread block Overhead : Switch

Experimental Setup GPGPU-Sim v3.2.2 Workloads
GPU Model: Fermi architecture Up to 32,768 (128 kB) registers Up to 48 kB shared memory Workloads 14 benchmarks from Nvidia SDK, Parboil, and Rodinia GPGPU benchmark + Synthetic benchmark Mimics periodic, real-time task (e.g. Graphics kernel) Period: 1ms Execution Time: 200us GPGPU benchmark + GPGPU benchmark Baseline: Non-preemptive First-come First-served

Preemption Latency Violations
GPGPU benchmark + Synthetic benchmark 15 us preemption latency constraint (real-time task) 0.2% Non-idempotent kernel with short thread block execution time Estimated shorter preemption latency

System Throughput Case study: LUD + Other GPGPU benchmark
LUD has many kernel launches with varying number of thread blocks Drain has lower average normalized turnaround time (5.17x for Drain, 5.50x for Chimera)

Preemption Technique Distribution
GPGPU benchmark + Synthetic benchmark

Summary Context switch can have high overhead on GPUs Chimera Flushing
Preemption latency, and throughput overhead Chimera Flushing Instant preemption Collaborative preemption Flush + Switch + Drain Almost always meets preemption latency constraint 0.2% violations (estimated shorter preemption latency) 5.5x ANTT improvement, 12.2% STP improvement For GPGPU benchmark + GPGPU benchmark combinations

Questions? Drain Context Switch Flush

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Similar presentations

Presentation on theme: "Chimera: Collaborative Preemption for Multitasking on a Shared GPU"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Similar presentations

Presentation on theme: "Chimera: Collaborative Preemption for Multitasking on a Shared GPU"— Presentation transcript:

Similar presentations

About project

Feedback