Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Michigan Electrical Engineering and Computer Science Chimera: Collaborative Preemption for Multitasking on a Shared GPU Jason Jong Kyu Park.

Similar presentations


Presentation on theme: "University of Michigan Electrical Engineering and Computer Science Chimera: Collaborative Preemption for Multitasking on a Shared GPU Jason Jong Kyu Park."— Presentation transcript:

1 University of Michigan Electrical Engineering and Computer Science Chimera: Collaborative Preemption for Multitasking on a Shared GPU Jason Jong Kyu Park 1, Yongjun Park 2, and Scott Mahlke University of Michigan, Ann Arbor 2 Hongik University

2 University of Michigan Electrical Engineering and Computer Science GPUs in Modern Computer Systems 2 GPU is now a default component in modern computer systems –Servers, desktops, laptops, etc. –Mobile devices Offloads data-parallel kernels –CUDA, OpenCL, and etc.

3 University of Michigan Electrical Engineering and Computer Science GPU Execution Model 3 threads thread blocks

4 University of Michigan Electrical Engineering and Computer Science Multitasking Needs in GPUs 4 Augmented RealityBodytrack 3D rendering Graphics Data parallel algorithm ….

5 University of Michigan Electrical Engineering and Computer Science Traditional Context Switching 5 K1 K2 Time K2 launches Context Save Context Load 256kB registers + 48kB shared memory GTX 780 (Kepler) GB/s for 12 SMs ~88us per SM (~1us for CPUs)

6 University of Michigan Electrical Engineering and Computer Science Challenge 1: Preemption latency 6 K1 K2 Time K2 launches Context Save Context Load 256kB registers + 48kB shared memory GTX 780 (Kepler) GB/s for 12 SMs ~88us per SM Too long for latency-critical kernels

7 University of Michigan Electrical Engineering and Computer Science Challenge 2: Throughput overhead 7 K1 K2 Time K2 launches Context Save Context Load 256kB registers + 48kB shared memory GTX 780 (Kepler) GB/s for 12 SMs ~88us per SM No useful work is done

8 University of Michigan Electrical Engineering and Computer Science Objective of This Work 8 Preemption Cost 100%0% Thread Block Progress (%) Switch Prior work

9 University of Michigan Electrical Engineering and Computer Science SM draining [Tanasic’ 14] 9 K1 K2 Time K2 launches Thread block No issue

10 University of Michigan Electrical Engineering and Computer Science Chimera 10 Preemption Cost 100%0% Thread Block Progress (%) Switch Drain Switch!!Drain!! Opportunity

11 University of Michigan Electrical Engineering and Computer Science SM Flushing Instant preemption –Throw away what was running on the SM Re-execute from the beginning later (Idempotent kernel) 11 GPU CPU Global Memory Only observable state Idempotent if read state is not modified

12 University of Michigan Electrical Engineering and Computer Science Finding Relaxed Idempotence 12 __global__ void kernel_cuda(const float *in, float* out, float* inout) { … = inout[idx]; … atomicAdd(…); … out[idx] = …; inout[idx] = …; } CUDA Atomic Operation Global Overwrite Idempotent Region Detected by compiler

13 University of Michigan Electrical Engineering and Computer Science Chimera Flush near the beginning Context switch in the middle Drain near the end 13 Preemption Cost 100%0% Thread Block Progress (%) Flush Switch Drain Optimal

14 University of Michigan Electrical Engineering and Computer Science Independent Thread Block Execution 14 SM … GPU Thread block No shared state No shared state between SMs and thread blocks –Each SM/thread block can be preempted with different preemption technique

15 University of Michigan Electrical Engineering and Computer Science Chimera 15 Thread SM … Flush Drain Switch SM GPU Progress Collaborative Preemption Thread block

16 University of Michigan Electrical Engineering and Computer Science Two level scheduler –Kernel scheduler + thread block scheduler Architecture 16 Kernel Scheduler Thread Block Scheduler TB-to-Kernel Mapping SM Scheduling Policy –How many SMs will each kernel have? SM Scheduling Policy Chimera

17 University of Michigan Electrical Engineering and Computer Science Chimera –Which SM will be preempted? –Which preemption technique to use? Architecture 17 Kernel Scheduler Thread Block Scheduler TB-to-Kernel Mapping SM Scheduling Policy Chimera

18 University of Michigan Electrical Engineering and Computer Science Thread block scheduler –Which thread block will be scheduled? –Carry out preemption decision Architecture 18 Kernel Scheduler Thread Block Scheduler TB-to-Kernel Mapping Thread Block Queue per Kernel Preempted TB Next TB Preempt

19 University of Michigan Electrical Engineering and Computer Science Switch –Context size / (Memory bandwidth / # of SMs) Cost Estimation: Preemption Latency 19 Average execution insts: Progress in insts Estimated remaining insts x CPI = Estimated preemption latency Drain –Instructions measured in a warp granularity Flush –Zero preemption latency

20 University of Michigan Electrical Engineering and Computer Science Cost Estimation: Throughput Switch –IPC * Preemption latency * 2 Doubled due to context saving and loading 20 Progress in insts Overhead Most progressed in the same SM: Flush –Executed instructions in a warp granularity Drain –Instructions measured in a warp granularity

21 University of Michigan Electrical Engineering and Computer Science … Chimera Algorithm 21 Thread SM Flush Drain : Switch: SM GPU Preemption victim –Least throughput overhead –Meets preemption latency constraint Thread block Flush : Switch

22 University of Michigan Electrical Engineering and Computer Science Chimera Algorithm 22 Thread SM Flush SM GPU Preemption victim –Least throughput overhead –Meets preemption latency constraint Thread block Switch … Latency : Overhead : Latency : Overhead : Constraint

23 University of Michigan Electrical Engineering and Computer Science Experimental Setup GPGPU-Sim v3.2.2 –GPU Model: Fermi architecture Up to 32,768 (128 kB) registers Up to 48 kB shared memory Workloads –14 benchmarks from Nvidia SDK, Parboil, and Rodinia GPGPU benchmark + Synthetic benchmark –Mimics periodic, real-time task (e.g. Graphics kernel) »Period: 1ms »Execution Time: 200us GPGPU benchmark + GPGPU benchmark –Baseline: Non-preemptive First-come First-served 23

24 University of Michigan Electrical Engineering and Computer Science Preemption Latency Violations GPGPU benchmark + Synthetic benchmark –15 us preemption latency constraint (real-time task) % Non-idempotent kernel with short thread block execution time Estimated shorter preemption latency

25 University of Michigan Electrical Engineering and Computer Science System Throughput Case study: LUD + Other GPGPU benchmark –LUD has many kernel launches with varying number of thread blocks 25 Drain has lower average normalized turnaround time (5.17x for Drain, 5.50x for Chimera)

26 University of Michigan Electrical Engineering and Computer Science Preemption Technique Distribution GPGPU benchmark + Synthetic benchmark 26

27 University of Michigan Electrical Engineering and Computer Science Summary Context switch can have high overhead on GPUs –Preemption latency, and throughput overhead Chimera –Flushing Instant preemption –Collaborative preemption Flush + Switch + Drain –Almost always meets preemption latency constraint 0.2% violations (estimated shorter preemption latency) –5.5x ANTT improvement, 12.2% STP improvement For GPGPU benchmark + GPGPU benchmark combinations 27

28 University of Michigan Electrical Engineering and Computer Science Questions? 28 Context Switch Drain Flush


Download ppt "University of Michigan Electrical Engineering and Computer Science Chimera: Collaborative Preemption for Multitasking on a Shared GPU Jason Jong Kyu Park."

Similar presentations


Ads by Google