Presentation is loading. Please wait.

Presentation is loading. Please wait.

Gwangsun Kim, Jiyun Jeong, John Kim

Similar presentations


Presentation on theme: "Gwangsun Kim, Jiyun Jeong, John Kim"— Presentation transcript:

1 Gwangsun Kim, Jiyun Jeong, John Kim
Automatically Exploiting Implicit Pipeline Parallelism from Multiple Dependent Kernels for GPUs Gwangsun Kim, Jiyun Jeong, John Kim Mark Stephenson

2 GPU Background Kernel grid GPU SM ... ... CTA CTA CTA CTA ... CTA ... CTA CTA CTA CTA ... CTA ... CTA CTA CTA CTA ... CTA ... CTA CTA CTA CTA ... CTA ... ... ... ... ... ... CTA CTA CTA CTA ... CTA SM (Streaming Multiprocessor) core Control logic thread CTA (Cooperative Thread Array) or Thread block

3 Different Stages of GPU Workloads
Prelude: input data initialization (e.g., from an SSD). Postlude: writing output data (e.g., to an SSD). Non-kernel overhead takes ~77% of runtime on average [Zhang et al., PACT’15]. Time *H2D: Host-to-Device *D2H: Device-to-Host Prelude H2D copy Kernel D2H copy Postlude Host Memory Device Memory ~290 GB/s (GDDR5) CPU GPU SSD ~5 GB/s ~32 GB/s (PCIe 3.0)

4 Overlapping Stages Prior work enabled overlapping the different stages. GPUfs [ASPLOS’13]: file I/O from GPUs. GPUnet [OSDI’14]: network I/O from GPUs. Full/empty bit approach [HPCA’13] can overlap memcpy and kernel. HSA (Heterogeneous System Architecture) allows page faults from GPUs. Time Prelude H2D copy Kernel D2H copy Postlude Significant reduction in runtime Prelude H2D copy Kernel D2H copy Postlude

5 Limitation with Multiple Dependent Kernels
Many workloads have multiple dependent kernels. (e.g., 2/3 of workloads from Rodinia and Parboil benchmark suites) Dependent kernels are serialized  limited speedup. Prelude H2D copy Kernel 0 Postlude D2H copy Kernel 1 Kernel 2 Kernel 3 Time Implicit synchronization barriers

6 Our Contributions Overlap multiple dependent kernels without any programmer effort. Coarse-grained Reference Counting-based Scoreboarding (CRCS): Enabling mechanism to track dependency across kernels. Pipeline Parallelism-aware CTA Scheduler (PPCS): Properly scheduling CTAs from the overlapped kernels: Time Prelude H2D copy Limited overlap Kernel 0 Kernel 1 Kernel 2 Kernel 3 D2H copy Postlude

7 Outline Introduction/Background
Coarse-grained Reference Counting-based Scoreboarding (CRCS) Pipeline Parallelism-aware CTA Scheduler (PPCS) Methodology Results Conclusion

8 Enabling Overlapped Kernel Execution
Coarse-grained Reference Counting-based Scoreboarding (CRCS) Existing scoreboard: dependency between instructions through registers. CRCS: dependency between CTAs through pages. Owner kernel Reference counter 2 1 Page 0 Page 1 Page 2 Address space ... CTA 0 CTA 1 CTA 2 CTA 3 Kernel 0 CTA 0 CTA 1 CTA 2 CTA 3 Kernel 1 Which kernel owns this page? How many CTAs from current owner kernel access this page?

9 Enabling Overlapped Kernel Execution
Coarse-grained Reference Counting-based Scoreboarding (CRCS) Existing scoreboard: dependency between instructions through registers. CRCS: dependency between CTAs through pages. Owner kernel Reference counter Address space Kernel 0 Kernel 1 CTA 0 CTA 0 1 2 Page 0 -1 1 Page 1 1 3 2 CTA 1 CTA 1 -1 Page 2 1 1 1 -1 CTA 2 CTA 2 CTA 3 CTA 3 ... Two types of information are needed. How many CTAs access each page? (to initialize the counter)  pre-profiling Which pages are accessed by this CTA? (to decrement the counter)  post-profiling

10 Profiling Memory Access Range
Determine the memory access range of each CTA. We use a sampling method [Kim et al., PPoPP’11]. Only inspect ”corner” threads of each CTA  low overhead. Obtain the union for multiple statements. We assume all pages in the range are accessed by the CTA. Profiler kernels are generated through source-to-source translation. 1D CTA 2D CTA 3D CTA thread

11 Pre-profiling ... Compute the reference count for each page.
One pre-profiler kernel for each original kernel launched. Executed before the corresponding kernel. Can be overlapped with other kernels. Reference count table for page 0 Kernel ID Read ref. count Write ref. count 1 Page 0 Page 1 Page 2 Address space ... CTA 0 CTA 1 CTA 2 CTA 3 Kernel 0 CTA 0 CTA 1 CTA 2 CTA 3 Kernel 1 2 2 1 1 Reference count table for page 1 Kernel ID Read ref. count Write ref. count 1 2 2 2 2

12 Post-profiling To decrement reference counters after each CTA is finished. Keeping all CTA-page dependency information is very costly. (Max. number of CTAs per kernel = ~1019 for NVIDIA Kepler GPU) Redo profiling, but for this CTA only. Remaining read CTA counter Remaining write CTA counter Owner kernel Page 0 Page 1 Page 2 Address space ... CTA 0 CTA 1 CTA 2 CTA 3 Kernel 0 Kernel 1 2 1 -1 1 2 -1 1 2 1 -1 2 1 -1 Reference count table for page 1 Kernel ID Read ref. count Write ref. count 2 1

13 Baseline Execution (No Overlap)
: Kernel 0’s CTA : Kernel 1’s CTA : Kernel 2’s CTA : Kernel 3’s CTA Initializing page 0 during prelude P0 P1 P7 Prelude A CTA processing page 0 CTA slot 0 P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P2 P3 P4 P5 P6 P7 P0 P1 P2 P3 P4 P5 P6 P7 SM0 CTA slot 1 CTA slot 0 SM1 CTA slot 1 Postlude P0 P1 P7 Time

14 Execution with CRCS + FIFO Scheduler
: Kernel 0’s CTA : Kernel 1’s CTA : Kernel 2’s CTA : Kernel 3’s CTA P4 SMs remain idle Should not run two dependent kernels on an SM P5 P0 P1 P2 P3 P4 P5 P6 P7 Prelude P6 There are some overlaps P2 P3 P4 P0 P1 P5 P6 P7 P0 P1 P2 P3 P4 P5 P6 P7 P6 P7 P0 P1 P2 P3 P4 P5 CTA slot 0 P0 P1 SM0 P7 CTA slot 1 CTA slot 0 SM1 CTA slot 1 Postlude No kernel overlap during most of prelude Time

15 Execution with CRCS + PPCS
: Kernel 0’s CTA : Kernel 1’s CTA : Kernel 2’s CTA : Kernel 3’s CTA Schedule the kernel with the largest value of (page share) – (SM share). Prelude CTA slot 0 SM0 Page share of a kernel: the portion of pages owned by the kernel out of all initialized pages. SM share of a kernel: the portion of SMs running the kernel. CTA slot 1 CTA slot 0 SM1 CTA slot 1 Postlude Time

16 Execution with CRCS + PPCS
: Kernel 0’s CTA : Kernel 1’s CTA : Kernel 2’s CTA : Kernel 3’s CTA Schedule the kernel with the largest value of (page share) – (SM share). P0 P1 P2 P3 P4 P5 P6 P7 Prelude P2 P3 P0 P1 P0 P1 P0 P1 P0 P1 P4 P5 P6 P7 CTA slot 0 SM0 Idle CTA slot 1 P2 P3 P2 P3 CTA slot 0 P2 P3 SM1 CTA slot 1 Postlude P0 P1 P2 P3 Time

17 Methodology Modified GPGPU-sim version 3.0.1. Configurations:
MHz, 6 Memory controllers. I/O device: Two SSDs (throughput: 500 MB/s each). PCIe 3.0: GB/s in each direction. Prior work model: A model with no memory copy between host and device. A model with perfect single-kernel overlap (first and last kernels). Profiler code generator based on clang from the LLVM compiler. Focus on workloads with multiple dependent kernels.

18 Performance Result 50% 51% 33% 39% 19% 14% 2%

19 Impact of Kernel Portion in Runtime
The number of kernels is varied for Hotspot from Rodinia. 51% 39% 30% 27% 26% 17% 9% 11% 12% 14% (portion of kernel execution in runtime) (9%) (66%) (23%) (33%) (50%)

20 Overhead Storage overhead: Assumptions:
CRCS: 1 KB per SM. PPCS: 3.25 KB per GPU. Assumptions: 64 SMs. 128-entry TLB in each SM. Max. number of kernels to overlap: 1024. 0.77% storage overhead for the entire GPU.

21 Conclusion System performance can be improved by overlapping different stages of GPU workloads. Prior work cannot overlap multiple dependent kernels. Coarse-grained Reference counting-based Scoreboarding (CRCS) enables overlapped execution of multiple dependent kernels. Pipeline Parallelism-aware CTA Scheduler (PPCS) further improves performance by properly scheduling CTAs across kernels. Combining CRCS with PPCS resulted in up to 67% speedup and 33% on average.


Download ppt "Gwangsun Kim, Jiyun Jeong, John Kim"

Similar presentations


Ads by Google