Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

Similar presentations


Presentation on theme: "University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems."— Presentation transcript:

1 University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems Janghaeng Lee Mehrzad Samadi Yongjun Park and Scott Mahlke 1 University of Michigan - Ann Arbor

2 University of Michigan Electrical Engineering and Computer Science Heterogeneity 2 Computer systems have become heterogeneous –Laptops, Servers –Mobile devices GPUs integrated with CPUs Discrete GPUs added for –Gaming –Supercomputing Co-processors for massive data-parallel workload –Intel Xeon PHI

3 University of Michigan Electrical Engineering and Computer Science Host Multi Core CPU (4-16 cores) Memory < 20GB/s GPU Cores (> 16 cores) Example of Heterogeneous System 3 Faster GPUs > 150GB/s GPU Cores (< 300 Cores) Global Memory Slower GPUs > 100GB/s GPU Cores (< 100 Cores) Global Memory PCIe < 16GB/s GPU Core Memory Controller Shared L3 External GPUs

4 University of Michigan Electrical Engineering and Computer Science Typical Heterogeneous Execution 4 GPU 1CPU Seq. Code Transfer Output Kernel Run on GPU 1 Transfer Input IDLE Time GPU 0 IDLE

5 University of Michigan Electrical Engineering and Computer Science Collaborative Heterogeneous Execution 5 GPU 1CPU Transfer Output Transfer Input Seq. Code IDLE Time GPU 0 IDLE Transfer Input Seq. Code Run on GPU 1 Transfer Output Merge GPU 1CPUGPU 0 Run on GPU 0 Seq. Code Run on CPU Speedup Kernel Run on GPU 1 3 Issues -How to transform the kernel -How to merge output efficiently -How to partition

6 University of Michigan Electrical Engineering and Computer Science OpenCL Execution Model 6 OpenCL Data Parallel Kernel Work-item Work-group Compute Device 1 Compute Unit (CU) PE … Global Memory Shared MemoryConstant Memory … Compute Unit (CU) PE … Compute Unit (CU) PE … Compute Unit (CU) PE … …… …

7 University of Michigan Electrical Engineering and Computer Science Virtualizing Computing Device 7 OpenCL Data Parallel Kernel Work-item Work-group NVIDIA SMX Executes work-item Regs Shared Mem 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 Scheduler 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15

8 University of Michigan Electrical Engineering and Computer Science Virtualization on CPU 8 Executes a work-item Core 0 Cache Control Register Core 1Core 3 ALU Cache Control Register ALU Cache Control Register ALU Work-item Work-group Core 2 ALU Cache Control Register ALU

9 University of Michigan Electrical Engineering and Computer Science Collaborative Execution 9 OpenCL Kernel Host Memory (Global) Device Memory (Global) COPY Input Flatten Workgroups Merge Output

10 University of Michigan Electrical Engineering and Computer Science OpenCL - Single Kernel Multiple Devices (SKMD) 10 OpenCL API Application Binary Launch Kernel CPU Device (Host) Faster GPUsSlower GPUs GPU Cores (< 300 Cores) GPU Cores (< 100 Cores) PCIe < 16GB/s WG-Variant Profile Data Partitioner Buffer Manager Kernel Transformer Work Groups Dev 0Dev 1Dev2 1622.4214.2142.11 3222.3434.3955.12 ………… 51222.4139.21120.23 num_groups(0) num_groups(1) num_groups(2) SKMD Framework Key Ideas –Kernel Transform Work on subset of WGs Efficiently merge outputs in different address spaces –Buffer Management Manages working set for each device –Partition Assign optimal workload for each device Multi Core CPU (4-16 cores) GPU Cores (> 16 cores)

11 University of Michigan Electrical Engineering and Computer Science Kernel Transform Partial Work-group Execution 11 __kernel void original_program(... ) { } num_groups(0) num_groups(1) num_groups(2) Flatten N-Dimensional Workgroups partial_program, int wg_from, int wg_to) int idx = get_group_id(0); int idy = get_group_id(1); int size_x = get_num_groups(0); int flat_id = idx + idy * size_x; if (flat_id wg_to) return; [KERNEL CODE]

12 University of Michigan Electrical Engineering and Computer Science Buffer Management – Classifying Kernel 12 Work-groups 33%50%17% Input Address Read Output Address Write 33%50%17% Work-groups Input Address Read Output Address Write CPU Device (Host)Faster GPUsSlower GPUs Contiguous Memory Access Kernel Discontiguous Memory Access Kernel Easy to merge output in separate address space Hard to Merge

13 University of Michigan Electrical Engineering and Computer Science Merging Outputs 13 Intuition –CPU would produce the same as GPU’s if it executed rest of work-groups –Already has rest of results from GPU Simple solution for merging kernel –Enable work-groups that were enabled in GPUs –Replace global store value with copy (load from GPU result) Log output location & check log location when merging –BAD idea –Additional overhead

14 University of Michigan Electrical Engineering and Computer Science Merge Kernel Transformation 14 kernel void partial_program (..., __global float *output, int wg_from, int wg_to ) { int flat_id = idx + idy * size_x; if (flat_id wg_to) return; // kernel body for (){... sum +=...; } output[tid] = sum; } Removed by Dead Code Elimination Store to global memory, float *gpu_out ) gpu_out[tid]; merge_program Merging Cost Done in Memory B/W  ≈ 20 GB /s  < 0.1 ms in most applications

15 University of Michigan Electrical Engineering and Computer Science Partitioning Uses profile data Done in Runtime –Must be fast Uses Decision Tree Heuristics (Greedy) –Start from the root node assuming All workgroups are assigned to the fastest device –Fixed # of workgroups are offloaded from the fastest device to another from the parent’s node 15

16 University of Michigan Electrical Engineering and Computer Science Decision Tree Example 16 f(256,0,0) f(255,1,0)f(255,0,1) f(254,2,0)f(254,1,1) f(253,1,2)f(253,2,1) =197ms =195ms =195ms =193ms =192ms =190ms =200ms At each node, also considers –Balancing Factor Do not choose a child that has less balanced execution time between devices –Data Transfer Time –Merging Costs –Performance Variation on # of workgroups f(Dev1, Dev2, Dev3)

17 University of Michigan Electrical Engineering and Computer Science Experimental Setup 17 Device Intel Xeon E3-1230 (SandyBridge) NVIDIA GTX 560 (Fermi) NVIDIA Quadro (Fermi) # of Cores4 (8 Threads)33696 Clock Freq.3.2 GHz1.62 GHz1.28 GHz Memory8 GB DDR31 GB GDDR51 GB GDDR 3 Peak Perf.409 GFlops1,088 GFlops245 GFlops OpenCL Driver Enhanced Intel SDK 1.5 NVIDIA SDK 4.0 PCIeN/A2.0 x16 OSUbuntu Linux 12.04 LTS BenchmarksAMDAPP SDK, NVIDIA SDK

18 University of Michigan Electrical Engineering and Computer Science Results 18 Intel Xeon Only

19 University of Michigan Electrical Engineering and Computer Science Results 19 Intel Xeon Only 29%

20 University of Michigan Electrical Engineering and Computer Science Results (Break Down) 20 Vector Add Matrix Multiplication

21 University of Michigan Electrical Engineering and Computer Science Summary Systems have been become more heterogeneous –Configured with several types of devices Existing CPU + GPU heterogeneous –Single device executes a single kernel Single Kernel Multiple Devices (SKMD) –CPUs and GPUs working on a single kernel –Transparent Framework –Partial kernel execution –Merging partial output –Optimal partitioning –Performance improvement of 29% over single device execution 21

22 University of Michigan Electrical Engineering and Computer Science Q & A 22


Download ppt "University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems."

Similar presentations


Ads by Google