Presentation is loading. Please wait.

Presentation is loading. Please wait.

Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -

Similar presentations


Presentation on theme: "Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -"— Presentation transcript:

1 Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan - Ann Arbor University of Michigan Electrical Engineering and Computer Science

2 2 Financial Modeling Medical Imaging Audio Processing Machine Learning Physics Simulation GamesImage ProcessingStatisticsVideo Processing Multiple data parallel kernels in a single application Dependencies between kernels Data Parallelism is Everywhere

3 3 Parallel Hardware is Everywhere LaptopsSupercomputersDesktopsCell PhonesServers Iris Pro GPU Core Intel Xeon 16-Core CPU Intel Xeon PHI Coprocessor NVIDIA GTX 980 Intel Core i7 CPU AMD Radeon R9 GPU

4 4 Performance More data to process –More demand for computing power GTX 980 Ti x2 GTX 280 GTX 480 GTX 580 GTX 680 GTX 780 Ti GTX 980 GTX 980 Ti

5 5 CPUGPU 1GPU N PCIe Interconnect GPU MEM MEM Kernel Exploiting Multiple Devices Memory Access Pattern Buffer Transfer Physical Memory Size Divide & Conquer 2. Compile for the target at runtime 4. Launch 3. Data transfer 1. Select target 5. Data back

6 6 Exploiting Multiple Devices Single Kernel

7 7 Multiple Kernels CPUGPU 1GPU N PCIe Interconnect GPU MEM MEM Kernel 1Kernel 2 Kernel 4Kernel 3Kernel 6Kernel 5 Kernel 7 Kernel 8 Dataflow Unknown at Compile Time

8 8 Example of Multiple Kernels Matrix equation – Matrix size A, C: 1K x 1K B: 1K x 8K Matrix A (1Kx1K) Matrix B (1Kx8K) Matrix C (1Kx1K) 1Kx8K Matrix Transpose 1Kx1Kx8K Matrix Mul 1Kx8Kx1K Matrix Mul 1Kx1Kx1K Matrix Mul Kernel types and size 1Kx1K Matrix 1Kx8K (8Kx1K) Matrix Data size

9 9 Kernel dependency Data flow & Interconnect Performance variance – Different kernels – Target processor – Input size Problem Complexity CPUGPU 1GPU N PCIe Interconnect GPU MEM MEM Matrix A (1Kx1K) Matrix B (1Kx8K) Matrix C (1Kx1K)

10 10 Multiple Kernels on Multiple Devices (MKMD) Objective – Finish multiple kernels As quickly as possible – Fully utilize resources Matrix A (1Kx1K) Matrix B (1Kx8K) Matrix C (1Kx1K) MKMD Framework CPUGPU 1GPU N PCIe Interconnect GPU MEM MEM

11 11 MKMD Overview Application OpenCL Kernel MKMD Profiling Mode Static Analysis Execution Mode Profile Modeling Offline Analysis Result Coarse-grain Scheduling Fine-grain Multi-kernel Partitioning Graph Construction Generate Sub-kernels Execute N Compute Units Unified Memory

12 12 Coarse Grained Scheduling Coarse grained – Kernel granularity List scheduling – Prioritization (Rank) – Schedule (1Kx1K)(1Kx8K) (1Kx1K) s t Avg. ExeTime of all devices 0 Est. TransTime 4 16 20 30 55 75 40 80 Accumulate Reverse BFS 40 80 55 7530 20 Earliest Finish Time CPU GPU 1 GPU 2 Time 40 75 55 3020

13 13 Fine Grained Partitioning Kernel granularity causes idle time slots Spatial partitioning across idle time slots – Start from the kernel starts in the earliest time – Balance execution time At work-group granularity – Consider buffer transfer CPU GPU 1 GPU 2 Time SKMD [PACT’13] OpenCL Kernel Work-itemWork-group

14 14 Fine Grained Partitioning Kernel granularity causes idle time slots Spatial partitioning across idle time slots – Start from the kernel starts in the earliest time – Balance execution time At work-group granularity – Consider buffer transfer CPU GPU 1 GPU 2 Time Saved

15 15 Execution Time Estimation Execution time varied by – Device – Type of kernels – Input size, work-group size Unrealistic to profile all possible combinations – 3 Devices, 100 different input sizes, 4096 work-groups 1,228,800 profiles per kernel Model execution time in offline – Time complexity analysis – Regression model

16 16 __kernel void SquareMatmul( __global float *C, __global float *A, __global float *B, int M) { int i = get_global_id(0); int j = get_global_id(1); for (int k = 0; k < M; ++k) tmp += A[i * M + k] * B[k * M + j]; C[i * M + j] = tmp; } Time complexity Example – N is # of work-item Time Complexity OpenCL Kernel Work-itemWork-group Loop Bound: M

17 17 Linearity OpenCL Kernel NVIDIA GTX 760

18 18 Linear Regression Modeled in offline Plugged in at runtime Kernel Average Error Rate 20 profiles40 profiles BlackScholes0.02120.0127 Nbody0.01720.0123 MatrixMul0.0140.009 FDTD3d0.01980.0145 SobelFilter0.01180.0097 MedianFilter0.01060.0098 K-means0.01420.0118 For modeling – Executed 20 and 40 profile-runs – NVIDIA GTX 760 30 actual runs – with random # of work-groups – with random input (big enough) Comparison – Predicted ExeTime – Actual ExeTime

19 19 Experimental Setup DeviceIntel Core i7- 3770 NVIDIA GTX 760 NVIDIA GTX 750 Ti # of Cores4 (8 Threads)1,152640 Clock Freq.3.2 GHz0.98 GHz1.02 GHz Memory (Peak B/W) 32 GB DDR3 (12.8 GB/s) 2 GB GDDR5 (192 GB/s) 2 GB GDDR 5 (86.4 GB/s) Peak Perf.435 GFlops2,258 GFlops1,306 GFlops OpenCL LibraryIntel SDK 2013NVIDIA CUDA SDK 6.0 PCIeN/A3.0 x8 (7.88 GB/s each) OSUbuntu Linux 12.04 LTS

20 20 Benchmarks NameEquationDomain Algebraic Bernoulli (ABE)System Theory Biconjugate gradient stabilized (BiCGSTAB) 11 operationsLinear Systems Triple commutatorMathematics Generalized Algebraic Bernoulli (GABE) System Theory Reachability GramianControl Theory JacobiLinear Systems Continuous LyapunovControl Theory Continuous Algebraic Riccati (CARE) Control Theory SteinProbability Singular value decomposition (SVD) Signal Processing SylvesterMathematics Matrix / Vector size: 4K x 4K / 4K

21 21 Speedup Serial Exe GTX 760 Better 1.89x

22 22 Device Utilization Finish time for each scheme Better 84%

23 23 Scheduling Overhead Ratio to the entire execution time

24 24 MKMD Summary Mapping kernels on multiple devices is hard – Inter kernel dependencies – Interconnect – Different device performance Type of kernels Input size MKMD ease the burdens from programmers – Accurate execution time prediction – Temporal scheduling at kernel granularity – Spatial partitioning at work-group granularity – 1.89x speedup over serial in-order execution

25 Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan - Ann Arbor University of Michigan Electrical Engineering and Computer Science


Download ppt "Orchestrating Multiple Data-Parallel Kernels on Multiple Devices Janghaeng Lee, Mehrzad Samadi, and Scott Mahlke October, 2015 University of Michigan -"

Similar presentations


Ads by Google