GPGPU Architectures Martin Kruliš Martin Kruliš 11. 4. 2017.

GPGPU Architectures Martin Kruliš Martin Kruliš

History 1996 - 3Dfx Voodoo 1 1999 - NVIDIA GeForce 256
First graphical (3D) accelerator for desktop PCs NVIDIA GeForce 256 First Transform&Lightning unit NVIDIA GeForce2, ATI Radeon GPU has programmable parts DirectX – vertex and fragment shaders (v1.0) OpenGL, DirectX 10, Windows Vista Unified shader architecture in HW Geometry shader added Martin Kruliš

History 2007 - NVIDIA CUDA 2007 - AMD Stream SDK (previously CTM)
First GPGPU solution, restricted to NVIDIA GPUs AMD Stream SDK (previously CTM) OpenCL, Direct Compute OpenCL revision 1.2 NVIDIA Kepler Architecture 2013 – OpenCL revision 2.0 2014 – NVIDIA Maxwell Architecture 2016 – OpenCL revision 2.2, NVIDIA Pascal 2017 – NVIDIA OpenCL 2.0 beta Martin Kruliš

GPU in comparison with CPU
Few cores per chip General purpose cores Processing different threads Huge caches to reduce memory latency Locality of reference problem GPU Many cores per chip Cores specialized for numeric computations SIMT thread processing Huge amount of threads and fast context switch Results in more complex memory transfers Architecture Convergence Martin Kruliš

NVIDIA Fermi GPU Architecture 16 SMP units 512 CUDA cores
786kB L2 cache Note that one CUDA core corresponds to one 5D AMD Stream Processor (VLIW5). Therefore Radeon 5870 has 320 cores with 4-way SIMD capabilities and one SFU. Martin Kruliš

NVIDIA Fermi Streaming Multiprocessor 32 CUDA cores
64kB shared memory (or L1 cache) 1024 registers per core 16 load/store units 4 special function units 16 double precision ops per clock 1 instruction decoder All cores are running in lockstep Streaming multiprocessors are designed SM on Fermi, SMX on Kepler, and SMM on Maxwell Martin Kruliš

GPU Execution Model Data Parallelism Threading Execution Model
Many data elements are processed concurrently by the same routine GPUs are designed under this particular paradigm Also have limited ways to express task parallelism Threading Execution Model One function (kernel) is executed in many threads Much more lightweight than the CPU threads Threads are grouped into blocks/work groups of the same size Martin Kruliš

Instruction Decoder and Warp Scheduler
SIMT Execution Single Instruction Multiple Threads All cores are executing the same instruction Each core has its own set of registers registers Instruction Decoder and Warp Scheduler Martin Kruliš

SIMT vs. SIMD Single Instruction Multiple Threads
Width-independent programming model Serial-like code Achieved by hardware with a little help from compiler Allows code divergence Single Instruction Multiple Data Explicitly expose the width of SIMD vector Special instructions Generated by compiler or directly written by programmer Code divergence is usually not supported or tedious Martin Kruliš

Simultaneously run on SM cores
Thread-Core Mapping How are threads assigned to SMPs Grid Block Warp Thread The same kernel Simultaneously run on SM cores Assigned to SM Warp is made by subsequent 32 threads of the block according to their thread ID (i.e., threads 0-31, 32-63, …). Warp size is 32 for all compute capabilities. Thread ID is threadIdx.x in one dimension, (threadIdx.x + threadIdx.y*blockDim.x) in two dimensions and threadId.x + threadId.y*blockDim.x + threadId.z*blockDim.x*blockDim.y) in three dimensions. Multiple kernels may run simultaneously on the GPU (since Fermi) and multiple blocks may be assigned simultaneously to an SM (if the registers, the schedulers, and the shared memory can accommodate them). Core GPU Streaming Multiprocessor Martin Kruliš

SIMT and Branches Masking Instructions In case of data-driven branches
if-else conditions, while loops, … All branches are traversed, threads mask their execution in invalid branches if (threadId % 2 == 0) { ... we are even ... } else { ... this is odd ... } 1 2 3 4 … 1 2 3 4 … Martin Kruliš

GPU Memory Global Memory L2 Cache … Host
Note that details about host memory interconnection are platform specific GPU Device Global Memory L2 Cache L1 Cache Registers Core … Host CPU Host Memory GPU Chip SMP ~ 25 GBps PCI Express (16/32 GBps) > 100 GBps Martin Kruliš

Host-Device Transfers
PCIe Transfers Much slower than internal GPU data transfers Issued explicitly by host code As a bulk transfer operation Or issued explicitly by the HW/operating system When GPU memory is mapped to host memory space The transfers have significant overhead Overlapping - up to 2 asynchronous transfers may run whilst the GPU is computing Data must be binary safe (no pointers) CUDA introduces unified memory addressing Martin Kruliš

Global Memory Global Memory Properties Off-chip, but on the GPU device
High bandwidth and high latency ~ 100 GBps, of clock cycles Operated in transactions Continuous aligned segments of 32 B B Number of transactions caused by global memory access depends on the pattern of the access Certain access patterns are optimized Data are cached in L2 On Fermi also cached in L1 cache Kepler/Maxwell uses L1 for specific purposes Martin Kruliš

Shared Memory Memory Shared by the SM Divided into banks
Each bank can be accessed independently Consecutive 32-bit words are in consecutive banks Optionally, 64-bit words division is used (Kepler) Bank conflicts are serialized Except for reading the same address (broadcast) Architecture Mem. size # of banks latency Tesla 16 kB 16 32 bits / 2 cycles Fermi 48 kB 32 Kepler 64 bits / 1 cycle Martin Kruliš

Shared Memory Linear Addressing
Each thread in warp access different memory bank No collisions Martin Kruliš

Shared Memory Linear Addressing with Stride
Each thread access 2*i-th item 2-way conflicts (2x slowdown) on Fermi and older No collisions on Kepler and newer Due to 64-bits per cycle throughput Martin Kruliš

Shared Memory Linear Addressing with Stride
Each thread access 3*i-th item No collisions, since the number of banks is not divisible by the stride Martin Kruliš

Registers Registers One register pool per multiprocessor
8-64k of 32-bit registers (depending on architecture) Registers assigned to threads by compiler As fast as the cores (no extra clock cycles) Read-after-write dependency 24 clock cycles Can be hidden if there are enough active warps Hardware scheduler (and compiler) attempts to avoid register bank conflicts whenever possible The programmer have no direct control over them Martin Kruliš

Hiding Latency Fast Context Switch When a warp gets stalled
E.g., by data load/store Scheduler switches to the next active warp Martin Kruliš

Kepler Architecture Kepler’s Major Improvements
Streaming Processors Next Generation (SMX) 192 cores, 32 SFUs, 32 load/store units 3 cores share a DP unit, 6 cores share LD and SFU Dynamic Parallelism Kernel may spawn child kernels (to depth of 24) Implies the work group context-switch capability Hyper-Q Up to 32 simultaneous GPU-host connections Better throughput if multiple processes/threads use the GPU (concurrent connections are managed in HW) Martin Kruliš

Kepler Architecture Martin Kruliš

Maxwell Architecture Maxwell’s Major Improvements
Maxwell Symmetric Multiprocessor (SMM) Many internal optimizations, better power efficiency Improved scheduling, increased occupancy Reduced arithmetic instruction latency Larger L2 cache (2MB) Dedicated shared memory (separate L1 cache) Native shared memory atomics Better support for dynamic parallelism Martin Kruliš

Maxwell Architecture Martin Kruliš

AMD GPU Architecture AMD’s Graphic Core Next (GCN)
Abandoning the VLIW4 architecture 1VLIW x 4 ALU ops => 4 SIMD x 1 ALU op 32 compute units (Radeon HD7970) 4 SIMD units per CU (each processing 16 elements) 10 planned wavefronts per SIMD unit Emphasis on vector processing (instructions, registers, memory, …) OpenCL 1.2, DirectCompute 11.1 and C++ AMP compatibility Martin Kruliš

AMD GPU Architecture Martin Kruliš

GPU Programming Summary
Data Parallelism SIMT execution model Load balancing needs to be carefully considered Host-device Transfers Data needs to be transferred to the GPU and back Computations should overlap with data transfers There should be sufficient amount of computations Memory Architecture Various types of memories and caches Coalesced loads/stores from/to global memory Banking properties of the shared memory Martin Kruliš

Discussion Martin Kruliš

OpenCL and GPU Programming
Martin Kruliš Martin Kruliš

OpenCL Universal Framework for Parallel Computations
Specification created by Khronos group Multiple implementations exist (AMD, NVIDIA, Mac, …) API for Different Parallel Architectures Multicore CPUs, Manycore GPUs, Xeon Phi, FPGA, … Host runtime handles device detection, data transfers, and code execution Extended Version of C99 for Programming Devices The code is compiled at runtime for selected device Theoretically, we may chose best device for our application dynamically However, we have to consider HW-specific optimizations… Martin Kruliš

OpenCL – Hardware Model
Device (CPU die or GPU card) Compute unit (CPU core or GPU SM) Processing element (slot in SSE registers or GPU core) Martin Kruliš

Intel Core i7 (4 cores with HT)
OpenCL – Logical Model Logical Layers Platform An implementation of OCL Context Groups devices of selected kind Buffers, programs, and other objects live in a context Device Command Queue Created for a device Controls data transfers, kernel execution and synchronization Intel Core i7 (4 cores with HT) ATI Radeon 5870 (320 cores) Martin Kruliš

OpenCL – Client Application
List all platforms std::vector<cl::Platform> platforms; cl_int err = cl::Platform::get(&platforms); if (err != CL_SUCCESS) return 1; cl_context_properties cps[3] = {CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0]()), 0}; cl::Context context(CL_DEVICE_TYPE_GPU, cps, NULL, NULL, &err); std::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>(); cl::Buffer buf(context, CL_MEM_READ_ONLY, sizeof(cl_float)*n); cl::Program program(context, cl::Program::Sources(1, std::make_pair(source.c_str(), source.length())) ); err = program.build(devices); cl::Kernel kernel(program, "function_name", &err); err = kernel.setArg(0, buf); Context of all GPUs on the 1st platform List all GPUs Allocate memory on the GPU Create and compile the program from string source Mark function as a kernel Martin Kruliš

OpenCL – Client Application
cl::CommandQueue cmdQueue(context, devices[0], 0, &err); cmdQueue.enqueueWriteBuffer(buf, CL_TRUE, 0, sizeof(cl_float)*n, data); cmdQueue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(n), cl::NDRange(grp), NULL, NULL); cmdQueue.finish(); cmdQueue.enqueueReadBuffer(buf, CL_TRUE, 0, sizeof(cl_float)*n, data); GPU (in-order) command queue Copy input data to the GPU buffer Execute the kernel and wait for it to finish (not necessary for the following readBuffer operation since the queue is in-order). Copy the results back from the GPU Martin Kruliš

OpenCL – Kernels A Kernel Kernel Execution
Written in OpenCL C (extended version of C99) Compiled at runtime for destination platform With (possibly) high degree of optimization Kernel Execution Task Parallelism Multiple kernels are enlisted in command queue(s) and executed concurrently Data Parallelism Multiple instances (threads) are created from single kernel, each operate on its own data elements Note that NVIDIA implementation does not support out-of-order queues at present time. To achieve a true task parallelism or overlap data transfers with kernel execution, multiple queues to the same device must be used. Martin Kruliš

OpenCL – Kernel Execution
Data Parallelism Each kernel instance has its own ID A 1-3 dimensional vector of numbers from 0 to N-1 ID identifies the portion of data to be processed Threads form groups (blocks) Threads within one group are executed on one SMP Groups have IDs as well 𝑇 𝐼𝐷 = 𝐺 𝐼𝐷 𝐺 𝑆𝐼𝑍𝐸 + 𝑇 𝐿𝑂𝐶𝐴𝐿_𝐼𝐷 Martin Kruliš

OpenCL – Memory Model Types of Memory
private – memory that belongs to one thread (registers) local – memory shared by workgroup (shared memory) global – memory of the device constant – read-only version of global memory More easily cached Martin Kruliš

OpenCL – Programming Kernels
Data Types Almost every standard C99 types (int, float, …) Some predefined types: size_t, ptrdiff_t Special type half – 16bit equivalent of float Vector Data Types typeN, where type is std. type and 𝑁∈ 2,4,8,16 E.g. float4, int16 Items are accessed as structure members int4 vec = (int4) (1, 2, 3, 4); int x = vec.x; Special swizzling operations are allowed int4 vec2 = vec.xxyy; vec = vec.wzyx; Martin Kruliš

Functions Other functions (beside the kernel) can be defined in the program (kernel is just an entry point) Some devices may impose limitations on call stack It is possible to call std. functions like printf() However, they may not work on GPUs There are many built-in functions Thread-related functions (e.g. get_global_id()) Mathematical and geometrical functions Originally designed for graphics Some of them are translated into single instruction Functions for asynchronous memory transfers Martin Kruliš

Restrictions And Optimization Issues Branching problem (if-else) Workgroup runs in SIMT, thus all branches are followed Use conditional assignment rather than branches For-cycles The compiler attempts to unroll them automatically While-cycles The same problem as branching Vector operations Translated into single instruction if possible (e.g., SSE) The compiler attempts to generate them automatically Different efficiency on different architectures Note that vector operations have virtually no effect on current NVIDIA architectures, but they may be particularly useful on CPUs and AMD GPUs. Martin Kruliš

OpenCL – Synchronization
Global Explicit barriers added to the command queue Events and event dependencies Within a Kernel Local barriers (for a work group) Memory fences Atomic operations Common operations (add, sub, xchg, cmpxhg, …) Extended – min, max, and, or, xor Martin Kruliš

Example – Matrix Multiplication
__kernel void mul_matrix (__global const float *A, __global const float *B, __global float *C) { int width = get_global_size(0); int x = get_global_id(0); int y = get_global_id(1); float sum = 0; for (int i = 0; i < width; ++i) sum += A[y*width + i] * B[i*width + x]; C[y*width + x] = sum; } Example Martin Kruliš

Example – Matrix Multiplication
Exact numbers may vary a little Optimized Solution Workgroup computes block of 16x16 results In each step, appropriate blocks of 16x16 numbers are loaded into local memory and intermediate results are updated × = 16 Example Martin Kruliš

OpenCL 2.0 Open CL 2.0 (July 2013) First implementations from Intel and AMD Shared virtual memory For sharing complex data structures with pointers Dynamic parallelism Device kernels can enqueue other kernels Generic address space Functions pointer arguments need not declare their address space Subset of C11 atomics were added Pipes - FIFO data structures, which can be read and written by kernels OpenCL 2.2 specification has been introduced recently. NVIDIA still implements OCL 1.2 only (in stable drivers). Martin Kruliš

Discussion Martin Kruliš

GPGPU Architectures Martin Kruliš Martin Kruliš 11. 4. 2017.

Similar presentations

Presentation on theme: "GPGPU Architectures Martin Kruliš Martin Kruliš 11. 4. 2017."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GPGPU Architectures Martin Kruliš Martin Kruliš 11. 4. 2017.

Similar presentations

Presentation on theme: "GPGPU Architectures Martin Kruliš Martin Kruliš 11. 4. 2017."— Presentation transcript:

Similar presentations

About project

Feedback