GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1) 03.11.2016.

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Overview Global Memory L2 Cache … Host
Note that details about host memory interconnection are platform specific. GPU Device Global Memory L2 Cache L1 Cache Registers Core … Host CPU Host Memory GPU Chip SMP ~ 25 GBps PCI Express (16/32 GBps) > 100 GBps by Martin Kruliš (v1.1)

Host-Device Transfers
PCIe Transfers Much slower than internal GPU data transfers Issued explicitly by host code cudaMemcpy(dst, src, size, direction); With one exception, when the GPU memory is mapped to the host memory space The transfer call has significant overhead Bulk transfers are preferred Overlapping Up to 2 async. transfers while the GPU is computing by Martin Kruliš (v1.1)

Global Memory Global Memory Properties Off-chip, but on the GPU device
High bandwidth and high latency ~ 100 GBps, of clock cycles Operated in transactions Continuous aligned segments of 32 B B Number of transaction depends on caching model, GPU architecture, and memory access pattern by Martin Kruliš (v1.1)

Global Memory Global Memory Caching Data are cached in L2 cache
Relatively small (up to 2MB on new Maxwell GPUs) On CC < 3.0 (Fermi) also cached in L1 cache Configurable by compiler flag -Xptxas -dlcm=ca (Cache Always, i.e. also in L1, default) -Xptxas -dlcm=cg (Cache Global, i.e. L2 only) CC 3.x (Kepler) reserves L1 for local memory caching and registry spilling CC 5.x (Maxwell) separates L1 cache from shared memory and unifies it with texture cache by Martin Kruliš (v1.1)

Global Memory Coalesced Transfers
Number of transactions caused by global memory access depends on the pattern of the access Certain access patterns are optimized CC 1.x Threads sequentially access aligned memory block Subsequent threads access subsequent words CC 2.0 and later Threads access aligned memory block Access within the block can be permuted by Martin Kruliš (v1.1)

Global Memory Access Patterns Perfectly aligned sequential access
by Martin Kruliš (v1.1)

Global Memory Access Patterns Perfectly aligned with permutation

Global Memory Access Patterns Continuous sequential, but misaligned

Global Memory Coalesced Loads Impact by Martin Kruliš (v1.1)

Shared Memory Memory Shared by SM Divided into banks
Each bank can be accessed independently Consecutive 32-bit words are in consecutive banks Optionally, 64-bit words division is used (CC 3.x) Bank conflicts are serialized Except for reading the same address (broadcast) In newer architectures (CC 5.x and 6.x), the size of the shared memory may vary a little, but the limit per thread block remains 48kB. Compute capability Mem. size # of banks latency 1.x 16 kB 16 32 bits / 2 cycles 2.x 48 kB 32 3.x 64 bits / 1 cycle by Martin Kruliš (v1.1)

Shared Memory Linear Addressing
Each thread in warp access different memory bank No collisions by Martin Kruliš (v1.1)

Shared Memory Linear Addressing with Stride
Each thread access 2*i-th item 2-way conflicts (2x slowdown) on CC < 3.0 No collisions on CC 3.x Due to 64-bits per cycle throughput by Martin Kruliš (v1.1)

Shared Memory Linear Addressing with Stride
Each thread access 3*i-th item No collisions, since the number of banks is not divisible by the stride by Martin Kruliš (v1.1)

Shared Memory Broadcast
One set of threads access value in bank #12 and the remaining threads access value in bank #20 Broadcasts are served independently on CC 1.x I.e., sample bellow causes 2-way conflict CC 2.x and newer serve broadcasts simultaneously by Martin Kruliš (v1.1)

Shared Memory Shared Memory vs. L1 Cache Shared Memory Configuration
On CC 2.x and 3.x, they are the same resource Division can be set for each kernel by cudaFuncSetCacheConfig(kernel, cacheConfig); Cache configuration can prefer L1 or shared memory (i.e., selecting 48kB of 64kB for the preferred) Shared Memory Configuration Some devices (CC 3.x) can configure memory banks cudaFuncSetSharedMemConfig(kernel, config); The config selects between 32 bit and 64 bit mode The 32bit mode on CC 3.x devices have one strange feature: If two threads access different addresses in the same bank, but both addresses are in an aligned 64 word (i.e., 32 bit word) block – index of second is index of first + 32 – the memory can handle the requests without a collision. Note that Maxwell (CC 5.x) returned to previous (Fermi) configuration – i.e., the bank size is not configurable and fixed for 32bit words. by Martin Kruliš (v1.1)

Registers Registers One register pool per multiprocessor
8-64k of 32-bit registers (depending on CC) Register allocation is defined by compiler As fast as the cores (no extra clock cycles) Read-after-write dependency 24 clock cycles Can be hidden if there are enough active warps Hardware scheduler (and compiler) attempts to avoid register bank conflicts whenever possible The programmer have no direct control over conflicts by Martin Kruliš (v1.1)

Local Memory Per-thread Global Memory
Allocated automatically by compiler Compiler may report the amount of allocated local memory (use --ptxas-options=-v) Large local structures and arrays are places here Instead of the registers Register Pressure There is not enough registers to accommodate the data of the thread The registers are spilled into the local memory Can be moderated selecting smaller thread blocks by Martin Kruliš (v1.1)

Constant and Texture Memory
Constant Memory Special 64KB cache for read-only data 8KB is the cache working set per multiprocessor CC 2.x introduces LDU (LoaD Uniform) instruction Compiler uses to force loading read-only variables that are thread-independent into the cache Texture Memory Texture cache is optimized for 2D spatial locality Additional functionality like fast data interpolation, normalized coordinate system, or handling the boundary cases by Martin Kruliš (v1.1)

Memory Allocation Global Memory Shared Memory cudaMalloc(), cudaFree()
Dynamic kernel allocation malloc() and free() called from kernel cudaDeviceSetLimit(cudaLimitMallocHeapSize, size) Shared Memory Statically (e.g., __shared__ int foo[16];) Dynamically (by kernel launch parameter) extern __shared__ float bar[]; float *bar1 = &(bar[0]); float *bar2 = &(bar[size_of_bar1]); by Martin Kruliš (v1.1)

Implications and Guidelines
Global Memory Data should be accessed in coalesced manner Hot data should be manually cached in shared mem Shared Memory Bank conflicts need to be avoided Redesigning data structures in col-wise manner Using strides that are not divisible by # of banks Registers and Local Memory Use as few as possible, avoid registry spilling by Martin Kruliš (v1.1)

Implications and Guidelines
Memory Caching The structures should be designed to utilize caches in best way possible The workset of active blocks should fit L2 cache Providing maximum information for the compiler Using const for constant data Using __restrict__ to indicate that no pointer aliasing will occur Data Alignment Operate on 32bit/64bit values only Align data structures to suitable powers of 2 by Martin Kruliš (v1.1)

Maxwell Architecture What is new in Maxwell….
L1 merges with texture cache Data are cached in L1 the same way as in Fermi Shared memory is independent 64k or 96k not shared with L1 Shared memory uses 32bit banks Revert to Fermi-like style, keeping the aggregated bandwidth Faster shared memory atomic operations by Martin Kruliš (v1.1)

Discussion by Martin Kruliš (v1.1)

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1) 03.11.2016.

Similar presentations

Presentation on theme: "GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1) 03.11.2016."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1) 03.11.2016.

Similar presentations

Presentation on theme: "GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1) 03.11.2016."— Presentation transcript:

Similar presentations

About project

Feedback