Download presentation
Presentation is loading. Please wait.
Published byHaden Saunders Modified over 9 years ago
1
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy in the NVIDIA GPU global memory, shared memory, register file, constant memory How to declare variables for each memory Cache memory and making most effective in program
2
2 Host-Device Connection Host (CPU) Device Global Memory Host Memory Device (GPU) PCIe x16 4 GB/s PCIe x16 Gen2 8 GB/s peak GPU bus C2050 1030.4 GB/s GTX 280 141.7 GB/s DDR 400 3.2 GB/s GDDR5 230 GB/s Memory bus limited by memory and processor- memory connection bandwidth Hypertransport and Intel’s Quickpath currently 25.6 GB/s Note transferring between host and GPU much slower that between device and global memory Hence need to minimize host-device transfers GPU on a laptop such as Mac pro may share the system memory.
3
3 GPU Memory Hierarchy Global memory is off-chip on the GPU card. Even though global memory an order of magnitude faster than CPU memory, still relatively slow and a bottleneck for performance GPU provided with faster on-chip memory although data has to be transferred explicitly into shared memory – Pointers created with cudaMalloc() point to global memory. Two principal levels on-chip: shared memory and registers
4
4 Grid Block Threads Shared memory Local memory Registers Global memory Constant memory Scope of global memory, shared memory, and registers Host Host memory For storing global constants see later. Also a read-only global memory called texture memory.
5
5 Currently can only transfer data from host to global (and constant memory) and not host directly to shared. Constant memory used for data that does not change (i.e. read-only by GPU) Shared memory is said to provide up to 15 x speed of global memory Register similar speed to shared memory if reading same address or no bank conflicts.
6
6 Lifetimes Global/constant memory–- lifetime of application Shared memory -– lifetime of a kernel Registers –- lifetime of a kernel Scope Global/constant memory–- Grid Shared memory –- Block Registers –- Thread
7
7 Declaring program variables for registers, shared memory and global memory MemoryDeclarationScopeLifetime RegistersAutomatic variables*ThreadKernel other than arrays LocalAutomatic array variablesThreadKernel Shared__shared__BlockKernel Global__device__GridApplication Constant__constant__GridApplication *Automatic variables allocated automatically when entering scope of variable and de- allocated when leaving scope. In C, all variables declared within a block are “automatic” by default, see http://en.wikipedia.org/wiki/Automatic_variable
8
8 Global Memory __device__ For data available to all threads in device. Declared outside function bodies Scope of Grid and lifetime of application #include #define N 1000 … __device__ int A[N]; __global__ kernel() { int tid = blockIdx.x * blockDim.x + threadIdx.x; A[tid] = … … } main { … }
9
9 Issues with using Global memory Long delays, slow Access congestion Cannot synchronize accesses Need to ensure no conflicts of accesses between threads
10
10 Shared Memory Shared memory is on the GPU chip and very fast Separate data available to all threads in one block. Declared inside function bodies Scope of block and lifetime of kernel call So each block would have its own array A[N] #include #define N 1000 … __global__ kernel() { __shared__ int A[N]; int tid = threadIdx.x; A[tid] = … … } main { … }
11
11 Transferring data to shared memory int A[N][N];//to be copied into device from host with cudamalloc __global__ void myKernel (int *A_global) { __shared__ int A_sh[n][n];// declare shared memory int row = … int col = … A_sh[i][j] = *A_global[row + col*N]; //copy from global to shared … } main () { … cudaMalloc((void**)dev_ A, size);// allocate global memory cudoMemcpy(dev_A, A, size, cudaMemcpyHostTo Device); //copy to global memory myKernel >(dev_A) … }
12
12 Issues with Shared Memory Shared memory is not immediately synchronized after access. Usually it is the writes that matter. Use __syncthreads() before you read data that has been altered. Shared memory is very limited (Fermi has up to 48KB per GPU core, NOT per block) Hence may have to divide your data into “chunks”
13
13 Example uses of shared data Where the data can be divided into independent parts: Image processing - Image can be divided into blocks and placed into shared memory for processing Block matrix multiplication - Sub-matrices can be stored in shared memory (Slides to follow on this)
14
14 Registers Compiler will place variables declared in kernel in registers when possible Limit to the number of registers Fermi has 32768 32-bit registers Registers divided across “warps” (group of 32 threads that will operate in the SIMT mode) and have the lifetime of the warps __global__ kernel() { int x, y, z; … }
15
15 Arrays declared within kernel (Automatic array variables) __global__ kernel() { int A[10]; … } Generally stored in global memory but private copy made for each thread.* Can be as slow access as global memory, except cached, see later If array indexed with a constant value, compiler may use registers * Global “local” memory, see later
16
16 Constant Memory __constant__ For data not altered by device. Although stored in global memory, cached and has fast access Declared outside function bodies Scope of grid and lifetime of application Size currently limited to 65536 bytes #include … __constant__ int n; __global__ kernel() { … } main { n = … … }
17
17 Local memory Resides in device memory space (global memory) and is slow except that organized such that consecutive 32-bit words accessed by consecutive threadIDs for best coalesced accesses when possible. For compute capability 2.x, cached in L1 and L2 caches on-chip Used to hold arrays if not indexed with a constant value and for variables when there are no more register available for them
18
18 Cache memory More recent GPUs have L1 and L2 cache memory, but apparently without cache coherence so up to the programmer to ensure that. Make sure each thread accesses different locations Ideally arrange accesses to be in same cache lines Compute capability 1.3 Tesla’s do not have cache memory Compute capability 2.0 Fermi’s have L1/L2 caches
19
19 Fermi Caches Streaming processors (SM’s) L2 cache L1 cache/ shared memory Streaming processors (SM) Register file
20
20 Fermi Cache Sizes L2 Unified 384kB L2 cache for all SM’s 384-bit memory bus from device memory to L2 cache Up to 160 GB/s bandwidth 128 bytes cache line (32 32-bit integers or floats, or 16 doubles) L1 Each SM has 16kB or 48kB of L1 cache (64kB split 16/48 or 48/16 between L1 cache and shared memory) No global cache coherency!
21
21 Poor Performance from Poor Data Layout __global__ void kernel(int *A) { int i = threadIdx.x + blockDim.x*blockIdx.x; A[1000*i] = … } Very Bad! Each thread accesses a location on a different line. Fermi line size is 32 integers or floats
22
22 Taking Advantage of Cache __global__ void kernel(int *A) { int i = threadIdx.x + blockDim.x*blockIdx.x; A[i] = … } Good! Groups of 32 accesses by consecutive threads on same line. Threads will be in same warp Fermi line size is 32 integers or floats
23
23 Warp A “warp’ in CUDA is a group of 32 threads that will operate in the SIMT mode A “half warp” (16 threads) actually execute simultaneously (current GPUs) Using knowledge of warps and how the memory is laid out can improve code performance
24
24 Memory Banks Memory 1Memory 4Memory 3Memory 2 Device (GPU) Consecutive locations on successive memory banks A[0]A[1]A[2]A[3] Device can fetch A[0], A[1], A[2], A[3] … A[B-1] at the same time, where there are B banks.
25
25 Shared Memory Banks Shared memory divided into 16 or 32 banks of 32-bit width. Banks can be accessed simultaneously Compute cap. 1.x has 16 banks accesses processed per half warp Compute cap. 2.x has 32 banks accesses processed per warp Banks can be accessed simultaneously To achieve maximum bandwidth, threads in a half warp should access different banks of shared memory Exception: all threads read the same location which results in a broadcast operation *coit-grid06 C2050 compute capability 2.0 has 32 banks)
26
26 Global memory banks Global memory is also partitioned into banks depending upon the version of the GPU 200 series and 10 series NVIDIA GPUs have 8 partitions of 256 bytes wide C2050 has ??
27
27 Achieving best data access patterns Requires a lot of thought – will consider in detail for specific problems Generally Padding data to make data aligned For matrix operations Tiling Pre-transpose operations Padding – adding columns/rows
28
28 Memory Coalescing Aligned memory accesses Threads can read 4, 8, or 16 bytes at a time from global memory but only if accesses are aligned. That is: A 4-byte read must start at address…xxxxx00 A 8 byte read must start at address…xxxx000 A 16 byte read must start at address…xxx0000 Then access is much faster (twice?)
29
29 Ideally try to arrange for threads to access different memory modules at the same time, and consecutive addresses A bad case would be: Thread 0 to access A[0], A[2],... A[15] Thread 1 to access A[16], A[17],... A[31] Thread 2 to access A[32], A[33],... A[63] … etc. Good case would be Thread 0 to access A[0], A[16],... A[31] Thread 1 to access A[1], A[17],... A[32] Thread 2 to access A[2], A[18],... A[33] … etc. if there are 16 banks. Need to know that detail! Time
30
30 Wikipedia “ CUDA” http://en.wikipedia.org/wiki/CUDA coit-grid06 C2050 2.0
31
Questions
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.