1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Slides:



Advertisements
Similar presentations
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:
Intermediate GPGPU Programming in CUDA
CUDA programming (continue) Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Optimization on Kepler Zehuan Wang
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.
The Missouri S&T CS GPU Cluster Cyriac Kandoth. Pretext NVIDIA ( ) is a manufacturer of graphics processor technologies that has begun to promote their.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
An Introduction to Programming with CUDA Paul Richmond
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
CUDA - 2.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
1 SC12 The International Conference for High Performance Computing, Networking, Storage and Analysis Salt Lake City, Utah. Workshop 119: An Educator's.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana Champaign 1 Programming Massively Parallel Processors CUDA Memories.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
CS/EE 217 GPU Architecture and Parallel Programming Lectures 4 and 5: Memory Model and Locality © David Kirk/NVIDIA and Wen-mei W. Hwu,
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana Champaign 1 ECE 498AL Programming Massively Parallel Processors.
CUDA programming Performance considerations (CUDA best practices)
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Computer Engg, IIT(BHU)
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
GPU Memories These notes will introduce:
Lecture 2: Intro to the simd lifestyle and GPU internals
ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth
L18: CUDA, cont. Memory Hierarchy and Examples
DRAM Bandwidth Slide credit: Slides adapted from
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Mattan Erez The University of Texas at Austin
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy in the NVIDIA GPU global memory, shared memory, register file, constant memory How to declare variables for each memory Cache memory and making most effective in program

2 Host-Device Connection Host (CPU) Device Global Memory Host Memory Device (GPU) PCIe x16 4 GB/s PCIe x16 Gen2 8 GB/s peak GPU bus C GB/s GTX GB/s DDR GB/s GDDR5 230 GB/s Memory bus limited by memory and processor- memory connection bandwidth Hypertransport and Intel’s Quickpath currently 25.6 GB/s Note transferring between host and GPU much slower that between device and global memory Hence need to minimize host-device transfers GPU on a laptop such as Mac pro may share the system memory.

3 GPU Memory Hierarchy Global memory is off-chip on the GPU card. Even though global memory an order of magnitude faster than CPU memory, still relatively slow and a bottleneck for performance GPU provided with faster on-chip memory although data has to be transferred explicitly into shared memory – Pointers created with cudaMalloc() point to global memory. Two principal levels on-chip: shared memory and registers

4 Grid Block Threads Shared memory Local memory Registers Global memory Constant memory Scope of global memory, shared memory, and registers Host Host memory For storing global constants see later. Also a read-only global memory called texture memory.

5 Currently can only transfer data from host to global (and constant memory) and not host directly to shared. Constant memory used for data that does not change (i.e. read-only by GPU) Shared memory is said to provide up to 15 x speed of global memory Register similar speed to shared memory if reading same address or no bank conflicts.

6 Lifetimes Global/constant memory–- lifetime of application Shared memory -– lifetime of a kernel Registers –- lifetime of a kernel Scope Global/constant memory–- Grid Shared memory –- Block Registers –- Thread

7 Declaring program variables for registers, shared memory and global memory MemoryDeclarationScopeLifetime RegistersAutomatic variables*ThreadKernel other than arrays LocalAutomatic array variablesThreadKernel Shared__shared__BlockKernel Global__device__GridApplication Constant__constant__GridApplication *Automatic variables allocated automatically when entering scope of variable and de- allocated when leaving scope. In C, all variables declared within a block are “automatic” by default, see

8 Global Memory __device__ For data available to all threads in device. Declared outside function bodies Scope of Grid and lifetime of application #include #define N 1000 … __device__ int A[N]; __global__ kernel() { int tid = blockIdx.x * blockDim.x + threadIdx.x; A[tid] = … … } main { … }

9 Issues with using Global memory Long delays, slow Access congestion Cannot synchronize accesses Need to ensure no conflicts of accesses between threads

10 Shared Memory Shared memory is on the GPU chip and very fast Separate data available to all threads in one block. Declared inside function bodies Scope of block and lifetime of kernel call So each block would have its own array A[N] #include #define N 1000 … __global__ kernel() { __shared__ int A[N]; int tid = threadIdx.x; A[tid] = … … } main { … }

11 Transferring data to shared memory int A[N][N];//to be copied into device from host with cudamalloc __global__ void myKernel (int *A_global) { __shared__ int A_sh[n][n];// declare shared memory int row = … int col = … A_sh[i][j] = *A_global[row + col*N]; //copy from global to shared … } main () { … cudaMalloc((void**)dev_ A, size);// allocate global memory cudoMemcpy(dev_A, A, size, cudaMemcpyHostTo Device); //copy to global memory myKernel >(dev_A) … }

12 Issues with Shared Memory Shared memory is not immediately synchronized after access. Usually it is the writes that matter. Use __syncthreads() before you read data that has been altered. Shared memory is very limited (Fermi has up to 48KB per GPU core, NOT per block) Hence may have to divide your data into “chunks”

13 Example uses of shared data Where the data can be divided into independent parts: Image processing - Image can be divided into blocks and placed into shared memory for processing Block matrix multiplication - Sub-matrices can be stored in shared memory (Slides to follow on this)

14 Registers Compiler will place variables declared in kernel in registers when possible Limit to the number of registers Fermi has bit registers Registers divided across “warps” (group of 32 threads that will operate in the SIMT mode) and have the lifetime of the warps __global__ kernel() { int x, y, z; … }

15 Arrays declared within kernel (Automatic array variables) __global__ kernel() { int A[10]; … } Generally stored in global memory but private copy made for each thread.* Can be as slow access as global memory, except cached, see later If array indexed with a constant value, compiler may use registers * Global “local” memory, see later

16 Constant Memory __constant__ For data not altered by device. Although stored in global memory, cached and has fast access Declared outside function bodies Scope of grid and lifetime of application Size currently limited to bytes #include … __constant__ int n; __global__ kernel() { … } main { n = … … }

17 Local memory Resides in device memory space (global memory) and is slow except that organized such that consecutive 32-bit words accessed by consecutive threadIDs for best coalesced accesses when possible. For compute capability 2.x, cached in L1 and L2 caches on-chip Used to hold arrays if not indexed with a constant value and for variables when there are no more register available for them

18 Cache memory More recent GPUs have L1 and L2 cache memory, but apparently without cache coherence so up to the programmer to ensure that. Make sure each thread accesses different locations Ideally arrange accesses to be in same cache lines Compute capability 1.3 Tesla’s do not have cache memory Compute capability 2.0 Fermi’s have L1/L2 caches

19 Fermi Caches Streaming processors (SM’s) L2 cache L1 cache/ shared memory Streaming processors (SM) Register file

20 Fermi Cache Sizes L2 Unified 384kB L2 cache for all SM’s 384-bit memory bus from device memory to L2 cache Up to 160 GB/s bandwidth 128 bytes cache line (32 32-bit integers or floats, or 16 doubles) L1 Each SM has 16kB or 48kB of L1 cache (64kB split 16/48 or 48/16 between L1 cache and shared memory) No global cache coherency!

21 Poor Performance from Poor Data Layout __global__ void kernel(int *A) { int i = threadIdx.x + blockDim.x*blockIdx.x; A[1000*i] = … } Very Bad! Each thread accesses a location on a different line. Fermi line size is 32 integers or floats

22 Taking Advantage of Cache __global__ void kernel(int *A) { int i = threadIdx.x + blockDim.x*blockIdx.x; A[i] = … } Good! Groups of 32 accesses by consecutive threads on same line. Threads will be in same warp Fermi line size is 32 integers or floats

23 Warp A “warp’ in CUDA is a group of 32 threads that will operate in the SIMT mode A “half warp” (16 threads) actually execute simultaneously (current GPUs) Using knowledge of warps and how the memory is laid out can improve code performance

24 Memory Banks Memory 1Memory 4Memory 3Memory 2 Device (GPU) Consecutive locations on successive memory banks A[0]A[1]A[2]A[3] Device can fetch A[0], A[1], A[2], A[3] … A[B-1] at the same time, where there are B banks.

25 Shared Memory Banks Shared memory divided into 16 or 32 banks of 32-bit width. Banks can be accessed simultaneously Compute cap. 1.x has 16 banks accesses processed per half warp Compute cap. 2.x has 32 banks accesses processed per warp Banks can be accessed simultaneously To achieve maximum bandwidth, threads in a half warp should access different banks of shared memory Exception: all threads read the same location which results in a broadcast operation *coit-grid06 C2050 compute capability 2.0 has 32 banks)

26 Global memory banks Global memory is also partitioned into banks depending upon the version of the GPU 200 series and 10 series NVIDIA GPUs have 8 partitions of 256 bytes wide C2050 has ??

27 Achieving best data access patterns Requires a lot of thought – will consider in detail for specific problems Generally Padding data to make data aligned For matrix operations Tiling Pre-transpose operations Padding – adding columns/rows

28 Memory Coalescing Aligned memory accesses Threads can read 4, 8, or 16 bytes at a time from global memory but only if accesses are aligned. That is: A 4-byte read must start at address…xxxxx00 A 8 byte read must start at address…xxxx000 A 16 byte read must start at address…xxx0000 Then access is much faster (twice?)

29 Ideally try to arrange for threads to access different memory modules at the same time, and consecutive addresses A bad case would be: Thread 0 to access A[0], A[2],... A[15] Thread 1 to access A[16], A[17],... A[31] Thread 2 to access A[32], A[33],... A[63] … etc. Good case would be Thread 0 to access A[0], A[16],... A[31] Thread 1 to access A[1], A[17],... A[32] Thread 2 to access A[2], A[18],... A[33] … etc. if there are 16 banks. Need to know that detail! Time

30 Wikipedia “ CUDA” coit-grid06 C

Questions