Presentation is loading. Please wait.

Presentation is loading. Please wait.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 GPU Programming with CUDA.

Similar presentations


Presentation on theme: "© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 GPU Programming with CUDA."— Presentation transcript:

1 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 GPU Programming with CUDA

2 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 2 A quiet revolution and potential build-up –Calculation: 367 GFLOPS vs. 32 GFLOPS –Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s –Until recently, programmed through graphics API –GPU in every PC and workstation – massive volume and potential impact GFLOPS G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 GPU: A Massively Parallel Processor

3 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 3 What is GPGPU? General Purpose computation using GPU and graphics API in applications other than 3D graphics –GPU accelerates critical path of application Data parallel algorithms leverage GPU attributes –Large data arrays, streaming throughput –Fine-grain SIMD parallelism –Low-latency floating point (FP) computation Applications – see //GPGPU.org –Game effects (FX) physics, image processing –Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting

4 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 4 Previous GPGPU Constraints Dealing with graphics API –Working with the corner cases of the graphics API Addressing modes –Limited texture size/dimension Shader capabilities –Limited outputs Instruction sets –Lack of Integer & bit ops Communication limited –Between pixels –Scatter a[i] = p Input Registers Fragment Program Output Registers Constants Texture Temp Registers per thread per Shader per Context FB Memory

5 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 5 CUDA “Compute Unified Device Architecture” General purpose programming model –User kicks off batches of threads on the GPU –GPU = dedicated super-threaded, massively data parallel co-processor Targeted software stack –Compute oriented drivers, language, and tools Driver for loading computation programs into GPU –Standalone Driver - Optimized for computation –Interface designed for compute – graphics-free API –Data sharing with OpenGL buffer objects –Guaranteed maximum download & readback speeds –Explicit GPU memory management

6 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 6 Parallel Computing on a GPU 8-series GPUs deliver 25 to 200+ GFLOPS on compiled parallel C applications –Available in laptops, desktops, and clusters GPU parallelism is doubling every year Programming model scales transparently Programmable in C with CUDA tools Multithreaded SPMD model uses application data parallelism and thread parallelism GeForce 8800 Tesla S870 Tesla D870

7 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 7 CUDA – C with no shader limitations! Integrated host+device app C program –Serial or modestly parallel parts in host C code –Highly parallel parts in device SPMD kernel C code Serial Code (host)‏... Parallel Kernel (device)‏ KernelA >>(args); Serial Code (host)‏ Parallel Kernel (device)‏ KernelB >>(args);

8 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 8 CUDA Devices and Threads A compute device –Is a coprocessor to the CPU or host –Has its own DRAM (device memory)‏ –Runs many threads in parallel –Is typically a GPU but can also be another type of parallel processing device Data-parallel portions of an application are expressed as device kernels, which run on many threads Differences between GPU and CPU threads –GPU threads are extremely lightweight Very little creation overhead –GPU needs 1000s of threads for full efficiency Multi-core CPU needs only a few

9 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 9 L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread IssuePixel Thread Issue Input Assembler Host SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF SP L1 TF L2 FB L2 FB L2 FB L2 FB L2 FB The future of GPUs is programmable processing So – build the architecture around the processor G80 – Graphics Mode

10 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 10 G80 CUDA mode – A Device Example Processors execute computing threads New operating mode/HW interface for computing

11 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 11 Extended C Declspecs –global, device, shared, local, constant Keywords –threadIdx, blockIdx Intrinsics –__syncthreads Runtime API –Memory, symbol, execution management Function launch __device__ float filter[N]; __global__ void convolve (float *image) { __shared__ float region[M];... region[threadIdx] = image[i]; __syncthreads()... image[j] = result; } // Allocate GPU memory void *myimage = cudaMalloc(bytes) // 100 blocks, 10 threads per block convolve >> (myimage);

12 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 12 gcc / cl G80 SASS foo.sass OCG Extended C cudacc EDG C/C++ frontend Open64 Global Optimizer GPU Assembly foo.s CPU Host Code foo.cpp Integrated source (foo.cu) Mark Murphy, “NVIDIA’s Experience with Open64,”NVIDIA’s Experience with Open64 www.capsl.udel.edu/conferences/open64/2008 /Papers/101.doc

13 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 13 CUDA API Highlights: Easy and Lightweight The API is an extension to the ANSI C programming language Low learning curve The hardware is designed to enable lightweight runtime and driver High performance

14 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign 14 CUDA Thread Block All threads in a block execute the same kernel program (SPMD) Programmer declares block: –Block size 1 to 512 concurrent threads –Block shape 1D, 2D, or 3D –Block dimensions in threads Threads have thread id numbers within block –Thread program uses thread id to select work and address shared data Threads in the same block share data and synchronize while doing their share of the work Threads in different blocks cannot cooperate –Each block can execute in any order relative to other blocs! CUDA Thread Block Thread Id #: 0 1 2 3 … m Thread program Courtesy: John Nickolls, NVIDIA

15 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 15 … float x = input[threadID]; float y = func(x); output[threadID] = y; … threadID Thread Block 0 … … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Block 0 … float x = input[threadID]; float y = func(x); output[threadID] = y; … Thread Block N - 1 Thread Blocks: Scalable Cooperation Divide monolithic thread array into multiple blocks –Threads within a block cooperate via shared memory, atomic operations and barrier synchronization –Threads in different blocks cannot cooperate 7654321076543210 76543210

16 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign 16 Transparent Scalability Hardware is free to assign blocks to any processor at any time –A kernel scales across any number of parallel processors Device Block 0Block 1 Block 2Block 3 Block 4Block 5 Block 6Block 7 Kernel grid Block 0Block 1 Block 2Block 3 Block 4Block 5 Block 6Block 7 Device Block 0Block 1Block 2Block 3Block 4Block 5Block 6Block 7 Each block can execute in any order relative to other blocks. time

17 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign 17 G80 Example: Executing Thread Blocks Threads are assigned to Streaming Multiprocessors in block granularity –Up to 8 blocks to each SM as resource allows –SM in G80 can take up to 768 threads Could be 256 (threads/block) * 3 blocks Or 128 (threads/block) * 6 blocks, etc. Threads run concurrently –SM maintains thread/block id #s –SM manages/schedules thread execution t0 t1 t2 … tm Blocks SP Shared Memory MT IU SP Shared Memory MT IU t0 t1 t2 … tm Blocks SM 1SM 0 Flexible resource allocation

18 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign 18 G80 Example: Thread Scheduling Each Block is executed as 32-thread Warps –An implementation decision, not part of the CUDA programming model –Warps are scheduling units in SM If 3 blocks are assigned to an SM and each block has 256 threads, how many Warps are there in an SM? –Each Block is divided into 256/32 = 8 Warps –There are 8 * 3 = 24 Warps … t0 t1 t2 … t31 … … … Block 1 WarpsBlock 2 Warps SP SFU SP SFU Instruction Fetch/Dispatch Instruction L1 Streaming Multiprocessor Shared Memory … t0 t1 t2 … t31 … Block 1 Warps

19 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana-Champaign 19 G80 Example: Thread Scheduling SM implements zero-overhead warp scheduling –At any time, only one of the warps is executed by SM –Warps whose next instruction has its operands ready for consumption are eligible for execution –Eligible Warps are selected for execution on a prioritized scheduling policy –All threads in a warp execute the same instruction when selected

20 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 20 Block IDs and Thread IDs Each thread uses IDs to decide what data to work on –Block ID: 1D or 2D –Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data –Image processing –Solving PDEs on volumes –…

21 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 21 Terminology Thread: concurrent code and associated state executed on the CUDA device (in parallel with other threads) –The unit of parallelism in CUDA Warp: a group of threads executed physically in parallel in G80 Block: a group of threads that are executed together and form the unit of resource assignment Grid: a group of thread blocks that must all complete before the next kernel call of the program can take effect

22 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 22 Memories

23 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 23 CUDA Memory Model Overview Global memory –Main means of communicating R/W Data between host and device –Contents visible to all threads –Long latency access We will focus on global memory for now –Constant and texture memory will come later Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host

24 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 24 CUDA Device Memory Allocation cudaMalloc() Global Memory –Allocates object in the device Global Memory –Requires two parameters Address of a pointer to the allocated object Size of of allocated object cudaFree() –Frees object from device Global Memory Pointer to freed object Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host

25 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 25 CUDA Device Memory Allocation (cont.)‏ Code example: –Allocate a 64 * 64 single precision float array –Attach the allocated storage to Md –“d” is often used to indicate a device data structure TILE_WIDTH = 64; Float* Md int size = TILE_WIDTH * TILE_WIDTH * sizeof(float); cudaMalloc((void**)&Md, size); cudaFree(Md);

26 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 26 CUDA Host-Device Data Transfer cudaMemcpy()‏ –memory data transfer –Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer –Host to Host –Host to Device –Device to Host –Device to Device Asynchronous transfer Grid Global Memory Block (0, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Block (1, 0)‏ Shared Memory Thread (0, 0)‏ Registers Thread (1, 0)‏ Registers Host

27 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 27 CUDA Host-Device Data Transfer (cont.) Code example: –Transfer a 64 * 64 single precision float array –M is in host memory and Md is in device memory –cudaMemcpyHostToDevice and cudaMemcpyDeviceToHost are symbolic constants cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice); cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

28 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 28 CUDA Function Declarations host __host__ float HostFunc()‏ hostdevice __global__ void KernelFunc()‏ device __device__ float DeviceFunc()‏ Only callable from the: Executed on the: __global__ defines a kernel function –Must return void __device__ and __host__ can be used together

29 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 29 CUDA Function Declarations (cont.)‏ __device__ functions cannot have their address taken For functions executed on the device: –No recursion –No static variable declarations inside the function –No variable number of arguments

30 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 30 Calling a Kernel Function – Thread Creation A kernel function must be called with an execution configuration: __global__ void KernelFunc(...); dim3 DimGrid(100, 50); // 5000 thread blocks dim3 DimBlock(4, 8, 8); // 256 threads per block size_t SharedMemBytes = 64; // 64 bytes of shared memory KernelFunc >>(...); Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking

31 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 31 G80 Implementation of CUDA Memories Each thread can: –Read/write per-thread registers –Read/write per-thread local memory –Read/write per-block shared memory –Read/write per-grid global memory –Read/only per-grid constant memory Grid Global Memory Block (0, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Block (1, 0) Shared Memory Thread (0, 0) Registers Thread (1, 0) Registers Host Constant Memory

32 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 32 CUDA Variable Type Qualifiers __device__ is optional when used with __local__, __shared__, or __constant__ Automatic variables without any qualifier reside in a register –Except arrays that reside in local memory Variable declarationMemoryScopeLifetime __device__ __local__ int LocalVar; localthread __device__ __shared__ int SharedVar; sharedblock __device__ int GlobalVar; globalgridapplication __device__ __constant__ int ConstantVar; constantgridapplication

33 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 33 Where to Declare Variables? yesno global constant register (automatic) shared local

34 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 34 Variable Type Restrictions Pointers can only point to memory allocated or declared in global memory: –Allocated in the host and passed to the kernel: __global__ void KernelFunc(float* ptr) –Obtained as the address of a global variable: float* ptr = &GlobalVar;

35 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 35 A Common Programming Strategy Global memory resides in device memory (DRAM) - much slower access than shared memory So, a profitable way of performing computation on the device is to tile data to take advantage of fast shared memory: –Partition data into subsets that fit into shared memory –Handle each data subset with one thread block by: Loading the subset from global memory to shared memory, using multiple threads to exploit memory-level parallelism Performing the computation on the subset from shared memory; each thread can efficiently multi-pass over any data element Copying results from shared memory to global memory

36 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 36 A Common Programming Strategy (Cont.) Constant memory also resides in device memory (DRAM) - much slower access than shared memory –But… cached! –Highly efficient access for read-only data Carefully divide data according to access patterns –R/Only  constant memory (very fast if in cache) –R/W shared within Block  shared memory (very fast) –R/W within each thread  registers (very fast) –R/W inputs/results  global memory (very slow) For texture memory usage, see NVIDIA document.

37 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE498AL, University of Illinois, Urbana Champaign 37 GPU Atomic Integer Operations Atomic operations on integers in global memory: –Associative operations on signed/unsigned ints –add, sub, min, max,... –and, or, xor –Increment, decrement –Exchange, compare and swap Requires hardware with compute capability 1.1 and above.

38 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 38 SM Register File Register File (RF) –32 KB (8K entries) for each SM in G80 TEX pipe can also read/write RF –2 SMs share 1 TEX Load/Store pipe can also read/write RF I$ L1 Multithreaded Instruction Buffer R F C$ L1 Shared Mem Operand Select MADSFU

39 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 39 Programmer View of Register File There are 8192 registers in each SM in G80 –This is an implementation decision, not part of CUDA –Registers are dynamically partitioned across all blocks assigned to the SM –Once assigned to a block, the register is NOT accessible by threads in other blocks –Each thread in the same block only access registers assigned to itself 4 blocks 3 blocks

40 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 40 Example If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM? –Each block requires 10*256 = 2560 registers –8192 = 3 * 2560 + change –So, three blocks can run on an SM as far as registers are concerned How about if each thread increases the use of registers by 1? –Each Block now requires 11*256 = 2816 registers –8192 < 2816 *3 –Only two Blocks can run on an SM, 1/3 reduction of parallelism!!!

41 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 41 More on Dynamic Partitioning Dynamic partitioning gives more flexibility to compilers/programmers –One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each This allows for finer grain threading than traditional CPU threading models –The compiler can trade off between instruction-level parallelism and thread level parallelism

42 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 42 ILP vs. TLP Example Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, global loads take 200 cycles –3 Blocks can run on each SM If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load –Only two Blocks can run on each SM –However, one only needs 200/(8*4) = 7 Warps to tolerate the memory latency –Two blocks have 16 Warps. The performance can be actually higher!

43 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 43 Memory Coalescing When accessing global memory, peak performance utilization occurs when all threads in a half warp access continuous memory locations. Md Nd W I D T H WIDTH Thread 1 Thread 2 Not coalescedcoalesced

44 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 44 Parallel Memory Architecture In a parallel machine, many threads access memory –Therefore, memory is divided into banks –Essential to achieve high bandwidth Each bank can service one address per cycle –A memory can service as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bank result in a bank conflict –Conflicting accesses are serialized Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

45 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 45 Bank Addressing Examples No Bank Conflicts –Linear addressing stride == 1 No Bank Conflicts –Random 1:1 Permutation Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0

46 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 46 Bank Addressing Examples 2-way Bank Conflicts –Linear addressing stride == 2 8-way Bank Conflicts –Linear addressing stride == 8 Thread 11 Thread 10 Thread 9 Thread 8 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 9 Bank 8 Bank 15 Bank 7 Bank 2 Bank 1 Bank 0 x8

47 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 47 How addresses map to banks on G80 Each bank has a bandwidth of 32 bits per clock cycle Successive 32-bit words are assigned to successive banks G80 has 16 banks –So bank = address % 16 –Same as the size of a half-warp No bank conflicts between different half-warps, only within a single half-warp

48 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 48 Shared memory bank conflicts Shared memory is as fast as registers if there are no bank conflicts The fast case: –If all threads of a half-warp access different banks, there is no bank conflict –If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: –Bank Conflict: multiple threads in the same half-warp access the same bank –Must serialize the accesses –Cost = max # of simultaneous accesses to a single bank

49 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 49 Linear Addressing Given: __shared__ float shared[256]; float foo = shared[baseIndex + s * threadIdx.x]; This is only bank-conflict-free if s shares no common factors with the number of banks –16 on G80, so s must be odd Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 s=3 s=1


Download ppt "© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 GPU Programming with CUDA."

Similar presentations


Ads by Google