CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Fall 2010 Jih-Kwon Peir Computer Information Science Engineering University of Florida

Supplement3: Memory Hardware

CUDA Device Memory Space: Review
Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory (Device) Grid Constant Memory Texture Global Block (0, 0) Shared Memory Local Thread (0, 0) Registers Thread (1, 0) Block (1, 0) Host Global, constant, and texture memory spaces are persistent across kernels called by the same application. The host can R/W global, constant, and texture memories using Copy function

Parallel Memory Sharing
Local Memory: per-thread Private per thread Auto variables, register spill Shared Memory: per-Block Shared by threads of the same block Inter-thread communication Global Memory: per-application Shared by all threads Inter-Grid communication Thread Local Memory Block Shared Memory Grid 0 . . . Global Memory Sequential Grids in Time Grid 1 . . .

Graphics pipeline

G80 – Graphics Mode The future of GPUs is programmable processing
So – build the architecture around the processor L2 FB SP L1 TF Thread Processor Vtx Thread Issue Setup / Rstr / ZCull Geom Thread Issue Pixel Thread Issue Input Assembler Host

G80 CUDA mode – A Device Example
Processors execute computing threads New operating mode/HW interface for computing Load/store Global Memory Thread Execution Manager Input Assembler Host Texture Parallel Data Cache

SM Memory Architecture
t0 t1 t2 … tm t0 t1 t2 … tm SP Shared Memory MT IU SP Shared Memory MT IU Blocks Blocks Threads in a block share data & results In Memory and Shared Memory Synchronize at barrier instruction Per-Block Shared Memory Allocation Keeps data close to processor Minimize trips to global Memory Shared Memory is dynamically allocated to blocks, one of the limiting resources Texture L1 TF TF: texture filter Courtesy: John Nicols, NVIDIA L2 Memory

SM Register File Register File (RF) TEX pipe can also read/write RF
32 KB (8K entries) for each SM in G80 TEX pipe can also read/write RF 2 SMs share 1 TEX unit Load/Store pipe can also read/write Register File I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU

Programmer View of Register File
3 blocks There are 8192 registers in each SM in G80 This is an implementation decision, not part of CUDA Registers are dynamically partitioned across all blocks assigned to the SM Once assigned to a block, the register is NOT accessible by threads in other blocks (private) Each thread in the same block only access registers assigned to itself (private) 4 blocks

Matrix Multiplication Example
If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM? Each block requires 10*256 = 2560 registers 8192 = 3 * change So, three blocks can run on an SM as far as registers are concerned How about if each thread increases the use of registers by 1? Each Block now requires 11*256 = 2816 registers 8192 < 2816 *3 Only two Blocks can run on an SM, 1/3 reduction of parallelism!!! Again, Programmers’ Responsibility!!!

More on Dynamic Partitioning
Dynamic partitioning gives more flexibility to compilers/programmers One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each This allows for finer grain threading than traditional CPU threading models. The compiler can tradeoff between instruction-level parallelism and thread level parallelism  Note not only registers, also shared memory!! That are impact the thread block (warp) scheduling

ILP vs. TLP Example Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, global loads have 200 cycles 3 Blocks can run on each SM If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load Only two can run on each SM (11*256=2816 register/block) However, one only needs 200/(8*4) = 7 Warps to tolerate the memory latency Two blocks have 16 Warps. The performance can be actually higher! For 10 registers, 200/(4*4)=13 Wraps, but double memory accesses

Memory Layout of a Matrix in C

Memory Coalescing When accessing global memory, peak performance utilization occurs when all threads in a half warp access continuous memory locations. Not coalesced coalesced Thread 1 Thread 2 Md Nd Thread 1 H T D Thread 2 I W WIDTH

Threads M0,0 M1,0 M2,0 M3,0 Access direction in Kernel code M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3 … Time Period 1 Time Period 2 … T1 T2 T3 T4 T1 T2 T3 T4 M Coalesced!! M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3

Access direction in Kernel code M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 … M0,3 M1,3 M2,3 M3,3 … Time Period 2 T1 T2 T3 T4 Time Period 1 T1 T2 T3 T4 M M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3

Use Shared Memory to Improve Coalescing
Md Nd Original H Access T D I Pattern W Coalesced move! WIDTH Copy into scratchpad memory Md Nd Tiled Access Perform Pattern multiplication with scratchpad values

Using Shared Memory Load data from global memory to shared memory
Synchronize all threads in a block Use data in shared memory for parallel computation Synchronize all threads in a block again before writeback the results in shared memory Write results back to global memory

Global Memory Access The GPU device is capable of reading 4-, 8-, and 16-byte from global memory in a single instruction __align__ (8) can align the data in (8) byte boundary, float2 and float4 have built-in 8 and 16 byte boundary Simultaneous memory accesses by threads in a half-warp can be coalesced into a single memory transaction of 32, 64, or 128 bytes. More rigid coalesce requirement in capacity 1.0 and 1.1 Must access 4B in one 64B transaction, or 8B in one 128B transaction, or 16B in two 128B transactions All 16 words must lie in the same segment of the size equal to the memory transaction size Threads must access the words in sequence, i.e. the Kth thread must access Kth word

Coalesced Global Memory Access
Left: coalesced memory accesses in one transaction Right: coalesced memory accesses in one transaction even if the warp is divergent

Non-Coalesced Global Memory Access (capability 1.0, 1.1)
All required 16 memory transactions!

Global Memory Access Coalescing with capability 1.2 or higher
The words accessed by all threads lie in the same segment of the size equal to 32B if all threads access 1B; 64B if all threads access 2B; and 128B if all thread access 4B or 8B words All any pattern of addresses requested by the half-warp including patterns where multiple threads access the same address, not in- sequence requirement If half-warp access n different segments, only n memory transactions are issued, even n>1 Unused works are wasted, so better align the memory with smaller size (smaller memory transaction size) if possible

Coalesced Global Memory Access (capability >1.2)
Left: Random coalesced memory accesses in one transaction Center: Misaligned coalesced memory accesses in one transaction Right: Misaligned coalesced memory accesses in two transactions

Constants Immediate address constants Indexed address constants
Constants stored in DRAM, and cached on chip L1 per SM A constant value can be broadcast to all threads in a Warp Extremely efficient way of accessing a value that is common for all threads in a block! I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU

Shared Memory CUDA uses Shared Memory as shared storage visible to all threads in a block Fast read and write access Each SM has 16 KB of Shared Memory 16 banks of 32bit words, i.e. word interleaved Each bank bandwidth is 32bits / 2 clock cycles A shared memory request for a warp is split into two requests of 1st and 2nd half, no bank conflict between threads in 1st and 2nd half Not used explicitly for pixel shader programs we dislike pixels talking to each other  I $ L 1 Multithreaded Instruction Buffer R C $ Shared F L 1 Mem Operand Select MAD SFU

Parallel Memory Architecture
In a parallel machine, many threads access memory Therefore, memory is divided into banks Essential to achieve high bandwidth Each bank can service one address per cycle A memory can service as many simultaneous accesses as it has banks Multiple simultaneous accesses to a bank result in a bank conflict Conflicting accesses are serialized Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

Bank Addressing Examples
No Bank Conflicts Linear addressing stride == 1 No Bank Conflicts Random 1:1 Permutation Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0

Bank Addressing Examples
2-way Bank Conflicts Linear addressing stride == 2 8-way Bank Conflicts Linear addressing stride == 8 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 9 Bank 8 Bank 15 Bank 7 Bank 2 Bank 1 Bank 0 x8 Thread 11 Thread 10 Thread 9 Thread 8 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0

How addresses map to banks on G80
Each bank has a bandwidth of 32 bits per clock cycle Successive 32-bit words are assigned to successive banks G80 has 16 banks So bank = address % 16 Same as the size of a half-warp Memory access scheduling are half-wrap based No bank conflicts between different half-warps, only within a single half-warp.

Shared memory bank conflicts
Shared memory is as fast as registers if there are no bank conflicts The fast case: If all threads of a half-warp access different banks, there is no bank conflict If all threads of a half-warp access the identical address, there is no bank conflict (broadcast) The slow case: Bank Conflict: multiple threads in the same half-warp access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank

Linear Addressing Given: __shared__ float shared[256]; float foo =
shared[baseIndex + s * threadIdx.x]; This is only bank-conflict-free if s shares no common factors with the number of banks 16 on G80, so s must be odd Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 s=3 Bank 15 Bank 7 Bank 6 Bank 5 Bank 4 Bank 3 Bank 2 Bank 1 Bank 0 Thread 15 Thread 7 Thread 6 Thread 5 Thread 4 Thread 3 Thread 2 Thread 1 Thread 0 Note, conflict-free accesses do not need to be stride-1

Differences between Global and Shared Memory
Global memory: Use coalesced reads (see also in the programming guide) It is likely the bandwidth of global memory interface is 64 bytes, hence can coalesce 16*32 bit data from half warp; or read using a texture with good spatial locality within each warp. Shared memory: Use multiple (16) banks for independent accesses from half warp.  There are many interesting discussions on the memory issue in CUDA Forum:

Registers, ILP and Instruction Mix

Programmer View of Register File
4 blocks 3 blocks There are 8192 registers in each SM in G80 This is an implementation decision, not part of CUDA Registers are dynamically partitioned across all Blocks assigned to the SM Once assigned to a Block, the register is NOT accessible by threads in other Blocks Each thread in the same Block only access registers assigned to itself

Matrix Multiplication Example
If each Block has 16X16 threads and each thread uses 10 registers, how many thread can run on each SM? Each Block requires 10*256 = 2560 registers 8192 = 3 * change So, three blocks can run on an SM as far as registers are concerned How about if each thread increases the use of registers by 1? Each Block now requires 11*256 = 2816 registers 8192 < 2816 *3 Only two Blocks can run on an SM, 1/3 reduction of thread-level parallelism (TLP)!!!

More on Dynamic Partitioning
Dynamic partitioning of SM resources gives more flexibility to compilers/programmers One can run a smaller number of threads that require many registers each or a large number of threads that require few registers each This allows for finer grain threading than traditional CPU threading models. The compiler can tradeoff between instruction-level parallelism and thread level parallelism

ILP vs. TLP Example Assume that a kernel has 256-thread Blocks, 4 independent instructions for each global memory load in the thread program, and each thread uses 10 registers, global loads have 200 cycles 3 Blocks can run on each SM If a compiler can use one more register to change the dependence pattern so that 8 independent instructions exist for each global memory load Only two can run on each SM However, one only needs 200/(8*4) = 7 Warps to tolerate the memory latency Two Blocks have 16 Warps. The performance can be actually higher!

Resource Allocation Example
Focus on the area within the SPs first. The area of the core computation is what we want to maximize for performance. This shows the unpredictable nature of optimization. An optimization is applied which increases SP utilization and code efficiency per block, but at the code of register usage, kicking off a block. Increase in per-thread performance, but fewer threads: Lower overall performance in this case

Prefetching One could double buffer the computation, getting better instruction mix within each thread This is classic software pipelining in ILP compilers Loop { Load current tile to shared memory syncthreads() Compute current tile } Load next tile from global memory Loop { Deposit current tile to shared memory syncthreads() Compute current tile } Note, global to shared memory move is separated in two parts: - Move from global to register - Deposit from register to shared memory Increase register requirement

Prefetch Deposit blue tile from register into shared memory
Md Nd Pd Pdsub TILE_WIDTH WIDTH bx tx 1 TILE_WIDTH-1 2 by ty TILE_WIDTHE Deposit blue tile from register into shared memory Syncthreads Load orange tile into register Compute Blue tile Deposit orange tile into shared memory ….

Instruction Mix Considerations
for (int k = 0; k < BLOCK_SIZE; ++k) Pvalue += Ms[ty][k] * Ns[k][tx]; There are very few mul/add between branches and address calculation. Loop unrolling can help. Pvalue += Ms[ty][k] * Ns[k][tx] + …… Ms[ty][k+15] * Ns[k+15][tx]; Beware that any local arrays used after unrolling will be dumped into Local Mmeory

Unrolling Removal of branch instructions and address calculations
Mention that memory bandwidth still isn’t an issue here. Depending on Does this use more registers? Removal of branch instructions and address calculations

Major G80 Performance Detractors
Long-latency operations Avoid stalls by executing other threads Stalls and bubbles in the pipeline Barrier synchronization Branch divergence (will be covered next) Shared resource saturation Global memory bandwidth Local memory capacity

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Similar presentations

Presentation on theme: "CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Similar presentations

Presentation on theme: "CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming"— Presentation transcript:

Similar presentations

About project

Feedback