Ceng 545 Performance Considerations. Memory Coalescing High Priority: Ensure global memory accesses are coalesced whenever possible. Off-chip memory is.

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture begins with an example of how a wide- memory bus is utilized in GPUs The impact of memory coalescing and memory bank conflicts.
Advertisements

Intermediate GPGPU Programming in CUDA
List Ranking and Parallel Prefix
Background Virtual memory – separation of user logical memory from physical memory. Only part of the program needs to be in memory for execution. Logical.
Multipattern String Matching On A GPU Author: Xinyan Zha, Sartaj Sahni Publisher: 16th IEEE Symposium on Computers and Communications Presenter: Ye-Zhi.
Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
Concurrency: Deadlock and Starvation Chapter 6. Deadlock Permanent blocking of a set of processes that either compete for system resources or communicate.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
CUBLAS and CUSPARSE MVM Timing Gavin Harrison. SMVM Algorithm.
Kernel memory allocation
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.
L6: Memory Hierarchy Optimization III, Bandwidth Optimization CS6963.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization CS6963.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
CS179: GPU Programming Lecture 11: Lab 5 Recitation.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
CUDA - 2.
ME964 High Performance Computing for Engineering Applications
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
ME964 High Performance Computing for Engineering Applications “Once a new technology rolls over you, if you're not part of the steamroller, you're part.
L17: CUDA, cont. Execution Model and Memory Hierarchy November 6, 2012.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 13: Application Lessons When the tires.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
CS 241 Discussion Section (12/1/2011). Tradeoffs When do you: – Expand Increase total memory usage – Split Make smaller chunks (avoid internal fragmentation)
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CUDA programming Performance considerations (CUDA best practices)
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Lecture 6 CUDA Global Memory Kyu Ho Park Mar. 29, 2016.
Computer Engg, IIT(BHU)
Introduction to CUDA Programming
GPU Computing CIS-543 Lecture 09: Shared and Constant Memory
Lecture 5: Performance Considerations
Memory Management.
Sathish Vadhiyar Parallel Programming
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
L18: CUDA, cont. Memory Hierarchy and Examples
Lecture 5: GPU Compute Architecture for the last time
CS/EE 217 – GPU Architecture and Parallel Programming
L15: CUDA, cont. Memory Hierarchy and Examples
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
Mattan Erez The University of Texas at Austin
Shared Memory Accesses
ECE 498AL Lecture 10: Control Flow
L6: Memory Hierarchy Optimization IV, Bandwidth Optimization
Mattan Erez The University of Texas at Austin
ECE 498AL Spring 2010 Lecture 10: Control Flow
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Ceng 545 Performance Considerations

Memory Coalescing High Priority: Ensure global memory accesses are coalesced whenever possible. Off-chip memory is accessed in chunks – Even if you read only a single word – If you dont use whole chunk, bandwidth is wasted Chunks are aligned to multiples of 32/64/128 Bytes – Unaligned accesses will cost more

Global memory loads and stores by threads of a half warp (for devices of compute capability 1.x) or of a warp (for devices of compute capability 2.x) are coalesced by the device into as few as one transaction when certain access requirements are met.

To understand these access requirements, global memory should be viewed in terms of aligned segments of 16 and 32 words.

Coalescing algorithm Find the memory segment that contains the address requested by the lowest numbered active thread: – 32B segment for 8-bit data – 64B segment for 16-bit data – 128B segment for 32, 64 and 128-bit data. Find all other active threads whose requested address lies in the same segment Reduce the transaction size, if possible: – If size == 128B and only the lower or upper half is used, reduce transaction to 64B – If size == 64B and only the lower or upper half is used, reduce transaction to 32B Carry out the transaction, mark threads as inactive Repeat until all threads in the half-warp are serviced

A Simple Access Pattern The first

A Sequential but Misaligned Access Pattern If the addresses fall within a 128-byte segment, then a single 128-byte transaction is performed

one 64-byte transaction and one 32-byte transaction result.

Memory allocated through the runtime API, such as via cudaMalloc(), is guaranteed to be aligned to at least 256 bytes. Therefore, choosing sensible thread block sizes, such as multiples of 16, facilitates memory accesses by half warps that are aligned to segments. __align__(8) and __align__(16) can be used when defining structures to ensure alignment to segments.

Strided Accesses __global__ void strideCopy(float *odata, float* idata, int stride) { int xid = (blockIdx.x*blockDim.x + threadIdx.x)*stride; odata[xid] = idata[xid]; }

As the stride increases, the effective bandwidth decreases until the point where 16 transactions are issued for the 16 threads in a half warp

Memory Coalescing Structure of array is often better than array of structures – Very clear win on regular, stride 1 access patterns – Unpredictable or irregular access patterns are case-by-case

Shared Memory and Memory Banks it is on-chip, shared memory is much faster than local and global memory. In fact, uncached shared memory latency is roughly 100x lower than global memory latency – provided there are no bank conflicts between the threads

To achieve high memory bandwidth for concurrent accesses, shared memory is divided into equally sized memory modules (banks) that can be accessed simultaneously Medium Priority: Accesses to shared memory should be designed to avoid serializing requests due to bank conflicts.

Shared memory banks are organized such that successive 32-bit words are assigned to successive banks and each bank has a bandwidth of 32 bits per clock cycle.