Optimization on Kepler Zehuan Wang

Slides:



Advertisements
Similar presentations
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
Advertisements

Intermediate GPGPU Programming in CUDA
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
Efficient Packet Pattern Matching for Gigabit Network Intrusion Detection using GPUs Date:102/1/9 Publisher:IEEE HPCC 2012 Author:Che-Lun Hung, Hsiao-hsi.
1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.
BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
CS 193G Lecture 5: Performance Considerations. But First! Always measure where your time is going! Even if you think you know where it is going Start.
CUDA and the Memory Model (Part II). Code executed on GPU.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
GPU Programming with CUDA – Optimisation Mike Griffiths
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CUDA - 2.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
© 2010 NVIDIA Corporation Optimizing GPU Performance Stanford CS 193G Lecture 15: Optimizing Parallel GPU Performance John Nickolls.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
Auto-tuning Dense Matrix Multiplication for GPGPU with Cache
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
NVIDIA® TESLA™ GPU Based Super Computer By : Adam Powell Student # For COSC 3P93.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Single Instruction Multiple Threads
Computer Engg, IIT(BHU)
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
Lecture 5: Performance Considerations
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
GPU Computing CIS-543 Lecture 07: CUDA Execution Model
NVIDIA Profiler’s Guide
Lecture 2: Intro to the simd lifestyle and GPU internals
ECE 498AL Spring 2010 Lectures 8: Threading & Memory Hardware in G80
Lecture 5: GPU Compute Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Mattan Erez The University of Texas at Austin
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
NVIDIA Fermi Architecture
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
General Purpose Graphics Processing Units (GPGPUs)
Lecture 5: Synchronization and ILP
6- General Purpose GPU Programming
CS 295: Modern Systems GPU Computing Introduction
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Optimization on Kepler Zehuan Wang

Fundamental Optimization

Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU interaction optimization Overlapped execution using streams

Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU interaction optimization Overlapped execution using streams

GPU High Level View SMEM Global Memory CPU Chipset PCIe Streaming Multiprocessor Global memory

GK110 SM Control unit 4 Warp Scheduler 8 instruction dispatcher Execution unit 192 single-precision CUDA Cores 64 double-precision CUDA Cores 32 SFU, 32 LD/ST Memory Registers: 64K 32-bit Cache L1+shared memory (64 KB) Texture Constant

GPU and Programming Model Software GPU Threads are executed by CUDA cores Thread CUDA Core Thread Block Multiprocessor Thread blocks are executed on multiprocessors Thread blocks do not migrate Several concurrent thread blocks can reside on one multiprocessor - limited by multiprocessor resources... Grid Device A kernel is launched as a grid of thread blocks Up to 16 kernels can execute on a device at one time

Warp Warp is successive 32 threads in a block E.g. blockDim = 160 Automatically divided to 5 warps by GPU E.g. blockDim = 161 If the blockDim is not the Multiple of 32 The rest of thread will occupy one more warp Block 0 Warp 4 (128~159) Warp 3 (96~127) Warp2 (64~95)Warp1 (32~63) Warp 0 (0~31) Block 0 Warp 5 (160)Warp 4 (128~159) Warp 3 (96~127) Warp2 (64~95)Warp1 (32~63) Warp 0 (0~31) Block 32 Threads... Warps =

Warp SIMD: Same Instruction Multi Data The threads in the same warp always executing the same instruction Instructions will be issued to operation units by warp warp 8 instruction 11 Warp Scheduler 0 warp 2 instruction 42 warp 14 instruction 95 warp 8 instruction warp 14 instruction 96 warp 2 instruction 43 warp 9 instruction 11 Warp Scheduler 1 warp 3 instruction 33 warp 15 instruction 95 warp 9 instruction warp 3 instruction 34 warp 15 instruction 96

Warp Latency is caused by the dependency between the neighbor instructions in the same warp In the waiting time, other instructions from other warps can be executed Context switching is free A lot of warps can hide memory latency warp 8 instruction 11 Warp Scheduler 0 warp 2 instruction 42 warp 14 instruction 95 warp 8 instruction warp 14 instruction 96 warp 2 instruction 43 warp 9 instruction 11 Warp Scheduler 1 warp 3 instruction 33 warp 15 instruction 95 warp 9 instruction warp 3 instruction 34 warp 15 instruction 96

Kepler Memory Hierarchy Register Spills to local memory Caches Shared memory L1 cache L2 cache Constant cache Texture cache Global memory

Kepler/Fermi Memory Hierarchy L2 Global Memory Registers C SM-0 L1& SMEM TEX Registers C SM-1 L1& SMEM TEX Registers C SM-N L1& SMEM TEX low fast kepler

Wrong View To Optimization Try all the optimization methods in book Optimization is endless

General Optimization Strategies: Measurement Find out the limiting factor in kernel performance Memory bandwidth bound (memory optimization) Instruction throughput bound (instruction optimization) Latency bound (configuration optimization) Measure effective memory/instruction throughput: NVIDIA Visual Profiler

Resolved Find Limiter Compare to Effective Value GB/s Memory optimization Compare to Effective Value inst/s Instruction optimization Configuration optimization Memory bound Instruction bound Latency bound Done! << ~ ~

Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU interaction optimization Overlapped execution using streams

Memory Optimization If the code is memory-bound and effective memory throughput is much lower than the peak Purpose: access only data that are absolutely necessary Major techniques Improve access pattern to reduce wasted transactions Reduce redundant access: read-only cache, shared memory

Reduce Wasted Transactions Memory accesses are per warp Memory is accessed in discrete chunks L1 is reserved only for register spills and stack data in Kepler Go directly to L2 (invalidate line in L1), on L2 miss go to DRAM Memory is transport in segments = 32 B (same as for writes) If a warp can’t take use all of the data in the segments, the rest memory transaction is wasted.

Kepler/Fermi Memory Hierarchy L2 Global Memory Registers C SM-0 L1& SMEM TEX Registers C SM-1 L1& SMEM TEX Registers C SM-N L1& SMEM TEX low fast kepler

Reduce Wasted Transactions Scenario: Warp requests 32 aligned, consecutive 4-byte words Addresses fall within 4 segments No replays Bus utilization: 100% Warp needs 128 bytes 128 bytes move across the bus on a miss... addresses from a warp Memory addresses 0

Reduce Wasted Transactions... addresses from a warp Scenario: Warp requests 32 aligned, permuted 4-byte words Addresses fall within 4 segments No replays Bus utilization: 100% Warp needs 128 bytes 128 bytes move across the bus on a miss Memory addresses 0

Reduce Wasted Transactions Scenario: Warp requests 32 consecutive 4-byte words, offset from perfect alignment Addresses fall within at most 5 segments 1 replay (2 transactions) Bus utilization: at least 80% Warp needs 128 bytes At most 160 bytes move across the bus Some misaligned patterns will fall within 4 segments, so 100% utilization Memory addresses 0... addresses from a warp

Reduce Wasted Transactions addresses from a warp Scenario: All threads in a warp request the same 4-byte word Addresses fall within a single segment No replays Bus utilization: 12.5% Warp needs 4 bytes 32 bytes move across the bus on a miss Memory addresses 0

Reduce Wasted Transactions addresses from a warp Memory addresses 0 Scenario: Warp requests 32 scattered 4-byte words Addresses fall within N segments (N-1) replays (N transactions) Could be lower some segments can be arranged into a single transaction Bus utilization: 128 / (N*32) (4x higher than caching loads) Warp needs 128 bytes N*32 bytes move across the bus on a miss...

Read-Only Cache An alternative to L1 when accessing DRAM Also known as texture cache: all texture accesses use this cache CC 3.5 and higher also enable global memory accesses Should not be used if a kernel reads and writes to the same addresses Comparing to L1: Generally better for scattered reads than L1 Caching is at 32 B granularity (L1, when caching operates at 128 B granularity) Does not require replay for multiple transactions (L1 does) Higher latency than L1 reads, also tends to increase register use

Read-Only Cache Annotate eligible kernel parameters with const __restrict Compiler will automatically map loads to use read-only data cache path __global__ void saxpy(float x, float y, const float * __restrict input, float * output) { size_t offset = threadIdx.x + (blockIdx.x * blockDim.x); // Compiler will automatically use texture // for "input" output[offset] = (input[offset] * x) + y; }

Shared Memory Low latency: a few cycles High throughput Main use Inter-block communication User-managed cache to reduce redundant global memory accesses Avoid non-coalesced access

Shared Memory Example: Matrix Multiplication A B C C=AxB Every thread corresponds to one entry in C.

Naive Kernel __global__ void simpleMultiply(float* a, float* b, float* c, int N) { int row = threadIdx.x + blockIdx.x*blockDim.x; int col = threadIdx.y + blockIdx.y*blockDim.y; float sum = 0.0f; for (int i = 0; i < N; i++) { sum += a[row*N+i] * b[i*N+col]; } c[row*N+col] = sum; } Every thread corresponds to one entry in C.

Blocked Matrix Multiplication A B C C=AxB Data reuse in the blocked version

Blocked and cached kernel __global__ void coalescedMultiply(double*a, double* b, double*c, int N) { __shared__ double aTile[TILE_DIM][TILE_DIM]; __shared__ double bTile[TILE_DIM][TILE_DIM]; int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; float sum = 0.0f; for (int k = 0; k < N; k += TILE_DIM) { aTile[threadIdx.y][threadIdx.x] = a[row*N+threadIdx.x+k]; bTile[threadIdx.y][threadIdx.x] = b[(threadIdx.y+k)*N+col]; __syncthreads(); for (int i = 0; i < TILE_DIM; i++) { sum += aTile[threadIdx.y][i]* bTile[i][threadIdx.x]; } c[row*N+col] = sum; }

Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU interaction optimization Overlapped execution using streams

Latency Optimization When the code is latency bound Both the memory and instruction throughputs are far from the peak Latency hiding: switching threads A thread blocks when one of the operands isn’t ready Purpose: have enough warps to hide latency Major techniques: increase active warps

Enough Block and Block Size # of blocks >> # of SM > 100 to scale well to future device Minimum: 64. I generally use 128 or 256. But use whatever is best for your app. Depends on the problem, do experiments!

Occupancy & Active Warps Occupancy: ratio of active warps per SM to the maximum number of allowed warps Maximum number: 48 in Fermi We need the occupancy to be high enough to hide latency Occupancy is limited by resource usage

Dynamical Partitioning of SM Resources Shared memory is partitioned among blocks Registers are partitioned among threads: <= 255 Thread block slots: <= 16 Thread slots: <= 2048 Any of those can be the limiting factor on how many threads can be launched at the same time on a SM

Occupancy Calculator

Occupancy Optimizations Know the current occupancy Visual profiler --ptxas-options=-v: output resource usage info; input to Occupancy Calculator Adjust resource usage to increase occupancy Change block size Limit register usage Compiler option –maxrregcount=n: per file __launch_bounds__: per kernel Dynamical allocating shared memory

Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU interaction optimization Overlapped execution using streams

Instruction Optimization If you find out the code is instruction bound Compute-intensive algorithm can easily become memory-bound if not careful enough Typically, worry about instruction optimization after memory and execution configuration optimizations Purpose: reduce instruction count Use less instructions to get the same job done Major techniques Use high throughput instructions Reduce wasted instructions: branch divergence, etc.

Reduce Instruction Count Use float if precision allow Adding “f” to floating literals (e.g. 1.0f) because the default is double Fast math functions Two types of runtime math library functions func(): s lower but higher accuracy (5 ulp or less) __func(): f ast but lower accuracy (see prog. guide for full details) -use_fast_math: forces every func() to __func ()

Control Flow Divergent branches: Threads within a single warp take different paths Example with divergence: if (threadIdx.x > 2) {...} else {...} Branch granularity < warp size Different execution paths within a warp are serialized Different warps can execute different code with no impact on performance Avoid diverging within a warp Example without divergence: if (threadIdx.x / WARP_SIZE > 2) {...} else {...} Branch granularity is a whole multiple of warp size

Kernel Optimization Workflow Find Limiter Compare to peak GB/s Memory optimization Compare to peak inst/s Instruction optimization Configuration optimization Memory bound Instruction bound Latency bound Done! << ~ ~

Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU interaction optimization Overlapped execution using streams

Minimizing CPU-GPU data transfer Host device data transfer has much lower bandwidth than global memory access. 16 GB/s (PCIe x16 Gen3) vs 250 GB/s & 3.95 Tinst/s (GK110) Minimize transfer Intermediate data can be allocated, operated, de-allocated directly on GPU Sometimes it’s even better to recompute on GPU Group transfer One large transfer much better than many small ones Overlap memory transfer with computation

Streams and Async API Default API: Kernel launches are asynchronous with CPU Memcopies (D2H, H2D) block CPU thread CUDA calls are serialized by the driver Streams and async functions provide: Memcopies (D2H, H2D) asynchronous with CPU Ability to concurrently execute a kernel and a memcopy Concurrent kernel in Fermi Stream = sequence of operations that execute in issue-order on GPU Operations from different streams can be interleaved A kernel and memcopy from different streams can be overlapped

Pinned (non-pageable) memory Pinned memory enables: memcopies asynchronous with CPU & GPU Usage cudaHostAlloc / cudaFreeHost instead of malloc / free Note: pinned memory is essentially removed from virtual memory cudaHostAlloc is typically very expensive

Overlap kernel and memory copy Requirements: D2H or H2D memcopy from pinned memory Device with compute capability ≥ 1.1 (G84 and later) Kernel and memcopy in different, non-0 streams Code: cudaStream_t stream1, stream2; cudaStreamCreate(&stream1); cudaStreamCreate(&stream2); cudaMemcpyAsync( dst, src, size, dir, stream1 ); kernel >>(…); potentially overlapped