DirectCompute Performance on DX11 Hardware

Slides:



Advertisements
Similar presentations
OIT and Indirect Illumination using DX11 Linked Lists
Advertisements

GPGPU Programming Dominik G ö ddeke. 2Overview Choices in GPGPU programming Illustrated CPU vs. GPU step by step example GPU kernels in detail.
Instructor Notes This lecture begins with an example of how a wide- memory bus is utilized in GPUs The impact of memory coalescing and memory bank conflicts.
Taking CUDA to Ludicrous Speed Getting Righteous Performance from your GPU 1.
DirectX11 Performance Reloaded
DirectCompute Accelerated Separable Filtering 28th February 20112AMD‘s Favorite Effects.
A Micro-benchmark Suite for AMD GPUs Ryan Taylor Xiaoming Li.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
IIAA GPMAD A beam dynamics code using Graphics Processing Units GPMAD (GPU Processed Methodical Accelerator Design) utilises Graphics Processing Units.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
CMPT 300: Operating Systems I Dr. Mohamed Hefeeda
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
ATI GPUs and Graphics APIs Mark Segal. ATI Hardware X1K series 8 SIMD vertex engines, 16 SIMD fragment (pixel) engines 3-component vector + scalar ALUs.
CUDA and the Memory Model (Part II). Code executed on GPU.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Shader Model 5.0 and Compute Shader
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Computer Graphics Graphics Hardware
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Extracted directly from:
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Cg Programming Mapping Computational Concepts to GPUs.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
Full and Para Virtualization
Synchronization These notes introduce:
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Equalizer: Dynamically Tuning GPU Resources for Efficient Execution Ankit Sethia* Scott Mahlke University of Michigan.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
Computer Graphics Graphics Hardware
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
Sathish Vadhiyar Parallel Programming
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Lecture 5: GPU Compute Architecture for the last time
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Computer Graphics Graphics Hardware
UMBC Graphics for Games
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Lecture 5: Synchronization and ILP
Synchronization These notes introduce:
Presentation transcript:

DirectCompute Performance on DX11 Hardware Nicolas Thibieroz, AMD Cem Cebenoyan, NVIDIA

Why DirectCompute? Allow arbitrary programming of GPU General-purpose programming Post-process operations Etc. Not always a win against PS though Well-balanced PS is unlikely to get beaten by CS Better to target PS with heavy TEX or ALU bottlenecks Use CS threads to divide the work and balance the shader out

Feeding the Machine GPUs are throughput oriented processors Latencies are covered with work Need to provide enough work to gain efficiency Look for fine-grained parallelism in your problem Trivial mapping works best Pixels on the screen Particles in a simulation

Feeding the Machine (2) Still can be advantageous to run a small computation on the GPU if it helps avoid a round trip to host Latency benefit Example: massaging parameters for subsequent kernel launches or draw calls Combine with DispatchIndirect() to get more work done without CPU intervention

Scalar vs Vector NVIDIA GPUs are scalar AMD GPUs are vector Explicit vectorization unnecessary Won’t hurt in most cases, but there are exceptions Map threads to scalar data elements AMD GPUs are vector Vectorization critical to performance Avoid dependant scalar instructions Use IHV tools to check ALU usage

CS5.0 >> CS4.0 CS5.0 is just better than CS4.0 More of everything Threads Thread Group Shared Memory Atomics Flexibility Etc. Will typically run faster If taking advantage of CS5.0 features Prefer CS5.0 over CS4.0 if D3D_FEATURE_LEVEL_11_0 supported

Thread Group Declaration Declaring a suitable number of thread groups is essential to performance numthreads(NUM_THREADS_X, NUM_THREADS_Y, 1) void MyCSShader(...) Total thread group size should be above hardware’s wavefront size Size varies depending on GPUs! ATI HW is 64 at max. NV HW is 32. Avoid sizes below wavefront size numthreads(1,1,1) is a bad idea! Larger values will generally work well across a wide range of GPUs Better scaling with lower-end GPUs

Thread Group Usage Try to divide work evenly among all threads in a group Dynamic Flow Control will create divergent workflows for threads This means threads doing less work will sit idle while others are still busy [numthreads(groupthreads,1,1)] void CSMain(uint3 Gid : SV_GroupID, uint3 Gtid: SV_GroupThreadID) { ... if (Gtid.x == 0) // Code here is only executed for one thread } !

Mixing Compute and Raster Reduce number of transitions between Compute and Draw calls Those transitions can be expensive! Compute A Compute B Compute C Draw X Draw Y Draw Z Compute A Draw X Compute B Draw Y Compute C Draw Z >>

Unordered Access Views UAV not strictly a DirectCompute resource Can be used with PS too Unordered Access support scattered R/W Scattered access = cache trashing Prefer grouped reads/writes (bursting) E.g. Read/write from/to float4 instead of float NVIDIA scalar arch will not benefit from this Contiguous writes to UAVs Do not create a buffer or texture with UAV flag if not required May require synchronization after render ops D3D11_BIND_UNORDERED_ACCESS only if needed! Avoid using UAVs as a scratch pad! Better use TGSM for this

Buffer UAV with Counter Shader Model 5.0 supports a counter on Buffer UAVs Not supported on textures D3D11_BUFFER_UAV_FLAG_COUNTER flag in CreateUnorderedAccessView() Accessible via: uint IncrementCounter(); uint DecrementCounter(); Faster method than implementing manual counter with UINT32-sized R/W UAV Avoids need for atomic operation on UAV See Linked List presentation for an example of this On NVIDIA HW, prefer Append buffers

Append/Consume buffers Useful for serializing output of a data-parallel kernel into an array Can be used in graphics, too! E.g. deferred fragment processing Use with care, can be costly Introduce serialization point in the API Large record sizes can hide the cost of append operation

Atomic Operations “Operation that cannot be interrupted by other threads until it has completed” Typically used with UAVs Atomic operations cost performance Due to synchronization needs Use them only when needed Many problems can be recast as more efficient parallel reduce or scan Atomic ops with feedback cost even more E.g. Buf->InterlockedAdd(uAddress, 1, Previous);

Thread Group Shared Memory Fast memory shared across threads within a group Not shared across thread groups! groupshared float2 MyArray[16][32]; Not persistent between Dispatch() calls Used to reduce computation Use neighboring calculations by storing them in TGSM E.g. Post-processing texture instructions

TGSM Performance (1) Access patterns matter! Limited number of I/O banks 32 banks on ATI and NVIDIA HW Bank conflicts will reduce performance

TGSM Performance (2) 32 banks example Each address is 32 bits Banks are arranged linearly with addresses: Address: 1 2 3 4 ... 31 32 33 34 35 Bank: TGSM addresses that are 32 DWORD apart use the same bank Accessing those addresses from multiple threads will create a bank conflict Declare TGSM 2D arrays as MyArray[Y][X], and increment X first, then Y Essential if X is a multiple of 32! Padding arrays/structures to avoid bank conflicts can help E.g. MyArray[16][33] instead of [16][32]

TGSM Performance (3) Reduce access whenever possible E.g. Pack data into uint instead of float4 But watch out for increased ALUs! Basically try to read/write once per TGSM address Copy to temp array can help if it avoids duplicate accesses! Unroll loops accessing shared mem Helps compiler hide latency

Barriers Barriers add a synchronization point for all threads within a group GroupMemoryBarrier() GroupMemoryBarrierWithGroupSync() Too many barriers will affect performance Especially true if work is not divided evenly among threads Watch out for algorithms using many barriers

Maximizing HW Occupancy A thread group cannot be split across multiple shader units Either in or out Unlike pixel work, which can be arbitrarily fragmented Occupancy affected by: Thread group size declaration TGSM size declared Number of GPRs used Those numbers affect the level of parallelism that can be achieved

Maximizing HW Occupancy (2) Example: HW shader unit: 8 thread groups max 32KB total shared memory 1024 threads max With thread group size of 128 threads requiring 24KB of shared memory can only run 1 thread group per shader unit (128 threads) BAD Ask your IHVs about GPU Computing documentation

Maximizing HW Occupancy (3) Register pressure will also affect occupancy You have little control over this Rely on drivers to do the right thing  Tuning and experimentations are required to find the ideal balance But this balance varies from HW to HW! Store different presets for best performance across a variety of GPUs

Conclusion Threadgroup size declaration essential to performance I/O can be a bottleneck TGSM tuning is important Minimize PS->CS->PS transitions HW occupancy is GPU-dependent DXSDK DirectCompute samples not necessarily using best practices atm! E.g. HDRToneMapping, OIT11

Questions? Nicolas.Thibieroz@amd.com cem@nvidia.com