Performance Dominik G ö ddeke. 2Overview Motivation and example PDE –Poisson problem –Discretization and data layouts Five points of attack for GPGPU.

Slides:



Advertisements
Similar presentations
GPGPU Programming Dominik G ö ddeke. 2Overview Choices in GPGPU programming Illustrated CPU vs. GPU step by step example GPU kernels in detail.
Advertisements

Is There a Real Difference between DSPs and GPUs?
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Lecture 6: Multicore Systems
Lecture 12 Reduce Miss Penalty and Hit Time
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Why GPUs? Robert Strzodka. 2Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid Jeffrey Bolz, Ian Farmer, Eitan Grinspun, Peter Schröder Caltech ASCI Center.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Chapter 12 CPU Structure and Function. Example Register Organizations.
The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Enhancing GPU for Scientific Computing Some thoughts.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Efficient Data Parallel Computing on GPUs Cliff Woolley University of Virginia / NVIDIA.
Extracted directly from:
Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.
Cg Programming Mapping Computational Concepts to GPUs.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
GPU Program Optimization Cliff Woolley University of Virginia / NVIDIA.
Main Memory CS448.
The programmable pipeline Lecture 3.
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Efficient Streaming of 3D Scenes with Complex Geometry and Complex Lighting Romain Pacanowski and M. Raynaud X. Granier P. Reuter C. Schlick P. Poulin.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Outline Announcements: –HW III due Friday! –HW II returned soon Software performance Architecture & performance Measuring performance.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
Fateme Hajikarami Spring  What is GPGPU ? ◦ General-Purpose computing on a Graphics Processing Unit ◦ Using graphic hardware for non-graphic computations.
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Mapping Computational Concepts to GPUs Mark Harris NVIDIA.
Sunpyo Hong, Hyesoon Kim
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Chapter 7 Memory Management Eighth Edition William Stallings Operating Systems: Internals and Design Principles.
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
An Introduction to the Cg Shading Language Marco Leon Brandeis University Computer Science Department.
Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.
Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
5.2 Eleven Advanced Optimizations of Cache Performance
Graphics Processing Unit
/ Computer Architecture and Design
Ph.D. Thesis Numerical Solution of PDEs and Their Object-oriented Parallel Implementations Xing Cai October 26, 1998.
Presentation transcript:

Performance Dominik G ö ddeke

2Overview Motivation and example PDE –Poisson problem –Discretization and data layouts Five points of attack for GPGPU

3 Example PDE: The Poisson Problem

4 Discretization Grids Equidistant grids –Easy to implement –One array holds all the values –One array for right hand side –No matrix required, just a stencil

5 Discretization Grids Tensorproduct grids –Reasonably easy to implement –Banded matrix, each band represented ass individual array N N Matrix i 2 Vectors N 12 2 GPU arrays 12 image courtesy of Jens Krüger

6 Discretization Grids Generalized tensorproduct grids –Generality vs. efficient data structures tailored for GPU –Global unstructured macro mesh, domain decomposition –(an-)isotropic refinement into local tensorproduct meshes Efficient compromise –Hide anisotropies locally and exploit fast solvers on regular sub-problems: excellent numerical convergence –Large problems become viable

7 Discretization Grids Unstructured grids –Bad performance for dynamic topology –Compact row storage format or similar –Challenging to implement: Indirection arrays Feedback loop to the vertex stage image courtesy of Jens Krüger

8 Discretization Grids Adaptive grids –Handles coherent grid topology changes –Needs dynamic hash/tree structure and/or page table on the GPU –Actively being researched, see Glift project Physical Memory Virtual Domain Mipmap Page Table image courtesy of Aaron Lefohn

9Overview Motivation and example PDE Five points of attack for GPGPU –Interpolation –On-chip bandwidth –Off-chip bandwidth –Overhead –Vectorization

10 General Performance Tuning Traditional CPU cache-aware techniques –Blocking, reordering, unrolling etc. –Can not be applied directly No direct control of what actually happens –Hardware details are NDA‘ed, not public –Driver recompiles the code and might apply some SFCs or similar to arrange arrays in memory –Driver knows hardware details best, so let it work for you Only small cache, optimized for texture filtering –Prefetching of small local neighborhoods –Memory interface optimized for streaming sequential access

11 Simplified GPU overview Vertex Processor (VP) Kernel changes index regions of input arrays Rasterizer Creates data streams from index regions Fragment Processor (FP) Kernel changes each datum independently, reads more input arrays CPU memory GPU memory

12 First Point of Attack Vertex Processor (VP) Kernel changes index regions of input arrays Rasterizer Creates data streams from index regions Fragment Processor (FP) Kernel changes each datum independently, reads more input arrays CPU memory GPU memory take advantage of the interpolation hardware

13Interpolation Recall how computation is triggered –Some geometry that covers the output region is drawn (more precisely, a quad with four vertices) –Index of each element in the output array is interpolated across the output region automatically already –This can be leveraged for all values that vary linearly over some region of the input arrays as well Example: Jacobi solver (cf. Session 1) -Typical domain sizes: 256x256, 512x512, 1024x1024 -Interpolate index math for neighborhood lookup as well

14 float jacobi (float2 center : WPOS, uniform samplerRECT x, uniform samplerRECT b, uniform float one_over_h) : COLOR { float2 left = center – float2(1,0); float2 right = center + float2(1,0); float2 bottom = center – float2(0,1); float2 top = center + float2(0,1); float x_center = texRECT(x, center); float x_left = texRECT(x, left); float x_right = texRECT(x, right); float x_bottom = texRECT(x, bottom); float x_top = texRECT(x, top); float rhs = texRECT(b, center); float Ax = one_over_h * ( 4.0 * x_center - x_left - x_right – x_bottom – x_top ); float inv_diag = one_over_h / 4.0; return x_center + inv_diag*(rhs – Ax); } Interpolation Example void stencil (float4 position : POSITION, out float4 center: HPOS, out float2 left : TEXCOORD0, out float2 right : TEXCOORD1, out float2 bottom: TEXCOORD2, out float2 top : TEXCOORD3, uniform float4x4 ModelViewMatrix) { center = mul(ModelViewMatrix, position); left = center – float2(1,0); right = center + float2(1,0); bottom = center – float2(0,1); top = center + float2(0,1); } calculated 1024^2 times calculated 4 times extract offset calculation to the vertex processor float jacobi (float2 center : WPOS, uniform samplerRECT x, uniform samplerRECT b, in float2 left : TEXCOORD0, in float2 right : TEXCOORD1, in float2 bottom : TEXCOORD2, in float2 top : TEXCOORD3, uniform float one_over_h) : COLOR { float x_center = texRECT(x, center); float x_left = texRECT(x, left); float x_right = texRECT(x, right); float x_bottom = texRECT(x, bottom); float x_top = texRECT(x, top); float rhs = texRECT(b, center); float Ax = one_over_h * ( 4.0 * x_center - x_left - x_right – x_bottom – x_top ); float inv_diag = one_over_h / 4.0; return x_center + inv_diag*(rhs – Ax); } input vars after interpolation

15 Interpolation Summary Powerful tool –Applicable to everything that varies linearly over some region –High level view: separate computation from lookup stencils Up to eight float4 interpolants available –On current hardware –Though using all 32 values might hurt in some applications Squeeze data into float4‘s –In this example, use 2 float4 instead of 4 float2

16 Second Point of Attack Vertex Processor (VP) Kernel changes index regions of input arrays Rasterizer Creates data streams from index regions Fragment Processor (FP) Kernel changes each datum independently, reads more input arrays CPU memory GPU memory on-chip bandwidth

17 Arithmetic Intensity Analysis of banded MatVec y=Ax, preassembled –Reads per component of y: 9 times into array x, once into each band –Operations per component of y: 9 multiply-adds Arithmetic intensity Operations per memory access Computation / bandwidth 18 reads 18 ops 18/18=1

18 Precompute vs. Recompute Case 1: Application is compute-bound –High arithmetic intensity –Trade computation for memory access –Precompute as many values as possible and read in from additional input arrays Try to maintain spatial coherence –Otherwise, performance will degrade Rule of thumb –Need approx. 7 basic arithmetic ops to hide latency –Do not precompute x 2 if you read in x anyway

19 Precompute vs. Recompute Case 2: Application is bandwidth-bound –Trade memory access for additional computation Example: Matrix assembly and Matrix-Vector multiplication –On-the-fly: recompute all entries in each MatVec Lowest memory requirement Good for simple entries or seldom use of matrix

20 Precompute vs. Recompute –Partial assembly: precompute only few intermediate values Allows to balance computation and bandwidth requirements Good choice of precomputed results requires also little memory –Full assembly: precompute all entries of A Read entries during MatVec Good if other computations hide bandwidth problem Otherwise try to use partial assembly

21 Third Point of Attack Vertex Processor (VP) Kernel changes index regions of input arrays Rasterizer Creates data streams from index regions Fragment Processor (FP) Kernel changes each datum independently, reads more input arrays CPU memory GPU memory off-chip bandwidth

22 CPU - GPU Barrier Transfer is potential bottleneck –Often less than 1 GB/s via PCIe bus –Readback to CPU memory always implies a global syncronization point (pipeline flush) Easy case –Application directly visualizes results –Only need to transfer initial data to the GPU in a preprocessing step –No readback required –Examples: Interactive visualization and fluid solvers (cf. session 1)

23 CPU - GPU Barrier Co-processor style computing –Readback to host is required –Don‘t want host or GPU idle: maximize throughput Interleaving computation with transfers –Apply some partitioning / domain decomposition –Simultaneously Prepare and transfer initial data for sub-problem i+1 Compute on sub-problem i Read back and postprocess result from sub-problem i-1 (causes pipeline flush but can‘t be avoided) –Good if input data is much larger than output data

24 Fourth Point of Attack Vertex Processor (VP) Kernel changes index regions of input arrays Rasterizer Creates data streams from index regions Fragment Processor (FP) Kernel changes each datum independently, reads more input arrays CPU memory GPU memory overhead

25 Playing it big Typical performance behaviour –CPU: Opteron 252, highly optimized cache-aware code –GPU: GeForce 7800 GTX, straight-forward, incl. transfers –saxpy, dot, MatVec CPU wins –Small problems –In-cache GPU wins –Large problems –Hide overhead + transfers

26 Playing it big Nice analogy: Memory hierarchies –GPU memory is fast, comparable to in-cache on CPUs –Consider offloading to the GPU as manual prefetching –Always choose that type of memory that is fastest for the given chunk of data Lots of parallel threads in flight –Need lots of data elements to compute on –Otherwise, PEs won‘t be saturated Worst case and best case –Offload saxpy for small N individually to the GPU –Offload whole solvers for large N to the GPU (e.g. a full MG cycle)

27 Fifth Point of Attack Vertex Processor (VP) Kernel changes index regions of input arrays Rasterizer Creates data streams from index regions Fragment Processor (FP) Kernel changes each datum independently, reads more input arrays CPU memory GPU memory instruction-level parallelism

28Vectorization GPUs are designed to process 4-tupels of data –Same cost to compute on four float values as on one –Take advantage of co-issueing over the four components Swizzles –Swizzling components of the 4-tupels is free (no MOVs) –Example: data=(1,2,3,4) yields data.zzyx=(3,3,2,1) –Very useful for index math and storing values in float4‘s Problem –Challenging task to map data into RGBA –Very problem-specific, no rules of thumb

29Conclusions Be aware of potential bottlenecks –Know hardware capabilities –Analyze arithmetic intensity –Check memory access patterns –Run existing benchmarks, e.g. GPUbench (Stanford) –Minimize number of pipeline stalls Adapt algorithms –Try to work around bottlenecks –Reformulate algorithm to exploit the hardware more efficiently