GPU Program Optimization Cliff Woolley University of Virginia / NVIDIA.

Slides:



Advertisements
Similar presentations
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Performance Dominik G ö ddeke. 2Overview Motivation and example PDE –Poisson problem –Discretization and data layouts Five points of attack for GPGPU.
Copyright © 2003, SAS Institute Inc. All rights reserved. Where's Waldo Uncovering Hard-to-Find Application Killers Claire Cates SAS Institute, Inc
The Programmable Graphics Hardware Pipeline Doug James Asst. Professor CS & Robotics.
Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
A Crash Course on Programmable Graphics Hardware Li-Yi Wei 2005 at Tsinghua University, Beijing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Status – Week 243 Victor Moya. Summary Current status. Current status. Tests. Tests. XBox documentation. XBox documentation. Post Vertex Shader geometry.
IN4151 Introduction 3D graphics 1 Introduction to 3D computer graphics part 2 Viewing pipeline Multi-processor implementation GPU architecture GPU algorithms.
Accelerating Marching Cubes with Graphics Hardware Gunnar Johansson, Linköping University Hamish Carr, University College Dublin.
The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.
Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Chapter 3 Memory Management: Virtual Memory
A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight Cliff Woolley Gregory Lewin David Luebke Greg Humphreys.
A Multigrid Solver for Boundary Value Problems Using Programmable Graphics Hardware Nolan Goodnight Cliff Woolley Gregory Lewin David Luebke Greg Humphreys.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
GPU Programming Robert Hero Quick Overview (The Old Way) Graphics cards process Triangles Graphics cards process Triangles Quads.
Enhancing GPU for Scientific Computing Some thoughts.
Real-time Graphical Shader Programming with Cg (HLSL)
Mapping Computational Concepts to GPUs Mark Harris NVIDIA Developer Technology.
Computer Graphics Graphics Hardware
GPU Computation Strategies & Tricks Ian Buck Stanford University.
Efficient Data Parallel Computing on GPUs Cliff Woolley University of Virginia / NVIDIA.
Kenneth Hurley Sr. Software Engineer
Chris Kerkhoff Matthew Sullivan 10/16/2009.  Shaders are simple programs that describe the traits of either a vertex or a pixel.  Shaders replace a.
Cg Programming Mapping Computational Concepts to GPUs.
1 A GPU-Like Soft Processor for High-Throughput Acceleration Jeffrey Kingyens and J. Gregory Steffan Electrical and Computer Engineering University of.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
General-Purpose Computation on Graphics Hardware.
The programmable pipeline Lecture 3.
CSE 690: GPGPU Lecture 6: Cg Tutorial Klaus Mueller Computer Science, Stony Brook University.
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Finding Body Parts with Vector Processing Cynthia Bruyns Bryan Feldman CS 252.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.
11/5/2002 (c) University of Wisconsin, CS 559 Last Time Local Shading –Diffuse term –Specular term –All together –OpenGL brief overview.
David Luebke 1 1/25/2016 Programmable Graphics Hardware.
Ray Tracing using Programmable Graphics Hardware
UW EXTENSION CERTIFICATE PROGRAM IN GAME DEVELOPMENT 2 ND QUARTER: ADVANCED GRAPHICS The GPU.
The Graphics Pipeline Revisited Real Time Rendering Instructor: David Luebke.
VAR/Fence: Using NV_vertex_array_range and NV_fence Cass Everitt.
An Introduction to the Cg Shading Language Marco Leon Brandeis University Computer Science Department.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Confessions of a Performance Monitor Hardware Designer Workshop on Hardware Performance Monitor Design HPCA February 2005 Jim Callister Intel Corporation.
Computer Graphics Graphics Hardware
GPU Architecture and Its Application
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
A Crash Course on Programmable Graphics Hardware
Graphics Processing Unit
GPU with CPU OZAN ÇETİNASLAN.
GPGPU: Distance Fields
Static Image Filtering on Commodity Graphics Processors
Graphics Processing Unit
Computer Graphics Graphics Hardware
Kenneth Moreland Edward Angel Sandia National Labs U. of New Mexico
Debugging Tools Tim Purcell NVIDIA.
University of Virginia
CSE 502: Computer Architecture
CIS 6930: Chip Multiprocessor: GPU Architecture and Programming
Presentation transcript:

GPU Program Optimization Cliff Woolley University of Virginia / NVIDIA

Overview Data Parallel Computing Data Parallel Computing Computational Frequency Computational Frequency Profiling and Load Balancing Profiling and Load Balancing

Data Parallel Computing

Instruction-Level Parallelism Instruction-Level Parallelism Data-Level Parallelism Data-Level Parallelism Data Parallel Computing

frag2frame Smooth(vert2frag IN, uniform samplerRECT Source : texunit0, uniform samplerRECT Operator : texunit1, uniform samplerRECT Boundary : texunit2, uniform float4 params) uniform samplerRECT Boundary : texunit2, uniform float4 params){ frag2frame OUT; frag2frame OUT; float2 center = IN.TexCoord0.xy; float2 center = IN.TexCoord0.xy; float4 U = f4texRECT(Source, center); float4 U = f4texRECT(Source, center); // Calculate Red-Black (odd-even) masks // Calculate Red-Black (odd-even) masks float2 intpart; float2 intpart; float2 place = floor(1.0f - modf(round(center + float2(0.5f, 0.5f)) / 2.0f, intpart)); float2 place = floor(1.0f - modf(round(center + float2(0.5f, 0.5f)) / 2.0f, intpart)); float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y); float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y); if (((mask.x + mask.y) && params.y) || (!(mask.x + mask.y) && !params.y)) if (((mask.x + mask.y) && params.y) || (!(mask.x + mask.y) && !params.y)) { float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f)); float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f)); float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f); float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f); float central = -2.0f*(O.x + O.y); float central = -2.0f*(O.x + O.y); float poisson = ((params.x*params.x)*U.z + (-O.x * f1texRECT(Source, float2(neighbor.x, center.y)) + float poisson = ((params.x*params.x)*U.z + (-O.x * f1texRECT(Source, float2(neighbor.x, center.y)) + -O.x * f1texRECT(Source, float2(neighbor.y, center.y)) + -O.x * f1texRECT(Source, float2(neighbor.y, center.y)) + -O.y * f1texRECT(Source, float2(center.x, neighbor.z)) + -O.y * f1texRECT(Source, float2(center.x, neighbor.z)) + -O.z * f1texRECT(Source, float2(center.x, neighbor.w)))) / O.w; -O.z * f1texRECT(Source, float2(center.x, neighbor.w)))) / O.w; OUT.COL.x = poisson; OUT.COL.x = poisson; } return OUT; return OUT;} A really naïve shader

frag2frame Smooth(vert2frag IN, uniform samplerRECT Source : texunit0, uniform samplerRECT Operator : texunit1, uniform samplerRECT Boundary : texunit2, uniform float4 params) uniform samplerRECT Boundary : texunit2, uniform float4 params){ frag2frame OUT; frag2frame OUT; float2 center = IN.TexCoord0.xy; float2 center = IN.TexCoord0.xy; float4 U = f4texRECT(Source, center); float4 U = f4texRECT(Source, center); // Calculate Red-Black (odd-even) masks // Calculate Red-Black (odd-even) masks float2 intpart; float2 intpart; float2 place = floor(1.0f - modf(round(center + float2(0.5f, 0.5f)) / 2.0f, intpart)); float2 place = floor(1.0f - modf(round(center + float2(0.5f, 0.5f)) / 2.0f, intpart)); float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y); float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y); if (((mask.x + mask.y) && params.y) || (!(mask.x + mask.y) && !params.y)) if (((mask.x + mask.y) && params.y) || (!(mask.x + mask.y) && !params.y)) { float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f)); float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f)); float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f); float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f); float central = -2.0f*(O.x + O.y); float central = -2.0f*(O.x + O.y); float poisson = ((params.x*params.x)*U.z + (-O.x * f1texRECT(Source, float2(neighbor.x, center.y)) + float poisson = ((params.x*params.x)*U.z + (-O.x * f1texRECT(Source, float2(neighbor.x, center.y)) + -O.x * f1texRECT(Source, float2(neighbor.y, center.y)) + -O.x * f1texRECT(Source, float2(neighbor.y, center.y)) + -O.y * f1texRECT(Source, float2(center.x, neighbor.z)) + -O.y * f1texRECT(Source, float2(center.x, neighbor.z)) + -O.z * f1texRECT(Source, float2(center.x, neighbor.w)))) / O.w; -O.z * f1texRECT(Source, float2(center.x, neighbor.w)))) / O.w; OUT.COL.x = poisson; OUT.COL.x = poisson; } return OUT; return OUT;} A really naïve shader

float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f)); float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f); Instruction-Level Parallelism

float2 offset = center.xy - 0.5f; offset = offset * params.xx + 0.5f; // MADR is cool too – one // cycle, two flops float4 neighbor = center.xxyy + float4(-1.0f,1.0f,-1.0f,1.0f); Instruction-Level Parallelism

Data-Level Parallelism Pack scalar data into RGBA in texture memory Pack scalar data into RGBA in texture memory

Computational Frequency

Think of your CPU program and your vertex and fragment programs as different levels of nested looping. Think of your CPU program and your vertex and fragment programs as different levels of nested looping.... foreach tri in triangles { // run the vertex program on each vertex v1 = process_vertex(tri.vertex1); v2 = process_vertex(tri.vertex2); v3 = process_vertex(tri.vertex2); // assemble the vertices into a triangle assembledtriangle = setup_tri(v1, v2, v3); // rasterize the assembled triangle into [0..many] fragments fragments = rasterize(assembledtriangle); // run the fragment program on each fragment foreach frag in fragments { outbuffer[frag.position] = process_fragment(frag); } }...

Computational Frequency Branches Branches Avoid these, especially in the inner loop – i.e., the fragment program. Avoid these, especially in the inner loop – i.e., the fragment program.

Computational Frequency Static branch resolution Static branch resolution write several variants of each fragment program to handle boundary cases write several variants of each fragment program to handle boundary cases eliminates conditionals in the fragment program eliminates conditionals in the fragment program equivalent to avoiding CPU inner-loop branching equivalent to avoiding CPU inner-loop branching case 2: accounts for boundaries case 1: no boundaries

Computational Frequency Dynamic branching Dynamic branching Dynamic branching on NV4x and G70 hardware is better than “branching” with NV3x Dynamic branching on NV4x and G70 hardware is better than “branching” with NV3x But still, there is a branch penalty But still, there is a branch penalty Good perf requires spatial coherence in branching Good perf requires spatial coherence in branching

Computational Frequency Branches Branches Ian Buck will talk more about various branching techniques after lunch Ian Buck will talk more about various branching techniques after lunch

Computational Frequency Precompute Precompute

Computational Frequency Precompute texture coordinates Precompute texture coordinates Take advantage of under-utilized hardware Take advantage of under-utilized hardware vertex processor vertex processor rasterizer rasterizer Reduce instruction count at the per-fragment level Reduce instruction count at the per-fragment level Avoid lookups being treated as texture indirections Avoid lookups being treated as texture indirections

frag2frame Smooth(vert2frag IN, uniform samplerRECT Source : texunit0, uniform samplerRECT Operator : texunit1, uniform samplerRECT Boundary : texunit2, uniform float4 params) uniform samplerRECT Boundary : texunit2, uniform float4 params){ frag2frame OUT; frag2frame OUT; float2 center = IN.TexCoord0.xy; float2 center = IN.TexCoord0.xy; float4 U = f4texRECT(Source, center); float4 U = f4texRECT(Source, center); // Calculate Red-Black (odd-even) masks // Calculate Red-Black (odd-even) masks float2 intpart; float2 intpart; float2 place = floor(1.0f - modf(round(center + float2(0.5f, 0.5f)) / 2.0f, intpart)); float2 place = floor(1.0f - modf(round(center + float2(0.5f, 0.5f)) / 2.0f, intpart)); float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y); float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y); if (((mask.x + mask.y) && params.y) || (!(mask.x + mask.y) && !params.y)) if (((mask.x + mask.y) && params.y) || (!(mask.x + mask.y) && !params.y)) { float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f)); float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f)); float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f); float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f); float central = -2.0f*(O.x + O.y); float central = -2.0f*(O.x + O.y); float poisson = ((params.x*params.x)*U.z + (-O.x * f1texRECT(Source, float2(neighbor.x, center.y)) + float poisson = ((params.x*params.x)*U.z + (-O.x * f1texRECT(Source, float2(neighbor.x, center.y)) + -O.x * f1texRECT(Source, float2(neighbor.y, center.y)) + -O.x * f1texRECT(Source, float2(neighbor.y, center.y)) + -O.y * f1texRECT(Source, float2(center.x, neighbor.z)) + -O.y * f1texRECT(Source, float2(center.x, neighbor.z)) + -O.z * f1texRECT(Source, float2(center.x, neighbor.w)))) / O.w; -O.z * f1texRECT(Source, float2(center.x, neighbor.w)))) / O.w; OUT.COL.x = poisson; OUT.COL.x = poisson; } return OUT; return OUT;} Computational Frequency Precompute texture coordinates Precompute texture coordinates

vert2frag smooth(app2vert IN, uniform float4x4 xform : C0, uniform float2 srcoffset, uniform float size) uniform float2 srcoffset, uniform float size){ vert2frag OUT; vert2frag OUT; OUT.position = mul(xform,IN.position); OUT.position = mul(xform,IN.position); OUT.center = IN.center; OUT.center = IN.center; OUT.redblack = IN.center - srcoffset; OUT.redblack = IN.center - srcoffset; OUT.operator = size*(OUT.redblack - 0.5f) + 0.5f; OUT.operator = size*(OUT.redblack - 0.5f) + 0.5f; OUT.hneighbor = IN.center.xxyx + float4(-1.0f, 1.0f, 0.0f, 0.0f); OUT.hneighbor = IN.center.xxyx + float4(-1.0f, 1.0f, 0.0f, 0.0f); OUT.vneighbor = IN.center.xyyy + float4(0.0f, -1.0f, 1.0f, 0.0f); OUT.vneighbor = IN.center.xyyy + float4(0.0f, -1.0f, 1.0f, 0.0f); return OUT; return OUT;} Computational Frequency Precompute texture coordinates Precompute texture coordinates

Computational Frequency Precomputing other values Precomputing other values Same deal! Factor other computations out: Same deal! Factor other computations out: Anything that varies linearly across the geometry Anything that varies linearly across the geometry Anything that has a complex value computed per- vertex Anything that has a complex value computed per- vertex Anything that is uniform across the geometry Anything that is uniform across the geometry

Computational Frequency Precomputing on the CPU Precomputing on the CPU Use glMultiTexCoord4f() creatively Use glMultiTexCoord4f() creatively Extract as much uniformity from uniform parameters as you can Extract as much uniformity from uniform parameters as you can

// Calculate Red-Black (odd-even) masks float2 intpart; float2 place = floor(1.0f - modf(round(center + 0.5f) / 2.0f, intpart)); intpart)); float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y); place.x * place.y); if (((mask.x + mask.y) && params.y) || (!(mask.x + mask.y) && !params.y)) (!(mask.x + mask.y) && !params.y)){ Computational Frequency Precomputed lookup tables Precomputed lookup tables

half4 mask = f4texRECT(RedBlack, IN.redblack); /* * mask.x and mask.w tell whether IN.center.x and IN.center.y * mask.x and mask.w tell whether IN.center.x and IN.center.y * are both odd or both even, respectively. either of these two * are both odd or both even, respectively. either of these two * conditions indicates that the fragment is red. params.x==1 * conditions indicates that the fragment is red. params.x==1 * selects red; params.y==1 selects black. * selects red; params.y==1 selects black. */ */ if (dot(mask,params.xyyx)) { Computational Frequency Precomputed lookup tables Precomputed lookup tables

Computational Frequency Precomputed lookup tables Precomputed lookup tables Be careful with texture lookups – cache coherence is crucial Be careful with texture lookups – cache coherence is crucial Use the smallest data types you can get away with to reduce bandwidth consumption Use the smallest data types you can get away with to reduce bandwidth consumption Use swizzles or writemasks on tex ops when possible Use swizzles or writemasks on tex ops when possible “Computation is cheap; memory accesses are not.”...if you’re memory access limited. “Computation is cheap; memory accesses are not.”...if you’re memory access limited.

Profiling and Load Balancing

Software profiling Software profiling GPU pipeline profiling GPU pipeline profiling GPU load balancing GPU load balancing

Run a standard software profiler! Run a standard software profiler! Rational Quantify Rational Quantify Intel VTune Intel VTune AMD CodeAnalyst AMD CodeAnalyst Profiling and Load Balancing

GPU Pipeline Profiling GPU Pipeline Profiling This is where it gets tricky. This is where it gets tricky. Some tools exist to help you: Some tools exist to help you: NVPerfKit NVIDIA exhibitor tech talk tomorrow morning at 10am in room 404A NVPerfKit NVIDIA exhibitor tech talk tomorrow morning at 10am in room 404A NVPerfHUD NVPerfHUD NVShaderPerf NVShaderPerf Apple OpenGL Profiler Apple OpenGL Profiler Profiling and Load Balancing

GPU Load Balancing GPU Load Balancing This is a whole talk in and of itself This is a whole talk in and of itself e.g., http ://developer.nvidia.com/docs/IO/8343/Performance- Optimisation.pdf e.g., http ://developer.nvidia.com/docs/IO/8343/Performance- Optimisation.pdf Be sure to read the NVIDIA GPU Programming Guide Be sure to read the NVIDIA GPU Programming Guide Sometimes you can get more hints from third parties than from the vendors themselves Sometimes you can get more hints from third parties than from the vendors themselves Profiling and Load Balancing

Conclusions

Conclusions Get used to thinking in terms of parallel computation Get used to thinking in terms of parallel computation Understand how frequently each computation will run, and reduce that frequency wherever possible Understand how frequently each computation will run, and reduce that frequency wherever possible Track down bottlenecks in your application, and shift work to other parts of the system that are idle Track down bottlenecks in your application, and shift work to other parts of the system that are idle

Questions? Acknowledgements Acknowledgements Pat Brown at NVIDIA Pat Brown at NVIDIA NVIDIA for having given me a job this summer NVIDIA for having given me a job this summer Dave Luebke, my advisor Dave Luebke, my advisor GPGPU course presenters GPGPU course presenters

See Also GPU Gems II, Chapter 35 GPU Gems II, Chapter 35