GPU Computation Strategies & Tricks Ian Buck Stanford University.

Slides:



Advertisements
Similar presentations
Destruction Masking in Frostbite 2 using Volume Distance Fields
Advertisements

GPGPU Programming Dominik G ö ddeke. 2Overview Choices in GPGPU programming Illustrated CPU vs. GPU step by step example GPU kernels in detail.
Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan GCafe December 10th, 2003.
Photon Mapping on Programmable Graphics Hardware Timothy J. Purcell Mike Cammarano Pat Hanrahan Stanford University Craig Donner Henrik Wann Jensen University.
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
Prepared 5/24/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.
Why GPUs? Robert Strzodka. 2Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics.
© Copyright 3Dlabs 2004 Page 1 ARB Roadmap Discussion Sacramento, June 2004.
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman Pat Hanrahan February 10th, 2003.
Introduction to Geometry Shaders Patrick Cozzi Analytical Graphics, Inc.
Control Flow Virtualization for General-Purpose Computation on Graphics Hardware Ghulam Lashari Ondrej Lhotak University of Waterloo.
Brook for GPUs Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike Houston, Pat Hanrahan Stanford University DARPA Site Visit, UNC.
Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.
1 Shader Performance Analysis on a Modern GPU Architecture Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Jordi Roca, Agustín Fernández Department.
Status – Week 274 Victor Moya. Simulator model Boxes. Boxes. Perform the actual work. Perform the actual work. A box can only access its own data, external.
A Crash Course on Programmable Graphics Hardware Li-Yi Wei 2005 at Tsinghua University, Beijing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
Sorting and Searching Timothy J. PurcellStanford / NVIDIA Updated Gary J. Katz based on GPUTeraSort (MSR TR )U. of Pennsylvania.
GPU Simulator Victor Moya. Summary Rendering pipeline for 3D graphics. Rendering pipeline for 3D graphics. Graphic Processors. Graphic Processors. GPU.
Hardware-Based Nonlinear Filtering and Segmentation using High-Level Shading Languages I. Viola, A. Kanitsar, M. E. Gröller Institute of Computer Graphics.
Evolutions of GPU Architectures Andrew Coile CMPE220 3/2007.
Data Parallel Computing on Graphics Hardware Ian Buck Stanford University.
The programmable pipeline Lecture 10 Slide Courtesy to Dr. Suresh Venkatasubramanian.
Some Things Jeremy Sugerman 22 February Jeremy Sugerman, FLASHG 22 February 2005 Topics Quick GPU Topics Conditional Execution GPU Ray Tracing.
Vertex & Pixel Shaders CPS124 – Computer Graphics Ferdinand Schober.
Mapping Computational Concepts to GPU’s Jesper Mosegaard Based primarily on SIGGRAPH 2004 GPGPU COURSE and Visualization 2004 Course.
GPU Tutorial 이윤진 Computer Game 2007 가을 2007 년 11 월 다섯째 주, 12 월 첫째 주.
GPU Graphics Processing Unit. Graphics Pipeline Scene Transformations Lighting & Shading ViewingTransformations Rasterization GPUs evolved as hardware.
Realtime 3D Computer Graphics Computer Graphics Computer Graphics Software & Hardware Rendering Software & Hardware Rendering 3D APIs 3D APIs Pixel & Vertex.
General-Purpose Computation on Graphics Hardware.
Ray Tracing and Photon Mapping on GPUs Tim PurcellStanford / NVIDIA.
© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.
Aaron Lefohn University of California, Davis GPU Memory Model Overview.
Enhancing GPU for Scientific Computing Some thoughts.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
GPU Shading and Rendering Shading Technology 8:30 Introduction (:30–Olano) 9:00 Direct3D 10 (:45–Blythe) Languages, Systems and Demos 10:30 RapidMind.
GPGPU Programming Shih-hsuan (Vincent) Hsu Communication and Multimedia Laboratory CSIE, NTU.
Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.
1 SIC / CoC / Georgia Tech MAGIC Lab Rossignac GPU  Precision, Power, Programmability –CPU: x60/decade, 6 GFLOPS,
Fast Computation of Database Operations using Graphics Processors Naga K. Govindaraju Univ. of North Carolina Modified By, Mahendra Chavan forCS632.
Pseudorandom Number Generation on the GPU Myles Sussman, William Crutchfield, Matthew Papakipos.
OpenGL Performance John Spitzer. 2 OpenGL Performance John Spitzer Manager, OpenGL Applications Engineering
The programmable pipeline Lecture 3.
CSE 690: GPGPU Lecture 7: Matrix Multiplications Klaus Mueller Computer Science, Stony Brook University.
Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.
GPGPU Tools and Source Code Mark HarrisNVIDIA Developer Technology.
CS662 Computer Graphics Game Technologies Jim X. Chen, Ph.D. Computer Science Department George Mason University.
GPU Computation Strategies & Tricks Ian Buck NVIDIA.
Introduction to OpenGL  OpenGL is a graphics API  Software library  Layer between programmer and graphics hardware (and software)  OpenGL can fit in.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.
Shadows David Luebke University of Virginia. Shadows An important visual cue, traditionally hard to do in real-time rendering Outline: –Notation –Planar.
GPGPU: Parallel Reduction and Scan Joseph Kider University of Pennsylvania CIS Fall 2011 Credit: Patrick Cozzi, Mark Harris Suresh Venkatensuramenan.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
An Introduction to the Cg Shading Language Marco Leon Brandeis University Computer Science Department.
Appendix C Graphics and Computing GPUs
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
A Crash Course on Programmable Graphics Hardware
CS427 Multicore Architecture and Parallel Computing
Graphics Processing Unit
GRAPHICS PROCESSING UNIT
Introduction to Computer Graphics with WebGL
Graphics Processing Unit
GPGPU: Parallel Reduction and Scan
UMBC Graphics for Games
Debugging Tools Tim Purcell NVIDIA.
Data Parallel Computing on Graphics Hardware
RADEON™ 9700 Architecture and 3D Performance
Presentation transcript:

GPU Computation Strategies & Tricks Ian Buck Stanford University

DirectX or OpenGL? DirectX + Render to Texture SetRenderTarget() No float targets on NV3x + Write once run anywhere + DBMON –Short programs Only 96 instr required ps_2_a compiler target allows long programs on NV3x –Readback is slow! ~50 MB/sec OpenGL + 0 to N texture addressing GL_TEXTURE_RECTANGLE_EXT + Readback is fast –Render to Texture not finalized Pbuffer rendering can be slow SuperBuffers GL_EXT_render_target –Specialized float formats for ATI and NV No ARB standard for creating float Pbuffer ATI float2: Red and Alpha NV float2: Red and Green

ATI Radeon 9800XT or NVIDIA GeForce 5900 Ultra? Instruction Timings

Floating Point Precision NVIDIA FP32 –s23e8 (largest counting number: 16,777,217) ATI 24-bit float –s16e7 (largest : 131,073) NVIDIA FP16 –s10e5 (largest : 2,049) mantissaexponents sign * 1.mantissa * 2 (exponent+bias)

Floating Point Precision Common Mistake –Pack large 1D array in 2D texture –Compute 1D address in shader –Convert 1D address into 2D FP precision will leave unaddressable texels! NVIDIA FP32: 16,777,217 ATI 24-bit float: 131,073 NVIDIA FP16: 2,049

Multiple Outputs Hardware supported multiple outputs –Not as fast as you think… Num OutputsNet Bandwidth GB/sec GB/sec GB/sec GB/sec ATI 9800XT

Multiple Outputs Software solution –Let cgc or fxc do dead code elimination –can be a good idea if shader is separable kernel void foo (float3 a<>, float3 b<>, …, out float3 x<>, out float3 y<>) kernel void foo1(float3 a<>, float3 b<>, …, out float3 x<>) kernel void foo2(float3 a<>, float3 b<>, …, out float3 y<>)

Scatter Techniques Problem: a[i] = p –indirect write –Can’t set the x,y of fragment in pixel shader –Also want to do a[i] += p

Scatter Techniques Solution 1: –Sort & Search Shader outputs destination address and data Bitonic sort based on address Run binary search shader over destination buffer –Each fragment searches for source data See “Sorting and Searching” course notes

Scatter Techniques Solution 2: –Render points Use vertex shader to set destination or just read back the data and reissue

Scatter Techniques Solution 3: –Vertex Textures Render data and address to texture Issue points, set point x,y in vertex shader using address texture Requires texld instruction in vertex program

Conditional Mask How to efficiently implement if (a) then c=b Kill instruction or LRP c, a, b, c –Executes all conditional code Using early Z-kill –Set Zbuffer equal to conditional –Z test can prevent shader execution

Conditional Mask Using early Z-kill –Z-kill operates at 4x4 block resolution –Good only if locality in conditional

Optimizing Execution Two methods for GPGPU shader execution glBegin(GL_QUADS); glVertex2f(left, bottom); glVertex2f(right, bottom); glVertex2f(right, top); glVertex2f(left, top); glEnd(); glViewport(0,0,width,height) glBegin(GL_TRIANGLE); glVertex2f( 0, 0); glVertex2f(width*2, 0); glVertex2f( 0, height*2); glEnd(); Faster: Higher observed bandwidth bandwidth

Performance Issues Peak GFLOPS

Performance Performance Issues NV3x Register Penalty The more registers used in a shader, the slower a shader executes –3-4 R: x2 slower –5-6 R: x3 slower –7-8 R: x4 slower –9-12R: x6 slower –13-16R: x8 slower –17-24R: x12 slower –25-32R: x16 slower Compiler / driver will try to minimize register usage. General Rule: The more state in your program the slower the execution

Performance Issues Floating Point Texture Bandwidth Observed Results: –GeForce 5900 Ultra Cache: GB/sec Sequential: 4.40 GB/sec Random: 0.76 GB/sec –ATI 9800 XT (24-bit) Cache: 9.15 GB/sec Sequential: 5.55 GB/sec Random: 1.80 GB/sec Big Penalty for Random Access!

Performance Issues WinXP Float4 Download and Readback –NVIDIA 1215 MB/sec texture download 221 MB/sec glReadPixels rate –ATI 926 MB/sec texture download 180 MB/sec glReadPixel rate Readback should be faster! 680 MB/sec ATI Linux Readback