Presentation is loading. Please wait.

Presentation is loading. Please wait.

GPU Computation Strategies & Tricks Ian Buck Stanford University.

Similar presentations


Presentation on theme: "GPU Computation Strategies & Tricks Ian Buck Stanford University."— Presentation transcript:

1 GPU Computation Strategies & Tricks Ian Buck Stanford University

2 DirectX or OpenGL? DirectX + Render to Texture SetRenderTarget() No float targets on NV3x + Write once run anywhere + DBMON –Short programs Only 96 instr required ps_2_a compiler target allows long programs on NV3x –Readback is slow! ~50 MB/sec OpenGL + 0 to N texture addressing GL_TEXTURE_RECTANGLE_EXT + Readback is fast –Render to Texture not finalized Pbuffer rendering can be slow SuperBuffers GL_EXT_render_target –Specialized float formats for ATI and NV No ARB standard for creating float Pbuffer ATI float2: Red and Alpha NV float2: Red and Green

3 ATI Radeon 9800XT or NVIDIA GeForce 5900 Ultra? Instruction Timings

4 Floating Point Precision NVIDIA FP32 –s23e8 (largest counting number: 16,777,217) ATI 24-bit float –s16e7 (largest : 131,073) NVIDIA FP16 –s10e5 (largest : 2,049) mantissaexponents sign * 1.mantissa * 2 (exponent+bias)

5 Floating Point Precision Common Mistake –Pack large 1D array in 2D texture –Compute 1D address in shader –Convert 1D address into 2D FP precision will leave unaddressable texels! NVIDIA FP32: 16,777,217 ATI 24-bit float: 131,073 NVIDIA FP16: 2,049

6 Multiple Outputs Hardware supported multiple outputs –Not as fast as you think… Num OutputsNet Bandwidth 19.25 GB/sec 22.91 GB/sec 33.97 GB/sec 42.76 GB/sec ATI 9800XT

7 Multiple Outputs Software solution –Let cgc or fxc do dead code elimination –can be a good idea if shader is separable kernel void foo (float3 a<>, float3 b<>, …, out float3 x<>, out float3 y<>) kernel void foo1(float3 a<>, float3 b<>, …, out float3 x<>) kernel void foo2(float3 a<>, float3 b<>, …, out float3 y<>)

8 Scatter Techniques Problem: a[i] = p –indirect write –Can’t set the x,y of fragment in pixel shader –Also want to do a[i] += p

9 Scatter Techniques Solution 1: –Sort & Search Shader outputs destination address and data Bitonic sort based on address Run binary search shader over destination buffer –Each fragment searches for source data See “Sorting and Searching” course notes

10 Scatter Techniques Solution 2: –Render points Use vertex shader to set destination or just read back the data and reissue

11 Scatter Techniques Solution 3: –Vertex Textures Render data and address to texture Issue points, set point x,y in vertex shader using address texture Requires texld instruction in vertex program

12 Conditional Mask How to efficiently implement if (a) then c=b Kill instruction or LRP c, a, b, c –Executes all conditional code Using early Z-kill –Set Zbuffer equal to conditional –Z test can prevent shader execution

13 Conditional Mask Using early Z-kill –Z-kill operates at 4x4 block resolution –Good only if locality in conditional

14 Optimizing Execution Two methods for GPGPU shader execution glBegin(GL_QUADS); glVertex2f(left, bottom); glVertex2f(right, bottom); glVertex2f(right, top); glVertex2f(left, top); glEnd(); glViewport(0,0,width,height) glBegin(GL_TRIANGLE); glVertex2f( 0, 0); glVertex2f(width*2, 0); glVertex2f( 0, height*2); glEnd(); Faster: Higher observed bandwidth bandwidth

15 Performance Issues Peak GFLOPS

16 Performance Performance Issues NV3x Register Penalty The more registers used in a shader, the slower a shader executes –3-4 R: x2 slower –5-6 R: x3 slower –7-8 R: x4 slower –9-12R: x6 slower –13-16R: x8 slower –17-24R: x12 slower –25-32R: x16 slower Compiler / driver will try to minimize register usage. General Rule: The more state in your program the slower the execution

17 Performance Issues Floating Point Texture Bandwidth Observed Results: –GeForce 5900 Ultra Cache: 11.08 GB/sec Sequential: 4.40 GB/sec Random: 0.76 GB/sec –ATI 9800 XT (24-bit) Cache: 9.15 GB/sec Sequential: 5.55 GB/sec Random: 1.80 GB/sec Big Penalty for Random Access!

18 Performance Issues WinXP Float4 Download and Readback –NVIDIA 1215 MB/sec texture download 221 MB/sec glReadPixels rate –ATI 926 MB/sec texture download 180 MB/sec glReadPixel rate Readback should be faster! 680 MB/sec ATI Linux Readback


Download ppt "GPU Computation Strategies & Tricks Ian Buck Stanford University."

Similar presentations


Ads by Google