Presentation is loading. Please wait.

Presentation is loading. Please wait.

DirectX 9 & Radeon 9700 Performance Optimizations Richard Huddy

Similar presentations

Presentation on theme: "DirectX 9 & Radeon 9700 Performance Optimizations Richard Huddy"— Presentation transcript:

1 DirectX 9 & Radeon 9700 Performance Optimizations Richard Huddy

2 DirectX 9 and Radeon 9700 considerations Resources Sorting and Clearing Vertex Buffers and Index Buffers Render States How to draw primitives Vertex Data Vertex Shaders Pixel Shaders Textures Targets (both Z and color) Miscellaneous

3 General resource management Create your most important resources first (that’s targets, shaders, textures, VB’s, IB’s etc) “Most important” is “most frequently used” Never call Create in your main loop –So create the main colour and Z buffers before you do anything else… The “main buffer” is the one through which the largest number of pixels pass…

4 Sorting Sort roughly front to back –There’s a staggering amount of hardware devoted to making this highly efficient Sort by vertex shader …or… Sort by pixel shader, or sort by texture When you change VS or PS it’s good to go back to that shader as soon as possible… Short shaders are faster^2 when sorted

5 Clearing Ideally use Clear once per frame (not less) –Always clear the whole render target Don’t track dirty regions at all –Always clear colour, Z and stencil together unless you can just clear Z/stencil Most importantly don’t force us to preserve stencil Don’t use 2 triangles to clear… Using Clear() is the way to get all the fancy Z buffer hardware working for you

6 Vertex Buffers Use the standard DirectX8/9 VB handling algorithm with NOOVERWRITE etc Try to always use DISCARD at the start of the frame on dynamic VB’s Specify write-only whenever possible Use the default pool whenever possible Roughly 2 – 4 MB for best performance –This allows large batches –And gives the driver sufficient granularity

7 Index Buffers Treat Index Buffers exactly as if they were vertex buffers – except that you always choose the smallest element possible –i.e. Use 32 bit indices only if you need to –Use 16 bit indices whenever you can All recent ATI hardware treats Index Buffers as ‘first class citizens’ –They don’t have to be copied about before the chip gets access –So keep them out of system memory

8 Updating Index and Vertex Buffers IBs and VBs which are optimally located need to be updated with sequential DWORD writes. AGP memory and LVM both benefit from this treatment…

9 Handling Render States Prefer minimal state blocks –‘minimal’ means you should weed out any redundant state changes where possible If 5% of state changes are redundant that’s OK If 50% are redundant then get it fixed! The expensive state changes: –Switching between VS and FF –Switching Vertex Shader –Changing Texture

10 How to draw primitives DrawIndexedPrimitive( strip or list ) –Indexing is a big win on real world data –Long strips beat everything else –Use lists if you would have to add large numbers of degenerate polys to stick with strips (more than ~20% means use lists) –Make sure your VB’s and IB’s are in optimal memory for best performance –Give the card hundreds of polys per call Small batches kill performance

11 Vertex data Don’t scatter it around –Fewer streams give better cache behaviour Compress it if you can –16 bits or less per component –Even if it costs you 1 or 2 ops in the shader… Try to avoid spilling into AGP –Because AGP has high latency pow2 sizes help – 32 bytes is best –Work the cache on the GPU Avoid random access patterns where possible by reordering vertex data before the main loop… –That’s at app start up or at authoring time

12 Compiling and Linking shaders Do this all “up front” –It may not be obvious to you - but you have to actually use a shader to force it’s complete instantiation in DirectX 9 –So, if you’re not careful you may get linking happening in your main loop –And linking may be time consuming  –Draw a little of everything before you start for real. Think of this as priming the caches…

13 Vertex shadersI Shorter shaders are faster – no surprises here… Avoid all unnecessary writes –This includes the output registers of the VS –So use the write masks aggressively –Pack constants as much as possible –Prefer locality of reference on constants too… Be aware of the expansion of macros but prefer them anyway if they match exactly what you want Pack your shader constant updates You should optimise the algorithm and leave the object-code optimisation to the driver/runtime

14 Vertex shadersII Branches and conditionals are fast so use them agressively –That’s not like the CPU where branches are slow… –Longer shaders allow better batching Shorter shaders are also more cache friendly –i.e. it’s usually faster to switch to the previous shader than to any other –But the shorter your shaders are… –…the more of them fit into the cache.

15 Vertex shadersII API Change: –Now you don’t “mov” to the address register, you use “mova” –And this performs round to nearest, not floor –And now A0 is a 4d register A0.x, A0.y, A0.z, A0.w

16 Pixel shadersI API change to accommodate MET’s: –You now have to explicitly write to oC0, oC1, oC2 and 0C3 to set the output colour –And the write has to be with a mov instruction –If you write to 0C[n] you must write to all elements from oC[0] to 0c[n-1] i.e. Writes must be contiguous starting at oC0 But the writes can happen in any order You can also write to oDepth to update the Z buffer but note that this kills the early Z cull… (this replaces ps1.3 texdepth)

17 Pixel shadersII Shorter is much faster –It’s much easier to be pixel limited than vertex limited –Short shaders are more cache friendly –Be aggressive with write masks –Think dual-issue (“+”) even though it’s gone from the API (so split colour and alpha out) Generally prefer to spend cycles on shader ops rather than using texture lookups –Because memory latency is the enemy here

18 Pixel shadersIII Dual issue? –But that’s not in the 2.0 shader spec… –But remember that DX9 hardware like the Radeon 9700 has to run DirectX 8 apps very fast indeed –And that means it has dual issue hardware ready for you to use

19 Pixel shadersIV Example : Diffuse + specular lighting … dp3 r0, r1, r0 // N.H dp3 r2, r1, r2 // N.L mul r2, r2, r3 // * color mul r2, r2, r4 // * texture mul r0.r, r0.r, r0.r // spec^2 mul r0.r, r0.r, r0.r // spec^4 mul r0.r, r0.r, r0.r // spec^8 mad r0.rgb, r0.r, r5, r2 … Total: 8 instructions … dp3 r0, r1, r0 // N.H dp3 r2.r, r1, r2 // N.L mul r6.a, r0.r, r0.r // spec^2 mul r2.rgb, r2.r, r3 // * color mul r6.a, r6.a, r6.a // spec^4 mul r2.rgb, r2, r4 // * texture mul r6.a, r6.a, r6.a // spec^8 mad r0.rgb, r6.a, r5, r2 … Optimized to 5 “DI” instructions

20 Pixel shadersIV Texture instructions –Avoid TEXDEPTH to retain the early Z-reject –If you do choose to use TEXKILL then use it as early as possible. [But, the positioning of TEXKILL within texture loading code is unimportant] Register usage –Minimize total number of registers used –No problems with dependency

21 Vertex and Pixel shaders If you’re fed up with writing assembler, and don’t feel excited by the opportunity to code 256 VS ops and 96 PS ops then… …maybe you should consider HLSL? In most cases it is as good as hand written assembler And much faster to author… –Perfect for prototyping –And for release code where you use D3DX

22 TexturesI API addition –SetSamplerState() –Handles the now-decoupled texture sampler setup. –You may now freely mix and match texture coordinates with texture samplers to fetch texels in arbitrary ways Texture coordinates are now just iterated floats Samplers handle clamp, wrap, bias and filter modes –You have 8 texture coordinates –And 16 texture samplers texld r11, t7, s15 (all register numbers are max)

23 TexturesII Use compressed textures –Do you need a good compressor? Use smaller textures Use 16 bit textures in preference to 32 bit Use textures with few components –Use an L8 or A8 format if that’s what you want Pack textures together –e. g. If you’re using two 2D textures then consider using a single RGBA texture Texture performance is bandwidth limited

24 TexturesIII Filtering modes –Use trilinear filtering to improve texture cache coherency –Only use anisotropic or tri-linear filtering when they make sense - they are more expensive –Avoid using anisotropic filtering with bumpmapping –Avoid using tri-linear anisotropic filtering unless the quality win justifies it –More costly filtering is more affordable with longer pixel shaders

25 Targets Always clear the whole of the target Present(): –WASSTILLDRAWING makes a comeback –Please use it! –Because using it properly will gain you CPU cycles - and that’s typically your scarcest resource

26 Depth BufferI Never lock depth buffers Clearing depth buffers –Clear the whole surface –When stencil is present clear both depth and stencil simultaneously If possible disable depth buffering when alpha blending (i.e. drawing HUD’s) Use as few depth buffers as possible… –i.e. re-use them across multiple render targets

27 Depth BufferII Efficiently use Hyper-Z –Render front to back –Make Znear, Zfar close to active depth range of the scene –The EQUAL and NOT EQUAL depth tests require exact compares which kill the early Z comparisons. Avoid them!

28 Occlusion query New to DirectX 9 –In GL you have HP_occlusion_query and NV_occlusion_query to avoid the need for locks Not free, but much cheaper than Lock() Supported on all ATI hardware since the Radeon 8500 CreateQuery(OCCLUSION, ppQuery) Issue(Begin/End) GetData() returns S_OK to signal completion - but please don’t spin waiting for the answer…

29 AGP 8X Is fast at ~2GB per second But has high latency compared to LVM And is 10 times slower than LVM Radeon 9700 has up to 20GB per sec of bandwidth available when talking to LVM –(LVM = Local Video Memory)

30 User clip planes User clip planes are much more efficient than texkill because: 1.They insert a per-vertex test, rather than a per-pixel test, and vertices are typically fewer in number than pixels 2.It’s important always to kill data at the earliest stage possible in the pipeline Plus, clipping is essentially a geometric operation All hardware which supports ps1.4 supports user clip planes in hardware

31 Sky box. First or last? Draw it last because: –That’s a rough front to back sort –In this case you know that most sky pixels will fail the Z test. Draw it first because: –That way you don’t need any Z tests –In this case you know that most sky pixels would pass the Z test

32 So, here is our target: DX9 style mainstream graphics (per frame): –> 500K triangles –< 500 DrawIndexedPrimitive() calls –< 500 VertexBuffer switches –< 200 different textures –< 200 State change groups –Few calls to SetRenderTarget - aim for 0 to 4... –1 pass per poly is typical, but 2 is sometimes smart –Runs at monitor refresh rate –Which gives more than 40 million polys per second And everything goes through the programmable pipeline –No occurrences of Lock(0), DrawPrimitive(), DPUP()

33 Questions… ? Richard Huddy

Download ppt "DirectX 9 & Radeon 9700 Performance Optimizations Richard Huddy"

Similar presentations

Ads by Google