CS248: Graphics Performance, Debugging and Optimisation Dave Oldcorn November 13 th 2007.

CS248: Graphics Performance, Debugging and Optimisation Dave Oldcorn November 13 th 2007

Graphics Performance and Optimisation 2 November 13th 2007 Your Guest Instructor Back in the mists of time, I wrote games… The last ten years have all been about 3D hardware Since 2001 at ATI, joining forces with AMD last year Optimisation specialist: linking the software and the hardware  Tweaking games  Understanding the hardware  Driver performance  Shader code optimisation –(I find assembly language fun)

Graphics Performance and Optimisation 3 November 13th 2007 Overview Three basic sections  GPU Architecture  Efficient OpenGL  Practical Optimisation And Debugging There’s a lot in here  Broad overview of all issues  I’ve prioritised the biggest issues and the ones most likely to help with Project 3  More details with respect to GPU architecture included as appendix

Graphics Performance and Optimisation 4 November 13th 2007 GPU Architecture

Graphics Performance and Optimisation 5 November 13th 2007 Graphics hardware architecture Parallel computation All about pipelines The OpenGL vertex pipeline shown right will be familiar…

Graphics Performance and Optimisation 6 November 13th 2007 Graphics hardware architecture Extend the top of the pipeline with some more implementation detail Ideally, every stage is working simultaneously Could also decompose to smaller blocks And eventually to individual hardware pipeline stages  As shown last week, the hardware implementation may be considerably more complex than a linear pipeline Application Video Drivers Parser API Command Buffers CPU GPU Vertex Assembly Vertex Operations Primitive Assembly

Graphics Performance and Optimisation 7 November 13th 2007 Draw commands Data enters the GPU pipeline via command buffers containing state and draw commands The draw command is a packet of primitives Occurs in the context of the current state  As set by glEnable, glBlendFunc, etc.  The full set of state is often referred to as a state vector  Driver translates API state into hardware state  State changes may be pipelined; different parts of the GPU pipeline may be operating with different state vectors (even to the level of per-vertex data such as glColor)

Graphics Performance and Optimisation 8 November 13th 2007 Pipeline performance The performance of a pipelined system is measured by throughput and latency  Can subdivide at any level from the full pipeline down to individual stages Throughput: the rate at which items enter and exit Latency: the time taken from entrance to exit  Latency is not typically a major issue for API users  It is a huge issue for GPU designers  Even GPU-local memory reads may be hundreds of cycles  Substantial percentage of both design effort and silicon is devoted to latency compensation  The system will generally run at full throughput until the latency compensation is exceeded

Graphics Performance and Optimisation 9 November 13th 2007 Pipeline throughput Given a particular state vector, each part of the pipeline has its own throughput The throughput of a system can be no higher than the slowest part: this is a bottleneck  More generally, if input is ready but output is not, it is a bottleneck

Graphics Performance and Optimisation 10 November 13th 2007 Pipeline bottlenecks Consider system shown right  Stage 1 can run at 1 per clock and is 100% utilised  Stage 2 can only accept on each other clock; still 100% utilised  Stage 3 is therefore starved on half of the cycles it could be working; 50% utilised  Although stage 3 has the longest latency, it has no effect on the throughput of the system Stage 1 Throughput 1/clock Latency 5 cycles Stage 2 Throughput 1 per 2 clocks Latency 10 cycles Stage 3 Throughput 1/clock Latency 15 cycles Items enter 1 per clock Items pass at 1 per clock Half throughput, result only every alternate clock Still only alternate clock results, despite per-clock throughput

Graphics Performance and Optimisation 11 November 13th 2007 Pipeline bottlenecks A key subtlety; for this to work as shown, there must be load balancing between stages 1 and 2 (probably a FIFO) Once the FIFO is full, the input buffer will exert backpressure on stage 1  Happens after equilibrium is reached This pipeline therefore runs at the speed of the slowest part as soon as the FIFO fills Stage 1 Throughput 1/clock Latency 5 cycles Stage 2 Throughput 1 per 2 clocks Latency 10 cycles Stage 3 Throughput 1/clock Latency 15 cycles Items enter 1 per clock Items pass at 1 per clock; eventually queue Half throughput, result only every alternate clock Still only alternate clock results, despite per-clock throughput Input Buffer

Graphics Performance and Optimisation 12 November 13th 2007 Variable throughput In general, throughput is data dependent  Example: clipping is a complex operation which often isn’t required  Example: texture fetch depends on the filtering chosen, which is data dependent Some pipeline stages require different rates at the input and the output  Example: back-face culling; primitive in, no primitive out  Example: rasterisation of primitives to fragments; few primitives in, many fragments out Buffering between stages takes up the slack

Graphics Performance and Optimisation 13 November 13th 2007 Pipeline bottlenecks A particular state vector will tend to have a characteristic set of bottlenecks  The input data does also have an effect Small changes to the state vector can make substantial changes to the bottleneck As a state change filters through the pipeline and for a short period afterwards, bottlenecks shift into the new equilibrium  For usual loads, where the render time is much larger than the pipeline depth, this time can be ignored Can be hard to determine bottlenecks if the states in the pipe are disparate  Smearing effect

Graphics Performance and Optimisation 14 November 13th 2007 Pipeline bottlenecks There may be multiple bottlenecks if the throughput is not constant at all parts of the pipeline  In general it is not constant GPU buffering absorbs changes in load  Measured in tens or hundreds of cycles at best  Whole pipeline is thousands of cycles The bottleneck could be outside the GPU  Application, driver, memory management… Bottleneck analysis is key to hardware performance  Not easy: bottlenecks are always present  separating expected and unexpected cases is the challenge

Graphics Performance and Optimisation 15 November 13th 2007 Flushes and synchronisation Some state cannot be pipelined; a flush occurs  Various localities of flush  For a whole-pipeline flush, the parser waits before allowing new data into the pipe  CPU can carry on building and queuing command buffers  Low cost ~ thousands of cycles (~5us?)  Some operations can require the CPU to wait for the GPU  Example: CPU wants to read memory the GPU is writing  This is a serialising event  Very expensive: wait for pipeline completion, flush all caches, and the restart time taken to build the next command buffer  You can force this with glFinish: please don’t!

Graphics Performance and Optimisation 16 November 13th 2007 Asynchronous system  The process of rendering a typical game image is massively asynchronous  The boxes left show possible asynchronous actors  The diagram below shows a possible timeline  The shaded areas are the same frame Input / Physics thread Runs continuously using input and time (fixed or delta) to update the game world – typically including the scene graph Typical runtime 10-30ms Render thread Runs continuously to convert the scene graph to rendering commands Generally cannot start until input/physics thread has processed whole frame GPU Renderer Runs on its command buffers DAC Loops over the display at 60-100Hz A command buffer operation changes the display at end of render; picked up at start of next frame (unless vsync is off) Input Render GPU DAC

Graphics Performance and Optimisation 17 November 13th 2007 Synchronisation GPU’s aim to run just under two frames ahead  Block at SwapBuffers if there is another SwapBuffers in the pipe that is not yet reached Reading any GPU memory on the CPU causes a sync  glReadPixels is one method, for example; avoid Writing to GPU memory generally does not  The GPU, driver and memory manager work together to do uploads without serialisation  No need to be unusually scared of glTexImage If you have to lock GPU memory, look for discard or write-only flags that will allow asynchronous access

Graphics Performance and Optimisation 18 November 13th 2007 Shaders Texture lookup operations are relatively expensive  Competition on GPU or system bus, cost of filtering, unpredictable  Some of this is only a latency issue – but latency is not important… –… until the buffering is exceeded –Latency more than doubles for dependent texture operations  Prefer ALU math to texture until the function is complex  Might replace very small textures with shader constants Shader – typically its texture operations – likely to be the limiting factor on performance

Graphics Performance and Optimisation 19 November 13th 2007 Shaders Each shader is run at a particular frequency  Per-vertex, per-fragment now; per-primitive also exists; per- sample seems likely in the future  Can view constants calculated on the CPU as another frequency (per-draw packet)  Aim to do calculations at the lowest necessary frequency Issues to be aware of:  Data passed from vertex to fragment shader is interpolated linearly in the space of the primitive (i.e. with perspective correction) so can only use interpolators if this is appropriate (linear or nearly so); high tessellation can be a workaround  Excessive use of interpolators can itself be a bottleneck; up to two interpolators per texture fetch, as a ballpark figure

Graphics Performance and Optimisation 20 November 13th 2007 Shader constants Shader constants are a large part of the state vector  Updating hundreds on each draw call will not be free Prefer inline constants (known at compile time) to state vector constants  Gives the compiler and constant manager more information For the same reason, avoid parameterising for its own sake Don’t switch shader just to change a couple of constants

Graphics Performance and Optimisation 21 November 13th 2007 Efficient OpenGL

Graphics Performance and Optimisation 22 November 13th 2007 Efficient OpenGL This is a data processing issue What data does the GPU need to render a scene?  State data, texture data, vertices / primitives CPU-side performance can easily be dominated by inefficient management of this data Of them all, vertex data is the most problematic Type of dataStateVertexTexture Volume (per frame)Low (~kB)Med-high (~MB) Very high (~GB) Rate of changeVery highLow-medVery low

Graphics Performance and Optimisation 23 November 13th 2007 Efficient vertex data Application needs to feed mesh data in somehow GL provides two basic methods  glBegin/glEnd (known as ‘immediate mode’)  Vertex arrays Immediate mode is easy to use but has high overheads  Many tiny, unaligned copies  Non ‘v’ forms imply extra copies  Command stream is unpredictable and irregular glBegin(GL_TRIANGLE_FAN); glColor4f(1,1,1,1); glVertex3f(0,0,0);// position + colour glVertex3f(0,1,0);// position only glColor4f(1,0,0,1); glVertex3f(1,1,0);// position + colour glVertex3f(1,0,0);// position only glEnd();

Graphics Performance and Optimisation 24 November 13th 2007 Vertex arrays Vertex arrays are an alternative  The application probably has its data in arrays somewhere, so let GL read them en masse  glVertexPointer, glColorPointer, etc. specify the array  glDrawElements to issue a draw command; takes index list  Primitives are drawn using the indices into the arrays as set up by the gl*Pointer commands glVertexPointer(3, GL_FLOAT, 16, vertex_array); glColorPointer(4, GL_UNSIGNED_BYTE, 0, color_array); glEnableClientState(GL_VERTEX_ARRAY); glEnableClientState(GL_COLOR_ARRAY); glDrawElements(GL_TRIANGLES, 12, GL_UNSIGNED_INT, indices); Easier for the driver and GPU to handle  State vector is instantiated at the glDrawElements command  The GPU can process all the primitives in a single draw packet

Graphics Performance and Optimisation 25 November 13th 2007 Vertex arrays Did you hear a but? The vertex data still belongs to the application  Until the glDrawElements call is entered, the GPU knows nothing of the data  After the call completes the app can change the data  Therefore, the driver must copy the data on every glDrawElements call  Even if the data never changes – the GL can’t know Wouldn’t it be great if we could avoid the copy?  We don’t supply textures on every call, just upload them to the GPU and let the driver manage them…

Graphics Performance and Optimisation 26 November 13th 2007 Buffer Objects This facility is provided with the Vertex Buffer Objects (VBO) extension  allows the creation of buffer objects in GPU memory with access mediated by the driver  Data can be uploaded at any time with glBufferData –As with glTexImage, done through command buffer to avoid serialisation  BindBuffer BindTexture, BufferData TexImage // During program initialisation glBindBuffer(GL_ARRAY_BUFFER, 0); glBufferData(GL_ARRAY_BUFFER, 16*4*sizeof(GL_FLOAT), vertex_array, GL_STATIC_DRAW);... // In render loop glBindBuffer(GL_ARRAY_BUFFER, 0); glVertexPointer(3, GL_FLOAT, 16, 0); glEnableClientState(GL_VERTEX_ARRAY); glDrawElements(GL_TRIANGLES, 12, GL_UNSIGNED_INT, indices);

Graphics Performance and Optimisation 27 November 13th 2007 Index data DrawElements only needs to send the indices Actually we can optimise that away too; Element Arrays allow buffer objects to contain index data  Index data is far smaller in volume, and tends to come in larger batches if state changes are minimised, so this can be overoptimisation Keep batches as large as possible  Keep state changes to a minimum  Primarily use triangle lists  Don’t mess with locality of reference  Strips can be marginally more efficient

Graphics Performance and Optimisation 28 November 13th 2007 Display Lists These offer the driver opportunity for unlimited optimisation It’s hard for the driver to do  The list can contain literally any GL command Not recommended for games or other consumer apps Professional GL apps do make heavy use of display lists (and immediate mode)  The effort required to efficiently optimise these is one reason professional GL cards are more expensive

Graphics Performance and Optimisation 29 November 13th 2007 Visibility optimisations It’s far more efficient not to render something at all Try to avoid sending primitives that can’t be seen  Not in the view frustum  Obscured Send it, but have it rejected at some early point in the pipeline  Cull primitives before rasterisation  Reject fragments before shading

Graphics Performance and Optimisation 30 November 13th 2007 Bounds Bounding boxes or spheres to reject objects wholly outside the view frustum Optimal methods for using these were in lecture 11

Graphics Performance and Optimisation 31 November 13th 2007 Occlusion culling PVS (Potentially Visible Set) culling  For each location in the set of locations, store which other locations might be visible  Precalculate before render process starts If you are standing anywhere in A, you absolutely cannot see C and vice versa  View frustum checks cannot solve this part of the problem; consider the position of the observer shown  A frustum test is still useful; if the observer was standing in B looking the same way, bounds could cull C Very effective on room-based games; not so useful on outdoor games  Fewer large-scale occluders A B C

Graphics Performance and Optimisation 32 November 13th 2007 Other visibility methods Portals – as discussed in lecture 11 BSP – Binary Space Partition – trees  Complex but efficient way to store large static worlds for fast frustum visibility calculations  Combine with PVS and portals; all need precalculation phase Abrash - Graphics Programming Black Book ch. 59-64, 70  Detailed information on these and other research he and John Carmack did on visibility while developing Quake  Still in use today on modern FPS games (with many enhancements!)

Graphics Performance and Optimisation 33 November 13th 2007 Model LOD Need to render something, render less of it Demonstrated two weeks ago:  A model close to the camera requires many triangles  Carry reduced detail models and select on each render –Like mipmapping, memory cost not prohibitive. –Target sizes near GPU’s high efficiency 100 pixel region Visualise with wireframe

Graphics Performance and Optimisation 34 November 13th 2007 Model LOD Non trivial implementation  Popping a well known issue; morph or blend common solutions  Must generate reduced detail models  Can reuse vertex data, just change indices Terrain offers particular challenges  LOD systems essential for really large worlds  Terrain tiles must match between different LODs Can also solve sampling issues  As with undersampling textures, semirandom triangles can be picked; occur if triangles are smaller than 1 pixel

Graphics Performance and Optimisation 35 November 13th 2007 GPU primitive culling Degenerate primitives (example: triangles with two indices the same) will be culled at index fetch A primitive with all vertices outside the same clip plane will be culled Back-face culling is a simple optimisation and should be used for all closed opaque models Zero area triangles will be culled before rasterisation  This is rarely usefully exploitable Scissor rectangles cull large parts of primitives during rasterisation

Graphics Performance and Optimisation 36 November 13th 2007 GPU Z rejection The Z test can occur before shading  Reduces colour read/write load as well Some states inhibit early Z test  Write Z in the shader, obviously  Gate Z update in the shader (pixel kill / alpha test with Z write) –Alpha test sounds like an optimisation, but it only saves colour read/write; use it for visual effect not performance –Shader kill acts as a shader conditional Z unit can reject at hundreds of pixels per clock  Accept rate is lower (at the very least Z has to be written) but as fast or faster than any other post-rasteriser operation Stencil usually rejects at Z rates  Having a stencil op that does something implies a stencil write

Graphics Performance and Optimisation 37 November 13th 2007 Early Z rejection Draw opaque geometry in roughly front-to-back order  Do not work too hard to make this perfect, that’s what the Z buffer was created for in the first place  Do not draw the sky first. Please!  This assumes you’re bottlenecked in the shader Consider a Z pass  If the fragment shaders are very expensive  If at any point rendering the colour buffer you need some algorithm that requires the Z buffer  Disable colour writes (glColorMask) or fill the colour buffer with something cheap but useful (example: ambient lighting)  Invariance issues should be rare nowadays (but be aware)

Graphics Performance and Optimisation 38 November 13th 2007 Shader conditionals Can also reduce shader load Treat with care… Use mostly for high coherency data  the conditional is unlikely to have per-pixel granularity  An if-then-else clause can have to execute both branches For low coherency data, prefer conditional move type operations Typically the shader compiler and optimiser can’t know much about the likely coherency  So it guesses

Graphics Performance and Optimisation 39 November 13th 2007 Triangle sizes Larger triangles are more efficient than small ones Rules of thumb:  Over 1000 pixels is large  100 pixel triangles are considered typical and the GPU should be into the ballpark of its peak performance  Under 25 pixel triangles are small  Tiny triangles likely to cause granularity losses in the GPU Often the type of object and size of triangle are related  Example: world triangles tend to be larger than entities

Graphics Performance and Optimisation 40 November 13th 2007 Bump mapping Can trade off geometric complexity for more expensive fragment shading  Textures in general offer this capability –light maps are an earlier example Having a normal map available in the fragment shader useful for other reasons too  Per-pixel lighting is an obvious use Doom3 an early pioneer:  polygon counts are low compared with other games of the time  bump mapping makes it hard to see except on silhouette edges

Graphics Performance and Optimisation 41 November 13th 2007 Practical Optimisation and Debugging

Graphics Performance and Optimisation 42 November 13th 2007 Optimising Applications Always profile; never assume Target optimisations  Better to get a small gain in something that takes half the time than a big gain in something that takes a couple of percent  Better to do easy things than hard things  “Low-hanging fruit”

Graphics Performance and Optimisation 43 November 13th 2007 Instrumentation for debugging Logging Visualisation: make particular rendering (more) visible Simple interfaces into the high-level parts of the program to make low-level testing easier  ‘God mode’  Skip to level N or subpart of the level –Saved games may seem to be an answer here, but minor changes during development usually break them  Metadata display Multiple monitors and remote debugging  Key for fullscreen applications  Useful to have ‘stable’ dev machine and separate debug target

Graphics Performance and Optimisation 44 November 13th 2007 Instrumentation for performance Feedback on what the performance actually is  A simple onscreen frames per second (FPS) and/or time- per-frame counter  Special benchmarking modes Modify the performance  Skip particular rendering passes  Add known extra load –Examples: new entities, particle system load, force postprocessing effects on

Graphics Performance and Optimisation 45 November 13th 2007 Real-world example: Doom3 engine  Heavily instrumented with developer console  accessed with ctrl-alt-  Most commands prefixed according to their functional unit –r_ commands are to the renderer, s_ the sound system, sv_ the server, g_ the client (game), etc.  Record demos; playback with playdemo or timedemo  Capture individual frames with demoshot for debugging or performance  Can also send console commands from the command line – essential for external tools  Many debugging commands –noclip to fly anywhere on a level –r_showshadows 1 displays the shadow volumes –g_showPVS 1 to show the PVS regions at work

Graphics Performance and Optimisation 46 November 13th 2007 More Doom3 convenience features  PAK files are just ZIP files  You can look at the ARB_fragment_program shaders Doom3 uses (glprogs/ directory in the first pakfile).  You can also modify them: real files (e.g. under the base/glprogs directory) override the pakfiles Human-readable configuration files TAB completion on the console  Long commands not a problem – plus you can find the command you want! Key bindings

Graphics Performance and Optimisation 47 November 13th 2007 Doom3 render: multipass process 1.Z pass: set the Z buffer for the frame 2. Lighting passes: for each light in the scene 2A. Shadow pass: render shadow volumes into the stencil buffer 2B. Interaction pass: accumulate the contribution from this light to the framebuffer. -Cheap Phong algorithm (per-pixel lighting with interpolated E; Prey calculates E on a per-pixel basis for better specular) -Vertex/fragment shader pair 3. Effects rendering; mostly blended geometry for explosions, smoke, decals, etc. 4. One or more postprocessing phases for refraction and other screen-space effects

Graphics Performance and Optimisation 48 November 13th 2007 Doom3 benchmarking tools Each render pass can be disabled from the console  r_skipinteractions, r_shadows, r_skippostprocess  Benchmark each pass individually  Worth considering render time rather than just FPS; linear quantity RenderedFPSFrame time (ms) Isolated pass Pass time (ms) Pass load Everything55.817.9 - postproc58.517.1Postproc0.84% - interactions104.79.6Interaction7.542% - shadows174.55.7Shadows3.922% The rest5.732%

Graphics Performance and Optimisation 49 November 13th 2007 Case study: Doom3 interaction shader The shader has 7 texture lookups  Texture limited on most GPUs  One of them was a simple function texture –Probably originally a point of customisation but unused  We tested gain by eliminating the lookup –replaced with a constant – note not 0 or 1, which might allow the optimiser to eliminate other code –Provided the expected ~15% gain for the pass  Replaced with a couple of scalar ALU instructions –Gain was still the same, as the scalar ALU scheduled into gaps in the existing shader Quake4 and later games all picked up the change

Graphics Performance and Optimisation 50 November 13th 2007 Instrumenting applications Be wary of profiling API calls  Asynchronous system; SwapBuffers is probably the only point of synchronisation  Can’t easily measure hardware performance at a finer granularity than a frame  Don’t try to profile the cost of rendering a mesh by timing DrawElements; only measures the time taken to validate state and fill the command buffer  Which isn’t to say that’s never useful information

Graphics Performance and Optimisation 51 November 13th 2007 Instrumenting applications Don’t overprofile  QueryPerformanceCounter has a cost  Even RDTSC does Try to look at the high level and in broad terms first  30% physics, 20% walking scene graph, 30% in the driver, 20% waiting for end of frame  Rather than 15.26% inside DrawElements Aim to be GPU limited, then optimise GPU workload  Don’t waste time optimising CPU code if it’s waiting for the GPU  Iterate as the GPU workload becomes more optimal Try to avoid compromising readability for performance  Rarely necessary  Download the Quake 3 source to see how clear really fast code can be  The games industry is really, incredibly, bad at this.

Graphics Performance and Optimisation 52 November 13th 2007 Benchmark modes Timed runs on repeatable scenes  Two options –Fix the number and exact content of frames and time the run (could be one frame repeated N times) –Fix the run time, render frames as fast as possible, count the frames  Former is more repeatable; often essential if tools require multiple runs to accumulate data  Latter more convenient for benchmarkers and more realistic to how games behave in the real world  Cynical reason for benchmarks: applications get more attention from press (and hence driver developers)

Graphics Performance and Optimisation 53 November 13th 2007 CodeAnalyst CodeAnalyst is an AMD tool that allows non-intrusive profiling of the application’s CPU usage  A profiling session spawns the application under test –Make sure to avoid profiling startup and shutdown time  Can drill down to individual source lines in your code and show you the cost  Many examples on AMD’s web site using this  Useful for all CPU-limited applications

Graphics Performance and Optimisation 54 November 13th 2007 CodeAnalyst hints  A spike inside driver components may not be driver overhead  The driver is probably waiting on the GPU to meet the SwapBuffers limit  If there's not a large spike in the driver, it's probably the application that's the limit.  This is complicated by the fact that we may choose to block if the GPU is not yet ready, so time may move from the driver to being reported as 'system idle process‘, PID 0, or similar.  Vary the resolution and check the how the traces change  If the relative time in the driver or system idle doesn't change, the application is not pixel limited.  Multicore systems make interpreting the results harder.  You might be best off switching a core off if you can

Graphics Performance and Optimisation 55 November 13th 2007 GPUPerfStudio Lets you look inside the GPU Hardware performance counters  3D busy is the most obvious and often important  Vertex / pixel load can also be seen The bad news: GL support is not in the currently downloadable version 1.1. Coming soon…

Graphics Performance and Optimisation 56 November 13th 2007 Shader Development AMD GPUShaderAnalyzer  Available to download from AMD web site  Will handle all GL shader types (GLSL, ARB_fp, ARB_vp)  Good development environment; no need to run your app to compile  Will show output code and statistics including estimated cycle counts for all AMD GPUs

Graphics Performance and Optimisation 57 November 13th 2007 Scalability Look to create consistent performance  Better to run at 30fps continuously than oscillate wildly between 15fps and 100fps.  Target worst-case scenes  You will need headroom to guarantee 60fps Is a particular gain useful?  A 4% speedup won’t help anyone play your game  Five 4% speedups would, though  Gains in a lesser component allow more use of that component

Graphics Performance and Optimisation 58 November 13th 2007 Scalability PC environment is a huge scalability challenge  Matrix of CPUs, GPUs and render resolutions is huge  Performance is in tension with image quality  Adjust quality to scale for GPU power and set higher loads –when CPU limited, more pixels probably have no cost  Adjust quality in profiling –Resolution (or clock) scaling to test if CPU or GPU limited Consoles have it easier: more fixed in every way  Still need headroom, just less of it  Now have resolution scaling issues - five TV resolutions in NTSC 480i, PAL 576i, 720p, 1080i/p  60Hz / 50Hz is a headache here

Graphics Performance and Optimisation 59 November 13th 2007 Caveats on optimisation Windowed mode  GPUs can behave differently in windowed mode to fullscreen mode  Windowed should still be your primary development mode unless you have remote debugging Front Buffer rendering  May be useful for debugging, but could have similar performance implications Avoid misusing benchmarks  Repeat runs – make sure everything’s ‘warm’.

Graphics Performance and Optimisation 60 November 13th 2007 Guidelines for Project 3 Concentrate on the scene graph first, GPU second, CPU cycle picking last  Look for algorithms that cull monsters, trees and rooms rather than triangles or pixels Work on model or texture data in the GPU, not CPU  Primarily, use the shader to do the work  Anywhere index data, primitive count and connectivity don’t change is a candidate  If you have to generate a texture consider using the GPU

Graphics Performance and Optimisation 61 November 13th 2007 Guidelines for Project 3 Short of time to write shaders  Write a few shaders that you use a lot Don’t try to do everything in this lecture  Many techniques won’t apply to your specific case  Even those that do often won’t matter  Profile-Guided Optimisation!

Graphics Performance and Optimisation 62 November 13th 2007 Headline performance items  Scene graph optimisations: visibility culling, model LOD  Don’t touch model data on the CPU unless the algorithm absolutely requires it  Use vertex arrays for complex mesh data (> 10 primitives); store static data in VBOs.  Use mipmaps for all static textures; avoid undersampling textures without mipmaps  Render roughly front to back; don’t kill yourself trying but give it a go for the largest geometry; draw the sky last!  Use compressed textures by default; only disable if artifacts appear  Disable unnecessary alpha testing; don’t do kills in shaders unless you have to  Move work from fragment to vertex shaders where possible  Prefer moderate math to texture lookups particularly if they increase the dependent fetch level

Graphics Performance and Optimisation 63 November 13th 2007 Further reading Abrash, Mike: The Graphics Programming Black Book  Even in 1997 the asm and register programming section was dated  Much of the Quake documentation isn’t –Clear explanation of BSP, PVS and some on portals.  Rest is still worth reading to show the mindset –Skip asm-specific bits, concentrate on thought process  Chapter 1 and chapter 70 are required reading  Stencil shadows; the Wikipedia page has many links

Graphics Performance and Optimisation 64 November 13th 2007 Samples and Tools  http://ati.amd.com/developer/ http://ati.amd.com/developer/  GPUPerfstudio, GPUShaderAnalyzer and the Compressonator  Tootle is also interesting; optimise meshes both for vertex cache and ‘internal’ front-to-backness  Many other samples, documents and tools  http://www.amd.com/codeanalyst

Graphics Performance and Optimisation 65 November 13th 2007 Questions If we have time…

Graphics Performance and Optimisation 66 November 13th 2007 Appendix Background information on more aspects of the GPU a.k.a. “The slides I knew I didn’t have time to go through”

Graphics Performance and Optimisation 67 November 13th 2007 Texture and rendertarget tiling Memory interface efficiency mostly determined by burst sizes  The more useful memory fetched in one go the better  Avoid fetching anything that isn’t then used –This is why mipmapping is so important: minifying a texture implies fetching memory that isn’t then used Rearranging memory into tiles increases locality of reference  64 bytes might contain 4x4 pixels instead of 16x1 pixel  Format is transparent to application

Graphics Performance and Optimisation 68 November 13th 2007 Texture Compression GL_ARB_texture_compression The S3TC / DXTC / BC algorithm is a high quality method for typical image textures  Designed such that the artifacts introduced in lossy compression tend to be smoothed out by texture filtering  Function textures and unusual use textures may not meet acceptable quality  Rearranging components can help  Use high-quality compressors - The Compressonator Compression isn’t just about memory bandwidth  Reduces effective latency (one fetch brings in more useful texels)  Effectively increases texture cache size

Graphics Performance and Optimisation 69 November 13th 2007 Texture Filtering A bilinear-filtered sample is the common basic unit of work for a texture unit  Unlikely that point sampling is any faster than bilinear; can make this work for you in image processing shaders (rather than point sampling and doing some constant weighted sum)  Each additional bilinear sample for trilinear or anisotropic filtering is probably consuming additional time Smart algorithms ensure that only needed samples are taken  No need for trilinear if magnifying  No need for anisotropy if square-on  Example: walls tend to have less anisotropy than floors Gradient calculations may be dynamic  Necessary to handle dependent texture reads  Be wary with dependency; gradient can be unpredictable

Graphics Performance and Optimisation 70 November 13th 2007 Render to texture Useful for generating extra views or postprocessing  Example: mirror in driving game  Example: postprocessing for refraction glCopyTexImage copies the framebuffer to a texture  CPU-GPU serialisation is not implied; this can probably be queued into a command buffer Other methods exist such as pbuffers and framebuffer extensions  Can be slightly more efficient  Can return to a rendertarget after rendering on another  More complex; don’t use without good reason

Graphics Performance and Optimisation 71 November 13th 2007 Multisample antialiasing The key gain is to run the fragment shader at pixel frequency rather than sample frequency Also saves memory bandwidth; can compress Z and colour  The buffer may need to be resolved to an uncompressed buffer for display or if used as a texture  Triangle size may be worth extra consideration with MSAA; frame buffer and Z compression rate is likely to be in roughly inverse proportion to the number of visible edges in the scene

Graphics Performance and Optimisation 72 November 13th 2007 Caching Many caches inside the GPU Different to what you might be familiar with in a CPU  More about memory bursts and latency compensation than reuse  In general you do need to hit the memory –Example: texture mapping the whole framebuffer at 1:1; every pixel and texel will be touched exactly once  Therefore, be pessimistic: assume this  Choose to compensate memory latency with large buffers –Rather than using the cache to dodge the accesses In a few places short-term ‘reuse’ is critical  Bilinear filtering the most obvious case

Graphics Performance and Optimisation 73 November 13th 2007 Caching Can still be advantages to avoiding cycling  Used to be a big thing, particularly in the days of visible caching (software controlled rather than auto)  Caused sorting policy of hard sort by material  Nowadays far less important, hence rough sort by depth  In some pathological circumstances sort by shader and depth (or a Z pass followed by sort by shader) might be more efficient

Graphics Performance and Optimisation 74 November 13th 2007 Disclaimer & Attribution DISCLAIMER The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2007 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other names are for informational purposes only and may be trademarks of their respective owners.

CS248: Graphics Performance, Debugging and Optimisation Dave Oldcorn November 13 th 2007.

Similar presentations

Presentation on theme: "CS248: Graphics Performance, Debugging and Optimisation Dave Oldcorn November 13 th 2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS248: Graphics Performance, Debugging and Optimisation Dave Oldcorn November 13 th 2007.

Similar presentations

Presentation on theme: "CS248: Graphics Performance, Debugging and Optimisation Dave Oldcorn November 13 th 2007."— Presentation transcript:

Similar presentations

About project

Feedback