Presentation on theme: "Deferred Lighting and Post Processing on PLAYSTATION®3"— Presentation transcript:
1Deferred Lighting and Post Processing on PLAYSTATION®3 Matt SwobodaPhyreEngine™ TeamSony Computer Entertainment Europe (SCEE) R&D
2Where Are We Now? PS3 into its 3rd year Many developers on their 2nd generation enginesSolved the basic problemsSPUs STILL underusedBut it’s improving
3But.. GPU now the most common bottleneck Usually limited by fragment operationsMany titles take > 1/3 of their time in post processingMost developers want to do even more fragment workMore / heavier post processing effectsBetter lighting techniques / more lights / softer shadowsLonger shadersFeatures ported from PC / other console hardwareThe Typical performance limitation on PS3 titles has moved from a CPU bottleneck to a GPU bottleneck - specifically fragment operations.
4“We fixed the vertex bottleneck..” Many possible solutions to improve geometry performance beyond just “optimising the shader”LODOcclusion culling & visibility cullingMove large vertex operations to SPU, e.g. skinningSPU triangle cullingThere’s a large range of techniques that can be applied to optimise vertex performance, often focusing on only drawing what you can see, and spending the most time drawing the most important things.In addition the SPU has successfully been used to perform vertex processing.
5What About Pixels?Fragment operations / post processing rarely optimised like geometry operationsThrow whole operation at the GPUSame operation done for every pixelSpatial optimization / branching considered too slowSPU not considered: “too slow”, “uses too much bandwidth”Fragment operations are usually applied in a brute force manner. Techniques for spending the most time only where it makes the most difference – e.g. edge cases – are rarely used because of the need for branching and potentially complex pre-passes. The SPU is rarely used for pixel processing because it’s perceived as being too slow or needing too much bandwidth.
6SPU pixel processing Yes, the SPU is fast enough to process pixels Won’t beat the GPU in a brute force raceGPU specialises in rasterising triangles and sampling textures – has dedicated hardwareSPU is a general purpose processorUse flexibility to your advantageChoose different code branches and fast pathsThe SPU is fast enough to perform some pixel processing tasks. Bandwidth is rarely an issue – every post process we’ve ever developed so far on SPU has been cycle limited, not bandwidth limited – the PS3 bus really is that fast.The GPU is very good at specific tasks, such as rasterisation and texture sampling, but it has strict limitations.The SPU is a general purpose processor – you can read and write data in any order you like, apply branches anywhere you like, and use fast paths to optimise the process.
7Post Processing Effects on SPU A Whirlwind TourA brief run-through of some post processes we’ve implemented on SPU, to contrast the differing approaches used.
8What to do on SPU Options: Offload whole processes from GPU to SPU Or use SPU and GPU together to do one processUse SPU to optimise GPU operationsDepth buffer tile classificationDXT compress render targets on SPUUse SPU to perform processes more suitable for SPU architectureSummed area table generation – much easier on SPU than GPUDeferred lighting – SPU can pick fast paths per block of pixelsUse SPU to perform processes to offload from GPUScreen space ambient occlusion, volumetric lightingOperations on SPU work in parallel with GPU doing other work – minimise time on critical path
9Depth Of Field Pre-Process High quality depth of field requires a long fragment shaderRead depth samples and colour samples in a kernel / discCheck depths against centre pixel depthWeight colours by depth check resultsWasteful for “most” of the screenAll depth checks pass (out of focus) or all fail (in focus)All fail == pass through original bufferAll pass == use pre-blurred buffer – separable gaussian blurCategorise the screen for these cases on SPUDepth of field is a very desirable but potentially slow post process.The process works by performing a blur where each weight in the kernel is scaled by a function of the difference between the kernel sample depth and pixel depth. As such the kernel can’t be separated, and the blur shader is therefore quite long and slow.This is in fact a waste of time for much of the screen. For most pixels, the depth differences are such that the weights are all 1 or 0. If we can detect areas like this we can run those areas through considerably shorter shaders and greatly reduce execution time.
10Depth Of Field Classification Results Post process depth bufferClassify by min/max depthGreen: fully in focusBlue: fully out of focusRed: neither fully in or outWe can perform this classification process on SPU.The process reads the depth buffer, processes it in tiles, classifies the results and outputs three lists of point sprites – one for each classification type. The lists are then rendered on the GPU using different shaders to perform the effect.For more detail about this process please refer to my SCEE DevStation 2008 presentation on SPU post processing.
11Depth Of Field Pre-process results Pre-process only on SPU, blur operations on GPUGoal: minimise overall frame time and latencyLarge blur w.r.t. depth15 ms+ on GPU alone1.5-2ms on SPU + 3 ms on GPUThe result is that using a version of the effect with classification techniques is massively faster than one without.The performance timings can be scaled down if the original shader used is simpler.
12Screen Tile Classification Categorise the screen using the range of depth values within a tilePowerful technique with many applicationsFull screen effect optimization - DOF, SSAO..Soft particlesAffecting lightsOccluder informationSoft particles: determine which particles need to be handled as “soft” by checking depth min/max tiles. Only those intersecting need to be handled as “soft”. The rest can be handled as regular particles or even avoided completely if they are totally obscured by the depth.
13Screen Space Ambient Occlusion (SSAO) Generate an ambient occlusion approximation using the depth buffer alonePerform a large kernel-based series of depth comparisons and sum the resultsDownsample output to ½ size for performanceOutput normals for bilateral upsamplingFor SSAO we perform the entire process on the SPU. The effect is generated and output to a texture which is sampled during rendering on the GPU.
14SPU Screen Space Ambient Occlusion Results GPU version: 10ms+SPU version: 6ms on 2 SPUsUsed in “Donkey Trader” PhyreEngine game templateThe process was able to be kicked off after the depth pre pass, and parallelised with the shadow rendering pass on the GPU. The results were then available to be read during the main render pass.
16Deferred Shading Overview Rasterise geometry information to multiple “GBuffers” (geometry buffers)Rasterise geometry information to multiple “GBuffers” (geometry buffers)colour, normal, depth, specular and material informationApply lighting and shading as post processesMultiple lights and spatial optimizations easyFewer shader combinations requiredHas some negative pointsGBuffers consume memory and bandwidthMSAA is problematicFixed BRDFsApply lighting and shading in a post process
18Deferred Lighting on SPU The SPU can handle the deferred lighting processThe GPU renders the geometry to GBuffersSPU and GPU execute in parallelTotal time : max( geometry, lighting )The SPU handles the deferred lighting process usually done on GPU.Parallelise the lighting process across multiple SPUs to improve performance.The GPU handles rasterisation processes - like rendering GBuffers, shadow maps, alpha passes, reflection geometry etc.SPU may be slower than GPU at light processing, but it’s faster than doing it all in serial on GPU.By moving such a large body of work off the GPU, we can greatly increase the overall frame rate if the GPU is the bottleneck.
19Deferred Lighting on SPU: Implementation (1) Process each pixel onceWork out which lights affect each pixelApply the N affecting lights in a loopProcess the screen in tilesUse classification techniques per tile to optimiseTo handle multiple lighting models, calculate all the different models and select based on light type. Branch per tile to optimise the set of light types used.
20Deferred Lighting on SPU: Implementation (2) Calculate affecting lights per tileBuild a frustum around the tile using the min and max depth values in that tilePerform frustum check with each light’s bounding volumeCompare light direction with tile average normal valueChoose fast paths based on tile contentsNo lights affect the tile? Use fast pathCheck material values to see if any pixels are marked as litComparing the light direction with tile average normal value can avoid lights behind walls and so on.
21Deferred Lighting on SPU: Implementation (3) Choose whether to process MSAA per tileIf no sample pair values differ, light only one sample from the pair, otherwise light both samples separatelyTypically quite few tiles need both MSAA samples litMSAA - why does this work? If there are no triangle edges or intersections, both samples rendered from the colour output will contain the same value.Tiles requiring MSAA
22Deferred Lighting on SPU: Results 3 shadow casting lights, 100 point lights2x MSAA, 720pLighting performed per sampleApply tone mapping on SPUVirtually freePerformance: > 60 fps, 3 SPUs for 11ms eachNo MSAA: 2 SPUs for 11msThe resulting performance is massively increased compared to a GPU-only equivalent.Tone mapping is virtually free on SPU – accumulation is just a sum of all pixel luminance, easily rolled into lighting calculations. The current frame’s tone mapping is applied using the eye adapted value from the previous frame. This maps much better to SPU than GPU.Why not roll in colour correction and other post processing operations – e.g. a depth of field pre-process - to the full deferred render solution?
23Deferred Lighting on SPU: Issues Potential latencyMust keep GPU busy while SPU process is runningRender something else or add a frame of latencyMain memory requirementsShadowsRequires “random” texture access – not ideal for SPUCan render shadows on GPU to a full screen buffer and use it on SPUGBuffers must be in main memory to be read by SPU – potentially requiring a lot of main memory.Find something else for the GPU to do that doesn’t depend on SPU results - alpha geometry, reflections, shadow maps, effects?Otherwise, add a frame of latency.
24Flavours of Deferred Lighting on SPU Full deferred render on SPUInput all GBuffers, output final composited resultLight pre-pass render on SPUInput normal and depth only; calculate light result; sample in 2nd geometry passLight tile classification data output?SPU outputs information per tile about affecting lightsDo lighting calculations on GPUDifferent options exist depending on your limitations.
26Volumetric Lighting Also known as “god rays” or “light beams” Simulates the effect of light illuminating dust particles in the airNumerous fakes existArtist-placed geometryArtist-placed particlesBetter: generate using the shadow mapWorks in a “general case”
27Volumetric Lighting Ray march through the shadow map Trace one ray per pixel in screen spaceSample the depth buffer to determine the end of the raySample the shadow map at N points along the rayN ~= 50Attenuate and sum up the number of samples that passedBlur and add noisePotentially very slow. 50 texture samples times 1280x720 pixels? Downsample to ¼ width and height – result is blurred anyway.There is a demo of this effect running on GPU in the NVIDIA SDK.
28Volumetric LightingEffect is a bit too slow to be practical on GPU: ~5msDo it on SPU insteadParallelises with GPU easilyResult needed late in the render at compositing stageOnly needs depth and shadow map inputsProblem: must randomly sample from the shadow mapEven after considerable down-sampling the effect still takes over 5ms on the GPU – too slow for our situation. So we decided to implement it on SPU instead.Unfortunately the effect requires random sampling of a shadow map texture – something which is difficult to map to the SPU.
29Texture sampling on SPU “Random access” texture sampling is bad for SPUIt’s bad for GPU, too, but sometimes you just have to do itGPU:Fast access from texture cache; cache miss is slowDedicated hardware handles lookups, filtering and wrappingSPU:Fast access from “texture cache” (SPU local memory)Slow access on cache miss (DMA from main memory)Cache lookups slow (no dedicated hardware)Must manually handle filtering and wrapping (again, slow)To work out the best way to do SPU texture sampling, consider the GPU. The GPU has a texture cache which stores a small portion of the texture in fast-to-access memory, and the rest in a slower, larger memory buffer elsewhere (main memory or VRAM). On the SPU, that texture cache is the SPU’s local store. Accesses to this are fast, but if the data is not in cache it must be pulled in from main memory by DMA – which is slower. Also there is no dedicated hardware to manage the texture lookup – everything must be emulated in software.
30Texture sampling on SPU Either:Make the texture entirely fit in SPU local memoryProblem solved!Still inefficient: random accesses reduce register parallelismOrWrite a very good software cacheLocate potential cache misses early - long before you need the valuesAvoid branches in sampling codeIf the texture can fit entirely in the SPU’s local store, we can avoid the whole texture cache issue. If not, we have to write a software cache that can handle it.This software cache must be branchless for lookups, otherwise performance of the calling code will be destroyed. This implies that cache misses must be caught and resolved early, so there are no DMAs in the main processing loop either.
31Volumetric Lighting on SPU Volumetric light result will be blurredDon’t need full shadow map accuracyNo filtering on texture samples neededDownsample shadow map from 1024x1024, 32 bit to 256x256, 16 bit128k – fits in SPU local memoryFast enough to sample on SPUFortunately for volumetric lighting, the whole shadow map can be made to fit in SPU local memory by downsampling and reducing it.
32Volumetric Lighting on SPU: Results Takes ~11 ms on 1 SPUThe effect is fast and parallel enough to run on an SPU in the background while other work is done on the GPU.
33Shadow Mapping on SPU (1) Needs the full-size shadow map1024x1024x32 bit == 4mb : won’t fit in SPU local memoryWe’ll have to write that “very good software cache”, thenPre-process the shadow map on SPUCalculate min and max depth for each tileStore in a low resolution depth hierarchy mapOutput high resolution shadow map as cache tilesLow resolution format: 64x64 for a 1k x 1k shadow map.Output the high resolution shadow map in a series of 16x16 tiles – they map to cache pages 1k in size.
34Shadow Mapping on SPU (2) Software cache with 32 entriesEach entry is a shadow map tileBranchless determination of cache entry index for tile indexLocate cache misses earlyWhile detiling depth data – work out required shadow tilesPull in all cache-missed tilesSample shadow map during lighting calculationsAll required shadow tiles are now definitely in cache – lookup is branchlessIt’s quite slowLocate tile in cache per pixel
35Shadow Mapping on SPU (3) Optimise via special cases to win back performanceUse the low resolution shadow tile mapAlways in SPU local memoryIf pixel shadow z > tile max Z : definitely in shadowIf pixel shadow z < tile min Z : definitely not in shadowCheck low resolution map before triggering cache fetchesClassify whole screen tiles as in or out of shadowDon’t need to sample high resolution shadow map at all for those tilesThe low resolution shadow min/max depth map can be used to greatly optimise the process by skipping high resolution shadow reads where the whole shadow tile is all in or out of shadow, and skipping shadow lookups entirely where the whole screen tile is in or out of shadow. By doing all this we can achieve good performance – good enough to make it practical to sample shadow maps on the SPU.This screen tile information can be output in a pre-process step similar to the depth of field classification and used for deferred shadowing on the GPU – only sampling the shadow map on GPU for the edge cases where the tile falls on the border of in and out of shadow. The rest of the screen can run through a fast path. This can greatly optimise the performance of deferred shadowing.Tiles requiring high resolution shadow samples
37Conclusion New additions to your toolbox: Tile-based classification techniques on SPUDeferred lighting on SPUTexture sampling on SPURendering is no longer just a GPU problemUse general purpose nature of the SPU to your advantageRethink fragment processing optimisation strategiesMake the GPU work smarter, not harderKey takeaways:Reconsider your approach to fragment processing operations.Use tile-based classification on the SPU to optimise heavy fragment processes.Move 2D fragment processing operations such as post process effects or deferred lighting to the SPU.Texture sampling on the SPU is possible too.
38Conclusion Some titles are already using SPU post processing Killzone 2PhyreEngine™ is here to help(If you’re a registered PS3 developer) it’s on DevNet nowNot just an engine: also a referenceComes with full sourceDownload it, learn from it, steal bits of the codeCheck out the PhyreEngine™ SPU Post Processing LibraryMuch of the work you need for SPU post processing has already been done for you – download PhyreEngine and you’ll find a complete engine with full source code which implements the effects in this presentation. It also provides the necessary GPU/SPU sync framework and many useful utilities to aid post processing – such as de-tiling of main memory render targets.