Presentation is loading. Please wait.

Presentation is loading. Please wait.

HOLY SMOKE! FASTER PARTICLE RENDERING USING DIRECTCOMPUTE AMD AND MICROSOFT DEVELOPER DAY, JUNE 2014, STOCKHOLM GARETH THOMAS 2 ND JUNE 2014.

Similar presentations


Presentation on theme: "HOLY SMOKE! FASTER PARTICLE RENDERING USING DIRECTCOMPUTE AMD AND MICROSOFT DEVELOPER DAY, JUNE 2014, STOCKHOLM GARETH THOMAS 2 ND JUNE 2014."— Presentation transcript:

1 HOLY SMOKE! FASTER PARTICLE RENDERING USING DIRECTCOMPUTE AMD AND MICROSOFT DEVELOPER DAY, JUNE 2014, STOCKHOLM GARETH THOMAS 2 ND JUNE 2014

2 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 2 PLAN FOR TODAY  Simulation Overview  Collisions  Sorting  Tiled Rendering  Conclusions

3 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 3 OVERVIEW  Why use the gpu for simulation? ‒Highly parallel workload ‒Free your CPU to do other cool stuff ‒Leverage compute ‒Take advantage of the Local Data Store (LDS) ‒Asynchronous compute on some platforms MOTIVATION

4 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 4 OVERVIEW  Emit  Simulate  Sort  Render ‒Rasterize billboards ‒Tiled Rendering using DirectCompute HOW TO BUILD A GPU PARTICLE SYSTEM

5 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 5 SIMULATION OVERVIEW HOW THE SIMULATION FITS TOGETHER Simulate Compute Shader Update Particles. Add alive ones to Alive List, add dead ones to Dead List Dead List Persistent list of particle indices Alive List List of alive particle indices. Rebuilt each frame by Simulation CS Emit Compute Shader Reads free indices from dead list. Writes new particle data into global array Particle Array Persistent list of particle indices

6 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 6 COLLISIONS  Can no longer use CPU-side physics engine for collisions  Use depth buffer [Tchou11] ‒Project particle into screen space and read depth buffer ‒Project particle into view space ‒Transform depth buffer value into view space and compare depths  Generate collision response ‒Use G-buffer normals ‒Or take multiple depth samples to reconstruct the normal A GPU-BASED SOLUTION view space P(n) P(n+1) thickness Z

7 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 7 COLLISIONS  Only collides against geometry in the depth buffer  Particles would collide against depth buffer even if they are behind the geometry ‒Use a thickness value to assume particles are in free space behind geometry  Particles don’t collide when they are off screen ‒Causes issues when particles that are at rest on the floor have gone off-screen and have now disappeared ‒Put particles to sleep in the simulation once they have come to rest ‒Use G-buffer to mark parts of the scene that particles can sleep on (static objects)  Not Multi-GPU Friendly! ‒Switch off depth buffer collisions in MGPU mode PROBLEMS WITH USING THE DEPTH BUFFER Fallen through world! 

8 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 8 73681425 for( subArraySize=2; subArraySize<ArraySize; subArraySize*=2) { for( compareDist=subArraySize/2; compareDist>0; compareDist/=2) { // Begin: GPU part of the sort for each element n n = selectBitonic(n, n^compareDist); // End: GPU part of the sort } BITONIC SORT

9 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 9 25146873 for( subArraySize=2; subArraySize<ArraySize; subArraySize*=2)// subArraySize == 2 { for( compareDist=subArraySize/2; compareDist>0; compareDist/=2) // compareDist == 1 { // Begin: GPU part of the sort for each element n n = selectBitonic(n, n^compareDist); // End: GPU part of the sort } BITONIC SORT (PASS 1)

10 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 10 37861452 for( subArraySize=2; subArraySize<ArraySize; subArraySize*=2)// subArraySize == 4 { for( compareDist=subArraySize/2; compareDist>0; compareDist/=2) // compareDist == 2 { // Begin: GPU part of the sort for each element n n = selectBitonic(n, n^compareDist); // End: GPU part of the sort } BITONIC SORT (PASS 2)

11 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 11 36875412 for( subArraySize=2; subArraySize<ArraySize; subArraySize*=2)// subArraySize == 4 { for( compareDist=subArraySize/2; compareDist>0; compareDist/=2) // compareDist == 1 { // Begin: GPU part of the sort for each element n n = selectBitonic(n, n^compareDist); // End: GPU part of the sort } BITONIC SORT (PASS 3)

12 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 12 36785421 for( subArraySize=2; subArraySize<ArraySize; subArraySize*=2)// subArraySize == 8 { for( compareDist=subArraySize/2; compareDist>0; compareDist/=2) // compareDist == 4 { // Begin: GPU part of the sort for each element n n = selectBitonic(n, n^compareDist); // End: GPU part of the sort } BITONIC SORT (PASS 4)

13 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 13 34215678 for( subArraySize=2; subArraySize<ArraySize; subArraySize*=2)// subArraySize == 8 { for( compareDist=subArraySize/2; compareDist>0; compareDist/=2) // compareDist == 2 { // Begin: GPU part of the sort for each element n n = selectBitonic(n, n^compareDist); // End: GPU part of the sort } BITONIC SORT (PASS 5)

14 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 14 21345678 for( subArraySize=2; subArraySize<ArraySize; subArraySize*=2)// subArraySize == 8 { for( compareDist=subArraySize/2; compareDist>0; compareDist/=2) // compareDist == 1 { // Begin: GPU part of the sort for each element n n = selectBitonic(n, n^compareDist); // End: GPU part of the sort } BITONIC SORT (PASS 6)

15 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 15 Sorted Alive List Vertex Shader Read Particle Buffer Geometry Shader Expand one point to four. Billboard in view space. Pixel Shader Texturing and tinting. Depth fade for soft particles. Particle Pool RENDERING

16 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 16 Sorted Alive List Vertex Shader Read particle buffer and billboard in view space Pixel Shader Texturing and tinting. Depth fade for soft particles. Particle PoolIndex Buffer RENDERING

17 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 17 RENDERING  The alive particle count is only available on the GPU ‒Use Indirect API  DrawInstancedIndirect( GPU-args ) for Geometry Shader billboards ‒D3DPT_POINTLIST with no VB, IB or IA ‒VertexId = Particle index ‒VertexCountPerInstance = NumParticles ‒InstanceCount = 1 ‒Geometry Shader expands the point into four vertices and a 2 triangle strip per billboard  Or better still……. DrawIndexedInstancedIndirect( GPU-args ) ‒D3DPT_TRIANGLELIST, use IB ‒VertexId / 4 = Particle index ‒VertexId % 4 = Billboard corner index ‒IndexCountPerInstance = NumParticles * 6 ‒InstanceCount = 1 RASTERIZATION – FOR OLD SCHOOL GPU PARTICLE SYSTEMS

18 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 18 RENDERING  Overdraw from large particles kills game performance! ‒Get artists to throttle back on the VFX   Optimizations ‒Tightly fit polygons around texture [Persson09] ‒Render to smaller buffer [Cantlay07] ‒Sorting issues ‒Loss of fidelity PROBLEMS WITH RASTERIZATION 

19 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 19 TILED RENDERING  Inspired by Forward+ [Harada12] ‒Screen-space binning of particles instead of point lights!  Use a 32x32 thread group to shade a 32x32 pixel tile in screen space ‒Cull particles (just like Forward+) ‒Sort particles ‒ Per pixel/thread ‒Evaluate colour of each particle ‒Blend together ‒Composite back onto scene OVERVIEW

20 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 20 TILED RENDERING 1 2 3 [1][1,2,3][2,3]  Divide screen into tiles  Build index lists of intersecting particles per tile

21 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 21 TILED RENDERING  View space asymmetric frustum generated per tile  Use camera’s near plane  Use camera’s far plane  Or calculate far plane from depth buffer Tile0Tile1Tile2Tile3

22 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 22 TILED RENDERING

23 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 23 TILED RENDERING  numthreads[ 32,32,1]  Culling 1024 particles in parallel  Add to LDS index list  Write out to memory ‒Particle count ‒Particle indices THREAD GROUP VIEW

24 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 24 TILED RENDERING TILE COMPLEXITY

25 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 25 TILED RENDERING  Cannot sort global list of particles ‒Because 1024 particles get culled in parallel they get added to visible list in arbitrary order  Need to sort particles per-tile ‒This is a good thing! ‒Only need to sort a subset of the global list ‒Sorting particles in single pass in LDS vs main memory and in multiple passes PER TILE BITONIC SORT

26 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 26 TILED RENDERING  numthreads[ 32, 32, 1 ] 1 thread = 1 pixel in screen space  Set accumulation colour to float4( 0, 0, 0, 0 )  For each particle in tile (back to front) ‒Evaluate particle contribution ‒UV generation & radius check ‒Texture lookup ‒Normal generation and lighting ‒Manually blend ‒Colour = ( srcA x srcCol ) + ( invSrcA x destCol ) ‒Alpha = srcA + ( invSrcA x destA ) ‒Write result to screen size UAV EVALUATING TILE COLOUR

27 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 27 TILED RENDERING  numthreads[ 32, 32, 1 ] 1 thread = 1 pixel in screen space  Set accumulation colour to float4( 0, 0, 0, 0 )  For each particle in tile (front to back) ‒Evaluate particle contribution ‒UV generation & radius check ‒Texture lookup ‒Normal generation and lighting ‒Manually blend [Bavoil08] ‒Colour = ( invDestA x srcA x srcCol ) + destCol ‒Alpha = srcA + ( invSrcA x destA ) ‒if ( accumulation alpha > threshold ) accumulation alpha = 1 and bail ‒Write result to screen size UAV EVALUATING TILE COLOUR – IMPROVED!!!

28 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 28 TILED RENDERING  Bin particles into 8x8 grid  For each particle ‒For each bin ‒Test particle against bin ‒Add particle if visible  UAV0 for particle indices (size = 8 x 8 x maxparticles) ‒Array split into 64 bins using offsets  UAV1 for storing particle count per bin (size = 8 x 8) ‒1 element per bin ‒Use InterlockedAdd() to bump bin’s counter COARSE CULLING

29 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 29 TILED RENDERING COMPUTE SHADER SETUP Per-bin particle indicesPer-tile sorted particle indicesScreen space colour bufferPer-bin frustum planes Per-tile particle indices and distances Particle data (position, radius, colour etc) Compute Shaders LDS Shader Output Updated particle data Simulation numthreads[256, 1, 1], 1 thread per particle Coarse Culling numthreads[256, 1, 1], 1 thread per particle Tile Culling and Sorting numthreads[32, 32, 1], 1 thread per particle Tile Rendering numthreads[32, 32, 1], 1 thread per pixel

30 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 30 modeframe time (ms)* Rasterization5.2 Tiled3.4 *AMD Radeon R9 290X @ 1080p Breakdownframe time (ms)* Simulation0.50 Coarse Culling0.06 Tile Culling and Sorting0.37 Tiled Rendering1.86 PERFORMANCE RESULTS Default View, ~35K particles

31 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 31 modeframe time (ms)* Rasterization27.3 Tiled6.2 *AMD Radeon R9 290X @ 1080p PERFORMANCE RESULTS In Smoke View, ~35K particles

32 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 32 CONCLUSIONS  Depth buffer collisions ‒Great bang-for-buck ‒Not perfect!  Bitonic sort ‒Good fit for sorting on the GPU  Tiled Rendering ‒Faster than rasterization ‒Great for combatting heavy overdraw ‒More predictable behaviour  Future work ‒Add arbitrary geometry for OIT ‒Volume tracing

33 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 33 QUESTIONS?  Demo with full source coming soon  http://developer.amd.com/tools/graphics-development/amd-radeon-sdk/ http://developer.amd.com/tools/graphics-development/amd-radeon-sdk/

34 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 34 REFERENCES  [Tchou11] Chris Tchou, “Halo Reach Effects Tech”, GDC 2011  [Persson09] Emil Persson, http://www.humus.name/index.php?page=News&ID=266http://www.humus.name/index.php?page=News&ID=266  [Cantlay07] Iain Cantlay, “High-Speed, Off-Screen Particles”, GPU Gems 3 2007  [Harada12] Takahiro Harada et al, “Forward+: Bringing Deferred Lighting to the Next Level”, Short Papers, Eurographics 2012  [Bavoil08] Louis Bavoil et al, “Order Independent Transparency with Dual Depth Peeling”, 2008

35 | FASTER PARTICLE RENDERING USING DIRECTCOMPUTE | AMD AND MICROSOFT GAME DEVELOPER DAY - JUNE 2 2014, STOCKHOLM 35 DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.


Download ppt "HOLY SMOKE! FASTER PARTICLE RENDERING USING DIRECTCOMPUTE AMD AND MICROSOFT DEVELOPER DAY, JUNE 2014, STOCKHOLM GARETH THOMAS 2 ND JUNE 2014."

Similar presentations


Ads by Google