Presentation is loading. Please wait.

Presentation is loading. Please wait.

Grass, Fur and all things hairy Nicolas ThibierozKarl Hillesland Gaming Engineering Manager, AMDSenior Research Engineer, AMD.

Similar presentations


Presentation on theme: "Grass, Fur and all things hairy Nicolas ThibierozKarl Hillesland Gaming Engineering Manager, AMDSenior Research Engineer, AMD."— Presentation transcript:

1 Grass, Fur and all things hairy Nicolas ThibierozKarl Hillesland Gaming Engineering Manager, AMDSenior Research Engineer, AMD

2 Next-gen Grass, Fur and Hair The time for next-gen quality is now Tomb Raider pioneered next-gen hair Even on PS4/XB1 Users expect this level of quality for next- gen titles You need to start thinking about this This talk is about making high-quality fur, grass and hair run at real-time performance

3 TressFX applied to Grass, Fur and Hair Variations of the same technique can be used for all those applications In all cases the core principles of next-gen quality are still needed: Compute simulations Anti-aliasing Transparency Volumetric self-shadowing A good lighting model

4 Forward Rendering Pipeline – a refresher Consists of three steps: Hair simulation Shade and store fragments into buffers Fetch shaded fragments, sort and render

5 // Retrieve current pixel count and increase counter uint uPixelCount = LinkedListUAV.IncrementCounter(); uint uOldStartOffset; // Exchange indices in LinkedListHead texture corresponding to pixel location InterlockedExchange(LinkedListHeadUAV[address], uPixelCount, uOldStartOffset); // Append new element at the end of the Fragment and Link Buffer Element.uNext = uOldStartOffset; LinkedListUAV[uPixelCount] = Element; Head UAV Each pixel location has a head pointer to a linked list in the PPLL UAV PPLL UAV As new fragments are rendered, they are added to the next open location in the PPLL (using UAV counter) A link is created to the fragment pointed to by the head pointer Head pointer then points to the new fragment Per-Pixel Linked Lists Head UAV PPLL UAV

6 CS Input Geometry Post-simulation geometry (UAV) Forward Rendering Pipeline – a refresher Hair Simulation Simulation parameters Model space World space

7 Forward Rendering Pipeline – a refresher Shade and Store fragments into Buffers Coverage depth color coverage next Lighting VS PS Homogeneous clip space World space Null RT Stencil PPLL UAV Head UAV Shadows Extrusion from line segments to non-indexed triangles

8 Full Screen Quad Forward Rendering Pipeline – a refresher Fetch shaded fragments, sort and render VS PS Stencil Head UAV PPLL UAV Render target Fragment sorting and manual blending

9 Forward Rendering Performance Main cost in forward rendering mode is in the shading part All fragments are lit and shadowed before being stored PPLL storing is typically not the bottleneck! Dont need maximum quality on all fragments tail fragments need only good enough quality Solution: Use shader LOD

10 Forward vs Deferred Rendering Pipeline Deferred rendering pipeline Hair simulation Store fragment properties into buffers Fetch fragment properties, sort, shade and render Full shading on K-frontmost fragments Tail fragments are shaded with a simpler light equation and shadowing algorithm Forward rendering pipeline Hair simulation Full shading and store fragments into buffers Fetch shaded fragments, sort and render

11 CS Input Geometry Post-simulation geometry (UAV) Deferred Rendering Pipeline Hair Simulation – unchanged! Simulation parameters Model space World space

12 Deferred Rendering Pipeline – a refresher Store Fragment Properties into Buffers Coverage depth tangent coverage next VS PS Homogeneous clip space World space Null RT Stencil PPLL UAV Head UAV Index Buffer Indexed triangle list

13 Deferred Rendering Pipeline Fetch fragments, sort, shade and render VS PS Stencil Head UAV PPLL UAV Render target K frontmost fragment: full shading, sorting and manual blending Lighting Shadows Full Screen Quad Tail fragments: cheap chading, no sorting and manual blending

14 Deferred Rendering Shading LOD Optimization Deferred approach allows a reduction in shading cost Shader LOD Only sort and shade K frontmost fragments at high quality Simple shading and out-of-order rendering on tail fragments Single-tap shadowing on tail fragments Very little quality difference compared to full shading But much better performance! TechniqueCost Out of order, no shading 1.31 ms Out of order, shading 2.80 ms Forward PPLL, shading 3.38 ms Deferred PPLL, shading 2.13 ms Fur model with ~130,000 fur strands Running on AMD Radeon 1080p Shading cost is ~ 1.5 ms PPLL cost is ~ 0.58 ms Fast!

15 Full quality shading forced on for all fragments Shading LOD

16 A great portion of time was spent in the GPU front-end 920,000 line segments for fur model Expansion from line segments to triangles was done in GS and then VS with Draw() Each segment would create a quad (two triangles) with 6 vertices Geometry Optimizations DrawIndexed() method Indexed triangle list = { ( 0, 1, 2 ), (2, 1, 3 ), ( 2, 3, 4 ), (4, 3, 5 ), ( … ) }; 1 Line segments Expanded quads ,4 Draw() method Line segments Expanded quads ,5 6 2,3 7,10 8, Triangle list = { ( 0, 1, 2 ), ( 3, 4, 5 ), ( 6, 7, 8 ), (9, 10, 11 ), ( … ) }; Offline creation of index buffer plus DrawIndexed() maximizes post vertex cache use!

17 Input line segments have a random order Just render fewer (but thicker) fragments when far away! Needs shading adjustments to ensure smooth quality transitions Increase alpha threshold for fragment inclusion when far away Distance-based LOD system Optimization

18 PPLL Head UAV uses a RWTexture2D instead of a Buffer Results in more efficient caching for UAV accesses Avoid GPR indexing for sorting Sorting K frontmost fragments required array of Generic Purpose Registers with random indexing into it Used an ALU-based indexing approach to improve performance TO DO: compute shader simulation optimizations Currently a set of multiple compute shaders Looking at combining some of these, optimizing shaders and output formats Other Optimizations

19 Per-Pixel Linked Lists UAV Memory Considerations How much memory is needed? Guesstimate for a given usage model Max (hair pixels x average overdraw) fragments What happens when I run out? Missing fragments What can be done about it?

20 k-Buffer in Memory

21 PP Linked-List (PPLL) k-Buffer fixed size array Node Pool All fragments How big? kkkkkkkk kkkkkkkk kkkkkkkk kkkkkkkk kkkkkkkk kkkkkkkk Simple Memory Bound

22 The Front k Approximation to avoid massive sorting Only sort the front k fragments per-pixel Blend the rest out-of-order If deferring for shader LOD … also Full quality shade on front k Cheap shade on rest 20 frags/pixel (ave) Red = over 100 k is 4, 8, 16

23 The Front k Approximation to avoid massive sorting Only sort the front k fragments per-pixel Blend the rest out-of-order If deferring for shader LOD … also Full quality shade on front k Cheap shade on rest k-Buffer Tail Cant know front k until all fragments processed

24 k-Buffer For Each Fragment in Each Pixel Index of furthest New Fragment Blend Tail Color Tail Fragment

25 If New Fragment in k Index of furthest k-Buffer Blend Tail Color If in k 1.Swap with furthest 2.Find new furthest 3.Blend with tail Tail Fragment New Fragment

26 If not in k Index of furthest k-Buffer Blend Tail Color If not in k 1.Blend with tail Tail Fragment New Fragment

27 From PPLL to k-Buffer For each pixel: Write frags to mem For each fragment in each pixel read fragment from mem update k-buffer (reg) blend tail fragment (reg) Read k-buffer from mem Sort and blend k-buffer (reg) update k-buffer (mem) blend tail fragment (mem)

28 k-Buffer Screen Width Screen Height k 8 bytes each (depth and data) PPLL nodes were 12 bytes (depth, data, next) K=4, 8, 16

29 PPLL: 2 nd Pass New Fragment Index of furthest Blend Tail Color Tail Fragment k-Buffer Registers

30 k-Buffer in Memory: 1 st Pass New Fragment Index of furthest Blend Tail Color Tail Fragment Mutex, index, … Blend Unit k-Buffer Memory

31 Mutex/Count/Index Buffer Screen Width Screen Height Mutex Bit Initialized Bit Max Index (4 bits) Count (remainder) High bit 32 bits

32 Spinlock Mutex [allow_uav_condition] for(; i

33 Find New Max Depth uint new_max_depth = u_inDepth; [unroll] for(int t=0; t new_max_depth ) { new_max_depth = element_depth; new_max_id = t; } Generally more memory traffic than PPLL

34 Initialization: The first k Options Clear k-buffer fullscreen (0,1) Clear k-buffer stenciled, 3 rd pass Clear on first fragment Count Mutex Bit Initialized Bit Max Index (4 bits) Count (remainder) High bit

35 The first k InterlockedAdd( tRWMutex[vScreenAddress], 1, oldCount); [allow_uav_condition] if(oldCount < KBUFFER_SIZE) { DATA(vScreenAddress,oldCount) = u_inData; DEPTH(vScreenAddress,oldCount) = u_inDepth; return uint2(u_outDepth,u_outData); } Mutex Bit Initialized Bit Max Index (4 bits) Count (remainder) High bit

36 Models 2k polygons ~20k hairs ~130k hairs Stats M fragments k pixels Shading One point light & shadow 2 shifted specular lobes

37 Depth Complexity Grey1 Blue8 Green50 Red100+

38 Contention Max attempts per pixel, k=4 Dark Blue 1 Aqua <=4 Bright Aqua <=8

39 Performance Time ratio to out-of-order blending Forward PPLL: 1.02 to 1.4 Forward k-Buffer: 1.2 to 1.4 Deferred PPLL: 0.7 to 0.9 Deferred k-Buffer: 0.9 to 1.6

40 K-Buffer in Memory Simple memory bound Can be less memory Usually slower Increased memory traffic

41 Simulation

42 Hair Simulation Length Constraint Local Constraint Global Constraint Model Transform Collision Shapes External Forces (wind, gravity, etc.)

43 Fur Simulation Length Constraint Local Constraint Global Constraint Model Transform Collision Shapes External Forces (wind, gravity, etc.)

44 Grass Simulation Length Constraint Local Constraint (1D) Global Constraint Model Transform Collision Shapes External Forces (wind, gravity, etc.)

45 Constraint Method (iterative) Used for length, local and global constraints Length is most difficult to converge particularly under large movement C0C0 C1C1 C n-2 p0p0 p2p2 P n-2 P n-1

46 Tridiagonal Matrix Formulation Direct solve for length constraint Almost zero stretch Limited to smaller time steps (stability) Still cheap Leverages matrix structure of strands Two sweeps of strand

47 Tridiagonal Matrix Formulation Tridiagonal Matrix Formulation for Inextensible Hair Strand Simulation, VRIPHYS, 2013

48 Demos

49 Summary Next-gen look is possible now! Deferred Rendering for shading LOD is fastest k-buffer in memory is an option for memory-constrained situations High-quality grass and fur simulation with compute Upcoming TressFX 2 SDK sample update with fur scenario at development/amd-radeon-sdk/ development/amd-radeon-sdk/

50 Questions?

51 Extras

52 Isoline Tessellation for hair/fur? 1/2 Isoline tessellation has two tess factors First is line density (lines per invocation) Second is line detail (segments per line) In theory provides easy LOD system Variable line density and detail by increasing both tessellation factors based on distance Tess = (1,1)Tess = (2,1) Tess = (2,2)Tess = (2,3) Tess = (3,3)

53 Isoline Tessellation for hair/fur? 2/2 In practice isoline tessellation is not cost effective for this scenario Lines are always 1-pixel thick Need GS to extrude them into triangles for smooth edges Major impact on performance! Alternative is to enable MSAA Most engines are deferred so this causes a large performance impact No extrusion for smoothing edges and no MSAA = poor quality! Bottom line: a pure Vertex Shader solution is faster LOD benefit is easily done in VS (more on this later) Curvature is rarely a problem (dependant on vertices/strands at authoring time)

54 AA, Self-shadowing and Transparency Basic Rendering Antialiasing + Self Shadowing Antialiasing + Self Shadowing + Transparency


Download ppt "Grass, Fur and all things hairy Nicolas ThibierozKarl Hillesland Gaming Engineering Manager, AMDSenior Research Engineer, AMD."

Similar presentations


Ads by Google