Grass, Fur and all things hairy Nicolas ThibierozKarl Hillesland Gaming Engineering Manager, AMDSenior Research Engineer, AMD.

Grass, Fur and all things hairy Nicolas ThibierozKarl Hillesland Gaming Engineering Manager, AMDSenior Research Engineer, AMD

Next-gen Grass, Fur and Hair The time for next-gen quality is now Tomb Raider pioneered next-gen hair Even on PS4/XB1 Users expect this level of quality for next- gen titles You need to start thinking about this This talk is about making high-quality fur, grass and hair run at real-time performance

TressFX applied to Grass, Fur and Hair Variations of the same technique can be used for all those applications In all cases the core principles of next-gen quality are still needed: Compute simulations Anti-aliasing Transparency Volumetric self-shadowing A good lighting model

Forward Rendering Pipeline – a refresher Consists of three steps: Hair simulation Shade and store fragments into buffers Fetch shaded fragments, sort and render

// Retrieve current pixel count and increase counter uint uPixelCount = LinkedListUAV.IncrementCounter(); uint uOldStartOffset; // Exchange indices in LinkedListHead texture corresponding to pixel location InterlockedExchange(LinkedListHeadUAV[address], uPixelCount, uOldStartOffset); // Append new element at the end of the Fragment and Link Buffer Element.uNext = uOldStartOffset; LinkedListUAV[uPixelCount] = Element; Head UAV Each pixel location has a head pointer to a linked list in the PPLL UAV PPLL UAV As new fragments are rendered, they are added to the next open location in the PPLL (using UAV counter) A link is created to the fragment pointed to by the head pointer Head pointer then points to the new fragment Per-Pixel Linked Lists Head UAV PPLL UAV

CS Input Geometry Post-simulation geometry (UAV) Forward Rendering Pipeline – a refresher Hair Simulation Simulation parameters Model space World space

Forward Rendering Pipeline – a refresher Shade and Store fragments into Buffers Coverage depth color coverage next Lighting VS PS Homogeneous clip space World space Null RT Stencil PPLL UAV Head UAV Shadows Extrusion from line segments to non-indexed triangles

Full Screen Quad Forward Rendering Pipeline – a refresher Fetch shaded fragments, sort and render VS PS Stencil Head UAV PPLL UAV Render target Fragment sorting and manual blending

Forward Rendering Performance Main cost in forward rendering mode is in the shading part All fragments are lit and shadowed before being stored PPLL storing is typically not the bottleneck! Dont need maximum quality on all fragments tail fragments need only good enough quality Solution: Use shader LOD

Forward vs Deferred Rendering Pipeline Deferred rendering pipeline Hair simulation Store fragment properties into buffers Fetch fragment properties, sort, shade and render Full shading on K-frontmost fragments Tail fragments are shaded with a simpler light equation and shadowing algorithm Forward rendering pipeline Hair simulation Full shading and store fragments into buffers Fetch shaded fragments, sort and render

CS Input Geometry Post-simulation geometry (UAV) Deferred Rendering Pipeline Hair Simulation – unchanged! Simulation parameters Model space World space

Deferred Rendering Pipeline – a refresher Store Fragment Properties into Buffers Coverage depth tangent coverage next VS PS Homogeneous clip space World space Null RT Stencil PPLL UAV Head UAV Index Buffer Indexed triangle list

Deferred Rendering Pipeline Fetch fragments, sort, shade and render VS PS Stencil Head UAV PPLL UAV Render target K frontmost fragment: full shading, sorting and manual blending Lighting Shadows Full Screen Quad Tail fragments: cheap chading, no sorting and manual blending

Deferred Rendering Shading LOD Optimization Deferred approach allows a reduction in shading cost Shader LOD Only sort and shade K frontmost fragments at high quality Simple shading and out-of-order rendering on tail fragments Single-tap shadowing on tail fragments Very little quality difference compared to full shading But much better performance! TechniqueCost Out of order, no shading 1.31 ms Out of order, shading 2.80 ms Forward PPLL, shading 3.38 ms Deferred PPLL, shading 2.13 ms Fur model with ~130,000 fur strands Running on AMD Radeon 7970 @ 1080p Shading cost is ~ 1.5 ms PPLL cost is ~ 0.58 ms Fast!

Full quality shading forced on for all fragments Shading LOD

A great portion of time was spent in the GPU front-end 920,000 line segments for fur model Expansion from line segments to triangles was done in GS and then VS with Draw() Each segment would create a quad (two triangles) with 6 vertices Geometry Optimizations DrawIndexed() method Indexed triangle list = { ( 0, 1, 2 ), (2, 1, 3 ), ( 2, 3, 4 ), (4, 3, 5 ), ( … ) }; 1 Line segments Expanded quads 0 1 2 3 2 4 0 5 1,4 Draw() method Line segments Expanded quads 0 1 2 3,5 6 2,3 7,10 8,9 0 11 Triangle list = { ( 0, 1, 2 ), ( 3, 4, 5 ), ( 6, 7, 8 ), (9, 10, 11 ), ( … ) }; Offline creation of index buffer plus DrawIndexed() maximizes post vertex cache use!

Input line segments have a random order Just render fewer (but thicker) fragments when far away! Needs shading adjustments to ensure smooth quality transitions Increase alpha threshold for fragment inclusion when far away Distance-based LOD system Optimization

PPLL Head UAV uses a RWTexture2D instead of a Buffer Results in more efficient caching for UAV accesses Avoid GPR indexing for sorting Sorting K frontmost fragments required array of Generic Purpose Registers with random indexing into it Used an ALU-based indexing approach to improve performance TO DO: compute shader simulation optimizations Currently a set of multiple compute shaders Looking at combining some of these, optimizing shaders and output formats Other Optimizations

Per-Pixel Linked Lists UAV Memory Considerations How much memory is needed? Guesstimate for a given usage model Max (hair pixels x average overdraw) fragments What happens when I run out? Missing fragments What can be done about it?

k-Buffer in Memory

PP Linked-List (PPLL) k-Buffer fixed size array Node Pool All fragments How big? kkkkkkkk kkkkkkkk kkkkkkkk kkkkkkkk kkkkkkkk kkkkkkkk Simple Memory Bound

The Front k Approximation to avoid massive sorting Only sort the front k fragments per-pixel Blend the rest out-of-order If deferring for shader LOD … also Full quality shade on front k Cheap shade on rest 20 frags/pixel (ave) Red = over 100 k is 4, 8, 16

The Front k Approximation to avoid massive sorting Only sort the front k fragments per-pixel Blend the rest out-of-order If deferring for shader LOD … also Full quality shade on front k Cheap shade on rest k-Buffer Tail Cant know front k until all fragments processed

k-Buffer For Each Fragment in Each Pixel Index of furthest New Fragment Blend Tail Color Tail Fragment

If New Fragment in k Index of furthest k-Buffer Blend Tail Color If in k 1.Swap with furthest 2.Find new furthest 3.Blend with tail Tail Fragment New Fragment

If not in k Index of furthest k-Buffer Blend Tail Color If not in k 1.Blend with tail Tail Fragment New Fragment

From PPLL to k-Buffer For each pixel: Write frags to mem For each fragment in each pixel read fragment from mem update k-buffer (reg) blend tail fragment (reg) Read k-buffer from mem Sort and blend k-buffer (reg) update k-buffer (mem) blend tail fragment (mem)

k-Buffer Screen Width Screen Height k 8 bytes each (depth and data) PPLL nodes were 12 bytes (depth, data, next) K=4, 8, 16

PPLL: 2 nd Pass New Fragment Index of furthest Blend Tail Color Tail Fragment k-Buffer Registers

k-Buffer in Memory: 1 st Pass New Fragment Index of furthest Blend Tail Color Tail Fragment Mutex, index, … Blend Unit k-Buffer Memory

Mutex/Count/Index Buffer Screen Width Screen Height Mutex Bit Initialized Bit Max Index (4 bits) Count (remainder) High bit 32 bits

Spinlock Mutex [allow_uav_condition] for(; i<MAX_LOOP_COUNT && !bStop; ++i) { uint oldID; InterlockedExchange( tRWMutex[vScreenAddress], RESERVED, oldID); if( (oldID&RESERVED) != RESERVED) ) { [[ … Do work ]] DeviceMemoryBarrier(); tRWMutex[vScreenAddress] = (new_max_id<<28)+INITED; bStop = true; } // end mutex check }// end spinlock loop Paranoia Try Release Do Work

Find New Max Depth uint new_max_depth = u_inDepth; [unroll] for(int t=0; t<KBUFFER_SIZE; t++) { uint element_depth = DEPTH( vScreenAddress, t ); if(element_depth > new_max_depth ) { new_max_depth = element_depth; new_max_id = t; } Generally more memory traffic than PPLL

Initialization: The first k Options Clear k-buffer fullscreen (0,1) Clear k-buffer stenciled, 3 rd pass Clear on first fragment Count Mutex Bit Initialized Bit Max Index (4 bits) Count (remainder) High bit

The first k InterlockedAdd( tRWMutex[vScreenAddress], 1, oldCount); [allow_uav_condition] if(oldCount < KBUFFER_SIZE) { DATA(vScreenAddress,oldCount) = u_inData; DEPTH(vScreenAddress,oldCount) = u_inDepth; return uint2(u_outDepth,u_outData); } Mutex Bit Initialized Bit Max Index (4 bits) Count (remainder) High bit

Models 2k polygons ~20k hairs ~130k hairs Stats 2-3.5 M fragments 200-300k pixels Shading One point light & shadow 2 shifted specular lobes

Depth Complexity Grey1 Blue8 Green50 Red100+

Contention Max attempts per pixel, k=4 Dark Blue 1 Aqua <=4 Bright Aqua <=8

Performance Time ratio to out-of-order blending Forward PPLL: 1.02 to 1.4 Forward k-Buffer: 1.2 to 1.4 Deferred PPLL: 0.7 to 0.9 Deferred k-Buffer: 0.9 to 1.6

K-Buffer in Memory Simple memory bound Can be less memory Usually slower Increased memory traffic

Simulation

Hair Simulation Length Constraint Local Constraint Global Constraint Model Transform Collision Shapes External Forces (wind, gravity, etc.)

Fur Simulation Length Constraint Local Constraint Global Constraint Model Transform Collision Shapes External Forces (wind, gravity, etc.)

Grass Simulation Length Constraint Local Constraint (1D) Global Constraint Model Transform Collision Shapes External Forces (wind, gravity, etc.)

Constraint Method (iterative) Used for length, local and global constraints Length is most difficult to converge particularly under large movement C0C0 C1C1 C n-2 p0p0 p2p2 P n-2 P n-1

Tridiagonal Matrix Formulation Direct solve for length constraint Almost zero stretch Limited to smaller time steps (stability) Still cheap Leverages matrix structure of strands Two sweeps of strand

Tridiagonal Matrix Formulation Tridiagonal Matrix Formulation for Inextensible Hair Strand Simulation, VRIPHYS, 2013

Summary Next-gen look is possible now! Deferred Rendering for shading LOD is fastest k-buffer in memory is an option for memory-constrained situations High-quality grass and fur simulation with compute Upcoming TressFX 2 SDK sample update with fur scenario at http://developer.amd.com/tools-and-sdks/graphics- development/amd-radeon-sdk/ http://developer.amd.com/tools-and-sdks/graphics- development/amd-radeon-sdk/

Questions?

Extras

Isoline Tessellation for hair/fur? 1/2 Isoline tessellation has two tess factors First is line density (lines per invocation) Second is line detail (segments per line) In theory provides easy LOD system Variable line density and detail by increasing both tessellation factors based on distance Tess = (1,1)Tess = (2,1) Tess = (2,2)Tess = (2,3) Tess = (3,3)

Isoline Tessellation for hair/fur? 2/2 In practice isoline tessellation is not cost effective for this scenario Lines are always 1-pixel thick Need GS to extrude them into triangles for smooth edges Major impact on performance! Alternative is to enable MSAA Most engines are deferred so this causes a large performance impact No extrusion for smoothing edges and no MSAA = poor quality! Bottom line: a pure Vertex Shader solution is faster LOD benefit is easily done in VS (more on this later) Curvature is rarely a problem (dependant on vertices/strands at authoring time)

AA, Self-shadowing and Transparency Basic Rendering Antialiasing + Self Shadowing Antialiasing + Self Shadowing + Transparency

Grass, Fur and all things hairy Nicolas ThibierozKarl Hillesland Gaming Engineering Manager, AMDSenior Research Engineer, AMD.

Similar presentations

Presentation on theme: "Grass, Fur and all things hairy Nicolas ThibierozKarl Hillesland Gaming Engineering Manager, AMDSenior Research Engineer, AMD."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Grass, Fur and all things hairy Nicolas ThibierozKarl Hillesland Gaming Engineering Manager, AMDSenior Research Engineer, AMD.

Similar presentations

Presentation on theme: "Grass, Fur and all things hairy Nicolas ThibierozKarl Hillesland Gaming Engineering Manager, AMDSenior Research Engineer, AMD."— Presentation transcript:

Similar presentations

About project

Feedback