Grass, Fur and all things hairy Nicolas ThibierozKarl Hillesland Gaming Engineering Manager, AMDSenior Research Engineer, AMD.

Slides:



Advertisements
Similar presentations
Destruction Masking in Frostbite 2 using Volume Distance Fields
Advertisements

15.1 Si23_03 SI23 Introduction to Computer Graphics Lecture 15 – Visible Surfaces and Shadows.
Accelerating Real-Time Shading with Reverse Reprojection Caching Diego Nehab 1 Pedro V. Sander 2 Jason Lawrence 3 Natalya Tatarchuk 4 John R. Isidoro 4.
OIT and Indirect Illumination using DX11 Linked Lists
Real-Time Rendering 靜宜大學資工研究所 蔡奇偉副教授 2010©.
Deferred Lighting and Post Processing on PLAYSTATION®3
Filtering Approaches for Real-Time Anti-Aliasing
Technology Behind AMD’s “Leo Demo” Jay McKee MTS Engineer, AMD
Exploration of advanced lighting and shading techniques
Chapter 4 Memory Management Basic memory management Swapping
Deferred Shading Optimizations
CS123 | INTRODUCTION TO COMPUTER GRAPHICS Andries van Dam © 1/16 Deferred Lighting Deferred Lighting – 11/18/2014.
Technische Universität München Computer Graphics SS 2014 Graphics Effects Rüdiger Westermann Lehrstuhl für Computer Graphik und Visualisierung.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
CS 352: Computer Graphics Chapter 7: The Rendering Pipeline.
Optimized Stencil Shadow Volumes
Understanding the graphics pipeline Lecture 2 Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider.
Graphics Pipeline.
Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.
RealityEngine Graphics Kurt Akeley Silicon Graphics Computer Systems.
CS 4363/6353 BASIC RENDERING. THE GRAPHICS PIPELINE OVERVIEW Vertex Processing Coordinate transformations Compute color for each vertex Clipping and Primitive.
Graphics Hardware CMSC 435/634. Transform Shade Clip Project Rasterize Texture Z-buffer Interpolate Vertex Fragment Triangle A Graphics Pipeline.
The Art and Technology Behind Bioshock’s Special Effects
Week 7 - Monday.  What did we talk about last time?  Specular shading  Aliasing and antialiasing.
Real-time lighting via Light Linked List 8/07/2014 Abdul Bezrati.
TRESS FX THE FAST AND THE FURRY AMD AND MICROSOFT DEVELOPER DAY, JUNE 2014, STOCKHOLM NICOLAS THIBIEROZ WORLDWIDE GAMING ENGINEERING MANAGER, AMD.
I3D Fast Non-Linear Projections using Graphics Hardware Jean-Dominique Gascuel, Nicolas Holzschuch, Gabriel Fournier, Bernard Péroche I3D 2008.
Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.
CGDD 4003 THE MASSIVE FIELD OF COMPUTER GRAPHICS.
Practical and Robust Stenciled Shadow Volumes for Hardware-Accelerated Rendering Cass Everitt and Mark J. Kilgard Speaker: Alvin Date: 5/28/2003 NVIDIA.
3D Graphics Processor Architecture Victor Moya. PhD Project Research on architecture improvements for future Graphic Processor Units (GPUs). Research.
Z-Buffer Optimizations Patrick Cozzi Analytical Graphics, Inc.
Introduction to 3D Graphics John E. Laird. Basic Issues u Given a internal model of a 3D world, with textures and light sources how do you project it.
4.2. D EFERRED S HADING Exploration of deferred shading (rendering)
Geometric Objects and Transformations. Coordinate systems rial.html.
09/09/03CS679 - Fall Copyright Univ. of Wisconsin Last Time Event management Lag Group assignment has happened, like it or not.
Week 2 - Friday.  What did we talk about last time?  Graphics rendering pipeline  Geometry Stage.
Advanced Computer Graphics Depth & Stencil Buffers / Rendering to Textures CO2409 Computer Graphics Week 19.
Emerging Technologies for Games Alpha Sorting and “Soft” Particles CO3303 Week 15.
CS 638, Fall 2001 Multi-Pass Rendering The pipeline takes one triangle at a time, so only local information, and pre-computed maps, are available Multi-Pass.
3D Graphics for Game Programming Chapter IV Fragment Processing and Output Merging.
Fast Cascade VSM By Zhang Jian.
09/16/03CS679 - Fall Copyright Univ. of Wisconsin Last Time Environment mapping Light mapping Project Goals for Stage 1.
Emerging Technologies for Games Deferred Rendering CO3303 Week 22.
Computer Graphics Blending CO2409 Computer Graphics Week 14.
Mobile Graphics Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Visual Appearance Chapter 4 Tomas Akenine-Möller Department of Computer Engineering Chalmers University of Technology.
Single Pass Point Rendering and Transparent Shading Paper by Yanci Zhang and Renato Pajarola Presentation by Harmen de Weerd and Hedde Bosman.
Emerging Technologies for Games Capability Testing and DirectX10 Features CO3301 Week 6.
What are shaders? In the field of computer graphics, a shader is a computer program that runs on the graphics processing unit(GPU) and is used to do shading.
Real-Time Dynamic Shadow Algorithms Evan Closson CSE 528.
UW EXTENSION CERTIFICATE PROGRAM IN GAME DEVELOPMENT 2 ND QUARTER: ADVANCED GRAPHICS The GPU.
The Graphics Pipeline Revisited Real Time Rendering Instructor: David Luebke.
09/23/03CS679 - Fall Copyright Univ. of Wisconsin Last Time Reflections Shadows Part 1 Stage 1 is in.
Build your own 2D Game Engine and Create Great Web Games using HTML5, JavaScript, and WebGL. Sung, Pavleas, Arnez, and Pace, Chapter 5 Examples 1.
Week 7 - Monday CS361.
Week 2 - Friday CS361.
Patrick Cozzi University of Pennsylvania CIS Fall 2013
Visual Appearance Chapter 4
Deferred Lighting.
Chapter 6 GPU, Shaders, and Shading Languages
The Graphics Rendering Pipeline
CS451Real-time Rendering Pipeline
Software Rasterization
UMBC Graphics for Games
UMBC Graphics for Games
RADEON™ 9700 Architecture and 3D Performance
Introduction to OpenGL
Frame Buffer Applications
Balancing the Graphics Pipeline for Optimal Performance
Presentation transcript:

Grass, Fur and all things hairy Nicolas ThibierozKarl Hillesland Gaming Engineering Manager, AMDSenior Research Engineer, AMD

Next-gen Grass, Fur and Hair The time for next-gen quality is now Tomb Raider pioneered next-gen hair Even on PS4/XB1 Users expect this level of quality for next- gen titles You need to start thinking about this This talk is about making high-quality fur, grass and hair run at real-time performance

TressFX applied to Grass, Fur and Hair Variations of the same technique can be used for all those applications In all cases the core principles of next-gen quality are still needed: Compute simulations Anti-aliasing Transparency Volumetric self-shadowing A good lighting model

Forward Rendering Pipeline – a refresher Consists of three steps: Hair simulation Shade and store fragments into buffers Fetch shaded fragments, sort and render

// Retrieve current pixel count and increase counter uint uPixelCount = LinkedListUAV.IncrementCounter(); uint uOldStartOffset; // Exchange indices in LinkedListHead texture corresponding to pixel location InterlockedExchange(LinkedListHeadUAV[address], uPixelCount, uOldStartOffset); // Append new element at the end of the Fragment and Link Buffer Element.uNext = uOldStartOffset; LinkedListUAV[uPixelCount] = Element; Head UAV Each pixel location has a head pointer to a linked list in the PPLL UAV PPLL UAV As new fragments are rendered, they are added to the next open location in the PPLL (using UAV counter) A link is created to the fragment pointed to by the head pointer Head pointer then points to the new fragment Per-Pixel Linked Lists Head UAV PPLL UAV

CS Input Geometry Post-simulation geometry (UAV) Forward Rendering Pipeline – a refresher Hair Simulation Simulation parameters Model space World space

Forward Rendering Pipeline – a refresher Shade and Store fragments into Buffers Coverage depth color coverage next Lighting VS PS Homogeneous clip space World space Null RT Stencil PPLL UAV Head UAV Shadows Extrusion from line segments to non-indexed triangles

Full Screen Quad Forward Rendering Pipeline – a refresher Fetch shaded fragments, sort and render VS PS Stencil Head UAV PPLL UAV Render target Fragment sorting and manual blending

Forward Rendering Performance Main cost in forward rendering mode is in the shading part All fragments are lit and shadowed before being stored PPLL storing is typically not the bottleneck! Dont need maximum quality on all fragments tail fragments need only good enough quality Solution: Use shader LOD

Forward vs Deferred Rendering Pipeline Deferred rendering pipeline Hair simulation Store fragment properties into buffers Fetch fragment properties, sort, shade and render Full shading on K-frontmost fragments Tail fragments are shaded with a simpler light equation and shadowing algorithm Forward rendering pipeline Hair simulation Full shading and store fragments into buffers Fetch shaded fragments, sort and render

CS Input Geometry Post-simulation geometry (UAV) Deferred Rendering Pipeline Hair Simulation – unchanged! Simulation parameters Model space World space

Deferred Rendering Pipeline – a refresher Store Fragment Properties into Buffers Coverage depth tangent coverage next VS PS Homogeneous clip space World space Null RT Stencil PPLL UAV Head UAV Index Buffer Indexed triangle list

Deferred Rendering Pipeline Fetch fragments, sort, shade and render VS PS Stencil Head UAV PPLL UAV Render target K frontmost fragment: full shading, sorting and manual blending Lighting Shadows Full Screen Quad Tail fragments: cheap chading, no sorting and manual blending

Deferred Rendering Shading LOD Optimization Deferred approach allows a reduction in shading cost Shader LOD Only sort and shade K frontmost fragments at high quality Simple shading and out-of-order rendering on tail fragments Single-tap shadowing on tail fragments Very little quality difference compared to full shading But much better performance! TechniqueCost Out of order, no shading 1.31 ms Out of order, shading 2.80 ms Forward PPLL, shading 3.38 ms Deferred PPLL, shading 2.13 ms Fur model with ~130,000 fur strands Running on AMD Radeon 1080p Shading cost is ~ 1.5 ms PPLL cost is ~ 0.58 ms Fast!

Full quality shading forced on for all fragments Shading LOD

A great portion of time was spent in the GPU front-end 920,000 line segments for fur model Expansion from line segments to triangles was done in GS and then VS with Draw() Each segment would create a quad (two triangles) with 6 vertices Geometry Optimizations DrawIndexed() method Indexed triangle list = { ( 0, 1, 2 ), (2, 1, 3 ), ( 2, 3, 4 ), (4, 3, 5 ), ( … ) }; 1 Line segments Expanded quads ,4 Draw() method Line segments Expanded quads ,5 6 2,3 7,10 8, Triangle list = { ( 0, 1, 2 ), ( 3, 4, 5 ), ( 6, 7, 8 ), (9, 10, 11 ), ( … ) }; Offline creation of index buffer plus DrawIndexed() maximizes post vertex cache use!

Input line segments have a random order Just render fewer (but thicker) fragments when far away! Needs shading adjustments to ensure smooth quality transitions Increase alpha threshold for fragment inclusion when far away Distance-based LOD system Optimization

PPLL Head UAV uses a RWTexture2D instead of a Buffer Results in more efficient caching for UAV accesses Avoid GPR indexing for sorting Sorting K frontmost fragments required array of Generic Purpose Registers with random indexing into it Used an ALU-based indexing approach to improve performance TO DO: compute shader simulation optimizations Currently a set of multiple compute shaders Looking at combining some of these, optimizing shaders and output formats Other Optimizations

Per-Pixel Linked Lists UAV Memory Considerations How much memory is needed? Guesstimate for a given usage model Max (hair pixels x average overdraw) fragments What happens when I run out? Missing fragments What can be done about it?

k-Buffer in Memory

PP Linked-List (PPLL) k-Buffer fixed size array Node Pool All fragments How big? kkkkkkkk kkkkkkkk kkkkkkkk kkkkkkkk kkkkkkkk kkkkkkkk Simple Memory Bound

The Front k Approximation to avoid massive sorting Only sort the front k fragments per-pixel Blend the rest out-of-order If deferring for shader LOD … also Full quality shade on front k Cheap shade on rest 20 frags/pixel (ave) Red = over 100 k is 4, 8, 16

The Front k Approximation to avoid massive sorting Only sort the front k fragments per-pixel Blend the rest out-of-order If deferring for shader LOD … also Full quality shade on front k Cheap shade on rest k-Buffer Tail Cant know front k until all fragments processed

k-Buffer For Each Fragment in Each Pixel Index of furthest New Fragment Blend Tail Color Tail Fragment

If New Fragment in k Index of furthest k-Buffer Blend Tail Color If in k 1.Swap with furthest 2.Find new furthest 3.Blend with tail Tail Fragment New Fragment

If not in k Index of furthest k-Buffer Blend Tail Color If not in k 1.Blend with tail Tail Fragment New Fragment

From PPLL to k-Buffer For each pixel: Write frags to mem For each fragment in each pixel read fragment from mem update k-buffer (reg) blend tail fragment (reg) Read k-buffer from mem Sort and blend k-buffer (reg) update k-buffer (mem) blend tail fragment (mem)

k-Buffer Screen Width Screen Height k 8 bytes each (depth and data) PPLL nodes were 12 bytes (depth, data, next) K=4, 8, 16

PPLL: 2 nd Pass New Fragment Index of furthest Blend Tail Color Tail Fragment k-Buffer Registers

k-Buffer in Memory: 1 st Pass New Fragment Index of furthest Blend Tail Color Tail Fragment Mutex, index, … Blend Unit k-Buffer Memory

Mutex/Count/Index Buffer Screen Width Screen Height Mutex Bit Initialized Bit Max Index (4 bits) Count (remainder) High bit 32 bits

Spinlock Mutex [allow_uav_condition] for(; i<MAX_LOOP_COUNT && !bStop; ++i) { uint oldID; InterlockedExchange( tRWMutex[vScreenAddress], RESERVED, oldID); if( (oldID&RESERVED) != RESERVED) ) { [[ … Do work ]] DeviceMemoryBarrier(); tRWMutex[vScreenAddress] = (new_max_id<<28)+INITED; bStop = true; } // end mutex check }// end spinlock loop Paranoia Try Release Do Work

Find New Max Depth uint new_max_depth = u_inDepth; [unroll] for(int t=0; t<KBUFFER_SIZE; t++) { uint element_depth = DEPTH( vScreenAddress, t ); if(element_depth > new_max_depth ) { new_max_depth = element_depth; new_max_id = t; } Generally more memory traffic than PPLL

Initialization: The first k Options Clear k-buffer fullscreen (0,1) Clear k-buffer stenciled, 3 rd pass Clear on first fragment Count Mutex Bit Initialized Bit Max Index (4 bits) Count (remainder) High bit

The first k InterlockedAdd( tRWMutex[vScreenAddress], 1, oldCount); [allow_uav_condition] if(oldCount < KBUFFER_SIZE) { DATA(vScreenAddress,oldCount) = u_inData; DEPTH(vScreenAddress,oldCount) = u_inDepth; return uint2(u_outDepth,u_outData); } Mutex Bit Initialized Bit Max Index (4 bits) Count (remainder) High bit

Models 2k polygons ~20k hairs ~130k hairs Stats M fragments k pixels Shading One point light & shadow 2 shifted specular lobes

Depth Complexity Grey1 Blue8 Green50 Red100+

Contention Max attempts per pixel, k=4 Dark Blue 1 Aqua <=4 Bright Aqua <=8

Performance Time ratio to out-of-order blending Forward PPLL: 1.02 to 1.4 Forward k-Buffer: 1.2 to 1.4 Deferred PPLL: 0.7 to 0.9 Deferred k-Buffer: 0.9 to 1.6

K-Buffer in Memory Simple memory bound Can be less memory Usually slower Increased memory traffic

Simulation

Hair Simulation Length Constraint Local Constraint Global Constraint Model Transform Collision Shapes External Forces (wind, gravity, etc.)

Fur Simulation Length Constraint Local Constraint Global Constraint Model Transform Collision Shapes External Forces (wind, gravity, etc.)

Grass Simulation Length Constraint Local Constraint (1D) Global Constraint Model Transform Collision Shapes External Forces (wind, gravity, etc.)

Constraint Method (iterative) Used for length, local and global constraints Length is most difficult to converge particularly under large movement C0C0 C1C1 C n-2 p0p0 p2p2 P n-2 P n-1

Tridiagonal Matrix Formulation Direct solve for length constraint Almost zero stretch Limited to smaller time steps (stability) Still cheap Leverages matrix structure of strands Two sweeps of strand

Tridiagonal Matrix Formulation Tridiagonal Matrix Formulation for Inextensible Hair Strand Simulation, VRIPHYS, 2013

Demos

Summary Next-gen look is possible now! Deferred Rendering for shading LOD is fastest k-buffer in memory is an option for memory-constrained situations High-quality grass and fur simulation with compute Upcoming TressFX 2 SDK sample update with fur scenario at development/amd-radeon-sdk/ development/amd-radeon-sdk/

Questions?

Extras

Isoline Tessellation for hair/fur? 1/2 Isoline tessellation has two tess factors First is line density (lines per invocation) Second is line detail (segments per line) In theory provides easy LOD system Variable line density and detail by increasing both tessellation factors based on distance Tess = (1,1)Tess = (2,1) Tess = (2,2)Tess = (2,3) Tess = (3,3)

Isoline Tessellation for hair/fur? 2/2 In practice isoline tessellation is not cost effective for this scenario Lines are always 1-pixel thick Need GS to extrude them into triangles for smooth edges Major impact on performance! Alternative is to enable MSAA Most engines are deferred so this causes a large performance impact No extrusion for smoothing edges and no MSAA = poor quality! Bottom line: a pure Vertex Shader solution is faster LOD benefit is easily done in VS (more on this later) Curvature is rarely a problem (dependant on vertices/strands at authoring time)

AA, Self-shadowing and Transparency Basic Rendering Antialiasing + Self Shadowing Antialiasing + Self Shadowing + Transparency