Solving Some Common Problems in a Modern Deferred Rendering Engine

Solving Some Common Problems in a Modern Deferred Rendering Engine
Jose Luis Sanchez Bonet Tomasz Stachowiak */ Good news everyone! I'm Tom, this is Jose, and we're going to talk about deferred rendering. The focus is on current generation consoles, but the presented techniques can be used on just about any platform, so we hope anyone can benefit from them.

Deferred rendering – pros and cons
Pros ( some ) Very scalable No shader permutation explosion G-Buffer useful in other techniques SSAO, SRAA, decals, … Cons ( some ) Difficult to use multiple shading models Does not handle translucent geometry Some variants do, but may be impractical Deferred rendering has been very popular lately due to its scalability, and because it plays nicely with other techniques, which can reuse the G-Buffer. At the same time, it doesn’t come without downsides. We are going to cover two of them in this presentation, and propose the custom solutions we've developed for our upcoming console title. The two problems are: handling many shading models, and rendering translucent geometry. I'm going to cover the former in the first half of the presentation, and then Jose will talk about translucency.

Reflectance models The BRDF defines the look of a surface
Bidirectional Reflectance Distribution Function 𝐿 𝑜 = 𝐿 𝑒 + Ω 𝐿 𝑖 ∙ 𝒇 𝒓 ∙ cos 𝜃 ∙𝛿𝜔 Typically games use just one ( Blinn-Phong ) Simple, but inaccurate Very important in physically based rendering Want more: Oren-Nayar, Kajiya-Kay, Penner, Cook-Torrance, … In graphics rendering, we use simple mathematical formulas, to approximate the look of some classes of surfaces. The most commonly used model, or Bidirectional Reflectance Distribution Function, is Blinn-Phong, which works reasonably well as an approximation of some dielectrics. It is used due to its simplicity, but for the same reason, it cannot reproduce the look of many surfaces accurately. You might want to render your plastics with Blinn, skin with Eric Penner's pre-integrated model, hair with Kajiya-Kay or Marschner, brushed metal with anisotropic Ward, and so on. The visual properties of these surfaces are vastly different, and can not be covered with just a single, simple mathematical model.

BRDFs vs. rendering Forward rendering Deferred rendering Material
Material shader directly evaluates the BRDF Trivial Deferred rendering Light shaders decoupled from materials No obvious solution Material G-Buffer Light So how do we render with multiple shading models? If you use forward rendering, this is trivial. Because the BRDF is combined with the material in the same shader, it just works. However, in deferred rendering, we need to evaluate the reflectance model in the light shader, and these don't bear any connection to material shaders that the BRDFs are associated with. BRDF ???

BRDFs vs. deferred – branching?
Read shading model ID in the lighting shader, branch Might be the way to go on next-gen Expensive on current consoles Tax for branches never taken Don’t want to pay it for every light Platform 1 BRDF 2 BRDFs 3 BRDFs 360 1.85 ms 2.1 ms ( ms ) 2.35 ms ( ms ) PS3 1.9 ms 2.48 ms ( ms ) 2.8 ms ( ms ) One approach would be to branch in the light shader. That is, the solid pass emits an identifier of the BRDF into the G-Buffer. The light shader reads it and branches upon its value. This solution might be viable on next-gen hardware, but it doesn't fare quite well on current consoles. In a small test case we did with a single full-screen light, branching brough the rendering cost from 1.85 to 2.1 milliseconds for just a single extra shading model. This is the tax you pay for not even taking the branch. That is, our test case is synthetic, and only the first BRDF is ever used. And it gets much worse on the PS3, which doesn't even have control flow instructions. Three different BRDFs, only one used ( branch always yields the first one )

BRDFs vs. deferred – LUTs?
Pre-calculate BRDF look-up tables Might be shippable enough See: S.T.A.L.K.E.R. Limited control over parameters Roughness Anisotropy, etc. BRDFs highly dimensional Isotropic with roughness control → 3D LUT One could also tabulate the BRDF data, and sample it using a combination of an ID, as well as some geometric parameters, such as N dot L and N dot H. One such approach has been used successfully in the game S.T.A.L.K.E.R., so it might be enough for your title as well. The trouble is, BRDFs are highly dimensional functions, so tabulation might be difficult; for example, the data for an isotropic BRDF parameterized by surface roughness, is already at least a 3-dimensional function. /* See Michael Ashikhmin’s "Distribution-Based BRDFs“. */

BRDFs vs. deferred – our approach
One default BRDF Others a relatively rare case Shading model ID in stencil Multi-pass light rendering Mask out parts of the scene in each pass We decided to use a single reflectance model for most of our scene geometry, and then special-case rendering in rare instances, such as skin and hair. The core of the idea is pretty simple: when rendering the solid pass, we store the ID of the shading model in the stencil buffer. Then in the lighting pass, we draw light geometry once for each BRDF, using the ID as a mask.

Multi-pass – tax avoidance
For each light Find all affected BRDFs Render the light volume once for each model Analogous to multi-pass forward rendering! Store bounding volumes of objects with non-standard BRDFs Intersect with light volumes Implemented like this, the idea would be inefficient. We would be multiplying the number of draw calls and shader switches by the number of supported BRDFs. However, when rendering a light, we can detect which BRDFs it can potentially use, and skip any extra processing. If you think of it, this is a very similar idea to multi-pass forward rendering. Here's a scene with two objects, both of which use different shading models. We have two lights influencing them. The light on the left is interesting, in that it will only affect just one object, hence only one BRDF. Therefore it doesn't need to run the multi-BRDF code path at all. To accomplish this optimization, we store the bounding boxes of all objects which use non-standard shading models. During light rendering, we intersect light volumes with these bounds, and conservatively find a list of all BRDFs which a light may potentially touch. /* We could in theory detect which BRDFs a light may affect and only use dynamic branching there, but then we either always pay a high cost, or we would need to create lots of shader permutations, for example “shading model A and B, A with B and C, A with C, B with C, et cetera.” For this reason we are just going to use multi-pass rendering. */

Making it practical    Needs to work with depth culling of lights
Hierarchical stencil on 360 and PS3   Now, there are two more bits to the algorithm, needed to make it practical. Firstly, it needs to work with the commonly used stencil and depth-based light culling trick. Secondly, it must play well with the hierarchical stencil buffer. Let's start with a quick reminder of depth culling for lights. Consider a surface rendered into the G-buffer, and three lights. The left one is completely in front of the surface, so cannot influence it. The right one is behind the surface, so cannot influence it either. Only the middle one contributes to lighting, because its volume intersects the surface in the G-buffer. 

Depth culling of lights
Assuming viewer is outside the light volume Start with stencil = 0 Render back faces of light volume Increment stencil; no color output Render front faces Only where stencil = 0; write color Render back faces Clear stencil; no color output So how do we accomplish that using stencil testing? Let's consider the case when the viewer is outside of the light's volume. The stencil is initially clear.

Assuming viewer is outside the light volume Start with stencil = 0 Render back faces of light volume Increment stencil; no color output Render front faces Only where stencil = 0; write color Render back faces Clear stencil; no color output We start by writing the value of one into the stencil by using the back faces of the light volume. This will result in the stencil being set where the light is completely in front of the surface. Therefore we only want to render where the stencil is zero …

Assuming viewer is outside the light volume Start with stencil = 0 Render back faces of light volume Increment stencil; no color output Render front faces Only where stencil = 0; write color Render back faces Clear stencil; no color output … and we do so using the front faces with stencil testing enabled. Note that this is a vanilla version of the algorithm, and you may be using an optimized one.

Culling with BRDFs Pack the culling bit and BRDF together
Use masks to read/affect required parts Assuming 8 supported BRDFs: Unused BRDF ID Culling bit 7 6 5 4 3 2 1 Extending this idea to selectively rendering multiple shading models, we need to pack both the culling bit and the shading model identifier in the stencil buffer. Because stencil testing supports read and write masks, we can act upon and affect portions of the stencil value. Here’s a sample layout assuming a maximum of eight supported BRDFs. Note that the BRDF bits can be placed at any offset in the byte. culling_mask = 0x01 brdf_mask = 0x0E brdf_shift = 1

Light volume rendering passes
Render back faces Increment stencil; no color output For each BRDF: Render front faces hi-stencil ref = brdf_id << brdf_shift no color out; write hi-stencil Render front faces using the light shader test hi-stencil Zero stencil; no color output OK, let's get down to the actual rendering passes. First of all, we will be using the hierarchical stencil buffer, so that the GPU may reject entire rasterization tiles. This is where the bulk of our time savings actually comes from, as the regular stencil test happens after you’ve already paid the pixel shading cost. We start the same as with just depth-based culling. We draw back faces of the light volume with the stencil set to Increment. Once again, this marks areas we don’t want to render to. At this point, we have determined the list of BRDFs the light can potentially influence. For each of them, we create a hi-stencil mask first, then we render the volume again with the actual shader. Creating the mask is fairly cheap, so even though we render twice, we typically save time by hi-stencil culling the expensive shader. Finally, the last step restores the affected stencil area, so that the next light can render.

Handling miscellaneous data in stencil
Stencil value may contain extra data Used in earlier / later rendering passes Need to ignore it somehow Stencil read mask? Doesn’t work with the 360’s hi-stencil Garbage BRDF ID Culling bit 7 6 5 4 3 2 1 For each BRDF: Stencil reference = brdf_id << brdf_shift Stencil read mask ~= garbage_mask ??? We have been assuming that the stencil values are clear of any unrelated data. Yet in practice, they will carry multiple meanings, and rendering engines will have their own 'magic' stencil encodings. /* One example would be using a single bit of stencil to mask out dynamic objects from being affected by deferred decals. */ Unfortunately, such extra bits turn out to be garbage from the point of view of the proposed algorithm, and we cannot simply ignore them with read masks, at least not on the XBox 360.

Stencil operation ++, --, = 0, … <, <=, >, ==, … Read
Read mask Comparison Operator Write (masked ) Result <, <=, >, ==, … Let's take a look at the stencil operation to figure out why. The GPU first reads the original value and applies a user-specified mask to it. This value is then compared with a reference constant using one of several predicates, such as Greater, Less, Equal, et cetera. Upon the result of this comparison as well as the the depth test, an operator may be applied to the stencil value, such as incrementing or zeroing it. Finally, the resulting value is written back into the stencil buffer.

Hierarchical stencil operation
++, --, = 0, … Read Read mask Comparison Operator Write (masked ) Result <, <=, >, ==, … Hi-stencil comparison Hi-stencil Hi-stencil comparison Hi-stencil <, <=, >, ==, … ==, != How does the hi-stencil integrate with this pipeline? On the PS3, we get to specify a mask and a comparison function for the hi-stencil test, very much like in the regular one. This means that we can ignore any bits we don’t like. The 360 however, takes its hi-stencil value from the completely opposite end of the pipe, from the final value written back to the stencil buffer. Furthermore, we may only specify a trivial equality or inequality predicate against a reference value. PS3 360

Spanner in the works Render back faces
Increment stencil; no color output For each BRDF: Render front faces hi-stencil ref = brdf_id << brdf_shift no color out; write hi-stencil Render front faces using the light shader test hi-stencil Zero stencil; no color output Breaks if stencil contains garbage we can’t mask out Unfortunately, this throws a spanner in our hi-stencil mask creation. Since the 360 can only create its mask from the full value, any garbage bits will cause the corresponding tiles to be culled.

Handling stencil garbage
Can’t do it in a non-destructive manner Take off and nuke the entire site from orbit It’s the only way to be sure Extra cleaning pass? Don’t want to pay for it! Do it as we go! D3DRS_STENCILPASS ← D3DSTENCILOP_ZERO D3DRS_STENCILWRITEMASK ← (BYTE)~brdf_mask Save your stencil if you need it Sorry for calling it garbage :`( We were already restoring it later on the 360 Don’t need to destroy it on the PS3, use a read mask! Well, if we can’t ignore the extra bits, I say we nuke them from orbit. The easiest way would be to have a separate pass which cleans the stencil buffer, removing any garbage bits. On the other hand, we don't want to add any more fixed cost steps into our rendering, especially at the end of the current hardware generation, when everyone is battling for the last microseconds. Fortunately, we can clear out the garbage bits as we go. When creating the hi-stencil mask, we will set the regular stencil operator to do so, while skipping over the ID of the shading model. Now, I've been calling these "garbage bits", but you may have good reasons for extra information in your stencil buffer. Chances are that on the 360 you restore them at a later point anyway, due to limited EDRAM resources. On the PS3 we don't need to clobber the bits at all, due to its more flexible hi-stencil buffer creation process.

Performance Platform 1 BRDF 2 BRDFs 3 BRDFs 360 ( branching ) 1.85 ms
( stencil ) 1.99 ms ( ms ) 2.13 ms ( ms ) PS3 1.9 ms 2.48 ms ( ms ) 2.8 ms ( ms ) ( ms ) 2.31 ms ( ms ) For each BRDF Platform Initial setup Mask Render Cleanup 360 0.03 ms 0.1 ms >= ms 0.022 ms PS3 >= 0.06 ms 0.14 ms How’s performance then? Let’s recall the figures from one of the first slides. With the dynamic branching approach, we had to pay a pretty hefty tax, especially on the PS3. How does the proposed algorithm stack against that? We still pay a slight tax, but only for the lights which render with multiple shading models, and only for the models we actually use. This is especially important if we support many shading models, but each light affects very few on average. Then we end up paying a considerably smaller cost for the extra shading models

Multi-pass light rendering – final notes
No change in single-BRDF rendering Use your madly optimized routines No need for a ‘default’ shading model It’s just our use case As long as you efficiently find influenced BRDFs Flush your hi-stencil FlushHiZStencil( D3DFHZS_ASYNCHRONOUS ); cellGcmSetClearZcullSurface( _, CELL_GCM_FALSE, CELL_GCM_TRUE ); Tiny lights? Try branching instead. Performance figures only from huge lights! With tiny lights, hi-stencil juggling becomes inefficient That's pretty much the whole algorithm. I'd just like to emphasize a few extra points. First of all, nothing is changed for single-BRDF rendering! If you conservatively figure out that a light only influences geometry with a single reflectance model, you can reuse your old light rendering code! Secondly, you don't really need to have a 'default' shading model for the whole level. As long as you can quickly classify which BRDFs a light can potentially influence, then you're golden. Next, remember to flush your hi-stencil when changing the reference value or the comparison function, otherwise you’ll get false culling. Finally, we’ve only given performance figures for lights taking a up significant portion of the screen. When a light is small and rendered with multiple BRDFs, the cost will be dominated by hi-stencil juggling. It might be worthwhile to use dynamic branching in the light shader below a certain size threshold. Okay, that’s all for me, now Jose is going to tell you about lighting translucent geometry!

Lighting alpha objects in deferred rendering engines
Classic solutions: Forward rendering. CPU based, one light probe per each object. Our solution: GPU based. More than one light probe. Calculate a lightmap for each object each frame. Used for objects and particle systems. Fits perfectly into a deferred rendering pipeline. Classic solutions: Forward rendering. Best quality solution, it calculates lighting for every pixel. Problems: Too expensive, especially if a lot of alpha layers are used. Shader permutation explosion if you want to support a lot of light types and combinations. Completely different than deferred rendering, we need to support two pipelines. We can use Forward+, but we are aiming to X360 and PS3. Calculated in CPU, one light probe (intensity, SH, etc) for each object. Only one light probe per object, it means same light configuration for all of the objects, a lot of issues with big ones. It is not easy to support shadow map casting lights. Our solution: GPU based. More than one light probe per object. Quality between the two classic solutions. It is just a lightmap for every object updated every frame. Lighting is calculated in object space. It can be used for objects and particle systems. It fits perfectly into a deferred engine pipeline.

Our solution for alpha objects
Object space map: Every pixel stores the local space position on the object’s surface For each alpha object we will create a distribution of light probes on the surface. Artists will define an UV channel with an unwrapped version of the object (like lightmaps), during export we will create a texture (we call it object space map, the size will depend of the surface area of the object). Every pixel in the object space map will represent a local space position on the surface of the object. Image attribution: Zephyris at en.wikipedia.

Our solution for alpha objects
For each object: Use baked positions as light probes Transform object space map into world space Render lights, reusing deferred shading code Accumulate into lightmap Render object in alpha pass using lightmap We convert every probe from the object space map to world space using the world matrix of the object. Render lights: We render a pass with a very similar shader that in deferred rendering. The input is a texture with world space light probe positions (calculated from the object space map) and the output will be a lightmap with the light that the light probes receive. It can reuse a lot of functions from deferred rendering code, like shadowmapping. Render object in alpha pass using lightmap. We use the UV channel for the object space map to access the lightmap. Image attribution: Zephyris at en.wikipedia.

Our solution for particle systems
Camera oriented quad fitted around and centered in the particle system. For each particle system we need a set of light probes distributed around it. As the particles are camera oriented, we are going to use a camera oriented quad fitted around and centered in the particle system. It is not a perfect representation, but it is really fast and it is simple, and it works in practice. If the particle system intersects the camera frustum we can just fit our quad, so we can improve the quality when the particle system fills the screen.

Our solution for particle systems
For each particle system: Allocate a texture quad and fill it with interpolated positions as light probes Render lights and accumulate into lightmap Render particles in alpha pass, converting from clip space to lightmap coordinates. For recovering the lighting information we just use a 2D matrix that converts from clip space coordinates (our quad is screen space orientated) to lightmap texture space.

Implementation details
For performance reasons we pack all position maps to a single texture. Every entity that needs alpha lighting will allocate and use a region inside the texture. The two solution have a lot in common. For performance reasons we pack all the world space position maps to one single texture, so we can calculate the lighting of all the objects at the same time. Two GPU textures: Input: World space position texture, similar to the gbuffer in deferred rendering. Output: Accumulated light. Every object that needs calculate lighting will allocate a region inside the textures and fill it with the positions of the light probes. The size of the region can depend on the screen space size of the object to improve performance and scalability. For improving performance, we check on CPU every light against every object, so we only apply the light shader to the regions that are inside the light. World space position Light maps

Integration with deferred rendering
Fill G-Buffer (Solid pass) Render Lights Render Alpha Deferred rendering engine. Fill gbuffer Render lights Render alpha

Our solution Fill world space light probes position map Fill G-Buffer
(Solid pass) Render lights Render lights using world space light probes map as input and calculate alpha lightmap Render alpha using alpha lightmap Added two extra steps in our deferred engine.

Improvements Calculate a second texture with light direction information. Other parameterizations for particle systems: Dust (one pixel per mote). Ribbons (a line of pixels). 3D volume slices for particle systems. Allocate a region for every slice Adds depth to the lighting solution. Having light direction information will allow bump mapping, occlusion and scattering effects.

3D volume slices Slice 0 map . For each slice we allocate one region
Slice n map For performance reasons, we can disable 3D volume slices when the particle system is far from the camera.

Demo Thanks to Howard Rayner, our technical artist and vfx magician for preparing these demos!

WE ARE HIRING!

Questions? Jose Luis Sanchez Bonet Tomasz Stachowiak
Tomasz Stachowiak twitter: h3r2tic

Solving Some Common Problems in a Modern Deferred Rendering Engine

Similar presentations

Presentation on theme: "Solving Some Common Problems in a Modern Deferred Rendering Engine"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Solving Some Common Problems in a Modern Deferred Rendering Engine

Similar presentations

Presentation on theme: "Solving Some Common Problems in a Modern Deferred Rendering Engine"— Presentation transcript:

Similar presentations

About project

Feedback