Accelerating Real-Time Shading with Reverse Reprojection Caching

Accelerating Real-Time Shading with Reverse Reprojection Caching
Diego Nehab1 Pedro V. Sander2 Jason Lawrence3 Natalya Tatarchuk4 John R. Isidoro4 My name is Diego Nehab, and I will present a technique for accelerating real-time pixel shading on the GPU with the help of a strategy that we call “Reverse Reprojection Caching”. This work is in collaboration with Pedro Sander, Jason Lawrence, Natasha Tatarchuk and John Isidoro. 1Princeton University 2Hong Kong University of Science and Technology 3University of Virginia 4Advanced Micro Devices, Inc.

Motivation Consider a typical game scenario, where a ninja animation is performing a kick loop. His leg is moving forward.

Motivation Between two frames in the animation, most surface points remain visible, although they might move. Those are marked in green. The red points, on the other hand, were previously occluded. Ideally, we should be able to reuse the shading computations involved at least with the the green pixels.

Previous work Dedicated hardware Image based rendering
Address Recalculation Pipeline [Regan and Pose 1994] Talisman [Torborg and Kajiya 1996] Image based rendering Image Warping [McMillan and Bishop 1995] Post Rendering 3D Warp [Mark et al. 1997] There is an immense amount of work on this subject. Here we briefly mention a selection. Certain hardware architectures reuse shaded objects from one frame to the next. Several image based rendering algorithms can also be seen as exploiting spatio-temporal coherence. The class of algorithms that are most related to our method are the ones that try to bring interactivity to expensive shaders.

Previous work Interactivity for expensive renderers
Frameless rendering [Bishop et al. 1994] Render Cache [Walter et al. 1999] Holodeck/Tapestry [Simmons et al. 1999/2000] Corrective Texturing [Stamminger et al 2000] Shading Cache [Tole et al. 2002] These vary mostly on how they store and reuse samples across frames. At the simplest level, we have the frameless rendering, which simply leaves samples where they are and updates a subset At the other extreme, we have the Shading Cache, which stores samples on a replica of the scene, which is rendered interactively

Our approach Explore coherence in real-time rendering Recompute
We want to explore coherence in generic real-time rendering applications, running on standard graphics hardware Pixel shaders are now responsible for a substantial fraction of the computational budget In unified shader architectures, reducing the pixel load is beneficial for the entire pipeline The idea is to replace a standard pixel shader with a modified version. Recompute

Our approach Explore coherence in real-time rendering Load/Reuse Hit?
Instead of recomping everything from scratch in each frame, the modified shader performs a cache lookup into the previous frame. If the information for the surface point we are shading is available, we load and reuse it, otherwise, we recompute from scratch Either way, we have a value to use, and we go ahead and update the cache with it Hit? Lookup Update Recompute

Requirements Load/reuse path must be cheaper
Cache hit ratio must be high Lookup/update must be efficient Load/Reuse If we are to gain anything from this, the load/reuse operation must be cheaper than recomputing, which means the computation must be expensive Either because of ALU cycles or due to extensive I/O Also, we must have a high cache-hit ratio Finally, the cache management must be very light weight Hit? Lookup Update Recompute

First insight Cache only visible surface fragments
Use screen space buffer to store cache Output sensitive memory Keep everything in GPU memory Leverage hardware Z-buffering for eviction To do this we will use a variety of tricks. One is that we will only cache visible surface fragments This allows us to use a simple screen-space buffer to store cache entries The memory requirements are output sensitive, there is no bus-traffic between the GPU and the CPU And we can use the Z-buffer to trivially evict entries

Cache hit ratio Here are three example animation sequences.
The parthenon is just a fly through The heroine is simply running past the camera The ninja is doing nijna things

Cache hit ratio results
Here we track the cache-hit-ratios That is, during the animation sequences, we counted the fraction of pixels that remain visible from one frame to the next. The thing to notice is that, in general, the hit rates are above 90%, and this is with our simple eviction policy There is no need to use anything fancier

Second insight Use reverse mapping
Recompute scene geometry at each frame Leverage hardware filtering for lookup [Walter et al. 1999] The second insight is that we use reverse mapping to lookup into the previous frame. This seems obvious, but only makes sense in our application domain. Instead of sending samples from current frame to the future, in a scattering operation Bring samples from the past into current frame, gathering them We can do that because we will completely recompute the geometry at each frame. We can use filtered lookups so we don’ t have to deal with the holes that show up with forward mapping.

Third insight Do not need to reproject at the pixel level
Hard work is performed at the vertex level The final insight is that we don’ t really have to perform the reprojection computations at the pixel level We can do most of the work at the vertex level

Address calculation Time t Time t+1 Here is how this works.
Let’s say that at time t we have a triangle in clip space, which projects somewhere on the screen. At time t+1, the triangle moved around and projected somewhere else. We are now shading a fragment and we need to find it in the previous frame What we do is we get our vertex shader to output the old vertex coordinates as texture parameters with the new vertices. Perspective correct interpolation will produce the coordinates of the old surface point At the pixel, we simply divide by w and we are done

Hit or miss? Time t Time t+1
Of course, there is the problem of occlusion, like we saw with the Ninja A surface point that is visible at time t+1 might have been occluded at time t. If we blindly lookup and reuse, we will mix values from different surfaces But let’s look at this from the side

Hit or miss? Load cached depth Compare with expected depth Time t
When we compute the old 3D coordinates of a surface point, we also compute its depth. We can then compare the depth with what is in the cache, and use it to decide if we had a hit or a miss. cached depth expected depth

Third insight Do not need to reproject at the pixel level
Hard work is performed at the vertex level Pass old vertex coords as texture coords Leverage perspective-correct interpolation One single final division within pixel shader Just to recap, the bulk of computation is on the vertex level, where we have to compute the old vertex coordinates The rest is just perspectively correct interpolation by hardware, and a single final division at the pixel level and check if the depth matches

What to cache? Slow varying, expensive computations
procedural albedo Values required in multiple passes color in depth of field or motion blur Samples within a sampling process amortized shadow map tests Good candidates for caching are slow varying, expensive computations, such as procedural albedos. We don’ t have to cache the final color. We can cache any partial result we might want to reuse. The advantages of caching become even more visible in multipass algorithms tha might reuse the same value several times. We can also amortize an integration process across consecutive frames. We will discuss a shadow mapping application for this.

Refreshing the cache Cached entries become stale with time
View dependent effects, repeated resampling Implicit (multipass algorithms) Flush entire cache each time step Random updates Refresh random fraction of pixels Amortized update Combine cache with new values at each frame Of course, cache entries can become stale with time, so we can’ t just keep reusing cached entries ad infinitum. This may be due to view dependet effects, time varying parameters, the repeated resampling etc. In certain applications, the refresh is implicit. Multipass algorithms can flush the entire cache from one frame to the next. Otherwise, we can randomly select a small fraction of pixel and enforce cache misses on them. Finally, we can compute running sums by combining cache entries and newly computed values.

Motion blur 3 passes 60fps brute force
Here is an example of multipass implementation of motion blur using the accumulation buffer. At 60fps, we can use 3 passes. 3 passes 60fps brute force

Reuse albedo in multipass
For each time step Fully compute albedo in first pass For each remaining pass Lookup into first pass and try to reuse To use our caching scheme, we can fully recompute the albedo on the first pass Then, for each subsequent pass, we can lookup into the first pass and reuse the albedo. On a cache miss, of course we simply recompute the pixel from scratch.

Motion blur 3 passes 60fps brute force

Motion blur 6 passes 60fps cached 30fps brute force
Using the same computational budget, we can double the number of passes To achieve the same quality, the brute force would take twice as long 6 passes 60fps cached 30fps brute force

Motion blur 14 passes 30fps cached
At which point, we can more than double the number of passes again, if we are using the cache. 14 passes 30fps cached

Randomly distributed refresh
1/4th updated Error plot In this example, the dragon model is rendered with strong view dependent effects, on top of an expensive diffuse shader. At each frame we select a random fraction of pixel and enforce cache misses. The error is spread randomly over the model, and is not distracting to the user.

Amortized super-sampling
Cache updated by recursive filter rule Lambda controls variance reduction... ...but also the lifespan Yet another way to refresh the cache is what we call amortized supersampling. This is useful when we have to sample a function within the pixel shader. We can compute new sample each frame, and blend it with a running sum stored in the cache. The effect is that we reduce the variance of the estimator, at the cost of increasing the reaction time of the system.

Trade-offs Variance Lifespan
A value of 0.6 gives us 1/4 of the original variance, with only 7 frames of influence lifespan

Variance reduction at work
Percentage close filtering uses many samples in a shadow map test to avoid aliasing due to low resolution shadow maps 16 tests result in good quality, but may be too expensive 4 taps might have good performance, but not visually pleasant 16 tap PCF 4 tap PCF

Reusing shadow map tests
At each frame, perform new shadow tests Read running sum from cache Blend the two values Update cache and display results We can keep a running sum of shadow map tests At each frame, we perform new tests, and combine these tests with the running sum in the cache

So now, for almost the same price of performing only 4 tests per frame, we achieve a quality similar to 16 tests per frame 16 tap PCF 4 tap PCF

16 tap PCF 4 tap amortized

Conclusions Shading every frame anew is wasteful
We can reuse some of the shading computations from previous frames Use reverse reprojection caching to do that in real-time rendering applications Less work per frame = faster rendering

Future work Track surface points and select shader level of detail based on screenspace speed Change refresh rate per pixel based on rate of cached value change Use code analysis to automatically select appropriate values to cache

Accelerating Real-Time Shading with Reverse Reprojection Caching

Similar presentations

Presentation on theme: "Accelerating Real-Time Shading with Reverse Reprojection Caching"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Accelerating Real-Time Shading with Reverse Reprojection Caching

Similar presentations

Presentation on theme: "Accelerating Real-Time Shading with Reverse Reprojection Caching"— Presentation transcript:

Similar presentations

About project

Feedback