Presentation is loading. Please wait.

Presentation is loading. Please wait.

An Optimized Diffusion Depth Of Field Solver (DDOF)

Similar presentations


Presentation on theme: "An Optimized Diffusion Depth Of Field Solver (DDOF)"— Presentation transcript:

1

2 An Optimized Diffusion Depth Of Field Solver (DDOF)
Holger Gruen – AMD 28th February 2011 AMD‘s Favorite Effects

3 AMD‘s Favorite Effects
Agenda Motivation Recap of a high-level explanation of DDOF Recap of earlier DDOF solvers A Vanilla Cyclic Reduction(CR) DDOF solver A DX11 optimized CR solver for DDOF Results 28th February 2011 AMD‘s Favorite Effects

4 AMD‘s Favorite Effects
Motivation Solver presented at GDC 2010 [RS2010] has some weaknesses Great implementation but memory reqs and runtime too high for many game developers Looking for faster and memory efficient solver 28th February 2011 AMD‘s Favorite Effects

5 AMD‘s Favorite Effects
Diffusion DOF recap 1 DDOF is an enhanced way of blurring a picture taking an arbitrary CoC at a pixel into account Interprets input image as a heat distribution Uses the CoC at a pixel to derive a per pixel heat conductivity CoC=Circle of Confusion 28th February 2011 AMD‘s Favorite Effects

6 AMD‘s Favorite Effects
Diffusion DOF recap 2 Blurring is done by time stepping a differential equation that models the diffusion of heat ADI method used to arrive at a separable solution for stepping Need to solve tri-diagonal linear system for each row and then each colum of the input 28th February 2011 AMD‘s Favorite Effects

7 DDOF Tri-diagonal system
row/col of input image derived from CoC at each pixel of an input row/col resulting blurred row/col 28th February 2011 AMD‘s Favorite Effects

8 AMD‘s Favorite Effects
Solver recap 1 The GDC2010 solver [RS2010] is a ‚hybrid‘ solver Performs three PCR steps upfront Performs serial ‚Sweep‘ algorithm to solve small resulting systems Check [ZCO2010] for details on other hybrid solvers 28th February 2011 AMD‘s Favorite Effects

9 AMD‘s Favorite Effects
Solver recap 2 The GDC2010 solver [RS2010] has drawbacks It uses a large UAV as a RW scratch-pad to store the modified coefficients of the sweep algorithm GPUs without RW cache will suffer For high resolutions three PCR steps produce tri-diagonal system of substantial size This means a serial (sweep) algorithm is run on a ‚big‘ system 28th February 2011 AMD‘s Favorite Effects

10 AMD‘s Favorite Effects
Solver recap 3 Cyclic Reduction (CR) solver Used by [Kass2006] in the original DDOF paper Runs in two phases reduction phase backward substitution phase 28th February 2011 AMD‘s Favorite Effects

11 AMD‘s Favorite Effects
Solver recap 4 According to [ZCO2010]: CR solver has lowest computational complexity of all solvers  It suffers from lack of parallelism though  At the end of the reduction phase At the start of the backwards substitution phase 28th February 2011 AMD‘s Favorite Effects

12 Passes of a Vanilla CR Solver
Input image X reduce reduce Solve for the first y Stop at size 1 Pass 1: construct from CoC abc reduce reduce Blurred image Y substitute substitute 28th February 2011 AMD‘s Favorite Effects

13 Vanilla Solver Results
Higher performance than reported in [Bavoil2010]  (~6 ms vs. ~8ms at 1600x1200) Memory footprint prohibitively high  >200 MB at 1600x1200 Need an answer to tackling the lack of parallelism problem – answer given in [ZCO2010] 28th February 2011 AMD‘s Favorite Effects

14 Vanilla CR Solver Input image … X Solve for the first y This is
reduce reduce This is what kills parallelism Solve for the first y Stop at size 1 Pass 1: construct from CoC abc reduce reduce Blurred image Y substitute substitute 28th February 2011 AMD‘s Favorite Effects

15 Keeping the parallelism high
Input image X reduce reduce Stop at a reasonable size Solve for Y at that resolution to have a big enough parallel workload (e.g using PCR see [ZCO2010]) Pass 1: construct from CoC abc reduce reduce Blurred image Y substitute substitute 28th February 2011 AMD‘s Favorite Effects

16 Memory Optimizations 1 Input image … X Stop at a reasonable size
reduce reduce Stop at a reasonable size Solve for Y at that resolution Pass 1: construct from CoC abc reduce reduce Blurred image Y substitute substitute 28th February 2011 AMD‘s Favorite Effects

17 AMD‘s Favorite Effects
Memory Optimizations 1 rgab32f rgab32f X reduce reduce Stop at a reasonable size Solve for Y at that resolution rgab32f rgab32f abc reduce reduce rgba32f rgab32f Y substitute substitute substi-tute 28th February 2011 AMD‘s Favorite Effects

18 AMD‘s Favorite Effects
Memory Optimizations 1 rgab16f rgab16f X reduce reduce Stop at a reasonable size Solve for Y at that resolution This saves some significant amount of memory - We found no artifacts for going from rgba32f to rgba16f rgab32f rgab32f abc reduce reduce rgba16f rgab16f Y substitute substitute substi-tute 28th February 2011 AMD‘s Favorite Effects

19 AMD‘s Favorite Effects
Memory Optimizations 2 rgab16f rgab16f X reduce reduce Stop at a reasonable size Solve for Y at that resolution This does again save a significant amount of memory as this is the biggest surface used by the solver rgab32f rgab32f abc reduce reduce rgba16f rgab16f Y substitute substitute substi-tute 28th February 2011 AMD‘s Favorite Effects

20 AMD‘s Favorite Effects
Memory Optimizations 2 rgab16f rgab16f X reduce reduce Stop at a reasonable size Solve for Y at that resolution Skip abc construction pass and compute abc on-the-fly during 1. reduction pass rgab32f abc reduce rgba16f rgab16f Y substitute substitute substi-tute 28th February 2011 AMD‘s Favorite Effects

21 Intermediate Results 1600x1200
Solver Time in ms Memory in Megabytes HD5870 GTX480 GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010] ~117 (guesstimate) Standard Solver (already skips high res abc construction) 3.66 3.33 ~132 28th February 2011 AMD‘s Favorite Effects

22 AMD‘s Favorite Effects
Memory Optimizations 3 rgab16f rgab16f X reduce reduce Stop at a reasonable size Solve for Y at that resolution Yet again this saves a significant amount of memory ! Skip abc construction pass compute abc during 1. reduction pass rgab32f abc reduce rgba16f rgab16f Y substitute substitute substi-tute 28th February 2011 AMD‘s Favorite Effects

23 AMD‘s Favorite Effects
Memory Optimizations 3 rgab16f X reduce4 Stop at a reasonable size Solve for Y at that resolution Skip abc construction pass compute abc during 1. reduction pass Reduce 4-to-1 in a special first reduction pass abc Substitute 1-to-4 in a special substitution pass rgba16f Y substitute substitute substitute4 28th February 2011 AMD‘s Favorite Effects

24 Intermediate Results 1600x1200
Solver Time in ms Memory in Megabytes HD5870 GTX480 GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010] ~117 (guesstimate) Standard Solver (already skips high res abc construction) 3.66 3.33 ~132 4–to-1 Reduction 2.87 3.32 ~73 28th February 2011 AMD‘s Favorite Effects

25 DX11 Memory Optimizations 1
rgab16f X reduce4 Stop at a reasonable size Solve for Y at that resolution Skip abc construction pass compute abc during 1. reduction pass Reduce 4-to-1 in a special first reduction pass abc Substitute 1-to-4 in a special substitution pass rgba16f Y substitute substitute substitute4 28th February 2011 AMD‘s Favorite Effects

26 DX11 Memory Optimizations 1
Pack abc and X into one rgba_uint surface rgab16f X reduce4 Stop at a reasonable size Solve for Y at that resolution Skip abc construction pass compute abc during 1. reduction pass Reduce 4-to-1 in a special first reduction pass abc Substitute 1-to-4 in a special substitution pass rgba16f Y substitute substitute substitute4 28th February 2011 AMD‘s Favorite Effects

27 Using SM5 for data packing
rgab16f uint pack x,y channel X uint (f32tof16(X.x) + (f32tof16(X.y) << 16)) rgab32f uint abc uint 28th February 2011 AMD‘s Favorite Effects

28 Using SM5 for data packing
rgab16f uint X uint lower 5 bits of z channel pack rgab32f uint higher 27 bits of x channel abc (asuint(abc.x) &0xFFFFFFC0) | (f32tof16(X.z) & 0x3F)) uint Steal 6 lowest mantissa bits of abc.x to store some bits of X.z 28th February 2011 AMD‘s Favorite Effects

29 Using SM5 for data packing
rgab16f uint X uint central 5 bits of z channel rgab32f pack uint higher 27 bits of y channel abc (asuint(abc.y) &0xFFFFFFC0) | ((f32tof16(X.z) >>6 )& 0x3F)) uint Steal 6 lowest mantissa bits of abc.y to store some bits of X.z 28th February 2011 AMD‘s Favorite Effects

30 SM5 Memory Optimizations 1
rgab16f uint X uint higher 5 bits of z channel rgab32f uint higher 27 bits of z channel pack abc (asuint(abc.z) &0xFFFFFFC0) | ((f32tof16(X.z) >>12 )& 0x3F)) uint Steal 6 lowest mantissa bits of abc.z to store some bits of X.z 28th February 2011 AMD‘s Favorite Effects

31 AMD‘s Favorite Effects
Sample Screenshot 28th February 2011 AMD‘s Favorite Effects

32 Abs(Packed-Unpacked) x 255.0f
28th February 2011 AMD‘s Favorite Effects

33 DX11 Memory Optimizations 2
Solver does a horizonal and vertical pass Chain of lower res RTs needs to be there twice Horizontal reduction/substitution chain Vertical reduction/substitution chain How can DX11 help? 28th February 2011 AMD‘s Favorite Effects

34 DX11 Memory Optimizations 2
UAVs allow us to reuse data of the horizontal chain for the vertical chain A proof of concept implementation shows that this works nicely but impacts the runtime significantly ~40% lower fps Stayed with RTs as memory was already quite low Use only if you are really concerned about memory 28th February 2011 AMD‘s Favorite Effects

35 AMD‘s Favorite Effects
Final Results 1600x1200 Solver Time in ms Memory in Megabytes HD5870 GTX480 GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010] ~117 (guesstimate,) Standard Solver (already skips high res abc construction) 3.66 3.33 ~132 4–to-1 Reduction 2.87 3.32 ~73 4-to-1 Reduction + SM5 Packing 2.75 3.14 ~58 28th February 2011 AMD‘s Favorite Effects

36 AMD‘s Favorite Effects
Future Work Look into CS acceleration of the solver 4-to-1 reduction pass 1-to-4 substitution pass Look into using heat diffusion for other effects e.g. Motion blur 28th February 2011 AMD‘s Favorite Effects

37 AMD‘s Favorite Effects
Conclusion Optimized CR solver is fast and mem-efficient Used in Dragon Age 2 4aGames considering its use for new projects Detailed description in ‚Game Engine Gems 2‘ Mail me if you want access to the sources 28th February 2011 AMD‘s Favorite Effects

38 AMD‘s Favorite Effects
References [Kass2006] “Interactive depth of field using simulated diffusion on a GPU” Michael Kass, Pixar Animation studios, Pixar technical memo #06-01 [ZCO2010] “Fast Tridiagonal Solvers on the GPU” Y. Zhang, J. Cohen, J. D. Owens, PPoPP 2010 [RS2010] “DX11 Effects in Metro 2033: The Last Refuge” A. Rege, O. Shishkovtsov, GDC 2010 [Bavoil2010] „Modern Real-Time Rendering Techniques“, L. Bavoil, FGO2010 28th February 2011 AMD‘s Favorite Effects

39 AMD‘s Favorite Effects
Backup 28th February 2011 AMD‘s Favorite Effects

40 AMD‘s Favorite Effects
Results 1920x1200 Solver Time in ms Memory in Megabytes HD5870 GTX480 Standard Solver (already skips high res abc construction) 4.31 4.03 ~158 4–to-1 Reduction 3.36 4.02 ~88 4-to-1 Reduction + SM5 Packing 3.23 3.79 ~70 28th February 2011 AMD‘s Favorite Effects


Download ppt "An Optimized Diffusion Depth Of Field Solver (DDOF)"

Similar presentations


Ads by Google