Download presentation

Presentation is loading. Please wait.

Published byClarissa claudia Pryce Modified over 2 years ago

1

2
An Optimized Diffusion Depth Of Field Solver (DDOF) 28th February 20112AMDs Favorite Effects Holger Gruen – AMD

3
Agenda Motivation Recap of a high-level explanation of DDOF Recap of earlier DDOF solvers A Vanilla Cyclic Reduction(CR) DDOF solver A DX11 optimized CR solver for DDOF Results 28th February 2011AMDs Favorite Effects3

4
Motivation Solver presented at GDC 2010 [RS2010] has some weaknesses Great implementation but memory reqs and runtime too high for many game developers Looking for faster and memory efficient solver 28th February 2011AMDs Favorite Effects4

5
Diffusion DOF recap 1 DDOF is an enhanced way of blurring a picture taking an arbitrary CoC at a pixel into account Interprets input image as a heat distribution Uses the CoC at a pixel to derive a per pixel heat conductivity CoC=Circle of Confusion 28th February 2011AMDs Favorite Effects5

6
Diffusion DOF recap 2 Blurring is done by time stepping a differential equation that models the diffusion of heat ADI method used to arrive at a separable solution for stepping Need to solve tri-diagonal linear system for each row and then each colum of the input 28th February 2011AMDs Favorite Effects6

7
DDOF Tri-diagonal system 28th February 2011AMDs Favorite Effects7 row/col of input image derived from CoC at each pixel of an input row/col resulting blurred row/col

8
Solver recap 1 The GDC2010 solver [RS2010] is a hybrid solver – Performs three PCR steps upfront – Performs serial Sweep algorithm to solve small resulting systems – Check [ZCO2010] for details on other hybrid solvers 28th February 2011AMDs Favorite Effects8

9
Solver recap 2 The GDC2010 solver [RS2010] has drawbacks – It uses a large UAV as a RW scratch-pad to store the modified coefficients of the sweep algorithm GPUs without RW cache will suffer – For high resolutions three PCR steps produce tri-diagonal system of substantial size This means a serial (sweep) algorithm is run on a big system 28th February 2011AMDs Favorite Effects9

10
Solver recap 3 Cyclic Reduction (CR) solver – Used by [Kass2006] in the original DDOF paper – Runs in two phases 1.reduction phase 2.backward substitution phase 28th February 2011AMDs Favorite Effects10

11
Solver recap 4 According to [ZCO2010]: – CR solver has lowest computational complexity of all solvers – It suffers from lack of parallelism though At the end of the reduction phase At the start of the backwards substitution phase 28th February 2011AMDs Favorite Effects11

12
Passes of a Vanilla CR Solver 28th February 2011AMDs Favorite Effects12 Input image X Pass 1: construct from CoC abc reduce … … Stop at size 1 Solve for the first y Y substitute … Blurred image

13
Vanilla Solver Results Higher performance than reported in [Bavoil2010] (~6 ms vs. ~8ms at 1600x1200) Memory footprint prohibitively high – >200 MB at 1600x1200 Need an answer to tackling the lack of parallelism problem – answer given in [ZCO2010] 28th February 2011AMDs Favorite Effects13

14
Vanilla CR Solver 28th February 2011AMDs Favorite Effects14 Input image X Pass 1: construct from CoC abc reduce … … Stop at size 1 Solve for the first y Y substitute … Blurred image This is what kills parallelism

15
Keeping the parallelism high 28th February 2011AMDs Favorite Effects15 Input image X Pass 1: construct from CoC abc reduce … … Stop at a reasonable size Solve for Y at that resolution to have a big enough parallel workload (e.g using PCR see [ZCO2010]) Y substitute … Blurred image

16
Memory Optimizations 1 28th February 2011AMDs Favorite Effects16 Input image X Pass 1: construct from CoC abc reduce … … Stop at a reasonable size Solve for Y at that resolution Y substitute … Blurred image

17
Memory Optimizations 1 28th February 2011AMDs Favorite Effects17 rgab32f X abc rgab32f reduce … … Stop at a reasonable size Solve for Y at that resolution Y substitute … rgba32f rgab32f substi- tute

18
Memory Optimizations 1 28th February 2011AMDs Favorite Effects18 rgab16f X rgab32f abc rgab16f rgab32f reduce … … Stop at a reasonable size Solve for Y at that resolution Y substitute … rgba16f rgab16f substi- tute This saves some significant amount of memory - We found no artifacts for going from rgba32f to rgba16f

19
Memory Optimizations 2 28th February 2011AMDs Favorite Effects19 rgab16f X rgab32f abc rgab16f rgab32f reduce … … Stop at a reasonable size Solve for Y at that resolution Y substitute … rgba16f rgab16f substi- tute This does again save a significant amount of memory as this is the biggest surface used by the solver

20
Memory Optimizations 2 28th February 2011AMDs Favorite Effects20 rgab16f X abc rgab16f rgab32f reduce … … Stop at a reasonable size Solve for Y at that resolution Y substitute … rgba16f rgab16f substi- tute Skip abc construction pass and compute abc on-the-fly during 1. reduction pass

21
Intermediate Results 1600x th February 2011AMDs Favorite Effects21 SolverTime in msMemory in Megabytes HD5870GTX480 GDC2010 hybrid solver on GTX480 ~ [Bavoil 2010] ~117 (guesstimate) Standard Solver (already skips high res abc construction) ~132

22
Memory Optimizations 3 28th February 2011AMDs Favorite Effects22 rgab16f X abc rgab16f rgab32f reduce … … Stop at a reasonable size Solve for Y at that resolution Y substitute … rgba16f rgab16f substi- tute Skip abc construction pass compute abc during 1. reduction pass Yet again this saves a significant amount of memory !

23
Memory Optimizations 3 28th February 2011AMDs Favorite Effects23 rgab16f X abc reduce4 … … Stop at a reasonable size Solve for Y at that resolution Y substitute … rgba16f substitute4 Skip abc construction pass compute abc during 1. reduction pass Reduce 4-to-1 in a special first reduction pass Substitute 1-to-4 in a special substitution pass

24
Intermediate Results 1600x th February 2011AMDs Favorite Effects24 SolverTime in msMemory in Megabytes HD5870GTX480 GDC2010 hybrid solver on GTX480 ~ [Bavoil 2010] ~117 (guesstimate) Standard Solver (already skips high res abc construction) ~132 4–to-1 Reduction ~73

25
DX11 Memory Optimizations 1 28th February 2011AMDs Favorite Effects25 rgab16f X abc reduce4 … … Stop at a reasonable size Solve for Y at that resolution Y substitute … rgba16f substitute4 Skip abc construction pass compute abc during 1. reduction pass Reduce 4-to-1 in a special first reduction pass Substitute 1-to-4 in a special substitution pass

26
DX11 Memory Optimizations 1 28th February 2011AMDs Favorite Effects26 rgab16f X abc reduce4 … … Stop at a reasonable size Solve for Y at that resolution Y substitute … rgba16f substitute4 Skip abc construction pass compute abc during 1. reduction pass Reduce 4-to-1 in a special first reduction pass Substitute 1-to-4 in a special substitution pass Pack abc and X into one rgba_uint surface

27
Using SM5 for data packing 28th February 2011AMDs Favorite Effects27 rgab16f X rgab32f abc uint pack x,y channel (f32tof16(X.x) + (f32tof16(X.y) << 16))

28
Using SM5 for data packing 28th February 2011AMDs Favorite Effects28 rgab16f X rgab32f abc uint lower 5 bits of z channel higher 27 bits of x channel pack (asuint(abc.x) &0xFFFFFFC0) | (f32tof16(X.z) & 0x3F)) Steal 6 lowest mantissa bits of abc.x to store some bits of X.z

29
Using SM5 for data packing 28th February 2011AMDs Favorite Effects29 rgab16f X rgab32f abc uint central 5 bits of z channel higher 27 bits of y channel pack (asuint(abc.y) &0xFFFFFFC0) | ((f32tof16(X.z) >>6 )& 0x3F)) Steal 6 lowest mantissa bits of abc.y to store some bits of X.z

30
SM5 Memory Optimizations 1 28th February 2011AMDs Favorite Effects30 rgab16f X rgab32f abc uint higher 5 bits of z channel higher 27 bits of z channel pack (asuint(abc.z) &0xFFFFFFC0) | ((f32tof16(X.z) >>12 )& 0x3F)) Steal 6 lowest mantissa bits of abc.z to store some bits of X.z

31
Sample Screenshot 28th February 2011AMDs Favorite Effects31

32
Abs(Packed-Unpacked) x 255.0f 28th February 2011AMDs Favorite Effects32

33
DX11 Memory Optimizations 2 Solver does a horizonal and vertical pass Chain of lower res RTs needs to be there twice – Horizontal reduction/substitution chain – Vertical reduction/substitution chain How can DX11 help? 28th February 2011AMDs Favorite Effects33

34
DX11 Memory Optimizations 2 UAVs allow us to reuse data of the horizontal chain for the vertical chain A proof of concept implementation shows that this works nicely but impacts the runtime significantly – ~40% lower fps Stayed with RTs as memory was already quite low Use only if you are really concerned about memory 28th February 2011AMDs Favorite Effects34

35
Final Results 1600x th February 2011AMDs Favorite Effects35 SolverTime in msMemory in Megabytes HD5870GTX480 GDC2010 hybrid solver on GTX480 ~ [Bavoil 2010] ~117 (guesstimate,) Standard Solver (already skips high res abc construction) ~132 4–to-1 Reduction ~73 4-to-1 Reduction + SM5 Packing ~58

36
Future Work Look into CS acceleration of the solver – 4-to-1 reduction pass – 1-to-4 substitution pass Look into using heat diffusion for other effects – e.g. Motion blur 28th February 2011AMDs Favorite Effects36

37
Conclusion Optimized CR solver is fast and mem-efficient – Used in Dragon Age 2 – 4aGames considering its use for new projects – Detailed description in Game Engine Gems 2 Mail me if you want access to the 28th February 2011AMDs Favorite Effects37

38
References [Kass2006] Interactive depth of field using simulated diffusion on a GPU Michael Kass, Pixar Animation studios, Pixar technical memo #06-01 [ZCO2010] Fast Tridiagonal Solvers on the GPU Y. Zhang, J. Cohen, J. D. Owens, PPoPP 2010 [RS2010] DX11 Effects in Metro 2033: The Last Refuge A. Rege, O. Shishkovtsov, GDC 2010 [Bavoil2010] Modern Real-Time Rendering Techniques, L. Bavoil, FGO th February 2011AMDs Favorite Effects38

39
Backup 28th February 2011AMDs Favorite Effects39

40
Results 1920x th February 2011AMDs Favorite Effects40 SolverTime in msMemory in Megabytes HD5870GTX480 Standard Solver (already skips high res abc construction) ~158 4–to-1 Reduction ~88 4-to-1 Reduction + SM5 Packing ~70

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google