An Optimized Diffusion Depth Of Field Solver (DDOF)

Slides:

Advertisements

Similar presentations

16.1 Si23_03 SI23 Introduction to Computer Graphics Lecture 16 – Some Special Rendering Effects.

Advertisements

Accelerating Real-Time Shading with Reverse Reprojection Caching Diego Nehab 1 Pedro V. Sander 2 Jason Lawrence 3 Natalya Tatarchuk 4 John R. Isidoro 4.

Using Graphics Processors for Real-Time Global Illumination UK GPU Computing Conference 2011 Graham Hazel.

Discrete Math Recurrence Relations 1.

Filtering Approaches for Real-Time Anti-Aliasing

Advance Database Systems and Applications COMP 6521

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

The student will be able to:

An Optimized Soft Shadow Volume Algorithm with Real-Time Performance Ulf Assarsson 1, Michael Dougherty 2, Michael Mounier 2, and Tomas Akenine-Möller.

Dragon Age II DX11 Technology

DirectCompute Accelerated Separable Filtering 28th February 20112AMD‘s Favorite Effects.

A Micro-benchmark Suite for AMD GPUs Ryan Taylor Xiaoming Li.

Normal Map Compression with ATI 3Dc™ Jonathan Zarge ATI Research Inc.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

A Discrete Adjoint-Based Approach for Optimization Problems on 3D Unstructured Meshes Dimitri J. Mavriplis Department of Mechanical Engineering University.

Computer graphics & visualization Global Illumination Effects.

Real-Time Rendering TEXTURING Lecture 02 Marina Gavrilova.

CHAPTER 12 Height Maps, Hidden Surface Removal, Clipping and Level of Detail Algorithms © 2008 Cengage Learning EMEA.

High-Quality Parallel Depth-of- Field Using Line Samples Stanley Tzeng, Anjul Patney, Andrew Davidson, Mohamed S. Ebeida, Scott A. Mitchell, John D. Owens.

1 Parallel Algorithms II Topics: matrix and graph algorithms.

© Janice Regan, CMPT 102, Sept CMPT 102 Introduction to Scientific Computer Programming The software development method algorithms.

Rasterization and Ray Tracing in Real-Time Applications (Games) Andrew Graff.

Enhancing and Optimizing the Render Cache Bruce Walter Cornell Program of Computer Graphics George Drettakis REVES/INRIA Sophia-Antipolis Donald P. Greenberg.

Modeling Fluid Phenomena -Vinay Bondhugula (25 th & 27 th April 2006)

1 7M836 Animation & Rendering Global illumination, ray tracing Arjan Kok

Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.

Outline Reprojection and data reuse Reprojection and data reuse – Taxonomy Bidirectional reprojection Bidirectional reprojection.

GPGPU platforms GP - General Purpose computation using GPU

Antialiasing with Line Samples Thouis R. Jones, Ronald N. Perry MERL - Mitsubishi Electric Research Laboratory.

1 Ethics of Computing MONT 113G, Spring 2012 Session 11 Graphics on the Web Limits of Computer Science.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

An Effective Dynamic Scheduling Runtime and Tuning System for Heterogeneous Multi and Many-Core Desktop Platforms Authous: Al’ecio P. D. Binotto, Carlos.

Today More raytracing stuff –Soft shadows and anti-aliasing More rendering methods –The text book is good on this –I’ll be using images from the CDROM.

A Human Eye Retinal Cone Synthesizer Michael F. Deering.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.

Pavel Slavík, Marek Gayer, Frantisek Hrdlicka, Ondrej Kubelka Czech Technical University in Prague Czech Republic 2003 Winter Simulation Conference December.

Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.

Dense Image Over-segmentation on a GPU Alex Rodionov 4/24/2009.

Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)

Hardware-accelerated Rendering of Antialiased Shadows With Shadow Maps Stefan Brabec and Hans-Peter Seidel Max-Planck-Institut für Informatik Saarbrücken,

Georgia Institute of Technology Speed part 6 Barb Ericson Georgia Institute of Technology May 2006.

FPGA Based Smoke Simulator Jonathan Chang Yun Fei Tianming Miao Guanduo Li.

09/16/03CS679 - Fall Copyright Univ. of Wisconsin Last Time Environment mapping Light mapping Project Goals for Stage 1.

Synthesizing Natural Textures Michael Ashikhmin University of Utah.

- Laboratoire d'InfoRmatique en Image et Systèmes d'information

CS 127 Introduction to Computer Science. What is a computer?  “A machine that stores and manipulates information under the control of a changeable program”

GPU Accelerated MRI Reconstruction Professor Kevin Skadron Computer Science, School of Engineering and Applied Science University of Virginia, Charlottesville,

A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.

February 22-25, 2010 Designers Work Less with Quality Formal Equivalence Checking by Orly Cohen, Moran Gordon, Michael Lifshits, Alexander Nadel, and Vadim.

Real-Time Relief Mapping on Arbitrary Polygonal Surfaces Fabio Policarpo Manuel M. Oliveira Joao L. D. Comba.

Pre-calculated Fluid Simulator States Tree Marek Gayer and Pavel Slavík C omputer G raphics G roup Department of Computer Science and Engineering Faculty.

Use the substitution method

Image Processing A Study in Pixel Averaging Building a Resolution Pyramid With Parallel Computing Denise Runnels and Farnaz Zand.

Warm Up. Solving Differential Equations General and Particular solutions.

INFOMGP Student names and numbers Papers’ references Title.

Solving Equations of Parallel and Perpendicular lines The following examples will help you to work through problems involving Parallel and Perpendicular.

Real-Time Lens Blur Effects and Focus Control Sungkil Lee, Elmar Eisemann, and Hans-Peter Seidel Sunyeong Kim Nov. 23 nd

A Parallel Hierarchical Solver for the Poisson Equation Seung Lee Deparment of Mechanical Engineering

Image Fusion In Real-time, on a PC. Goals Interactive display of volume data in 3D –Allow more than one data set –Allow fusion of different modalities.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

A novel approach to visualizing dark matter simulations

Multi-core processors

Multi-core processors

Deferred Lighting.

© University of Wisconsin, CS559 Fall 2004

Neural Networks and Backpropagation

(c) 2002 University of Wisconsin

(c) 2002 University of Wisconsin

RADEON™ 9700 Architecture and 3D Performance

Presentation transcript:

An Optimized Diffusion Depth Of Field Solver (DDOF) Holger Gruen – AMD 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Agenda Motivation Recap of a high-level explanation of DDOF Recap of earlier DDOF solvers A Vanilla Cyclic Reduction(CR) DDOF solver A DX11 optimized CR solver for DDOF Results 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Motivation Solver presented at GDC 2010 [RS2010] has some weaknesses Great implementation but memory reqs and runtime too high for many game developers Looking for faster and memory efficient solver 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Diffusion DOF recap 1 DDOF is an enhanced way of blurring a picture taking an arbitrary CoC at a pixel into account Interprets input image as a heat distribution Uses the CoC at a pixel to derive a per pixel heat conductivity CoC=Circle of Confusion 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Diffusion DOF recap 2 Blurring is done by time stepping a differential equation that models the diffusion of heat ADI method used to arrive at a separable solution for stepping Need to solve tri-diagonal linear system for each row and then each colum of the input 28th February 2011 AMD‘s Favorite Effects

DDOF Tri-diagonal system row/col of input image derived from CoC at each pixel of an input row/col resulting blurred row/col 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Solver recap 1 The GDC2010 solver [RS2010] is a ‚hybrid‘ solver Performs three PCR steps upfront Performs serial ‚Sweep‘ algorithm to solve small resulting systems Check [ZCO2010] for details on other hybrid solvers 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Solver recap 2 The GDC2010 solver [RS2010] has drawbacks It uses a large UAV as a RW scratch-pad to store the modified coefficients of the sweep algorithm GPUs without RW cache will suffer For high resolutions three PCR steps produce tri-diagonal system of substantial size This means a serial (sweep) algorithm is run on a ‚big‘ system 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Solver recap 3 Cyclic Reduction (CR) solver Used by [Kass2006] in the original DDOF paper Runs in two phases reduction phase backward substitution phase 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Solver recap 4 According to [ZCO2010]: CR solver has lowest computational complexity of all solvers  It suffers from lack of parallelism though  At the end of the reduction phase At the start of the backwards substitution phase 28th February 2011 AMD‘s Favorite Effects

Passes of a Vanilla CR Solver Input image X … reduce reduce Solve for the first y Stop at size 1 Pass 1: construct from CoC … abc reduce reduce Blurred image Y … substitute substitute 28th February 2011 AMD‘s Favorite Effects

Vanilla Solver Results Higher performance than reported in [Bavoil2010]  (~6 ms vs. ~8ms at 1600x1200) Memory footprint prohibitively high  >200 MB at 1600x1200 Need an answer to tackling the lack of parallelism problem – answer given in [ZCO2010] 28th February 2011 AMD‘s Favorite Effects

Vanilla CR Solver Input image … X Solve for the first y This is reduce reduce This is what kills parallelism Solve for the first y Stop at size 1 Pass 1: construct from CoC … abc reduce reduce Blurred image Y … substitute substitute 28th February 2011 AMD‘s Favorite Effects

Keeping the parallelism high Input image X … reduce reduce Stop at a reasonable size Solve for Y at that resolution to have a big enough parallel workload (e.g using PCR see [ZCO2010]) Pass 1: construct from CoC … abc reduce reduce Blurred image Y … substitute substitute 28th February 2011 AMD‘s Favorite Effects

Memory Optimizations 1 Input image … X Stop at a reasonable size reduce reduce Stop at a reasonable size Solve for Y at that resolution Pass 1: construct from CoC … abc reduce reduce Blurred image Y … substitute substitute 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Memory Optimizations 1 rgab32f rgab32f X … reduce reduce Stop at a reasonable size Solve for Y at that resolution rgab32f rgab32f … abc reduce reduce rgba32f rgab32f … Y substitute substitute substi-tute 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Memory Optimizations 1 rgab16f rgab16f X … reduce reduce Stop at a reasonable size Solve for Y at that resolution This saves some significant amount of memory - We found no artifacts for going from rgba32f to rgba16f rgab32f rgab32f … abc reduce reduce rgba16f rgab16f … Y substitute substitute substi-tute 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Memory Optimizations 2 rgab16f rgab16f X … reduce reduce Stop at a reasonable size Solve for Y at that resolution This does again save a significant amount of memory as this is the biggest surface used by the solver rgab32f rgab32f … abc reduce reduce rgba16f rgab16f … Y substitute substitute substi-tute 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Memory Optimizations 2 rgab16f rgab16f X … reduce reduce Stop at a reasonable size Solve for Y at that resolution Skip abc construction pass and compute abc on-the-fly during 1. reduction pass rgab32f … abc reduce rgba16f rgab16f … Y substitute substitute substi-tute 28th February 2011 AMD‘s Favorite Effects

Intermediate Results 1600x1200 Solver Time in ms Memory in Megabytes HD5870 GTX480 GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010] ~117 (guesstimate) Standard Solver (already skips high res abc construction) 3.66 3.33 ~132 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Memory Optimizations 3 rgab16f rgab16f X … reduce reduce Stop at a reasonable size Solve for Y at that resolution Yet again this saves a significant amount of memory ! Skip abc construction pass compute abc during 1. reduction pass rgab32f … abc reduce rgba16f rgab16f … Y substitute substitute substi-tute 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Memory Optimizations 3 rgab16f X … reduce4 Stop at a reasonable size Solve for Y at that resolution Skip abc construction pass compute abc during 1. reduction pass Reduce 4-to-1 in a special first reduction pass … abc Substitute 1-to-4 in a special substitution pass rgba16f … Y substitute substitute substitute4 28th February 2011 AMD‘s Favorite Effects

Intermediate Results 1600x1200 Solver Time in ms Memory in Megabytes HD5870 GTX480 GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010] ~117 (guesstimate) Standard Solver (already skips high res abc construction) 3.66 3.33 ~132 4–to-1 Reduction 2.87 3.32 ~73 28th February 2011 AMD‘s Favorite Effects

DX11 Memory Optimizations 1 rgab16f X … reduce4 Stop at a reasonable size Solve for Y at that resolution Skip abc construction pass compute abc during 1. reduction pass Reduce 4-to-1 in a special first reduction pass … abc Substitute 1-to-4 in a special substitution pass rgba16f … Y substitute substitute substitute4 28th February 2011 AMD‘s Favorite Effects

DX11 Memory Optimizations 1 Pack abc and X into one rgba_uint surface rgab16f X … reduce4 Stop at a reasonable size Solve for Y at that resolution Skip abc construction pass compute abc during 1. reduction pass Reduce 4-to-1 in a special first reduction pass … abc Substitute 1-to-4 in a special substitution pass rgba16f … Y substitute substitute substitute4 28th February 2011 AMD‘s Favorite Effects

Using SM5 for data packing rgab16f uint pack x,y channel X uint (f32tof16(X.x) + (f32tof16(X.y) << 16)) rgab32f uint abc uint 28th February 2011 AMD‘s Favorite Effects

Using SM5 for data packing rgab16f uint X uint lower 5 bits of z channel pack rgab32f uint higher 27 bits of x channel abc (asuint(abc.x) &0xFFFFFFC0) | (f32tof16(X.z) & 0x3F)) uint Steal 6 lowest mantissa bits of abc.x to store some bits of X.z 28th February 2011 AMD‘s Favorite Effects

Using SM5 for data packing rgab16f uint X uint central 5 bits of z channel rgab32f pack uint higher 27 bits of y channel abc (asuint(abc.y) &0xFFFFFFC0) | ((f32tof16(X.z) >>6 )& 0x3F)) uint Steal 6 lowest mantissa bits of abc.y to store some bits of X.z 28th February 2011 AMD‘s Favorite Effects

SM5 Memory Optimizations 1 rgab16f uint X uint higher 5 bits of z channel rgab32f uint higher 27 bits of z channel pack abc (asuint(abc.z) &0xFFFFFFC0) | ((f32tof16(X.z) >>12 )& 0x3F)) uint Steal 6 lowest mantissa bits of abc.z to store some bits of X.z 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Sample Screenshot 28th February 2011 AMD‘s Favorite Effects

Abs(Packed-Unpacked) x 255.0f 28th February 2011 AMD‘s Favorite Effects

DX11 Memory Optimizations 2 Solver does a horizonal and vertical pass Chain of lower res RTs needs to be there twice Horizontal reduction/substitution chain Vertical reduction/substitution chain How can DX11 help? 28th February 2011 AMD‘s Favorite Effects

DX11 Memory Optimizations 2 UAVs allow us to reuse data of the horizontal chain for the vertical chain A proof of concept implementation shows that this works nicely but impacts the runtime significantly ~40% lower fps Stayed with RTs as memory was already quite low Use only if you are really concerned about memory 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Final Results 1600x1200 Solver Time in ms Memory in Megabytes HD5870 GTX480 GDC2010 hybrid solver on GTX480 ~8.5 8.00 [Bavoil2010] ~117 (guesstimate,) Standard Solver (already skips high res abc construction) 3.66 3.33 ~132 4–to-1 Reduction 2.87 3.32 ~73 4-to-1 Reduction + SM5 Packing 2.75 3.14 ~58 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Future Work Look into CS acceleration of the solver 4-to-1 reduction pass 1-to-4 substitution pass Look into using heat diffusion for other effects e.g. Motion blur 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Conclusion Optimized CR solver is fast and mem-efficient Used in Dragon Age 2 4aGames considering its use for new projects Detailed description in ‚Game Engine Gems 2‘ Mail me (holger.gruen@amd.com) if you want access to the sources 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects References [Kass2006] “Interactive depth of field using simulated diffusion on a GPU” Michael Kass, Pixar Animation studios, Pixar technical memo #06-01 [ZCO2010] “Fast Tridiagonal Solvers on the GPU” Y. Zhang, J. Cohen, J. D. Owens, PPoPP 2010 [RS2010] “DX11 Effects in Metro 2033: The Last Refuge” A. Rege, O. Shishkovtsov, GDC 2010 [Bavoil2010] „Modern Real-Time Rendering Techniques“, L. Bavoil, FGO2010 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Backup 28th February 2011 AMD‘s Favorite Effects

AMD‘s Favorite Effects Results 1920x1200 Solver Time in ms Memory in Megabytes HD5870 GTX480 Standard Solver (already skips high res abc construction) 4.31 4.03 ~158 4–to-1 Reduction 3.36 4.02 ~88 4-to-1 Reduction + SM5 Packing 3.23 3.79 ~70 28th February 2011 AMD‘s Favorite Effects