Download presentation
Presentation is loading. Please wait.
Published byMeaghan Tift Modified over 10 years ago
2
DirectCompute Accelerated Separable Filtering 28th February 20112AMD‘s Favorite Effects
3
Separable Filters Much faster than executing a box filter Classically performed by the Pixel Shader Consists of a horizontal and vertical pass Source image over-sampling increases with kernel size – Shader is usually TEX instruction limited 28th February 2011AMD‘s Favorite Effects3
4
Separable? – Who Cares In many cases developers use this technique even though the filter may not actually be separable – Results are often still acceptable – Much faster than performing a real box filter – Accelerates many bilateral cases 28th February 2011AMD‘s Favorite Effects4
5
Typical Pipeline Steps 28th February 2011AMD‘s Favorite Effects5 Source RT Intermediate RT Destination RT Horizontal Pass Vertical Pass
6
Use Bilinear HW filtering? Bilinear filter HW can halve the number of ALU and TEX instructions – Just need to compute the correct sampling offsets Not possible with more advanced filters – Usually because weighting is a dynamic operation – Think about bilateral cases... 28th February 2011AMD‘s Favorite Effects6
7
Where to start with DirectCompute Is the Pixel Shader version TEX or ALU limited? – You need to know what to optimize for! – Use IHV tools to establish this Achieving peak performance is not easy – so write a highly configurable kernel – Will allow you to easily experiment and fine tune 28th February 2011AMD‘s Favorite Effects7
8
Thread Group Shared Memory (TGSM) TGSM can be used to reduce TEX ops TGSM can also be used to cache results – Thus saving ALU ops too Load a sensible run length – base this on HW wavefront/warp size (AMD = 64, NVIDIA = 32) – Choose a good common factor (multiples of 64) 28th February 2011AMD‘s Favorite Effects8
9
Kernel #1 Redundant compute threads 28th February 2011AMD‘s Favorite Effects9........... 128 threads load 128 texels 128 – ( Kernel Radius * 2 ) threads compute results Kernel Radius
10
Avoid Redundant Threads Should ensure that all threads in a group have useful work to do – wherever possible Redundant threads will not be reassigned work from another group This would involve alot of redundancy for a large kernel diameter 28th February 2011AMD‘s Favorite Effects10
11
Kernel #2 28th February 2011AMD‘s Favorite Effects11........... 128 threads load 128 texels 128 threads compute results Kernel Radius No redundant compute threads Kernel Radius * 2 threads load 1 extra texel each
12
Multiple Pixels per Thread Allows for natural vectorization – 4 works well on AMD HW – Doesn‘t hurt performance on scalar HW Possible to cache TGSM reads on General Purpose Registers (GPRs) – Quartering TGSM reads - absolute winner!! 28th February 2011AMD‘s Favorite Effects12
13
Kernel #3 Compute threads not a multiple of 64 28th February 2011AMD‘s Favorite Effects13........... 32 threads compute 128 results Kernel Radius 32 threads load 128 texels Kernel Radius * 2 threads load 1 extra texel each
14
Multiple Lines per Thread Group Process multiple lines per thread group – Better than one long line – 2 or 4 works well Improved texture cache efficiency Compute threads back to a multiple of 64 28th February 2011AMD‘s Favorite Effects14
15
Kernel #4 28th February 2011AMD‘s Favorite Effects15........... Kernel Radius 64 threads compute 256 results 64 threads load 256 texels Kernel Radius * 4 threads load 1 extra texel each
16
Kernel Diameter Kernel diameter needs to be > 7 to see a DirectCompute win – Otherwise the overhead cancels out the advantage The larger the kernel diameter the greater the win 28th February 2011AMD‘s Favorite Effects16
17
Use Packing in TGSM Use packing to reduce storage space required in TGSM – Only have 32k per SIMD Reduces reads/writes from TGSM Often a uint is sufficient for color filtering Use SM5.0 instructions f32tof16(), f16tof32() 28th February 2011AMD‘s Favorite Effects17
18
High Definition Ambient Occlusion 28th February 2011AMD‘s Favorite Effects18 Depth + Normals HDAO buffer * = Original Scene Final Scene
19
Perform at Half Resolution HDAO at full resolution is expensive Running at half resolution captures more occlusion – and is obviously much faster Problem: Artifacts are introduced when combined with the full resolution scene 28th February 2011AMD‘s Favorite Effects19
20
Bilateral Dilate & Blur 28th February 2011AMD‘s Favorite Effects20 HDAO buffer doesn‘t match with scene A bilateral dilate & blur fixes the issue
21
New Pipeline... 28th February 2011AMD‘s Favorite Effects21 Bilinear Upsample Intermediate UAV Dilated & Blurred Horizontal Pass Vertical Pass ½ Res Still much faster than performing at full res!
22
Pixel Shader vs DirectCompute 28th February 2011AMD‘s Favorite Effects22 *Tested on a range of AMD and NVIDIA DX11 HW, DirectCompute is between ~2.53x to ~3.17x faster than the Pixel Shader
23
Depth of Field Many techniques exist to solve this problem A common technique is to figure out how blurry a pixel should be – Often called the Cirle of Confusion (CoC) A Gaussian blur weighted by CoC is a pretty efficient way to implement this effect 28th February 2011AMD‘s Favorite Effects23
24
The Pipeline... 28th February 2011AMD‘s Favorite Effects24 Intermediate UAV CoC Horizontal Pass Vertical Pass
25
28th February 2011AMD‘s Favorite Effects25 Shogun 2: DoF OFF
26
28th February 2011AMD‘s Favorite Effects26 Shogun 2: DoF ON
27
Pixel Shader vs DirectCompute 28th February 2011AMD‘s Favorite Effects27 *Tested on a range of AMD and NVIDIA DX11 HW, DirectCompute is between ~1.48x to ~1.86x faster than the Pixel Shader
28
Summary DirectCompute greatly accelerates larger kernel diameter filters Allows for filtering at full resolution For access to source code: – HDAO11: jon.story@amd.comjon.story@amd.com – DoF11: nicolas.thibieroz@amd.comnicolas.thibieroz@amd.com 28th February 2011AMD‘s Favorite Effects28
29
Questions? takahiro.harada@amd.com holger.gruen@amd.com jon.story@amd.com Please fill in the feedback forms! takahiro.harada@amd.com holger.gruen@amd.com jon.story@amd.com 28th February 201129AMD‘s Favorite Effects
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.