Presentation is loading. Please wait.

Presentation is loading. Please wait.

D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE DAVE OLDCORN, AMD STEPHAN HODES, AMD MAX MCMULLEN, MICROSOFT DAN BAKER, OXIDE 5 TH MARCH 2015.

Similar presentations


Presentation on theme: "D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE DAVE OLDCORN, AMD STEPHAN HODES, AMD MAX MCMULLEN, MICROSOFT DAN BAKER, OXIDE 5 TH MARCH 2015."— Presentation transcript:

1 D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE DAVE OLDCORN, AMD STEPHAN HODES, AMD MAX MCMULLEN, MICROSOFT DAN BAKER, OXIDE 5 TH MARCH 2015

2 D3D11 to D3D12

3 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH WHAT HASN’T CHANGED  D3D12 is primarily a software change  Hardware programming model is still the same ‒Few new rendering features

4 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH WHAT HAS CHANGED  The software model has changed a lot  Not just in the API, but also in the underlying philosophy ‒Closer to the hardware ‒Give more control to the application

5 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH APPLICATION IS ARBITER OF CORRECT RENDERING  Trades off safety for power ‒If D3D11 is Javascript, D3D12 is C++  Large areas of undefined ‒... where behaviour will change with future GPUs  Use the debug layer  Stay away from the corners, don’t take risks ‒Expect “morality guides” ‒... once we know what people keep doing wrong

6 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH BROAD STROKE CHANGES D3D11 -> 12 Sequential APIQueues, Command Lists Small state blocksState object for pipeline Resource binding: individual objectsResource binding: tables Automatic synchronisation, driver tracks resource state Manual synchronisation, app must avoid overwrites Implicit memory management by OS & driver Explicit memory management by application

7 New in D3D12

8 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH COMMAND LISTS  Each command list is executed strictly sequentially  Command lists can call out to second-level command lists (“bundles”) ‒Some restrictions on bundles ‒Replaying bundles is OK  Top level command lists can be replayed too ‒But not until the previous submit has retired  Size them right ‒100s draws for direct lists; 10+ draws for bundle

9 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH COMMAND LISTS ENABLE CPU SIDE THREADING  Command lists can be built on arbitrary threads ‒And very quickly too  Submit is thread-safe ‒Submit in batches  Consider task oriented engines ‒Divide rendering into tasks ‒Run CPU tasks to build command lists ‒Use dependencies to order GPU submission ‒Also helps with resource barriers

10 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH ALLOCATOR AND LIST MEMORY MANAGEMENT  Lists / Allocators manage memory ‒Hang on to their resources when reset ‒Must be destroyed to fully release memory ‒Reuse lists / allocators on ‘similar’ data ‒Destroy if data is very dissimilar ‒Don’t use pool of lists / allocators for all possible uses Initial 100 draws Reset Same 100 draws 200 draws List / Allocator memory usage (Guaranteed no new allocations) Different 100 draws 5 draws

11 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH PIPELINE STATE OBJECT (PSO)  Collates most D3D11 renderstates  Compiled into hardware registers at Create time ‒Can easily be tens of ms, so use asynchronous threads  All state set onto command buffer in one go  Keep adjacent PSOs similar  Use sensible defaults for don’t care fields Example: Rasterizer state INT DepthBias; FLOAT DepthBiasClamp; FLOAT SlopeScaledDepthBias; BOOL DepthClipEnable; None of this matters if depth test is off

12 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH RESOURCE BINDING IN D3D11  D3D11: Bind individual resources ‒Addressing is by shader stage, type and slot ‒[This is a huge space! 6*4*128 is 3072 slots, and each slot is 4+ DWORDs. Very inefficient] ‒Changes to resources propagate automatically ‒Renaming ‒Synchronisation

13 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH D3D12 RESOURCE BINDING 1  Table driven  Shared across all shader stages  Two-level table ‒Root Signature describes a top-level layout ‒Pointers to descriptor tables ‒Direct pointers to constant buffers ‒Inline constants  Changing which table is pointed to is cheap ‒It’s just writing a pointer; no synchronisation cost  Changing contents of table is harder ‒Can’t change table in flight on the hardware; no automatic renaming Table Pointer Root Signature Root Constant Buffer View Root Constant Buffer View 32-bit constant 32-bit constant Table pointer CB view SR view UA view Descriptor Table SR view Descriptor Table Table pointer

14 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH D3D12 RESOURCE BINDING 2  Tables should be grouped by frequency of change ‒Per-draw, per-material, per-light, per-frame ‒Hint update frequency to driver by placing most frequent changes early in root signature

15 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH D3D12 RESOURCE BINDING TIPS  Don’t overload root signature size ‒CBVs and constants in root signature should probably be changing every draw call ‒Bulk constant data should be in CBs not root constants  Use static tables where possible ‒Associate with object and prebuild

16 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH D3D12 RESOURCE SYNCHRONISATION  No automatic synchronisation  Must insert barriers between usage  Three functions of barrier ‒Format conversion ‒e.g. antialiasing resolve or depth decompression ‒Synchronisation ‒Ensuring correct order of execution; e.g. compute use of a render output could start before colour buffer is finished working on the data, due to pipelining ‒Visibility ‒Typically cache flushes, if unit A and unit B do not share the same visibility of the data  Barrier specifies previous and next usage and driver inserts appropriate work

17 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH BARRIER TIPS  Group barriers into same Barrier call ‒Will take the worst case of all, rather than potentially incurring multiple sequential barriers  Set minimal barriers  Barriers must be correct ‒Will be a gigantic headache for IHVs if not

18 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH PROFILING  D3D11 was reasonably predictable in profiling ‒Limited set of accessible bottlenecks ‒Usually fairly obvious which one you’re hitting  D3D12 environment adds new factors ‒API features: flexible resource binding, concurrency ‒Hardware limits that were pretty much impossible to bump against in D3D11 ‒Even PCIe® and system memory bus  Different hardware much more likely to have divergent behaviour ‒Test on a wide range of hardware

19 Concurrency in D3D12

20 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH QUEUES  Graphics, compute and copy queues  Each is a superset  Must specify executing queue type at record time Graphics Compute Copy

21 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH MULTIPLE QUEUES  Multiple queues of the same type supported ‒Within queue: work is ordered ‒Between separate queues work can be arbitrarily reordered  Use Fences to define work order Graphics Queue 1 Graphics Queue 2 Graphics engine Shadowmap L0Lighting L0 Shadowmap L1Lighting L1 Shadowmap L0Shadowmap L1Lighting L1Lighting L0

22 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH GAME ENGINE WORKFLOW Physics Shadowmap Rendering G-buffer Rendering Lighting & Shading Solid Post Processing Post Processing UI Rendering Present TressFX Particle Multiple cascades Point/Spotlights Prepare e.g. generate Min/Max Mips e.g. Particle Rendering Transparent Obj Rendering Heap Defragmentation StreamingDynamic Data Update

23 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH CONCURRENCY  Graphics, compute and / or copy may run in parallel ‒Profile to verify ‒Very familiar to console programmers Graphics Engine Compute Engine Copy Engine Defragmentation StreamingDynamic Data Update Physics ShadowmapsG-buffer TileDeferredAA/AO Transparent Tonemap UI Prepare SM

24 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH DEMO TIME!  Example of gains from async compute: ‒Interleaving 2 frames  Sample code will be available  Sample based on DX11 work by Jason Stewart & Gareth Thomas G-buffer Rendering 1 Lighting & Shading 1 G-buffer Rendering 2 Lighting & Shading 2 G-buffer Rendering 3 Lighting & Shading 0

25 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH

26 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH

27 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH PARALLELISE UNALIKE WORKLOADS  Engines may compete for resources ‒Bus bandwidth ‒Shader core, texture fetch for compute / graphics ‒GPRs, Caches…  The less similar the workload, the faster each runs Bus dominatedShader throughputGeometry dominated Shadow mapping ROP heavy workloads Many G buffer operations DMA operations - Texture upload - Heap defrag Deferred lighting (usually) Many postprocessing effects Most compute tasks - Texture compression - Physics - Simulations Rendering highly detailed models

28 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH EXPLOITING CONCURRENCY  Profile!  Can align execution across queues with fences ‒Fences have a significant cost ‒Don’t overdo this; “a few” per frame at most Shadow map Animate Particles Animate Particles Stream Texture Deferred Lighting Shadow map Deferred Lighting Stream Texture Animate Particles Deferred Lighting Shadow map Stream Texture Animate Particles Win! Big Win!

29 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH BARRIERS AND MULTIPLE QUEUES  Barrier must be inserted on last queue to write resource ‒Primarily this is for any required format conversion  Fences contain implicit acquire / release barriers ‒One of the reasons they have a high cost

30 Resource Management in D3D12 Max McMullen Microsoft

31 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH DIRECT3D 12 RESOURCE CREATION OVERVIEW  Direct3D 11 has a simple model, create and use  Works great given the simplicity of the abstraction  A few problems for today’s titles ‒Unpredictable performance differences due to driver workarounds ‒No high performance reuse of memory in a given frame ‒Tiled Resources added on to the original abstraction

32 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH DIRECT3D 11 Physical Pages DDI API Physical Pages GPU VA Buffer Physical Pages GPU VA Texture3D Physical Pages GPU VA Texture2D

33 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH DIRECT3D 12 RESOURCE HEAPS  Direct3D 12 separates allocation of GPU physical pages and GPU virtual addresses from resources  Applications can better amortize the cost of physical page allocation ‒Reuse memory for temporaries ‒Repurpose memory when the scene no longer requires it

34 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH DIRECT3D 12 RESOURCE HEAPS Physical Pages DDI API Physical Pages GPU VA BufferTexture3D Texture2D Resource Heap Texture2D

35 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH RESOURCE HEAP PROPERTIES Memory PoolL0 – Closest to CPU L1 – Closest to GPU (Discrete GPU only) CPU Page PropertiesNot Accessible (L0 & L1) Write Combine (L0 Only) Write Back (L0 Only) Alignment64 KB (Default) 1 MB (Enable MSAA)

36 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH SIMPLIFIED HEAP TYPES DEFAULTUPLOADREADBACK Memory Pool L1 (Discrete) L0 (Integrated) L0 CPU Properties No CPU accessWrite Combine Write Back* Write Back UsageFrequent GPU Read/Write Max GPU Bandwidth CPU Write Once, GPU Read Once Max CPU Write Bandwidth GPU Write Once, CPU Read Max CPU Read Bandwidth

37 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH DIRECT3D 12 RESOURCE CREATION APIS  Three types of resource create ‒Committed ‒Placed ‒Reserved  Each has a different pattern of GPU VA and Physical Page usage to enable different scenarios

38 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH DIRECT3D 12 RESOURCE CREATION APIS Physical Pages GPU VA Resource Heap Texture3D Buffer Physical Pages GPU VA Resource Heap Texture2D Committed Placed Reserved

39 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH EFFICIENT HEAP USAGE  Prefer default heaps populated by upload heaps ‒Build a ring buffer out of one or more committed upload buffer resources, and leave each buffer perpetually mapped for CPU access ‒Sequentially write data into each buffer with the CPU, aligning offsets as needed ‒Instruct the GPU to signal an increasing fence value at the end of each frame ‒Do not overwrite the data in the upload heap until the fence value indicates the GPU has finished reading the data  Reuse upload heaps for dynamic data sent to GPU throughout rendering

40 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH PHYSICAL MEMORY REUSE Both reserved and placed resources must follow the same rules as Direct3D 11 tiled resources:  An aliasing barrier must be queued when physical memory is reused with a new resource  The application must initialize the resource memory with either a Clear or Copy operation when first using or re-using physical memory with a render target or depth stencil resource

41 Efficient Memory Use in D3D12 Dan Baker Co-Founder of Oxide Games

42 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH D3D12 MEMORY CONTROL  D3D11 – much guesswork in driver/API on where data went, how it was referenced  ConstantBuffer dynamic map difficult to stream huge quantities of data efficiently  D3D12 provides explicit control over memory mapping ‒Can create one large buffer per frame and stage all data ‒No specific need for a constant buffer – becomes application construct if desired

43 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH HIGH THROUGHPUT RENDERING  To get advantage of draw call, must be hooked into game logic  For each unit, turret, missile trail, CPU calculates information like position or color  This data must be uploaded to the GPU – quickly as possible

44 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH FAST DATA STREAMING TO GPU CPU L1 Data Cache CPU Memory L2/L3 Cache GPU Memory GPU

45 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH STREAMING THE DATA  GPU memory is not write-cached, do not read  Should always write whole cache-lines out  _mm_stream_si128 ‒Writes cache-line at a time ‒Will bypass L2 and L3 Cache

46 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH REAL-WORLD D3D12 EXAMPLE  Ashes of the Singularity – new mega RTS from Oxide and Stardock  Player may have thousands of units  Every turret, bullet and missile simulated by engine  On heavy frame, Ashes uploads mb/s of data to GPU, 60fps = 3 GB/s ‒~20% of system bandwidth on DDR3 ‒If stored in CPU memory with GPU fetch, would be doubled

47 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH WHAT A FRAME LOOKS LIKE IN ASHES Sim Job D3D12 CMD Job Core 1 Current Frame Sim Job D3D12 CMD Job Core 2 Sim Job D3D12 CMD Job Core 3 Sim Job D3D12 CMD Job Core 4 AI Job Sim Job D3D12 CMD Job Core 5 Game Job Sim Job AI Job Game Job Next Frame D3D12 Present Job GPU Memory

48 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH D3D12 DEMO  Demo of Ashes of the Singularity

49 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH Questions We are hiring! Contact:

50 | D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE | GDC | MARCH 5 TH DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2015 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.


Download ppt "D3D12 A NEW MEANING FOR EFFICIENCY AND PERFORMANCE DAVE OLDCORN, AMD STEPHAN HODES, AMD MAX MCMULLEN, MICROSOFT DAN BAKER, OXIDE 5 TH MARCH 2015."

Similar presentations


Ads by Google