D3D12 A new meaning for efficiency and performance

Name: D3D12 A new meaning for efficiency and performance
Uploaded: 2017-07-19T02:19:43+00:00
Duration: PTM26S1
Channel: Wayne Heslop
Description: D3D12 A new meaning for efficiency and performance

D3D12 A new meaning for efficiency and performance
Dave Oldcorn, AMD Stephan Hodes, AMD Max McMullen, Microsoft Dan Baker, Oxide 5th March 2015

D3D11 to D3D12 Assume that most people here are familiar with D3D11 (more so than 12)

D3D12 is primarily a software change
What hasn’t changed D3D12 is primarily a software change Hardware programming model is still the same Few new rendering features D3D12 is primarily a software change [Obviously; all D3D11.2 hardware supported] Hardware programming model is still the same [VS, DS, HS, PS, rasteriser, depth, colour etc.] Few new rendering features

The software model has changed a lot
What has changed The software model has changed a lot Not just in the API, but also in the underlying philosophy Closer to the hardware Give more control to the application

Application is arbiter of correct rendering
Trades off safety for power If D3D11 is Javascript, D3D12 is C++ Large areas of undefined ... where behaviour will change with future GPUs Use the debug layer Stay away from the corners, don’t take risks Expect “morality guides” ... once we know what people keep doing wrong

Broad stroke changes D3D11 -> 12
Sequential API Queues, Command Lists Small state blocks State object for pipeline Resource binding: individual objects Resource binding: tables Automatic synchronisation, driver tracks resource state Manual synchronisation, app must avoid overwrites Implicit memory management by OS & driver Explicit memory management by application

New in D3D12

Each command list is executed strictly sequentially
Command lists Each command list is executed strictly sequentially Command lists can call out to second-level command lists (“bundles”) Some restrictions on bundles Replaying bundles is OK Top level command lists can be replayed too But not until the previous submit has retired Size them right 100s draws for direct lists; 10+ draws for bundle [Can view this as the existing sequential model with manual flushing, if that’s all that’s required] Bundles: [particularly with what data is inherited from the enclosing command list]

Command lists enable CPU side threading
Command lists can be built on arbitrary threads And very quickly too Submit is thread-safe Submit in batches Consider task oriented engines Divide rendering into tasks Run CPU tasks to build command lists Use dependencies to order GPU submission Also helps with resource barriers Tasks: with clear CPU and GPU side dependencies

Allocator and list memory management
Lists / Allocators manage memory Hang on to their resources when reset Must be destroyed to fully release memory Reuse lists / allocators on ‘similar’ data Destroy if data is very dissimilar Don’t use pool of lists / allocators for all possible uses List / Allocator memory usage Initial 100 draws Reset Same 100 draws (Guaranteed no new allocations) 5 draws Different 100 draws 200 draws

Pipeline State Object (PSO)
Collates most D3D11 renderstates Compiled into hardware registers at Create time Can easily be tens of ms, so use asynchronous threads All state set onto command buffer in one go Keep adjacent PSOs similar Use sensible defaults for don’t care fields Example: Rasterizer state INT DepthBias; FLOAT DepthBiasClamp; FLOAT SlopeScaledDepthBias; BOOL DepthClipEnable; [No driver threading!] None of this matters if depth test is off

D3D11: Bind individual resources
Resource binding in D3D11 D3D11: Bind individual resources Addressing is by shader stage, type and slot [This is a huge space! 6*4*128 is 3072 slots, and each slot is 4+ DWORDs. Very inefficient] Changes to resources propagate automatically Renaming Synchronisation

D3D12 Resource Binding 1 Table driven Shared across all shader stages
Root Signature Descriptor Table Table driven Shared across all shader stages Two-level table Root Signature describes a top-level layout Pointers to descriptor tables Direct pointers to constant buffers Inline constants Changing which table is pointed to is cheap It’s just writing a pointer; no synchronisation cost Changing contents of table is harder Can’t change table in flight on the hardware; no automatic renaming Table Pointer CB view Root Constant Buffer View CB view SR view UA view Descriptor Table 32-bit constant Table pointer SR view SR view Table pointer SR view SR view Table pointer

Tables should be grouped by frequency of change
D3D12 Resource Binding 2 Tables should be grouped by frequency of change Per-draw, per-material, per-light, per-frame Hint update frequency to driver by placing most frequent changes early in root signature

D3D12 Resource Binding tips
Don’t overload root signature size CBVs and constants in root signature should probably be changing every draw call Bulk constant data should be in CBs not root constants Use static tables where possible Associate with object and prebuild Don’t overload root signature size CBVs and constants in root signature should probably be changing every draw call [Otherwise, it may well be cheaper to do instancing yourself than have the driver do it all the time] Bulk constant data should be in CBs not root constants [Matrices, etc.] Use static tables where possible Associate with object and prebuild [at object / world load time]

D3D12 Resource Synchronisation
No automatic synchronisation Must insert barriers between usage Three functions of barrier Format conversion e.g. antialiasing resolve or depth decompression Synchronisation Ensuring correct order of execution; e.g. compute use of a render output could start before colour buffer is finished working on the data, due to pipelining Visibility Typically cache flushes, if unit A and unit B do not share the same visibility of the data Barrier specifies previous and next usage and driver inserts appropriate work

Group barriers into same Barrier call
Barrier tips Group barriers into same Barrier call Will take the worst case of all, rather than potentially incurring multiple sequential barriers Set minimal barriers Barriers must be correct Will be a gigantic headache for IHVs if not Avoid GENERIC_READ [except when necessary, and particularly if this barrier’s being inserted every frame]

D3D11 was reasonably predictable in profiling
Limited set of accessible bottlenecks Usually fairly obvious which one you’re hitting D3D12 environment adds new factors API features: flexible resource binding, concurrency Hardware limits that were pretty much impossible to bump against in D3D11 Even PCIe® and system memory bus Different hardware much more likely to have divergent behaviour Test on a wide range of hardware [application, driver thread, tessellation, shader core, ROP covers most real-world cases] [if you know your app and GPUView] [Should be tools to help with this] Different hardware [even from same IHV] [Listen to IHV devrel; accept that we may have slightly different answers]

Concurrency in D3D12

Graphics, compute and copy queues Each is a superset
Must specify executing queue type at record time Compute Copy [The recorded buffer to be sent to hardware may need to be quite different]

Multiple queues of the same type supported
Within queue: work is ordered Between separate queues work can be arbitrarily reordered Use Fences to define work order Graphics Queue 1 Shadowmap L0 Lighting L0 Graphics Queue 2 Shadowmap L1 Lighting L1 Graphics engine Shadowmap L0 Shadowmap L1 Lighting L1 Lighting L0 [This is a way to tell the driver „I don‘t care about the execution order“]

gAME Engine Workflow Heap Defragmentation Streaming
Dynamic Data Update Physics Shadowmap Rendering Prepare G-buffer Rendering Lighting & Shading TressFX Particle Multiple cascades Point/Spotlights e.g. generate Min/Max Mips [light blue: represents areas of work which we showed can be successfully ported to compute] Solid Post Processing Transparent Obj Rendering Post Processing UI Rendering Present e.g. Particle Rendering

Graphics, compute and / or copy may run in parallel
Concurrency Graphics, compute and / or copy may run in parallel Profile to verify Very familiar to console programmers Graphics Engine Shadowmaps G-buffer Transparent UI Compute Engine Physics Prepare SM TileDeferred AA/AO Tonemap Copy Engine Dynamic Data Update Streaming Defragmentation

Example of gains from async compute:
Demo TIME! Example of gains from async compute: Interleaving 2 frames Sample code will be available Sample based on DX11 work by Jason Stewart & Gareth Thomas G-buffer Rendering 1 G-buffer Rendering 2 G-buffer Rendering 3 Lighting & Shading 0 Lighting & Shading 1 Lighting & Shading 2

Parallelise unalike workloads
Engines may compete for resources Bus bandwidth Shader core, texture fetch for compute / graphics GPRs, Caches… The less similar the workload, the faster each runs Bus dominated Shader throughput Geometry dominated Shadow mapping ROP heavy workloads Many G buffer operations DMA operations - Texture upload - Heap defrag Deferred lighting (usually) Many postprocessing effects Most compute tasks - Texture compression - Physics - Simulations Rendering highly detailed models [Gbuffer(BUS) and Deferred lighting (ALU) are somewhat different, so we‘d expect some nice gain]

Exploiting concurrency
Stream Texture Animate Particles Shadow map Deferred Lighting Shadow map Deferred Lighting Win! Stream Texture Animate Particles Shadow map Deferred Lighting Big Win! Animate Particles Stream Texture [KMD / scheduler overhead + GPU sync] Profile! Can align execution across queues with fences Fences have a significant cost Don’t overdo this; “a few” per frame at most

Barriers and multiple queues
Barrier must be inserted on last queue to write resource Primarily this is for any required format conversion Fences contain implicit acquire / release barriers One of the reasons they have a high cost [Global visibility of all data implies a lot of cache flushing]

Resource Management in D3D12
Max McMullen Microsoft

Direct3D 12 Resource Creation Overview
Direct3D 11 has a simple model, create and use Works great given the simplicity of the abstraction A few problems for today’s titles Unpredictable performance differences due to driver workarounds No high performance reuse of memory in a given frame Tiled Resources added on to the original abstraction Lots of fine-grained allocations are more difficult to manage, CPU overhead

Direct3D 11 API Buffer Texture3D Texture2D Texture2D DDI GPU VA GPU VA
Physical Pages Physical Pages Physical Pages Physical Pages

Direct3D 12 Resource Heaps
Direct3D 12 separates allocation of GPU physical pages and GPU virtual addresses from resources Applications can better amortize the cost of physical page allocation Reuse memory for temporaries Repurpose memory when the scene no longer requires it

Direct3D 12 Resource Heaps
Buffer Texture3D Texture2D Texture2D API Resource Heap DDI GPU VA Physical Pages Physical Pages

Resource Heap Properties
Memory Pool L0 – Closest to CPU L1 – Closest to GPU (Discrete GPU only) CPU Page Properties Not Accessible (L0 & L1) Write Combine (L0 Only) Write Back (L0 Only) Alignment 64 KB (Default) 1 MB (Enable MSAA) Resource Heap Properties L0 is equivalent to “local” memory residency budget. L1 is equivalent to “non-local” memory residency budget. Heaps actually are both a physical memory allocation and a virtual address range to span the entire heap. They are optimized for creation on background threads, localizing the creation expense to the background thread instead of the threads recording or executing command lists. APIs are available to help applications size the heaps for a certain configuration of resources. Custom heap type exists to control memory pool and CPU page properties. Together with adapter architecture caps, this allows applications to optimize for UMA GPU characteristics, where the previous rules of thumbs aren’t so strong. Instead, applications can save power on mobile devices by optimizing unnecessary GPU copies.

Write Combine Write Back* Write Back Usage Frequent GPU Read/Write
Simplified Heap Types DEFAULT UPLOAD READBACK Memory Pool L1 (Discrete) L0 (Integrated) L0 CPU Properties No CPU access Write Combine Write Back* Write Back Usage Frequent GPU Read/Write Max GPU Bandwidth CPU Write Once, GPU Read Once Max CPU Write Bandwidth GPU Write Once, CPU Read Max CPU Read Bandwidth Simplified Heap Types These rules haven’t really changed since D3D10: Default heaps experience the maximum amount of GPU bandwidth available when using them. They are good for repeated GPU read & write operations. Upload heaps experience the maximum amount of CPU bandwidth for CPU writes. A good rule of thumb is to use them for CPU write-once and GPU read-once scenarios, like uploading texture data. However, GPU read-once is the most conservative recommendation. Discrete GPUs have caches to avoid re-reading from system memory across PCI-e bus for every memory access, so smaller buffers that hold indexed vertices and constants can reside here. But, some usage scenarios will be better of from locating the data in a default heap before extensive GPU usage. Readback heaps experience the maximum amount of CPU bandwidth for CPU reads. A good rule of thumb is to use them for marshalling data back to the CPU; and only using such heaps as the destination of GPU copy operations. * Cache Coherent UMA only

Direct3D 12 Resource Creation APIs
Three types of resource create Committed Placed Reserved Each has a different pattern of GPU VA and Physical Page usage to enable different scenarios

Direct3D 12 Resource Creation APIs
Committed Placed Reserved Buffer Texture3D Texture2D Resource Heap Resource Heap GPU VA GPU VA Physical Pages Physical Pages Physical Pages

Prefer default heaps populated by upload heaps
Efficient Heap Usage Prefer default heaps populated by upload heaps Build a ring buffer out of one or more committed upload buffer resources, and leave each buffer perpetually mapped for CPU access Sequentially write data into each buffer with the CPU, aligning offsets as needed Instruct the GPU to signal an increasing fence value at the end of each frame Do not overwrite the data in the upload heap until the fence value indicates the GPU has finished reading the data Reuse upload heaps for dynamic data sent to GPU throughout rendering How do you initialize and copy data to heaps CopyTextureRegion requires the texture data in buffer resources to be located at an offset aligned to 512. Constant buffer descriptor usage requires the data in buffer resources to be located at an offset aligned to 256. Etc.

Physical Memory Reuse Both reserved and placed resources must follow the same rules as Direct3D 11 tiled resources: An aliasing barrier must be queued when physical memory is reused with a new resource The application must initialize the resource memory with either a Clear or Copy operation when first using or re-using physical memory with a render target or depth stencil resource

Efficient Memory Use in D3D12
Dan Baker Co-Founder of Oxide Games

D3D12 provides explicit control over memory mapping
D3D12 MEMORY control D3D11 – much guesswork in driver/API on where data went, how it was referenced ConstantBuffer dynamic map difficult to stream huge quantities of data efficiently D3D12 provides explicit control over memory mapping Can create one large buffer per frame and stage all data No specific need for a constant buffer – becomes application construct if desired

High throughput rendering
To get advantage of draw call, must be hooked into game logic For each unit, turret, missile trail, CPU calculates information like position or color This data must be uploaded to the GPU – quickly as possible

FAST data streaming to GPU
CPU L1 Data Cache GPU Memory GPU L2/L3 Cache CPU Memory

GPU memory is not write-cached, do not read
Streaming the data GPU memory is not write-cached, do not read Should always write whole cache-lines out _mm_stream_si128 Writes cache-line at a time Will bypass L2 and L3 Cache

Ashes of the Singularity – new mega RTS from Oxide and Stardock
Real-World D3D12 example Ashes of the Singularity – new mega RTS from Oxide and Stardock Player may have thousands of units Every turret, bullet and missile simulated by engine On heavy frame, Ashes uploads mb/s of data to GPU, 60fps = 3 GB/s ~20% of system bandwidth on DDR3 If stored in CPU memory with GPU fetch, would be doubled

What a frame looks like in Ashes
Current Frame Next Frame Sim Job Sim Job Sim Job Sim Job D3D12 CMD Job D3D12 CMD Job Game Job Core 1 Sim Job D3D12 CMD Job D3D12 CMD Job AI Job Core 2 Sim Job Sim Job Sim Job D3D12 CMD Job D3D12 CMD Job AI Job Core 3 Sim Job Sim Job Sim Job D3D12 CMD Job D3D12 CMD Job Game Job Core 4 Sim Job Sim Job Sim Job D3D12 CMD Job D3D12 CMD Job D3D12 Present Job Core 5 GPU Memory

Demo of Ashes of the Singularity
D3D12 Demo Demo of Ashes of the Singularity

Contact: Nicolas.Thibieroz@amd.com
Questions We are hiring! Contact:

Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2015 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.

D3D12 A new meaning for efficiency and performance

Similar presentations

Presentation on theme: "D3D12 A new meaning for efficiency and performance"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

D3D12 A new meaning for efficiency and performance

Similar presentations

Presentation on theme: "D3D12 A new meaning for efficiency and performance"— Presentation transcript:

Similar presentations

About project

Feedback