Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improving Performance in Your Game

Similar presentations

Presentation on theme: "Improving Performance in Your Game"— Presentation transcript:


2 Improving Performance in Your Game
DirectX 12: Improving Performance in Your Game Bennett Sorbo Program Manager Direct3D, Windows Graphics

3 Agenda Overview Improving GPU efficiency Reducing CPU overhead
Summary / Next Steps

4 Overview DirectX 12 provides a single API for low-level access to a variety of GPU hardware Enables games to leverage higher-level knowledge to achieve great performance gains Today, we’ll discuss best practices for specific DirectX 12 features to achieve these gains in your game

5 Increasing GPU efficiency

6 GPU Efficiency Three key areas for GPU-side gains
Explicit resource transitions Parallel GPU execution GPU-generated workloads

7 GPU Efficiency: Explicit resource transitions
Modern GPUs require resources to be in different ‘states’ for different use cases, and knowledge of when these transitions need to occur In DirectX 12, app is responsible for identifying when these transitions need to occur. Making these transitions explicit makes it clear when operations are expensive..

8 GPU Efficiency: Explicit resource transitions (cont’d)
.. but also gives games the opportunity to eliminate unnecessary transitions. Two key opportunities: First, UAV synchronization is now exposed as an explicit resource barrier. Previously, driver would ensure all writes to a UAV were in order of dispatch by inserting “Wait for Idle” commands after each dispatch. Dispatch WaitForIdle Dispatch WaitForIdle Dispatch WaitForIdle Dispatch

9 GPU Efficiency: Explicit resource transitions (cont’d)
If app has high-level knowledge that dispatches can run out of order, WaitForIdle’s can be removed But more importantly, dispatches can then run in parallel to achieve higher GPU occupancy Particularly beneficial for large numbers of dispatches with low thread counts Dispatch WaitForIdle Dispatch WaitForIdle

10 GPU Efficiency: Explicit resource transitions (cont’d)
Second, the ResourceBarrier API allows application to perform transitions over a period of time. App specifies starting/destination states at ‘begin’ and ‘end’ ResourceBarrier calls. Promises not to use resource while in transition. Driver can use this information to eliminate redundant pipeline stalls, cache flushes

11 GPU Efficiency: Explicit resource transitions (cont’d)
Example rendering scenario (before) Example rendering scenario (after) Draw call that renders to Tex1 Resource Barrier (Tex1) Render Target -> SRV SetDescriptorHeap Bind Tex1 as SRV, sample in Draw call API Calls Hardware Commands Driver emits ‘WaitForIdle’ command Driver emits ‘WaitForIdle’ command Draw call that renders to Tex1 Resource Barrier (Tex1) Render Target -> SRV BEGIN SetDescriptorHeap Resource Barrier (Tex1) Render Target -> SRV END Bind Tex1 as SRV, sample in Draw call API Calls Hardware Commands Driver emits ‘WaitForIdle’ command

12 GPU Efficiency: Parallel GPU execution
Modern hardware has the ability to run multiple workloads in parallel on multiple ‘engines’ DirectX 12 allows games to target engines explicitly. The developer knows best about what operations can happen in parallel, what the dependencies are Three engine types exposed in DirectX 12: 3D, Compute, Copy Up to app to know, manage dependencies between queues

13 GPU Efficiency: Parallel GPU execution (cont’d)
The copy engine type is great for getting data around without blocking/interrupting the main 3D engine. Two notable use cases: Texture streaming ‘lazy’ CPU readback Especially great if going across PCI-E Demo

14 GPU Efficiency: Parallel GPU execution
< GPUView comparison between serial/parallel execution >

15 GPU Efficiency: Parallel GPU execution (cont’d)
Really excited about compute engine scenarios as well Two notable use cases: Long-running, low priority compute work Tightly interleaved 3D/Compute work within a frame Get the gain from running different types of workloads that stress different parts of GPU Canonical example: compute-heavy dispatches during shadow map generation.

16 GPU Efficiency: GPU-generated workloads
ID3D11Asynchronous -> ID3D12QueryHeap Query Heaps generalize query functionality – output stored into any buffer on the GPU or in system memory. ID3D12CommandList::ResolveQueryData( ID3D12QueryHeap *pQueryHeap, D3D12_QUERY_TYPE Type, UINT StartElement, UINT ElementCount, ID3D12Resource *pDestinationBuffer, UINT64 AlignedDestinationBufferOffset ) Two key performance opportunities: Binary occlusion Batched query ‘resolve’ operations

17 GPU Efficiency: GPU-generated workloads (cont’d)
Predication has also been generalized ID3D12CommandList::SetPredication( ID3D12Resource *pBuffer, UINT64 AlignedBufferOffset, D3D12_PREDICATION_OP Operation) Predicate on general buffer: query-derived, CPU-populated, GPU- populated – enables new rendering scenarios

18 GPU Efficiency: GPU-generated workloads (cont’d)
ExecuteIndirect – powerful new API for executing GPU-generated Draw/Dispatch workloads Broad hardware compatibility Can vary the following between invocations: Vertex/Index buffers Root constants, Inline SRV/UAV/CBV descriptors Enables new scenarios, dramatic efficiency improvements

19 GPU Efficiency: GPU-generated workloads (cont’d)
Demo Always going to be very efficient: two ways to maximize Set a proper ‘max count’, or just use CPU count. Group these together, ideally put space between generation and consumption of arguments.

20 Reducing CPU Overhead

21 CPU Overhead Many improvements just for showing up:
No high-frequency ref-counting No hazard tracking No state shadowing Three other opportunities to take advantage of: Resource Binding Multi-threading Memory allocation

22 CPU Overhead: Resource Binding
What’s new: Descriptor Heap access Root Signatures Descriptor Heap: Actual GPU memory that contains resource access metadata Root Signature: Binding parameters that can be passed to a shader invocation. Can contain: Location in descriptor heap ‘Inline’ descriptors Actual constant data

23 CPU Overhead: Resource Binding (cont’d)
Descriptor Heap best practices Do: keep your descriptor heap as static as possible. Avoid: frequently changing descriptor heaps. Root Signature best practices Do: keep your root signature small Do: take advantage of inline descriptors/data Avoid: binding unnecessary pipeline stages This is an area where you can move the needle on CPU performance – take advantage of the new flexibility here.

24 CPU Overhead: Multi-threading
In DirectX 11, driver created background thread outside app control. In DirectX 12, multi-threading is app-controlled, first-class citizen via ID3D12CommandList. Not just command lists: you can create PSO and buffers/textures on background threads. Recommendation: Serial workload? Create own background submission thread.

25 CPU Overhead: Resource allocation
In DirectX 11, driver-managed versioning, sub-allocation behind app’s back. DirectX 12 provides tools like fences, resource placement to put apps in charge. Persistently-mapped resources. Recommendations: Use appropriate number of fences Expire resources based on engine knowledge

26 Ashes of the Singularity case study
Dan Baker Graphics Architect, Oxide Games

27 Resource Binding in Nitrous
Nitrous designed from start to map to hardware binding models Three key engine design points: Textures pre-grouped in descriptor heap Bindings shared across shader stages – less bind calls Built around Static Samplers Findings: Easy to stay within one descriptor heap/frame Important to avoid redundant state sets Optional usage of Root CBVs can provide win Result: resource binding overhead is a fraction of what it is on D3D11

28 Resource Management in Nitrous
Nitrous also benefits from more explicit resource management Two classes of resources: Formally tracked, persistent resources Temporary, frame-specific resources Frame-specific resources linearly allocated out of heap, with no resource tracking – minimal overhead

29 Demo

30 Conclusion Many opportunities with DirectX 12 to achieve dramatic performance improvements in your game Get started today! Enroll in the Early Access program at to receive the latest SDK, DirectX 12 drivers, documentation, … Check out Channel9 for previous DirectX12 talks Q/A


32 Backup < Would need to explain ‘residency’, how this worked in DX11 > < WDDM2 residency management provides flexibility/performance. < Don’t need to track resource usage/frame if memory usage isn’t a concern – keep it all resident. >

Download ppt "Improving Performance in Your Game"

Similar presentations

Ads by Google