2Improving Performance in Your Game DirectX 12:Improving Performance in Your GameBennett Sorbo Program ManagerDirect3D, Windows Graphics
3Agenda Overview Improving GPU efficiency Reducing CPU overhead Summary / Next Steps
4OverviewDirectX 12 provides a single API for low-level access to a variety of GPU hardwareEnables games to leverage higher-level knowledge to achieve great performance gainsToday, we’ll discuss best practices for specific DirectX 12 features to achieve these gains in your game
6GPU Efficiency Three key areas for GPU-side gains Explicit resource transitionsParallel GPU executionGPU-generated workloads
7GPU Efficiency: Explicit resource transitions Modern GPUs require resources to be in different ‘states’ for different use cases, and knowledge of when these transitions need to occurIn DirectX 12, app is responsible for identifying when these transitions need to occur.Making these transitions explicit makes it clear when operations are expensive..
8GPU Efficiency: Explicit resource transitions (cont’d) .. but also gives games the opportunity to eliminate unnecessary transitions. Two key opportunities:First, UAV synchronization is now exposed as an explicit resource barrier.Previously, driver would ensure all writes to a UAV were in order of dispatch by inserting “Wait for Idle” commands after each dispatch.DispatchWaitForIdleDispatchWaitForIdleDispatchWaitForIdleDispatch
9GPU Efficiency: Explicit resource transitions (cont’d) If app has high-level knowledge that dispatches can run out of order, WaitForIdle’s can be removedBut more importantly, dispatches can then run in parallel to achieve higher GPU occupancyParticularly beneficial for large numbers of dispatches with low thread countsDispatchWaitForIdleDispatchWaitForIdle
10GPU Efficiency: Explicit resource transitions (cont’d) Second, the ResourceBarrier API allows application to perform transitions over a period of time.App specifies starting/destination states at ‘begin’ and ‘end’ ResourceBarrier calls. Promises not to use resource while in transition.Driver can use this information to eliminate redundant pipeline stalls, cache flushes
11GPU Efficiency: Explicit resource transitions (cont’d) Example rendering scenario (before)Example rendering scenario (after)Draw call that renders to Tex1Resource Barrier (Tex1)Render Target -> SRV…SetDescriptorHeapBind Tex1 as SRV, sample in Draw callAPI CallsHardware CommandsDriver emits ‘WaitForIdle’ commandDriver emits ‘WaitForIdle’ commandDraw call that renders to Tex1Resource Barrier (Tex1)Render Target -> SRVBEGIN…SetDescriptorHeapResource Barrier (Tex1)Render Target -> SRVENDBind Tex1 as SRV, sample in Draw callAPI CallsHardware CommandsDriver emits ‘WaitForIdle’ command
12GPU Efficiency: Parallel GPU execution Modern hardware has the ability to run multiple workloads in parallel on multiple ‘engines’DirectX 12 allows games to target engines explicitly. The developer knows best about what operations can happen in parallel, what the dependencies areThree engine types exposed in DirectX 12: 3D, Compute, CopyUp to app to know, manage dependencies between queues
13GPU Efficiency: Parallel GPU execution (cont’d) The copy engine type is great for getting data around without blocking/interrupting the main 3D engine.Two notable use cases:Texture streaming‘lazy’ CPU readbackEspecially great if going across PCI-EDemo
15GPU Efficiency: Parallel GPU execution (cont’d) Really excited about compute engine scenarios as wellTwo notable use cases:Long-running, low priority compute workTightly interleaved 3D/Compute work within a frameGet the gain from running different types of workloads that stress different parts of GPUCanonical example: compute-heavy dispatches during shadow map generation.
16GPU Efficiency: GPU-generated workloads ID3D11Asynchronous -> ID3D12QueryHeapQuery Heaps generalize query functionality – output stored into any buffer on the GPU or in system memory.ID3D12CommandList::ResolveQueryData( ID3D12QueryHeap *pQueryHeap, D3D12_QUERY_TYPE Type, UINT StartElement, UINT ElementCount, ID3D12Resource *pDestinationBuffer, UINT64 AlignedDestinationBufferOffset )Two key performance opportunities:Binary occlusionBatched query ‘resolve’ operations
17GPU Efficiency: GPU-generated workloads (cont’d) Predication has also been generalizedID3D12CommandList::SetPredication( ID3D12Resource *pBuffer, UINT64 AlignedBufferOffset, D3D12_PREDICATION_OP Operation)Predicate on general buffer: query-derived, CPU-populated, GPU- populated – enables new rendering scenarios
18GPU Efficiency: GPU-generated workloads (cont’d) ExecuteIndirect – powerful new API for executing GPU-generated Draw/Dispatch workloadsBroad hardware compatibilityCan vary the following between invocations:Vertex/Index buffersRoot constants,Inline SRV/UAV/CBV descriptorsEnables new scenarios, dramatic efficiency improvements
19GPU Efficiency: GPU-generated workloads (cont’d) DemoAlways going to be very efficient: two ways to maximizeSet a proper ‘max count’, or just use CPU count.Group these together, ideally put space between generation and consumption of arguments.
21CPU Overhead Many improvements just for showing up: No high-frequency ref-countingNo hazard trackingNo state shadowingThree other opportunities to take advantage of:Resource BindingMulti-threadingMemory allocation
22CPU Overhead: Resource Binding What’s new:Descriptor Heap accessRoot SignaturesDescriptor Heap: Actual GPU memory that contains resource access metadataRoot Signature: Binding parameters that can be passed to a shader invocation. Can contain:Location in descriptor heap‘Inline’ descriptorsActual constant data
23CPU Overhead: Resource Binding (cont’d) Descriptor Heap best practicesDo: keep your descriptor heap as static as possible.Avoid: frequently changing descriptor heaps.Root Signature best practicesDo: keep your root signature smallDo: take advantage of inline descriptors/dataAvoid: binding unnecessary pipeline stagesThis is an area where you can move the needle on CPU performance – take advantage of the new flexibility here.
24CPU Overhead: Multi-threading In DirectX 11, driver created background thread outside app control.In DirectX 12, multi-threading is app-controlled, first-class citizen via ID3D12CommandList.Not just command lists: you can create PSO and buffers/textures on background threads.Recommendation: Serial workload? Create own background submission thread.
25CPU Overhead: Resource allocation In DirectX 11, driver-managed versioning, sub-allocation behind app’s back.DirectX 12 provides tools like fences, resource placement to put apps in charge. Persistently-mapped resources.Recommendations:Use appropriate number of fencesExpire resources based on engine knowledge
26Ashes of the Singularity case study Dan BakerGraphics Architect, Oxide Games
27Resource Binding in Nitrous Nitrous designed from start to map to hardware binding modelsThree key engine design points:Textures pre-grouped in descriptor heapBindings shared across shader stages – less bind callsBuilt around Static SamplersFindings:Easy to stay within one descriptor heap/frameImportant to avoid redundant state setsOptional usage of Root CBVs can provide winResult: resource binding overhead is a fraction of what it is on D3D11
28Resource Management in Nitrous Nitrous also benefits from more explicit resource managementTwo classes of resources:Formally tracked, persistent resourcesTemporary, frame-specific resourcesFrame-specific resources linearly allocated out of heap, with no resource tracking – minimal overhead
30ConclusionMany opportunities with DirectX 12 to achieve dramatic performance improvements in your gameGet started today!Enroll in the Early Access program at to receive the latest SDK, DirectX 12 drivers, documentation, …Check out Channel9 for previous DirectX12 talksQ/A
32Backup< Would need to explain ‘residency’, how this worked in DX11 >< WDDM2 residency management provides flexibility/performance.< Don’t need to track resource usage/frame if memory usage isn’t a concern – keep it all resident. >