Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Graphics and Performance

Similar presentations

Presentation on theme: "Advanced Graphics and Performance"— Presentation transcript:


2 Advanced Graphics and Performance
DirectX 12 Advanced Graphics and Performance Max McMullen Direct3D Development Lead Microsoft

3 It’s been a busy year… API is largely complete, with working drivers
Over 50% of gamers have DirectX 12 hardware Massive industry support: Early Access, Engines, Titles 1 yr free upgrade to Windows 10 from Windows 7, 8.x And now… Announced DX12 last year at GDC.. Amazing progress since The API is almost complete drivers from all of our PC partners Said over 50% of hardware DX12 we are there now and should be at 2/3 by Holiday 2015 early access program has 400+ members from 100+ studios Q&A + feedback posted daily content and tools appearing: UE4, Unity, 3DMark Farandole, and first games Fable Legends, ashes of singularity Free upgrade from Win7/Win8! DXRedist for the OS  But this talk is for developers… Let’s talk about the perf wins on the CPU…and even GPU-bound workloads *Based on Steam survey

4 Agenda Refresh on Direct3D 12 New Feature Levels Unity on Direct3D 12
CPU & GPU Performance Improvements Fable – 11 versus 12 3 parts to talk Refresh on API concepts Unity on 12 New API features for greater control and performance

5 Direct3D 12 API Reduce CPU overhead
Increase scalability across multiple CPU cores Greater developer control Console level API efficiency and performance Superset of D3D 11 rendering functionality Reduce CPU Overhead Increase scalability Greater developer control over memory and synchronization Giving console API efficiency and performance on PC, Tablet, Phone, as well as XBox All with same rendering capabilities as D3D 11

6 CPU Overhead and Multithread Improvements
Pipeline state objects Explicit resource binding management Flexible pipeline parameterization Explicit CPU/GPU synchronization Command Reuse Main concepts that get most wins

7 Pipeline State Objects
HW State 1 HW State 2 Pipeline State Object HW State 3 D3D Vertex Shader D3D Rasterizer D3D Pixel Shader D3D Blend State Vertex Shader HS/DS/… Pixel Shader Blend State Single object representing shader instructions on most GPU architectures memcpy of set commands on modern GPUs

8 Explicit Resource Binding Management
Descriptor { Type Format Mip Count pData } Descriptor Heap Start Index Size Descriptor Table

9 Resource Binding Tiers
Tier 1 Tier 2 Tier 3 Max Descriptor Heap CBV/SRV/UAVs 220 220+ Max CBVs per stage 14 full heap Max SRVs per stage 128 Max UAVs in all stages 8 64 Max Samplers per stage 16 Max SRV Descriptor Tables 5 no limit Tier 1: original 11 hardware Tier 2: next gen, exceeding 11 hardware Tier 3: latest, most flexible binding model

10 Binding Tiers in the D3D 12 Market
Based on Feb ‘15 Steam Survey

11 Explicit Resource Binding: Hazard Resolution
Resource hazards Render Target to/from Texture Copy Source to/from Copy Destination Tiled Resource Aliasing etc… ResourceBarrier API to resolve hazards

12 Flexible Pipeline Parameterization
Two parts: Root Signature and Root Arguments Contains constants, descriptors, and descriptor tables Leverage hardware specific registers and pipelined renaming paths for highest frequency parameters Remove indirection from a constant descriptor index to an explicit descriptor No need to reset entire set of bindings for a few high-frequency descriptor changes Like a function call for the PSO Pass Values and Pointers

13 Explicit CPU/GPU synchronization
Application responsible to manage CPU & GPU race-conditions Synchronization primitive is a fence Application chooses granularity of synchronization One increment per-frame is well amortized Increment per command list submission possible Examples Reading CPU mapped memory after copy back from the GPU Deleting GPU accessible resources

14 New Feature Levels Direct3D 12

15 New Rendering Features
Conservative Rasterization ROVs Typed UAV Loads Tiled Resources Tier 3: Volumes PS Specified Stencil Ref ConsRast: Occlusion Culling, Hit Testing, Curve Rendering ROVs: OIT Typed UAV Loads: General power, normal interpolation Volumes: sparse volume, Global Illumination

16 New Feature Levels Feature Level 12.0 Feature Level 12.1
Resource Binding Tier 2 Tiled Resources Tier 2: Texture2D Typed UAV Tier 1 Feature Level 12.1 Conservative Rasterization Tier 1 ROVs 12.0: Bindless era, large worlds 12.1: Two powerful pipeline enhancements that work together

17 Unity on Direct3D 12 Kasper Engelstoft Unity Graphics Engineer

18 Direct3D 12 in Unity Porting experience
Case study: multithreaded shadow rendering What’s next for D3D 12 in Unity? Thank you Max, now it is time to talk about Direct3D 12 in Unity. We have worked closely with Microsoft to bring D3D12 support to Unity. We started porting in September and today I’ll first talk about our progress. Then I will cover the gains we have gotten by moving our shadow map rendering away from the main thread and onto worker threads. Finally, I will look a bit ahead and talk about our future plans for D3D12.

19 D3D 12 porting experience Started porting in September with SDK1
After 2 weeks, we had something rendering In October, SDK2 API changes hit... Mid-January 95% of our tests were passing Then SDK3 hit... We started in September With a Windows10 build with SDK 1 from March 2014 and early IHV drivers. It took 2 weeks to learn the API and get something rendering in Unity. Luckily our entire shader pipeline just worked with shader model 5.0, which saved us a lot of time, because we were generating working bytecode already. In October, SDK version 2 hit, it took two weeks to fix the fallout and refactor code. From October to December we spent the time getting the tests in our internal graphics tests framework green. The majority of time spent here was because of driver issues. We would probably have moved faster if we had a functional graphics debugger available. By mid-January - 95% of all of tests worked and then SDK3 hit… the new heaps concept destroyed performance, because every resource would create a heap! Our dynamic VBO implementation completely rewritten for performance. We now allocate memory in 1 mb chunks. Dynamic VBO grabs from that buffer using a view. This is where we are today.

20 D3D 12 optimization case study
Multi-threading shadow map rendering Move work away from main thread Generate d3d cmd lists for each of the shadow maps on their own worker threads Cmd lists executed in parallel with the main scene cmd list building We chose the multithreading of shadow maps as our first case study. Because this is something that can be applied to your existing engine and you can get a nice performance boost from multithreading and using D3D12. The main goal is to allow the rendering command lists be created on multiple threads to move work away from the main thread. D3D12 is forcing us to do changes that will give us nice performance boosts later. Common (Unity code) has to generate intermediate commands that will be later executed by a worker device. On D3D12 however, we can skip the intermediate command list and generate D3D12 Command lists directly thereby removing the intermediate step. Note: the main thread still builds the cmd lists to render the main scene and executes them all at flip time. a better way to think about it is: the commands are built asynchronously, but they have to be executed in order. So they’re queued and then executed like this: foreach(asynccmd) { execute }; maincmd->execute();

21 Rendered before the main scene Simple render loop
Why shadow maps? Rendered before the main scene Simple render loop Extracting receivers & casters is quite CPU intensive The shadow jobs don’t require waiting until ID3D12CommandList needs to be executed Shadow maps are an ideal candidate when it comes to multithreading the rendering pipeline: They are rendering before the main scene, which means commands can be generated in parallel with the main scene. Rendering shadow maps is pretty much a) SetRenderTarget(shadowMap); b) render everything; with fewer state changes than normal scene rendering. The extraction of casters & receivers is a very good thing to *not* do on the main thread. That means that it can run in parallel with the whole rendering, and only needing the wait at the end of the frame. On present, do shadow cmd chains, then main scene...

22 Before This test scene consists of 7000 objects that are lit with 3 shadowcasting directional lights. As you can see from the profiler capture, the RenderShadowMaps samples are inlined on the main thread with the rest of the scene. The CPU time spent rendering one frame is 23.38ms (Camera.Render sample) Before we only had one function: RenderShadowMaps. In that function however there are a few things that can run only on the main thread and those have to be kept on the main thread. So we had to first split the code into what can be moved out of the main thread and what needed to stay on the main thread.

23 After By moving the each of the shadow maps to their own thread the total frame time has dropped to 13.75ms which is a win of 10ms. To achieve this, the RenderShadowMaps function has been split into PrepareShadowMaps and RenderShadowMaps. PrepareShadowMaps (runs on the main thread): Responsible for running the main thread part of the shadow map rendering. It performs the following tasks: Calculates memory requirements for the shadow map temp render buffer Spawns a job for calculating bounds and shadow cascade parameters. It also performs culling against the cascades. Finally returns a partially header (filled with the shadow matrices and cascade info) that will be used later on by the RenderShadowMaps job. RenderShadowMaps (runs on main thread) Creates the shadow map render texture Launches the ShadowMapJob async job. This will create a native D3D12 ID3D12CommandList which will get executed at PresentFrame time. The ShadowMapJobs are waited for (if needed) and executed at PresentFrame time.

24 Future D3D 12 work Prerecorded command bundles
One bundle per material pass Bundles for standard operations Mipmap generation Use shader model 5.1 features Moving on from where we are now, first we aim to be feature complete, then start optimizing. Instead of executing all the code for setting state every frame, state changes can be prerecorded into a bundle. Drawcalls are very cheap when reissuing command bundles, so we should be able to issue many more. We can generate these bundles for standard operations. We will also investigate deferred rendering with shader model 5.1. It can be improved because of the new resource binding concept where we can bind 1000s of textures and the shader can select which ones to use.

25 CPU & GPU Performance Improvements
Direct3D 12

26 Heavy shader compilation
Shader Cache Redundant compilation from IL to hardware specific instructions Optimize startup and level load times, reduce glitches Heavy shader compilation during start-up start-up level load menu play Time (s) CPU Usage (%) during level-load Huge stalls during app load, level load Glitches during gameplay PSOs will make things worse – deeper optimization analysis

27 Shader Cache Frames typically have 200 to 400 Pipeline State Objects
Long traces typically have 300 to 1000 Pipeline State Objects Cache operates on fully compiled PSOs, not individual shader stages Serialization and deserialization under developer control Full application control over what to serialize and how to serialize

28 ExecuteIndirect Replacement for DrawIndirect and DispatchIndirect
Can perform multiple draws with a single API call Number of draws can be controlled by CPU or GPU Can even change bindings between draw calls Works on all 12 hardware from FL 11.0 and up Much like everything else in directx, we’ve abstracted the nuances of all the hardware and enabled this feature on every 12 GPU

29 ExecuteIndirect Command Signature
Operations performed by ExecuteIndirect described by a command signature Describes the layout of the argument buffer and the set of commands Operations include: Set vertex or index buffer Change root constants Set root resource views (SRV, UAV, CBV) Draw, DrawIndexed, or Dispatch

30 ExecuteIndirect versus Draw Loop
for (UINT drawIdx = drawStart; drawIdx < drawEnd; ++drawIdx) { // Set bindings cmdLst->SetGraphicsRootConstantBufferView(RT_CBV, constantsPointer); constantsPointer += sizeof(DrawConstantBuffer); auto textureSRV = textureStartSRV.MakeOffsetted(staticData->textureIndex, handleIncrementSize); cmdLst->SetGraphicsRootDescriptorTable(RT_SRV, textureSRV); cmdLst->DrawIndexedInstanced(dynamicData- >indexCount, 1, dynamicData->indexStart, staticData- >vertexStart, 0); } mCmdLst->SetGraphicsRootDescriptorTable(RT_SRV, mTextureStart); mCmdLst->ExecuteIndirect(mCommandSignature, settings.numAsteroids, frame->mIndirectArgBuffer- >Heap(), 0, nullptr, 0);

31 ExecuteIndirect Demo Intel’s Asteroids Demo Updated

32 ExecuteIndirect Demo 11 12 12 Bindless 12 ExecuteIndirect CPU 39.19 ms
GPU 34.81 ms 12.85 ms 11.86 ms 10.59 ms Total CPU time

33 Flexible Predication and Queries
Predicates & Queries are now an explicit resource creation on GPU accessible heaps Rendering operations can be predicated based on arbitrary computation performed by the CPU or GPU Resolve operation transforms hardware specific query representation into standardized buffer contents Apps that have lots of occlusion queries per frame will see improved performance due to bulk resolves

34 Multiengine 3D Compute Copy
Expose multiple parallel queues as explicit API objects Queue Types: 3D, Compute, Copy Prioritized queues enable new scenarios High priority, latency sensitive workloads Low priority background tasks Extract all the parallelism out of the hardware that’s available

35 Multiengine 3D Queue Copy Queue Stream textures Render Compute Wait
Fence 1 Stream textures Signal Fence 3D Queue Copy Queue

36 Multiengine 3D Queue Copy Queue Stream textures Render Compute Wait
Fence 1 3D Queue Stream textures Signal Fence Copy Queue

37 Multiengine Demo Compute and Copy Scenario Test

38 UAV Barriers In D3D11 all UAV accesses in 1 Draw/Dispatch must complete before any UAV accesses in a subsequent Draw/Dispatch This results in idle GPU shader cores for small Draw/Dispatch In D3D12 UAV accesses in multiple Draw/Dispatch are truly unordered, applications must use an explicit barrier to enforce ordering D3D12 – putting the “U” back in UAV

39 UAV Barriers Direct3D 11 Direct3D 12 Draw+UAV Wait for Idle Dispatch

40 UAV Barrier – Fable A/B Demo

41 Fable: 11 versus 12

42 Summary Dramatically reduced CPU overhead
Great multithreaded scalability Expose new GPU capabilities Increase GPU performance Greater developer control

43 Resources – Previous Talks
IDF 2014: ON_ID=1315 GDC 2014/Build 2014:

44 Resources Check our booths and quick start challenge at the Expo
Join early access: Upcoming GDC 2015 Talks: DirectX Tools: problems-with-your-game-using-directx-tools-presented-by-microsoft Direct3D 12 Power & Performance: game-on-directx12-presented-by-microsoft And several talks by hardware partners…


Download ppt "Advanced Graphics and Performance"

Similar presentations

Ads by Google