Direct3D12 and the future of graphics APIs

Direct3D12 and the future of graphics APIs
Dave Oldcorn, Direct3D12 Technical Lead, AMD

The Problem

The problem Mismatch between existing Direct3D and hardware capabilities Lots of CPU cores, but only one stream of data State communication in small chunks “Hidden” work Hard to predict from any one given call what the overhead might be Implicit memory management Hardware evolving away from classical register programming

(register level access)
API landscape Gap between PC ‘raw’ 3D APIs and the hardware has opened up Very high level APIs now ubiquitous; easy to access even for casual developers, plenty of choice Where the PC APIs are is a middle ground Game Engines Frostbite Unity Unreal CryEngine BlitzTech Flash / Silverlight Capability, ease of use, distance from 3D engine Application D3D9 OpenGL D3D11 D3D7/8 Opportunity Metal (register level access) Console APIs

What are the Consequences? What Are the solutions?

State contributing to draw
Sequential API API input State contributing to draw Sequential API: state for given draw comes from arbitrary previous time Some states must be reconciled on the CPU (“delayed validation”) All contributing state needs to be visible GPU isn’t like this, uses command buffers Must save and restore state at start and end ... Draw Set PS CB Draw x 5 Set VS CB Draw x 3 Set Blend Set PS Set RT state Set VS VB (more, earlier) PS CB VS CB Blend state PS RT state Draw

Threading a sequential API
Application simulation Sequential API threading Simple producer / consumer model Extra latency Buffering has a cost More threading would mean dividing tasks on finer grain Bottlenecked on application or driver thread Difficult to extract parallelism (Amdahl’s Law) ... Prebuild Thread 0 Prebuild Thread 1 Application Render Thread Application Driver Thread Runtime / Driver GPU Execution Queue Queued Buffer 0 Queued Buffer 1 Queued Buffer 2

Application simulation
Command buffer API Application simulation GPUs only listen to command buffers Let the app build them Command Lists, at the API level Solves sequential API CPU issues ... Thread 0 Thread 1 Build Cmd Buffer Build Cmd Buffer Application Runtime / Driver GPU Execution Queue Queued Buffer 0 Queued Buffer 1

Better scheduling App has much more control over scheduling work
Both CPU side and GPU Threads don’t really share much resource Many more options for streaming assets Driver thread Create thread D3D11: CB building threads tend to interfere D3D12: CB building threads more independent Create thread Build threads GPU load still added but only after queuing Render work Create work GPU executes

Pipeline objects Pipeline objects get rid of JIT and enable LTCG for GPUs Decouple interface and implementation We’re aware that this is a hairpin bend for many graphics engines to negotiate. Many engines don’t think in terms of predicting state up front The benefits are worth it Index Process VS ? Primitive Generation ? Rasteriser PS Simplified dataflow through pipeline ? Rendertarget Output

render object binding mismatch
GPU Memory SRD table GPU Memory resource On-chip root table (1 per stage) Hardware uses tables in video memory BUT still programmed like a register solution So one bind becomes: Allocate a new chunk of video memory Create a new copy of the entire table Update the one entry Write the register with the new table base address Pointer to (+ params of) resource Pointer to table (here, textures) SR CB Pointer to table (constant buffers)

Descriptor Tables Several tables of each type of resource
Easy to divide up by frequency Tables can be of arbitrary size; dynamically indexed to provide bindless textures Changing a pointer in the root table is cheap Updating a descriptor in a table is not so cheap Some dynamic descriptors are a requirement but avoid in general. On-chip root table GPU Memory SRD table Pointer to table (textures table 0) SR.T[0] SR.T[0][0] SR.T[1] SR.T[0][1] SR.T[2] SR.T[0][2] SR.T[3] UAV Samp CB.T[0] CB.T[1] CB.T[1][0] Pointer to table (constbuf table 1) CB.T[1][1]

Innovation CPU-side win GPU-side win
KEY innovations Innovation CPU-side win GPU-side win Command buffers Build on many threads Control of scheduling Lower latency Simplified state tracking Pipeline state objects Link at create time No JIT shader compiles Efficient batched updates Cheaper state updates Enables LTCG Bind objects in groups Cheap to change group Fits hardware paradigm Move work to Create Predictability Enables optimisations

Innovation CPU-side win GPU-side win FEWER BUGS
KEY innovations Innovation CPU-side win GPU-side win Explicit Synchronisation Efficiency Required for bindless textures Less overhead Explicit Memory Management Predictability Application flexibility Zero copy Control over placement Do less Predictability, Efficiency Enables aggressive schedule FEWER BUGS

NEW PROBLEMS (And tips to solve them)

New visible limits More draws in does not automatically mean more triangles out You will not see full rendering rates with triangles averaging 1 pixel each. Wireframe mode should look different to filled rendering

New visible limits Feeding the GPU much more efficiently means exploring interesting new limits that weren’t visible before 10k/frame of anything is ~1µs per thing. GPU pipeline depth is likely to be 1-10µs (1k-10k cycles). Specific limit: context registers Root shader table is NOT in the context Compute doesn’t bottleneck on context

Applications must be warning-free on the debug layer
Application in charge Application is arbiter of correct rendering This is a serious responsibility The benefits of D3D12 aren’t readily available without this condition Applications must be warning-free on the debug layer Different opportunities for driver intervention Consider controlling risk by avoiding riskier techniques

D3D11: No dead GPU time after 1st frame (but extra latency)
Application in charge No driver thread in play App can target much lower latency BUT implies app has to be ready with new GPU work D3D11: No dead GPU time after 1st frame (but extra latency) App Render Frame 1 Frame 2 F2 Frame 3 First work sent to driver Driver buffers Present; no future dead time Driver Dead Time F1 F3 GPU F1 F3 No buffered present reveals dead time on GPU

Use command buffers sparingly
Multiple applications running on system Each API command list maps to a single hardware command buffer Starting / ending a command list has an overhead Writes full 3D state, may flush caches or idle GPU We think a good rule of thumb will be to target around command buffers/frame Use the multiple submission API where possible Application 0 queue CB0 CB1 CB2 Application 1 queue CB0 GPU executes CB0 CB1 CB0 CB2

Round-up

All-new There’s a learning curve here for all of us
In the main it’s a shallow one Compared at least to the general problem of multithreaded rendering Multithread is always hard. Simpler design means fewer bugs and more predictable performance

What AMD plan to deliver
Release driver for Direct3D12 launch Continuous engagement With Microsoft With ISVs Bring your opinions to us and to Microsoft.

QUESTIONS

Direct3D12 and the future of graphics APIs

Similar presentations

Presentation on theme: "Direct3D12 and the future of graphics APIs"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Direct3D12 and the future of graphics APIs

Similar presentations

Presentation on theme: "Direct3D12 and the future of graphics APIs"— Presentation transcript:

Similar presentations

About project

Feedback