3 The problemMismatch between existing Direct3D and hardware capabilitiesLots of CPU cores, but only one stream of dataState communication in small chunks“Hidden” workHard to predict from any one given call what the overhead might beImplicit memory managementHardware evolving away from classical register programming
4 (register level access) API landscapeGap between PC ‘raw’ 3D APIs and the hardware has opened upVery high level APIs now ubiquitous; easy to access even for casual developers, plenty of choiceWhere the PC APIs are is a middle groundGame EnginesFrostbiteUnityUnrealCryEngineBlitzTechFlash / SilverlightCapability, ease of use, distance from 3D engineApplicationD3D9OpenGLD3D11D3D7/8OpportunityMetal(register level access)Console APIs
5 What are the Consequences? What Are the solutions?
6 State contributing to draw Sequential APIAPI inputState contributing to drawSequential API: state for given draw comes from arbitrary previous timeSome states must be reconciled on the CPU (“delayed validation”)All contributing state needs to be visibleGPU isn’t like this, uses command buffersMust save and restore state at start and end...DrawSet PS CBDraw x 5Set VS CBDraw x 3Set BlendSet PSSet RT stateSet VS VB(more, earlier)PS CBVS CBBlend statePSRT stateDraw
7 Threading a sequential API Application simulationSequential API threadingSimple producer / consumer modelExtra latencyBuffering has a costMore threading would mean dividing tasks on finer grainBottlenecked on application or driver threadDifficult to extract parallelism (Amdahl’s Law)...PrebuildThread 0PrebuildThread 1Application Render ThreadApplicationDriver ThreadRuntime / DriverGPU Execution QueueQueued Buffer 0QueuedBuffer 1QueuedBuffer 2
8 Application simulation Command buffer APIApplication simulationGPUs only listen to command buffersLet the app build themCommand Lists, at the API levelSolves sequential API CPU issues...Thread 0Thread 1Build Cmd BufferBuildCmdBufferApplicationRuntime / DriverGPU Execution QueueQueued Buffer 0QueuedBuffer 1
9 Better scheduling App has much more control over scheduling work Both CPU side and GPUThreads don’t really share much resourceMany more options for streaming assetsDriver threadCreate threadD3D11: CB building threads tend to interfereD3D12: CB building threads more independentCreate threadBuild threadsGPU load still added but only after queuingRender workCreate workGPU executes
10 Pipeline objectsPipeline objects get rid of JIT and enable LTCG for GPUsDecouple interface and implementationWe’re aware that this is a hairpin bend for many graphics engines to negotiate.Many engines don’t think in terms of predicting state up frontThe benefits are worth itIndexProcessVS?Primitive Generation?RasteriserPSSimplified dataflow through pipeline?RendertargetOutput
11 render object binding mismatch GPU MemorySRD tableGPU MemoryresourceOn-chiproot table(1 per stage)Hardware uses tables in video memoryBUT still programmed like a register solutionSo one bind becomes:Allocate a new chunk of video memoryCreate a new copy of the entire tableUpdate the one entryWrite the register with the new table base addressPointer to (+ params of) resourcePointer to table(here, textures)SRCBPointer to table(constant buffers)
12 Descriptor Tables Several tables of each type of resource Easy to divide up by frequencyTables can be of arbitrary size; dynamically indexed to provide bindless texturesChanging a pointer in the root table is cheapUpdating a descriptor in a table is not so cheapSome dynamic descriptors are a requirement but avoid in general.On-chiproot tableGPU MemorySRD tablePointer to table(textures table 0)SR.TSR.TSR.TSR.TSR.TSR.TSR.TUAVSampCB.TCB.TCB.TPointer to table(constbuf table 1)CB.T
13 Innovation CPU-side win GPU-side win KEY innovationsInnovationCPU-side winGPU-side winCommand buffersBuild on many threadsControl of schedulingLower latencySimplified state trackingPipeline state objectsLink at create timeNo JIT shader compilesEfficient batched updatesCheaper state updatesEnables LTCGBind objects in groupsCheap to change groupFits hardware paradigmMove work to CreatePredictabilityEnables optimisations
16 New visible limitsMore draws in does not automatically mean more triangles outYou will not see full rendering rates with triangles averaging 1 pixel each.Wireframe mode should look different to filled rendering
17 New visible limitsFeeding the GPU much more efficiently means exploring interesting new limits that weren’t visible before10k/frame of anything is ~1µs per thing.GPU pipeline depth is likely to be 1-10µs (1k-10k cycles).Specific limit: context registersRoot shader table is NOT in the contextCompute doesn’t bottleneck on context
18 Applications must be warning-free on the debug layer Application in chargeApplication is arbiter of correct renderingThis is a serious responsibilityThe benefits of D3D12 aren’t readily available without this conditionApplications must be warning-free on the debug layerDifferent opportunities for driver interventionConsider controlling risk by avoiding riskier techniques
19 D3D11: No dead GPU time after 1st frame (but extra latency) Application in chargeNo driver thread in playApp can target much lower latencyBUT implies app has to be ready with new GPU workD3D11: No dead GPU time after 1st frame (but extra latency)App RenderFrame 1Frame 2F2Frame 3First work sent to driverDriver buffers Present; no future dead timeDriverDeadTimeF1F3GPUF1F3No buffered present reveals dead time on GPU
20 Use command buffers sparingly Multiple applications running on systemEach API command list maps to a single hardware command bufferStarting / ending a command list has an overheadWrites full 3D state, may flush caches or idle GPUWe think a good rule of thumb will be to target around command buffers/frameUse the multiple submission API where possibleApplication 0 queueCB0CB1CB2Application 1 queueCB0GPU executesCB0CB1CB0CB2
22 All-new There’s a learning curve here for all of us In the main it’s a shallow oneCompared at least to the general problem of multithreaded renderingMultithread is always hard.Simpler design means fewer bugs and more predictable performance
23 What AMD plan to deliver Release driver for Direct3D12 launchContinuous engagementWith MicrosoftWith ISVsBring your opinions to us and to Microsoft.
Your consent to our cookies if you continue to use this website.