Presentation is loading. Please wait.

Presentation is loading. Please wait.

NITROUS AND MANTLE: Combining efficient engine design with a modern API Dan Baker, Partner, Oxide Games.

Similar presentations


Presentation on theme: "NITROUS AND MANTLE: Combining efficient engine design with a modern API Dan Baker, Partner, Oxide Games."— Presentation transcript:

1 NITROUS AND MANTLE: Combining efficient engine design with a modern API Dan Baker, Partner, Oxide Games

2 2| Nitrous and Mantle | 19 March 2014 PRE-REQUISITE MOTIVATIONAL SLIDE MODERN APIS ARE STARTING TO FEEL RATHER DATED BUT HOW MUCH BETTER CAN WE BE?

3 3| Nitrous and Mantle | 19 March 2014 PRE-REQUISITE MOTIVATIONAL SLIDE TURNS OUT… A WHOLE LOT FASTER

4 4| Nitrous and Mantle | 19 March 2014 HONEY, DOES THIS DRESS MAKE ME LOOK FAT? …

5 5| Nitrous and Mantle | 19 March 2014 STATE OF THE ART TODAY: WHATS GOING ON? Lots of little things add up 2 major problems require rearchitecture –Functional threading model throws a wrench into task based systems –Implicit Hazard tracking and synchronization API tries to hide the async nature of GPU Lots of little things, memory model, binding model, etc Analysis of features like instancing indicate that it is unreliable and tends to speed up only the fastest frames, correlation between batches and driver perf is casual Cant RETRO fit old APIS

6 6| Nitrous and Mantle | 19 March 2014 DIVING INTO NITROUS Nitrous = Oxides custom engine Specifically designed for high throughput Core neutral. Main thread acts only as lightweight sequencer All work divided up into small jobs, which are in the microsecond range Can produce lots of jobs, 10,000+ range per frame

7 7| Nitrous and Mantle | 19 March 2014 STAR SWARM Nitrous Engine demo Free to download, experiment Proof of concept for modern API design Represents 2 AI opponents, thus application CPU load is realistic 10,000 units possible 100,000+ batches possible

8 8| Nitrous and Mantle | 19 March 2014 SECRETS BEHIND STAR SWARM Much of what is required for high performance isnt specific to Mantle Star Swarm originally not based on Mantle If engine is structured in certain ways, Mantle support is straight-forward and intuitive. Maybe even fun. Work done to restructure engine will have benefits outside of Mantle support

9 9| Nitrous and Mantle | 19 March 2014 ADDING NITROUS TO THE ENGINE Rendering broken into jobs which generate autonomous command buffers CPU to GPU data streamlined – constants, texture updates go into GPU frame memory Shader bindings standardized Shaders, state, bundled into blocks Resources grouped into sets Graphics commands streamlined, restricted bind points Stateless command format Expensive state transitioned rarely Much attention paid to cache usage, lockless data structures All hazards detangled, all buffers considered non persistent

10 10| Nitrous and Mantle | 19 March 2014 MULTI-CORE CPU BASICS Be Wary, There Is A Lot Of Very Bad Advice In The Wild Spawning threads to handle tasks Relying OS preemptive scheduler, heavy weight OS synchronization primitives Functional threading in general Your Survival Guide OK: Multi-thread read of same location OK: Multi-thread write to different locations OK: Multi-thread write to same location in stamp mode CAUTION: Atomic instructions STOP: Multi-thread read/write to same location STOP: Multi-thread write to same CACHE line

11 11| Nitrous and Mantle | 19 March 2014 NITROUS AND MANTLE Nitrous is NOT built around Mantle Reverse is more true, Mantle adapts well to Nitrous internal concepts The concepts are what make engine fast Results are astounding, driver time reduced up to 50x Mantle is the harbinger of future API design, Not just in Graphics

12 12| Nitrous and Mantle | 19 March 2014 TASK BASED SYSTEM Idea is that work load is a constructed graph of much smaller nuggets Many advantages –Scales well, 32+ cores –Easy to balance workload –More power efficient – more slower cores just as good Already seeing CPUs dynamically slowing clock speed –If enough similar work items queued, can execute same code on cores Cache hit rate much higher –End up generating a larger number of command buffers to prevent thread serialization

13 13| Nitrous and Mantle | 19 March 2014 HOW NITROUS GENERATES COMMANDS Core 0 Core 1 Core 2 Cmd Buffer 0 Cmd Buffer 1 Cmd Buffer 2 Cmd Buffer 3 Frame Data Cmd Buffer 2 Cmd Buffer 0 Cmd Buffer 1 Cmd Buffer 3

14 14| Nitrous and Mantle | 19 March 2014 NITROUS COMMAND FORMATS In reality, diagram is over simplified Nitrous has its own internal command format –Small, efficient commands –Stateless, each command contains references to all needed state –Inheritance unneeded –Separates internal graphics system from any particular API Being Stateless, can be generated completely out of order Entire Frame is queued up in internal command format Frame is translated to GPU commands via Mantle –Nitrous Command buffers are translated into Mantle Command Buffers at one section Gets more optimal use out of instruction cache and data cache

15 15| Nitrous and Mantle | 19 March 2014 BUILDING AROUND ASYNCRONISITY: HOW NITROUS THINKS OF A FRAME Entire app should be exposed to concept of asyncronisity The concept of a frame: –A set of commands which will be executed on the GPU –A set of data which will be read by the GPU –This concept is fundamental in Nitrous, regardless of API Frame CMD Frame Data Persistent Textures Big Transfer Buffers Resource Sets

16 16| Nitrous and Mantle | 19 March 2014 CREATING A FRAME, USING FRAME DATA Create 2 copies of our frame data One will be read by GPU, while other is being written to by the CPU Must use fence to make sure CPU doesnt get ahead More complex situations could be explored Frame data includes –Constant Data –Small texture updates Even Frame Odd Frame GPU CPU

17 17| Nitrous and Mantle | 19 March 2014 STARTING OUR FRAME g_fTotalFrameTime = System::Time::TimeAsSeconds( System::Time::PeekTime()) - g_StartTotalFrameTime ; g_uFrameBuffer = (g_uFrame % TransferMemory::BUFFERS); if(g_bProtectMemory) ProtectMemory(g_CmdTransferMemory.PerFrameMemory[g_uFrameBuffer].pData, g_CmdTransferMemory.Capacity, PO_READWRITE); gr = grWaitForFences(g_Device, 1, &g_FrameFences[g_uFrameBuffer], true, MAX_FRAME_WAIT_TIME_SECONDS); if(gr == GR_TIMEOUT) // Throw the frame out return false; g_StartTotalFrameTime = System::Time::TimeAsSeconds( System::Time::PeekTime()); //Reset our allocation and map the current GPU memory HeapAndChunk HeapChunk = g_GPUTransferMemory.PerFrameMemory[g_uFrameBuffer].Memory; MarkMemoryChunk(HeapChunk); GR_GPU_MEMORY DynamicMemory = g_MemoryHeap.MemoryChunk[HeapChunk.Heap][HeapChunk.SubChunk].Memory; gr = grMapMemory(DynamicMemory, 0, (GR_VOID**) &g_GPUTransferMemory.PerFrameMemory[g_uFrameBuffer].pData); g_CmdTransferMemory.PerFrameMemory[g_uFrameBuffer ].Offset = 0; g_GPUTransferMemory.PerFrameMemory[g_uFrameBuffer ].Offset = 0; g_GPUTransferMemory.pCurrentMantleMemoryBase = g_GPUTransferMemory.PerFrameMemory[g_uFrameBuffer].pData; g_GPUTransferMemory.pCurrentMantleMemoryEnd = g_GPUTransferMemory.PerFrameMemory[g_uFrameBuffer].pData + g_GPUTransferMemory.Capacity; g_GPUTransferMemory.CurrentMantleMemory = DynamicMemory; g_GPUTransferMemory.TempHeapMemory = 0; return true;

18 18| Nitrous and Mantle | 19 March 2014 HINT Use memory heap that has highest cpuWritePerfRating In Debug, rather then copying directly to GPU memory, allocate CPU memory –Or use pinned Mantle memory Then, use OS call Virtual Protect with PAGE_NOACCESS for any data that effects the frame, while the frame is being accessed by GPU, or could be being translated by the CPU If any part of system inadvertently writes to the memory, will throw exception

19 19| Nitrous and Mantle | 19 March 2014 SOME EXTRA STUFF WE WILL NEED Because we track hazards, we will want a few more buffers A delete queue – objects are not deleted, but placed in the delete queue –One queue per frame, once that frame is complete, items will be deleted A state transition queue –Used only when a resource is created, to transition it to the desired initial state An Unordered Command Queue –Gets flushed before main frames command queue –Useful for preparing resources for first time use (e.g. initialization)

20 20| Nitrous and Mantle | 19 March 2014 INTERNAL COMMAND FORMAT Nitrous has its own internal command format Persistent state: –Resource Sets –Shader Blocks –Various pipeline state Frame State, primary construct is a batch set –Contains primitives, batches and shader sets –Batches which reference Primitives Shader Sets –Constant references are made into our frame memory Each one of these has a different, natural change frequency

21 21| Nitrous and Mantle | 19 March 2014 NITROUS MEMORY POOLS Resources used together, created together Multiple resource sets are often pooled Simplifies memory management, less then 1000 total allocations Orange Team Units Memory FIGHTER 1 CAR. REAR CAR. FOR CARRIER MAIN (0) Albiedo (1) Material Mask (2) Ambient Occlusion (3) Normal Map (4) Weathering Map (0) Albiedo (1) Material Mask (2) Ambient Occlusion (3) Normal Map (4) Weathering Map (0) Albiedo (1) Material Mask (2) Ambient Occlusion (3) Normal Map (4) Weathering Map (0) Albiedo (1) Material Mask (2) Ambient Occlusion (3) Normal Map (4) Weathering Map

22 22| Nitrous and Mantle | 19 March 2014 NITROUS MEMORY POOLS GPU resource allocation a little tricky – we dont know ahead of time how big something might be 2 step process, first calculate size of resource, then allocate pool based on that size Does not map 1:1 to Mantle memory allocations Instead, Pool is created with default page size When a new resource is added, either it places inside current allocation, or if resource is bigger then the page size, creates a new allocation that fits the resource A memory pool in Nitrous = a list of allocations in Mantle If able to size ahead of time, only 1 allocation Unit Textures Diffuse Specular Mask AO Normal Mantle Alloc

23 23| Nitrous and Mantle | 19 March 2014 CREATING A RESOURCE //Setup our resource decriptions ResourceDesc RedLUTDesc, BlueLUTDesc, GreenLUTDesc; RedLUTDesc.Init(OX_FORMAT_R32_FLOAT, 1, 1, RT_TEXTURE_1D, v3ui(1024,1,1), pfRedData, 0); BlueLUTDesc.Init(OX_FORMAT_R32_FLOAT, 1, 1, RT_TEXTURE_1D, v3ui(1024,1,1), pfBlueData, 0); GreenLUTDesc.Init(OX_FORMAT_R32_FLOAT, 1, 1, RT_TEXTURE_1D, v3ui(1024,1,1), pfGreenData, 0); //Calculate Size uint64 uSize = 0; uSize += GetResourceMemorySize(RedLUTDesc, Graphics::RF_SHADERRESOURCE, Graphics::HT_GPU); uSize += GetResourceMemorySize(BlueLUTDesc, Graphics::RF_SHADERRESOURCE, Graphics::HT_GPU); uSize += GetResourceMemorySize(GreenLUTDesc, Graphics::RF_SHADERRESOURCE, Graphics::HT_GPU); //Create a GPU heap, sized to what we want g_GPUMemory = AllocateHeap(uSize, Graphics::HT_GPU); //Create the Resources ColorLuts.RedLUT.Resource = CreateResource(RedLUTDesc, RF_SHADERRESOURCE, RSTATE_DEFAULT, g_GPUMemory); ColorLuts.BlueLUT.Resource = CreateResource(BlueLUTDesc, RF_SHADERRESOURCE, RSTATE_DEFAULT, g_GPUMemory); ColorLuts.GreenLUT.Resource = CreateResource(GreenLUTDesc, RF_SHADERRESOURCE, RSTATE_DEFAULT, g_GPUMemory);

24 24| Nitrous and Mantle | 19 March 2014 SOME EXTRA MANAGEMENT REQUIRED Creating a Resource slightly more involved When a resource creation call occurs, check to see if we are a GPU heap If so, no way to directly map memory and upload resource so –1) Allocate, or recycle a CPU visible heap object –2) Create Resource and map into this heap –3) Create Resource on the GPU in the specified heap, (it will be uninitialized) –4) Issue a copy command in our Unordered Command Queue –5) Place temp resource in a deletion queue For any resource, we allow a default state to be specified –At beginning of frame, before we execute main comands, issue any state transition queues to place resources from default state into desired state

25 25| Nitrous and Mantle | 19 March 2014 RESOURCE SETS In real world, textures are grouped Nitrous has 5 bind points –2 for batch –2 for shader –1 for primitive VB is just a resource set Nitrous does not allow binding of individual textures Clearly, maps 1:1 to a descriptor Space Fighter 1 (0) Albiedo (1)Material Mask (2) Ambient Occlusion (3) Normal Map (4) Weathering Map

26 26| Nitrous and Mantle | 19 March 2014 VERTEX BUFFERS Nitrous does not use Vertex Buffers Instead, Resource Set acts as VB, but with more programmatic control Vastly simplifies engine side management –VBs can be saved as DDS files –Do not require a huge amount of loading code for slightly different Vertex Formats –Can fold Displacement maps and other geometry modifiers into Primitive Resource Set Not seen strong evidence on any hardware that this causes a performance issue

27 27| Nitrous and Mantle | 19 March 2014 CONSTANT BUFFERS Nitrous does not have concept of constant buffers Instead, all constant data is thrown out every frame –When we render an object, CPU will generate the constants needed for that frame –Grab a piece of the Frame Memory and write to it Constant bindings are just references into our frame memory But… be careful! CPU is writing straight to GPU memory. Do NOT read it back! Evidence suggests no performance advantage of persisting constants across frames, regenerating every frame is ample fast. 100k+ batches not a problem

28 28| Nitrous and Mantle | 19 March 2014 A BATCH IN NITROUS CONSISTS OF 4 PARTS Batch Set Prim 0Prim 1 Prim 2 Shader 0Shader 1 Batch 0 Batch 1 Batch 2 Batch 3 Batch 4 Primitive IB Resources Tri info Shader Resources (2) Constants (2) Shader Block Batch Primitive Shader Resources (2) Constants (2) Batch Set Batches Primitives Shaders RTs Blend State

29 29| Nitrous and Mantle | 19 March 2014 DESCRIPTOR TABLE LAYOUT FOR NITROUS Descriptor 0 *Batch Resource Set 0 *Batch Resource Set 1 Batch Constants 1 Batch Constants 2 *Shader Resource Set 0 *Shader Resource Set 1 Shader Constants 0 Shader Constants 1 *UAV *Samplers (only 1 global bank) Descriptor 1 *Primitive VB Dynamic Const Batch Constants 0

30 30| Nitrous and Mantle | 19 March 2014 DESCRIPTOR BINDING STRATEGY Remember: Descriptors are just structures on GPU memory, so need to double buffer as well Create 1 giant descriptor table, start update at beginning of frame Recognize that we have a resource bind vector of only 9 items Each bind vector can be built into a descriptor table, but dont need unique one Check to see if this bind vector has been built before(During this frame), e.g. resident in a small cache, if so, just reference it If not, build a new descriptor table, and place in cache Dynamic constants, batch constant 0, uses grCmdBindDynamicMemoryView –Usually, this will change every call (e.g. some part of the batch is changing or else its the same batch) Using grCmdBindDynamicMemoryView, for 100k batches, about 5-10k descriptors actually need to get built per frame

31 31| Nitrous and Mantle | 19 March 2014 TRACKING RESOURCE USAGE Apps responsibility to track what resources get used Simple strategy: Stamp a frame number on each memory pool anytime it is bound Traverse the complete resource list, anything which matches current frame must be resident Quick as long as we keep # of heaps reasonable Important: Frame # should be padded into a cache line to avoid serialization Heap descriptionLast Frame Used UI Textures intro2401 UI Textures in Game17204 Orange Faction Units17204 Purple Faction Units17204 Weapon effects16392 Post Process RTs17204 Terrain Heightmap17204

32 32| Nitrous and Mantle | 19 March 2014 DEALING WITH STATE TRANSITIONS Most important, difficult part of Mantle Must understand anytime a resource is getting used in a different way, Read After Write Write After Write

33 33| Nitrous and Mantle | 19 March 2014 SHADER BLOCKS Shader Blocks –Group of shaders with identical resources –Key point : all shader stages grouped together –All resources are bound to all stages –For mantle, need add some extra data Can we blend? What back buffer formats might be used? What z buffer formats might be used? –Create a matrix of pipeline objects based on specified modes The right pipeline objet is selected based on current RT state RTs and blendstate already chunked, no extra state changes introduced ShaderGroup SimpleShader { ResourceSetPrimitive = VertexData; ConstantSetDynamic[0] = DynamicData; ResourceSetBatch[1] = UserTS; ConstantSetShader[0] = Globals; RenderTargetFormats = R16G16B16A16_FLOAT, R11G11B10_FLOAT; BlendStates = BlendOff; DepthTargetFormats = D32_FLOAT; Methods { main: CodeBlocks = SimpleShaders; VertexShader = SimpleVSShader; PixelShader = SimplePSShader; zprime: CodeBlocks = SimpleShaders; VertexShader = SimpleVSShader; PixelShader = BlankSimplePSShader; }

34 34| Nitrous and Mantle | 19 March 2014 CREATING SHADER BLOCKS IN MANTLE Translate HLSL Byte code to Mantle IC –All done at compile time, have a Mantle speific executable Creating a Mapping Table –Batch has 5 bind points –Shader has 4 bind points –Batch Set has 1 bind point –Primitive has 1 bind point –Global Samplers have 1 bind point Set up our IC so all pipeline objects use exactly the same top level desciptor

35 35| Nitrous and Mantle | 19 March 2014 WHAT ABOUT THAT PRESENT? Unlike other APIS, we do not need, or should, block on the present on the main thread Instead we spawn a job, which we block against on the next present Void PresentJob() { … result = grQueueSubmit(g_UniversalQueue, g_cCommandBuffers, g_CommandBuffers, cMemRefs, MemRef, g_FrameFences[g_uSubmittingFrameBuffer]); uint32 PresentFlags = 0; if(g_bVSync) PresentFlags = GR_WSI_WIN_PRESENT_FLIP_DONOTWAIT; // instruct the GPU to present the backbuffer in the applications window GR_WSI_WIN_PRESENT_INFO presentInfo = { g_hWnd, g_MasterResourceList.Images[DR_BACKBUFFER], GR_WSI_WIN_PRESENT_MODE_BLT, 0, PresentFlags }; result = grWsiWinQueuePresent(g_UniversalQueue, &presentInfo); SignalProcessAndPresentDone(pInfo); }

36 36| Nitrous and Mantle | 19 March 2014 WHAT OUR FRAME SUBMISSION LOOKS LIKE 1)Block on last frames presents job (e.g. NOT the fence, the actual job we spawned) 2)Process and pending resource transitions from newly created resources 3)Generate all pending unordered commands, by generating into 1 or more cmd buffers 4)Send signals to the issuers of unordered commands, to notify them the commands are submiitted 5)Begin translation of Nitrous cmds into Mantle cmds – usually jobs across all cores 6)Flush the deletion queues for this frame (likely a few frames old at this point) 7)Any item in our master deletion queue, add to the now empty deletion queue for this frame 8)Handle memory readbacks 9)Spawn Present job

37 37| Nitrous and Mantle | 19 March 2014 FUTURE WORK Now have explicit control over Multi GPU Can write better MGPU solutions, like split screen which will not increase latency –We just got rid of a bunch of latency, dont want to add it back! Asymetric GPU use situations are doable – e.g. using integrated graphics in tandem with Discrete GPU

38 38| Nitrous and Mantle | 19 March 2014 RESULTS Star Swarm surprised both Oxide and AMD –We were not expecting to see cases where application was % faster, still room for optimizations –Right now, we are clearly GPU bound, will release an update soon that increases CPU utilization a little bit to optimize GPU, expecting 10-20% more performance out of Mantle on high end GPUs Driver overhead very consistent, well correlated to number of calls made About 2 man months of work –For an Alpha API, likely 1 month if final version Especially telling on slower CPUs, surprising number of cases with high end GPUS with old CPUs Try for yourself: Star Swarm is free to download on Steam!


Download ppt "NITROUS AND MANTLE: Combining efficient engine design with a modern API Dan Baker, Partner, Oxide Games."

Similar presentations


Ads by Google