Presentation on theme: "Inside Xbox One Martin Fuller Xbox Advanced Technology Group AMD AND MICROSOFT GAME DEVELOPER DAY - June 2 2014, STOCKHOLM."— Presentation transcript:
1 Inside Xbox One Martin Fuller Xbox Advanced Technology Group AMD AND MICROSOFT GAME DEVELOPER DAY - June , STOCKHOLM
2 NDA This is a non-NDA event That means there is a limit to how much I can say, go easy!
3 CPU AMD Jaguar (x64) - 8-cores arranged in 2x clusters of 4 cores each 1.75 GHzDual issueOut of order executionSpeculative executionStore-to-load forwardingSSE4.2 and AVX(Dot product!)16 x 256-bit wide floating point registersHardware pre-fetch
4 Memory 8 GiB of DDR3 at 68 GiB/s 48-bit virtual address space Low latencyNot enough bandwidth to touch all of memory a frame, RAM as a super fast cache48-bit virtual address space256 terabytesTricky to fragment!Synced between CPU and GPU4 MiB of L2 cache2 MiB per clusterMOESI protocol for cache coherency16-way set associativePer core, up to eight cache requests in flight at once
5 CPU – RecommendationsStore to load forwarding saves the dreaded LHS stallBut not spilling out registers is even betterThe branch predictor is not a crystal ballBranchless tricks learnt in Xbox 360 era can still applyHardware data pre-fetch is awesomeOnly works with arraysAvoid aliasing load/stores on 2KiB alignmentsThis causes a false positive that delays load executionGo wide with SSE and leverage all coresNo brainer
6 GPU AMD GCN 768-SPU 853 MHz 32 MiB of ESRAM at 109 GiB/s 4 Move Engines3 hardware display planesResolution independentFrame rate independentExact sRGB this time!(oh, and its free)Hardware video encode and decodeHDMI 1.4a in and out
7 Move Engines More than just DMA copy Memory set Texture swizzle JPEG decompressLZ compress and decompress
8 ESRAM 32MiB of general purpose RAM Not like EDRAM on Xbox 360 109 GiB/sSometimes faster in practice!Zero contentionNot shared with CPU, SRA’s or video outESRAM makes everything betterRender targetsTexturesGeometryCompute tasks
9 ESRAM – Sometimes faster in practice? ESRAM can handle concurrent read/writes:Increasing effective bandwidth above 109 GiB/sOperations that can take advantage of this:Read modify write operationsDepth buffer / HTILE updateAlpha blendingOh, and concurrently DMA’ing resources in/out of ESRAM while also renderingHow much effective bandwidth can titles achieve?The current record holder achieved 141 GiB/s from ESRAM (this is a post processing pass in a real title)Of course all titles combine ESRAM’s >= 109 GiB/s with DRAM’s 68 GiB/s
10 ESRAM – The Four Stages of Adoption Statically allocate a small number of render targets in ESRAMAlias the same memory for re-use laterPartial residencyPut the top strip of render targets (sky) in DRAM, the rest in ESRAMAsynchronously DMA resources in/out of ESRAMLaunch titles were at 1 - 22nd wave of titles are now starting to tackle points 3 and/or 43rd+ wave will get really good at this!
11 ESRAM – Memory Maps! It’s like 8 bit days all over again! (Sort of)Plan the asynchronous movesMove resources in/out asynchronous while also renderingNew memory map at each stage of the render pipelineDon’t forget, swizzle textures on DMA
12 Then use async compute! Maxing out the GPU Are you bandwidth limited? Have you maxed out the fixed function hardware?Do you have spare compute resource?Then use async compute!Titles have barely scratched the surface yet:Watch this space!
13 The usual GPU recommendations Use ESRAMFirst for depth / stencilThen colour targetsThen everything elseSort by state / shader / use hardware instancing(Batch batch batch!)Always swizzle texturesBe wary of using too many general purpose registersKeep an eye on occupancy in PIX, we normally recommend >= 4Avoid reading DRAM via the CPU-coherent busThere is no hardware integer divideExcellent information in Emil’s talk
14 Graphics APIDX11 was designed for the desktop (a long time ago, 2008!)Abstracts a variety of different GPU architecturesManages VRAM residency for youOver subscribing VRAM is a serious performance pitfallHandles hazardsDevelopers can handle these at a higher level => less costXbox One will run vanilla DX11 PC codeEasy portExtensions available for low level access
15 Graphics API DX11.X (Xbox specific, not the DX12 API) Some DX12 features available right now on Xbox:Turn off hazard trackingSimple fence APIDeferred contexts re-implementedNew resource descriptor modelDraw bundles(Xbox specific, not the DX12 API)
16 DRAM - ContentionThe CPU cannot saturate DRAM bandwidth on its own, the GPU can!Significant performance degradation from DRAM contentionFancy CPU features don’t help if memory starved10. Use ESRAM as much as possible20. Leave DRAM for the CPU and DMA30. goto 10;
17 DRAM – Love your bandwidth Hardware data cache pre-fetch units are awesomeManual pre-fetch is near pointless once hardware pre-fetch is spinningWasting bandwidth if only operating on small arraysWrite combined memory pages and SSE streaming store instructions by-pass the cacheNo load - halves the bandwidth consumed by the CPUPack your data!Expanding / compressing data is cheap (CPU & GPU)F16C (half <-> float) CPU instructionsStore to load forwarding avoids LHS stallsSwizzle your texturesMove engines can swizzle on copyConsider applying Data Orientated Design where possible
18 Audio Custom audio hardware Nuff said Very fast Lots of features Kinda cool!Nuff said
19 3x Operating Systems ERA SRA Hypervisor Exclusive Resource Allocation Only one active at a timeCustom OS(Games!)SRAShared Resource AllocationWin8 core(Apps)HypervisorSRA and ERA use different virtual address space
20 PLM (Program Lifetime Management) ERA can be in one of several statesFull screenFull resources (even with snapped app up)Constrained (Windowed)Slightly less CPU and GPU resourceNo inputSame amount of memorySuspendedZero CPU and GPU resourceLimited time to save after receiving a suspend message
22 Streaming install 6x Bluray = ~26 MiB/s Too long to wait… bored now… To install a 50 GiB Bluray at ~26 MiB/s = ~33 minutesToo long to wait… bored now…Game must start after an initial payload has been installed.When running title can hint as to what to install next.No direct access to Bluray.Could be digital downloadIt’s obvious but I’ll say it anyway – compress you assets!
23 The Cloud Cloud compute: Live services: Secure! Developer’s code is hosted and executed in Windows AzureGame code execution automatically scales based upon usageLive services:Stats, analytics, matchmaking & storage.Secure!
24 Challenges Is your code 64-bit compliant? Can you scale to 6 cores? Adopt new DX11.X API extensionsManage your own resource hazardsMake sure you use ESRAM effectivelyPackage content for streaming installGame design considerationsQuick save on ERA terminationKinect, SmartglassCloud services
25 Thank You! – Questions?(That I’m allowed to answer)
Your consent to our cookies if you continue to use this website.