Inside Xbox One Martin Fuller Xbox Advanced Technology Group AMD AND MICROSOFT GAME DEVELOPER DAY - June 2 2014, STOCKHOLM.

Slides:



Advertisements
Similar presentations
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Advertisements

1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.
So far Binary numbers Logic gates Digital circuits process data using gates – Half and full adder Data storage – Electronic memory – Magnetic memory –
Chapter 5 Input/Output 5.1 Principles of I/O hardware
Chapter 6 File Systems 6.1 Files 6.2 Directories
CPU Structure and Function
Our Digital World Second Edition
Real Time Versions of Linux Operating System Present by Tr n Duy Th nh Quách Phát Tài 1.
SE-292 High Performance Computing
Chapter 5 : Memory Management
Mehdi Naghavi Spring 1386 Operating Systems Mehdi Naghavi Spring 1386.
Chapter 5 Computing Components.
The IP Revolution. Page 2 The IP Revolution IP Revolution Why now? The 3 Pillars of the IP Revolution How IP changes everything.
Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.
Chapter 5 Computing Components. 5-2 Chapter Goals Read an ad for a computer and understand the jargon List the components and their function in a von.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
§4.4 Input and Output Devices
Chapter 4 Memory Management Basic memory management Swapping
Project 5: Virtual Memory
Hardware-assisted Virtualization
1 Overview Assignment 4: hints Memory management Assignment 3: solution.
Page Replacement Algorithms
Cache and Virtual Memory Replacement Algorithms
Chapter 3.3 : OS Policies for Virtual Memory
Virtual Memory II Chapter 8.
Memory Management.
OFFICE OF SUPERINTENDENT OF PUBLIC INSTRUCTION Division of Assessment and Student Information Online MSP Testing Technology & Assessment Coordinator Training.
Services Course Windows Live SkyDrive Participant Guide.
Xbox Indie Scene 2014 Mike Froggatt Xbox Advanced Technology Group AMD AND MICROSOFT GAME DEVELOPER DAY - June , STOCKHOLM.
Lab # 03- SS Basic Graphic Commands. Lab Objectives: To understand M-files principle. To plot multiple plots on a single graph. To use different parameters.
Processes Management.
Executional Architecture
Part IV: Memory Management
Services Course Windows Live SkyDrive Participant Guide.
SE-292 High Performance Computing
We will resume in: 25 Minutes.
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
CS123 | INTRODUCTION TO COMPUTER GRAPHICS Andries van Dam © 1/16 Deferred Lighting Deferred Lighting – 11/18/2014.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Processor support devices Part 3:Memory management, floating point dr.ir. A.C.
Technische Universität München Computer Graphics SS 2014 Graphics Effects Rüdiger Westermann Lehrstuhl für Computer Graphik und Visualisierung.
COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.
ARCHITECTURE OF APPLE’S G4 PROCESSOR BY RON WEINWURZEL MICROPROCESSORS PROFESSOR DEWAR SPRING 2002.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Status – Week 257 Victor Moya. Summary GPU interface. GPU interface. GPU state. GPU state. API/Driver State. API/Driver State. Driver/CPU Proxy. Driver/CPU.
Computer Organization and Architecture The CPU Structure.
Pipelining By Toan Nguyen.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
© Copyright Khronos Group, Page 1 Harnessing the Horsepower of OpenGL ES Hardware Acceleration Rob Simpson, Bitboys Oy.
Stream Processing Main References: “Comparing Reyes and OpenGL on a Stream Architecture”, 2002 “Polygon Rendering on a Stream Architecture”, 2000 Department.
Xbox One Best multiplayer service Dedicated servers Smartmatch Cloud processing and AI.
Feature: Customer Combiner and Modifier © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are.
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
Computer Systems Week 14: Memory Management Amanda Oddie.
Xbox MB system memory IBM 3-way symmetric core processor ATI GPU with embedded EDRAM 12x DVD Optional Hard disk.
Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.
A SEMINAR ON 1 CONTENT 2  The Stream Programming Model  The Stream Programming Model-II  Advantage of Stream Processor  Imagine’s.
Service Pack 2 System Center Configuration Manager 2007.
02 | Things to consider when porting Michael “Mickey” MacDonald | Indie game developer Bryan Griffiths | Software Engineer/Game Developer.
COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE
50 Performance Tricks to Make your HTML5 apps and sites Faster
Use any Amazon S3 application with Azure Blob Storage
Central Controller 2009©HIMA Digital Entertainment.
Build /24/2018 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks.
The Small batch (and Other) solutions in Mantle API
Myth Busting: Hosted Web Apps
Windows Phone multitasking
* From AMD 1996 Publication #18522 Revision E
COMP755 Advanced Operating Systems
Presentation transcript:

Inside Xbox One Martin Fuller Xbox Advanced Technology Group AMD AND MICROSOFT GAME DEVELOPER DAY - June 2 2014, STOCKHOLM

NDA This is a non-NDA event That means there is a limit to how much I can say, go easy!

CPU AMD Jaguar (x64) - 8-cores arranged in 2x clusters of 4 cores each 1.75 GHz Dual issue Out of order execution Speculative execution Store-to-load forwarding SSE4.2 and AVX (Dot product!) 16 x 256-bit wide floating point registers Hardware pre-fetch

Memory 8 GiB of DDR3 at 68 GiB/s 48-bit virtual address space Low latency Not enough bandwidth to touch all of memory a frame, RAM as a super fast cache 48-bit virtual address space 256 terabytes Tricky to fragment! Synced between CPU and GPU 4 MiB of L2 cache 2 MiB per cluster MOESI protocol for cache coherency 16-way set associative Per core, up to eight cache requests in flight at once

CPU – Recommendations Store to load forwarding saves the dreaded LHS stall But not spilling out registers is even better The branch predictor is not a crystal ball Branchless tricks learnt in Xbox 360 era can still apply Hardware data pre-fetch is awesome Only works with arrays Avoid aliasing load/stores on 2KiB alignments This causes a false positive that delays load execution Go wide with SSE and leverage all cores No brainer

GPU AMD GCN 768-SPU 853 MHz 32 MiB of ESRAM at 109 GiB/s 4 Move Engines 3 hardware display planes Resolution independent Frame rate independent Exact sRGB this time! (oh, and its free) Hardware video encode and decode HDMI 1.4a in and out

Move Engines More than just DMA copy Memory set Texture swizzle JPEG decompress LZ compress and decompress

ESRAM 32MiB of general purpose RAM Not like EDRAM on Xbox 360 109 GiB/s Sometimes faster in practice! Zero contention Not shared with CPU, SRA’s or video out ESRAM makes everything better Render targets Textures Geometry Compute tasks

ESRAM – Sometimes faster in practice? ESRAM can handle concurrent read/writes: Increasing effective bandwidth above 109 GiB/s Operations that can take advantage of this: Read modify write operations Depth buffer / HTILE update Alpha blending Oh, and concurrently DMA’ing resources in/out of ESRAM while also rendering How much effective bandwidth can titles achieve? The current record holder achieved 141 GiB/s from ESRAM (this is a post processing pass in a real title) Of course all titles combine ESRAM’s >= 109 GiB/s with DRAM’s 68 GiB/s

ESRAM – The Four Stages of Adoption Statically allocate a small number of render targets in ESRAM Alias the same memory for re-use later Partial residency Put the top strip of render targets (sky) in DRAM, the rest in ESRAM Asynchronously DMA resources in/out of ESRAM Launch titles were at 1 - 2 2nd wave of titles are now starting to tackle points 3 and/or 4 3rd+ wave will get really good at this!

ESRAM – Memory Maps! It’s like 8 bit days all over again! (Sort of) Plan the asynchronous moves Move resources in/out asynchronous while also rendering New memory map at each stage of the render pipeline Don’t forget, swizzle textures on DMA

Then use async compute! Maxing out the GPU Are you bandwidth limited? Have you maxed out the fixed function hardware? Do you have spare compute resource? Then use async compute! Titles have barely scratched the surface yet: Watch this space!

The usual GPU recommendations Use ESRAM First for depth / stencil Then colour targets Then everything else Sort by state / shader / use hardware instancing (Batch batch batch!) Always swizzle textures Be wary of using too many general purpose registers Keep an eye on occupancy in PIX, we normally recommend >= 4 Avoid reading DRAM via the CPU-coherent bus There is no hardware integer divide Excellent information in Emil’s talk

Graphics API DX11 was designed for the desktop (a long time ago, 2008!) Abstracts a variety of different GPU architectures Manages VRAM residency for you Over subscribing VRAM is a serious performance pitfall Handles hazards Developers can handle these at a higher level => less cost Xbox One will run vanilla DX11 PC code Easy port Extensions available for low level access

Graphics API DX11.X (Xbox specific, not the DX12 API) Some DX12 features available right now on Xbox: Turn off hazard tracking Simple fence API Deferred contexts re-implemented New resource descriptor model Draw bundles (Xbox specific, not the DX12 API)

DRAM - Contention The CPU cannot saturate DRAM bandwidth on its own, the GPU can! Significant performance degradation from DRAM contention Fancy CPU features don’t help if memory starved 10. Use ESRAM as much as possible 20. Leave DRAM for the CPU and DMA 30. goto 10;

DRAM – Love your bandwidth Hardware data cache pre-fetch units are awesome Manual pre-fetch is near pointless once hardware pre-fetch is spinning Wasting bandwidth if only operating on small arrays Write combined memory pages and SSE streaming store instructions by-pass the cache No load - halves the bandwidth consumed by the CPU Pack your data! Expanding / compressing data is cheap (CPU & GPU) F16C (half <-> float) CPU instructions Store to load forwarding avoids LHS stalls Swizzle your textures Move engines can swizzle on copy Consider applying Data Orientated Design where possible

Audio Custom audio hardware Nuff said Very fast Lots of features Kinda cool! Nuff said

3x Operating Systems ERA SRA Hypervisor Exclusive Resource Allocation Only one active at a time Custom OS (Games!) SRA Shared Resource Allocation Win8 core (Apps) Hypervisor SRA and ERA use different virtual address space

PLM (Program Lifetime Management) ERA can be in one of several states Full screen Full resources (even with snapped app up) Constrained (Windowed) Slightly less CPU and GPU resource No input Same amount of memory Suspended Zero CPU and GPU resource Limited time to save after receiving a suspend message

Kinect 2.0 Hardware: Software: Higher resolution colour and depth Better ranges New – infrared! Microphone array No tilt motor Software: Improved skeletal tracking Improved biometrics

Streaming install 6x Bluray = ~26 MiB/s Too long to wait… bored now… To install a 50 GiB Bluray at ~26 MiB/s = ~33 minutes Too long to wait… bored now… Game must start after an initial payload has been installed. When running title can hint as to what to install next. No direct access to Bluray. Could be digital download It’s obvious but I’ll say it anyway – compress you assets!

The Cloud Cloud compute: Live services: Secure! Developer’s code is hosted and executed in Windows Azure Game code execution automatically scales based upon usage Live services: Stats, analytics, matchmaking & storage. Secure!

Challenges Is your code 64-bit compliant? Can you scale to 6 cores? Adopt new DX11.X API extensions Manage your own resource hazards Make sure you use ESRAM effectively Package content for streaming install Game design considerations Quick save on ERA termination Kinect, Smartglass Cloud services

Thank You! – Questions? (That I’m allowed to answer)