“The Slow Game Of Life” Allan Murphy Senior Software Development Engineer XNA Developer Connection Microsoft Understanding Performance Consumer Hardware.

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

DSPs Vs General Purpose Microprocessors

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Morgan Kaufmann Publishers The Processor

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

© 2006 Edward F. Gehringer ECE 463/521 Lecture Notes, Spring 2006 Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Spring 2006.

ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.

Performance of Cache Memory

Click to add text © IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.

The Linux Kernel: Memory Management

Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Computer Organization and Architecture

Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.

Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,

1 Lecture 6 Performance Measurement and Improvement.

Chapter 12 Pipelining Strategies Performance Hazards.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Associative Cache Mapping A main memory block can load into any line of cache Memory address is interpreted as tag and word (or sub-address in line) Tag.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

CS 300 – Lecture 23 Intro to Computer Architecture / Assembly Language Virtual Memory Pipelining.

1  Caches load multiple bytes per block to take advantage of spatial locality  If cache block size = 2 n bytes, conceptually split memory into 2 n -byte.

Chapter 12 CPU Structure and Function. Example Register Organizations.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Computer Organization and Architecture

CS 300 – Lecture 24 Intro to Computer Architecture / Assembly Language The LAST Lecture!

1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.

Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Gary MarsdenSlide 1University of Cape Town Computer Architecture – Introduction Andrew Hutchinson & Gary Marsden (me) ( ) 2005.

Computer Graphics Graphics Hardware

CMPE 421 Parallel Computer Architecture

Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER

© 2009 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied,

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

IT253: Computer Organization

RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.

University of Amsterdam Computer Systems – optimizing program performance Arnoud Visser 1 Computer Systems Optimizing program performance.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

CS 3500 L Performance l Code Complete 2 – Chapters 25/26 and Chapter 7 of K&P l Compare today to 44 years ago – The Burroughs B1700 – circa 1974.

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Computer Systems Week 14: Memory Management Amanda Oddie.

Processor Architecture

Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.

Winter 2002CSE Topic Branch Hazards in the Pipelined Processor.

1 CSCI 2510 Computer Organization Memory System II Cache In Action.

Performance Tuning John Black CS 425 UNR, Fall 2000.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.

1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

Buffering Techniques Greg Stitt ECE Department University of Florida.

Computer Graphics Graphics Hardware

Advanced OS Concepts (For OCR)

CSC 4250 Computer Architectures

How will execution time grow with SIZE?

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

* From AMD 1996 Publication #18522 Revision E

Computer Graphics Graphics Hardware

Computer Architecture

Memory System Performance Chapter 3

COMP755 Advanced Operating Systems

Presentation transcript:

“The Slow Game Of Life” Allan Murphy Senior Software Development Engineer XNA Developer Connection Microsoft Understanding Performance Consumer Hardware And Performance Coding

Hello So… Who exactly am I? And what am I doing here Firstly, hands up who… Has heavily optimized an application Hasn’t, and doesn’t care Is actually here and alive Is hungover and mainly hoping for the answers for the group assignment

Hello Duncan let me speak today because… Career spent on performance hardware Experience with a variety of consoles Have managed teams building game engines Still have those photos of Duncan With Doug, the West Highland Terrier I did my degree at Strathclyde Computer architecture Low level programming

Will Optimize For Money

Previous Experience Did database analysis, hated it Worked in telecoms, hated it Moved to 3 person game company …“Until I could find a proper job” It’s not all about me Except this bit Strathclyde The Game Of Life assignment Left Strathclyde Immediately paid enormous fortune Didn’t wear a suit, worked in games Bought first Ferrari 3 months after Uni Had more than 1 girlfriend

Previous Experience 2 years PC engine development 2D 640x480 bitmap graphics C, C++, 80x86 (486, Pentium) 3 years at Sony 3 years PS1 3 rd party support and game dev C, C++, MIPS R years at game developer in Glasgow PS1 engine development C, C++, MIPS R3000

Previous Experience 6 years owning own developer PS1, PS2, GC, Xbox 1, PC development C, C++, MIPS R4400, VU assembly, HLSL 2 years at Eurocom PS3, 360, PC C, C++, PowerPC, SPU assembly 2 years at Microsoft Xbox 360, some Windows C, C++, PowerPC, HLSL

Previous Experience Fair amount of optimization experience Part of XDC group at Microsoft 3 rd party developer support group Visited 60+ game developers Performance reviews Consultancy Sample code Bespoke coding

Previous Experience “All this will go away soon” 1992 Multiplying by 320 in x86 assembler Surely it should, because… Processor power increasing Processor cost reducing Compilers getting better

Console Hardware

Console hardware is about… Maximum performance …for minimum cost Often CPUs are… Cut down production processors Have bespoke processing hardware added Eg vector processing units Attached to cheap memory and peripherals Consoles are sold at a loss

80x86 PC (circa mid-90s) Pentium Pro 200Mhz Main Memory Graphics Card VRAM To monitor AGP 512Kb L2 Cache FPU + MMX 8Kb L1 Somewhat abstracted

PS1 MIPS R Mhz GTE MDEC 2Mb Main Memory I$D$ GPU 1Mb VRAM To telly

Xbox 1 Pentium III 733Mhz 64Mb UMA Main Memory nVidia NV2A To telly 128Kb L2 Cache FPU + MMX SSE L1

PS2 MIPS R Mhz 32Mb Main Memory I$ D$ GS 4 Mb VRAM S-Pad GIF FPU + MMX EE VIF0 VU0 mem VU1 VIF1 mem To telly

Xbox Mb UMA 1Mb L2 Cache PowerPC Core L1 ATI Xenos To telly FPU + VMX PowerPC Core L1 FPU + VMX PowerPC Core L1 FPU + VMX

PS3 256Mb To telly nVidia RSX 256 Mb VRAM Cell PPE SPE L1 LS L2 Cache SPE DMAC

The Sad Truth About CPU Design In which programmers have to do the hard work again

This Is What You Want CPU Main Memory Ridiculously Fast Very Wide, Very Fast Very BIG, Very Fast

CPUs Not Getting Faster… Core 0 Core 1 Core 2 Main Memory ?

Fast Memory is Expensive… Core 0 Core 1 Core 2 Main Memory Cache

This Is What You Get… Core 0 L1 Core 1 L1 Core 2 L1 Main Memory L2 Cache NCU 0NCU 1NCU 0 Store Queue Load Queue Store Gather Store Queue Load Queue Store Gather Store Queue Load Queue Store Gather RC Machines

Multicore Strategy Multicore is future of performance Scenario forced on unwilling game developers Not necessarily a happy marriage Game systems often highly… Temporally connected Intertwined Game devs often from single thread background Some tasks easy to parallelize Rendering, physics, effects, animation

Multicore Strategy Single threaded On Xbox360 and PS3, this is a bad plan Two main threads Game logic update Renderer submission Two main threads + fixed tasks As above plus… …fixed tasks in parallel … eg streaming, effects, audio

Multicore Strategy Truly multi-threaded Usually a main game logic thread Main tasks sliced into independent pieces Rendering, physics, collision, effects… Scheduler controls task execution Tasks execute when preconditions met Scheduler runs task on any available unit Real trick is… Balancing scheduling Making sure tasks truly independent

Multicore Strategy Problems Very hard to debug a task system… …especially at sub millisecond resolution Balancing tasks and scheduler can be hard Slicing data and tasks into pieces tricky Many conditions very hard to find… …never mind debug Side effects in code not always obvious

Game Engine Concerns

Game Engine Coding Main concerns: Speed Feature set Memory usage Disc space for assets But most importantly… Speed Because this dictates game content Slow means less features

Game Engine Coding Speed measured in… Frames per second Or equivalently ms per frame 33.33ms in a frame at 30fps Game must perform update in this time Update all of the game’s systems Set up and submit all rendering for frame Do all of the drawing for previous frame

Game Engine Coding Critical choices for engine design Algorithms Sorting, searching, pruning calculations Rendering policy Data structuring How you bend the above around hardware Consoles have hardware acceleration… …for certain tasks …for certain data …for certain data layouts

Game Engine Coding Example: VMX instructions on Xbox360 SIMD instructions, operating on vectors Vector can be 8, 16, 32 bit values 32 bit can be float or int Multiply, add, shift, pack, unpack Great! But… No divide, sqrt, individual bit operations Only aligned loading Loading individual pieces to build expensive Possible to lose improvement easily

The 360 Core Remember, cheap hardware Cut down PowerPC core Missing out of order execution hardware Missing store forwarding hardware Ie, this is an in-order processor Attached to slow memory Means loading data is painful Which in turn makes data layout critical

360 Core Very commonly ocurring penalties: Load Hit Store L2 cache miss Expensive instructions Branch mispredict

Load-Hit-Store (LHS) What is it? Storing to a memory location… …then loading from it very shortly after What causes LHS? Type casts, changing register set, aliasing Passing by value, or by reference Why is it a problem? On PC, bullet usually dodged by… Instruction re-ordering Store forwarding hardware

L2 Miss What is it? Loading from a location not already in cache Why is it a problem? Costs ~610 cycles to load a cache line You can do a lot of work in 610 cycles What can we do about it? Hot/cold split Reduce in-memory data size Use cache coherent structures

Expensive Instructions What is it? Certain instructions not pipelined No other instructions issued ‘til they complete Stalls both hardware threads high latency and low throughput What can we do about it? Know when those instructions are generated Avoid or code round those situations But only in critical places

Branch Mispredicts What is it? Mispredicting a branch causes… …CPU to discard instructions it predicted it needed …23-24 cycle delay as correct instructions fetched Why is this a problem? Misprediction penalty can… …dominate total time in tight loops …waste time fetching unneeded instructions

PIX for Xbox 360

PIX Performance Investigator for Xbox For analysing various kinds of performance Rendering, file system, CPU For CPU… Several different mechanisms Stochastic sampling High level timers and counters Instruction trace

CPU Instruction Trace What is an instruction trace? CPU core set to single step mode Tools record instructions and load/store addrs 400x slower than normal execution Trace (and code) affected by: Compiler output – un-optimized / optimized Some statistics are simulated Eg cache statistics assumes Cache starts empty No other threads run and evict data

CPU Instruction Trace Instruction trace contains 5 tabs: Summary tab Top Issues tab Memory Accesses tab Source tab Functions tab

CPU Instruction Trace Summary tab Instructions executed statistics I-cache statistics D-cache statistics Very useful: cache line usage % TLB statistics Very useful: 4Kb and 64Kb page usage Very useful: TLB miss rate exceeding 1024 Instruction type histogram

Summary Tab Cache line efficiency – try for 35% minimum Executed instructions – gives notion of possible maximum speed

Top Issues Tab Major CPU penalties, by cycle cost order Includes link to: Address of instruction where penalty occurs Function in source view L2 miss and LHS normally dominate Other common penalties: Branch mispredict fcmp Expensive instructions (fdiv et al)

Top Issue Tab Cache misses Displays % of data used before eviction Load-hit-stores Displays store instruction addr, last data addr Source / destination register types Expensive instructions Location of instruction Branch mispredictions Conditional or branch target mispredict

Memory Accesses Tab Shows all memory accesses by… Page type, address, and cache line For each cache lines shows… Symbol that touched the cache line most Right click gives all symbols touching the line

Source Tab Annotated source and assembly Columns show ‘penalty’ counts With hot links to more details Click here for load-hit- store details Brings up this dialog, showing you all store instructions that this load hit

Functions Tab Per-function values of six counters: Instruction counts L2 misses, LHS, fcmp, L1 D & I cache misses All available as inclusive and exclusive Exclusive – for this function only Inclusive – this function and everything it calls

Op timization Example

Optimization Zen Perspective is king 90% of time spent in 10% of code Optimization is expensive, slow, error prone Improvement to execution speed Generality Maintainability Understandability Speed of development

Optimization Zen Ground rules for optimization Have CPU budgets in place Budget planning assists good performance Measure twice, cut once Optimize in an iterative pruning fashion Remove easiet to tackle & worst culprits first Re-evaluat timing and metrics Stop as soon as budget achieved Be sure to performance issues correctly

Optimization Example class BaseParticle { public: … virtual Vector& Position(){ return mPosition; } virtual Vector& PreviousPosition() { return mPreviousPosition; } float& Intensity() { return mIntensity; } float& Lifetime() { return mLifetime; } bool& Active() { return mActive; } … private: … float mIntensity; float mLifetime; bool mActive; Vector mPosition; Vector mPreviousPosition; … };

Optimization Example // Boring old vector class class Vector { … public: float x,y,z,w; }; // Boring old generic linked list class template class ListNode { public: ListNode(T* contents) : mNext(NULL), mContents(contents){} void SetNext(ListNode* node){ mNext = node; } ListNode* NextNode(){ return mNext; } T* Contents(){ return mContents; } private: ListNode * mNext; T* mContents; };

Optimization Example // Run through list and update each active particle for (ListNode * node = gParticles; node != NULL; node = node->NextNode()) if (node->Contents()->Active()) { Vector vel; vel.x = node->Contents()->Position().x - node->Contents()->PrevPosition().x; vel.y = node->Contents()->Position().y - node->Contents()->PrevPosition().y; vel.z = node->Contents()->Position().z - node->Contents()->PrevPosition().z; const float length = __fsqrts((vel.x*vel.x) + (vel.y*vel.y) + (vel.z*vel.z)); if (length > cLimitLength) { float newIntensity = cMaxIntensity - node->Contents()->Lifetime(); if (newIntensity < 0.0f) newIntensity = 0.0f; node->Contents()->Intensity() = newIntensity; } else node->Contents()->Intensity() = 0.0f; }

Optimization Example // Replacement for straight C vector work // Build 360 friendly __vector4s __vector4 position, prevPosition; position.x = node->Contents()->Position().x; position.y = node->Contents()->Position().y; position.z = node->Contents()->Position().z; prevPosition.x = node->Contents()->PrevPosition().x; prevPosition.y = node->Contents()->PrevPosition().y; prevPosition.z = node->Contents()->PrevPosition().z; // Use VMX to do the calculations __vector4 velocity = __vsubfp(position,previousPosition); __vector4 velocitySqr = __vmsum4fp(velocity,velocity); // Grab the length result from the vector const float length = __fsqrts(velocitySqr.x);

Measure First PIX Summary 704k instructions executed 40% L2 cache line usage Top penalties L2 cache 3m cycles bctr 1.14m cycles 696k cycles 2x 490k cycles Some 20.9m cycles of penalty overall Takes 7.528ms

Improving Original Example 1) Avoid branch mispredict #1 Ditch the zealous use of virtual Call functions just once Gives 1.13x speedup 2) Improve L2 use #1 Refactoring list to contiguous array Hot/cold split Using bitfield for active flag Gives 3.59x speedup

Improving Original Example 4) Remove expensive instructions Ditch __fsqrts and compare with squares Gives 4.05x speedup 5) Avoid fcmp pipeline flush Insert __fsel() to select tail length Gives 4.44x speedup Insert 2 nd fsel Now only branch on active flag remains Gives 5.0x speedup

Improving Original Example 7) Use VMX Use __vsubfp and __vmsum3fp for vector math Gives 5.28x speedup 8) Avoid branching too often Unroll the loop 4x Sticks at 5.28x speedup

Improving Original Example 9) Avoid branch mispredict #2 Read vector4 of tail intensities Build a __vector4 mask from active flags __vsel tail lengths from existing and new Write updated vector4 of tail intensities back Gives 6.01x speedup 10) Improve L2 access #2 Add __dcbt on particle array Gives 16.01x speedup

Improving Original Example 11) Improve L2 use #3 Moveto short coordinates Now loading ¼ the data for positions Gives 21.23x speedup 12) Avoid unnecessary work We are now writing tail lengths for every particle Wait, we don’t care about inactive particles Epiphany - don’t check active flag at all Gives 23.2x speedup

Improving Original Example 13) Improve L2 use #4 Remaining L2 misses on output array __dcbt that too Tweak __dcbt offsets and pre-load 39.01x speedup Check its correct!

for (int loop = 0; loop < cParticleCount; loop+=4) { __dcbt(768,&gParticles[loop]); __dcbt(768,&gParticleLifetime[loop]); __vector4 lifetimes = *(__vector4 *)&gParticleLifetime[loop]; __vector4 newIntensity = __vsubfp(maxLifetime,lifetimes); const __vector4 velocity0 = gParticles[loop].Velocity(); __vector4 lengthSqr0 = __vmsum3fp(velocity0,velocity0); // …calculate remaining lengths and concatenate into one __vector4 lengths = __vsubfp(lengths,cLimitLengthSqrV); __vector4 lengthMask = __vcmpgtfp(lengths,zero); newIntensity = __vmaxfp(newIntensity,zero); __vector4 result = __vsel(zero,newIntensity,lengthMask); *(__vector4 *)&gParticleTailIntensity[loop] = __vsel(zero,newIntensity,lengthMask); }

Improving Original Example PIX Summary 259k instructions executed 99.4% L2 usage Top penalties ERAT Data 14k cycles 1 LHS via 4kb aliasing No mispredict penalties 71k cycles of penalty overall Takes 0.193ms

Summary

Thanks for listening Hopefully you gathered something about: Cheap consumer hardware Multicore strategies What game engine programmers worry about How games are profiled and optimized

Q&A

© 2008 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

Dawson’s Creek Figures Clock rate = 3.2 GHz = 3,200,000,000 cycles per second 60 fps = 53,333,333 cycles per frame 30 fps = 106,666,666 cycles per frame Dawson’s Law: average 0.2 IPC in a game title Therefore … at 60 fps, you can do 10,666,666 instructions ~= 10M at 30 fps, you can do 21,333,333 instructions ~= 21M Or put another way… how bad is a 1M-cycle penalty? It’s approx 200K instructions of quality execution going missing. 1M cycles is 1/50 th – 2% of a frame at 60 fps, or 1/100 th – 1% of a frame at 30 fps, or 1% of a frame at 30 fps 1M cycles is ~0.32 ms.