Presentation is loading. Please wait.

Presentation is loading. Please wait.

“The Slow Game Of Life” Allan Murphy Senior Software Development Engineer XNA Developer Connection Microsoft Understanding Performance Consumer Hardware.

Similar presentations


Presentation on theme: "“The Slow Game Of Life” Allan Murphy Senior Software Development Engineer XNA Developer Connection Microsoft Understanding Performance Consumer Hardware."— Presentation transcript:

1 “The Slow Game Of Life” Allan Murphy Senior Software Development Engineer XNA Developer Connection Microsoft Understanding Performance Consumer Hardware And Performance Coding

2 Hello So… Who exactly am I? And what am I doing here Firstly, hands up who… Has heavily optimized an application Hasn’t, and doesn’t care Is actually here and alive Is hungover and mainly hoping for the answers for the group assignment

3 Hello Duncan let me speak today because… Career spent on performance hardware Experience with a variety of consoles Have managed teams building game engines Still have those photos of Duncan With Doug, the West Highland Terrier I did my degree at Strathclyde Computer architecture Low level programming

4 Will Optimize For Money

5 Previous Experience Did database analysis, hated it Worked in telecoms, hated it Moved to 3 person game company …“Until I could find a proper job” It’s not all about me Except this bit Strathclyde The Game Of Life assignment Left Strathclyde Immediately paid enormous fortune Didn’t wear a suit, worked in games Bought first Ferrari 3 months after Uni Had more than 1 girlfriend

6 Previous Experience 2 years PC engine development 2D 640x480 bitmap graphics C, C++, 80x86 (486, Pentium) 3 years at Sony 3 years PS1 3 rd party support and game dev C, C++, MIPS R3000 2 years at game developer in Glasgow PS1 engine development C, C++, MIPS R3000

7 Previous Experience 6 years owning own developer PS1, PS2, GC, Xbox 1, PC development C, C++, MIPS R4400, VU assembly, HLSL 2 years at Eurocom PS3, 360, PC C, C++, PowerPC, SPU assembly 2 years at Microsoft Xbox 360, some Windows C, C++, PowerPC, HLSL

8 Previous Experience Fair amount of optimization experience Part of XDC group at Microsoft 3 rd party developer support group Visited 60+ game developers Performance reviews Consultancy Sample code Bespoke coding

9 Previous Experience “All this will go away soon” 1992 Multiplying by 320 in x86 assembler Surely it should, because… Processor power increasing Processor cost reducing Compilers getting better

10 Console Hardware

11 Console hardware is about… Maximum performance …for minimum cost Often CPUs are… Cut down production processors Have bespoke processing hardware added Eg vector processing units Attached to cheap memory and peripherals Consoles are sold at a loss

12 80x86 PC (circa mid-90s) Pentium Pro 200Mhz Main Memory Graphics Card VRAM To monitor AGP 512Kb L2 Cache FPU + MMX 8Kb L1 Somewhat abstracted

13 PS1 MIPS R3000 33.868Mhz GTE MDEC 2Mb Main Memory I$D$ GPU 1Mb VRAM To telly

14 Xbox 1 Pentium III 733Mhz 64Mb UMA Main Memory nVidia NV2A To telly 128Kb L2 Cache FPU + MMX SSE L1

15 PS2 MIPS R4400 294Mhz 32Mb Main Memory I$ D$ GS 4 Mb VRAM S-Pad GIF FPU + MMX EE VIF0 VU0 mem VU1 VIF1 mem To telly

16 Xbox 360 512Mb UMA 1Mb L2 Cache PowerPC Core L1 ATI Xenos To telly FPU + VMX PowerPC Core L1 FPU + VMX PowerPC Core L1 FPU + VMX

17 PS3 256Mb To telly nVidia RSX 256 Mb VRAM Cell PPE SPE L1 LS L2 Cache SPE DMAC

18 The Sad Truth About CPU Design In which programmers have to do the hard work again

19 This Is What You Want CPU Main Memory Ridiculously Fast Very Wide, Very Fast Very BIG, Very Fast

20 CPUs Not Getting Faster… Core 0 Core 1 Core 2 Main Memory ?

21 Fast Memory is Expensive… Core 0 Core 1 Core 2 Main Memory Cache

22 This Is What You Get… Core 0 L1 Core 1 L1 Core 2 L1 Main Memory L2 Cache NCU 0NCU 1NCU 0 Store Queue Load Queue Store Gather Store Queue Load Queue Store Gather Store Queue Load Queue Store Gather RC Machines

23 Multicore Strategy Multicore is future of performance Scenario forced on unwilling game developers Not necessarily a happy marriage Game systems often highly… Temporally connected Intertwined Game devs often from single thread background Some tasks easy to parallelize Rendering, physics, effects, animation

24 Multicore Strategy Single threaded On Xbox360 and PS3, this is a bad plan Two main threads Game logic update Renderer submission Two main threads + fixed tasks As above plus… …fixed tasks in parallel … eg streaming, effects, audio

25 Multicore Strategy Truly multi-threaded Usually a main game logic thread Main tasks sliced into independent pieces Rendering, physics, collision, effects… Scheduler controls task execution Tasks execute when preconditions met Scheduler runs task on any available unit Real trick is… Balancing scheduling Making sure tasks truly independent

26 Multicore Strategy Problems Very hard to debug a task system… …especially at sub millisecond resolution Balancing tasks and scheduler can be hard Slicing data and tasks into pieces tricky Many conditions very hard to find… …never mind debug Side effects in code not always obvious

27 Game Engine Concerns

28 Game Engine Coding Main concerns: Speed Feature set Memory usage Disc space for assets But most importantly… Speed Because this dictates game content Slow means less features

29 Game Engine Coding Speed measured in… Frames per second Or equivalently ms per frame 33.33ms in a frame at 30fps Game must perform update in this time Update all of the game’s systems Set up and submit all rendering for frame Do all of the drawing for previous frame

30 Game Engine Coding Critical choices for engine design Algorithms Sorting, searching, pruning calculations Rendering policy Data structuring How you bend the above around hardware Consoles have hardware acceleration… …for certain tasks …for certain data …for certain data layouts

31 Game Engine Coding Example: VMX instructions on Xbox360 SIMD instructions, operating on vectors Vector can be 8, 16, 32 bit values 32 bit can be float or int Multiply, add, shift, pack, unpack Great! But… No divide, sqrt, individual bit operations Only aligned loading Loading individual pieces to build expensive Possible to lose improvement easily

32 The 360 Core Remember, cheap hardware Cut down PowerPC core Missing out of order execution hardware Missing store forwarding hardware Ie, this is an in-order processor Attached to slow memory Means loading data is painful Which in turn makes data layout critical

33 360 Core Very commonly ocurring penalties: Load Hit Store L2 cache miss Expensive instructions Branch mispredict

34 Load-Hit-Store (LHS) What is it? Storing to a memory location… …then loading from it very shortly after What causes LHS? Type casts, changing register set, aliasing Passing by value, or by reference Why is it a problem? On PC, bullet usually dodged by… Instruction re-ordering Store forwarding hardware

35 L2 Miss What is it? Loading from a location not already in cache Why is it a problem? Costs ~610 cycles to load a cache line You can do a lot of work in 610 cycles What can we do about it? Hot/cold split Reduce in-memory data size Use cache coherent structures

36 Expensive Instructions What is it? Certain instructions not pipelined No other instructions issued ‘til they complete Stalls both hardware threads high latency and low throughput What can we do about it? Know when those instructions are generated Avoid or code round those situations But only in critical places

37 Branch Mispredicts What is it? Mispredicting a branch causes… …CPU to discard instructions it predicted it needed …23-24 cycle delay as correct instructions fetched Why is this a problem? Misprediction penalty can… …dominate total time in tight loops …waste time fetching unneeded instructions

38 PIX for Xbox 360

39 PIX Performance Investigator for Xbox For analysing various kinds of performance Rendering, file system, CPU For CPU… Several different mechanisms Stochastic sampling High level timers and counters Instruction trace

40 CPU Instruction Trace What is an instruction trace? CPU core set to single step mode Tools record instructions and load/store addrs 400x slower than normal execution Trace (and code) affected by: Compiler output – un-optimized / optimized Some statistics are simulated Eg cache statistics assumes Cache starts empty No other threads run and evict data

41 CPU Instruction Trace Instruction trace contains 5 tabs: Summary tab Top Issues tab Memory Accesses tab Source tab Functions tab

42 CPU Instruction Trace Summary tab Instructions executed statistics I-cache statistics D-cache statistics Very useful: cache line usage % TLB statistics Very useful: 4Kb and 64Kb page usage Very useful: TLB miss rate exceeding 1024 Instruction type histogram

43 Summary Tab Cache line efficiency – try for 35% minimum Executed instructions – gives notion of possible maximum speed

44 Top Issues Tab Major CPU penalties, by cycle cost order Includes link to: Address of instruction where penalty occurs Function in source view L2 miss and LHS normally dominate Other common penalties: Branch mispredict fcmp Expensive instructions (fdiv et al)

45 Top Issue Tab Cache misses Displays % of data used before eviction Load-hit-stores Displays store instruction addr, last data addr Source / destination register types Expensive instructions Location of instruction Branch mispredictions Conditional or branch target mispredict

46 Memory Accesses Tab Shows all memory accesses by… Page type, address, and cache line For each cache lines shows… Symbol that touched the cache line most Right click gives all symbols touching the line

47 Source Tab Annotated source and assembly Columns show ‘penalty’ counts With hot links to more details Click here for load-hit- store details Brings up this dialog, showing you all store instructions that this load hit

48 Functions Tab Per-function values of six counters: Instruction counts L2 misses, LHS, fcmp, L1 D & I cache misses All available as inclusive and exclusive Exclusive – for this function only Inclusive – this function and everything it calls

49 Op timization Example

50 Optimization Zen Perspective is king 90% of time spent in 10% of code Optimization is expensive, slow, error prone Improvement to execution speed Generality Maintainability Understandability Speed of development

51 Optimization Zen Ground rules for optimization Have CPU budgets in place Budget planning assists good performance Measure twice, cut once Optimize in an iterative pruning fashion Remove easiet to tackle & worst culprits first Re-evaluat timing and metrics Stop as soon as budget achieved Be sure to performance issues correctly

52 Optimization Example class BaseParticle { public: … virtual Vector& Position(){ return mPosition; } virtual Vector& PreviousPosition() { return mPreviousPosition; } float& Intensity() { return mIntensity; } float& Lifetime() { return mLifetime; } bool& Active() { return mActive; } … private: … float mIntensity; float mLifetime; bool mActive; Vector mPosition; Vector mPreviousPosition; … };

53 Optimization Example // Boring old vector class class Vector { … public: float x,y,z,w; }; // Boring old generic linked list class template class ListNode { public: ListNode(T* contents) : mNext(NULL), mContents(contents){} void SetNext(ListNode* node){ mNext = node; } ListNode* NextNode(){ return mNext; } T* Contents(){ return mContents; } private: ListNode * mNext; T* mContents; };

54 Optimization Example // Run through list and update each active particle for (ListNode * node = gParticles; node != NULL; node = node->NextNode()) if (node->Contents()->Active()) { Vector vel; vel.x = node->Contents()->Position().x - node->Contents()->PrevPosition().x; vel.y = node->Contents()->Position().y - node->Contents()->PrevPosition().y; vel.z = node->Contents()->Position().z - node->Contents()->PrevPosition().z; const float length = __fsqrts((vel.x*vel.x) + (vel.y*vel.y) + (vel.z*vel.z)); if (length > cLimitLength) { float newIntensity = cMaxIntensity - node->Contents()->Lifetime(); if (newIntensity < 0.0f) newIntensity = 0.0f; node->Contents()->Intensity() = newIntensity; } else node->Contents()->Intensity() = 0.0f; }

55 Optimization Example // Replacement for straight C vector work // Build 360 friendly __vector4s __vector4 position, prevPosition; position.x = node->Contents()->Position().x; position.y = node->Contents()->Position().y; position.z = node->Contents()->Position().z; prevPosition.x = node->Contents()->PrevPosition().x; prevPosition.y = node->Contents()->PrevPosition().y; prevPosition.z = node->Contents()->PrevPosition().z; // Use VMX to do the calculations __vector4 velocity = __vsubfp(position,previousPosition); __vector4 velocitySqr = __vmsum4fp(velocity,velocity); // Grab the length result from the vector const float length = __fsqrts(velocitySqr.x);

56 Measure First PIX Summary 704k instructions executed 40% L2 cache line usage Top penalties L2 cache miss @ 3m cycles bctr mispredicts @ 1.14m cycles __fsqrt @ 696k cycles 2x fcmp @ 490k cycles Some 20.9m cycles of penalty overall Takes 7.528ms

57 Improving Original Example 1) Avoid branch mispredict #1 Ditch the zealous use of virtual Call functions just once Gives 1.13x speedup 2) Improve L2 use #1 Refactoring list to contiguous array Hot/cold split Using bitfield for active flag Gives 3.59x speedup

58 Improving Original Example 4) Remove expensive instructions Ditch __fsqrts and compare with squares Gives 4.05x speedup 5) Avoid fcmp pipeline flush Insert __fsel() to select tail length Gives 4.44x speedup Insert 2 nd fsel Now only branch on active flag remains Gives 5.0x speedup

59 Improving Original Example 7) Use VMX Use __vsubfp and __vmsum3fp for vector math Gives 5.28x speedup 8) Avoid branching too often Unroll the loop 4x Sticks at 5.28x speedup

60 Improving Original Example 9) Avoid branch mispredict #2 Read vector4 of tail intensities Build a __vector4 mask from active flags __vsel tail lengths from existing and new Write updated vector4 of tail intensities back Gives 6.01x speedup 10) Improve L2 access #2 Add __dcbt on particle array Gives 16.01x speedup

61 Improving Original Example 11) Improve L2 use #3 Moveto short coordinates Now loading ¼ the data for positions Gives 21.23x speedup 12) Avoid unnecessary work We are now writing tail lengths for every particle Wait, we don’t care about inactive particles Epiphany - don’t check active flag at all Gives 23.2x speedup

62 Improving Original Example 13) Improve L2 use #4 Remaining L2 misses on output array __dcbt that too Tweak __dcbt offsets and pre-load 39.01x speedup Check its correct!

63 for (int loop = 0; loop < cParticleCount; loop+=4) { __dcbt(768,&gParticles[loop]); __dcbt(768,&gParticleLifetime[loop]); __vector4 lifetimes = *(__vector4 *)&gParticleLifetime[loop]; __vector4 newIntensity = __vsubfp(maxLifetime,lifetimes); const __vector4 velocity0 = gParticles[loop].Velocity(); __vector4 lengthSqr0 = __vmsum3fp(velocity0,velocity0); // …calculate remaining lengths and concatenate into one __vector4 lengths = __vsubfp(lengths,cLimitLengthSqrV); __vector4 lengthMask = __vcmpgtfp(lengths,zero); newIntensity = __vmaxfp(newIntensity,zero); __vector4 result = __vsel(zero,newIntensity,lengthMask); *(__vector4 *)&gParticleTailIntensity[loop] = __vsel(zero,newIntensity,lengthMask); }

64 Improving Original Example PIX Summary 259k instructions executed 99.4% L2 usage Top penalties ERAT Data Miss @ 14k cycles 1 LHS via 4kb aliasing No mispredict penalties 71k cycles of penalty overall Takes 0.193ms

65 Summary

66 Thanks for listening Hopefully you gathered something about: Cheap consumer hardware Multicore strategies What game engine programmers worry about How games are profiled and optimized

67 Q&A

68 © 2008 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. http://www.xna.com

69 Dawson’s Creek Figures Clock rate = 3.2 GHz = 3,200,000,000 cycles per second 60 fps = 53,333,333 cycles per frame 30 fps = 106,666,666 cycles per frame Dawson’s Law: average 0.2 IPC in a game title Therefore … at 60 fps, you can do 10,666,666 instructions ~= 10M at 30 fps, you can do 21,333,333 instructions ~= 21M Or put another way… how bad is a 1M-cycle penalty? It’s approx 200K instructions of quality execution going missing. 1M cycles is 1/50 th – 2% of a frame at 60 fps, or 1/100 th – 1% of a frame at 30 fps, or 1% of a frame at 30 fps 1M cycles is ~0.32 ms.


Download ppt "“The Slow Game Of Life” Allan Murphy Senior Software Development Engineer XNA Developer Connection Microsoft Understanding Performance Consumer Hardware."

Similar presentations


Ads by Google