August 14-15 2006 Xbox 360 CPU Performance Update (Gamefest 2006 edition) Bruce Dawson Software Design Engineer Game Technology Group Microsoft.

August 14-15 2006 Xbox 360 CPU Performance Update (Gamefest 2006 edition) Bruce Dawson Software Design Engineer Game Technology Group Microsoft

August 14-15 2006 Talk Purpose Give you the tools to make CPU code faster! Better understanding of CPU Compiler limitations Compiler tricks Update on tools Assembly language New material only

August 14-15 2006 Important Things Not Covered Xfest February 2006 Effective Profiler Use on Xbox 360 Effective Profiler Use on Xbox 360 Efficient C++ on Xbox 360 Efficient C++ on Xbox 360 Trace Analysis and Memory Optimization Trace Analysis and Memory Optimization Gamefest June 2005 CPU Performance Bottlenecks and Solutions CPU Performance Bottlenecks and Solutions Xfest February 2005 Xenon CPU Pipelines Xenon CPU Pipelines Intro to PPCIntro to PPC and VMX 128 VMX 128 Intro to PPCVMX 128 Xbox 360 XDK CPU Pipeline Animator

August 14-15 2006 System Block Diagram Core0 Core1Core2 1MB L2 L1 CPU: 3.2GHz Memory Controller 10 MB EDRAM 512 MB RAM Memory Southbridge 3D Core DVD HDD port A/V output Ethernet MU ports Controller ports GPU: 500MHz

August 14-15 2006 CPU Block Diagram 3 cores, 6 threads, 1-MB L2 cache, on one chip In-order instruction execution Instruction latencies from 2 to 14 cycles Pipeline flushes due to load-hit-stores, mispredicted branches, float compares, etc. Significant memory latency: ~610 cycles

August 14-15 2006 Two stall points: IQ end (for int, load, and branch)... and VQ end for float and VMX Other issues include microcoded instructions and flushes Aligned pairs dispatch, each pair from one thread Execution pipelines accept one instruction per clock Not physically accurate—see the XDK for precise layout and timings Pipeline latency is normally the pipeline length Exceptions for load/store to integer, integer to load/store, and float/VMX compares

August 14-15 2006 CD Bonus Extra Track Least understood Trace warning: <4-byte write-combined write at inst 82312034, estimated cost from 532 occurrences is 10640 Writing to uncacheable write-combined memory is only efficient if rules are followed: All writes should be four bytes or greater, naturally aligned, and in-order, with no gaps or repeats Otherwise each write may be a separate front- side-bus (FSB) transaction!

August 14-15 2006 CD Bonus Code Slow code (detected by trace analysis) short* pWriteCombined =... pWriteCombined[0] = data0; pWriteCombined[1] = data1; pWriteCombined[2] = data2; Combining pairs of 16-bit values into 32-bit values before writing is much faster

August 14-15 2006 CD Bonus Code Slow code (not detected by trace analysis, yet) DWORD* pWriteCombined =... pWriteCombined[0] = data0; pWriteCombined[2] = data2; pWriteCombined[1] = data1; Writing in order is essential

August 14-15 2006 CD Bonus Code Fast code extern "C" void _ReadWriteBarrier(); #pragma intrinsic(_ReadWriteBarrier) DWORD* pWriteCombined =... pWriteCombined[0] = data0; _ReadWriteBarrier(); pWriteCombined[1] = data1; _ReadWriteBarrier(); pWriteCombined[2] = data2;

August 14-15 2006 Finding CPU Problems/Hotspots PIX system monitor PIX timing capture XbPerfview (/callcap or /fastcap) Sampling profiler!? Trace analysis Custom timing code

August 14-15 2006 Trace Analysis Update Trace Recording records every instruction executed and every address referenced Trace Analysis has a UI! Multiple reports faster (one analysis pass for all reports) Source view with integrated top-issues and per-line disassembly Coming soon: links from top-issues and memory access map to source view, etc.

August 14-15 2006 Timing: mftb mftb is the fastest and most precise way to measure CPU execution time Used by XbPerfView and QueryPerformanceCounter Increments every 64 CPU cycles ~44 cycle cost to read Accessible in one instruction (__mftb intrinsic)

August 14-15 2006 Timing: mftb Frequency mftb or QueryPerformanceCounter are tempting for game timing also, but... mftb does not run at exactly 50.0 MHz Actually runs at about 49.875 MHz Varies between machines from about 49.85 to 49.90 MHz Always exactly 64 CPU cycles per tick GetTickCount is more long-term accurate

August 14-15 2006 Timing: mftb Errors mftb is occasionally slightly off* Every 256 billion CPU cycles (85 seconds) value is wrong for 4 CPU cycles Solutions: Ignore the problem Detect and fix the problem: int64 time = __mftb(); int64 time = __mftb(); if( 0 == (DWORD)time ) if( 0 == (DWORD)time ) time = __mftb(); time = __mftb(); Use QueryPerformanceCounter Use __mftb32 (max time 85 seconds) * Slightly off means about 4 billion (exactly 2 32 ) too small // Timing with __mftb32() DWORD start = __mftb32(); DoStuff(); DWORD elapsed = __mftb32() - start;

August 14-15 2006 Timing: __mftb Alignment When Xbox 360 launched, mftb on separate cores was not aligned One core could be 10-20 ticks (640-1280 cycles) ahead With the spring 2006 update mftb should be cycle accurate synchronized between cores

August 14-15 2006 Detecting Sub-optimal Code Key is to find code that could run better Trace recording top-issues analysis Points out many common problems Makes everyone an expert Assembly code inspection/search Look for sign/zero extend instructions Look for code that expands to "too many" instructions Look for excessive frsp, fmr (and mr, vor) Look for bad scheduling Look for references to r1—stack pointer Compiler generated temporaries: Usually unwanted Often avoidable Often lead to load-hit-stores, or just wasted instructions Go on the stack, use r1 Sign/zero extension occurs when using short and byte local variables

August 14-15 2006 Assembly Inspection Options Set a Visual Studio breakpoint, go to disassembly mode (toggle with Ctrl+F11) Record a trace, go to the Source tab, expand source lines COD files: C/C++, Output Files, Assembler Output, set to Assembly, Machine Code and Source (/FAcs) Don’t need to link or run game

August 14-15 2006 Poor Code in COD File ; Begin code for function: ?IntegralFloat@@YAMM@Z ; 5 : // Truncate a float to an integral value, ; 6 : // still as a float, using casts. ; 7 : return (float)(int)input; 000003961fff0 addi r11,r1,-16 000003961fff0 addi r11,r1,-16 00004fc00081e fctiwz fr0,fr1 00004fc00081e fctiwz fr0,fr1 000087c005fae stfiwx fr0,r0,r11 000087c005fae stfiwx fr0,r0,r11 0000ce961fff2 lwa r11,-10h(r1) 0000ce961fff2 lwa r11,-10h(r1) 00010f961fff0 std r11,-10h(r1) 00010f961fff0 std r11,-10h(r1) 00014c801fff0 lfd fr0,-10h(r1) 00014c801fff0 lfd fr0,-10h(r1) 00018fc00069c fcfid fr0,fr0 00018fc00069c fcfid fr0,fr0 0001cfc200018 frsp fr1,fr0 0001cfc200018 frsp fr1,fr0 000204e800020 blr 000204e800020 blr Note all the references to r1— avoid if possible

August 14-15 2006 Faster code in COD File ; Begin code for function: ?IntegralFloatFast@@YAMM@Z ; 12 : // Truncate a float to an integral value, ; 13 : // still as a float, using intrinsics. ; 14 : return (float)__fcfid(__fctidz(input)); 00000fc000e5e fctidz fr0,fr1 00000fc000e5e fctidz fr0,fr1 00004fc00069c fcfid fr0,fr0 00004fc00069c fcfid fr0,fr0 00008fc200018 frsp fr1,fr0 00008fc200018 frsp fr1,fr0 0000c4e800020 blr 0000c4e800020 blr See ppcintrinsics.h for explanations of __fctidz and __fcfid, or see Optimization Case Studies from the summer 2005 Gamefest Fewer instructions, no stack traffic

August 14-15 2006 Poor Code in PIX Source Tab

August 14-15 2006 Expanding Code extsh instructions are doing no useful work

August 14-15 2006 Faster Code in PIX Source Tab Using int instead of short saves time and code space Save char/short for arrays and large structs

August 14-15 2006 Compiler Quirks to Beware Of __restrict and inline interacting badly Compiler may do poor scheduling when unrolling read/modify/write of one array bool—the forgotten 8-bit type Functions returning bool

August 14-15 2006 Restrict and Inline void TestCalcs(__vector4* __restrict input1, __vector4* __restrict input2, __vector4* __restrict input2, __vector4* __restrict result, __vector4* __restrict result, const int count) { const int count) { for(int j=0; j<count; j+=4) { for(int j=0; j<count; j+=4) { result[j+0] = input1[j+0] + input2[j+0]; result[j+0] = input1[j+0] + input2[j+0]; result[j+1] = input1[j+1] + input2[j+1]; result[j+1] = input1[j+1] + input2[j+1]; result[j+2] = input1[j+2] + input2[j+2]; result[j+2] = input1[j+2] + input2[j+2]; result[j+3] = input1[j+3] + input2[j+3]; result[j+3] = input1[j+3] + input2[j+3]; }}

August 14-15 2006 This Code Depends on Context TestCalcs is inlined Parent function pointers are not marked __restrict Inlining merges parameters Merging is conservative so __restrict is lost The Parent Trap void Parent(__vector4* input1, __vector4* input2, __vector4* result, const int count) __vector4* result, const int count){ TestCalcs(input1, input2, result, count); TestCalcs(input1, input2, result, count);} Avoiding The Parent Trap void Parent(__vector4* input1A, __vector4* input2A, __vector4* resultA, const int count) __vector4* resultA, const int count){ __vector4* __restrict input1 = input1A; __vector4* __restrict input1 = input1A; __vector4* __restrict input2 = input2A; __vector4* __restrict input2 = input2A; __vector4* __restrict result = resultA; __vector4* __restrict result = resultA; TestCalcs(input1, input2, result, count); TestCalcs(input1, input2, result, count);}

August 14-15 2006 Solutions to __restrict/inline Be wary of __forceinline Better results (in this case) with __declspec(noinline) Even better results by marking the variables as __restrict in the parent Be wary of increased inlining making the problem return

August 14-15 2006 Array Update Good Scheduling void IncrementFast(float* __restrict data, float* __restrict input, float* __restrict input, int count, float addend) int count, float addend){ // Assume count is a multiple of 2 // Assume count is a multiple of 2 for (int i = 0; i < count; i += 2) for (int i = 0; i < count; i += 2) { data[i] = input[i] + addend; data[i] = input[i] + addend; data[i+1] = input[i+1] + addend; data[i+1] = input[i+1] + addend; }} Pointers are marked __restrict, so compiler can do great scheduling lfs fr0,0(r9) lfsx fr13,r8,r11 fadds fr0,fr0,fr1 fadds fr13,fr13,fr1 stfs fr0,-4(r11) stfs fr13,0(r11)

August 14-15 2006 Array Update Bad Scheduling void IncrementData(float* __restrict data, int count, float addend) int count, float addend){ // Assume count is a multiple of 2 // Assume count is a multiple of 2 for (int i = 0; i < count; i += 2) for (int i = 0; i < count; i += 2) { data[i] += addend; data[i] += addend; data[i+1] += addend; data[i+1] += addend; }} Updates to data[] can’t overlap, so compiler should do great scheduling… lfs fr0,0(r11) fadds fr0,fr0,fr1 stfs fr0,0(r11) lfs fr0,0(r9) fadds fr0,fr1,fr0 stfs fr0,0(r9)

August 14-15 2006 bool—the Forgotten 8-Bit Type Our compiler has difficulty with bool It likes to add extra instructions and do other unfortunate things Someday this may change, but for now... Functions that return bool can be the worst Sorry

August 14-15 2006 bool Interacting with STL This is the canonical form for STL iteration typedef std::vector IntVec; IntVec::iterator end = testVector.end(); for (IntVec::iterator p = testVector.begin(); p != end; p != end; ++p) { ++p) { result += *p; } Loop test calls bool operator!=(iter, iter), and generates inefficient code

August 14-15 2006 Ideal Code for STL Loop $LL20@IteratorIt lwz r9,0(r11) lwz r9,0(r11) addi r11,r11,4 addi r11,r11,4 add r3,r9,r3 add r3,r9,r3 cmplw cr6,r11,r10 cmplw cr6,r11,r10 bne cr6,$LL20@IteratorIt bne cr6,$LL20@IteratorIt p != end

August 14-15 2006 Actual Code for STL Loop $LL20@IteratorIt lwz r10,0(r11) lwz r10,0(r11) addi r11,r11,4 addi r11,r11,4 add r3,r10,r3 add r3,r10,r3 subf r10,r11,r9 subf r10,r11,r9 cntlzw r10,r10 cntlzw r10,r10 rlwinm r10,r10,27,31,31 rlwinm r10,r10,27,31,31 cntlzw r10,r10 cntlzw r10,r10 rlwinm r10,r10,27,31,31 rlwinm r10,r10,27,31,31 cmplwi cr6,r10,0 cmplwi cr6,r10,0 bne cr6,$LL20@IteratorIt bne cr6,$LL20@IteratorIt Code is poor even when inlined! p != end

August 14-15 2006 bool Function Solutions Avoid functions that return bool, even inline functions Prefer doing comparisons in main function But this can be stymied when using STL iterators with overloaded operator!= Since iterator comparisons use bool functions, consider using array style indexing for vectors

August 14-15 2006 Improving Code Generation Use __restrict (have we mentioned that before?) But only use it when it is true Inlining is good Avoid having tiny leaf functions—even just error handlers—that can't be inlined Avoid functions that return bool/BOOL/boolean LTCG, PGO Know what code is generated

August 14-15 2006 Compiler Improvements Updated compiler coming Real Soon Now Improved VMX code-generation (less spilling to stack) __declspec( passinreg ) Bug fixes and miscellaneous other improvements See Sublime C++ for Games for more information What about when the compiler isn't good enough?

August 14-15 2006 Simple Assembly Language int __declspec( naked ) SimpleAddAssem( int x, int y ) { int y ) { asm { asm { // x is in r3 // x is in r3 // y is in r4 // y is in r4 add r3, r3, r4 add r3, r3, r4 // The return value is in r3 // The return value is in r3 blr // Don’t forget the explicit ‘blr’ blr // Don’t forget the explicit ‘blr’ }} Expands to this code, always: add r3,r3,r4 blr

August 14-15 2006 The Perils of Not Being Naked int SimpleAddAssem( int x, int y ) { int y ) { asm { asm { // x is in r3 // x is in r3 // y is in r4 // y is in r4 add r3, r3, r4 add r3, r3, r4 // The return value is in r3 // The return value is in r3 // Don’t put an explicit ‘blr’ // Don’t put an explicit ‘blr’ }} Expands to this code in release: stw r3,x$(r1) stw r4,y$(r1) add r3,r3,r4 blr Expands to this code in /callcap: mflr r12 stw r12,-8(r1) stwu r1,-60h(r1) stw r3,x$(r1) stw r4,y$(r1) mr r13,r13 add r3,r3,r4 mr r14,r14 addi r1,r1,96 lwz r12,-8(r1) mtlr r12 blr Special ‘mr’ instructions change to /callcap functions when profiling, and trash your registers!

August 14-15 2006 Assembly Language Guidelines Avoid when possible The compiler can schedule code very well Intrinsics give you access to special instructions The compiler can call functions faster and easier! High-level code can be rearranged and updated much faster Assembly may make sense for small, critical routines Know the pipelines Use __declspec( naked ) Compare to C/C++ performance

August 14-15 2006 Summary The compiler is your friend: learn to work with it The compiler sometimes generates bad code—watch for it and work around it Know your tools Use assembly language when necessary

August 14-15 2006 References Unsigned Developers http://arstechnica.com/articles/paedia/cpu/xbox360-2.ars Learning the Xbox 360 CPU https://xds.xbox.com/xbox360/link.aspx?page=xdksoftware/whitep apers/xbox_360_cpu_overview.doc https://xds.xbox.com/xbox360/link.aspx?page=xdksoftware/whitep apers/xbox_360_cpu_overview.doc https://xds.xbox.com/xbox360/link.aspx?page=xdksoftware/whitep apers/xbox_360_cpu_pipelines.doc https://xds.xbox.com/xbox360/link.aspx?page=xdksoftware/whitep apers/xbox_360_cpu_pipelines.doc Xbox 360 Optimization Guides https://xds.xbox.com/xbox360/nav.aspx?page=xdksoftware/trainin g/XfestFeb2006.htm#IDA4JUP https://xds.xbox.com/xbox360/nav.aspx?page=xdksoftware/trainin g/XfestFeb2006.htm#IDA4JUP https://xds.xbox.com/xbox360/nav.aspx?page=xdksoftware/trainin g/xds_Gamefest_05Jun_Training.htm#2.4https://xds.xbox.com/xbox360/nav.aspx?page=xdksoftware/trainin g/xds_Gamefest_05Jun_Training.htm#2.4. https://xds.xbox.com/xbox360/nav.aspx?page=xdksoftware/trainin g/xds_Gamefest_05Jun_Training.htm#2.4

© 2006 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. DirectX Developer Center http://msdn.microsoft.com/directx Game Development MSDN Forums http://forums.microsoft.com/msdn Xbox 360 Central http://xds.xbox.com/ XNA Web site http://www.microsoft.com/xna http://msdn.microsoft.com/directx http://forums.microsoft.com/msdn http://xds.xbox.com/ http://www.microsoft.com/xna http://msdn.microsoft.com/directx http://forums.microsoft.com/msdn http://xds.xbox.com/ http://www.microsoft.com/xna

August 14-15 2006 Xbox 360 CPU Performance Update (Gamefest 2006 edition) Bruce Dawson Software Design Engineer Game Technology Group Microsoft.

Similar presentations

Presentation on theme: "August 14-15 2006 Xbox 360 CPU Performance Update (Gamefest 2006 edition) Bruce Dawson Software Design Engineer Game Technology Group Microsoft."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

August 14-15 2006 Xbox 360 CPU Performance Update (Gamefest 2006 edition) Bruce Dawson Software Design Engineer Game Technology Group Microsoft.

Similar presentations

Presentation on theme: "August 14-15 2006 Xbox 360 CPU Performance Update (Gamefest 2006 edition) Bruce Dawson Software Design Engineer Game Technology Group Microsoft."— Presentation transcript:

Similar presentations

About project

Feedback