Presentation is loading. Please wait.

Presentation is loading. Please wait.

Common C++ Performance Mistakes in Games

Similar presentations


Presentation on theme: "Common C++ Performance Mistakes in Games"— Presentation transcript:

1 Common C++ Performance Mistakes in Games
Pete Isensee Xbox Advanced Technology Group

2 About the Data ATG reviews code to find bottlenecks and make perf recommendations 50 titles per year 96% use C++ 1 in 3 use “advanced” features like templates or generics You would recognize most of the titles. One week doing the review. Reviewed over 100 titles over the life of Xbox Culled through all our reports to find common themes and specific problems

3 Why This Talk Is Important
The majority of Xbox games are CPU bound The CPU bottleneck is often a language or C++ library issue These issues are not usually specific to the platform Not talking about hardware-specific optimizations (graphics, audio, math) Algorithmic optimizations; Process less data (e.g. culling), Processing data less often (e.g. every other frame), Multithreading

4 Format Definition of the problem Examples Recommendation For reference
A frame is 17 or 33 ms (60fps / 30fps) Bottlenecks given in ms per frame To be consistent, bottlenecks in ms rather than frame percent or CPU percent The reason is: % of frame is variable depending on frame rate Code modified to protect the guilty OK, let’s get to the first problem

5 Issue: STL Game using std::list Adding ~20,000 objects every frame
Rebuilding the list every frame Time spent: 6.5 ms/frame! ~156K overhead (2 pointers per node) Objects spread all over the heap If you’ve been to any of my GDC presentations in the past or if you know me personally, me saying anything bad about STL is sacrilege. The fact is, it’s often a bottleneck. Let me back up my claim All the visible objects in the world 20K * sizeof(void*) * 2 (forward and back pointers) Poor cache coherency

6 std::set and map Many games use set/map as sorted lists
Inserts are slow (log(N)) Memory overhead: 3 ptrs + color Worst case in game: 3.8 ms/frame So easy to use – and that’s their biggest drawback On VS7.1 memory overhead is at least 14 bytes per node

7 std::vector Hundreds of push_back()s per frame
VS7.1 expands vector by 50% Question: How many reallocations for 100 push_back()s? Answer: 13! (1,2,3,4,5,7,10,14,20,29,43,64,95) Another example vector: no memory overhead; better cache coherency, automatically grows, and not very often. Dinkumware7.1 algorithm: cap+=cap/2; if(cap didn’t increase) cap++;

8 Clearly, the STL is Evil

9 Use the Right Tool for the Job
The STL is powerful, but it’s not free Filling any container is expensive Be aware of container overhead Be aware of heap fragmentation and cache coherency Prefer vector, vector::reserve() You can use a powersaw to trim your toenails? – yes Is it a good idea? Is it perfectly suited to the task at hand? – no Filling any container every frame is not a good use of your game engine. Just because you have STL containers doesn’t mean they’re always the best solution. push_back 100 ints 5X faster with reserve()

10 The STL is Evil, Sometimes
The STL doesn’t solve every problem The STL solves some problems poorly Sometimes good old C-arrays are the perfect container Mike Abrash puts it well: “The best optimizer is between your ears” Or it maybe it does solve every problem, just not well Use your head

11 Issue: NIH Syndrome Example: Custom binary tree
Sorted list of transparent objects Badly unbalanced 1 ms/frame to add only 400 items Example: Custom dynamic array class Poorer performance than std::vector Fewer features “Not Invented Here” 1ms: creating the tree, doing 70,000 comparisons instead of around 3500

12 Optimizations that Aren’t
void appMemcpy( void* d, const void* s, size_t b ) { // lots of assembly code here ... } appMemcpy( pDest, pSrc, 100 ); // bottleneck appMemcpy was slower than memcpy for anything under 64K

13 Invent Only What You Need
std::set/map more efficient than the custom tree by 10X Tested and proven Still high overhead An even better solution Unsorted vector or array Sort once 20X improvement Don’t waste time Take advantage of things like the STL

14 Profile Run your profiler
Rinse. Repeat. Prove the improvement. Don’t rewrite the C runtime or STL just because you can. There are more interesting places to spend your time. memcpy is an intrinsic; can be expanded inline, particularly if the size of the copy is constant

15 Issue: Tool Knowledge If you’re a programmer, you use C/C++ every day
C++ is complex CRT and STL libraries are complex The complexities matter Sometimes they really matter

16 vector::clear Game reused global vector in frame loop
clear() called every frame to empty the vector C++ Standard clear() erases all elements (size() goes to 0) No mention of what happens to vector capacity On VS7.1/Dinkumware, frees the memory Every frame reallocated memory Game reusing global vector in its frame loop; filling it every frame with interesting data STL library issue What the Standard doesn’t say is just as important as what it says. On VS71, it also sets the capacity to zero. The complexities matter.

17 Zero-Initialization Costing 3.5 ms/frame
struct Array { int x[1000]; }; struct Container { Array arr; Container() : arr() { } }; Container x; // bottleneck Costing 3.5 ms/frame Removing : arr() speeds this by 20X This stumped developers on my team

18 Know Thine Holy Standard
Use resize(0) to reduce container size without affecting capacity T() means zero-initialize PODs. Don’t use T() unless you mean it. Get a copy of the C++ Standard. Really. search on 14882 Only $18 for the PDF We are engineers. In any other engineering profession, could you get away with not having a copy of engineering standards manuals at your desk? It’s interesting. It’s relevant. Mention recent updates to standard. Other references: Stroustrup, Meyers Effective[X], Sutter Exceptional C++, Josuttis

19 Issue: C Runtime n was often zero sprintf was a hotspot
void BuildScore( char* s, int n ) { if( n > 0 ) sprintf( s, “%d”, n ); else sprintf( s, “” ); } n was often zero sprintf was a hotspot HUD

20 qsort Sorting is important in games
qsort is not an ideal sorting function No type safety Comparison function call overhead No opportunity for compiler inlining There are faster options

21 Clearly, the CRT is Evil It gets back to “use the right tool for the job”

22 Understand Your Options
itoa() can replace sprintf( s, “%d”, n ) *s = ‘\0’ can replace sprintf( s, “” ) std::sort can replace qsort Type safe Comparison can be inlined Other sorting options can be even faster: partial_sort, partition itoa is 2-4X faster The assignment is 10X faster sort of 1000 ints is 20% faster than qsort

23 Issue: Function Calls 50,000-100,000 calls/frame is normal
At 60Hz, Xbox has 12.2M cycles/frame Function call/return averages 20 cycles A game calling 61,000 functions/frame spends 10% CPU (1.7 ms/frame) in function call overhead 733MHz / 60MHz = 12.2M Function call and return costs vary depending on branch prediction and other factors 12.2M / 20 cycles * 10% = 61,000 C++, by nature, is function intensive; beware the hidden cost

24 Extreme Function-ality
120,000 functions/frame 140,000 functions/frame 130,000 calls to a single function/frame (ColumnVec<3,float>::operator[]) And the winner: 340,000 calls per frame! 9 ms/frame of call overhead primarily iterations through things in the world

25 Beware Elegance Elegance → levels of indirection → more functions → perf impact Use algorithmic solutions first One pass through the world Better object rejection Do AI/physics/networking less often than once/frame algorithmic: one pass through the world, better object rejection, doing AI/physics/networking less often than once/frame

26 Inline Judiciously Remember: inline is a suggestion
Try “inline any suitable” compiler option 15 to 20 fps 68,000 calls down to 47,000 Try __forceinline or similar keyword Adding to 5 funcs shaved 1.5 ms/frame Don’t over-inline Even small common functions are not always inlined by your compiler Warning: overuse of __forceinline can reduce your performance because it makes your exe larger

27 Issue: for loops // Example 1: Copy indices to push buffer
for( DWORD i = 0; i < dwIndexCnt; ++i ) *pPushBuffer++ = arrIndices[ i ]; // Example 2: Initialize vector array for( DWORD i = 0; i < dwMax; ++i ) mVectorArr[i] = XGVECTOR4(0,0,0,0); // Example 3: Process items in world for( itr i = c.begin(); i < c.end(); ++i ) Process( *i ); All from games, they compile and do the correct thing, but they’re all bottlenecks. Copy indices: Copy each DWORD element of the index array to the push buffer memory block Init vector: construct a temporary XGVECTOR4 object, copy it to the vector array, repeat. And that’s exactly what the compiler will generate Process items: For each item in the container, call a function on that item; end() called every time

28 Watch Out For For Never copy/clear a POD with a for loop
std::algorithms are optimized; use them memcpy( pPushBuffer, arrIndices, dwIndexCnt * sizeof(DWORD) ); memset( mVectorArr, 0, dwMax * sizeof(XGVECTOR4) ); for_each( c.begin(), c.end(), Process ); POD = plain old data item Never ever copy a byte, a short, or a long using a for loop. Memcpy (or std::copy) is optimized to do this. for_each and friends in the algorithm header are as fast or faster than using for loops. They optimize away repeated calls to end() and optimize traversals of particular containers. std::algorithms are implemented using optimized comp. sci. techniques that you just aren’t likely to beat. Side advantage: less code memcpy 4X memset 16X for_each 1.02X Side note: talk about std::copy (memmove) and std::fill (memset for char*)

29 Issue: Exception Handling
Most games never throw Most games never catch Yet, most games enable EH EH adds code to do stack unwinding A little bit of overhead to a lot of code 10% size increase is common 2 ms/frame in worst case At least game code itself doesn’t often throw. Lower level libraries and new might throw. Even if a game did catch an exception, what’s it going to do? If you run out of memory, is there any point in letting the player know you’re about to crash? If a STL vector index is out of range is there anything your shipped title can do about it? STL throws: invalid arg, overflow, out of range, length error (container exceeds max_size), and bad alloc Code overhead for: 1) code that is never called and 2) pollutes the cache Why add code you will never use?

30 Disable Exception Handling
Don’t throw or catch exceptions Turn off the C++ EH compiler option For Dinkumware STL Define “_HAS_EXCEPTIONS=0” Write empty _Throw and _Raise_handler; see stdthrow.cpp and raisehan.cpp in crt folder Add #pragma warning(disable: 4530) In a game, fatal errors can just call a global error handler. A throw statement is much more expensive than a return statement; don’t use as an alternate return mechanism. On a console, an exception generally means one of two things Your game is buggy (and EH won’t help) The hardware is faulty (and EH won’t help) It’s better to crash in the exceptional cases and save 2ms a frame in the % general case Code: _CRTIMP2 void __cdecl std::_Throw(const std::exception&) {} typedef void (*_Prhand)(const std::exception&); _CRTIMP2 _Prhand std::_Raise_handler = 0;

31 Issue: Strings Programmers love strings Love hurts
~7000 calls to stricmp in frame loop 1.5 ms/frame Binary search of a string table 2 ms/frame

32 Avoid strings String comparisons don’t belong in the frame loop
Put strings in an table and compare indices At least optimize the comparison Compare pointers only Prefer strcmp to stricmp strcmp is about 20% faster than stricmp

33 Issue: Memory Allocation
Memory overhead Xbox granularity/overhead is 16/16 bytes Overhead alone is often 1+ MB Too many allocations Games commonly do thousands of allocations per frame Cost: 1-5 ms/frame You knew that I would be talking about memory! 64MB total on Xbox The cost isn’t deterministic

34 Hidden Allocations push_back(), insert() and friends typically allocate memory String constructors allocate Init-style calls often allocate Temporary objects, particularly string constants that convert to string objects

35 Minimize Per-Frame Allocations
Use memory-friendly data structures, e.g. arrays, vectors Reserve memory in advance Use custom allocators Pool same-size allocations in a single block of memory to avoid overhead Use the explicit keyword to avoid hidden temporaries Avoid strings vector/array: fewer allocations, no overhead, more cache friendly Reserve: not just vector.reserve() Allocator: pool same size allocations in a single block of memory, saves memory because less overhead

36 Other Tidbits Compiler settings: experiment dynamic_cast: just say no
Constructors: performance killers Unused static array space: track this Loop unrolling: huge wins, sometimes Suspicious comments: watch out “Immensely slow matrix multiplication” Dynamic cast: 1 ms/frame worst case RE constructors: passing objects by value Mention: virtual functions

37 Wrap Up Use the Right Tool for the Job The STL is Evil, Sometimes
Invent Only What You Need Profile Know Thine Holy Standard Understand Your Options Beware Elegance Inline Judiciously Watch Out For For Disable Exception Handling Avoid Strings Minimize Per-frame Allocations

38 Call to Action: Evolve! Pass the rubber chicken
Share your C++ performance mistakes with your team Mentor junior programmers So they only make new mistakes Don’t stop learning You can never know enough C++ Your mission/task/challenge You’ve learned from the mistakes of others, don’t forget to learn from your own mistakes and share that knowledge

39 Questions Fill out your feedback forms
This presentation:


Download ppt "Common C++ Performance Mistakes in Games"

Similar presentations


Ads by Google