Download presentation
Presentation is loading. Please wait.
Published byValentine Bailey Modified over 8 years ago
1
© 2009 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. http://www.xna.com
2
Out of Order Making In-order Processors Play Nicely Allan Murphy XNA Developer Connection, Microsoft
3
Optimization Example class BaseParticle { public: … virtual Vector& Position(){ return mPosition; } virtual Vector& PreviousPosition() { return mPreviousPosition; } float& Intensity() { return mIntensity; } bool& Active() { return mActive; } float& Lifetime() { return mLifetime; } … private: … float mIntensity; float mLifetime; bool mActive; Vector mPosition; Vector mPreviousPosition; … };
4
Optimization Example // Boring old vector class class Vector { … public: float x,y,z,w; }; // Boring old generic linked list class template class ListNode { public: ListNode(T* contents) : mNext(NULL), mContents(contents){} void SetNext(ListNode* node){ mNext = node; } ListNode* NextNode(){ return mNext; } T* Contents(){ return mContents; } private: ListNode * mNext; T* mContents; };
5
Optimization Example // Run through list and update each active particle for (ListNode * node = gParticles; node != NULL; node = node->NextNode()) if (node->Contents()->Active()) { Vector vel; vel.x = node->Contents()->Position().x - node->Contents()->PrevPosition().x; vel.y = node->Contents()->Position().y - node->Contents()->PrevPosition().y; vel.z = node->Contents()->Position().z - node->Contents()->PrevPosition().z; const float length = __fsqrts((vel.x*vel.x) + (vel.y*vel.y) + (vel.z*vel.z)); if (length > cLimitLength) { float newIntensity = cMaxIntensity - node->Contents()->Lifetime(); if (newIntensity < 0.0f) newIntensity = 0.0f; node->Contents()->Intensity() = newIntensity; } else node->Contents()->Intensity() = 0.0f; }
6
Optimization Example // Replacement for straight C vector work // Build 360 friendly __vector4s __vector4 position, prevPosition; position.x = node->Contents()->Position().x; position.y = node->Contents()->Position().y; position.z = node->Contents()->Position().z; prevPosition.x = node->Contents()->PrevPosition().x; prevPosition.y = node->Contents()->PrevPosition().y; prevPosition.z = node->Contents()->PrevPosition().z; // Use VMX to do the calculations __vector4 velocity = __vsubfp(position,previousPosition); __vector4 velocitySqr = __vmsum4fp(velocity,velocity); // Grab the length result from the vector const float length = __fsqrts(velocitySqr.x);
7
Job done, right? Thank you for listening
8
Optimization Example Hold on. If we time it… Its actually slower than the straight C version And if we check the results.. It's also wrong! Incorrect is a special case optimization Unfortunately, this does happen in practice
9
Important Caveat Today we’re talking about optimization But, the techniques discussed are orthogonal to… …good algorithm choice …good multithreading system implementation It’s like Mr Knuth said. They typically build code which is… …very non-general …very difficult to maintain or understand …possibly completely platform specific
10
But My Code Is Really Quick On PC…? A common assumption: It’s quick on PC 360 & PS3 have 3.2GHz clock speed Should be good on console! Right? Alas 360 core and PS3 PPU have.. No instruction reordering hardware No store forwarding hardware Smaller caches and slower memory No L3 cache
11
The 4 Horsemen of In-Order Apocalypse What goes wrong? LHS L2 miss Expensive, non pipelined instructions Branch mispredict penalty
12
Load-Hit-Store (LHS) What is it? Storing to a memory location… …then loading from it very shortly after What causes LHS? Casts, changing register set, aliasing Why is it a problem? On PC, bullet usually dodged by… Instruction re-ordering Store forwarding hardware
13
L2 Miss What is it? Loading from a location not already in cache Why is it a problem? Costs ~610 cycles to load a cache line You can do a lot of work in 610 cycles What can we do about it? Hot cold split Reduce in-memory data size Use cache coherent structures
14
Expensive Instructions What is it? Certain instructions not pipelined No other instructions issued ‘til they complete Stalls both hardware threads high latency and low throughput What can we do about it? Know when those instructions are generated Avoid or code round those situations But only in critical places
15
Branch Mispredicts What is it? Mispredicting a branch causes… …CPU to discard instructions it predicted it needed …23-24 cycle delay as correct instructions fetched Why is this a problem? Misprediction penalty can… …dominate total time in tight loops …waste time fetching unneeded instructions
16
Branch Mispredicts What can we do about it? Know how compiler implements branches for, do, while, if Function pointers, switches, virtual calls Reduce total branch counts for task Use test and set style instructions Refactor calculations to remove branches Unroll
17
Who Are Our Friends? Profiling, profiling, profiling 360 tools PIX CPU instruction trace LibPMCPB counters XbPerfView sampling capture Other platforms SN Tuner, vTune Thinking laterally
18
General Improvements inline Make sure your function fits the profile Pass and return in register __declspec(passinreg) __restrict Compiler released from being ultra careful const Doesn’t affect code gen But does affect your brain
19
General Improvements Compiler options Inline all possible Prefer speed over size Platform specifics 360 /Ou - Removes integer div by zero trap /Oc – Runs a second code scheduling pass Don’t write inline asm
20
General Improvements Reduce parameter count Reduce function epilogue and prologue Reduce stack access Reduce LHS Prefer 32, 64 and 128 bit variables Isolate constants – or constant sets Look to specialise, not generalise Avoid virtual if feasible Unnecessary virtual means indirected branch
21
Know Your Cache Architecture Cache size 360: 1Mb L2, 32Kb L1 Cache line size 360: 128 bytes; x86 – typically 64 bytes Pre-fetch mechanism 360: dcbt, dcbz128 Cross-core sharing policy 360: L2 shared, L1 per core
22
Know Pipeline & LHS Conditions LHS caused by: Pointer aliasing Register set swap / casting Be aware of non-pipelined instructions fsqrt, fdiv, int mul, int div, sraw Be aware of pipeline flush issues Especially fcmp
23
Knowing Your Instruction Set 360 specifics: VMX Slow instructions Regularly useful instructions fsel, vsel, vcmp*, vrlimi PS3 Altivec & world of SPE PC SSE, SSE2, SSE3, SSE4, SSE4.1 and friends
24
What Went Wrong With The Example? Correctness Always cross-compare during development Guessed at 1 performance issue SIMD vs straight float Giving SIMD ‘some road’ Branch behaviour exactly the same Adding SIMD adds an LHS Memory access and L2 usage unchanged
25
Image Analysis
26
Image Analysis Example Classification via Gaussian Mixture Model For each pixel in a 320x240 array… Evaluate ‘cost’ via up to 20 Gaussian models Returns lowest cost found for pixel Submit cost to graph structure for min-cut Profiling shows: 86% of time in pixel cost function No surprises there 1,536,000 Gaussian model applies
27
Image Analysis Example float GMM::Cost(unsigned char r, unsigned char g, unsigned char b, size_t k) { Component& component = mComponent[k]; SampleType x(r,g,b); x -= component.Mean(); FloatVector fx((float)x[0],(float)x[1],(float)x[2]); return component.EofLog() + 0.5f * fx.Dot( component.CovInv().Multiply(fx)); } float GMM::BestCost(unsigned char r, unsigned char g, unsigned char b) { float bestCost = Cost(r,g,b,0); for(size_t k=1; k<nK; k++) { float cost = Cost(r,g,b,k); if( cost < bestCost ) bestCost = cost; } return bestCost; }
28
Image Analysis Example What things look suspect? L2 miss on component load Passing individual r,g,b elements Building two separate vectors Casting int to float Vector maths Branching may be an issue in BestCost() Loop Conditional inside loop Confirm with PIX on 360
29
Image Analysis Example Pass 1 Don’t even touch platform specifics Pass a single int, not 3 unsigned chars Mark up all consts Build the sample value once in the caller Add __forceinline Check correctness Doesn’t help a lot – gives about 1.1x
30
Image Analysis Example Pass 2 Turn Cost function innards to VMX Return cost as __vector4 to avoid LHS Remove if from loop in BestCost by… Keeping bestCost as a __vector4 Using vcmpgefp to make a comparison mask Using vsel to pick the lowest value Speedup of 1.7x Constructing the __vector4s on the fly expensive
31
Image Analysis Example Pass 3 -Build the colour as a __vector4 in calling function -Build a static __vector4 containing {0.5f,0.5f,0.5f,0.5f} -Load once in calling function -Mark all __vector4 as __declspec(passinreg) -Build __vector4 version of Component -All calculations done as __vector4 -More like it – speedup of 5.2x
32
Image Analysis Example Pass 4 -Go all the way out to the per pixel calling code -Load __vector4 at a time from source array -Do 4 pixel costs at once -__vcmpgefp/__vsel works exactly the same -Return __vector4 with 4 costs -Write to results array as single __vector4 -Gives a speedup of 19.54x
33
Image Analysis Example __declspec(passinreg) __vector4 CMOGs::BestCost(__declspec(passinreg) __vector4 colours) const { __vector4 half = gHalf; const size_t nK = m_componentCount; assert(nK != 0); __vector4 bestCost = Cost(colour, half, 0 ); for(size_t k=1;k<nK;k++) { const __vector4 cost = Cost(colour, half, k ); const __vector4 mask = __vcmpgefp(bestCost,cost); bestCost = __vsel(bestCost,cost,mask); } return bestCost; }
34
Image Analysis Example const Component& comp = m_vComponent[k]; const __vector4 vEofLog = comp.GetVEofLog(); colour0 = __vsubfp(colour0,comp.GetVMean()); … const __vector4 row0 = comp.GetVCovInv(0); const __vector4 row1 = comp.GetVCovInv(1); const __vector4 row2 = comp.GetVCovInv(2); x = __vspltw(colour0,0); y = __vspltw(colour0,1); z = __vspltw(colour0,2); mulResult = __vmulfp(row0,x); mulResult = __vmaddfp(row1,y,mulResult); mulResult = __vmaddfp(row2,z,mulResult); vdp2 = __vmsum3fp(mulResult,input); vdp2 = __vmaddfp(vdp2,half,vEofLog);// half is __vector4 parameter result = vdp2; …
35
Image Analysis Example Hold on, this is image analysis. Shouldn’t it be on the GPU? Maybe, maybe not: Per pixel we manipulate a dynamic tree structure Excluding the tree structure… CPU can run close to GPU speed But syncing and memory throughput overhead not worth it
36
Movie Compression
37
Movie Compression Optimization Timing results Freeware movie compressor on 360 76.3% of instructions spent in InterError() Calculating error between macroblocks Majority of time in 8x8 macro block functions Picking up source and target intensity macro block For each pixel, calculating abs difference Summing differences along rows Returning sum of diffs Or early out when sum exceeds a threshold
38
Movie Compression Optimization int ThresholdSum(unsigned char *ptr1, unsigned char *ptr2, int stride2, int stride1,int thres) { int32 sad = 0; for (int i=8; i; i--) { sad += DSP_OP_ABS_DIFF(ptr1[0], ptr2[0]); sad += DSP_OP_ABS_DIFF(ptr1[1], ptr2[1]); sad += DSP_OP_ABS_DIFF(ptr1[2], ptr2[2]); sad += DSP_OP_ABS_DIFF(ptr1[3], ptr2[3]); sad += DSP_OP_ABS_DIFF(ptr1[4], ptr2[4]); sad += DSP_OP_ABS_DIFF(ptr1[5], ptr2[5]); sad += DSP_OP_ABS_DIFF(ptr1[6], ptr2[6]); sad += DSP_OP_ABS_DIFF(ptr1[7], ptr2[7]); if (sad > thres ) return sad; ptr1 += stride1; ptr2 += stride2; } return sad; }
39
Movie Compression Optimization Look at our worst enemies L2 8x8 byte blocks, seems tight LHS Its all integer, so we should be LHS free Expensive instructions? No, just byte maths Branching Should get prediction right 7 out of 8 times
40
Movie Compression Optimization Maths Element by element abs and average ops on bytes Done row by row, exit on over sum Perfect for VMX! Awesome speedup of… 0% Huh? Why? Summing a row doesn’t suit VMX Branch penalty still there We have to do unaligned loads to VMX registers
41
Movie Compression Optimization Let’s think again Look at higher level picture Error calculated for 4 blocks at a time by caller Rows in blocks (0,1) and (2,3) are contiguous Pick up two blocks at a time in VMX registers Thresholding is by row But there is no reason not to do it by column Means we can sum columns in 7 instructions Use __restrict on block pointers
42
Movie Compression Optimization 01 32 VMX register 0 VMX register 1 VMX register 2 VMX register 3 VMX register 4 VMX register 5 VMX register 6 VMX register 7
43
Movie Compression Optimization Data Layout & Alignment Rows in 2 blocks are contiguous in memory Source block always 16 byte aligned Dest block only guaranteed to be byte aligned Unrolling We can unroll the 8 iteration loop We have plenty of VMX registers available Return value Return a __vector4 to avoid LHS writing to int
44
Movie Compression Optimization Miscellaneous Prebuild threshold word once Remove stride word parameters Constant values in this application only Proved with empirical research (and assert) Vector parameters and return in registers Pushed vector error results out to caller All callers calculations in VMX – drop LHS
45
Movie Compression Optimization __vector4 __declspec(passinreg) twoblock_sad8x8__xbox (const unsigned char* __restrict ptr1, const unsigned char* __restrict ptr2) { __vector4 zero = __vzero(); __vector4 row1_0 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_1 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_2 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_3 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_4 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_5 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_6 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_7 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row2_0 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_1 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_2 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_3 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_4 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_5 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_6 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_7 = *(__vector4 *)ptr2; ptr2 += cStride2; row1_0 = __vsubsbs(__vmaxub(row1_0,row2_0),__vminub(row1_0,row2_0)); row1_1 = __vsubsbs(__vmaxub(row1_1,row2_1),__vminub(row1_1,row2_1)); row1_2 = __vsubsbs(__vmaxub(row1_2,row2_2),__vminub(row1_2,row2_2)); row1_3 = __vsubsbs(__vmaxub(row1_3,row2_3),__vminub(row1_3,row2_3)); row1_4 = __vsubsbs(__vmaxub(row1_4,row2_4),__vminub(row1_4,row2_4)); row1_5 = __vsubsbs(__vmaxub(row1_5,row2_5),__vminub(row1_5,row2_5)); row1_6 = __vsubsbs(__vmaxub(row1_6,row2_6),__vminub(row1_6,row2_6)); row1_7 = __vsubsbs(__vmaxub(row1_7,row2_7),__vminub(row1_7,row2_7)); row2_0 = __vmrglb(zero,row1_0); row1_0 = __vmrghb(zero,row1_0); row2_1 = __vmrglb(zero,row1_1); row1_1 = __vmrghb(zero,row1_1); row2_2 = __vmrglb(zero,row1_2); row1_2 = __vmrghb(zero,row1_2); row2_3 = __vmrglb(zero,row1_3); row1_3 = __vmrghb(zero,row1_3); row2_4 = __vmrglb(zero,row1_4); row1_4 = __vmrghb(zero,row1_4); row2_5 = __vmrglb(zero,row1_5); row1_5 = __vmrghb(zero,row1_5); row2_6 = __vmrglb(zero,row1_6); row1_6 = __vmrghb(zero,row1_6); row2_7 = __vmrglb(zero,row1_7); row1_7 = __vmrghb(zero,row1_7); row1_0 = __vaddshs(row1_0,row1_1); row1_2 = __vaddshs(row1_2,row1_3); row1_4 = __vaddshs(row1_4,row1_5); row1_6 = __vaddshs(row1_6,row1_7); row1_0 = __vaddshs(row1_0,row1_2); row1_4 = __vaddshs(row1_4,row1_6); row1_0 = __vaddshs(row1_0,row1_4); row2_0 = __vaddshs(row2_0,row2_1); row2_2 = __vaddshs(row2_2,row2_3); row2_4 = __vaddshs(row2_4,row2_5); row2_6 = __vaddshs(row2_6,row2_7); row2_0 = __vaddshs(row2_0,row2_2); row2_4 = __vaddshs(row2_4,row2_6); row2_0 = __vaddshs(row2_0,row2_4); row1_1 = __vsldoi(row1_0,row2_0,2); row1_2 = __vsldoi(row1_0,row2_0,4); row1_3 = __vsldoi(row1_0,row2_0,6); row1_4 = __vsldoi(row1_0,row2_0,8); row1_5 = __vsldoi(row1_0,row2_0,10); row1_6 = __vsldoi(row1_0,row2_0,12); row1_7 = __vsldoi(row1_0,row2_0,14); row1_0 = __vrlimi(row1_0,row2_0,0x1,0); row2_0 = __vsldoi(row2_0,zero,2); row1_1 = __vrlimi(row1_1,row2_0,0x1,0); row1_0 = __vaddshs(row1_0,row1_1);// add 4 rows to the next row row1_2 = __vaddshs(row1_2,row1_3); row1_4 = __vaddshs(row1_4,row1_5); row1_6 = __vaddshs(row1_6,row1_7); row1_0 = __vaddshs(row1_0,row1_2); row1_4 = __vaddshs(row1_4,row1_6); row1_0 = __vaddshs(row1_0,row1_4); row1_0 = __vpermwi(row1_0,VPERMWI_CONST(0,3,0,0)); row1_0 = __vmrghh(zero,row1_0); row1_0 = __vpermwi(row1_0,VPERMWI_CONST(0,2,0,0)); return row1_0; } Unpleasant
46
Movie Compression Optimization Results Un-thresholded macro block compare 2.86 times quicker than existing C Not bad, but our code is doing 2 blocks at once, too So actually, 5.72 times quicker Thresholded macro block compare 4.12 times quicker Optimizations to just the block compares… …reduced movie compression time by 22% …in worst case, saved 40 seconds from compress time
47
Do We Get Improvements In Reverse? Do we see improvements on PC? Image analysis Movie compression
48
Summary Interlude Profiling, profiling, profiling Know your enemy Explore data alignment and layout Give SIMD plenty of room to work Don’t ignore simple code structure changes Specialise, not generalise
49
Original Example
50
Improving Original Example PIX Summary 704k instructions executed 40% L2 usage Top penalties L2 cache miss @ 3m cycles bctr mispredicts @ 1.14m cycles __fsqrt @ 696k cycles 2x fcmp @ 490k cycles Some 20.9m cycles of penalty overall Takes 7.528ms
51
Improving Original Example 1)Avoid branch mispredict #1 Ditch the zealous use of virtual Call functions just once Gives 1.13x speedup 2)Improve L2 use #1 Refactoring list to contiguous array Hot/cold split Using bitfield for active flag Gives 3.59x speedup
52
Improving Original Example 4) Remove expensive instructions Ditch __fsqrts and compare with squares Gives 4.05x speedup 5) Avoid branch mispredict #1 Insert __fsel() to select tail length Gives 4.44x speedup Insert 2 nd fsel Now only loop active branches remain Gives 5.0x speedup
53
Improving Original Example 7) Use VMX Use __vsubfp and __vmsum3fp for vector math Gives 5.28x speedup 8) Avoid branch mispredict #2 Unroll the loop 4x Sticks at 5.28x speedup
54
Improving Original Example 9) Avoid branch mispredict #3 Build a __vector4 mask from active flags __vsel tail lengths from existing and new Write a single __vector4 result Now only the loop branch remaining Gives 6.01x speedup 10) Improve L2 access #2 Add __dcbt on position array Gives 16.01x speedup
55
Improving Original Example 11) Improve L2 use #3 Moveto short coordinates Now loading ¼ the data for positions Gives 21.23x speedup 12) Avoid branch mispredict #4 We are now writing tail lengths for every particle Wait, we don’t care about inactive particles Epiphany - don’t check active flag at all Gives 23.21x speedup
56
Improving Original Example 13) Improve L2 use #4 Remaining L2 misses on output array __dcbt that too Tweak __dcbt offsets and pre-load 39.01x speedup
57
Improving Original Example PIX Summary 259k instructions executed 99.4% L2 usage Top penalties ERAT Data Miss @ 14k cycles 1 LHS via 4kb aliasing No mispredict penalties 71k cycles of penalty overall Takes 0.193ms
58
Improving Original Example Caveat Slightly trivial code example Not all techniques possible in ‘real life’ But principles always apply Dcbz128 mystery? We write entire array Should be able to save L2 loads by pre-zeroing But results showed slowdown
59
Thanks For Listening Any questions?
60
© 2009 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. http://www.xna.com © 2009 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. http://www.xna.com
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.