Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2009 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied,

Similar presentations


Presentation on theme: "© 2009 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied,"— Presentation transcript:

1 © 2009 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. http://www.xna.com

2 Out of Order Making In-order Processors Play Nicely Allan Murphy XNA Developer Connection, Microsoft

3 Optimization Example class BaseParticle { public: … virtual Vector& Position(){ return mPosition; } virtual Vector& PreviousPosition() { return mPreviousPosition; } float& Intensity() { return mIntensity; } bool& Active() { return mActive; } float& Lifetime() { return mLifetime; } … private: … float mIntensity; float mLifetime; bool mActive; Vector mPosition; Vector mPreviousPosition; … };

4 Optimization Example // Boring old vector class class Vector { … public: float x,y,z,w; }; // Boring old generic linked list class template class ListNode { public: ListNode(T* contents) : mNext(NULL), mContents(contents){} void SetNext(ListNode* node){ mNext = node; } ListNode* NextNode(){ return mNext; } T* Contents(){ return mContents; } private: ListNode * mNext; T* mContents; };

5 Optimization Example // Run through list and update each active particle for (ListNode * node = gParticles; node != NULL; node = node->NextNode()) if (node->Contents()->Active()) { Vector vel; vel.x = node->Contents()->Position().x - node->Contents()->PrevPosition().x; vel.y = node->Contents()->Position().y - node->Contents()->PrevPosition().y; vel.z = node->Contents()->Position().z - node->Contents()->PrevPosition().z; const float length = __fsqrts((vel.x*vel.x) + (vel.y*vel.y) + (vel.z*vel.z)); if (length > cLimitLength) { float newIntensity = cMaxIntensity - node->Contents()->Lifetime(); if (newIntensity < 0.0f) newIntensity = 0.0f; node->Contents()->Intensity() = newIntensity; } else node->Contents()->Intensity() = 0.0f; }

6 Optimization Example // Replacement for straight C vector work // Build 360 friendly __vector4s __vector4 position, prevPosition; position.x = node->Contents()->Position().x; position.y = node->Contents()->Position().y; position.z = node->Contents()->Position().z; prevPosition.x = node->Contents()->PrevPosition().x; prevPosition.y = node->Contents()->PrevPosition().y; prevPosition.z = node->Contents()->PrevPosition().z; // Use VMX to do the calculations __vector4 velocity = __vsubfp(position,previousPosition); __vector4 velocitySqr = __vmsum4fp(velocity,velocity); // Grab the length result from the vector const float length = __fsqrts(velocitySqr.x);

7 Job done, right? Thank you for listening

8 Optimization Example Hold on. If we time it… Its actually slower than the straight C version And if we check the results.. It's also wrong! Incorrect is a special case optimization Unfortunately, this does happen in practice

9 Important Caveat Today we’re talking about optimization But, the techniques discussed are orthogonal to… …good algorithm choice …good multithreading system implementation It’s like Mr Knuth said. They typically build code which is… …very non-general …very difficult to maintain or understand …possibly completely platform specific

10 But My Code Is Really Quick On PC…? A common assumption: It’s quick on PC 360 & PS3 have 3.2GHz clock speed Should be good on console! Right? Alas 360 core and PS3 PPU have.. No instruction reordering hardware No store forwarding hardware Smaller caches and slower memory No L3 cache

11 The 4 Horsemen of In-Order Apocalypse What goes wrong? LHS L2 miss Expensive, non pipelined instructions Branch mispredict penalty

12 Load-Hit-Store (LHS) What is it? Storing to a memory location… …then loading from it very shortly after What causes LHS? Casts, changing register set, aliasing Why is it a problem? On PC, bullet usually dodged by… Instruction re-ordering Store forwarding hardware

13 L2 Miss What is it? Loading from a location not already in cache Why is it a problem? Costs ~610 cycles to load a cache line You can do a lot of work in 610 cycles What can we do about it? Hot cold split Reduce in-memory data size Use cache coherent structures

14 Expensive Instructions What is it? Certain instructions not pipelined No other instructions issued ‘til they complete Stalls both hardware threads high latency and low throughput What can we do about it? Know when those instructions are generated Avoid or code round those situations But only in critical places

15 Branch Mispredicts What is it? Mispredicting a branch causes… …CPU to discard instructions it predicted it needed …23-24 cycle delay as correct instructions fetched Why is this a problem? Misprediction penalty can… …dominate total time in tight loops …waste time fetching unneeded instructions

16 Branch Mispredicts What can we do about it? Know how compiler implements branches for, do, while, if Function pointers, switches, virtual calls Reduce total branch counts for task Use test and set style instructions Refactor calculations to remove branches Unroll

17 Who Are Our Friends? Profiling, profiling, profiling 360 tools PIX CPU instruction trace LibPMCPB counters XbPerfView sampling capture Other platforms SN Tuner, vTune Thinking laterally

18 General Improvements inline Make sure your function fits the profile Pass and return in register __declspec(passinreg) __restrict Compiler released from being ultra careful const Doesn’t affect code gen But does affect your brain

19 General Improvements Compiler options Inline all possible Prefer speed over size Platform specifics 360 /Ou - Removes integer div by zero trap /Oc – Runs a second code scheduling pass Don’t write inline asm

20 General Improvements Reduce parameter count Reduce function epilogue and prologue Reduce stack access Reduce LHS Prefer 32, 64 and 128 bit variables Isolate constants – or constant sets Look to specialise, not generalise Avoid virtual if feasible Unnecessary virtual means indirected branch

21 Know Your Cache Architecture Cache size 360: 1Mb L2, 32Kb L1 Cache line size 360: 128 bytes; x86 – typically 64 bytes Pre-fetch mechanism 360: dcbt, dcbz128 Cross-core sharing policy 360: L2 shared, L1 per core

22 Know Pipeline & LHS Conditions LHS caused by: Pointer aliasing Register set swap / casting Be aware of non-pipelined instructions fsqrt, fdiv, int mul, int div, sraw Be aware of pipeline flush issues Especially fcmp

23 Knowing Your Instruction Set 360 specifics: VMX Slow instructions Regularly useful instructions fsel, vsel, vcmp*, vrlimi PS3 Altivec & world of SPE PC SSE, SSE2, SSE3, SSE4, SSE4.1 and friends

24 What Went Wrong With The Example? Correctness Always cross-compare during development Guessed at 1 performance issue SIMD vs straight float Giving SIMD ‘some road’ Branch behaviour exactly the same Adding SIMD adds an LHS Memory access and L2 usage unchanged

25 Image Analysis

26 Image Analysis Example Classification via Gaussian Mixture Model For each pixel in a 320x240 array… Evaluate ‘cost’ via up to 20 Gaussian models Returns lowest cost found for pixel Submit cost to graph structure for min-cut Profiling shows: 86% of time in pixel cost function No surprises there 1,536,000 Gaussian model applies

27 Image Analysis Example float GMM::Cost(unsigned char r, unsigned char g, unsigned char b, size_t k) { Component& component = mComponent[k]; SampleType x(r,g,b); x -= component.Mean(); FloatVector fx((float)x[0],(float)x[1],(float)x[2]); return component.EofLog() + 0.5f * fx.Dot( component.CovInv().Multiply(fx)); } float GMM::BestCost(unsigned char r, unsigned char g, unsigned char b) { float bestCost = Cost(r,g,b,0); for(size_t k=1; k<nK; k++) { float cost = Cost(r,g,b,k); if( cost < bestCost ) bestCost = cost; } return bestCost; }

28 Image Analysis Example What things look suspect? L2 miss on component load Passing individual r,g,b elements Building two separate vectors Casting int to float Vector maths Branching may be an issue in BestCost() Loop Conditional inside loop Confirm with PIX on 360

29 Image Analysis Example Pass 1 Don’t even touch platform specifics Pass a single int, not 3 unsigned chars Mark up all consts Build the sample value once in the caller Add __forceinline Check correctness Doesn’t help a lot – gives about 1.1x

30 Image Analysis Example Pass 2 Turn Cost function innards to VMX Return cost as __vector4 to avoid LHS Remove if from loop in BestCost by… Keeping bestCost as a __vector4 Using vcmpgefp to make a comparison mask Using vsel to pick the lowest value Speedup of 1.7x Constructing the __vector4s on the fly expensive

31 Image Analysis Example Pass 3 -Build the colour as a __vector4 in calling function -Build a static __vector4 containing {0.5f,0.5f,0.5f,0.5f} -Load once in calling function -Mark all __vector4 as __declspec(passinreg) -Build __vector4 version of Component -All calculations done as __vector4 -More like it – speedup of 5.2x

32 Image Analysis Example Pass 4 -Go all the way out to the per pixel calling code -Load __vector4 at a time from source array -Do 4 pixel costs at once -__vcmpgefp/__vsel works exactly the same -Return __vector4 with 4 costs -Write to results array as single __vector4 -Gives a speedup of 19.54x

33 Image Analysis Example __declspec(passinreg) __vector4 CMOGs::BestCost(__declspec(passinreg) __vector4 colours) const { __vector4 half = gHalf; const size_t nK = m_componentCount; assert(nK != 0); __vector4 bestCost = Cost(colour, half, 0 ); for(size_t k=1;k<nK;k++) { const __vector4 cost = Cost(colour, half, k ); const __vector4 mask = __vcmpgefp(bestCost,cost); bestCost = __vsel(bestCost,cost,mask); } return bestCost; }

34 Image Analysis Example const Component& comp = m_vComponent[k]; const __vector4 vEofLog = comp.GetVEofLog(); colour0 = __vsubfp(colour0,comp.GetVMean()); … const __vector4 row0 = comp.GetVCovInv(0); const __vector4 row1 = comp.GetVCovInv(1); const __vector4 row2 = comp.GetVCovInv(2); x = __vspltw(colour0,0); y = __vspltw(colour0,1); z = __vspltw(colour0,2); mulResult = __vmulfp(row0,x); mulResult = __vmaddfp(row1,y,mulResult); mulResult = __vmaddfp(row2,z,mulResult); vdp2 = __vmsum3fp(mulResult,input); vdp2 = __vmaddfp(vdp2,half,vEofLog);// half is __vector4 parameter result = vdp2; …

35 Image Analysis Example Hold on, this is image analysis. Shouldn’t it be on the GPU? Maybe, maybe not: Per pixel we manipulate a dynamic tree structure Excluding the tree structure… CPU can run close to GPU speed But syncing and memory throughput overhead not worth it

36 Movie Compression

37 Movie Compression Optimization Timing results Freeware movie compressor on 360 76.3% of instructions spent in InterError() Calculating error between macroblocks Majority of time in 8x8 macro block functions Picking up source and target intensity macro block For each pixel, calculating abs difference Summing differences along rows Returning sum of diffs Or early out when sum exceeds a threshold

38 Movie Compression Optimization int ThresholdSum(unsigned char *ptr1, unsigned char *ptr2, int stride2, int stride1,int thres) { int32 sad = 0; for (int i=8; i; i--) { sad += DSP_OP_ABS_DIFF(ptr1[0], ptr2[0]); sad += DSP_OP_ABS_DIFF(ptr1[1], ptr2[1]); sad += DSP_OP_ABS_DIFF(ptr1[2], ptr2[2]); sad += DSP_OP_ABS_DIFF(ptr1[3], ptr2[3]); sad += DSP_OP_ABS_DIFF(ptr1[4], ptr2[4]); sad += DSP_OP_ABS_DIFF(ptr1[5], ptr2[5]); sad += DSP_OP_ABS_DIFF(ptr1[6], ptr2[6]); sad += DSP_OP_ABS_DIFF(ptr1[7], ptr2[7]); if (sad > thres ) return sad; ptr1 += stride1; ptr2 += stride2; } return sad; }

39 Movie Compression Optimization Look at our worst enemies L2 8x8 byte blocks, seems tight LHS Its all integer, so we should be LHS free Expensive instructions? No, just byte maths Branching Should get prediction right 7 out of 8 times

40 Movie Compression Optimization Maths Element by element abs and average ops on bytes Done row by row, exit on over sum Perfect for VMX! Awesome speedup of… 0% Huh? Why? Summing a row doesn’t suit VMX Branch penalty still there We have to do unaligned loads to VMX registers

41 Movie Compression Optimization Let’s think again Look at higher level picture Error calculated for 4 blocks at a time by caller Rows in blocks (0,1) and (2,3) are contiguous Pick up two blocks at a time in VMX registers Thresholding is by row But there is no reason not to do it by column Means we can sum columns in 7 instructions Use __restrict on block pointers

42 Movie Compression Optimization 01 32 VMX register 0 VMX register 1 VMX register 2 VMX register 3 VMX register 4 VMX register 5 VMX register 6 VMX register 7

43 Movie Compression Optimization Data Layout & Alignment Rows in 2 blocks are contiguous in memory Source block always 16 byte aligned Dest block only guaranteed to be byte aligned Unrolling We can unroll the 8 iteration loop We have plenty of VMX registers available Return value Return a __vector4 to avoid LHS writing to int

44 Movie Compression Optimization Miscellaneous Prebuild threshold word once Remove stride word parameters Constant values in this application only Proved with empirical research (and assert) Vector parameters and return in registers Pushed vector error results out to caller All callers calculations in VMX – drop LHS

45 Movie Compression Optimization __vector4 __declspec(passinreg) twoblock_sad8x8__xbox (const unsigned char* __restrict ptr1, const unsigned char* __restrict ptr2) { __vector4 zero = __vzero(); __vector4 row1_0 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_1 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_2 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_3 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_4 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_5 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_6 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_7 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row2_0 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_1 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_2 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_3 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_4 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_5 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_6 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_7 = *(__vector4 *)ptr2; ptr2 += cStride2; row1_0 = __vsubsbs(__vmaxub(row1_0,row2_0),__vminub(row1_0,row2_0)); row1_1 = __vsubsbs(__vmaxub(row1_1,row2_1),__vminub(row1_1,row2_1)); row1_2 = __vsubsbs(__vmaxub(row1_2,row2_2),__vminub(row1_2,row2_2)); row1_3 = __vsubsbs(__vmaxub(row1_3,row2_3),__vminub(row1_3,row2_3)); row1_4 = __vsubsbs(__vmaxub(row1_4,row2_4),__vminub(row1_4,row2_4)); row1_5 = __vsubsbs(__vmaxub(row1_5,row2_5),__vminub(row1_5,row2_5)); row1_6 = __vsubsbs(__vmaxub(row1_6,row2_6),__vminub(row1_6,row2_6)); row1_7 = __vsubsbs(__vmaxub(row1_7,row2_7),__vminub(row1_7,row2_7)); row2_0 = __vmrglb(zero,row1_0); row1_0 = __vmrghb(zero,row1_0); row2_1 = __vmrglb(zero,row1_1); row1_1 = __vmrghb(zero,row1_1); row2_2 = __vmrglb(zero,row1_2); row1_2 = __vmrghb(zero,row1_2); row2_3 = __vmrglb(zero,row1_3); row1_3 = __vmrghb(zero,row1_3); row2_4 = __vmrglb(zero,row1_4); row1_4 = __vmrghb(zero,row1_4); row2_5 = __vmrglb(zero,row1_5); row1_5 = __vmrghb(zero,row1_5); row2_6 = __vmrglb(zero,row1_6); row1_6 = __vmrghb(zero,row1_6); row2_7 = __vmrglb(zero,row1_7); row1_7 = __vmrghb(zero,row1_7); row1_0 = __vaddshs(row1_0,row1_1); row1_2 = __vaddshs(row1_2,row1_3); row1_4 = __vaddshs(row1_4,row1_5); row1_6 = __vaddshs(row1_6,row1_7); row1_0 = __vaddshs(row1_0,row1_2); row1_4 = __vaddshs(row1_4,row1_6); row1_0 = __vaddshs(row1_0,row1_4); row2_0 = __vaddshs(row2_0,row2_1); row2_2 = __vaddshs(row2_2,row2_3); row2_4 = __vaddshs(row2_4,row2_5); row2_6 = __vaddshs(row2_6,row2_7); row2_0 = __vaddshs(row2_0,row2_2); row2_4 = __vaddshs(row2_4,row2_6); row2_0 = __vaddshs(row2_0,row2_4); row1_1 = __vsldoi(row1_0,row2_0,2); row1_2 = __vsldoi(row1_0,row2_0,4); row1_3 = __vsldoi(row1_0,row2_0,6); row1_4 = __vsldoi(row1_0,row2_0,8); row1_5 = __vsldoi(row1_0,row2_0,10); row1_6 = __vsldoi(row1_0,row2_0,12); row1_7 = __vsldoi(row1_0,row2_0,14); row1_0 = __vrlimi(row1_0,row2_0,0x1,0); row2_0 = __vsldoi(row2_0,zero,2); row1_1 = __vrlimi(row1_1,row2_0,0x1,0); row1_0 = __vaddshs(row1_0,row1_1);// add 4 rows to the next row row1_2 = __vaddshs(row1_2,row1_3); row1_4 = __vaddshs(row1_4,row1_5); row1_6 = __vaddshs(row1_6,row1_7); row1_0 = __vaddshs(row1_0,row1_2); row1_4 = __vaddshs(row1_4,row1_6); row1_0 = __vaddshs(row1_0,row1_4); row1_0 = __vpermwi(row1_0,VPERMWI_CONST(0,3,0,0)); row1_0 = __vmrghh(zero,row1_0); row1_0 = __vpermwi(row1_0,VPERMWI_CONST(0,2,0,0)); return row1_0; } Unpleasant

46 Movie Compression Optimization Results Un-thresholded macro block compare 2.86 times quicker than existing C Not bad, but our code is doing 2 blocks at once, too So actually, 5.72 times quicker Thresholded macro block compare 4.12 times quicker Optimizations to just the block compares… …reduced movie compression time by 22% …in worst case, saved 40 seconds from compress time

47 Do We Get Improvements In Reverse? Do we see improvements on PC? Image analysis Movie compression

48 Summary Interlude Profiling, profiling, profiling Know your enemy Explore data alignment and layout Give SIMD plenty of room to work Don’t ignore simple code structure changes Specialise, not generalise

49 Original Example

50 Improving Original Example PIX Summary 704k instructions executed 40% L2 usage Top penalties L2 cache miss @ 3m cycles bctr mispredicts @ 1.14m cycles __fsqrt @ 696k cycles 2x fcmp @ 490k cycles Some 20.9m cycles of penalty overall Takes 7.528ms

51 Improving Original Example 1)Avoid branch mispredict #1 Ditch the zealous use of virtual Call functions just once Gives 1.13x speedup 2)Improve L2 use #1 Refactoring list to contiguous array Hot/cold split Using bitfield for active flag Gives 3.59x speedup

52 Improving Original Example 4) Remove expensive instructions Ditch __fsqrts and compare with squares Gives 4.05x speedup 5) Avoid branch mispredict #1 Insert __fsel() to select tail length Gives 4.44x speedup Insert 2 nd fsel Now only loop active branches remain Gives 5.0x speedup

53 Improving Original Example 7) Use VMX Use __vsubfp and __vmsum3fp for vector math Gives 5.28x speedup 8) Avoid branch mispredict #2 Unroll the loop 4x Sticks at 5.28x speedup

54 Improving Original Example 9) Avoid branch mispredict #3 Build a __vector4 mask from active flags __vsel tail lengths from existing and new Write a single __vector4 result Now only the loop branch remaining Gives 6.01x speedup 10) Improve L2 access #2 Add __dcbt on position array Gives 16.01x speedup

55 Improving Original Example 11) Improve L2 use #3 Moveto short coordinates Now loading ¼ the data for positions Gives 21.23x speedup 12) Avoid branch mispredict #4 We are now writing tail lengths for every particle Wait, we don’t care about inactive particles Epiphany - don’t check active flag at all Gives 23.21x speedup

56 Improving Original Example 13) Improve L2 use #4 Remaining L2 misses on output array __dcbt that too Tweak __dcbt offsets and pre-load 39.01x speedup

57 Improving Original Example PIX Summary 259k instructions executed 99.4% L2 usage Top penalties ERAT Data Miss @ 14k cycles 1 LHS via 4kb aliasing No mispredict penalties 71k cycles of penalty overall Takes 0.193ms

58 Improving Original Example Caveat Slightly trivial code example Not all techniques possible in ‘real life’ But principles always apply Dcbz128 mystery? We write entire array Should be able to save L2 loads by pre-zeroing But results showed slowdown

59 Thanks For Listening Any questions?

60 © 2009 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. http://www.xna.com © 2009 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. http://www.xna.com


Download ppt "© 2009 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied,"

Similar presentations


Ads by Google