© 2009 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied,

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
Lecture 12 Reduce Miss Penalty and Hit Time
Performance of Cache Memory
“The Slow Game Of Life” Allan Murphy Senior Software Development Engineer XNA Developer Connection Microsoft Understanding Performance Consumer Hardware.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
The Linux Kernel: Memory Management
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Malloc Recitation Section K (Kevin Su) November 5 th, 2012.
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
1 1 Lecture 4 Structure – Array, Records and Alignment Memory- How to allocate memory to speed up operation Structure – Array, Records and Alignment Memory-
Computer Organization and Architecture The CPU Structure.
Multiprocessing Memory Management
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
IT253: Computer Organization Lecture 3: Memory and Bit Operations Tonga Institute of Higher Education.
Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
COP4020 Programming Languages Subroutines and Parameter Passing Prof. Xin Yuan.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.
University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 8: MIPS Pipelined.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Processor Architecture
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Optimization of C Code The C for Speed
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Dynamic Representation and Prefetching of Linked Data Structures (or Introspection for Low-Level Data Prefetching) Mark Whitney & John Kubiatowicz ROC.
Optimization. How to Optimize Code Conventional Wisdom: 1.Don't do it 2.(For experts only) Don't do it yet.
Real-World Pipelines Idea Divide process into independent stages
Data Prefetching Smruti R. Sarangi.
CSC 4250 Computer Architectures
How will execution time grow with SIZE?
Exploiting Parallelism
Lecture 5: GPU Compute Architecture
Morgan Kaufmann Publishers The Processor
Lecture 5: GPU Compute Architecture for the last time
Instruction Level Parallelism (ILP)
Data Prefetching Smruti R. Sarangi.
Explaining issues with DCremoval( )
Memory System Performance Chapter 3
Wackiness Algorithm A: Algorithm B:
Lecture 4: Instruction Set Design/Pipelining
Optimization.
Lecture 11: Machine-Dependent Optimization
COMP755 Advanced Operating Systems
Presentation transcript:

© 2009 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.

Out of Order Making In-order Processors Play Nicely Allan Murphy XNA Developer Connection, Microsoft

Optimization Example class BaseParticle { public: … virtual Vector& Position(){ return mPosition; } virtual Vector& PreviousPosition() { return mPreviousPosition; } float& Intensity() { return mIntensity; } bool& Active() { return mActive; } float& Lifetime() { return mLifetime; } … private: … float mIntensity; float mLifetime; bool mActive; Vector mPosition; Vector mPreviousPosition; … };

Optimization Example // Boring old vector class class Vector { … public: float x,y,z,w; }; // Boring old generic linked list class template class ListNode { public: ListNode(T* contents) : mNext(NULL), mContents(contents){} void SetNext(ListNode* node){ mNext = node; } ListNode* NextNode(){ return mNext; } T* Contents(){ return mContents; } private: ListNode * mNext; T* mContents; };

Optimization Example // Run through list and update each active particle for (ListNode * node = gParticles; node != NULL; node = node->NextNode()) if (node->Contents()->Active()) { Vector vel; vel.x = node->Contents()->Position().x - node->Contents()->PrevPosition().x; vel.y = node->Contents()->Position().y - node->Contents()->PrevPosition().y; vel.z = node->Contents()->Position().z - node->Contents()->PrevPosition().z; const float length = __fsqrts((vel.x*vel.x) + (vel.y*vel.y) + (vel.z*vel.z)); if (length > cLimitLength) { float newIntensity = cMaxIntensity - node->Contents()->Lifetime(); if (newIntensity < 0.0f) newIntensity = 0.0f; node->Contents()->Intensity() = newIntensity; } else node->Contents()->Intensity() = 0.0f; }

Optimization Example // Replacement for straight C vector work // Build 360 friendly __vector4s __vector4 position, prevPosition; position.x = node->Contents()->Position().x; position.y = node->Contents()->Position().y; position.z = node->Contents()->Position().z; prevPosition.x = node->Contents()->PrevPosition().x; prevPosition.y = node->Contents()->PrevPosition().y; prevPosition.z = node->Contents()->PrevPosition().z; // Use VMX to do the calculations __vector4 velocity = __vsubfp(position,previousPosition); __vector4 velocitySqr = __vmsum4fp(velocity,velocity); // Grab the length result from the vector const float length = __fsqrts(velocitySqr.x);

Job done, right? Thank you for listening

Optimization Example Hold on. If we time it… Its actually slower than the straight C version And if we check the results.. It's also wrong! Incorrect is a special case optimization Unfortunately, this does happen in practice

Important Caveat Today we’re talking about optimization But, the techniques discussed are orthogonal to… …good algorithm choice …good multithreading system implementation It’s like Mr Knuth said. They typically build code which is… …very non-general …very difficult to maintain or understand …possibly completely platform specific

But My Code Is Really Quick On PC…? A common assumption: It’s quick on PC 360 & PS3 have 3.2GHz clock speed Should be good on console! Right? Alas 360 core and PS3 PPU have.. No instruction reordering hardware No store forwarding hardware Smaller caches and slower memory No L3 cache

The 4 Horsemen of In-Order Apocalypse What goes wrong? LHS L2 miss Expensive, non pipelined instructions Branch mispredict penalty

Load-Hit-Store (LHS) What is it? Storing to a memory location… …then loading from it very shortly after What causes LHS? Casts, changing register set, aliasing Why is it a problem? On PC, bullet usually dodged by… Instruction re-ordering Store forwarding hardware

L2 Miss What is it? Loading from a location not already in cache Why is it a problem? Costs ~610 cycles to load a cache line You can do a lot of work in 610 cycles What can we do about it? Hot cold split Reduce in-memory data size Use cache coherent structures

Expensive Instructions What is it? Certain instructions not pipelined No other instructions issued ‘til they complete Stalls both hardware threads high latency and low throughput What can we do about it? Know when those instructions are generated Avoid or code round those situations But only in critical places

Branch Mispredicts What is it? Mispredicting a branch causes… …CPU to discard instructions it predicted it needed …23-24 cycle delay as correct instructions fetched Why is this a problem? Misprediction penalty can… …dominate total time in tight loops …waste time fetching unneeded instructions

Branch Mispredicts What can we do about it? Know how compiler implements branches for, do, while, if Function pointers, switches, virtual calls Reduce total branch counts for task Use test and set style instructions Refactor calculations to remove branches Unroll

Who Are Our Friends? Profiling, profiling, profiling 360 tools PIX CPU instruction trace LibPMCPB counters XbPerfView sampling capture Other platforms SN Tuner, vTune Thinking laterally

General Improvements inline Make sure your function fits the profile Pass and return in register __declspec(passinreg) __restrict Compiler released from being ultra careful const Doesn’t affect code gen But does affect your brain

General Improvements Compiler options Inline all possible Prefer speed over size Platform specifics 360 /Ou - Removes integer div by zero trap /Oc – Runs a second code scheduling pass Don’t write inline asm

General Improvements Reduce parameter count Reduce function epilogue and prologue Reduce stack access Reduce LHS Prefer 32, 64 and 128 bit variables Isolate constants – or constant sets Look to specialise, not generalise Avoid virtual if feasible Unnecessary virtual means indirected branch

Know Your Cache Architecture Cache size 360: 1Mb L2, 32Kb L1 Cache line size 360: 128 bytes; x86 – typically 64 bytes Pre-fetch mechanism 360: dcbt, dcbz128 Cross-core sharing policy 360: L2 shared, L1 per core

Know Pipeline & LHS Conditions LHS caused by: Pointer aliasing Register set swap / casting Be aware of non-pipelined instructions fsqrt, fdiv, int mul, int div, sraw Be aware of pipeline flush issues Especially fcmp

Knowing Your Instruction Set 360 specifics: VMX Slow instructions Regularly useful instructions fsel, vsel, vcmp*, vrlimi PS3 Altivec & world of SPE PC SSE, SSE2, SSE3, SSE4, SSE4.1 and friends

What Went Wrong With The Example? Correctness Always cross-compare during development Guessed at 1 performance issue SIMD vs straight float Giving SIMD ‘some road’ Branch behaviour exactly the same Adding SIMD adds an LHS Memory access and L2 usage unchanged

Image Analysis

Image Analysis Example Classification via Gaussian Mixture Model For each pixel in a 320x240 array… Evaluate ‘cost’ via up to 20 Gaussian models Returns lowest cost found for pixel Submit cost to graph structure for min-cut Profiling shows: 86% of time in pixel cost function No surprises there 1,536,000 Gaussian model applies

Image Analysis Example float GMM::Cost(unsigned char r, unsigned char g, unsigned char b, size_t k) { Component& component = mComponent[k]; SampleType x(r,g,b); x -= component.Mean(); FloatVector fx((float)x[0],(float)x[1],(float)x[2]); return component.EofLog() + 0.5f * fx.Dot( component.CovInv().Multiply(fx)); } float GMM::BestCost(unsigned char r, unsigned char g, unsigned char b) { float bestCost = Cost(r,g,b,0); for(size_t k=1; k<nK; k++) { float cost = Cost(r,g,b,k); if( cost < bestCost ) bestCost = cost; } return bestCost; }

Image Analysis Example What things look suspect? L2 miss on component load Passing individual r,g,b elements Building two separate vectors Casting int to float Vector maths Branching may be an issue in BestCost() Loop Conditional inside loop Confirm with PIX on 360

Image Analysis Example Pass 1 Don’t even touch platform specifics Pass a single int, not 3 unsigned chars Mark up all consts Build the sample value once in the caller Add __forceinline Check correctness Doesn’t help a lot – gives about 1.1x

Image Analysis Example Pass 2 Turn Cost function innards to VMX Return cost as __vector4 to avoid LHS Remove if from loop in BestCost by… Keeping bestCost as a __vector4 Using vcmpgefp to make a comparison mask Using vsel to pick the lowest value Speedup of 1.7x Constructing the __vector4s on the fly expensive

Image Analysis Example Pass 3 -Build the colour as a __vector4 in calling function -Build a static __vector4 containing {0.5f,0.5f,0.5f,0.5f} -Load once in calling function -Mark all __vector4 as __declspec(passinreg) -Build __vector4 version of Component -All calculations done as __vector4 -More like it – speedup of 5.2x

Image Analysis Example Pass 4 -Go all the way out to the per pixel calling code -Load __vector4 at a time from source array -Do 4 pixel costs at once -__vcmpgefp/__vsel works exactly the same -Return __vector4 with 4 costs -Write to results array as single __vector4 -Gives a speedup of 19.54x

Image Analysis Example __declspec(passinreg) __vector4 CMOGs::BestCost(__declspec(passinreg) __vector4 colours) const { __vector4 half = gHalf; const size_t nK = m_componentCount; assert(nK != 0); __vector4 bestCost = Cost(colour, half, 0 ); for(size_t k=1;k<nK;k++) { const __vector4 cost = Cost(colour, half, k ); const __vector4 mask = __vcmpgefp(bestCost,cost); bestCost = __vsel(bestCost,cost,mask); } return bestCost; }

Image Analysis Example const Component& comp = m_vComponent[k]; const __vector4 vEofLog = comp.GetVEofLog(); colour0 = __vsubfp(colour0,comp.GetVMean()); … const __vector4 row0 = comp.GetVCovInv(0); const __vector4 row1 = comp.GetVCovInv(1); const __vector4 row2 = comp.GetVCovInv(2); x = __vspltw(colour0,0); y = __vspltw(colour0,1); z = __vspltw(colour0,2); mulResult = __vmulfp(row0,x); mulResult = __vmaddfp(row1,y,mulResult); mulResult = __vmaddfp(row2,z,mulResult); vdp2 = __vmsum3fp(mulResult,input); vdp2 = __vmaddfp(vdp2,half,vEofLog);// half is __vector4 parameter result = vdp2; …

Image Analysis Example Hold on, this is image analysis. Shouldn’t it be on the GPU? Maybe, maybe not: Per pixel we manipulate a dynamic tree structure Excluding the tree structure… CPU can run close to GPU speed But syncing and memory throughput overhead not worth it

Movie Compression

Movie Compression Optimization Timing results Freeware movie compressor on % of instructions spent in InterError() Calculating error between macroblocks Majority of time in 8x8 macro block functions Picking up source and target intensity macro block For each pixel, calculating abs difference Summing differences along rows Returning sum of diffs Or early out when sum exceeds a threshold

Movie Compression Optimization int ThresholdSum(unsigned char *ptr1, unsigned char *ptr2, int stride2, int stride1,int thres) { int32 sad = 0; for (int i=8; i; i--) { sad += DSP_OP_ABS_DIFF(ptr1[0], ptr2[0]); sad += DSP_OP_ABS_DIFF(ptr1[1], ptr2[1]); sad += DSP_OP_ABS_DIFF(ptr1[2], ptr2[2]); sad += DSP_OP_ABS_DIFF(ptr1[3], ptr2[3]); sad += DSP_OP_ABS_DIFF(ptr1[4], ptr2[4]); sad += DSP_OP_ABS_DIFF(ptr1[5], ptr2[5]); sad += DSP_OP_ABS_DIFF(ptr1[6], ptr2[6]); sad += DSP_OP_ABS_DIFF(ptr1[7], ptr2[7]); if (sad > thres ) return sad; ptr1 += stride1; ptr2 += stride2; } return sad; }

Movie Compression Optimization Look at our worst enemies L2 8x8 byte blocks, seems tight LHS Its all integer, so we should be LHS free Expensive instructions? No, just byte maths Branching Should get prediction right 7 out of 8 times

Movie Compression Optimization Maths Element by element abs and average ops on bytes Done row by row, exit on over sum Perfect for VMX! Awesome speedup of… 0% Huh? Why? Summing a row doesn’t suit VMX Branch penalty still there We have to do unaligned loads to VMX registers

Movie Compression Optimization Let’s think again Look at higher level picture Error calculated for 4 blocks at a time by caller Rows in blocks (0,1) and (2,3) are contiguous Pick up two blocks at a time in VMX registers Thresholding is by row But there is no reason not to do it by column Means we can sum columns in 7 instructions Use __restrict on block pointers

Movie Compression Optimization VMX register 0 VMX register 1 VMX register 2 VMX register 3 VMX register 4 VMX register 5 VMX register 6 VMX register 7

Movie Compression Optimization Data Layout & Alignment Rows in 2 blocks are contiguous in memory Source block always 16 byte aligned Dest block only guaranteed to be byte aligned Unrolling We can unroll the 8 iteration loop We have plenty of VMX registers available Return value Return a __vector4 to avoid LHS writing to int

Movie Compression Optimization Miscellaneous Prebuild threshold word once Remove stride word parameters Constant values in this application only Proved with empirical research (and assert) Vector parameters and return in registers Pushed vector error results out to caller All callers calculations in VMX – drop LHS

Movie Compression Optimization __vector4 __declspec(passinreg) twoblock_sad8x8__xbox (const unsigned char* __restrict ptr1, const unsigned char* __restrict ptr2) { __vector4 zero = __vzero(); __vector4 row1_0 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_1 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_2 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_3 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_4 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_5 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_6 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row1_7 = *(__vector4 *)ptr1; ptr1 += cStride1; __vector4 row2_0 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_1 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_2 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_3 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_4 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_5 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_6 = *(__vector4 *)ptr2; ptr2 += cStride2; __vector4 row2_7 = *(__vector4 *)ptr2; ptr2 += cStride2; row1_0 = __vsubsbs(__vmaxub(row1_0,row2_0),__vminub(row1_0,row2_0)); row1_1 = __vsubsbs(__vmaxub(row1_1,row2_1),__vminub(row1_1,row2_1)); row1_2 = __vsubsbs(__vmaxub(row1_2,row2_2),__vminub(row1_2,row2_2)); row1_3 = __vsubsbs(__vmaxub(row1_3,row2_3),__vminub(row1_3,row2_3)); row1_4 = __vsubsbs(__vmaxub(row1_4,row2_4),__vminub(row1_4,row2_4)); row1_5 = __vsubsbs(__vmaxub(row1_5,row2_5),__vminub(row1_5,row2_5)); row1_6 = __vsubsbs(__vmaxub(row1_6,row2_6),__vminub(row1_6,row2_6)); row1_7 = __vsubsbs(__vmaxub(row1_7,row2_7),__vminub(row1_7,row2_7)); row2_0 = __vmrglb(zero,row1_0); row1_0 = __vmrghb(zero,row1_0); row2_1 = __vmrglb(zero,row1_1); row1_1 = __vmrghb(zero,row1_1); row2_2 = __vmrglb(zero,row1_2); row1_2 = __vmrghb(zero,row1_2); row2_3 = __vmrglb(zero,row1_3); row1_3 = __vmrghb(zero,row1_3); row2_4 = __vmrglb(zero,row1_4); row1_4 = __vmrghb(zero,row1_4); row2_5 = __vmrglb(zero,row1_5); row1_5 = __vmrghb(zero,row1_5); row2_6 = __vmrglb(zero,row1_6); row1_6 = __vmrghb(zero,row1_6); row2_7 = __vmrglb(zero,row1_7); row1_7 = __vmrghb(zero,row1_7); row1_0 = __vaddshs(row1_0,row1_1); row1_2 = __vaddshs(row1_2,row1_3); row1_4 = __vaddshs(row1_4,row1_5); row1_6 = __vaddshs(row1_6,row1_7); row1_0 = __vaddshs(row1_0,row1_2); row1_4 = __vaddshs(row1_4,row1_6); row1_0 = __vaddshs(row1_0,row1_4); row2_0 = __vaddshs(row2_0,row2_1); row2_2 = __vaddshs(row2_2,row2_3); row2_4 = __vaddshs(row2_4,row2_5); row2_6 = __vaddshs(row2_6,row2_7); row2_0 = __vaddshs(row2_0,row2_2); row2_4 = __vaddshs(row2_4,row2_6); row2_0 = __vaddshs(row2_0,row2_4); row1_1 = __vsldoi(row1_0,row2_0,2); row1_2 = __vsldoi(row1_0,row2_0,4); row1_3 = __vsldoi(row1_0,row2_0,6); row1_4 = __vsldoi(row1_0,row2_0,8); row1_5 = __vsldoi(row1_0,row2_0,10); row1_6 = __vsldoi(row1_0,row2_0,12); row1_7 = __vsldoi(row1_0,row2_0,14); row1_0 = __vrlimi(row1_0,row2_0,0x1,0); row2_0 = __vsldoi(row2_0,zero,2); row1_1 = __vrlimi(row1_1,row2_0,0x1,0); row1_0 = __vaddshs(row1_0,row1_1);// add 4 rows to the next row row1_2 = __vaddshs(row1_2,row1_3); row1_4 = __vaddshs(row1_4,row1_5); row1_6 = __vaddshs(row1_6,row1_7); row1_0 = __vaddshs(row1_0,row1_2); row1_4 = __vaddshs(row1_4,row1_6); row1_0 = __vaddshs(row1_0,row1_4); row1_0 = __vpermwi(row1_0,VPERMWI_CONST(0,3,0,0)); row1_0 = __vmrghh(zero,row1_0); row1_0 = __vpermwi(row1_0,VPERMWI_CONST(0,2,0,0)); return row1_0; } Unpleasant

Movie Compression Optimization Results Un-thresholded macro block compare 2.86 times quicker than existing C Not bad, but our code is doing 2 blocks at once, too So actually, 5.72 times quicker Thresholded macro block compare 4.12 times quicker Optimizations to just the block compares… …reduced movie compression time by 22% …in worst case, saved 40 seconds from compress time

Do We Get Improvements In Reverse? Do we see improvements on PC? Image analysis Movie compression

Summary Interlude Profiling, profiling, profiling Know your enemy Explore data alignment and layout Give SIMD plenty of room to work Don’t ignore simple code structure changes Specialise, not generalise

Original Example

Improving Original Example PIX Summary 704k instructions executed 40% L2 usage Top penalties L2 cache 3m cycles bctr 1.14m cycles 696k cycles 2x 490k cycles Some 20.9m cycles of penalty overall Takes 7.528ms

Improving Original Example 1)Avoid branch mispredict #1 Ditch the zealous use of virtual Call functions just once Gives 1.13x speedup 2)Improve L2 use #1 Refactoring list to contiguous array Hot/cold split Using bitfield for active flag Gives 3.59x speedup

Improving Original Example 4) Remove expensive instructions Ditch __fsqrts and compare with squares Gives 4.05x speedup 5) Avoid branch mispredict #1 Insert __fsel() to select tail length Gives 4.44x speedup Insert 2 nd fsel Now only loop active branches remain Gives 5.0x speedup

Improving Original Example 7) Use VMX Use __vsubfp and __vmsum3fp for vector math Gives 5.28x speedup 8) Avoid branch mispredict #2 Unroll the loop 4x Sticks at 5.28x speedup

Improving Original Example 9) Avoid branch mispredict #3 Build a __vector4 mask from active flags __vsel tail lengths from existing and new Write a single __vector4 result Now only the loop branch remaining Gives 6.01x speedup 10) Improve L2 access #2 Add __dcbt on position array Gives 16.01x speedup

Improving Original Example 11) Improve L2 use #3 Moveto short coordinates Now loading ¼ the data for positions Gives 21.23x speedup 12) Avoid branch mispredict #4 We are now writing tail lengths for every particle Wait, we don’t care about inactive particles Epiphany - don’t check active flag at all Gives 23.21x speedup

Improving Original Example 13) Improve L2 use #4 Remaining L2 misses on output array __dcbt that too Tweak __dcbt offsets and pre-load 39.01x speedup

Improving Original Example PIX Summary 259k instructions executed 99.4% L2 usage Top penalties ERAT Data 14k cycles 1 LHS via 4kb aliasing No mispredict penalties 71k cycles of penalty overall Takes 0.193ms

Improving Original Example Caveat Slightly trivial code example Not all techniques possible in ‘real life’ But principles always apply Dcbz128 mystery? We write entire array Should be able to save L2 loads by pre-zeroing But results showed slowdown

Thanks For Listening Any questions?

© 2009 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary. © 2009 Microsoft Corporation. All rights reserved. This presentation is for informational purposes only. Microsoft makes no warranties, express or implied, in this summary.