Click to add text www.ibm.com © 2006-2008 IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.

w3.ibm.com © 2007 IBM Corporation 2 Abstract Optimization Issues in SSE/AVX-compatible functions on PowerPC As an experiment, some SSE/AVX-compatible functions were implemented on PowerPC to see if they would allow easier porting and more SIMD parallelism in ported programs. Trying to maximize their performance led to finding missed compiler optimization opportunities, a few compiler bugs in rarely executed code, and also to some changes in programming techniques. Another result was discovering what aspects of little-endian SSE/AVX SIMD are hard to emulate efficiently on big-endian PowerPC VMX(Altivec) / VSX SIMD.

w3.ibm.com © 2007 IBM Corporation 3 Background  Auto-SIMDization is quite successful in some compilers but not in others.  Also, SIMD instructions include operations like saturated add not available otherwise.  As a result, many programs use vendor-specific SIMD intrinsic or built-in functions to improve their performance.  That severely impedes portability.  SIMD functions are unlikely to be standardized soon if ever.  One potential solution is to emulate one vendor's functions by using another vendor's.  An experimental small prototype of that was tried.

w3.ibm.com © 2007 IBM Corporation 4 Function Types Investigated  A small list of 8, 16, 32 and 64-bit integer SIMD operations, and a few single and double precision floating-point operations were tried.

w3.ibm.com © 2007 IBM Corporation 5 Approach Taken  SSE/AVX function prototypes were joined to brand new bodies using PowerPC Altivec built in functions; eg, /* Add 4 32-bit ints */ __m128i _mm_add_epi32 (__m128i left, __m128i right) { return (__m128i) vec_add ((vector signed int) left, (vector signed int) right); }

w3.ibm.com © 2007 IBM Corporation 6 Another Example /* Unpack 8+8 8-bit chars from high halves and interleave */ __m128i _mm_unpackhi_epi8 (__m128i left, __m128i right) { static const vector unsigned char permute_selector = { #ifdef __LITTLE_ENDIAN__ 0x07, 0x17, 0x06, 0x16, 0x05, 0x15, 0x04, 0x14, 0x03, 0x13, 0x02, 0x12, 0x01, 0x11, 0x00, 0x10 #elif __BIG_ENDIAN__ 0x10, 0x00, 0x11, 0x01, 0x12, 0x02, 0x13, 0x03, 0x14, 0x04, 0x15, 0x05, 0x16, 0x06, 0x17, 0x07 #endif }; return vec_perm (left, right, permute_selector); } Is that correct? In big endian, should “hi” mean the left or right end? Are the permute control vector initializers right?

w3.ibm.com © 2007 IBM Corporation 7 Compiler Optimizations  The xlc generated code for every function was examined. (gcc will be checked too.)  If it didn't look perfect, compiler optimizer defects or work items were or will be opened.  Some of those were or will be very easy to fix. Some are hard.

w3.ibm.com © 2007 IBM Corporation 8 Other Difficulties  A few compiler bugs were found, in both xlc and gcc. This exercises some rarely used functionality.  Defects were or will be reported.  None have been fixed yet, but workarounds for all were found.

w3.ibm.com © 2007 IBM Corporation 9 Programming Technique Changes Obvious but wrong code: /* Shift 4 32-bit ints left logical immediate */ __m128i _mm_slli_epi32 (__m128i v, unsigned int count) { return (__m128i) vec_sl ((vector unsigned int) v, (vector unsigned int) vec_splats (count))); }; Corrected code: /* Shift 4 32-bit ints left logical immediate */ __m128i _mm_slli_epi32 (__m128i v, unsigned int count) { return (__m128i) vec_and ( vec_sl ((vector unsigned int) v, (vector unsigned int) vec_splats (count)), (vector unsigned int) vec_cmplt (vec_splats (count), vec_splats (32u))); }; Why the change? PowerPC shifts by count % element_size. SSE shifts by element_size, giving zero when count >= element_size. So shift, compare the count giving all ones or all zeros, and and with that.

w3.ibm.com © 2007 IBM Corporation 10 Programming Technique Changes The corrected code is good if not inlined, but should always be inlined. It always executes a shift, 2 splats, a compare, and an and. Faster code when inlined: /* Shift 4 32-bit ints left logical immediate */ __m128i _mm_slli_epi32 (__m128i v, unsigned int count) { if ((unsigned long) count >= 32) { return (__m128i) vec_splats (0); } else if (count == 0) { return v; } else { return (__m128i) vec_sl ((vector signed int) v, (vector unsigned int) vec_splats ((int) count)); } When inlined, the if s are normally evaluated at compile time, so only one clause is executed – either a splat, or nothing, or a splat and shift. Use a different mind set to get faster code!

w3.ibm.com © 2007 IBM Corporation 11 Potential Performance Improvements  Obvious performance issues include:  Some functions need to transfer data from GPRs to Vector Registers or vice versa. Doing that via stores and loads can be very slow. The Power8 has transfer instructions, so some functions should have #ifdef s testing the CPU model.  There are no instructions corresponding to some parts of some SSE/AVX operations. The Power8 does add some very useful instructions like vector permute bit, so some functions should have #ifdef s testing the CPU model to use them.  Many functions need a permute, and first load a permute control vector from memory. (Does not hurt performance much if inlined.)

w3.ibm.com © 2007 IBM Corporation 12 AVX 256-bit Handling  AVX 256 functions should operate on 256 not 128 bit vector registers.  Simulating things like add or subtract normally just means using two 128 bit instructions, running in parallel on two vector pipelines.  In a “shuffle” (permute), though, each byte of the result may come from any byte of 4 128-bit input registers. Since vector permute can only deal with 2 inputs, the general case needs 3 permutes for the first half of the result and 3 more for the second half. Worse, since in general the permute control information may not be known at compile time, the 6 permute control vectors may need to be calculated at execution time. Working that out would be challenging. Compiler optimization could help by eliminating unnecessary permutes.  AVX 512 would be even more complicated.

w3.ibm.com © 2007 IBM Corporation 13 Performance  Detailed performance information isn't available yet.  Overall it seems to be very competitive.  Very common operations like vector add or compare are more than competitive, running faster than the competition (in some cases taking fewer cycles, with a faster clock rate), and two operations done in parallel.  Some particularly awkward operations are up to ~10x slower.  Improving compiler optimization and using #ifdef s to generate model-specific code should improve both the worst cases and the average.  Some like “shuffle” (permute) might be doomed to being slow?  For many functions AVX512 needs 4 128-bit instructions. Two would start immediately and two more one cycle later, so performance would still be good.

w3.ibm.com © 2007 IBM Corporation 14 Summary  SSE functions are mostly competitive, and most are easy to implement, but...  Getting both big and little endian working correctly can be hard. Some endian issues including - which element is upper and which lower - and how to declare little endian permute control vectors are surprisingly easy to get wrong.  AVX 256 is a little harder than SSE, and AVX 512 more so. MMX is hard to efficiently load and store, but fortunately obsolete.  Some functions are hard to write or hard to make fast.  Some instructions would be useful but do not exist.  Overall it's a very promising approach to portability and improving programmer productivity.  Experiments can sometimes be frustrating but are also fun!

Click to add text www.ibm.com © 2006-2008 IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.

Similar presentations

Presentation on theme: "Click to add text www.ibm.com © 2006-2008 IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Click to add text www.ibm.com © 2006-2008 IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.

Similar presentations

Presentation on theme: "Click to add text www.ibm.com © 2006-2008 IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014."— Presentation transcript:

Similar presentations

About project

Feedback