Click to add text www.ibm.com © 2006-2008 IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.

Slides:



Advertisements
Similar presentations
CH10 Instruction Sets: Characteristics and Functions
Advertisements

Announcements You survived midterm 2! No Class / No Office hours Friday.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Review of the MIPS Instruction Set Architecture. RISC Instruction Set Basics All operations on data apply to data in registers and typically change the.
Lecture 6 Programming the TMS320C6x Family of DSPs.
Chapter 3 Instruction Set Architecture Advanced Computer Architecture COE 501.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
1 Lecture 3: Instruction Set Architecture ISA types, register usage, memory addressing, endian and alignment, quantitative evaluation.
Data Dependencies Describes the normal situation that the data that instructions use depend upon the data created by other instructions, or data is stored.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
INSTRUCTION SET ARCHITECTURES
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.
COMP3221: Microprocessors and Embedded Systems Lecture 2: Instruction Set Architecture (ISA) Lecturer: Hui Wu Session.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
8 November Forms and JavaScript. Types of Inputs Radio Buttons (select one of a list) Checkbox (select as many as wanted) Text inputs (user types text)
Portability CPSC 315 – Programming Studio Spring 2008 Material from The Practice of Programming, by Pike and Kernighan.
What is an instruction set?
Code Generation CS 480. Can be complex To do a good job of teaching about code generation I could easily spend ten weeks But, don’t have ten weeks, so.
1 CSE SUNY New Paltz Chapter Seven Exploiting Memory Hierarchy.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
CMSC 104, Version 8/061L22Arrays1.ppt Arrays, Part 1 of 2 Topics Definition of a Data Structure Definition of an Array Array Declaration, Initialization,
RISC and CISC. Dec. 2008/Dec. and RISC versus CISC The world of microprocessors and CPUs can be divided into two parts:
RISC:Reduced Instruction Set Computing. Overview What is RISC architecture? How did RISC evolve? How does RISC use instruction pipelining? How does RISC.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Assembly Questions תרגול 12.
Multimedia Macros for Portable Optimized Programs Juan Carlos Rojas Miriam Leeser Northeastern University Boston, MA.
Implementation of a Stored Program Computer ITCS 3181 Logic and Computer Systems 2014 B. Wilkinson Slides2.ppt Modification date: Oct 16,
Stephen P. Carl - CS 2421 Recursion Reading : Chapter 4.
IT253: Computer Organization Lecture 3: Memory and Bit Operations Tonga Institute of Higher Education.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Pipeline Extensions prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University MIPS Extensions1May 2015.
Chapter 10 The Assembly Process. What Assemblers Do Translates assembly language into machine code. Assigns addresses to all symbolic labels (variables.
1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.
Instruction Set Architecture The portion of the machine visible to the programmer Issues: Internal storage model Addressing modes Operations Operands Encoding.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Looping and Counting Lecture 3 Hartmut Kaiser
CMPE 421 Parallel Computer Architecture
Computer Architecture EKT 422
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Introduction to MMX, XMM, SSE and SSE2 Technology
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Chapter 10 Instruction Sets: Characteristics and Functions Felipe Navarro Luis Gomez Collin Brown.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
What is a program? A sequence of steps
Group # 3 Jorge Chavez Henry Diaz Janty Ghazi German Montenegro.
بسم الله الرحمن الرحيم MEMORY AND I/O.
Windows Programming Lecture 06. Data Types Classification Data types are classified in two categories that is, – those data types which stores decimal.
Bitwise Operations C includes operators that permit working with the bit-level representation of a value. You can: - shift the bits of a value to the left.
Course Contents KIIT UNIVERSITY Sr # Major and Detailed Coverage Area
A Closer Look at Instruction Set Architectures
Microprocessor Systems Design I
CSC 4250 Computer Architectures
CPSC 315 – Programming Studio Spring 2012
A Closer Look at Instruction Set Architectures
Central Processing Unit
Portability CPSC 315 – Programming Studio
Bitwise Operations C includes operators that permit working with the bit-level representation of a value. You can: - shift the bits of a value to the left.
Control unit extension for data hazards
Chapter 9 Instruction Sets: Characteristics and Functions
Introduction to Microprocessor Programming
EE 193: Parallel Computing
Control unit extension for data hazards
Review In last lecture, done with unsigned and signed number representation. Introduced how to represent real numbers in float format.
Control unit extension for data hazards
Chapter 10 Instruction Sets: Characteristics and Functions
Presentation transcript:

Click to add text © IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014 CASCON 2014

w3.ibm.com © 2007 IBM Corporation 2 Abstract Optimization Issues in SSE/AVX-compatible functions on PowerPC As an experiment, some SSE/AVX-compatible functions were implemented on PowerPC to see if they would allow easier porting and more SIMD parallelism in ported programs. Trying to maximize their performance led to finding missed compiler optimization opportunities, a few compiler bugs in rarely executed code, and also to some changes in programming techniques. Another result was discovering what aspects of little-endian SSE/AVX SIMD are hard to emulate efficiently on big-endian PowerPC VMX(Altivec) / VSX SIMD.

w3.ibm.com © 2007 IBM Corporation 3 Background  Auto-SIMDization is quite successful in some compilers but not in others.  Also, SIMD instructions include operations like saturated add not available otherwise.  As a result, many programs use vendor-specific SIMD intrinsic or built-in functions to improve their performance.  That severely impedes portability.  SIMD functions are unlikely to be standardized soon if ever.  One potential solution is to emulate one vendor's functions by using another vendor's.  An experimental small prototype of that was tried.

w3.ibm.com © 2007 IBM Corporation 4 Function Types Investigated  A small list of 8, 16, 32 and 64-bit integer SIMD operations, and a few single and double precision floating-point operations were tried.

w3.ibm.com © 2007 IBM Corporation 5 Approach Taken  SSE/AVX function prototypes were joined to brand new bodies using PowerPC Altivec built in functions; eg, /* Add 4 32-bit ints */ __m128i _mm_add_epi32 (__m128i left, __m128i right) { return (__m128i) vec_add ((vector signed int) left, (vector signed int) right); }

w3.ibm.com © 2007 IBM Corporation 6 Another Example /* Unpack bit chars from high halves and interleave */ __m128i _mm_unpackhi_epi8 (__m128i left, __m128i right) { static const vector unsigned char permute_selector = { #ifdef __LITTLE_ENDIAN__ 0x07, 0x17, 0x06, 0x16, 0x05, 0x15, 0x04, 0x14, 0x03, 0x13, 0x02, 0x12, 0x01, 0x11, 0x00, 0x10 #elif __BIG_ENDIAN__ 0x10, 0x00, 0x11, 0x01, 0x12, 0x02, 0x13, 0x03, 0x14, 0x04, 0x15, 0x05, 0x16, 0x06, 0x17, 0x07 #endif }; return vec_perm (left, right, permute_selector); } Is that correct? In big endian, should “hi” mean the left or right end? Are the permute control vector initializers right?

w3.ibm.com © 2007 IBM Corporation 7 Compiler Optimizations  The xlc generated code for every function was examined. (gcc will be checked too.)  If it didn't look perfect, compiler optimizer defects or work items were or will be opened.  Some of those were or will be very easy to fix. Some are hard.

w3.ibm.com © 2007 IBM Corporation 8 Other Difficulties  A few compiler bugs were found, in both xlc and gcc. This exercises some rarely used functionality.  Defects were or will be reported.  None have been fixed yet, but workarounds for all were found.

w3.ibm.com © 2007 IBM Corporation 9 Programming Technique Changes Obvious but wrong code: /* Shift 4 32-bit ints left logical immediate */ __m128i _mm_slli_epi32 (__m128i v, unsigned int count) { return (__m128i) vec_sl ((vector unsigned int) v, (vector unsigned int) vec_splats (count))); }; Corrected code: /* Shift 4 32-bit ints left logical immediate */ __m128i _mm_slli_epi32 (__m128i v, unsigned int count) { return (__m128i) vec_and ( vec_sl ((vector unsigned int) v, (vector unsigned int) vec_splats (count)), (vector unsigned int) vec_cmplt (vec_splats (count), vec_splats (32u))); }; Why the change? PowerPC shifts by count % element_size. SSE shifts by element_size, giving zero when count >= element_size. So shift, compare the count giving all ones or all zeros, and and with that.

w3.ibm.com © 2007 IBM Corporation 10 Programming Technique Changes The corrected code is good if not inlined, but should always be inlined. It always executes a shift, 2 splats, a compare, and an and. Faster code when inlined: /* Shift 4 32-bit ints left logical immediate */ __m128i _mm_slli_epi32 (__m128i v, unsigned int count) { if ((unsigned long) count >= 32) { return (__m128i) vec_splats (0); } else if (count == 0) { return v; } else { return (__m128i) vec_sl ((vector signed int) v, (vector unsigned int) vec_splats ((int) count)); } When inlined, the if s are normally evaluated at compile time, so only one clause is executed – either a splat, or nothing, or a splat and shift. Use a different mind set to get faster code!

w3.ibm.com © 2007 IBM Corporation 11 Potential Performance Improvements  Obvious performance issues include:  Some functions need to transfer data from GPRs to Vector Registers or vice versa. Doing that via stores and loads can be very slow. The Power8 has transfer instructions, so some functions should have #ifdef s testing the CPU model.  There are no instructions corresponding to some parts of some SSE/AVX operations. The Power8 does add some very useful instructions like vector permute bit, so some functions should have #ifdef s testing the CPU model to use them.  Many functions need a permute, and first load a permute control vector from memory. (Does not hurt performance much if inlined.)

w3.ibm.com © 2007 IBM Corporation 12 AVX 256-bit Handling  AVX 256 functions should operate on 256 not 128 bit vector registers.  Simulating things like add or subtract normally just means using two 128 bit instructions, running in parallel on two vector pipelines.  In a “shuffle” (permute), though, each byte of the result may come from any byte of bit input registers. Since vector permute can only deal with 2 inputs, the general case needs 3 permutes for the first half of the result and 3 more for the second half. Worse, since in general the permute control information may not be known at compile time, the 6 permute control vectors may need to be calculated at execution time. Working that out would be challenging. Compiler optimization could help by eliminating unnecessary permutes.  AVX 512 would be even more complicated.

w3.ibm.com © 2007 IBM Corporation 13 Performance  Detailed performance information isn't available yet.  Overall it seems to be very competitive.  Very common operations like vector add or compare are more than competitive, running faster than the competition (in some cases taking fewer cycles, with a faster clock rate), and two operations done in parallel.  Some particularly awkward operations are up to ~10x slower.  Improving compiler optimization and using #ifdef s to generate model-specific code should improve both the worst cases and the average.  Some like “shuffle” (permute) might be doomed to being slow?  For many functions AVX512 needs bit instructions. Two would start immediately and two more one cycle later, so performance would still be good.

w3.ibm.com © 2007 IBM Corporation 14 Summary  SSE functions are mostly competitive, and most are easy to implement, but...  Getting both big and little endian working correctly can be hard. Some endian issues including - which element is upper and which lower - and how to declare little endian permute control vectors are surprisingly easy to get wrong.  AVX 256 is a little harder than SSE, and AVX 512 more so. MMX is hard to efficiently load and store, but fortunately obsolete.  Some functions are hard to write or hard to make fast.  Some instructions would be useful but do not exist.  Overall it's a very promising approach to portability and improving programmer productivity.  Experiments can sometimes be frustrating but are also fun!

w3.ibm.com © 2007 IBM Corporation 15 Discussion and Questions ?