Presentation on theme: "1 Agenda AVX overview Proposed AVX ABI changes −For IA-32 −For x86-64 AVX and vectorizer infrastructure. Ongoing projects by Intel gcc team: −Stack alignment."— Presentation transcript:
1 Agenda AVX overview Proposed AVX ABI changes −For IA-32 −For x86-64 AVX and vectorizer infrastructure. Ongoing projects by Intel gcc team: −Stack alignment. >IA32 relative performance numbers for SPEC CPU 2K and >DWARF2 change for stack alignment. −Status of gcc AVX branch. >Limit AVX vectorizer to 128bit.
2 Intel® Advanced Vector Extensions (Intel® AVX) 2X Vector Width A 256-bit vector extension to SSE Intel® AVX extends all 16 XMM registers to 256bits Intel® AVX works on either −The whole 256-bits −The lower 128-bits (like existing SSE instructions) >A drop-in replacement for all existing scalar/128-bit SSE instructions >The upper part of the register is zeroed out >No alignment fault on ld-op arithmetic operations 256 bits (2010) YMM0 XMM0 128 bits (1999)
3 Intel® Advanced Vector Extensions (Intel® AVX) – New Encoding System Nearly all SSE FP instructions “promoted” to 256-bits −VADDPSYMM1, YMM2, [m256] Nearly all (*) SSE instructions encode-able in new format −VADDPSXMM1, XMM2, [m128] −VMULSSXMM1, XMM2, [m32] −VPUNPCKHQDQXMM1, XMM2, [m128] 128-bit and scalar promoted instructions have full inter-operability with 256-bit operations (*) instructions referencing MMX registers are NOT promoted to Intel AVX
4 Key Intel® Advanced Vector Extensions (Intel® AVX) Features Wider Vectors −Increased from 128 bit to 256 bit KEY FEATURES BENEFITS Up to 2x peak FLOPs (floating point operations per second) output with good power efficiency Intel® AVX is a general purpose architecture, expected to supplant SSE in all applications used today Enhanced Data Rearrangement −Use the new 256 bit primitives to broadcast, mask loads and permute data Organize, access and pull only necessary data more quickly and efficiently Three and four Operands, Non Destructive Syntax −Designed for efficiency and future extensibility Fewer register copies, better register use for both vector and scalar code Flexible unaligned memory access support More opportunities to fuse load and compute operations Extensible new opcode (VEX) Code size reduction
5 AVX related changes Assembler support in binutils − Under –msse2avx, sse instructions will map to AVX instructions New data type __m256 − Natural alignment is 32 bytes − Requires stack alignment greater than guaranteed by IA- 32 and x86-64 ABI Intrinsics for AVX instructions − Under –mavx sse intrinsics will be mapped AVX instructions Automatic code generation − Vectorizer work ongoing
6 Proposed AVX ABI changes __m256 *p; a reference to *p will generate a 32-byte aligned 32-byte load − In particular, __m256 variables on stack also need to be 32-byte aligned − Has implications on aligning stack (talked yesterday) as well as parameter passing (today)
7 Parameter passing ABIs (Linux-32) Almost everything on stack −__m128 variables in xmm0-2, all caller save − __m128 after first 3 parameters passed on stack > gcc assumes stack aligned at 16 bytes > Inserts padding as desired Proposed change − __m128/__m256 variables in xmm0-2/ymm0-2, all caller save − Note overlap between xmm and lower halves of ymm − After first 3 parameters, __m128/__m256 passed on stack > Requires stack aligned to 32 bytes (Joey Ye’s talk yesterday) > Insert padding as desired Varargs dealt later
8 Parameter passing ABIs (Linux-64) __m128 variables in xmm0-7, all caller save > __m128 after first 8 parameters passed on stack ABI guarantees stack aligned to 16 bytes Inserts padding as desired Proposed change − __m128/__m256 variables in xmm0-7/ymm0-7, all caller save − Note overlap between xmm and lower halves of ymm − After first 8 parameters, __m128/__m256 passed on stack > Requires stack aligned to 32 bytes (Joey Ye’s talk yesterday) > Insert padding as desired Varargs dealt later
9 Varargs (Linux-32) Current − Everything (named, unnamed, including __m128s) passed on stack − gcc assumes stack aligned at 16 bytes − Inserts padding as desired Proposed change − Everything (named, unnamed, including __m128/__m256) passed on stack − Stack aligned to 32 bytes if __m256 is passed on stack − Insert padding as desired
10 Varargs (Linux-64) Current − No change from non-varargs case except >Register al contains number of xmm registers used as parameters ABI pages 50 (rax) and footnote 14 (al) may/may not be in contradiction − (Callee) has register save area whose layout is defined by the ABI > rdi, rsi, rdx, rcx, r8, r9 (integer registers for parameters) followed by xmm0-15 Why xmm0-15 instead of xmm0-7 as only xmm0-7 can be for parameters? Proposed change options − Aligned stack to 32 bytes if __m256 parameters are passed on stack − Register al contains number of xmm/ymm registers used as parameters − For __m256 parameters > Option 1: All __m256 parameters (named, unnamed) on stack > Option 2: Only unnamed __m256 parameters on stack
11 Unprototyped functions (Linux-64) Current − Same as prototyped + al defined − Works even if function is non varargs or varargs Proposed change (assume __m256 as parameter) − Same as prototyped + al defined(?) − Option 1: (all ___m256 on stack): Does not work if function is a varargs function − Option 2: (unnamed __m256 on stack): Does not work if __m256 parameter is among unnamed For unprototyped functions caller must treat as prototyped with al defined for performance − If we want unprototyped functions to work when they are really vararg functions you must extend register save area >Unknown performance penalty for all vararg functions On Linux-32 (similar abi for __m128 to option 1), unprototyped functions do not work when really vararg functions
12 AVX and Vectorizer AVX −128bit INT and 256bit FP vector operations >Can use 256bit FP vector AND, ANDN, OR, XOR to emulate 256bit INT. −256bit vector 256bit vector −128bit vector 256bit vector Vectorizer −Doesn’t support vector conversion of different vector sizes. −Doesn’t support different vector sizes based on operations.
13 AVX Branch Status Implemented: −AVX code generation. -mavx generates pure AVX instructions without legacy SSE instructions. −AVX intrinsics. −AVX vectorizer is limited to 128bit. To do: −Variable argument −Verify unwind and debug. −AVX specific tests. >Runtime intrinsic tests. >Variable arguments. >Unwind with 256bit vector. −256bit Vectorizer support.
14 Stack Alignment Branch Collect stack alignment info in middle-end. Use DW_OP_operation to describe call frame with stack alignment. −Need to handle DRAP properly without changing CFA. Implemented x86 target hooks for stack alignment. Added ~70 C/C++ runtime tectcases for stack alignment. On 45nm Core 2 Duo in 32bit, compared against gcc 4.4 revision at –O2, stack alignment introduced 0% regression on SPEC CPU 2006 INT/FP, 0.3% regression on SPEC CPU 2K INT and 0.6% regressions on SPEC CPU 2K FP. Updated gdb prologue analyzer to recognize the x86 prologues with stack alignment.
15 Float128 on x86 Gcc uses SSE/SSE2 to implement float128. Gcc only supports float128 on ix −Update ia32 psABI. >Alignment. 16byte >Parameter (varargs) passing. On stack, aligned at 16byte. −Check TARGET_SSE/TARGET_SSE2/TARGET_SSE_MATH instead of TARGET_64BIT. There is no run-time support for float128. −I/O. −String to __float128 function −Math library. IEEE 754R −Existing float128 API implementations. −Implement float128 with DFP in a separate run-time library.