# GCC Tutorial – The compilation flow of the auto-vectorizer

## Presentation on theme: "GCC Tutorial – The compilation flow of the auto-vectorizer"— Presentation transcript:

GCC Tutorial – The compilation flow of the auto-vectorizer
Dorit Nuzman Haifa IBM Labs 2nd HiPEAC GCC Tutorial Ghent, Belgium, January 2007

What is vectorization original serial loop: for(i=0; i<N; i++){ a[i] = a[i] + b[i]; } loop in vector notation: for (i=0; i<N; i+=VF) { a[i:i+VF-1] = a[i:i+VF-1] + b[i:i+VF-1]; } VF = 4 VR1 VR2 VR3 VR4 VR5 1 2 3 a b c d OP(a) OP(b) OP(c) OP(d) VOP( a, b, c, d ) VR1 Vector operation vectorization Vector Registers The context of this work is the DSP domain. One of the …..characteritics of applications in the DSP domain is the abundant parallelism that is present in the computations they perform. What happens often that you execute the same instruction many times, each time of different data. Modern DSP architectures often have special hardware that allows executing the same instructions simultanuously on multiple data elements. Usually, the way it is done – the operands/data must be packed in advance in vector registers. And then the vector instruction takes that register as operand and performs the operation on the (4) data elements. SIMD  SIMpD The process of transforming groups of scalar instructions into vector ones is called vectorization. Data elements packed into vectors Vector length  Vectorization Factor (VF) Data in Memory: a b c d e f g h i j k l m n o p

loop analyses and optimizations
Ada front-end loop analyses and optimizations Fortran front-end data-dependence scalar-evolution number of iters invariant motion iv-canon/optimize linear transform unswitching if-conversion unrolling C++ front-end GCC Passes - loop form ok? - any data-deps? - scalar-cycles? - aliasing? - access-patterns? C front-end original serial loop: for(i=0; i<N; i++){ a[i] = a[i] + b[i]; } loop in vector notation: for (i=0; i<N; i+=VF) { a[i:i+VF-1] = a[i:i+VF-1] + b[i:i+VF-1]; } parse trees Why study the vectorizer? - middle-end & back-end aspects - performance impact potential - there’s a lot to do… vectorization middle-end GIMPLE trees - vector size? - supportable? - alignment? - data shuffle? - cost? mips port i386 port back-end rs6000 port RTL machine description assembly

Talk Layout front-end parse trees middle-end GIMPLE trees … mips port
What is vectorization Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases Using the vectorizer Programming and tuning hints middle-end vectorization GIMPLE trees mips port i386 port back-end rs6000 port RTL machine description assembly

A GCC “port”: Target specific files
GCC Backend – machine-description files and operation tables A GCC “port”: Target specific files gcc/gcc/config/<myport>/ – for example: i386, ia64, rs6000, spu… target-specific compiler options: <target>.opt – command-line options of GCC specific to the target – for example: -maltivec, -msse2, -mtune=power4, -minsert-sched-nops= target-specific definitions: <target>.h – basic parameters and features – for example: target-specific support functions: <target>.c – target predicates, code generation functions, target variants machine description: <target>.md – definition of RTL instructions and their translations to assembly – content of machine description determines which features (operations, modes) are available #define POINTER_SIZE (TARGET_32BIT ? 32 : 64) #define BYTES_BIG_ENDIAN 1 #define FIXED_REGISTERS \ {0, 1, FIXED_R2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, FIXED_R13, 0, 0, \ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \ …. #define CALL_USED_REGISTERS \ {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, FIXED_R13, 0, 0, \ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \...

machine-description file
RTL operations: rtl.def DEF_RTL_EXPR(SMIN, "smin", "ee", RTX_COMM_ARITH) DEF_RTL_EXPR(SMAX, "smax", "ee", RTX_COMM_ARITH) DEF_RTL_EXPR(UMIN, "umin", "ee", RTX_COMM_ARITH) DEF_RTL_EXPR(UMAX, "umax", "ee", RTX_COMM_ARITH) machine-description file alpha/alpha.md (define_insn "sminqi3" [(set (match_operand:QI 0 "register_operand" "=r") (smin:QI (match_operand:QI 1 "reg_or_0_operand" "%rJ") (match_operand:QI 2 "reg_or_8bit_operand" "rI")))] "TARGET_MAX" "minsb8 %r1,%2,%0" [(set_attr "type" "mvi")]) (define_insn "sminv8qi3" [(set (match_operand:V8QI 0 "register_operand" "=r") (smin:V8QI (match_operand:V8QI 1 "reg_or_0_operand" "rW") (match_operand:V8QI 2 "reg_or_0_operand" "rW")))] "minsb8 %r1,%r2,%0" gcc/gcc: rtl.def gcc/gcc/config/<port>: <target>.opt <target>.h <target>.c <target>.md

machine-description file
RTL operations: rtl.def DEF_RTL_EXPR(SMIN, "smin", "ee", RTX_COMM_ARITH) DEF_RTL_EXPR(SMAX, "smax", "ee", RTX_COMM_ARITH) DEF_RTL_EXPR(UMIN, "umin", "ee", RTX_COMM_ARITH) DEF_RTL_EXPR(UMAX, "umax", "ee", RTX_COMM_ARITH) machine-description file alpha/alpha.md (define_insn "sminqi3" [(set (match_operand:QI 0 "register_operand" "=r") (smin:QI (match_operand:QI 1 "reg_or_0_operand" "%rJ") (match_operand:QI 2 "reg_or_8bit_operand" "rI")))] "TARGET_MAX" "minsb8 %r1,%2,%0" [(set_attr "type" "mvi")]) (define_insn "sminv8qi3" [(set (match_operand:V8QI 0 "register_operand" "=r") (smin:V8QI (match_operand:V8QI 1 "reg_or_0_operand" "rW") (match_operand:V8QI 2 "reg_or_0_operand" "rW")))] "minsb8 %r1,%r2,%0" - machine-modes: qi, hi, si, di, sf, df - vector machine-modes: alpha: v8qi, v4hi altivec: v16qi, v8hi, v4si - scalar and vector operations differ only in operand modes - constraints - conditions - attributes - assembly

RTL operations: rtl.def
DEF_RTL_EXPR(IF_THEN_ELSE, "if_then_else", "eee", RTX_TERNARY) DEF_RTL_EXPR(GT, "gt", "ee", RTX_COMPARE) DEF_RTL_EXPR(MINUS, "minus", "ee", RTX_BIN_ARITH) rs6000/rs6000.md (define_expand "sminsi3" [(set (match_dup 3) (if_then_else:SI (gt:SI (match_operand:SI 1 "gpc_reg_operand" "") (match_operand:SI 2 "reg_or_short_operand" "")) (const_int 0) (minus:SI (match_dup 2) (match_dup 1)))) (set (match_operand:SI 0 "gpc_reg_operand" "") (minus:SI (match_dup 2) (match_dup 3)))] "TARGET_POWER || TARGET_ISEL" "{ if (TARGET_ISEL) { operands[2] = force_reg (SImode, operands[2]); rs6000_emit_minmax (operands[0], SMIN, operands[1], operands[2]); DONE; } operands[3] = gen_reg_rtx (SImode); }") rs6000/rs6000.c

When the same pattern applies to multiple modes:
rs6000/altivec.md (define_insn "sminv4sf3" [(set (match_operand:V4SF 0 "register_operand" "=v") (smin:V4SF (match_operand:V4SF 1 "register_operand" "v") (match_operand:V4SF 2 "register_operand" "v")))] "TARGET_ALTIVEC" "vminfp %0,%1,%2" [(set_attr "type" "veccmp")]) ;; Vec int modes (define_mode_macro VI [V4SI V8HI V16QI]) (define_insn "smin<mode>3" [(set (match_operand:VI 0 "register_operand" "=v") (smin:VI (match_operand:VI 1 "register_operand" "v") (match_operand:VI 2 "register_operand" "v")))] "TARGET_ALTIVEC" "vmins<VI_char> %0,%1,%2" [(set_attr "type" "vecsimple")]) When the same pattern applies to multiple modes: use mode macros to generate an entire family of patterns

GCC Backend – machine-description files and operation tables
optabs.c,h tables of RTL operations sharing common semantics, but differing in operand size and/or structure no type information available anymore optab/type qi hi si v4si v2si smin_optab 700 701 CODE_FOR_nothing 753 umin_optab 702 703 754 build/gcc/insn-emit.c rtx gen_sminv4si3 (rtx operand0 ATTRIBUTE_UNUSED, rtx operand1 ATTRIBUTE_UNUSED, rtx operand2 ATTRIBUTE_UNUSED) { return gen_rtx_SET (VOIDmode, operand0, gen_rtx_SMIN (V4SImode, operand1, operand2)); } build/gcc/insn-output.c { "sminv4si3", { "vminsw %0,%1,%2", 0, 0 }, (insn_gen_fn) gen_sminv4si3, &operand_data[1427], 3, 0, 1, 1 }

GCC Backend – machine-description files and operation tables
optabs.c,h tables of RTL operations sharing common semantics, but differing in operand size and/or structure no type information available anymore gcc/gcc: rtl.def gcc/gcc/config/<port>: <target>.opt <target>.h <target>.c <target>.md gcc/gcc: rtl.def gcc/gcc/config/<port>: <target>.opt <target>.h <target>.c <target>.md optab/type qi hi si v4si v2si smin_optab 700 701 CODE_FOR_nothing 753 umin_optab 702 703 754 optab qi hi si v8qi v4hi v2si smin umin build/gcc/insn-emit.c rtx gen_sminv4si3 (rtx operand0 ATTRIBUTE_UNUSED, rtx operand1 ATTRIBUTE_UNUSED, rtx operand2 ATTRIBUTE_UNUSED) { return gen_rtx_SET (VOIDmode, operand0, gen_rtx_SMIN (V4SImode, operand1, operand2)); } build/gcc/insn-output.c { "sminv4si3", { "vminsw %0,%1,%2", 0, 0 }, (insn_gen_fn) gen_sminv4si3, &operand_data[1427], 3, 0, 1, 1 }

Querying the backend for target support in the vectorizer
min_27 = MIN_EXPR <tmp_26, min_50>; optab = optab_for_tree_code (code, vectype); vec_mode = TYPE_MODE (vectype); icode = (int) optab->handlers[(int) vec_mode].insn_code; if (icode == CODE_FOR_nothing) { if (vect_print_dump_info (REPORT_DETAILS)) fprintf (vect_dump, "operation not supported by target."); return false; } vector int smin_optab v2si optab/type qi hi si v8qi v4hi v2si smin_optab 700 701 CODE_FOR_nothing 752 753 umin_optab 702 703 754 755

Talk Layout front-end parse trees middle-end GIMPLE trees … mips port
What is vectorization Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases Using the vectorizer Programming and tuning hints middle-end vectorization GIMPLE trees mips port i386 port back-end rs6000 port RTL machine description assembly

Enabling vectorization for a new port
Basic features: <target.md> - distinction between scalar and vector ops: operand modes - availability of vector ops: deduced from MD file <target>.h - specify supported vector length in bytes: #define UNITS_PER_SIMD_WORD 16 <target>-modes.def - specify supported vector modes: /* Vector modes. */ VECTOR_MODES (INT, 8); /* V8QI V4HI V2SI */ VECTOR_MODES (INT, 16); /* V16QI V8HI V4SI V2DI */ VECTOR_MODE (INT, DI, 1); VECTOR_MODES (FLOAT, 8); /* V4HF V2SF */ VECTOR_MODES (FLOAT, 16); /* V8HF V4SF V2DF */

Enabling vectorization for a new port
#define reduc_smax_optab (optab_table[OTI_reduc_smax]) #define reduc_umax_optab (optab_table[OTI_reduc_umax]) #define reduc_smin_optab (optab_table[OTI_reduc_smin]) #define reduc_umin_optab (optab_table[OTI_reduc_umin]) #define reduc_splus_optab (optab_table[OTI_reduc_splus]) #define reduc_uplus_optab (optab_table[OTI_reduc_uplus]) #define ssum_widen_optab (optab_table[OTI_ssum_widen]) #define usum_widen_optab (optab_table[OTI_usum_widen]) #define sdot_prod_optab (optab_table[OTI_sdot_prod]) #define udot_prod_optab (optab_table[OTI_udot_prod]) #define vec_set_optab (optab_table[OTI_vec_set]) #define vec_extract_optab (optab_table[OTI_vec_extract]) #define vec_extract_even_optab (optab_table[OTI_vec_extract_even]) #define vec_extract_odd_optab (optab_table[OTI_vec_extract_odd]) #define vec_interleave_high_optab (optab_table[OTI_vec_interleave_high]) #define vec_interleave_low_optab (optab_table[OTI_vec_interleave_low]) #define vec_init_optab (optab_table[OTI_vec_init]) #define vec_shl_optab (optab_table[OTI_vec_shl]) #define vec_shr_optab (optab_table[OTI_vec_shr]) #define vec_realign_load_optab (optab_table[OTI_vec_realign_load]) #define vec_widen_umult_hi_optab (optab_table[OTI_vec_widen_umult_hi]) #define vec_widen_umult_lo_optab (optab_table[OTI_vec_widen_umult_lo]) #define vec_widen_smult_hi_optab (optab_table[OTI_vec_widen_smult_hi]) #define vec_widen_smult_lo_optab (optab_table[OTI_vec_widen_smult_lo]) #define vec_unpacks_hi_optab (optab_table[OTI_vec_unpacks_hi]) #define vec_unpacku_hi_optab (optab_table[OTI_vec_unpacku_hi]) #define vec_unpacks_lo_optab (optab_table[OTI_vec_unpacks_lo]) #define vec_unpacku_lo_optab (optab_table[OTI_vec_unpacku_lo]) #define vec_pack_mod_optab (optab_table[OTI_vec_pack_mod]) #define vec_pack_ssat_optab (optab_table[OTI_vec_pack_ssat]) #define vec_pack_usat_optab (optab_table[OTI_vec_pack_usat]) Enabling vectorization for a new port Advanced features: Special idioms: generic vector operations: look over list of idioms in optabs.h specialized vector operations: look over target.h /* Functions relating to vectorization. */ struct vectorize { tree (* builtin_mask_for_load) (void); tree (* builtin_vectorized_function) (unsigned, tree); tree (* builtin_mul_widen_even) (tree); tree (* builtin_mul_widen_odd) (tree); } vectorize;

Enabling vectorization for a new port
#define reduc_smax_optab (optab_table[OTI_reduc_smax]) #define reduc_umax_optab (optab_table[OTI_reduc_umax]) #define reduc_smin_optab (optab_table[OTI_reduc_smin]) #define reduc_umin_optab (optab_table[OTI_reduc_umin]) #define reduc_splus_optab (optab_table[OTI_reduc_splus]) #define reduc_uplus_optab (optab_table[OTI_reduc_uplus]) #define ssum_widen_optab (optab_table[OTI_ssum_widen]) #define usum_widen_optab (optab_table[OTI_usum_widen]) #define sdot_prod_optab (optab_table[OTI_sdot_prod]) #define udot_prod_optab (optab_table[OTI_udot_prod]) #define vec_set_optab (optab_table[OTI_vec_set]) #define vec_extract_optab (optab_table[OTI_vec_extract]) #define vec_extract_even_optab (optab_table[OTI_vec_extract_even]) #define vec_extract_odd_optab (optab_table[OTI_vec_extract_odd]) #define vec_interleave_high_optab (optab_table[OTI_vec_interleave_high]) #define vec_interleave_low_optab (optab_table[OTI_vec_interleave_low]) #define vec_init_optab (optab_table[OTI_vec_init]) #define vec_shl_optab (optab_table[OTI_vec_shl]) #define vec_shr_optab (optab_table[OTI_vec_shr]) #define vec_realign_load_optab (optab_table[OTI_vec_realign_load]) #define vec_widen_umult_hi_optab (optab_table[OTI_vec_widen_umult_hi]) #define vec_widen_umult_lo_optab (optab_table[OTI_vec_widen_umult_lo]) #define vec_widen_smult_hi_optab (optab_table[OTI_vec_widen_smult_hi]) #define vec_widen_smult_lo_optab (optab_table[OTI_vec_widen_smult_lo]) #define vec_unpacks_hi_optab (optab_table[OTI_vec_unpacks_hi]) #define vec_unpacku_hi_optab (optab_table[OTI_vec_unpacku_hi]) #define vec_unpacks_lo_optab (optab_table[OTI_vec_unpacks_lo]) #define vec_unpacku_lo_optab (optab_table[OTI_vec_unpacku_lo]) #define vec_pack_mod_optab (optab_table[OTI_vec_pack_mod]) #define vec_pack_ssat_optab (optab_table[OTI_vec_pack_ssat]) #define vec_pack_usat_optab (optab_table[OTI_vec_pack_usat]) Enabling vectorization for a new port Advanced features: gcc/gcc: rtl.def target.h optabs.h gcc/gcc/config/<port>: <target>.opt <target>.h <target>.c <target>.md Special idioms: generic vector operations: look over list of idioms in optabs.h specialized vector operations: look over target.h /* Functions relating to vectorization. */ struct vectorize { tree (* builtin_mask_for_load) (void); tree (* builtin_vectorized_function) (unsigned, tree); tree (* builtin_mul_widen_even) (tree); tree (* builtin_mul_widen_odd) (tree); } vectorize;

Enabling vectorization for a new port
if [istarget "powerpc*-*-*"] { } } elseif { [istarget "spu-*-*"] } { set dg-do-what-default run } elseif { [istarget "i?86-*-*"] || [istarget "x86_64-*-*"] } { lappend DEFAULT_VECTCFLAGS "-msse2" } elseif { [istarget "mipsisa64*-*-*"] && [check_effective_target_mpaired_single] } { lappend DEFAULT_VECTCFLAGS "-mpaired-single" } elseif [istarget "sparc*-*-*"] { } elseif [istarget "alpha*-*-*"] { lappend DEFAULT_VECTCFLAGS "-mmax" if [check_alpha_max_hw_available] { } else { set dg-do-what-default compile } elseif [istarget "ia64-*-*"] { return Enabling vectorization for a new port Enable the vectorizer testcases testcases are in gcc/gcc/testsuite/gcc.dg/vect additional target-specific testcases testsuite/gcc.target/i386/vect1.c vect.exp: add logic to decide whether to compile/run and with which target-specific options Add where relevant in: testsuite/lib/target-supports.exp:

Enabling vectorization for a new port
proc check_effective_target_vect_int check_effective_target_vect_shift check_effective_target_vect_long proc check_effective_target_vect_float proc check_effective_target_vect_double { } { global et_vect_double_saved if [info exists et_vect_double_saved] { verbose "using cached result" 2 } else { set et_vect_double_saved 0 if { [istarget i?86-*-*] || [istarget x86_64-*-*] || [istarget spu-*-*] } { set et_vect_double_saved 1 } return \$et_vect_double_saved check_effective_target_vect_no_int_max check_effective_target_vect_no_int_add check_effective_target_vect_sdot_hi check_effective_target_vect_udot_hi check_effective_target_vect_sdot_si check_effective_target_vect_udot_si …. Enabling vectorization for a new port Enable the vectorizer testcases testcases are in gcc/gcc/testsuite/gcc.dg/vect additional target-specific testcases testsuite/gcc.target/i386/vect1.c vect.exp: add logic to decide whether to compile/run and with which target-specific options Add where relevant in: testsuite/lib/target-supports.exp:

Talk Layout front-end parse trees middle-end GIMPLE trees … mips port
What is vectorization Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases Using the vectorizer Programming and tuning hints middle-end vectorization GIMPLE trees mips port i386 port back-end rs6000 port RTL machine description assembly

A tree-level pass New C file in gcc/gcc: tree-vectorizer.c
unsigned vectorize_loops (void) { unsigned int i; unsigned int num_vectorized_loops = 0; unsigned int vect_loops_num; loop_iterator li; struct loop *loop; vect_loops_num = number_of_loops (); FOR_EACH_LOOP (li, loop, LI_ONLY_OLD) loop_vec_info loop_vinfo; vect_loop_location = find_loop_location (loop); loop_vinfo = vect_analyze_loop (loop); loop->aux = loop_vinfo; if (!loop_vinfo || !LOOP_VINFO_VECTORIZABLE_P (loop_vinfo)) continue; vect_transform_loop (loop_vinfo); num_vectorized_loops++; } if (vect_print_dump_info (REPORT_VECTORIZED_LOOPS)) fprintf (vect_dump, "vectorized %u loops in function.\n", num_vectorized_loops); A tree-level pass New C file in gcc/gcc: tree-vectorizer.c tree-vect-analyze.c tree-vect-trasnform.c tree-vect-patterns.c tree-vectorizer.h tree-flow.h – prototype for pass function unsigned vectorize_loops (void); gcc/Makefile.in entries The pass is invoked for each function

A tree-level pass add the pass to the pass hierarchy in passes.c
NEXT_PASS (pass_split_crit_edges); NEXT_PASS (pass_pre); NEXT_PASS (pass_may_alias); NEXT_PASS (pass_sink_code); NEXT_PASS (pass_tree_loop); NEXT_PASS (pass_cse_reciprocals); NEXT_PASS (pass_reassoc); NEXT_PASS (pass_vrp); NEXT_PASS (pass_dominator); p = &pass_tree_loop.sub; NEXT_PASS (pass_tree_loop_init); NEXT_PASS (pass_copy_prop); NEXT_PASS (pass_lim); NEXT_PASS (pass_tree_unswitch); NEXT_PASS (pass_scev_cprop); NEXT_PASS (pass_empty_loop); NEXT_PASS (pass_record_bounds); NEXT_PASS (pass_linear_transform); NEXT_PASS (pass_iv_canon); NEXT_PASS (pass_if_conversion); NEXT_PASS (pass_vectorize); NEXT_PASS (pass_complete_unroll); NEXT_PASS (pass_loop_prefetch); NEXT_PASS (pass_iv_optimize); NEXT_PASS (pass_tree_loop_done); *p = NULL; p = &pass_vectorize.sub; NEXT_PASS (pass_lower_vector_ssa); NEXT_PASS (pass_dce_loop); A tree-level pass add the pass to the pass hierarchy in passes.c in tree-pass.h – prototype for pass structure extern struct tree_opt_pass pass_vectorize; pass-structure definition in tree-ssa-loop.c

A tree-level pass • static bool • static unsigned int • common.opt
gate_tree_vectorize (void) { return flag_tree_vectorize && current_loops; } • static unsigned int tree_vectorize (void) return vectorize_loops (); • common.opt Add command line option ftree-vectorize Common Report Var(flag_tree_vectorize) Enable loop vectorization on trees • pass structure definition: struct tree_opt_pass pass_vectorize = { "vect", /* name */ gate_tree_vectorize, /* gate */ tree_vectorize, /* execute */ NULL, /* sub */ NULL, /* next */ 0, /* static_pass_number */ TV_TREE_VECTORIZATION, /* tv_id */ PROP_cfg | PROP_ssa, /* properties_required */ 0, /* properties_provided */ 0, /* properties_destroyed */ TODO_verify_loops, /* todo_flags_start */ TODO_dump_func | TODO_update_ssa, /* todo_flags_finish */ /* letter */ }; • timevar.def: variable used for timing and for identification in timing reports: DEFTIMEVAR (TV_TREE_VECTORIZATION , "tree vectorization")

A tree-level pass gcc/gcc: rtl.def target.h optabs.h
[tree-vect*.c] tree-flow.h Makefile.in [tree-ssa-loop.c] timevar.def common.opt Invoke.texi invoke.texi: Document the pass for the GCC manual: @item -ftree-vectorize Perform loop vectorization on trees. @item fdump-tree-vect Dump each function after applying vectorization of loops. The file name is made by to the source file name. gcc/gcc: rtl.def target.h optabs.h gcc/gcc/config/<port>: <target>.opt <target>.h <target>.c <target>.md gcc –O2 –ftree-vectorize example.c gcc –O2 –ftree-vectorize –maltivec example.c gcc –O2 –ftree-vectorize –msse2 example.c gcc –O2 –ftree-vectorize –maltivec –fdump-tree-vect example.c gcc –O2 –ftree-vectorize –maltivec –fdump-tree-vect-details example.c gcc –O2 –ftree-vectorize –maltivec –ftree-vectorizer-verbose=2 example.c gcc –O2 –ftree-vectorize –maltivec –ftree-vectorizer-verbose=7 –fdump-tree-vect example.c

Example: vectorizer dump reports
int main1 (short *in, int off, short scale, int n) { int i, sum = 0; for (i = 0; i < n; i++) { sum += ((int) in[i] * (int) in[i+off]) >> scale; } return sum; autocorrelation Speedups: - powerpc970 – 5-6x - Cell SPU – 4-5x vect]\$ gcc -O2 -ftree-vectorize -maltivec -ftree-vectorizer-verbose=5 vect-widen-mult-sum.c vect-widen-mult-sum.c:16: note: Vectorizing an unaligned access. vect-widen-mult-sum.c:16: note: LOOP VECTORIZED. vect-widen-mult-sum.c:12: note: vectorized 1 loops in function.

Talk Layout front-end parse trees middle-end GIMPLE trees … mips port
What is vectorization Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases Using the vectorizer Programming and tuning hints middle-end vectorization GIMPLE trees mips port i386 port back-end rs6000 port RTL machine description assembly

Auto-vectorization Skeleton
tree-vect-analyze.c vect_analyze_loop (loop) { if (!1_analyze_loop_form (loop)) FAIL if (!2_analyze_data_refs (loop)) FAIL if (!3_analyze_scalar_dependence_cycles (loop)) FAIL if (!4_pattern_recog (loop)) FAIL if (!5_analyze_data_alignment (loop)) FAIL if (!6_determine_VF (loop)) FAIL if (!7_analyze_data_dependence_distances (loop)) FAIL if (!8_analyze_memory_access_patterns (loop)) FAIL if (!9_analyze_all_operations_supported (loop)) FAIL SUCCEED } if SUCCEED: vect_transform_loop (loop) { FOR_ALL_STMTS_IN_LOOP(loop, stmt) replace_OP_by_VOP (stmt); decrease_loop_bound_by_factor_VF (loop); Xform stmt by stmt, top-down When xforming a stmt, may add code in prolog/epilog (reduction), Or handling of misalignment tree-vect-transform.c

Auto-Vectorization Transformation
original serial loop: for(i=0; i<N; i++){ a[i] = a[i] + b[i]; } loop in vector notation: for (i=0; i<N; i+=VF){ a[i:i+VF-1] = a[i:i+VF-1] + b[i:i+VF-1]; } loop in vector notation: for (i=0; i<(N-N%VF); i+=VF){ a[i:i+VF-1] = a[i:i+VF-1] + b[i:i+VF-1]; } for ( ; i < N; i++) { a[i] = a[i] + b[i]; } vectorized loop vectorization epilog loop Modify loop bound - strip-mine - create epilog loop Replace scalar statements with vector statements - Simultanuously operate on different elements - Prepare data in Small vectors (pack data elements)

Vectorization on SSA-ed GIMPLE trees
int i; float a[N], b[N]; for (i=0; i < 16; i++) a[i] = a[i ] * b[i ]; loop: if (i < 16) break; T.11 = a[i ]; T.12 = a[i+1]; T.13 = a[i+2]; T.14 = a[i+3]; T.21 = b[i ]; T.22 = b[i+1]; T.23 = b[i+2]; T.24 = b[i+3]; T.31 = T.11 * T.21; T.32 = T.12 * T.22; T.33 = T.13 * T.23; T.34 = T.14 * T.24; a[i] = T.31; a[i+1] = T.32; a[i+2] = T.33; a[i+3] = T.34; i = i + 4; goto loop; VF = 4 “unroll by VF and replace” float T.1, T.2, T.3; loop: if ( i < 16 ) break; S1: T.1 = a[i ]; S2: T.2 = b[i ]; S3: T.3 = T.1 * T.2; S4: a[i] = T.3; S5: i = i + 1; goto loop; v4sf VT.1, VT.2, VT.3; v4sf *VPa = (v4sf *)a, *VPb = (v4sf *)b; int indx; loop: if ( indx < 4 ) break; VT.1 = VPa[indx ]; VT.2 = VPb[indx ]; VT.3 = VT.1 * VT.2; VPa[indx] = VT.3; indx = indx + 1; goto loop;

Talk Layout front-end parse trees middle-end GIMPLE trees … mips port
What is vectorization Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases Using the vectorizer Programming and tuning hints middle-end vectorization GIMPLE trees mips port i386 port back-end rs6000 port RTL machine description assembly

Vectorizer analyses and transformation: Reduction
Analysis reduction cross iteration dependences induction s = 0; for (i=0; i<N; i++) { s += a[i] * b[i]; } loop: s_1 = phi (0, s_2) i_1 = phi (0, i_2) xa_1 = a[i_1] xb_1 = b[i_1] tmp_1 = xa * xb s_2 = s_1 + tmp_1 i_2 = i_1 + 1 if (i_2 < N) goto loop Detect scalar dependece cycles Identify scalar cycles that are reduction/induction 12 15 18 21 4 6 8 10 1 2 3 1 2 3 4 5 6 7 8 9 10 11 tmp_1

tree-vect-analyze.c s_1 = phi (0, s_2) i_1 = phi (0, i_2)
reduc unknown s_1 = phi (0, s_2) i_1 = phi (0, i_2) xa_1 = a[i_1] xb_1 = b[i_1] tmp_1 = xa * xb s_2 = s_1 + tmp_1 i_2 = i_1 + 1 static void vect_analyze_scalar_cycles (loop_vec_info loop_vinfo) { tree phi; struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); basic_block bb = loop->header; if (vect_print_dump_info (REPORT_DETAILS)) fprintf (vect_dump, "=== vect_analyze_scalar_cycles ==="); for (phi = phi_nodes (bb); phi; phi = PHI_CHAIN (phi)) stmt_vec_info stmt_vinfo = vinfo_for_stmt (phi); tree def = PHI_RESULT (phi); if (!is_gimple_reg (SSA_NAME_VAR (def))) continue; STMT_VINFO_DEF_TYPE (stmt_vinfo) = vect_unknown_def_type; tree access_fn = analyze_scalar_evolution (loop, def); if (!access_fn) if (vect_is_simple_iv_evolution (loop->num, access_fn) STMT_VINFO_DEF_TYPE (stmt_vinfo) = vect_induction_def; } tree rstmt = vect_is_simple_reduction (loop, phi); if (rstmt) { STMT_VINFO_DEF_TYPE (stmt_vinfo) = STMT_VINFO_DEF_TYPE (vinfo_for_stmt (rstmt)) = vect_reduction_def; } else if (vect_print_dump_info (REPORT_DETAILS)) fprintf (vect_dump, "Unknown def-use cycle pattern."); } /* End for loop */ return; tree-vect-analyze.c

Snippet from vect_is_simple_reduction:
s_1 = phi (0, s_2) i_1 = phi (0, i_2) xa_1 = a[i_1] xb_1 = b[i_1] tmp_1 = xa * xb s_2 = s_1 + tmp_1 i_2 = i_1 + 1 edge latch_e = loop_latch_edge (loop); tree loop_arg = PHI_ARG_DEF_FROM_EDGE (phi, latch_e); tree def_stmt = SSA_NAME_DEF_STMT (loop_arg); tree operation = GIMPLE_STMT_OPERAND (def_stmt, 1); enum tree_code code = TREE_CODE (operation); if (!commutative_tree_code (code) || !associative_tree_code (code)) { if (vect_print_dump_info (REPORT_DETAILS)) fprintf (vect_dump, "reduction: not commutative/associative: "); print_generic_expr (vect_dump, operation, TDF_SLIM); } return NULL_TREE; if (SCALAR_FLOAT_TYPE_P (type) && !flag_unsafe_math_optimizations) fprintf (vect_dump, "reduction: unsafe fp math optimization: "); tree-vectorizer.c

Vectorizer analyses and transformation: Reduction
loop: s_1 = phi (0, s_2) i_1 = phi (0, i_1) xa_1 = a[i_1] xb_1 = b[i_1] tmp_1 = xa * xb s_2 = s_1 + tmp_1 i_2 = i_1 + 1 if (i_2 < N) goto loop loop: vs_1 = phi (vs_0, vs_2) i_1 = phi (0, i_1) vxa_1 = vpa[i_1] vxb_1 = vpb[i_1] vtmp_1 = vxa * vxb vs_2 = vs_1 + vtmp_1 i_2 = i_1 + 1 if (i_2 < N/VF) goto loop vec_dest = vect_create_destination_var (scalar_dest, vectype); expr = build2 (code, vectype, loop_vec_def0, reduc_def); new_stmt = build2 (GIMPLE_MODIFY_STMT, void_type_node, vec_dest, expr); new_temp = make_ssa_name (vec_dest, new_stmt); GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp; bsi_insert_before (bsi, vec_stmt, BSI_SAME_STMT); tree-vect-transform.c

Vectorizer analyses and transformation: Reduction
s1,s2,s3,s4 s = 0; for (i=0; i<N; i++) { s += a[i] * b[i]; } printf (“sum = %f\n”, s); vs_0 loop: vs_1 = phi (vs_0, vs_2) i_1 = phi (0, i_2) vxa_1 = vpa[i_1] vxb_1 = vpb[i_1] vtmp_1 = vxa * vxb vs_2 = vs_1 + vtmp_1 i_2 = i_1 + 1 if (i_2 < N/VF) goto loop 1 2 3 + vtmp_1 4 5 6 7 + vtmp_1 4 6 8 10 1 2 3 vs_2 + 8 10 28 s scalar epilog whole vector shifts sum across 12 16 + 16 28

Talk Layout front-end parse trees middle-end GIMPLE trees … mips port
What is vectorization Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases Using the vectorizer Programming and tuning hints middle-end vectorization GIMPLE trees mips port i386 port back-end rs6000 port RTL machine description assembly

Adding new idioms tree.def: define the tree-code:
/* Reduction operations. Operations that take a vector of elements and "reduce" it to a scalar result (e.g. summing the elements of the vector, finding the minimum over the vector elements, etc). Operand 0 is a vector; the first element in the vector has the result. Operand 1 is a vector. */ DEFTREECODE (REDUC_PLUS_EXPR, "reduc_plus_expr", tcc_unary, 1) tree-pretty-print.c dump_generic_node, op_prio, op_symbol tree-inline.c: estimate_num_insns_1 ()

Adding new idioms optabs.h: add a new operator table (optab) index to enum optab_index /* Reduction operations on a vector operand. */ OTI_reduc_splus, OTI_reduc_uplus, optabs.h: define matching shortcuts #define reduc_splus_optab (optab_table[OTI_reduc_splus]) #define reduc_uplus_optab (optab_table[OTI_reduc_uplus])

Adding new idioms optabs.c: add selection of appropriate optab in the dispatch function optab_for_tree_code(): case REDUC_PLUS_EXPR: return TYPE_UNSIGNED (type) ? reduc_uplus_optab : reduc_splus_optab; optabs.c: initialize the new optabs in init_optabs() reduc_splus_optab = init_optab (UNKNOWN); reduc_uplus_optab = init_optab (UNKNOWN);

Adding new idioms gcc/gcc: rtl.def target.h optabs.h
tree.def tree-pretty-print.c tree-inline.c optabs.h optabs.c genopinit.c expr.c <target>.md gcc/gcc: rtl.def target.h optabs.h gcc/gcc/config/<port>: <target>.opt <target>.h <target>.c <target>.md Adding new idioms genopinit.c: fill in the optabs: "reduc_splus_optab->handlers[\$A].insn_code = CODE_FOR_\$(reduc_splus_\$a\$)" , "reduc_uplus_optab->handlers[\$A].insn_code = CODE_FOR_\$(reduc_uplus_\$a\$)", optab/type qi hi si v8qi v4hi v2si reduc_splus_optab CODE_FOR_nothing reduc_uplus_optab

Adding new idioms expand tree.def tree-pretty-print.c tree-inline.c
optabs.h optabs.c genopinit.c expr.c <target>.md Adding new idioms expr.c: tree-to-rtl expansion: case REDUC_PLUS_EXPR: { op0 = expand_normal (TREE_OPERAND (exp, 0)); this_optab = optab_for_tree_code (code, type); temp = expand_unop (mode, this_optab, op0, target, unsignedp); gcc_assert (temp); return temp; } <target>.md: RTL instruction definition: (define_expand "reduc_splus_<mode>" [(set (match_operand:VIshort 0 "register_operand" "=v") (unspec:VIshort [(match_operand:VIshort 1 "register_operand" "v")] UNSPEC_REDUC_PLUS))] "TARGET_ALTIVEC" "{rtx vzero = gen_reg_rtx (V4SImode); rtx vtmp1 = gen_reg_rtx (V4SImode); emit_insn (gen_altivec_vspltisw (vzero, const0_rtx)); emit_insn (gen_altivec_vsum4s<VI_char>s (vtmp1, operands[1], vzero)); emit_insn (gen_altivec_vsumsws_nomode (operands[0], vtmp1, vzero)); DONE;}") mips port middle-end GIMPLE trees machine description front-end parse trees rs6000 port i386 port assembly RTL back-end vectorization expand

Talk Layout front-end parse trees middle-end GIMPLE trees … mips port
What is vectorization Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases Using the vectorizer Programming and tuning hints middle-end vectorization GIMPLE trees mips port i386 port back-end rs6000 port RTL machine description assembly

Compilation Flow Example
gcc -O2 -ftree-vectorize -maltivec -ftree-vectorizer-verbose=4 vect-reduc-min.c vect-reduc-min.c:14: note: not vectorized: unsupported use in stmt. vect-reduc-min.c:9: note: vectorized 0 loops in function. gcc -O2 -ftree-vectorize -maltivec -ftree-vectorizer-verbose=7 vect-reduc-min.c vect-reduc-min.c:14: note: === vect_analyze_scalar_cycles === vect-reduc-min.c:14: note: Analyze phi: min_6 = PHI <min_3(6), 1.0e+1(2)> vect-reduc-min.c:14: note: reduction: not commutative/associative: min_6 > min_7 ? min_7 : min_6 vect-reduc-min.c:14: note: Unknown def-use cycle pattern vect-reduc-min.c:14: note: Unsupported pattern. vect-reduc-min.c:14: note: unexpected pattern. gcc -O2 -ftree-vectorize -maltivec vect-reduc-min.c -ftree-vectorizer-verbose=4 -ffast-math vect-reduc-min.c:14: note: LOOP VECTORIZED. vect-reduc-min.c:9: note: vectorized 1 loops in function. vect-reduc-min.c #define N 16 int main1 () { int i; float c[N] = {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; float min = 10; for (i = 0; i < N; i++) { min = min > c[i] ? c[i] : min; } /* check results: */ if (min != 0) abort (); return 0;

-fdump-tree-all -da vect-min.c.004t.gimple vect-min.c.081t.ifcvt
c = C.3; min = 1.0e+1; i = 0; goto <D2425>; <D2424>:; i.4 = i; D.2429 = c[i.4]; min = MIN_EXPR <D.2429, min>; i = i + 1; <D2425>:; if (i <= 15) { goto <D2424>; } else goto <D2426>; <D2426>:; if (min != 0.0) abort (); D.2430 = 0; return D.2430; main1 () { unsigned int ivtmp.31; int pretmp.25; float min; float c[16]; int i; float D.2429; static float C.3[16] = {…}; <bb 2>: c = C.3; # ivtmp.31_2 = PHI <ivtmp.31_3(4), 16(2)> # min_15 = PHI <min_7(4), 1.0e+1(2)> # i_14 = PHI <i_8(4), 0(2)> <L0>:; D.2429_6 = c[i_14]; min_7 = MIN_EXPR <D.2429_6, min_15>; i_8 = i_14 + 1; ivtmp.31_3 = ivtmp.31_2 - 1; if (ivtmp.31_3 != 0) goto <L8>; else goto <L2>; <L8>:; goto <bb 3> (<L0>); # min_1 = PHI <min_7(3)> <L2>:; if (min_1 != 0.0) goto <L3>; else goto <L4>; <L3>:; abort (); <L4>:; return 0; }

vect-min.c.082t.vect <bb 2>: c = C.3;
vect_pc.32_5 = (__vector float *) &c; vect_cst_.40_21 = { 1.0e+1, 1.0e+1, 1.0e+1, 1.0e+1 }; # ivtmp.43_28 = PHI <ivtmp.43_29(4), 0(2)> # vect_var.39_19 = PHI <vect_var.39_20, vect_cst.40_21> # ivtmp.37_16 = PHI <ivtmp.37_17(4), vect_pc.32_5(2)> # ivtmp.31_2 = PHI <ivtmp.31_3(4), 16(2)> # min_15 = PHI <min_7(4), 1.0e+1(2)> # i_14 = PHI <i_8(4), 0(2)> <L0>:; vect_var_.38_18 = *ivtmp.37_16; D.2429_6 = c[i_14]; vect_var.39_20 = MIN_EXPR <vect_var.38_18, vect_var.39_19>; min_7 = MIN_EXPR <D.2429_6, min_15>; i_8 = i_14 + 1; ivtmp.31_3 = ivtmp.31_2 - 1; ivtmp.37_17 = ivtmp.37_ B; ivtmp.43_29 = ivtmp.43_28 + 1; if (ivtmp.43_29 < 4) goto <L8>; else goto <L2>; <L8>:; goto <bb 3> (<L0>); Continued: # vect_var_.39_22 = PHI <vect_var_.39_20(3)> # min_1 = PHI <min_7(3)> <L2>:; vect_var_.42_23 = vect_var_.39_22 v>> 64; vect_var.42_24 = MIN_EXPR <vect_var.42_23, vect_var.39_22>; vect_var_.42_25 = vect_var_.42_24 v>> 32; vect_var_.42_26 = MIN_EXPR <vect_var_.42_25, vect_var_.42_24>; vect_var_.41_27 = BIT_FIELD_REF <vect_var_.42_26, 32, 96>; if (vect_var_.41_27 != 0.0) goto <L3>; else goto <L4>; <L3>:; abort (); <L4>:; return 0; }

vect-min.c.095t.dse2 c = C.3; vect_pc.36_4 = (__vector float *) &c;
vect_var_.38_6 = *vect_pc.36_4; vect_var_.39_1 = MIN_EXPR <vect_var_.38_6, { 1.0e+1, 1.0e+1, 1.0e+1, 1.0e+1 }>; ivtmp.37_14 = vect_pc.36_4 + 16B; vect_var_.38_32 = *ivtmp.37_14; vect_var_.39_33 = MIN_EXPR <vect_var_.39_1, vect_var_.38_32>; ivtmp.37_34 = ivtmp.37_ B; vect_var_.38_39 = *ivtmp.37_34; vect_var_.39_40 = MIN_EXPR <vect_var_.39_33, vect_var_.38_39>; ivtmp.37_41 = ivtmp.37_ B; vect_var_.38_18 = *ivtmp.37_41; vect_var_.39_20 = MIN_EXPR <vect_var_.38_18, vect_var_.39_40>; vect_var_.42_23 = vect_var_.39_20 v>> 64; vect_var_.42_24 = MIN_EXPR <vect_var_.39_20, vect_var_.42_23>; vect_var_.42_25 = vect_var_.42_24 v>> 32; vect_var_.42_26 = MIN_EXPR <vect_var_.42_25, vect_var_.42_24>; vect_var_.41_27 = BIT_FIELD_REF <vect_var_.42_26, 32, 96>; if (vect_var_.41_27 != 0.0) goto <L3>; else goto <L4>; <L3>:; abort (); <L4>:; return 0; }

vect-min.c.138r.life2 vect-min.c.153r.sched2
(insn:HI (set (reg:V4SF 138) (mem/u/c/i:V4SF (reg/f:SI 139) [2 S16 A128])) 632 {altivec_lvx_v4sf} )) (insn:HI (set (reg:V4SF 141) (mem:V4SF (plus:SI (reg/f:SI 113 sfp) (const_int 16 [0x10])) [2 S16 A128])) 632 {altivec_lvx_v4sf} (nil) (nil)) (insn:HI (set (reg:V4SF 126 [ vect_var_.39 ]) (smin:V4SF (reg:V4SF 138) (reg:V4SF 141))) 706 {sminv4sf3})) (insn:HI (set (reg/f:SI 127 [ ivtmp.37 ]) (plus:SI (reg/f:SI 134) (const_int 16 [0x10]))) 79 {*addsi3_internal1} (nil) (nil)) (insn:HI (set (reg:V4SF 142) (mem:V4SF (plus:SI (reg/f:SI 134) (const_int 16 [0x10])) [2 S16 A128])) 632 {altivec_lvx_v4sf} (nil) (nil))) (insn:HI (set (reg:V4SF 121 [ vect_var_.50 ]) (smin:V4SF (reg:V4SF 126 [ vect_var_.39 ]) (reg:V4SF 142))) 706 {sminv4sf3} (nil)))) (insn:HI (set (reg:V4SF 143) (mem:V4SF (plus:SI (reg/f:SI 127 [ ivtmp.37 ]) (const_int 16 [0x10])) [2 S16 A128])) 632 {altivec_lvx_v4sf} (nil)) (nil)) (insn:HI (set (reg:V4SF 119 [ vect_var_.53 ]) (smin:V4SF (reg:V4SF 121 [ vect_var_.50 ]) (reg:V4SF 143))) 706 {sminv4sf3} (nil)))) (insn:TI (set (reg:V4SF 77 0 [138]) (mem/u/c/i:V4SF (reg/f:SI 9 9 [139]) [2 S16 A128])) 632 {altivec_lvx_v4sf} (nil) (nil)))) (insn (set (reg:SI 9 9) (plus:SI (reg/f:SI 1 1) (const_int 16 [0x10]))) 79 {*addsi3_internal1} (nil) (nil)) (insn:TI (set (reg:V4SF 78 1 [141]) (mem:V4SF (reg:SI 9 9) [2 S16 A128])){altivec_lvx_v4sf} )) (insn (set (reg:SI 9 9) (plus:SI (reg/f:SI [orig:127 ivtmp.37 ] [127]) (insn (set (reg:SI 29 29) (const_int 32 [0x20]))) 79 {*addsi3_internal1} (nil) (nil)) (insn:TI (set (reg:V4SF 77 0[orig:126 vect_var.39] [126]) (smin:V4SF (reg:V4SF 77 0 [138]) (reg:V4SF 78 1 [141]))) 706 {sminv4sf3} (nil) (nil))) (insn (set (reg:V4SF 78 1 [143]) (mem:V4SF (reg:SI 9 9) [2 S16 A128])){altivec_lvx_v4sf} (insn (set (reg:V4SF [144]) (mem:V4SF (reg:SI 29 29) [2 S16 A128])) {altivec_lvx_v4sf}))

vect-min.s main1: stwu 1,-128(1) lis 4,.LANCHOR0@ha mflr 0
la li 5,64 stw 29,116(1) stw 0,132(1) addi 29,1,16 mr 3,29 bl memcpy addi 9,29,16 addi 29,29,16 lvx 13,0,9 lis la lvx 0,0,9 addi 9,1,16 lvx 1,0,9 addi 29,29,32 vminfp 0,0,1 lvx 12,0,29 addi 9,1,108 vminfp 0,0,13 vminfp 0,0,1 vminfp 0,0,12 vsldoi 13,0,0,8 vsldoi 1,0,0,12 vminfp 1,1,0 stvewx 1,0,9 lis lfs 13,108(1) lfs fcmpu 7,13,0 bne- 7,.L7 lwz 0,132(1) lwz 29,116(1) li 3,0 addi 1,1,128 mtlr 0 blr

Talk Layout front-end parse trees middle-end GIMPLE trees … mips port
What is vectorization Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases Using the vectorizer Programming and tuning hints middle-end vectorization GIMPLE trees mips port i386 port back-end rs6000 port RTL machine description assembly

Using the Vectorizer – Programming Hints
Don’t unroll the loop for (i=0; i<N; i+=4){ a[i] = x; a[i+1] = x; a[i+2] = x; a[i+3] = x; } Use countable loops, with no side-effects No function-calls in the loop (distribute into a separate loop); No ‘break’/’continue’ Avoid aliasing problems Use __restrict__ qualified pointers Keep the memory access-pattern simple Don’t use array of structures, e.g.: struct {int f1; int f2;} a[N]; for (i=0; i<N; i++) a[i].f1 = x; Use constant increment. i.e., don’t use the following: for (i=0; i<N; i+=incr) a[i] = x; Alignment Use alignment attributes If have more than a single misaligned store – distribute into a separate loop (currently the vectorizer peels the loop to align a misaligned store). for (i=0; i<N; i++) a[i] = x; foo (float * __restrict__ p, float * __restrict__ q) int af1[N], af2[N]; for (i=0; i<N; i++) af1[i] = x;

Using the Vectorizer – Tuning Hints
float *b, *c, diff, min, max; for (i = 0; i < N; i++) { diff += (b[i] - c[i]); } max = max < c[i] ? c[i] : max; min = min > c[i] ? c[i] : min; -ffast-math if operating on floats in a reduction computation (to allow the vectorizer to change the order of the computation) -fwrapv if operating on signed subword integers (to avoid casts to int that currently confuse the vectorizer) --param min-vect-loop-bound=[X] if have loops with a short trip-count -fno-vect-loop-version if worried about code size -funroll-loops –fvariable-expansion-in-unroller –param max-variable-expansions-in-unroller=[X] for improved scheduling of summation (breaking the accumulation into X+1 accumulator to increase ILP). for (i=0; i<N; i++){ p[i] = q[i]; } Loop versioning: if (q is aligned) { x = q[i]; // q is aligned p[i] = x; }else { x = q[i]; // q’s alignment unknown signed char *b, *c, diff; for (i = 0; i < N; i++) { diff += (signed char)(b[i] - c[i]); }

Summit papers ftp://gcc.gnu.org/pub/gcc/summit/2004/Autovectorization.pdf General Summit papers Happy Hacking!

The End

for (i = 0; i < n; i++) { sum += ((int) in[i] * (int) in[i+off]) >> scale; }

Talk Layout front-end parse trees middle-end GIMPLE trees … mips port
What is vectorization Back-end aspects Machine-description and operation tables Querying target-support in vectorizer Enabling vectorization for a new port Tree-level aspects Adding a tree-optimization pass Vectorization analyses and transformation Detailed example: Reduction Adding a new idiom Compilation flow example Two advanced cases Using the vectorizer Programming and tuning hints middle-end vectorization GIMPLE trees mips port i386 port back-end rs6000 port RTL machine description assembly

Non-consecutive access patterns
A[i], i={0,5,10,15,…}; access_fn(i) = (0,+,5) VR1 VR2 VR3 VR4 VR5 1 2 3 a b c d a OP(a) OP(f) OP(k) OP(p) e f g h f VOP( a, f, k, p ) VR5 i j k l k m n o p p a f k p The context of this work is the DSP domain. One of the …..characteritics of applications in the DSP domain is the abundant parallelism that is present in the computations they perform. What happens often that you execute the same instruction many times, each time of different data. Modern DSP architectures often have special hardware that allows executing the same instructions simultanuously on multiple data elements. Usually, the way it is done – the operands/data must be packed in advance in vector registers. And then the vector instruction takes that register as operans and performs the operation on the (4) data elements. SIMD  SIMpD The process of transforming groups of scalar instructions into vector ones is called vectorization. Data in Memory: a a b c d e f f g h i j k k l m n o p p

Basic unpacking and packing operations for strided access
Use two pairs of inverse operations widely supported on SIMD platforms: extract_even, extract_odd: interleave_high, interleave_low: Use them recursively to support strided accesses with power-of-2 strides Support several data types Compared to most general “permute”

a b c d e f y h k l δ=8 VF=4 load δ *VF elements
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 2 4 6 1 3 5 7 8 10 12 14 22 20 16 18 9 11 13 15 24 26 28 30 17 19 21 25 27 29 23 31 4 8 12 2 6 10 14 24 20 1 5 9 13 22 26 30 18 17 21 29 25 3 7 11 15 27 19 23 16 28 31 24 8 16 1 9 17 25 2 10 26 18 27 3 11 19 28 4 12 20 5 13 21 29 30 6 14 22 31 7 15 23 S1: a = x [8*i] S2: b = x [8*i+1] S3: c = x [8*i+2] S4: d = x [8*i+3] S5: e = x [8*i+4] S6: f = x [8*i+5] S7: g = x [8*i+6] S8: h = x [8*i+7] S9: y [2*i] = k = f (a,…,h) S10: y [2*i+1] = l = g (a,…,h) a b c d e f y h 1 2 3 4 5 6 7 load δ *VF elements generate δ *log δ extracts (odd/even) k l 1

a e k l δ=8 load δ *VF elements generate δ *log δ extracts (odd/even)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 2 4 6 1 3 5 7 8 10 12 14 22 20 16 18 9 11 13 15 24 26 28 30 17 19 21 25 27 29 23 31 4 8 12 2 6 10 14 24 20 1 5 9 13 22 26 30 18 17 21 29 25 3 7 11 15 27 19 23 16 28 31 24 8 16 1 9 17 25 2 10 26 18 27 3 11 19 28 4 12 20 5 13 21 29 30 6 14 22 31 7 15 23 S1: a = x [8*i] S5: e = x [8*i+4] S9: y [2*i] = k = f (a,e) S10: y [2*i+1] = l = g (a,e) a e 1 2 3 4 5 6 7 load δ *VF elements generate δ *log δ extracts (odd/even) Interleaving with gaps k l 1

Strided Accesses (Interleaved Data)
Very common in real world computations Complex data rgba images (alpha blend) multi-channel audio streams (down mix) Viterbi decoder: 5x improvement on entire benchmark PLDI 2006

Mixed data types short b[N]; int a[N]; for (i=0; i<N; i++) a[i] = (int) b[i]; Unpack

Multiple Data-Types & Type Conversions
units 4 S1: x_int = memref S2: z_int = x_int + 1 S3: y_char = memref …. VS1.0: vx0 = memref0 VS1.1: vx1 = memref1 VS1.2: vx2 = memref2 VS1.3: vx3 = memref3 4 (char) z_int 16 V1 = {1, 1, 1, 1} VS2.0: vz0 = vx0 + v1 VS2.1: vz1 = vx1 + v1 VS2.2: vz2 = vx2 + v1 VS2.3: vz3 = vx3 + v1 VF = 16 “unroll” by VF/units VS3: vy = memref VS3.0: vy0 = vpack (vz0, vz1) VS3.1: vy1 = vpack (vz2, vz3) VS3: vy = vpack (vy0, vy1)

Multiple Data-Types & Type Conversions
Very common in multimedia computations Video: unsigned chars  shorts Audio: signed shorts  ints Filters, autocorrelation, dot product, alpha-blending… Autocorrelation: 6x improvement on benchmark for (i = 0; i < n; i++) { acc += ((int) short_in1[i] * (int) short_in2[i+lag]) >> Scale; }