Presentation is loading. Please wait.

Presentation is loading. Please wait.

IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,

Similar presentations


Presentation on theme: "IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,"— Presentation transcript:

1 IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent, Belgium, January 2007

2 IBM Labs in Haifa 2 abcdefghijklmnop OP(a) OP(b) OP(c) OP(d) Data in Memory: VOP( a, b, c, d )VR1 abcd VR2 VR3 VR4 VR What is vectorization Vector Registers Vector operation  Data elements packed into vectors  Vector length  Vectorization Factor (VF) VF = 4  original serial loop: for(i=0; i

3 IBM Labs in Haifa 3 … mips port … Ada front-end middle-end GIMPLE trees back-end RTL GCC Passes machine description Fortran front-end C front-end C++ front-end parse trees rs6000 port i386 port assembly loop analyses and optimizations data-dependence scalar-evolution number of iters invariant motion iv-canon/optimize linear transform unswitching if-conversion unrolling vectorization - loop form ok? - any data-deps? - scalar-cycles? - aliasing? - access-patterns?  original serial loop: for(i=0; i

4 IBM Labs in Haifa 4 Talk Layout  What is vectorization  Back-end aspects  Machine-description and operation tables  Querying target-support in vectorizer  Enabling vectorization for a new port  Tree-level aspects  Adding a tree-optimization pass  Vectorization analyses and transformation  Detailed example: Reduction  Adding a new idiom  Compilation flow example  Two advanced cases  Using the vectorizer  Programming and tuning hints … mips port middle-end GIMPLE trees machine description front-end parse trees rs6000 port i386 port assembly RTL back-end vectorization

5 IBM Labs in Haifa 5 A GCC “port”: Target specific files  gcc/gcc/config/ / – for example: i386, ia64, rs6000, spu…  target-specific compiler options:.opt – command-line options of GCC specific to the target – for example: -maltivec, -msse2, -mtune=power4, -minsert-sched-nops=  target-specific definitions:.h – basic parameters and features – for example:  target-specific support functions:.c – target predicates, code generation functions, target variants  machine description:.md – definition of RTL instructions and their translations to assembly – content of machine description determines which features (operations, modes) are available GCC Backend – machine-description files and operation tables #define POINTER_SIZE (TARGET_32BIT ? 32 : 64) #define BYTES_BIG_ENDIAN 1 #define FIXED_REGISTERS \ {0, 1, FIXED_R2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, FIXED_R13, 0, 0, \ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \ …. #define CALL_USED_REGISTERS \ {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, FIXED_R13, 0, 0, \ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, \...

6 IBM Labs in Haifa 6 machine-description file alpha/alpha.md (define_insn "sminqi3" [(set (match_operand:QI 0 "register_operand" "=r") (smin:QI (match_operand:QI 1 "reg_or_0_operand" "%rJ") (match_operand:QI 2 "reg_or_8bit_operand" "rI")))] "TARGET_MAX" "minsb8 %r1,%2,%0" [(set_attr "type" "mvi")]) (define_insn "sminv8qi3" [(set (match_operand:V8QI 0 "register_operand" "=r") (smin:V8QI (match_operand:V8QI 1 "reg_or_0_operand" "rW") (match_operand:V8QI 2 "reg_or_0_operand" "rW")))] "TARGET_MAX" "minsb8 %r1,%r2,%0" [(set_attr "type" "mvi")]) RTL operations: rtl.def DEF_RTL_EXPR(SMIN, "smin", "ee", RTX_COMM_ARITH) DEF_RTL_EXPR(SMAX, "smax", "ee", RTX_COMM_ARITH) DEF_RTL_EXPR(UMIN, "umin", "ee", RTX_COMM_ARITH) DEF_RTL_EXPR(UMAX, "umax", "ee", RTX_COMM_ARITH) gcc/gcc: rtl.def gcc/gcc/config/ :.opt.h.c.md

7 IBM Labs in Haifa 7 alpha/alpha.md (define_insn "sminqi3" [(set (match_operand:QI 0 "register_operand" "=r") (smin:QI (match_operand:QI 1 "reg_or_0_operand" "%rJ") (match_operand:QI 2 "reg_or_8bit_operand" "rI")))] "TARGET_MAX" "minsb8 %r1,%2,%0" [(set_attr "type" "mvi")]) (define_insn "sminv8qi3" [(set (match_operand:V8QI 0 "register_operand" "=r") (smin:V8QI (match_operand:V8QI 1 "reg_or_0_operand" "rW") (match_operand:V8QI 2 "reg_or_0_operand" "rW")))] "TARGET_MAX" "minsb8 %r1,%r2,%0" [(set_attr "type" "mvi")]) machine-description file RTL operations: rtl.def DEF_RTL_EXPR(SMIN, "smin", "ee", RTX_COMM_ARITH) DEF_RTL_EXPR(SMAX, "smax", "ee", RTX_COMM_ARITH) DEF_RTL_EXPR(UMIN, "umin", "ee", RTX_COMM_ARITH) DEF_RTL_EXPR(UMAX, "umax", "ee", RTX_COMM_ARITH) - machine-modes: qi, hi, si, di, sf, df - vector machine-modes: alpha: v8qi, v4hi altivec: v16qi, v8hi, v4si - constraints - conditions - attributes - assembly - scalar and vector operations differ only in operand modes

8 IBM Labs in Haifa 8 rs6000/rs6000.md (define_expand "sminsi3" [(set (match_dup 3) (if_then_else:SI (gt:SI (match_operand:SI 1 "gpc_reg_operand" "") (match_operand:SI 2 "reg_or_short_operand" "")) (const_int 0) (minus:SI (match_dup 2) (match_dup 1)))) (set (match_operand:SI 0 "gpc_reg_operand" "") (minus:SI (match_dup 2) (match_dup 3)))] "TARGET_POWER || TARGET_ISEL" "{ if (TARGET_ISEL) { operands[2] = force_reg (SImode, operands[2]); rs6000_emit_minmax (operands[0], SMIN, operands[1], operands[2]); DONE; } operands[3] = gen_reg_rtx (SImode); }") RTL operations: rtl.def DEF_RTL_EXPR(IF_THEN_ELSE, "if_then_else", "eee", RTX_TERNARY) DEF_RTL_EXPR(GT, "gt", "ee", RTX_COMPARE) DEF_RTL_EXPR(MINUS, "minus", "ee", RTX_BIN_ARITH) rs6000/rs6000.c

9 IBM Labs in Haifa 9 ;; Vec int modes (define_mode_macro VI [V4SI V8HI V16QI]) (define_insn "smin 3" [(set (match_operand:VI 0 "register_operand" "=v") (smin:VI (match_operand:VI 1 "register_operand" "v") (match_operand:VI 2 "register_operand" "v")))] "TARGET_ALTIVEC" "vmins %0,%1,%2" [(set_attr "type" "vecsimple")]) rs6000/altivec.md (define_insn "sminv4sf3" [(set (match_operand:V4SF 0 "register_operand" "=v") (smin:V4SF (match_operand:V4SF 1 "register_operand" "v") (match_operand:V4SF 2 "register_operand" "v")))] "TARGET_ALTIVEC" "vminfp %0,%1,%2" [(set_attr "type" "veccmp")]) When the same pattern applies to multiple modes: use mode macros to generate an entire family of patterns

10 IBM Labs in Haifa 10 optabs.c,h optab/typeqihisiv4siv2si… smin_optab700701CODE_FOR _nothing 753CODE_FOR _nothing … umin_optab702703CODE_FOR _nothing 754CODE_FOR _nothing … build/gcc/insn-emit.c rtx gen_sminv4si3 (rtx operand0 ATTRIBUTE_UNUSED, rtx operand1 ATTRIBUTE_UNUSED, rtx operand2 ATTRIBUTE_UNUSED) { return gen_rtx_SET (VOIDmode, operand0, gen_rtx_SMIN (V4SImode, operand1, operand2)); } build/gcc/insn-output.c { "sminv4si3", { "vminsw %0,%1,%2", 0, 0 }, (insn_gen_fn) gen_sminv4si3, &operand_data[1427], 3, 0, 1, 1 } -tables of RTL operations sharing common semantics, but differing in operand size and/or structure -no type information available anymore GCC Backend – machine-description files and operation tables

11 IBM Labs in Haifa 11 optabs.c,h optab/typeqihisiv4siv2si… smin_optab700701CODE_FOR _nothing 753CODE_FOR _nothing … umin_optab702703CODE_FOR _nothing 754CODE_FOR _nothing … build/gcc/insn-emit.c rtx gen_sminv4si3 (rtx operand0 ATTRIBUTE_UNUSED, rtx operand1 ATTRIBUTE_UNUSED, rtx operand2 ATTRIBUTE_UNUSED) { return gen_rtx_SET (VOIDmode, operand0, gen_rtx_SMIN (V4SImode, operand1, operand2)); } build/gcc/insn-output.c { "sminv4si3", { "vminsw %0,%1,%2", 0, 0 }, (insn_gen_fn) gen_sminv4si3, &operand_data[1427], 3, 0, 1, 1 } -tables of RTL operations sharing common semantics, but differing in operand size and/or structure -no type information available anymore GCC Backend – machine-description files and operation tables gcc/gcc: rtl.def gcc/gcc/config/ :.opt.h.c.md gcc/gcc: rtl.def gcc/gcc/config/ :.opt.h.c.md optabqihisiv8qiv4hiv2si smin umin

12 IBM Labs in Haifa 12 min_27 = MIN_EXPR ; optab = optab_for_tree_code (code, vectype); vec_mode = TYPE_MODE (vectype); icode = (int) optab->handlers[(int) vec_mode].insn_code; if (icode == CODE_FOR_nothing) { if (vect_print_dump_info (REPORT_DETAILS)) fprintf (vect_dump, "operation not supported by target."); return false; } optab/typeqihisiv8qiv4hiv2si smin_optab700701CODE_FOR_ nothing CODE_FOR_ nothing umin_optab702703CODE_FOR_ nothing CODE_FOR_ nothing Querying the backend for target support in the vectorizer vector int v2si smin_optab

13 IBM Labs in Haifa 13 … mips port middle-end GIMPLE trees machine description front-end parse trees rs6000 port i386 port assembly RTL back-end vectorization Talk Layout  What is vectorization  Back-end aspects  Machine-description and operation tables  Querying target-support in vectorizer  Enabling vectorization for a new port  Tree-level aspects  Adding a tree-optimization pass  Vectorization analyses and transformation  Detailed example: Reduction  Adding a new idiom  Compilation flow example  Two advanced cases  Using the vectorizer  Programming and tuning hints

14 IBM Labs in Haifa 14 Enabling vectorization for a new port  - distinction between scalar and vector ops: operand modes - availability of vector ops: deduced from MD file .h - specify supported vector length in bytes: #define UNITS_PER_SIMD_WORD 16  -modes.def - specify supported vector modes: /* Vector modes. */ VECTOR_MODES (INT, 8); /* V8QI V4HI V2SI */ VECTOR_MODES (INT, 16); /* V16QI V8HI V4SI V2DI */ VECTOR_MODE (INT, DI, 1); VECTOR_MODES (FLOAT, 8); /* V4HF V2SF */ VECTOR_MODES (FLOAT, 16); /* V8HF V4SF V2DF */ Basic features:

15 IBM Labs in Haifa 15 Enabling vectorization for a new port  Special idioms:  generic vector operations: look over list of idioms in optabs.h  specialized vector operations: look over target.h Advanced features: #define reduc_smax_optab (optab_table[OTI_reduc_smax]) #define reduc_umax_optab (optab_table[OTI_reduc_umax]) #define reduc_smin_optab (optab_table[OTI_reduc_smin]) #define reduc_umin_optab (optab_table[OTI_reduc_umin]) #define reduc_splus_optab (optab_table[OTI_reduc_splus]) #define reduc_uplus_optab (optab_table[OTI_reduc_uplus]) #define ssum_widen_optab (optab_table[OTI_ssum_widen]) #define usum_widen_optab (optab_table[OTI_usum_widen]) #define sdot_prod_optab (optab_table[OTI_sdot_prod]) #define udot_prod_optab (optab_table[OTI_udot_prod]) #define vec_set_optab (optab_table[OTI_vec_set]) #define vec_extract_optab (optab_table[OTI_vec_extract]) #define vec_extract_even_optab (optab_table[OTI_vec_extract_even]) #define vec_extract_odd_optab (optab_table[OTI_vec_extract_odd]) #define vec_interleave_high_optab (optab_table[OTI_vec_interleave_high]) #define vec_interleave_low_optab (optab_table[OTI_vec_interleave_low]) #define vec_init_optab (optab_table[OTI_vec_init]) #define vec_shl_optab (optab_table[OTI_vec_shl]) #define vec_shr_optab (optab_table[OTI_vec_shr]) #define vec_realign_load_optab (optab_table[OTI_vec_realign_load]) #define vec_widen_umult_hi_optab (optab_table[OTI_vec_widen_umult_hi]) #define vec_widen_umult_lo_optab (optab_table[OTI_vec_widen_umult_lo]) #define vec_widen_smult_hi_optab (optab_table[OTI_vec_widen_smult_hi]) #define vec_widen_smult_lo_optab (optab_table[OTI_vec_widen_smult_lo]) #define vec_unpacks_hi_optab (optab_table[OTI_vec_unpacks_hi]) #define vec_unpacku_hi_optab (optab_table[OTI_vec_unpacku_hi]) #define vec_unpacks_lo_optab (optab_table[OTI_vec_unpacks_lo]) #define vec_unpacku_lo_optab (optab_table[OTI_vec_unpacku_lo]) #define vec_pack_mod_optab (optab_table[OTI_vec_pack_mod]) #define vec_pack_ssat_optab (optab_table[OTI_vec_pack_ssat]) #define vec_pack_usat_optab (optab_table[OTI_vec_pack_usat]) /* Functions relating to vectorization. */ struct vectorize { tree (* builtin_mask_for_load) (void); tree (* builtin_vectorized_function) (unsigned, tree); tree (* builtin_mul_widen_even) (tree); tree (* builtin_mul_widen_odd) (tree); } vectorize;

16 IBM Labs in Haifa 16 Enabling vectorization for a new port  Special idioms:  generic vector operations: look over list of idioms in optabs.h  specialized vector operations: look over target.h Advanced features: #define reduc_smax_optab (optab_table[OTI_reduc_smax]) #define reduc_umax_optab (optab_table[OTI_reduc_umax]) #define reduc_smin_optab (optab_table[OTI_reduc_smin]) #define reduc_umin_optab (optab_table[OTI_reduc_umin]) #define reduc_splus_optab (optab_table[OTI_reduc_splus]) #define reduc_uplus_optab (optab_table[OTI_reduc_uplus]) #define ssum_widen_optab (optab_table[OTI_ssum_widen]) #define usum_widen_optab (optab_table[OTI_usum_widen]) #define sdot_prod_optab (optab_table[OTI_sdot_prod]) #define udot_prod_optab (optab_table[OTI_udot_prod]) #define vec_set_optab (optab_table[OTI_vec_set]) #define vec_extract_optab (optab_table[OTI_vec_extract]) #define vec_extract_even_optab (optab_table[OTI_vec_extract_even]) #define vec_extract_odd_optab (optab_table[OTI_vec_extract_odd]) #define vec_interleave_high_optab (optab_table[OTI_vec_interleave_high]) #define vec_interleave_low_optab (optab_table[OTI_vec_interleave_low]) #define vec_init_optab (optab_table[OTI_vec_init]) #define vec_shl_optab (optab_table[OTI_vec_shl]) #define vec_shr_optab (optab_table[OTI_vec_shr]) #define vec_realign_load_optab (optab_table[OTI_vec_realign_load]) #define vec_widen_umult_hi_optab (optab_table[OTI_vec_widen_umult_hi]) #define vec_widen_umult_lo_optab (optab_table[OTI_vec_widen_umult_lo]) #define vec_widen_smult_hi_optab (optab_table[OTI_vec_widen_smult_hi]) #define vec_widen_smult_lo_optab (optab_table[OTI_vec_widen_smult_lo]) #define vec_unpacks_hi_optab (optab_table[OTI_vec_unpacks_hi]) #define vec_unpacku_hi_optab (optab_table[OTI_vec_unpacku_hi]) #define vec_unpacks_lo_optab (optab_table[OTI_vec_unpacks_lo]) #define vec_unpacku_lo_optab (optab_table[OTI_vec_unpacku_lo]) #define vec_pack_mod_optab (optab_table[OTI_vec_pack_mod]) #define vec_pack_ssat_optab (optab_table[OTI_vec_pack_ssat]) #define vec_pack_usat_optab (optab_table[OTI_vec_pack_usat]) /* Functions relating to vectorization. */ struct vectorize { tree (* builtin_mask_for_load) (void); tree (* builtin_vectorized_function) (unsigned, tree); tree (* builtin_mul_widen_even) (tree); tree (* builtin_mul_widen_odd) (tree); } vectorize; gcc/gcc: rtl.def target.h optabs.h gcc/gcc/config/ :.opt.h.c.md

17 IBM Labs in Haifa 17  testcases are in gcc/gcc/testsuite/gcc.dg/vect  additional target-specific testcases testsuite/gcc.target/i386/vect1.c  vect.exp: add logic to decide whether to compile/run and with which target- specific options  Add where relevant in: testsuite/lib/target-supports.exp: Enabling vectorization for a new port if [istarget "powerpc*-*-*"] { … } } elseif { [istarget "spu-*-*"] } { set dg-do-what-default run } elseif { [istarget "i?86-*-*"] || [istarget "x86_64-*-*"] } { lappend DEFAULT_VECTCFLAGS "-msse2" set dg-do-what-default run } elseif { [istarget "mipsisa64*-*-*"] && [check_effective_target_mpaired_single] } { lappend DEFAULT_VECTCFLAGS "-mpaired-single" set dg-do-what-default run } elseif [istarget "sparc*-*-*"] { … } elseif [istarget "alpha*-*-*"] { lappend DEFAULT_VECTCFLAGS "-mmax" if [check_alpha_max_hw_available] { set dg-do-what-default run } else { set dg-do-what-default compile } } elseif [istarget "ia64-*-*"] { set dg-do-what-default run } else { return Enable the vectorizer testcases

18 IBM Labs in Haifa 18  testcases are in gcc/gcc/testsuite/gcc.dg/vect  additional target-specific testcases testsuite/gcc.target/i386/vect1.c  vect.exp: add logic to decide whether to compile/run and with which target- specific options  Add where relevant in: testsuite/lib/target-supports.exp: Enabling vectorization for a new port Enable the vectorizer testcases proc check_effective_target_vect_int check_effective_target_vect_shift check_effective_target_vect_long proc check_effective_target_vect_float proc check_effective_target_vect_double { } { global et_vect_double_saved if [info exists et_vect_double_saved] { verbose "using cached result" 2 } else { set et_vect_double_saved 0 if { [istarget i?86-*-*] || [istarget x86_64-*-*] || [istarget spu-*-*] } { set et_vect_double_saved 1 } return $et_vect_double_saved } check_effective_target_vect_no_int_max check_effective_target_vect_no_int_add check_effective_target_vect_sdot_hi check_effective_target_vect_udot_hi check_effective_target_vect_sdot_si check_effective_target_vect_udot_si ….

19 IBM Labs in Haifa 19 … mips port middle-end GIMPLE trees machine description front-end parse trees rs6000 port i386 port assembly RTL back-end vectorization Talk Layout  What is vectorization  Back-end aspects  Machine-description and operation tables  Querying target-support in vectorizer  Enabling vectorization for a new port  Tree-level aspects  Adding a tree-optimization pass  Vectorization analyses and transformation  Detailed example: Reduction  Adding a new idiom  Compilation flow example  Two advanced cases  Using the vectorizer  Programming and tuning hints

20 IBM Labs in Haifa 20 A tree-level pass  New C file in gcc/gcc:  tree-vectorizer.c  tree-vect-analyze.c  tree-vect-trasnform.c  tree-vect-patterns.c  tree-vectorizer.h  tree-flow.h – prototype for pass function unsigned vectorize_loops (void);  gcc/Makefile.in entries  The pass is invoked for each function unsigned vectorize_loops (void) { unsigned int i; unsigned int num_vectorized_loops = 0; unsigned int vect_loops_num; loop_iterator li; struct loop *loop; … vect_loops_num = number_of_loops (); FOR_EACH_LOOP (li, loop, LI_ONLY_OLD) { loop_vec_info loop_vinfo; vect_loop_location = find_loop_location (loop); loop_vinfo = vect_analyze_loop (loop); loop->aux = loop_vinfo; if (!loop_vinfo || !LOOP_VINFO_VECTORIZABLE_P (loop_vinfo)) continue; vect_transform_loop (loop_vinfo); num_vectorized_loops++; } if (vect_print_dump_info (REPORT_VECTORIZED_LOOPS)) fprintf (vect_dump, "vectorized %u loops in function.\n", num_vectorized_loops); … }

21 IBM Labs in Haifa 21 A tree-level pass … NEXT_PASS (pass_split_crit_edges); NEXT_PASS (pass_pre); NEXT_PASS (pass_may_alias); NEXT_PASS (pass_sink_code); NEXT_PASS (pass_tree_loop); NEXT_PASS (pass_cse_reciprocals); NEXT_PASS (pass_reassoc); NEXT_PASS (pass_vrp); NEXT_PASS (pass_dominator); p = &pass_tree_loop.sub; NEXT_PASS (pass_tree_loop_init); NEXT_PASS (pass_copy_prop); NEXT_PASS (pass_lim); NEXT_PASS (pass_tree_unswitch); NEXT_PASS (pass_scev_cprop); NEXT_PASS (pass_empty_loop); NEXT_PASS (pass_record_bounds); NEXT_PASS (pass_linear_transform); NEXT_PASS (pass_iv_canon); NEXT_PASS (pass_if_conversion); NEXT_PASS (pass_vectorize); NEXT_PASS (pass_complete_unroll); NEXT_PASS (pass_loop_prefetch); NEXT_PASS (pass_iv_optimize); NEXT_PASS (pass_tree_loop_done); *p = NULL; p = &pass_vectorize.sub; NEXT_PASS (pass_lower_vector_ssa); NEXT_PASS (pass_dce_loop); *p = NULL;  add the pass to the pass hierarchy in passes.c  in tree-pass.h – prototype for pass structure extern struct tree_opt_pass pass_vectorize;  pass-structure definition in tree-ssa-loop.c

22 IBM Labs in Haifa 22 A tree-level pass pass structure definition: struct tree_opt_pass pass_vectorize = { "vect", /* name */ gate_tree_vectorize, /* gate */ tree_vectorize, /* execute */ NULL, /* sub */ NULL, /* next */ 0, /* static_pass_number */ TV_TREE_VECTORIZATION, /* tv_id */ PROP_cfg | PROP_ssa, /* properties_required */ 0, /* properties_provided */ 0, /* properties_destroyed */ TODO_verify_loops, /* todo_flags_start */ TODO_dump_func | TODO_update_ssa, /* todo_flags_finish */ 0 /* letter */ }; timevar.def: variable used for timing and for identification in timing reports: DEFTIMEVAR (TV_TREE_VECTORIZATION, "tree vectorization") static bool gate_tree_vectorize (void) { return flag_tree_vectorize && current_loops; } static unsigned int tree_vectorize (void) { return vectorize_loops (); } common.opt Add command line option ftree-vectorize Common Report Var(flag_tree_vectorize) Enable loop vectorization on trees

23 IBM Labs in Haifa 23 A tree-level pass  invoke.texi: Document the pass for the GCC -ftree-vectorize Perform loop vectorization on fdump-tree-vect Dump each function after applying vectorization of loops. The file name is made by to the source file name.  gcc –O2 –ftree-vectorize example.c  gcc –O2 –ftree-vectorize –maltivec example.c  gcc –O2 –ftree-vectorize –msse2 example.c  gcc –O2 –ftree-vectorize –maltivec –fdump-tree-vect example.c  gcc –O2 –ftree-vectorize –maltivec –fdump-tree-vect-details example.c  gcc –O2 –ftree-vectorize –maltivec –ftree-vectorizer-verbose=2 example.c  gcc –O2 –ftree-vectorize –maltivec –ftree-vectorizer-verbose=7 –fdump-tree-vect example.c gcc/gcc: rtl.def target.h optabs.h gcc/gcc/config/ :.opt.h.c.md 1.[tree-vect*.c] 2.tree-flow.h 3.Makefile.in 4.[tree-ssa-loop.c] 5.timevar.def 6.common.opt 7.Invoke.texi

24 IBM Labs in Haifa 24 Example: vectorizer dump reports int main1 (short *in, int off, short scale, int n) { int i, sum = 0; for (i = 0; i < n; i++) { sum += ((int) in[i] * (int) in[i+off]) >> scale; } return sum; }  autocorrelation  Speedups: - powerpc970 – 5-6x - Cell SPU – 4-5x vect]$ gcc -O2 -ftree-vectorize -maltivec -ftree-vectorizer-verbose=5 vect- widen-mult-sum.c vect-widen-mult-sum.c:16: note: Vectorizing an unaligned access. vect-widen-mult-sum.c:16: note: LOOP VECTORIZED. vect-widen-mult-sum.c:12: note: vectorized 1 loops in function.

25 IBM Labs in Haifa 25 … mips port middle-end GIMPLE trees machine description front-end parse trees rs6000 port i386 port assembly RTL back-end vectorization Talk Layout  What is vectorization  Back-end aspects  Machine-description and operation tables  Querying target-support in vectorizer  Enabling vectorization for a new port  Tree-level aspects  Adding a tree-optimization pass  Vectorization analyses and transformation  Detailed example: Reduction  Adding a new idiom  Compilation flow example  Two advanced cases  Using the vectorizer  Programming and tuning hints

26 IBM Labs in Haifa 26 Auto-vectorization Skeleton vect_analyze_loop (loop) { if (!1_analyze_loop_form (loop)) FAIL if (!2_analyze_data_refs (loop)) FAIL if (!3_analyze_scalar_dependence_cycles (loop)) FAIL if (!4_pattern_recog (loop)) FAIL if (!5_analyze_data_alignment (loop)) FAIL if (!6_determine_VF (loop)) FAIL if (!7_analyze_data_dependence_distances (loop)) FAIL if (!8_analyze_memory_access_patterns (loop)) FAIL if (!9_analyze_all_operations_supported (loop)) FAIL SUCCEED } if SUCCEED: vect_transform_loop (loop) { FOR_ALL_STMTS_IN_LOOP(loop, stmt) replace_OP_by_VOP (stmt); decrease_loop_bound_by_factor_VF (loop); } tree-vect-analyze.c tree-vect-transform.c

27 IBM Labs in Haifa 27 Auto-Vectorization Transformation  original serial loop: for(i=0; i

28 IBM Labs in Haifa 28 Vectorization on SSA-ed GIMPLE trees float T.1, T.2, T.3; loop: if ( i < 16 ) break; S1: T.1 = a[i ]; S2: T.2 = b[i ]; S3: T.3 = T.1 * T.2; S4: a[i] = T.3; S5: i = i + 1; goto loop; loop: if (i < 16) break; T.11 = a[i ]; T.12 = a[i+1]; T.13 = a[i+2]; T.14 = a[i+3]; T.21 = b[i ]; T.22 = b[i+1]; T.23 = b[i+2]; T.24 = b[i+3]; T.31 = T.11 * T.21; T.32 = T.12 * T.22; T.33 = T.13 * T.23; T.34 = T.14 * T.24; a[i] = T.31; a[i+1] = T.32; a[i+2] = T.33; a[i+3] = T.34; i = i + 4; goto loop;  VF = 4  “unroll by VF and replace” int i; float a[N], b[N]; for (i=0; i < 16; i++) a[i] = a[i ] * b[i ]; v4sf VT.1, VT.2, VT.3; v4sf *VPa = (v4sf *)a, *VPb = (v4sf *)b; int indx; loop: if ( indx < 4 ) break; VT.1 = VPa[indx ]; VT.2 = VPb[indx ]; VT.3 = VT.1 * VT.2; VPa[indx] = VT.3; indx = indx + 1; goto loop;

29 IBM Labs in Haifa 29 … mips port middle-end GIMPLE trees machine description front-end parse trees rs6000 port i386 port assembly RTL back-end vectorization Talk Layout  What is vectorization  Back-end aspects  Machine-description and operation tables  Querying target-support in vectorizer  Enabling vectorization for a new port  Tree-level aspects  Adding a tree-optimization pass  Vectorization analyses and transformation  Detailed example: Reduction  Adding a new idiom  Compilation flow example  Two advanced cases  Using the vectorizer  Programming and tuning hints

30 IBM Labs in Haifa 30 Vectorizer analyses and transformation: Reduction s = 0; for (i=0; i

31 IBM Labs in Haifa 31 static void vect_analyze_scalar_cycles (loop_vec_info loop_vinfo) { tree phi; struct loop *loop = LOOP_VINFO_LOOP (loop_vinfo); basic_block bb = loop->header; if (vect_print_dump_info (REPORT_DETAILS)) fprintf (vect_dump, "=== vect_analyze_scalar_cycles ==="); for (phi = phi_nodes (bb); phi; phi = PHI_CHAIN (phi)) { stmt_vec_info stmt_vinfo = vinfo_for_stmt (phi); tree def = PHI_RESULT (phi); if (!is_gimple_reg (SSA_NAME_VAR (def))) continue; STMT_VINFO_DEF_TYPE (stmt_vinfo) = vect_unknown_def_type; tree access_fn = analyze_scalar_evolution (loop, def); if (!access_fn) continue; if (vect_is_simple_iv_evolution (loop->num, access_fn) { STMT_VINFO_DEF_TYPE (stmt_vinfo) = vect_induction_def; continue; } tree rstmt = vect_is_simple_reduction (loop, phi); if (rstmt) { STMT_VINFO_DEF_TYPE (stmt_vinfo) = STMT_VINFO_DEF_TYPE (vinfo_for_stmt (rstmt)) = vect_reduction_def; } else if (vect_print_dump_info (REPORT_DETAILS)) fprintf (vect_dump, "Unknown def-use cycle pattern."); } /* End for loop */ return; } s_1 = phi (0, s_2) i_1 = phi (0, i_2) xa_1 = a[i_1] xb_1 = b[i_1] tmp_1 = xa * xb s_2 = s_1 + tmp_1 i_2 = i_1 + 1 unknown reduc tree-vect-analyze.c

32 IBM Labs in Haifa 32 edge latch_e = loop_latch_edge (loop); tree loop_arg = PHI_ARG_DEF_FROM_EDGE (phi, latch_e); tree def_stmt = SSA_NAME_DEF_STMT (loop_arg); tree operation = GIMPLE_STMT_OPERAND (def_stmt, 1); enum tree_code code = TREE_CODE (operation); … if (!commutative_tree_code (code) || !associative_tree_code (code)) { if (vect_print_dump_info (REPORT_DETAILS)) { fprintf (vect_dump, "reduction: not commutative/associative: "); print_generic_expr (vect_dump, operation, TDF_SLIM); } return NULL_TREE; } if (SCALAR_FLOAT_TYPE_P (type) && !flag_unsafe_math_optimizations) { if (vect_print_dump_info (REPORT_DETAILS)) { fprintf (vect_dump, "reduction: unsafe fp math optimization: "); print_generic_expr (vect_dump, operation, TDF_SLIM); } return NULL_TREE; } … s_1 = phi (0, s_2) i_1 = phi (0, i_2) xa_1 = a[i_1] xb_1 = b[i_1] tmp_1 = xa * xb s_2 = s_1 + tmp_1 i_2 = i_1 + 1 Snippet from vect_is_simple_reduction: tree-vectorizer.c

33 IBM Labs in Haifa 33 Vectorizer analyses and transformation: Reduction loop: s_1 = phi (0, s_2) i_1 = phi (0, i_1) xa_1 = a[i_1] xb_1 = b[i_1] tmp_1 = xa * xb s_2 = s_1 + tmp_1 i_2 = i_1 + 1 if (i_2 < N) goto loop Transformation loop: vs_1 = phi (vs_0, vs_2) i_1 = phi (0, i_1) vxa_1 = vpa[i_1] vxb_1 = vpb[i_1] vtmp_1 = vxa * vxb vs_2 = vs_1 + vtmp_1 i_2 = i_1 + 1 if (i_2 < N/VF) goto loop  vec_dest = vect_create_destination_var (scalar_dest, vectype);  expr = build2 (code, vectype, loop_vec_def0, reduc_def);  new_stmt = build2 (GIMPLE_MODIFY_STMT, void_type_node, vec_dest, expr);  new_temp = make_ssa_name (vec_dest, new_stmt);  GIMPLE_STMT_OPERAND (new_stmt, 0) = new_temp;  bsi_insert_before (bsi, vec_stmt, BSI_SAME_STMT); tree-vect-transform.c

34 IBM Labs in Haifa Vectorizer analyses and transformation: Reduction s = 0; for (i=0; i

35 IBM Labs in Haifa 35 … mips port middle-end GIMPLE trees machine description front-end parse trees rs6000 port i386 port assembly RTL back-end vectorization Talk Layout  What is vectorization  Back-end aspects  Machine-description and operation tables  Querying target-support in vectorizer  Enabling vectorization for a new port  Tree-level aspects  Adding a tree-optimization pass  Vectorization analyses and transformation  Detailed example: Reduction  Adding a new idiom  Compilation flow example  Two advanced cases  Using the vectorizer  Programming and tuning hints

36 IBM Labs in Haifa 36 Adding new idioms  tree.def: define the tree-code: /* Reduction operations. Operations that take a vector of elements and "reduce" it to a scalar result (e.g. summing the elements of the vector, finding the minimum over the vector elements, etc). Operand 0 is a vector; the first element in the vector has the result. Operand 1 is a vector. */ DEFTREECODE (REDUC_PLUS_EXPR, "reduc_plus_expr", tcc_unary, 1)  tree-pretty-print.c dump_generic_node, op_prio, op_symbol  tree-inline.c: estimate_num_insns_1 ()

37 IBM Labs in Haifa 37 Adding new idioms  optabs.h: add a new operator table (optab) index to enum optab_index /* Reduction operations on a vector operand. */ OTI_reduc_splus, OTI_reduc_uplus,  optabs.h: define matching shortcuts #define reduc_splus_optab (optab_table[OTI_reduc_splus]) #define reduc_uplus_optab (optab_table[OTI_reduc_uplus])

38 IBM Labs in Haifa 38 Adding new idioms  optabs.c: add selection of appropriate optab in the dispatch function optab_for_tree_code(): case REDUC_PLUS_EXPR: return TYPE_UNSIGNED (type) ? reduc_uplus_optab : reduc_splus_optab;  optabs.c: initialize the new optabs in init_optabs() reduc_splus_optab = init_optab (UNKNOWN); reduc_uplus_optab = init_optab (UNKNOWN);

39 IBM Labs in Haifa 39 Adding new idioms  genopinit.c: fill in the optabs: "reduc_splus_optab->handlers[$A].insn_code = CODE_FOR_$(reduc_splus_$a$)", "reduc_uplus_optab->handlers[$A].insn_code = CODE_FOR_$(reduc_uplus_$a$)", optab/typeqihisiv8qiv4hiv2si reduc_splus_optabCODE_FOR_ nothing reduc_uplus_optabCODE_FOR_ nothing gcc/gcc: rtl.def target.h optabs.h gcc/gcc/config/ :.opt.h.c.md 1.tree.def 2.tree-pretty-print.c 3.tree-inline.c 4.optabs.h 5.optabs.c 6.genopinit.c 7.expr.c 8..md

40 IBM Labs in Haifa 40 Adding new idioms  expr.c: tree-to-rtl expansion: case REDUC_PLUS_EXPR: { op0 = expand_normal (TREE_OPERAND (exp, 0)); this_optab = optab_for_tree_code (code, type); temp = expand_unop (mode, this_optab, op0, target, unsignedp); gcc_assert (temp); return temp; } .md: RTL instruction definition: (define_expand "reduc_splus_ " [(set (match_operand:VIshort 0 "register_operand" "=v") (unspec:VIshort [(match_operand:VIshort 1 "register_operand" "v")] UNSPEC_REDUC_PLUS))] "TARGET_ALTIVEC" "{rtx vzero = gen_reg_rtx (V4SImode); rtx vtmp1 = gen_reg_rtx (V4SImode); emit_insn (gen_altivec_vspltisw (vzero, const0_rtx)); emit_insn (gen_altivec_vsum4s s (vtmp1, operands[1], vzero)); emit_insn (gen_altivec_vsumsws_nomode (operands[0], vtmp1, vzero)); DONE;}") 1.tree.def 2.tree-pretty-print.c 3.tree-inline.c 4.optabs.h 5.optabs.c 6.genopinit.c 7.expr.c 8..md … mips port middle-end GIMPLE trees machine description front-end parse trees rs6000 port i386 port assembly RTL back-end vectorization expand

41 IBM Labs in Haifa 41 … mips port middle-end GIMPLE trees machine description front-end parse trees rs6000 port i386 port assembly RTL back-end vectorization Talk Layout  What is vectorization  Back-end aspects  Machine-description and operation tables  Querying target-support in vectorizer  Enabling vectorization for a new port  Tree-level aspects  Adding a tree-optimization pass  Vectorization analyses and transformation  Detailed example: Reduction  Adding a new idiom  Compilation flow example  Two advanced cases  Using the vectorizer  Programming and tuning hints

42 IBM Labs in Haifa 42 vect-reduc-min.c #define N 16 int main1 () { int i; float c[N] = {0,1,2,3,4,5,6,7,8,9,10,11, 12,13,14,15}; float min = 10; for (i = 0; i < N; i++) { min = min > c[i] ? c[i] : min; } /* check results: */ if (min != 0) abort (); return 0; }  gcc -O2 -ftree-vectorize -maltivec -ftree-vectorizer-verbose=4 vect-reduc-min.c vect-reduc-min.c:14: note: not vectorized: unsupported use in stmt. vect-reduc-min.c:9: note: vectorized 0 loops in function.  gcc -O2 -ftree-vectorize -maltivec -ftree-vectorizer-verbose=7 vect-reduc-min.c … vect-reduc-min.c:14: note: === vect_analyze_scalar_cycles === vect-reduc-min.c:14: note: Analyze phi: min_6 = PHI vect-reduc-min.c:14: note: reduction: not commutative/associative: min_6 > min_7 ? min_7 : min_6 vect-reduc-min.c:14: note: Unknown def-use cycle pattern … vect-reduc-min.c:14: note: Unsupported pattern. vect-reduc-min.c:14: note: not vectorized: unsupported use in stmt. vect-reduc-min.c:14: note: unexpected pattern. vect-reduc-min.c:9: note: vectorized 0 loops in function.  gcc -O2 -ftree-vectorize -maltivec vect-reduc-min.c -ftree-vectorizer-verbose=4 -ffast-math vect-reduc-min.c:14: note: LOOP VECTORIZED. vect-reduc-min.c:9: note: vectorized 1 loops in function. Compilation Flow Example

43 IBM Labs in Haifa 43 vect-min.c.081t.ifcvt main1 () { unsigned int ivtmp.31; int pretmp.25; float min; float c[16]; int i; float D.2429; static float C.3[16] = {…}; : c = C.3; # ivtmp.31_2 = PHI # min_15 = PHI # i_14 = PHI :; D.2429_6 = c[i_14]; min_7 = MIN_EXPR ; i_8 = i_14 + 1; ivtmp.31_3 = ivtmp.31_2 - 1; if (ivtmp.31_3 != 0) goto ; else goto ; :; goto ( ); # min_1 = PHI :; if (min_1 != 0.0) goto ; else goto ; :; abort (); :; return 0; } vect-min.c.004t.gimple c = C.3; min = 1.0e+1; i = 0; goto ; :; i.4 = i; D.2429 = c[i.4]; min = MIN_EXPR ; i = i + 1; :; if (i <= 15) { goto ; } else { goto ; } :; if (min != 0.0) { abort (); } else { } D.2430 = 0; return D.2430; -fdump-tree-all -da

44 IBM Labs in Haifa 44 vect-min.c.082t.vect : c = C.3; vect_pc.32_5 = (__vector float *) &c; vect_cst_.40_21 = { 1.0e+1, 1.0e+1, 1.0e+1, 1.0e+1 }; # ivtmp.43_28 = PHI # vect_var.39_19 = PHI # ivtmp.37_16 = PHI # ivtmp.31_2 = PHI # min_15 = PHI # i_14 = PHI :; vect_var_.38_18 = *ivtmp.37_16; D.2429_6 = c[i_14]; vect_var.39_20 = MIN_EXPR ; min_7 = MIN_EXPR ; i_8 = i_14 + 1; ivtmp.31_3 = ivtmp.31_2 - 1; ivtmp.37_17 = ivtmp.37_ B; ivtmp.43_29 = ivtmp.43_28 + 1; if (ivtmp.43_29 ; else goto ; :; goto ( ); Continued: # vect_var_.39_22 = PHI # min_1 = PHI :; vect_var_.42_23 = vect_var_.39_22 v>> 64; vect_var.42_24 = MIN_EXPR ; vect_var_.42_25 = vect_var_.42_24 v>> 32; vect_var_.42_26 = MIN_EXPR ; vect_var_.41_27 = BIT_FIELD_REF ; if (vect_var_.41_27 != 0.0) goto ; else goto ; :; abort (); :; return 0; }

45 IBM Labs in Haifa 45 vect-min.c.095t.dse2 c = C.3; vect_pc.36_4 = (__vector float *) &c; vect_var_.38_6 = *vect_pc.36_4; vect_var_.39_1 = MIN_EXPR ; ivtmp.37_14 = vect_pc.36_4 + 16B; vect_var_.38_32 = *ivtmp.37_14; vect_var_.39_33 = MIN_EXPR ; ivtmp.37_34 = ivtmp.37_ B; vect_var_.38_39 = *ivtmp.37_34; vect_var_.39_40 = MIN_EXPR ; ivtmp.37_41 = ivtmp.37_ B; vect_var_.38_18 = *ivtmp.37_41; vect_var_.39_20 = MIN_EXPR ; vect_var_.42_23 = vect_var_.39_20 v>> 64; vect_var_.42_24 = MIN_EXPR ; vect_var_.42_25 = vect_var_.42_24 v>> 32; vect_var_.42_26 = MIN_EXPR ; vect_var_.41_27 = BIT_FIELD_REF ; if (vect_var_.41_27 != 0.0) goto ; else goto ; :; abort (); :; return 0; }

46 IBM Labs in Haifa 46 vect-min.c.138r.life2 (insn:HI (set (reg:V4SF 138) (mem/u/c/i:V4SF (reg/f:SI 139) [2 S16 A128])) 632 {altivec_lvx_v4sf} )) (insn:HI (set (reg:V4SF 141) (mem:V4SF (plus:SI (reg/f:SI 113 sfp) (const_int 16 [0x10])) [2 S16 A128])) 632 {altivec_lvx_v4sf} (nil) (nil)) (insn:HI (set (reg:V4SF 126 [ vect_var_.39 ]) (smin:V4SF (reg:V4SF 138) (reg:V4SF 141))) 706 {sminv4sf3})) (insn:HI (set (reg/f:SI 127 [ ivtmp.37 ]) (plus:SI (reg/f:SI 134) (const_int 16 [0x10]))) 79 {*addsi3_internal1} (nil) (nil)) (insn:HI (set (reg:V4SF 142) (mem:V4SF (plus:SI (reg/f:SI 134) (const_int 16 [0x10])) [2 S16 A128])) 632 {altivec_lvx_v4sf} (nil) (nil))) (insn:HI (set (reg:V4SF 121 [ vect_var_.50 ]) (smin:V4SF (reg:V4SF 126 [ vect_var_.39 ]) (reg:V4SF 142))) 706 {sminv4sf3} (nil)))) (insn:HI (set (reg:V4SF 143) (mem:V4SF (plus:SI (reg/f:SI 127 [ ivtmp.37 ]) (const_int 16 [0x10])) [2 S16 A128])) 632 {altivec_lvx_v4sf} (nil)) (nil)) (insn:HI (set (reg:V4SF 119 [ vect_var_.53 ]) (smin:V4SF (reg:V4SF 121 [ vect_var_.50 ]) (reg:V4SF 143))) 706 {sminv4sf3} (nil)))) vect-min.c.153r.sched2 (insn:TI (set (reg:V4SF 77 0 [138]) (mem/u/c/i:V4SF (reg/f:SI 9 9 [139]) [2 S16 A128])) 632 {altivec_lvx_v4sf} (nil) (nil)))) (insn (set (reg:SI 9 9) (plus:SI (reg/f:SI 1 1) (const_int 16 [0x10]))) 79 {*addsi3_internal1} (nil) (nil)) (insn:TI (set (reg:V4SF 78 1 [141]) (mem:V4SF (reg:SI 9 9) [2 S16 A128])){altivec_lvx_v4sf} )) (insn (set (reg:SI 9 9) (plus:SI (reg/f:SI [orig:127 ivtmp.37 ] [127]) (const_int 16 [0x10]))) 79 {*addsi3_internal1} (nil) (nil)) (insn (set (reg:SI 29 29) (plus:SI (reg/f:SI [orig:127 ivtmp.37 ] [127]) (const_int 32 [0x20]))) 79 {*addsi3_internal1} (nil) (nil)) (insn:TI (set (reg:V4SF 77 0[orig:126 vect_var.39] [126]) (smin:V4SF (reg:V4SF 77 0 [138]) (reg:V4SF 78 1 [141]))) 706 {sminv4sf3} (nil) (nil))) (insn (set (reg:V4SF 78 1 [143]) (mem:V4SF (reg:SI 9 9) [2 S16 A128])){altivec_lvx_v4sf} (insn (set (reg:V4SF [144]) (mem:V4SF (reg:SI 29 29) [2 S16 A128])) {altivec_lvx_v4sf}))

47 IBM Labs in Haifa 47 vect-min.s main1: stwu 1,-128(1) lis mflr 0 la li 5,64 stw 29,116(1) stw 0,132(1) addi 29,1,16 mr 3,29 bl memcpy addi 9,29,16 addi 29,29,16 lvx 13,0,9 lis la lvx 0,0,9 addi 9,1,16 lvx 1,0,9 addi 9,29,16 addi 29,29,32 vminfp 0,0,1 lvx 1,0,9 lvx 12,0,29 addi 9,1,108 vminfp 0,0,13 vminfp 0,0,1 vminfp 0,0,12 vsldoi 13,0,0,8 vminfp 0,0,13 vsldoi 1,0,0,12 vminfp 1,1,0 stvewx 1,0,9 lis lfs 13,108(1) lfs fcmpu 7,13,0 bne- 7,.L7 lwz 0,132(1) lwz 29,116(1) li 3,0 addi 1,1,128 mtlr 0 blr

48 IBM Labs in Haifa 48 … mips port middle-end GIMPLE trees machine description front-end parse trees rs6000 port i386 port assembly RTL back-end vectorization Talk Layout  What is vectorization  Back-end aspects  Machine-description and operation tables  Querying target-support in vectorizer  Enabling vectorization for a new port  Tree-level aspects  Adding a tree-optimization pass  Vectorization analyses and transformation  Detailed example: Reduction  Adding a new idiom  Compilation flow example  Two advanced cases  Using the vectorizer  Programming and tuning hints

49 IBM Labs in Haifa 49 Using the Vectorizer – Programming Hints  Don’t unroll the loop for (i=0; i

50 IBM Labs in Haifa 50  -ffast-math  if operating on floats in a reduction computation (to allow the vectorizer to change the order of the computation)  -fwrapv  if operating on signed subword integers (to avoid casts to int that currently confuse the vectorizer)  --param min-vect-loop-bound=[X]  if have loops with a short trip-count  -fno-vect-loop-version  if worried about code size  -funroll-loops –fvariable-expansion-in-unroller – param max-variable-expansions-in-unroller=[X]  for improved scheduling of summation (breaking the accumulation into X+1 accumulator to increase ILP). float *b, *c, diff, min, max; for (i = 0; i < N; i++) { diff += (b[i] - c[i]); } for (i = 0; i < N; i++) { max = max < c[i] ? c[i] : max; } for (i = 0; i < N; i++) { min = min > c[i] ? c[i] : min; } signed char *b, *c, diff; for (i = 0; i < N; i++) { diff += (signed char)(b[i] - c[i]); } for (i=0; i

51 IBM Labs in Haifa 51 More information  Vectorizer:    Summit papers - - ftp://gcc.gnu.org/pub/gcc/summit/2004/Autovectorization.pdfhttp://www.gccsummit.org/2006/2006-GCC-Summit-Proceedings.pdf  General    Summit papers Happy Hacking!

52 IBM Labs in Haifa 52 The End

53 IBM Labs in Haifa 53 for (i = 0; i < n; i++) { sum += ((int) in[i] * (int) in[i+off]) >> scale; }

54 IBM Labs in Haifa 54 … mips port middle-end GIMPLE trees machine description front-end parse trees rs6000 port i386 port assembly RTL back-end vectorization Talk Layout  What is vectorization  Back-end aspects  Machine-description and operation tables  Querying target-support in vectorizer  Enabling vectorization for a new port  Tree-level aspects  Adding a tree-optimization pass  Vectorization analyses and transformation  Detailed example: Reduction  Adding a new idiom  Compilation flow example  Two advanced cases  Using the vectorizer  Programming and tuning hints

55 IBM Labs in Haifa 55 Non-consecutive access patterns abcdefghijklmnop OP(a) OP(f) OP(k) OP(p) Data in Memory: VOP( a, f, k, p )VR5 abcd VR1 VR2 VR3 VR4 VR efghijklmnop a f k p afkp afkp A[i], i={0,5,10,15,…}; access_fn(i) = (0,+,5)

56 IBM Labs in Haifa 56 Basic unpacking and packing operations for strided access  Use two pairs of inverse operations widely supported on SIMD platforms:  extract_even, extract_odd:  interleave_high, interleave_low:  Use them recursively to support strided accesses with power-of-2 strides  Support several data types

57 IBM Labs in Haifa S1:a = x [8*i] S2:b = x [8*i+1] S3:c = x [8*i+2] S4:d = x [8*i+3] S5:e = x [8*i+4] S6:f = x [8*i+5] S7:g = x [8*i+6] S8:h = x [8*i+7] S9: y [2*i] = k = f (a,…,h) S10: y [2*i+1] = l = g (a,…,h) abcdefyh kl δ=8VF=4  load δ *VF elements  generate δ *log δ extracts (odd/even)

58 IBM Labs in Haifa S1:a = x [8*i] S5:e = x [8*i+4] S9: y [2*i] = k = f (a,e) S10: y [2*i+1] = l = g (a,e) ae kl δ=8  load δ *VF elements  generate δ *log δ extracts (odd/even)  Interleaving with gaps

59 IBM Labs in Haifa 59  Very common in real world computations  Complex data  rgba images (alpha blend)  multi-channel audio streams (down mix)  Viterbi decoder: 5x improvement on entire benchmark  PLDI 2006 Strided Accesses (Interleaved Data)

60 IBM Labs in Haifa 60 Mixed data types  short b[N]; int a[N]; for (i=0; i

61 IBM Labs in Haifa 61 Multiple Data-Types & Type Conversions S1:x_int = memref S2:z_int = x_int + 1 S3:y_char = memref …. VS1.0:vx0 = memref0 VS1.1:vx1 = memref1 VS1.2:vx2 = memref2 VS1.3:vx3 = memref3 VS2.0:vz0 = vx0 + v1 VS2.1:vz1 = vx1 + v1 VS2.2:vz2 = vx2 + v1 VS2.3:vz3 = vx3 + v1 V1 = {1, 1, 1, 1} VS3:vy = memref VF = VS3.0:vy0 = vpack (vz0, vz1) VS3.1:vy1 = vpack (vz2, vz3) VS3:vy = vpack (vy0, vy1) (char) z_int units “unroll” by VF/units

62 IBM Labs in Haifa 62  Very common in multimedia computations  Video: unsigned chars  shorts  Audio: signed shorts  ints  Filters, autocorrelation, dot product, alpha-blending…  Autocorrelation: 6x improvement on benchmark for (i = 0; i < n; i++) { acc += ((int) short_in1[i] * (int) short_in2[i+lag]) >> Scale; } Multiple Data-Types & Type Conversions

63 IBM Labs in Haifa 63


Download ppt "IBM Labs in Haifa 1 GCC Tutorial – The compilation flow of the auto-vectorizer Dorit Nuzman Haifa IBM Labs 2 nd HiPEAC GCC Tutorial Ghent,"

Similar presentations


Ads by Google