Auto-Vectorization of Interleaved Data for SIMD

Auto-Vectorization of Interleaved Data for SIMD
Dorit Nuzman, Ira Rosen, Ayal Zaks IBM Haifa Research Lab – HiPEAC member, Isreal {dorit, ira, used this slide as opportunity to mention that this is a collaboration between different companies/vendors... Not IBM confidential, but not GPL either … those who stayed-over from yesterday are welcome to stay.

Main Message Most SIMD targets support access to packed data in memory (“SIMpD”), but there are important applications which access non-consecutive data We show how a classic compiler loop-based auto-SIMDizing optimization was augmented to support accesses to strided, interleaved data This can serve as a first step to combine traditional loop-based vectorization with (if-converted) basic-block vectorization (“SLP”) First two constitute the problem; emphasize that we’re not trying to devise new targets (as we and others did, e.g. eLite’s SIMdD) nor try to change the application (e.g. data layout xformations) We’re focusing on optimizing a range of existing apps on a range of existing targets Hence, in the next point Point 3 constitutes the solution Where we used the existing, multi-targeted gcc compiler Point 4 constitutes an extension/future work PLDI 2006

SIMD: Single Instruction Multiple Data
Packed VR1 VR2 VR3 VR4 VR5 1 2 3 a a b c d b c d OP(a) OP(b) OP(c) OP(d) VOP( a, b, c, d ) VR1 Vector Operation Vectorization Vector Registers The context of this work is the DSP domain. One of the …..characteritics of applications in the DSP domain is the abundant parallelism that is present in the computations they perform. What happens often that you execute the same instruction many times, each time of different data. Modern DSP architectures often have special hardware that allows executing the same instructions simultanuously on multiple data elements. Usually, the way it is done – the operands/data must be packed in advance in vector registers. And then the vector instruction takes that register as operans and performs the operation on the (4) data elements. SIMD  SIMpD The process of transforming groups of scalar instructions into vector ones is called vectorization. Data in Memory: a b c d e f g h i j k l m n o p PLDI 2006

Packed VR1 VR2 VR3 VR4 VR5 1 2 3 a a b c d b c d OP(a) OP(b) OP(c) OP(d) VOP( a, b, c, d ) VR1 Vectorization The context of this work is the DSP domain. One of the …..characteritics of applications in the DSP domain is the abundant parallelism that is present in the computations they perform. What happens often that you execute the same instruction many times, each time of different data. Modern DSP architectures often have special hardware that allows executing the same instructions simultanuously on multiple data elements. Usually, the way it is done – the operands/data must be packed in advance in vector registers. And then the vector instruction takes that register as operans and performs the operation on the (4) data elements. SIMD  SIMpD The process of transforming groups of scalar instructions into vector ones is called vectorization. Data in Memory: a a b b c c d d e f g h i j k l m n o p PLDI 2006

Packed VR1 VR2 VR3 VR4 VR5 1 2 3 a a b c d OP(a) OP(f) OP(k) OP(p) e f g h f VOP( a, f, k, p ) VR5 i j k l k m n o p p a f k p Vectorizing for a SIMpD Architecture The context of this work is the DSP domain. One of the …..characteritics of applications in the DSP domain is the abundant parallelism that is present in the computations they perform. What happens often that you execute the same instruction many times, each time of different data. Modern DSP architectures often have special hardware that allows executing the same instructions simultanuously on multiple data elements. Usually, the way it is done – the operands/data must be packed in advance in vector registers. And then the vector instruction takes that register as operans and performs the operation on the (4) data elements. SIMD  SIMpD The process of transforming groups of scalar instructions into vector ones is called vectorization. Data in Memory: a a b c d e f f g h i j k k l m n o p p PLDI 2006

SIM D: Single Instruction Multiple Data Packed p
VR1 VR2 VR3 VR4 VR5 1 2 3 a a b c d mask  … loop: (VR1,…,VR4)  vload (mem) VR5  pack (VR1,…,VR4),mask VOP(VR5) OP(a) OP(f) OP(k) OP(p) e f g h f VOP( a, f, k, p ) VR5 i j k l k m n o p p a f k p operation Reorder buffer Reorder buffer memory Data in Memory: a a b c d e f f g h i j k k l m n o p p PLDI 2006

Application accessing non-consecutive data – Viterbi decoder (before)
Stride 1 Stride 2 Stride 2 + - - + << 1 << 1|1 << 1 << 1|1 Butterfly, perform max-with-index, recording where each max came from. Applications may unroll code, e.g. if programmer knows loop-bound is even Or if different entries have different names – fields of a struct max sel max sel Stride 4 PLDI 2006

Application accessing non-consecutive data – Viterbi decoder (after)
Stride 1 Stride 2 Stride 2 + - - + << 1 << 1|1 << 1 << 1|1 Butterfly, perform max-with-index, recording where each max came from. Applications may unroll code, e.g. if programmer knows loop-bound is even Or if different entries have different names – fields of a struct max sel max sel Stride 4 PLDI 2006

Application accessing non-consecutive data – Audio downmix (before)
Stride 4 >> 1 >> 1 >> 1 >> 1 Butterfly, perform max-with-index, recording where each max came from. Applications may unroll code, e.g. if programmer knows loop-bound is even Or if different entries have different names – fields of a struct + + Stride 2 PLDI 2006

Application accessing non-consecutive data – Audio downmix (after)
Stride 4 >> 1 >> 1 >> 1 >> 1 Butterfly, perform max-with-index, recording where each max came from. Applications may unroll code, e.g. if programmer knows loop-bound is even Or if different entries have different names – fields of a struct + + Stride 2 PLDI 2006

Basic unpacking and packing operations for strided access
Use two pairs of inverse operations widely supported on SIMD platforms: extract_even, extract_odd: interleave_high, interleave_low: Use them recursively to support strided accesses with power-of-2 strides Support several data types Compared to most general “permute” PLDI 2006

Classic loop-based auto-vectorization
vect_analyze_loop (loop) { if (!1_analyze_counted_single_bb_loop (loop)) FAIL if (!2_determine_VF (loop)) FAIL if (!3_analyze_memory_access_patterns (loop)) FAIL if (!4_analyze_scalar_dependence_cycles (loop)) FAIL if (!5_analyze_data_dependence_distances (loop)) FAIL if (!6_analyze_consecutive_data_accesses (loop)) FAIL if (!7_analyze_data_alignment (loop)) FAIL if (!8_analyze_vops_exist_forall_ops (loop)) FAIL SUCCEED } vect_transform_loop (loop) { FOR_ALL_STMTS_IN_LOOP(loop, stmt) replace_OP_by_VOP (stmt); decrease_loop_bound_by_factor_VF (loop); Xform stmt by stmt, top-down When xforming a stmt, may add code in prolog/epilog (reduction), Or handling of misalignment PLDI 2006

Vectorizing non unit stride access
One VOP accessing data with stride d requires loading of dVF elements Several, otherwise unrelated VOPs can share these loaded elements If they all share the same stride d If they all start close to each other Upto d VOPS; if less, there are ‘gaps’ Recognize this spatial reuse potential to eliminate redundant load and extract operations Better make the decision earlier than later – without such elimination vectorizing the loop may be non beneficial (for loads) vectorizing the loop may be prohibited (for stores) If stride is greater than VF, can load only VF*VF elements, But we always load consecutive intervals to facilitate reuse Having the same type can simplify PLDI 2006

Augmenting the vectorizer: step 1/3 – build spatial groups
5_analyze_data_dependence_distances already traversed all pairs of load/stores to analyze their dependence distance: if (cross_iteration_dependence_distance <= (VF-1)*stride) if (read,write) or (write,read) or (write,write) ok = dep_resolve(); endif Augment this traversal to look for spatial reuse between pairs of independent loads and stores, building spatial groups: if ok and (intra_iteration_address_distance < stride*u) if (read,read) or (write,write) ok = analyze_and_build_spatial_groups(); PLDI 2006

Augmenting the vectorizer: step 2/3 – check spatial groups
6_analyze_consecutive_data_accesses already traversed each individual load/store to analyze its access pattern Augment this traversal by Allowing non-consecutive accesses Building singleton groups for strided ungrouped load/stores Checking for gaps and profitability of spatial groups PLDI 2006

Augmenting the vectorizer: step 3/3 – transformation
vect_transform_stmt generates vector code per scalar OP Augment this by considering If OP is a load/store in first position of a spatial group generate d load/stores handle their alignment according to the starting address generate d log d extract/interleaves If OP belongs to a spatial group, connect it to the appropriate extract/interleave according to its position Unused extract/interleaves are discarded by subsequent DCE PLDI 2006

Performance – qualitative: VF/(1 + log d)
4 8 16 2 1.3 2.6 5.3 0.8 1.6 3.2 32 0.6 1.2 2.4 Vectorized code has d load/stores and (d log d) extract/interleaves Scalar code has dVF loads/stores Performance improvement factor in # of load/store/extract/interleave is VF/(1 + log d) PLDI 2006

Performance – empirically (on PowerPC 970 with Altivec)
Stride of 2 always provides speedups Strides of 8, 16 suffer from increased code-size – turns off loop unrolling Stride of 32 suffers from high register pressure (d+1) If non-permute operations exist – speedups for all strides if VFm8 PLDI 2006

Performance – stride of 8 with gaps
Position of gaps affects the number of extract (interleaves) needed Improvement is observed even for a single strided access (VF=16 with arithmetic operations) PLDI 2006

Performance - kernels 4 groups: VF=4, 8, 16, 16-with-gaps
Strides prefix each kernel Slowdown when doing only memory operations at VF=4, d=8 PLDI 2006

Future direction – towards loop-aware SLP
When building spatial groups, we consider distinct operations accessing adjacent/close addresses; this is the first step of building SLP chains SLP looks for VF fully interleaved accesses, without gaps; may require earlier loop unrolling Next step is to consider the operations that use a spatial group of loads – if they’re isomorphic, try to postpone the extracts Analogous to handling alignment using zero-shift, lazy-shift, eager-shift policies PLDI 2006

Conclusions Existing SIMD targets supporting SIMpD can provide improved performance for important power-of-2 strided applications – don’t be afraid of d > 2 Existing compiler loop-based auto-vectorization can be augmented efficiently to handle such strided accesses This can serve as a first step combining traditional loop-based vectorization with (if-converted) basic-block vectorization (“SLP”) This area of work is fertile; consider details (d, gaps, positions, VF, non-mem ops) for it not to be futile! PLDI 2006

Questions ? PLDI 2006

Auto-Vectorization of Interleaved Data for SIMD

Similar presentations

Presentation on theme: "Auto-Vectorization of Interleaved Data for SIMD"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Auto-Vectorization of Interleaved Data for SIMD

Similar presentations

Presentation on theme: "Auto-Vectorization of Interleaved Data for SIMD"— Presentation transcript:

Similar presentations

About project

Feedback