Download presentation

Presentation is loading. Please wait.

Published byChristopher Ferguson Modified over 2 years ago

1
PLDI 2006 Auto-Vectorization of Interleaved Data for SIMD Dorit Nuzman, Ira Rosen, Ayal Zaks IBM Haifa Research Lab – HiPEAC member, Isreal {dorit, ira,

2
IBM Labs in Haifa 2 PLDI 2006 Main Message 1.Most SIMD targets support access to packed data in memory (SIMpD), but there are important applications which access non-consecutive data 2.We show how a classic compiler loop-based auto-SIMDizing optimization was augmented to support accesses to strided, interleaved data 3.This can serve as a first step to combine traditional loop-based vectorization with (if-converted) basic-block vectorization (SLP)

3
IBM Labs in Haifa 3 PLDI 2006 abcdefghijklmnop OP(a) OP(b) OP(c) OP(d) Data in Memory: VOP( a, b, c, d )VR1 abcd VR2 VR3 VR4 VR abcd SIMD: Single Instruction Multiple Data Packedp Vectorization Vector Registers Vector Operation

4
IBM Labs in Haifa 4 PLDI 2006 abcdefghijklmnop OP(a) OP(b) OP(c) OP(d) Data in Memory: VOP( a, b, c, d )VR1 abcd VR2 VR3 VR4 VR abcd SIMD: Single Instruction Multiple Data Packedp Vectorization abcd

5
IBM Labs in Haifa 5 PLDI 2006 SIMD: Single Instruction Multiple Data abcdefghijklmnop OP(a) OP(f) OP(k) OP(p) Data in Memory: VOP( a, f, k, p )VR5 abcd VR1 VR2 VR3 VR4 VR efghijklmnop a f k p SIM D: Single Instruction Multiple DataPackedp afkp Vectorizing for a SIMpD Architecture afkp

6
IBM Labs in Haifa 6 PLDI 2006 abcdefghijklmnop OP(a) OP(f) OP(k) OP(p) Data in Memory: VOP( a, f, k, p )VR5 abcd VR1 VR2 VR3 VR4 VR efghijklmnop a f k p afkp SIM D: Single Instruction Multiple DataPackedp memory Reorder buffer operation Reorder buffer afkp mask … loop: (VR1,…,VR4) vload (mem) VR5 pack (VR1,…,VR4),mask VOP(VR5)

7
IBM Labs in Haifa 7 PLDI 2006 Application accessing non-consecutive data – Viterbi decoder (before) - + max << 1 << 1|1 - max + Stride 1 Stride 2 Stride 4 sel

8
IBM Labs in Haifa 8 PLDI 2006 Application accessing non-consecutive data – Viterbi decoder (after) - + max << 1 << 1|1 - max + sel Stride 1 Stride 2 Stride 4

9
IBM Labs in Haifa 9 PLDI 2006 Application accessing non-consecutive data – Audio downmix (before) + >> 1 + Stride 2 Stride 4 >> 1

10
IBM Labs in Haifa 10 PLDI 2006 Application accessing non-consecutive data – Audio downmix (after) + >> 1 + Stride 2 Stride 4 >> 1

11
IBM Labs in Haifa 11 PLDI 2006 Basic unpacking and packing operations for strided access Use two pairs of inverse operations widely supported on SIMD platforms: extract_even, extract_odd: interleave_high, interleave_low: Use them recursively to support strided accesses with power-of-2 strides Support several data types

12
IBM Labs in Haifa 12 PLDI 2006 Classic loop-based auto-vectorization vect_analyze_loop (loop) { if (!1_analyze_counted_single_bb_loop (loop)) FAIL if (!2_determine_VF (loop)) FAIL if (!3_analyze_memory_access_patterns (loop)) FAIL if (!4_analyze_scalar_dependence_cycles (loop)) FAIL if (!5_analyze_data_dependence_distances (loop)) FAIL if (!6_analyze_consecutive_data_accesses (loop)) FAIL if (!7_analyze_data_alignment (loop)) FAIL if (!8_analyze_vops_exist_forall_ops (loop)) FAIL SUCCEED } vect_transform_loop (loop) { FOR_ALL_STMTS_IN_LOOP(loop, stmt) replace_OP_by_VOP (stmt); decrease_loop_bound_by_factor_VF (loop); }

13
IBM Labs in Haifa 13 PLDI 2006 Vectorizing non unit stride access One VOP accessing data with stride d requires loading of dVF elements Several, otherwise unrelated VOPs can share these loaded elements If they all share the same stride d If they all start close to each other Upto d VOPS; if less, there are gaps Recognize this spatial reuse potential to eliminate redundant load and extract operations Better make the decision earlier than later – without such elimination vectorizing the loop may be non beneficial (for loads) vectorizing the loop may be prohibited (for stores)

14
IBM Labs in Haifa 14 PLDI 2006 Augmenting the vectorizer: step 1/3 – build spatial groups 5_analyze_data_dependence_distances already traversed all pairs of load/stores to analyze their dependence distance: if (cross_iteration_dependence_distance <= (VF-1)*stride) if (read,write) or (write,read) or (write,write) ok = dep_resolve(); endif Augment this traversal to look for spatial reuse between pairs of independent loads and stores, building spatial groups: if ok and (intra_iteration_address_distance < stride*u) if (read,read) or (write,write) ok = analyze_and_build_spatial_groups(); endif

15
IBM Labs in Haifa 15 PLDI 2006 Augmenting the vectorizer: step 2/3 – check spatial groups 6_analyze_consecutive_data_accesses already traversed each individual load/store to analyze its access pattern Augment this traversal by Allowing non-consecutive accesses Building singleton groups for strided ungrouped load/stores Checking for gaps and profitability of spatial groups

16
IBM Labs in Haifa 16 PLDI 2006 Augmenting the vectorizer: step 3/3 – transformation vect_transform_stmt generates vector code per scalar OP Augment this by considering If OP is a load/store in first position of a spatial group generate d load/stores handle their alignment according to the starting address generate d log d extract/interleaves If OP belongs to a spatial group, connect it to the appropriate extract/interleave according to its position Unused extract/interleaves are discarded by subsequent DCE

17
IBM Labs in Haifa 17 PLDI 2006 Performance – qualitative: VF/(1 + log d) Vectorized code has d load/stores and (d log d) extract/interleaves Scalar code has dVF loads/stores Performance improvement factor in # of load/store/extract/interleave is VF/(1 + log d) d VF=4VF=8VF=

18
IBM Labs in Haifa 18 PLDI 2006 Performance – empirically (on PowerPC 970 with Altivec) Stride of 2 always provides speedups Strides of 8, 16 suffer from increased code-size – turns off loop unrolling Stride of 32 suffers from high register pressure (d+1) If non-permute operations exist – speedups for all strides if VFm8

19
IBM Labs in Haifa 19 PLDI 2006 Performance – stride of 8 with gaps Position of gaps affects the number of extract (interleaves) needed Improvement is observed even for a single strided access (VF=16 with arithmetic operations)

20
IBM Labs in Haifa 20 PLDI 2006 Performance - kernels 4 groups: VF=4, 8, 16, 16-with-gaps Strides prefix each kernel Slowdown when doing only memory operations at VF=4, d=8

21
IBM Labs in Haifa 21 PLDI 2006 Future direction – towards loop-aware SLP When building spatial groups, we consider distinct operations accessing adjacent/close addresses; this is the first step of building SLP chains SLP looks for VF fully interleaved accesses, without gaps; may require earlier loop unrolling Next step is to consider the operations that use a spatial group of loads – if theyre isomorphic, try to postpone the extracts Analogous to handling alignment using zero-shift, lazy-shift, eager-shift policies

22
IBM Labs in Haifa 22 PLDI 2006 Conclusions 1.Existing SIMD targets supporting SIMpD can provide improved performance for important power-of-2 strided applications – dont be afraid of d > 2 2.Existing compiler loop-based auto-vectorization can be augmented efficiently to handle such strided accesses 3.This can serve as a first step combining traditional loop-based vectorization with (if-converted) basic-block vectorization (SLP) 4.This area of work is fertile; consider details (d, gaps, positions, VF, non-mem ops) for it not to be futile!

23
IBM Labs in Haifa 23 PLDI 2006 Questions ?

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google