Optimizing Data Permutations for SIMD Devices Gang Ren, Peng Wu 1, David Padua University of Illinois at Urbana-Champaign 1 IBM T.J. Watson Research Center.

Optimizing Data Permutations for SIMD Devices Gang Ren, Peng Wu 1, David Padua University of Illinois at Urbana-Champaign 1 IBM T.J. Watson Research Center

2 PLDI 06 SIMD Is Everywhere + + + + Register File ALU Memory SIMD Architecture

3 PLDI 06 Data Permutation Optimization Idiom Recognition Execution Mapping Type Promotion Elimination …… SIMD Compilation int a[16],b[16],c[16]; for(i=0; i<16; i++) c[i] = a[i] + b[i]; for(i=0; i<16; i++) c[i] = a[i] + b[i]; float a[16],b[16],c[16]; c[0:15] = a[0:15] + b[0:15]; float a[16], b[16], c[16];... vr1 = vload(a); vr2 = vload(b); vr3 = vadd(vr1, vr2);... vr1 = vec_load(a); vr2 = vec_load(b); vr3 = vec_add(vr1, vr2);... Explore Data Parallelism Generating Efficient SIMD Code Vectorization Instruction Packing If Conversion ……

4 PLDI 06 Strict SIMD Architecture (1) + + + + Register File ALU Memory a0a0 a1a1 a2a2 a3a3 a4a4 a5a5 a6a6 a7a7 …… a0a0 a1a1 a2a2 a3a3 a0a0 a1a1 a2a2 a3a3 a0a0 a1a1 a2a2 a3a3... =...a[0:3:1]...;  Most SIMD devices only support memory accesses on contiguous and aligned memory sections  vr1 = vec_load(a);

5 PLDI 06  Additional permutation instructions are needed for non- contiguous and/or misaligned memory references Strict SIMD Architecture (2) + + + + Register File ALU Memory a0a0 a1a1 a2a2 a3a3 a4a4 a5a5 a6a6 a7a7 ……... =...a[0:6:2]...; a1a1 a3a3 a5a5 a7a7 a0a0 a2a2 a4a4 a6a6 a0a0 a1a1 a2a2 a3a3 a4a4 a5a5 a6a6 a7a7 vperm a0a0 a1a1 a2a2 a3a3 a4a4 a5a5 a6a6 a7a7 a0a0 a1a1 a2a2 a3a3 a4a4 a5a5 a6a6 a7a7 a4a4 a5a5 a6a6 a7a7 a0a0 a2a2 a4a4 a6a6 a0a0 a2a2 a4a4 a6a6 Strict SIMD devices: All data reorganization must be accomplished with permutation instructions.  vr1 = vec_load(a); vr2 = vec_load(a+4); vr4 = vperm(vr1, vr2, );

6 PLDI 06 Overview of the Optimization Framework float a[16],b[16],c[16]; c[0:15] = a[0:15] + b[0:15]; c[0:15] = a[0:31:2] + b[0:15]; float a[16], b[16], c[16];... vr1 = vload(a); vr2 = vload(b); vr3 = vadd(vr1, vr2);... vr1 = vec_load(a); vr2 = vec_load(a+4); vr3 = vperm(vr1,vr2,…); vr4 = vec_load(b);... NormalizationOptimizationCode Generation

7 PLDI 06 Example: An 8-point FFT Program 1. t0[0:6:2] = x[0:3] + x[4:7]; 2. t0[1:7:2] = x[0:3] - x[4:7]; 3. t1[0:7] = T8[0:7] * t0[0:7]; 4. for (i = 0; i < 2; i++) { 5. t2[0:2:2] = t1[i:i+2:2] + t1[i+4:i+6:2]; 6. t2[1:3:2] = t1[i:i+2:2] - t1[i+4:i+6:2]; 7. t3[0:3] = T4[0:3] * t2[0:3]; 8. y[i+0:i+2:2] = t3[0] + t3[2:3]; 9. y[i+4:i+6:2] = t3[0] - t3[2:3]; 10. } 1. t0[0:6:2] = x[0:3] + x[4:7]; 2. t0[1:7:2] = x[0:3] - x[4:7]; 3. t1[0:7] = T8[0:7] * t0[0:7]; 4. for (i = 0; i < 2; i++) { 5. t2[0:2:2] = t1[i:i+2:2] + t1[i+4:i+6:2]; 6. t2[1:3:2] = t1[i:i+2:2] - t1[i+4:i+6:2]; 7. t3[0:3] = T4[0:3] * t2[0:3]; 8. y[i+0:i+2:2] = t3[0:1] + t3[2:3]; 9. y[i+4:i+6:2] = t3[0:1] - t3[2:3]; 10. } 1. v1[0:3] = x[0:3] + x[4:7]; 2. v1[4:7] = x[0:3] - x[4:7]; 3. t0[0:7] = Permute(v1[0:7], P1); 4. t1[0:7] = T8[0:7] * t0[0:7]; 5. v2[0:7] = Permute(t1[0:7], P2); 6. u1[0:7] = Permute(v2[0:7], P3); 7. u2[0:3] = u1[0:3] + u1[4:7]; 8. u2[4:7] = u1[0:3] - u1[4:7]; 9. v3[0:7] = Permute(u2[0:7], P4); 10. t2[0:7] = Permute(v3[0:7], P5); 11. t3[0:7] = T4_2[0:7] * t2[0:7]; 12. u3[0:7] = Permute(t3[0:7], P6); 13. u4[0:3] = u3[0:3] + u3[4:7]; 14. u4[4:7] = u3[0:3] - u3[4:7]; 15. v4[0:7] = Permute(u4[0:7], P7); 16. y[0:7] = Permute(v4[0:7], P8); 1. v1[0:3] = x[0:3] + x[4:7]; 2. v1[4:7] = x[0:3] - x[4:7]; 3. t1[0:7] = T8[0:7] * v1[0:7]; 4. u1[0:7] = Permute(t1[0:7], Q1); 5. u2[0:3] = u1[0:3] + u1[4:7]; 6. u2[4:7] = u1[0:3] - u1[4:7]; 7. t3[0:7] = T4_2[0:7] * u2[0:7]; 8. u3[0:7] = Permute(t3[0:7], Q2); 9. y[0:3] = u3[0:3] + u3[4:7]; 10. y[4:7] = u3[0:3] - u3[4:7]; 0 1 2 3 Generating native permutation instructions from Permute operations

8 PLDI 06 Overview of the Optimization Framework float a[16],b[16],c[16]; c[0:15] = a[0:15] + b[0:15]; c[0:15] = a[0:31:2] + b[0:15]; float a[16], b[16], c[16];... vr1 = vload(a); vr2 = vload(b); vr3 = vadd(vr1, vr2);... vr1 = vec_load(a); vr2 = vec_load(a+4); vr3 = vperm(vr1,vr2,…); vr4 = vec_load(b);... NormalizationOptimizationCode Generation Use generic Permute to represent: Non-unit strides Misalignment Other reorganizations

9 PLDI 06 Data Permutations on Vectors  Permute(X n, P n ): X n is a vector and P n is a permutation matrix  Use Permute to represent all data reorganizations explicitly b[0:3] = Permute(a[0:3], )... = a[0:6:2] + a[1:7:2]; t[0:7] = Permute(a[0:7], );... = t[0:3] + t[4:7]; t[0:7] = Permute(a[0:7], );... = t[0:3] + t[4:7]; Two stride-2 accesses at right-hand side 0123 a0a0 a1a1 a2a2 a3a3 b[0:3] 0123 a2a2 a1a1 a0a0 a3a3 a[0:3]

10 PLDI 06 Overview of the Optimization Framework float a[16],b[16],c[16]; c[0:15] = a[0:15] + b[0:15]; c[0:15] = a[0:31:2] + b[0:15]; float a[16], b[16], c[16];... vr1 = vload(a); vr2 = vload(b); vr3 = vadd(vr1, vr2);... vr1 = vec_load(a); vr2 = vec_load(a+4); vr3 = vperm(vr1,vr2,…); vr4 = vec_load(b);... NormalizationOptimizationCode Generation Minimize Permute ops in a basic block - Based on two rules of Permute - A NP-complete problem - Propagation-based algorithm

11 PLDI 06 Two Important Rules on Permutations  Composition Rule  Distributive Rule Permute(Permute(a[0:3:1], ), ) x0x0 x1x1 x2x2 x3x3 a0a0 a1a1 a2a2 a3a3 x0x0 x1x1 x2x2 x3x3 a1a1 a0a0 a3a3 a2a2 x0x0 x1x1 x2x2 x3x3 a3a3 a0a0 a1a1 a2a2 Permute(a[0:3:1], ) + Permute(b[0:3:1], ) x0x0 x1x1 x2x2 x3x3 a0a0 a1a1 a2a2 a3a3 x0x0 x1x1 x2x2 x3x3 b0b0 b1b1 b2b2 b3b3 x0x0 x1x1 x2x2 x3x3 a1a1 a0a0 a3a3 a2a2 x0x0 x1x1 x2x2 x3x3 b1b1 b0b0 b3b3 b2b2 x0x0 x1x1 x2x2 x3x3 a 1 +b 1 a 0 +b 0 a 3 +b 3 a 2 +b 2 + Permute(a[0:3:1] + b[0:3:1], ) x0x0 x1x1 x2x2 x3x3 a0a0 a1a1 a2a2 a3a3 x0x0 x1x1 x2x2 x3x3 b0b0 b1b1 b2b2 b3b3 + x0x0 x1x1 x2x2 x3x3 a 0 +b 0 a 1 +b 1 a 2 +b 2 a 3 +b 3 x0x0 x1x1 x2x2 x3x3 a 1 +b 1 a 0 +b 0 a 3 +b 3 a 2 +b 2 Permute(a[0:3:1], )

12 PLDI 06 Propagation-Based Optimization Algorithm  Overview: Propagating permutation to permutation –Step 1: Pickup an unvisited permutation statement –Step 2: Propagate the permutation from the definition to the uses –Step 3: If a use is a permutation, goto (a), otherwise goto (b) a.Merge it with the propagated permutation pattern. Goto Step 1 b.Propagate the permutation from right-hand side to left-hand side. Goto Step 2 1. v1[0:3] = x[0:3] + x[4:7]; 2. v1[4:7] = x[0:3] - x[4:7]; 3. t0[0:7] = Permute(v1[0:7], P1); 4. t1[0:7] = T8[0:7] * t0[0:7]; 5. v2[0:7] = Permute(t1[0:7], P2); 6. u1[0:7] = Permute(v2[0:7], P3); 7. u2[0:3] = u1[0:3] + u1[4:7]; 8. u2[4:7] = u1[0:3] - u1[4:7]; 9. v3[0:7] = Permute(u2[0:7], P4); 10. t2[0:7] = Permute(v3[0:7], P5); 11. t3[0:7] = T4_2[0:7] * t2[0:7]; 12. u3[0:7] = Permute(t3[0:7], P6); 13. u4[0:3] = u3[0:3] + u3[4:7]; 14. u4[4:7] = u3[0:3] - u3[4:7]; 15. v4[0:7] = Permute(u4[0:7], P7); 16. y[0:7] = Permute(v4[0:7], P8); 1. v1[0:3] = x[0:3] + x[4:7]; 2. v1[4:7] = x[0:3] - x[4:7]; 3. t1[0:7] = T8[0:7] * v1[0:7]; 4. u1[0:7] = Permute(t1[0:7], Q1); 5. u2[0:3] = u1[0:3] + u1[4:7]; 6. u2[4:7] = u1[0:3] - u1[4:7]; 7. t3[0:7] = T4_2[0:7] * u2[0:7]; 8. u3[0:7] = Permute(t3[0:7], Q2); 9. y[0:3] = u3[0:3] + u3[4:7]; 10. y[4:7] = u3[0:3] - u3[4:7]; 1. v1[0:3] = x[0:3] + x[4:7]; 2. v1[4:7] = x[0:3] - x[4:7]; 3. t0[0:7] = Permute(v1[0:7], P1); 4. t1[0:7] = T8’[0:7] * v1[0:7]; 5. v2[0:7] = Permute(t1[0:7], P2’); 6. u1[0:7] = Permute(v2[0:7], P3); 7. u2[0:3] = u1[0:3] + u1[4:7]; 8. u2[4:7] = u1[0:3] - u1[4:7]; 9. v3[0:7] = Permute(u2[0:7], P4); 10. t2[0:7] = Permute(v3[0:7], P5); 11. t3[0:7] = T4_2[0:7] * t2[0:7]; 12. u3[0:7] = Permute(t3[0:7], P6); 13. u4[0:3] = u3[0:3] + u3[4:7]; 14. u4[4:7] = u3[0:3] - u3[4:7]; 15. v4[0:7] = Permute(u4[0:7], P7); 16. y[0:7] = Permute(v4[0:7], P8); 1. v1[0:3] = x[0:3] + x[4:7]; 2. v1[4:7] = x[0:3] - x[4:7]; 3. t0[0:7] = Permute(v1[0:7], P1); 4. t1[0:7] = T8’[0:7] * v1[0:7]; 5. v2[0:7] = Permute(t1[0:7], P2’); 6. u1[0:7] = Permute(t1[0:7], P3’); 7. u2[0:3] = u1[0:3] + u1[4:7]; 8. u2[4:7] = u1[0:3] - u1[4:7]; 9. v3[0:7] = Permute(u2[0:7], P4); 10. t2[0:7] = Permute(v3[0:7], P5); 11. t3[0:7] = T4_2[0:7] * t2[0:7]; 12. u3[0:7] = Permute(t3[0:7], P6); 13. u4[0:3] = u3[0:3] + u3[4:7]; 14. u4[4:7] = u3[0:3] - u3[4:7]; 15. v4[0:7] = Permute(u4[0:7], P7); 16. y[0:7] = Permute(v4[0:7], P8); 1. v1[0:3] = x[0:3] + x[4:7]; 2. v1[4:7] = x[0:3] - x[4:7]; 3. t0[0:7] = Permute(v1[0:7], P1); 4. t1[0:7] = T8’[0:7] * v1[0:7]; 5. v2[0:7] = Permute(t1[0:7], P2’); 6. u1[0:7] = Permute(t1[0:7], P3’); 7. u2[0:3] = u1[0:3] + u1[4:7]; 8. u2[4:7] = u1[0:3] - u1[4:7]; 9. v3[0:7] = Permute(u2[0:7], P4); 10. t2[0:7] = Permute(v3[0:7], P5); 11. t3[0:7] = T4_2’[0:7] * u2[0:7]; 12. u3[0:7] = Permute(t3[0:7], P6’); 13. y[0:3] = u3[0:3] + u3[4:7]; 14. y[4:7] = u3[0:3] - u3[4:7]; 15. v4[0:7] = Permute(u4[0:7], P7); 16. y[0:7] = Permute(v4[0:7], P8);

13 PLDI 06 b[0:7] = Permute(a[0:7], ); c[0:3] = b[0:3] + b[4:7]; Propagating Permutations to Partial Uses b[0:7] = Permute(a[0:7], ); c[0:3] = b[0:3] + b[4:7]; b[0:3] and b[4:7] are two partial uses of b[0:7]. b[0:3] = Permute(a[0:3], ); b[4:7] = Permute(a[4:7], ); c[0:3] = b[0:3] + b[4:7]; b[0:3] = Permute(a[0:3], ); b[4:7] = Permute(a[4:7], ); c[0:3] = b[0:3] + b[4:7]; b[0:7] = Permute(a[0:7], ); c[0:3] = b[0:3] + b[4:7]; b[0:7] = Permute(a[0:7], ); c[0:3] = b[0:3] + b[4:7]; Not all permutations can be partitioned and propagated to partial uses P Q R Improvements over partial use boundary - Permutation decomposition Register-wise decomposition Shuffle instruction decomposition - Permutation reshaping

14 PLDI 06 Optimization: Permutation Reshaping  For permutations used in commutative operations a0a0 a1a1 a2a2 a3a3 a4a4 a5a5 a6a6 a7a7 a0a0 a1a1 a2a2 a3a3 a0a0 a5a5 a2a2 a7a7 a0a0 a5a5 a2a2 a7a7 a4a4 a5a5 a6a6 a7a7 a4a4 a1a1 a6a6 a3a3 a4a4 a1a1 a6a6 a3a3 + a 0 +a 4 a 5 +a 1 a 2 +a 6 a 7 +a 3 c0c0 c1c1 c2c2 c3c3 a 0 +a 4 a 5 +a 1 a 2 +a 6 a 7 +a 3 b[05] = Permute(a[05], ); c[0:7] = b[0:7] + b[85]; b[0:7] = Permute(a[0:7], ); c[0:4] = b[0:3] + b[4:7]; b[05] = Permute(a[05], ); c[0:7] = b[0:7] + b[85]; b[0:7] = Permute(a[0:7], ); c[0:4] = b[0:3] + b[4:7]; a0a0 a1a1 a2a2 a3a3 a4a4 a5a5 a6a6 a7a7 a0a0 a1a1 a2a2 a3a3 a0a0 a5a5 a2a2 a7a7 a0a0 a1a1 a2a2 a3a3 a4a4 a5a5 a6a6 a7a7 a4a4 a1a1 a6a6 a3a3 a4a4 a5a5 a6a6 a7a7 + a 0 +a 4 a 5 +a 1 a 2 +a 6 a 7 +a 3 c0c0 c1c1 c2c2 c3c3 a 0 +a 4 a 1 +a 5 a 2 +a 6 a 3 +a 7

15 PLDI 06 Overview of the Optimization Framework float a[16],b[16],c[16]; c[0:15] = a[0:15] + b[0:15]; c[0:15] = a[0:31:2] + b[0:15]; float a[16], b[16], c[16];... vr1 = vload(a); vr2 = vload(b); vr3 = vadd(vr1, vr2);... vr1 = vec_load(a); vr2 = vec_load(a+4); vr3 = vperm(vr1,vr2,…); vr4 = vec_load(b);... NormalizationOptimizationCode Generation - “Strip-mine” Permute to vperm inst. - Map vperm to native permutation inst.

16 PLDI 06 Generating Permutation Instructions (1) a[0:15] = Permute(b[0:15], ); b[0:15] 0123 0123 0123 4567 0123 891011 0123 12131415 a[0:15] 0123 04812 0123 15913 0123 261014 0123 371115 vperm 0123 04** 0123 048* vperm vperm 0123 04** 0123 048* 0123 15** 0123 159* 0123 26** 0123 2610* vperm 0123 37** 0123 3711* vperm

17 PLDI 06 Generating Permutation Instructions (2) a[0:15] = Permute(b[0:15], ); b[0:15] 0123 0123 0123 4567 0123 891011 0123 12131415 a[0:15] 0123 04812 0123 15913 0123 261014 0123 371115 vperm 0123 04** vperm 0123 812** vperm vperm 0123 0415 vperm 0123 812913 vperm <2,3,6,7><2,3,6,7> 0123 0415 0123 812913 vperm 0123 2637 0123 10141115 vperm Two Steps: Maximize empty slots when generating vperm instructions; Fill empty slots with data elements that go to the same target;

18 PLDI 06 Experiment Setups  Two SIMD devices: VMX(AltiVec) & SSE2  Tested applications –Group I : Applications with relatively simple permutation patterns C-Saxpy: Complex version of saxpy ( y = alpha*x + y ) R-Color, C-Dot, R-FIR, … –Group II: Applications with complicated permutation patterns FFT: Fast Fourier transform programs generated by the SPIRAL system WHT: Walsh-Hadamard transform routines generated by the SPIRAL system Bitonic sorting: One of the fastest sorting networks –Group III: Reorganization-only applications Matrix transpose Bit-reversal reordering Processor1.8G PowerPC G52.0G Pentium 4 Main Memory2048 MB1024 MB Operating SystemMac OS v10.3Linux v2.4 Compilerxlc v6.0icc v9.0 Compiler Options-O3 -qaltivec-fast (-O3)

19 PLDI 06 Static Evaluation: # of Permutation Inst. VMXSSE2 ProgramSizeBaseOptReducedBaseOptReduced fft.416962475.0%962475.0% fft.5322084879.6%2084879.6% fft.6643529672.7%3529672.7% wht.416481275.0%481275.0% wht.532962475.0%962475.0% wht.6641924875.0%1924875.0% bitonic.416523434.6%563439.3% bitonic.5321369232.3%1449236.1% bitonic.66433623231.0%35223234.1%

20 PLDI 06 Run-time Performance of FFT & Bitonic Sorting

21 PLDI 06 Overall Speedups 231 231

22 PLDI 06 Related Work  Optimizing permutation instructions introduced by misalignment –A. Eichenberger, P. Wu, K. O'Brien, Vectorization for SIMD architectures with alignment constraints, PLDI ’ 04 –P. Wu, A. Eichenbreger, A. Wang, Efficient SIMD Code Generation for Runtime Alignment and Length Conversion, CGO 05  Efficient permutation instruction generation –A. Kudriavtsev, P. Kogge, Generation of permutations for SIMD processors, LCTES ’ 05 –M. Narayanan, K. Yelick, Generating permutation instructions from a high-level description, MSP ’ 04 –D. Nuzman, I. Rosen, A. Zaks, Auto-vectorization of interleaved data for SIMD, PLDI ’ 06  Similar idea, different applications –A. Solar-Lezama, R. Rabbah, R. Bodik, K. Ebcioglu, Programming by sketching for bit-streaming programs, PLDI ’ 05 –S. Chatterjee, J. Gilbert, R. Schreiber, S. Teng. Automatic array alignment in data-parallel programs, POPL ’ 93 –G. Hwang, J. K. Lee, D. Ju, An array operation synthesis scheme to optimize FORTRAN 90 programs, PPOPP ’ 95

23 PLDI 06 Conclusion  It is a performance critical problem for SIMD compilation to reduce the overhead introduced by permutation instructions  A unified framework is proposed to optimize data permutations –Putting all forms of data permutations into a unified representation –Propagating permutations across statements and merging them together –Generating efficient permutation instructions natively supported by devices  Experiments were conducted on different applications –Up to 77% permutation instructions are eliminated –Improve average performance by 48% on VMX and 68% on SSE2 –Near-peak overall speedups are achieved on some applications

Thank You! June 2006

Optimizing Data Permutations for SIMD Devices Gang Ren, Peng Wu 1, David Padua University of Illinois at Urbana-Champaign 1 IBM T.J. Watson Research Center.

Similar presentations

Presentation on theme: "Optimizing Data Permutations for SIMD Devices Gang Ren, Peng Wu 1, David Padua University of Illinois at Urbana-Champaign 1 IBM T.J. Watson Research Center."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimizing Data Permutations for SIMD Devices Gang Ren, Peng Wu 1, David Padua University of Illinois at Urbana-Champaign 1 IBM T.J. Watson Research Center.

Similar presentations

Presentation on theme: "Optimizing Data Permutations for SIMD Devices Gang Ren, Peng Wu 1, David Padua University of Illinois at Urbana-Champaign 1 IBM T.J. Watson Research Center."— Presentation transcript:

Similar presentations

About project

Feedback