Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,

Similar presentations


Presentation on theme: "The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,"— Presentation transcript:

1 The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran, padua @ uiuc.edu University of Illinois at Urbana Champaign

2 LCPC20032 Motivation Long basic blocks  Compiler optimizations  Library generators  SPIRAL, ATLAS Speedup obtained from unrolling FFTs The effectiveness of register allocation on long basic blocks

3 LCPC20033 Contributions Apply Belady ’ s MIN algorithm to long basic blocks Compared with MIPSPro, Belady ’ s MIN algorithm performs  10% faster on Matrix Multiplication code  12% faster on FFT code of size 32  33% faster on FFT code of size 64

4 LCPC20034 Outline Belady ’ s MIN algorithm On FFT code On Matrix Multiplication code Conclusions

5 LCPC20035 Belady ’ s MIN Algorithm Chooses the farthest next use Guarantees the minimum number of reloads but not the stores. 1 c = a + b 2e = a + d 3f = a - b 4h = d + c 3 FP registers load R1, a load R2, b add R3, R1, R2 store R3, c load R3, d … …… … RegR1R2R3 Varabc Next use234 Statusclean dirty spill d 4 clean

6 LCPC20036 Belady’s algorithm A Simple Compiler Long Basic Blocks Parsing Register Allocation MIPS assembly code Target code generation SPIRAL after optimizations ATLAS after optimizations

7 LCPC20037 Outline Belady ’ s MIN algorithm On FFT code On Matrix Multiplication code Conclusions

8 LCPC20038 SPIRAL and FFT Code Make use of formula transformations and intelligent search to generate optimized DSP libraries Search for the best degree of unrolling  Small size (2 to 64)  Straight line code  FFT64: 1400+ stmts  large size (128+)  use small size results as components. FrFr FsFs

9 LCPC20039 Interesting Patterns in FFT Codes … y32 = y2 - y18 y33 = y3 - y19 y34 = y2 + y18 y35 = y3 + y19 y36 = y10 - y26 y37 = y11 - y27 y38 = y10 + y26 y39 = y11 + y27 y40 = y34 - y38 y41 = y35 - y39 y42 = y34 + y38 y43 = y35 + y39 y44 = y32 - y37 y45 = y33 + y36 … One define Two uses Close uses Simplify: One-define, one-use program It can be proved that Belady ’ s MIN algorithm generates the minimum number of reloads and stores!

10 LCPC200310 Performance Evaluation: FFT Speedup: FFT 32: 12% FFT 64: 33% no spill spills Performance of the best formula for FFT 4-64 Performance of all the formulas for FFT 64

11 LCPC200311 Outline Belady ’ s MIN algorithm On FFT code On Matrix Multiplication code Conclusions

12 LCPC200312 ATLAS and Matrix Multiplication An empirical optimizer searching for best parameters (degree of unrolling, etc.) Study: innermost loop body  When KU=64 NU:MU=2:2 to 8:8 300-4000 LOC A MU KU B NU KU Tile Size C for (int j =0; j<TileSize; j+=NU) for (int i=0; i<TileSize; i+=MU) load C 1..8 into registers for (int k=0; k<TileSize; k+=KU) load A k, 1..4 into registers load B 1..2, k into registers C 1 += A k, 1 * B 1, k C 2 += A k, 1 * B 2, k … C 8 += A k, 4 * B 2,k Repeat * for k+1, k+KU-1 store C back to memory *

13 LCPC200313 Performance Evaluation for MM no spill spills Spills for NU:MU = 4:8 MIPSPro: 438 MIN: 892

14 LCPC200314 Explanation Keep spilling c elements  More stores Long dependency chain 1 c 0 += a0 * b0 2 c 1 += a1 * b0 3 c 2 += a2 * b0 4 c 3 += a3 * b0 5 c 64 += a0 * b64 6 c 65 += a1 * b64 7 c 66 += a2 * b64 8 c 67 += a3 * b64 9 c 0 += a64* b1 10 c 1 += a65* b1 11 c 2 += a66* b1 12 c 3 += a67* b1 13 c 64 += a64* b65 14 c 65 += a65* b65 15 c 66 += a66 * b65 16 c 67 += a67 * b65 Original load R0, a0 load R1, b0 load R2, c0 1 madd R2, R2, R0, R1 load R3, a1 load R4, c1 2 madd R4, R4, R3, R1 load R5, a2 store R4, c1 load R4, c2 3 madd R4, R4, R5, R1 store R4, c2 load R4, a3 store R2, c0 load R2, c3 4 madd R2, R2, R4, R1 … …… …

15 LCPC200315 Solution Spill a, b, c elements  Less stores Break dependency chain Use the instruction scheduling from MIPSPro c 0 += a0 * b0 c 0 += a64* b1 c 1 += a65* b1 c 1 += a1 * b0 c 2 += a2 * b0 c 3 += a3 * b0 c 2 += a66* b1 c 3 += a67* b1 c 66 += a2 * b64 c 64 += a0 * b64 c 66 += a67 * b65 c 64 += a64* b65 c 65 += a1 * b64 c 67 += a3 * b64 c 67 += a68 * b65 c 65 += a65* b65 Scheduled by MIPSPro

16 LCPC200316 Performance Evaluation for MM 10% better when mu:nu are larger than 4:6 no spill spills Spills for NU:MU = 4:8 MINSched: 356 MIPSPro: 438 MIN: 892

17 LCPC200317 Outline Belady ’ s MIN algorithm On FFT code On Matrix Multiplication code Conclusions

18 LCPC200318 Conclusions Belady ’ s MIN algorithm  Generates minimum number of reloads and stores for one-define and one-use problem  Performs better than the state of the art compilers like MIPSPro and GCC  Speedup: 12% (FFT32), 33%(FFT64), 10%(MM) Further benchmarks to be tested

19

20 LCPC200320 Code Size after Register Allocation

21 LCPC200321 Speedup from Fully Unrolling


Download ppt "The Power of Belady ’ s Algorithm in Register Allocation for Long Basic Blocks Jia Guo, María Jesús Garzarán and David Padua jiaguo, garzaran,"

Similar presentations


Ads by Google