Presentation is loading. Please wait.

Presentation is loading. Please wait.

Compiler techniques for exposing ILP

Similar presentations


Presentation on theme: "Compiler techniques for exposing ILP"— Presentation transcript:

1 Compiler techniques for exposing ILP

2 Instruction Level Parallelism
Potential overlap among instructions Few possibilities in a basic block Blocks are small (6-7 instructions) Instructions are dependent Goal: Exploit ILP across multiple basic blocks Iterations of a loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;

3 Basic Scheduling for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;
Sequential MIPS Assembly Code Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Pipelined execution: Loop: LD F0, 0(R1) stall ADDD F4, F0, F stall stall SD 0(R1), F SUBI R1, R1, # stall BNEZ R1, Loop stall Scheduled pipelined execution: Loop: LD F0, 0(R1) SUBI R1, R1, # ADDD F4, F0, F stall BNEZ R1, Loop SD 8(R1), F

4 Loop Unrolling Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4
SUBI R1, R1, #8 BEQZ R1, Exit LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 BNEZ R1, Loop Exit: Pros: Larger basic block More scope for scheduling and eliminating dependencies Cons: Increases code size Comment: Often a precursor step for other optimizations

5 Loop Transformations Instruction independency is the key requirement for the transformations Example Determine that is legal to move SD after SUBI and BNEZ Determine that unrolling is useful (iterations are independent) Use different registers to avoid unnecessary constrains Eliminate extra tests and branches Determine that LD and SD can be interchanged Schedule the code, preserving the semantics of the code

6 1. Eliminating Name Dependences
Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, -8(R1) SD -8(R1), F4 LD F0, -16(R1) SD -16(R1), F4 LD F0, -24(R1) SD -24(R1), F4 SUBI R1, R1, #32 BNEZ R1, Loop Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop Register Renaming

7 2. Eliminating Control Dependences
Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BEQZ R1, Exit LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 BNEZ R1, Loop Exit: Intermediate BEQZ are never taken Eliminate!

8 3. Eliminating Data Dependences
Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 BNEZ R1, Loop Data dependencies SUBI, LD, SD Force sequential execution of iterations Compiler removes this dependency by: Computing intermediate R1 values Eliminating intermediate SUBI Changing final SUBI Data flow analysis Can do on Registers Cannot do easily on memory locations 100(R1) = 20(R2)

9 4. Alleviating Data Dependencies
Unrolled loop: Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop Scheduled Unrolled loop: Loop: LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD 0(R1), F4 SD -8(R1), F8 SUBI R1, R1, #32 SD 16(R1), F12 BNEZ R1, Loop SD 8(R1), F16

10 Some General Comments Dependences are a property of programs
Actual hazards are a property of the pipeline Techniques to avoid dependence limitations Maintain dependences but avoid hazards Code scheduling hardware software Eliminate dependences by code transformations Complex Compiler-based

11 Loop-level Parallelism
Primary focus of dependence analysis Determine all dependences and find cycles for (i=1; i<=100; i=i+1) { x[i] = y[i] + z[i]; w[i] = x[i] + v[i]; } for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; } x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100]; for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; }

12 Dependence Analysis Algorithms
Assume array indexes are affine (ai + b) GCD test: For two affine array indexes ai+b and ci+d: if a loop-carried dependence exists, then GCD (c,a) must divide (d-b) x[8*i ] = x[4*i + 2] +3 (2-0)/GCD(8,4) General graph cycle determination is NP a, b, c, and d may not be known at compile time

13 Software Pipelining Start-up Finish-up Software pipelined iteration
Iteration Iteration Iteration Iteration 3 Software pipelined iteration

14 Example Iteration i Iteration i+1 Iteration i+2 LD F0, 0(R1)
ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop Loop: SD 16(R1), F4 ADDD F4, F0, F2 LD F0, 0(R1) SUBI R1, R1, #8 BNEZ R1, Loop

15 Trace (global-code) Scheduling
Find ILP across conditional branches Two-step process Trace selection Find a trace (sequence of basic blocks) Use loop unrolling to generate long traces Use static branch prediction for other conditional branches Trace compaction Squeeze the trace into a small number of wide instructions Preserve data and control dependences

16 Trace Selection LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4, R5 SW 0(R1), R4
A[I] = A[I] + B[I] LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4, R5 SW 0(R1), R4 BNEZ R4, else SW 0(R2), . . . J join Else: X Join: SW 0(R3), . . . T F A[I] = 0? X B[I] = C[I] =

17 Summary of Compiler Techniques
Try to avoid dependence stalls Loop unrolling Reduce loop overhead Software pipelining Reduce single body dependence stalls Trace scheduling Reduce impact of other branches Compilers use a mix of three All techniques depend on prediction accuracy

18 Food for thought: Analyze this
Analyze this for different values of X and Y To evaluate different branch prediction schemes For compiler scheduling purposes add r1, r0, 1000 #  all numbers in decimal add r2, r0, a # Base address of array a loop: andi r10, r1, X beqz r10, even lw r11, 0(r2) addi r11, r11, 1 sw 0(r2), r11 even: addi r2, r2, 4 subi r1, r1, Y bnez r1, loop


Download ppt "Compiler techniques for exposing ILP"

Similar presentations


Ads by Google