Download presentation

Presentation is loading. Please wait.

Published byLee Gibbard Modified about 1 year ago

1
Compiler techniques for exposing ILP

2
Instruction Level Parallelism Potential overlap among instructions Few possibilities in a basic block –Blocks are small (6-7 instructions) –Instructions are dependent Goal: Exploit ILP across multiple basic blocks –Iterations of a loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;

3
Basic Scheduling for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Sequential MIPS Assembly Code Loop:LDF0, 0(R1) ADDDF4, F0, F2 SD0(R1), F4 SUBIR1, R1, #8 BNEZR1, Loop Pipelined execution: Loop:LDF0, 0(R1) 1 stall 2 ADDDF4, F0, F2 3 stall 4 stall 5 SD0(R1), F4 6 SUBIR1, R1, #8 7 stall 8 BNEZR1, Loop 9 stall 10 Scheduled pipelined execution: Loop:LDF0, 0(R1) 1 SUBIR1, R1, #8 2 ADDDF4, F0, F2 3 stall 4 BNEZR1, Loop 5 SD8(R1), F4 6

4
Loop Unrolling Loop:LDF0, 0(R1) ADDDF4, F0, F2 SD0(R1), F4 SUBIR1, R1, #8 BEQZR1, Exit LDF6, 0(R1) ADDDF8, F6, F2 SD0(R1), F8 SUBIR1, R1, #8 BEQZR1, Exit LDF10, 0(R1) ADDDF12, F10, F2 SD0(R1), F12 SUBIR1, R1, #8 BEQZR1, Exit LDF14, 0(R1) ADDDF16, F14, F2 SD0(R1), F16 SUBIR1, R1, #8 BNEZR1, Loop Exit: Pros: Larger basic block More scope for scheduling and eliminating dependencies Cons: Increases code size Comment: Often a precursor step for other optimizations

5
Loop Transformations Instruction independency is the key requirement for the transformations Example –Determine that is legal to move SD after SUBI and BNEZ –Determine that unrolling is useful (iterations are independent) –Use different registers to avoid unnecessary constrains –Eliminate extra tests and branches –Determine that LD and SD can be interchanged –Schedule the code, preserving the semantics of the code

6
1. Eliminating Name Dependences Loop:LDF0, 0(R1) ADDDF4, F0, F2 SD0(R1), F4 LDF0, -8(R1) ADDDF4, F0, F2 SD-8(R1), F4 LDF0, -16(R1) ADDDF4, F0, F2 SD-16(R1), F4 LDF0, -24(R1) ADDDF4, F0, F2 SD-24(R1), F4 SUBIR1, R1, #32 BNEZR1, Loop Loop:LDF0, 0(R1) ADDDF4, F0, F2 SD0(R1), F4 LDF6, -8(R1) ADDDF8, F6, F2 SD-8(R1), F8 LDF10, -16(R1) ADDDF12, F10, F2 SD-16(R1), F12 LDF14, -24(R1) ADDDF16, F14, F2 SD-24(R1), F16 SUBIR1, R1, #32 BNEZR1, Loop Register Renaming

7
2. Eliminating Control Dependences Loop:LDF0, 0(R1) ADDDF4, F0, F2 SD0(R1), F4 SUBIR1, R1, #8 BEQZR1, Exit LDF6, 0(R1) ADDDF8, F6, F2 SD0(R1), F8 SUBIR1, R1, #8 BEQZR1, Exit LDF10, 0(R1) ADDDF12, F10, F2 SD0(R1), F12 SUBIR1, R1, #8 BEQZR1, Exit LDF14, 0(R1) ADDDF16, F14, F2 SD0(R1), F16 SUBIR1, R1, #8 BNEZR1, Loop Exit: Intermediate BEQZ are never taken Eliminate!

8
3. Eliminating Data Dependences Loop:LDF0, 0(R1) ADDDF4, F0, F2 SD0(R1), F4 SUBIR1, R1, #8 LDF6, 0(R1) ADDDF8, F6, F2 SD0(R1), F8 SUBIR1, R1, #8 LDF10, 0(R1) ADDDF12, F10, F2 SD0(R1), F12 SUBIR1, R1, #8 LDF14, 0(R1) ADDDF16, F14, F2 SD0(R1), F16 SUBIR1, R1, #8 BNEZR1, Loop Data dependencies SUBI, LD, SD Force sequential execution of iterations Compiler removes this dependency by: Computing intermediate R1 values Eliminating intermediate SUBI Changing final SUBI Data flow analysis Can do on Registers Cannot do easily on memory locations 100(R1) = 20(R2)

9
4. Alleviating Data Dependencies Unrolled loop: Loop:LDF0, 0(R1) ADDDF4, F0, F2 SD0(R1), F4 LDF6, -8(R1) ADDDF8, F6, F2 SD-8(R1), F8 LDF10, -16(R1) ADDDF12, F10, F2 SD-16(R1), F12 LDF14, -24(R1) ADDDF16, F14, F2 SD-24(R1), F16 SUBIR1, R1, #32 BNEZR1, Loop Scheduled Unrolled loop: Loop:LDF0, 0(R1) LDF6, -8(R1) LDF10, -16(R1) LDF14, -24(R1) ADDDF4, F0, F2 ADDDF8, F6, F2 ADDDF12, F10, F2 ADDDF16, F14, F2 SD0(R1), F4 SD-8(R1), F8 SUBIR1, R1, #32 SD16(R1), F12 BNEZR1, Loop SD8(R1), F16

10
Some General Comments Dependences are a property of programs Actual hazards are a property of the pipeline Techniques to avoid dependence limitations –Maintain dependences but avoid hazards Code scheduling –hardware –software –Eliminate dependences by code transformations Complex Compiler-based

11
Loop-level Parallelism Primary focus of dependence analysis Determine all dependences and find cycles for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; } for (i=1; i<=100; i=i+1) { x[i] = y[i] + z[i]; w[i] = x[i] + v[i]; } for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; } x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100];

12
Dependence Analysis Algorithms Assume array indexes are affine (ai + b) –GCD test: For two affine array indexes ai+b and ci+d: if a loop-carried dependence exists, then GCD (c,a) must divide (d-b) x[8*i ] = x[4*i + 2] +3 (2-0)/GCD(8,4) General graph cycle determination is NP a, b, c, and d may not be known at compile time

13
Software Pipelining Start-up Finish-up Iteration 0 Iteration 1 Iteration 2 Iteration 3 Software pipelined iteration

14
Example LD F0, 0(R1) ADDDF4, F0, F2 SD0(R1), F4 LD F0, 0(R1) ADDDF4, F0, F2 SD0(R1), F4 LD F0, 0(R1) ADDDF4, F0, F2 SD0(R1), F4 Loop:SD16(R1), F4 ADDDF4, F0, F2 LDF0, 0(R1) SUBIR1, R1, #8 BNEZR1, Loop Iteration i Iteration i+1 Iteration i+2 Loop: LDF0, 0(R1) ADDDF4, F0, F2 SD0(R1), F4 SUBIR1, R1, #8 BNEZR1, Loop

15
Trace (global-code) Scheduling Find ILP across conditional branches Two-step process –Trace selection Find a trace (sequence of basic blocks) Use loop unrolling to generate long traces Use static branch prediction for other conditional branches –Trace compaction Squeeze the trace into a small number of wide instructions Preserve data and control dependences

16
Trace Selection A[I] = A[I] + B[I] B[I] = C[I] = A[I] = 0? X FT LWR4, 0(R1) LWR5, 0(R2) ADDR4, R4, R5 SW0(R1), R4 BNEZR4, else.. SW0(R2),... Jjoin Else:.... X Join:.... SW0(R3),...

17
Summary of Compiler Techniques Try to avoid dependence stalls Loop unrolling –Reduce loop overhead Software pipelining –Reduce single body dependence stalls Trace scheduling –Reduce impact of other branches Compilers use a mix of three All techniques depend on prediction accuracy

18
Food for thought: Analyze this Analyze this for different values of X and Y –To evaluate different branch prediction schemes –For compiler scheduling purposes add r1, r0, 1000 # all numbers in decimal add r2, r0, a # Base address of array a loop: –andi r10, r1, X –beqz r10, even –lw r11, 0(r2) –addi r11, r11, 1 –sw 0(r2), r11 even: –addi r2, r2, 4 –subi r1, r1, Y –bnez r1, loop

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google