Presentation is loading. Please wait.

Presentation is loading. Please wait.

Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.

Similar presentations


Presentation on theme: "Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions."— Presentation transcript:

1 Eliminating Stalls Using Compiler Support

2 Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions may not uncover enough instruction level parallelism to eliminate all stalls –To eliminate remaining stalls we must look beyond single block and find more instruction level parallelism Loop level parallelism one opportunity Illustrate the above using DLX with Floating Point as an example

3 FP Loop: Where are the Hazards? Loop:LDF0,0(R1);F0=vector element ADDDF4,F0,F2;add scalar in F2 SD0(R1),F4;store result SUBIR1,R1,8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero NOP;delayed branch slot InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Integer opInteger op0

4 FP Loop Hazards Where are the stalls? InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Integer opInteger op0 Loop:LDF0,0(R1);F0=vector element ADDDF4,F0,F2;add scalar in F2 SD0(R1),F4;store result SUBIR1,R1,8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero NOP;delayed branch slot

5 FP Loop Showing Stalls Rewrite code to minimize stalls? InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 1 Loop:LDF0,0(R1);F0=vector element 2stall 3 ADDDF4,F0,F2;add scalar in F2 4stall 5stall 6 SD0(R1),F4;store result 7 SUBIR1,R1,8;decrement pointer 8B (DW) 8 BNEZR1,Loop;branch R1!=zero 9stall ;delayed branch slot

6 Revised FP Loop Minimizing Stalls Unroll loop 4 times code to make faster? InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 1 Loop:LDF0,0(R1) 2stall 3 ADDDF4,F0,F2 4 SUBIR1,R1,8 5 BNEZR1,Loop;delayed branch 6 SD8(R1),F4;altered when move past SUBI

7 Unroll Loop Four Times Rewrite loop to minimize stalls? 1 Loop:LDF0,0(R1) 2 ADDDF4,F0,F2 3 SD0(R1),F4 ;drop SUBI & BNEZ 4 LDF6,-8(R1) 5 ADDDF8,F6,F2 6 SD-8(R1),F8 ;drop SUBI & BNEZ 7 LDF10,-16(R1) 8 ADDDF12,F10,F2 9 SD-16(R1),F12 ;drop SUBI & BNEZ 10 LDF14,-24(R1) 11 ADDDF16,F14,F2 12 SD-24(R1),F16 13 SUBIR1,R1,#32;alter to 4*8 14 BNEZR1,LOOP 15 NOP x (1+2) = 27 clock cycles, or 6.8 per iteration Assumes R1 is multiple of 4

8 Unrolled Loop That Minimizes Stalls What assumptions made when moved code? –OK to move store past SUBI even though changes register –OK to move loads before stores: get right data? –When is it safe for compiler to do such changes? 1 Loop:LDF0,0(R1) 2 LDF6,-8(R1) 3 LDF10,-16(R1) 4 LDF14,-24(R1) 5 ADDDF4,F0,F2 6 ADDDF8,F6,F2 7 ADDDF12,F10,F2 8 ADDDF16,F14,F2 9 SD0(R1),F4 10 SD-8(R1),F8 11 SD-16(R1),F12 12 SUBIR1,R1,#32 13 BNEZR1,LOOP 14 SD8(R1),F16; 8-32 = clock cycles, or 3.5 per iteration

9 Loop Unrolling in VLIW

10 Software Pipelining Observation: if iterations from loops are independent, then can get ILP by taking instructions from different iterations Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (­ Tomasulo in SW)

11 SW Pipelining Example

12 Compile-time Analysis Compiler analysis is performed to detect data dependences. Further analysis is performed to identify stalls (must have knowledge of the HW). Unroll loop and reorder code to eliminate stalls.

13 Compiler Perspective on Data Dependences Flow dependence (RAW hazard for HW) –Instruction j writes a register or memory location that instruction i reads from and instruction j is execution first. Anti-dependence (WAR hazard for HW) –Instruction j writes a register or memory location that instruction i reads from and instruction i is executed first. Output dependence (WAW hazard for HW) –Instruction i and instruction j write the same register or memory location; ordering between instructions must be preserved.

14 Dependency Analysis Easy to determine for registers –By looking at fixed register names dependences can be easily found For memory in some cases it is easy but in general it can be hard –From same iteration 0(R1) != -8(R1) != -16(R1) != -24(R1) –From different loop iterations 20(R6) != 20(R6) if R6 has changed –Is 100(R4) = 20(R6)? If references are to two different arrays there is no dependence. But in general this is hard to determine. Unroll loop if instructions from different iterations are not dependent upon each other.

15 Dependence Analysis Final kind of dependence called control dependence Example if p1 {S1;} if p2 {S2;} S1 is control dependent on p1 and S2 is control dependent on p2 but not on p1. Strict enforcement of control dependences limits parallelism –unrolling eliminated conditional branches to overcome this limitation.

16 Summary Instruction Level Parallelism can be uncovered by the compiler. Loops are an important source of instruction level parallelism. Dependency analysis is key to uncovering instruction level parallelism.


Download ppt "Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions."

Similar presentations


Ads by Google