Presentation is loading. Please wait.

Presentation is loading. Please wait.

CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Similar presentations


Presentation on theme: "CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for."— Presentation transcript:

1 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for ILP exploitation:  BTB and branch prediction  Dynamic scheduling  Scoreboard  Tomasulo’s algorithm  Speculation  Multiple issue  How can compilers help?

2 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 2 Loop Unrolling  Let’s look at the code: for (i=1000;i>0;i=i-1) x[i] = x[i] + s ADD R2,R0,R0 Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop

3 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 3 Scheduling On A Simple 5 Stage MIPS Loop: L.D F0,0(R1) stall, wait for F0 value to propagate ADD.D F4, F0, F2 stall, wait for FP add to be completed S.D F4, 0(R1) DADDUI R1, R1, #-8 stall, wait for R1 value to propagate BNE R1, R2, Loop stall one cycle, branch penalty 10 cycles

4 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 4 We Could Rearrange The Instructions Loop: L.D F0,0(R1) stall, wait for F0 value to propagate ADD.D F4, F0, F2 stall, wait for FP add to be completed S.D F4, 0(R1) DADDUI R1, R1, #-8 stall, wait for R1 value to propagate BNE R1, R2, Loop stall one cycle, branch penalty Interleave these inst. with some independent inst. Best we can achieve is 6 6 cycles Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop 8

5 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 5 Loop Unrolling  Getting into the loop more useful instructions and reducing overhead  Step 1: Put several iterations together Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Assume taken

6 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 6 Loop Unrolling  Step 2: Take out control instructions, adjust offsets Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F0,-8(R1) ADD.D F4, F0, F2 S.D F4, -8(R1) L.D F0,-16(R1) ADD.D F4, F0, F2 S.D F4, -16(R1) L.D F0,-24(R1) ADD.D F4, F0, F2 S.D F4, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop

7 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 7 Loop Unrolling  Step 3: Rename registers Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F0,-8(R1) ADD.D F4, F0, F2 S.D F4, -8(R1) L.D F0,-16(R1) ADD.D F4, F0, F2 S.D F4, -16(R1) L.D F0,-24(R1) ADD.D F4, F0, F2 S.D F4, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop

8 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 8 Loop Unrolling  Current loop still has stalls due to RAW dependencies Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop Loop: L.D F0,0(R1) stall, wait for F0 value to propagate ADD.D F4, F0, F2 stall, wait for FP add to be completed S.D F4, 0(R1) DADDUI R1, R1, #-8 stall, wait for R1 value to propagate BNE R1, R2, Loop stall one cycle, branch penalty 28 cycles = 7 per it.

9 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 9 Loop Unrolling  Step 4: Interleave iterations Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) DADDUI R1, R1, #-32 BNE R1, R2, Loop 14 cycles = 3.5 per it. Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDUI R1, R1, #-32 S.D F12, 16(R1) BNE R1, R2, Loop S.D F16, 8(R1)

10 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 10 Loop Unrolling + Multiple Issue  Let’s unroll the loop 5 times, mark int. and FP operations Loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F6,-8(R1) ADD.D F8, F6, F2 S.D F8, -8(R1) L.D F10,-16(R1) ADD.D F12, F10, F2 S.D F12, -16(R1) L.D F14,-24(R1) ADD.D F16, F14, F2 S.D F16, -24(R1) L.D F18,-32(R1) ADD.D F20, F18, F2 S.D F20, -32(R1) DADDUI R1, R1, #-40 BNE R1, R2, Loop

11 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 11 Loop Unrolling + Multiple Issue  Move all loads first, then ADD.D then S.D Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-40 BNE R1, R2, Loop

12 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 12 Loop Unrolling + Multiple Issue  Rearrange instructions to handle delay for DADDUI and BNE Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-40 BNE R1, R2, Loop Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) DADDUI R1, R1, #-40 S.D F16, -24(R1) BNE R1, R2, Loop S.D F20, -32(R1)

13 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 13 Loop Unrolling + Multiple Issue  Fix immediate displacement values Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) DADDUI R1, R1, #-40 S.D F16, 16(R1) BNE R1, R2, Loop S.D F20, 8(R1)

14 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 14 Loop Unrolling + Multiple Issue  Now imagine we can issue 2 instructions per cycle, one integer and one FP Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) DADDUI R1, R1, #-40 S.D F16, 16(R1) BNE R1, R2, Loop S.D F20, 8(R1) 1 2 3 3 4 4 5 5 6 6 7 7 8 9 10 11 12 12 cycles = 2.4 per it.

15 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 15 Static Branch Prediction  Analyze the code, figure out which outcome of a branch is likely  Always predict taken  Predict backward branches as taken, forward as not taken  Predict based on the profile of previous runs  Static branch prediction can help us schedule delayed branch slots

16 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 16 Static Multiple Issue: VLIW  Hardware checking for dependencies in issue packets may be expensive and complex  Compiler can examine instructions and decide which ones can be scheduled in parallel – group instructions into instruction packets – VLIW  Hardware can then be simplified  Processor has multiple functional units and each field of the VLIW is assigned to one unit  For example, VLIW could contain 5 fields and one has to contain ALU instruction or branch, two have to contain FP instructions and two have to be memory references

17 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 17 Example  Assume VLIW contains 5 fields: ALU instruction or branch, two FP instructions and two memory references  Ignore branch delay slot Memory reference FP instruction ALU instruction Loop: L.D F0,0(R1) stall, wait for F0 value to propagate ADD.D F4, F0, F2 stall, wait for FP add to be completed S.D F4, 0(R1) DADDUI R1, R1, #-8 stall, wait for R1 value to propagate BNE R1, R2, Loop

18 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 18 Example  Unroll seven times and rearrange Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 1 ALU /branch FP mem 3

19 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 19 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 2 ALU /branch FP mem 3 4

20 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 20 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 3 3 ALU /branch FP mem 4 6 5

21 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 21 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 4 4 ALU /branch FP mem 7 6 5 6

22 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 22 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 5 ALU /branch FP mem 7 6 6 8

23 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 23 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 6 6 ALU /branch FP mem 7 9 8

24 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 24 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, 24(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 7 7 ALU /branch FP mem 9 8

25 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 25 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, 24(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 8 8 ALU /branch FP mem 9

26 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 26 Example Loop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) L.D F14,-24(R1) L.D F18,-32(R1) L.D F22,-40(R1) L.D F26,-48(R1) ADD.D F4, F0, F2 ADD.D F8, F6, F2 ADD.D F12, F10, F2 ADD.D F16, F14, F2 ADD.D F20, F18, F2 ADD.D F24, F22, F2 ADD.D F28, F26, F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, 24(R1) DADDUI R1, R1, #-56 S.D F24, 16(R1) BNE R1, R2, Loop S.D F28, 8(R1) 9 Overall 9 cycles for 7 iterations 1.29 per iteration But VLIW was always half-full ALU /branch FP mem

27 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 27 Detecting and Enhancing Loop Level Parallelism  Determine whether data in later iterations depends on data in earlier iterations – loop-carried dependence  Easier detected at source code level than at machine code for(i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i];/* S1 */ B[i+1] = B[i] + A[i+1] /* S2 */ } S1 calculates a value A[i+1] which will be used in next iteration of S1 S2 calculates a value B[i+1] which will be used in next iteration of S2  This is a loop-carried dependence and prevents parallelism S1 calculates a value A[i+1] which will be used in the current iteration of S2  This is dependence within the loop

28 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 28 Detecting and Enhancing Loop Level Parallelism for(i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i] /* S2 */ } S1 calculates a value A[i] which is not used in the future S2 calculates a value B[i+1] which will be used in next iteration of S1  This is a loop-carried dependence but S1 depends on S2 not on itself and S2 does not depend on S1 This loop can be made parallel if we transform it so that there is no loop-carried dependence A[1] = A[1] + B[1]; for(i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i] /* S2 */ A[i+1] = A[i+1] + B[i+1]; /* S1 */ } B[101] = C[100]+D[100]

29 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 29 Detecting and Enhancing Loop Level Parallelism  Recursion creates loop-carried dependence  But sometimes it may parallelizable if distance between dependent elements is >1 for(i=1; i<=100; i=i+1) { A[i] = A[i-1] + B[i]; } for(i=1; i<=100; i=i+1) { A[i] = A[i-5] + B[i]; }

30 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 30 Detecting and Enhancing Loop Level Parallelism  Find all dependencies in the following loop (5) and eliminate as many as you can: for(i=1; i<=100; i=i+1) { Y[i] = X[i] / c; /* S1 */ X[i] = X[i] + c; /* S2 */ Z[i] = Y[i] + c; /* S3 */ Y[i] = c – Y[i]; /* S4 */ } Solution at page 325

31 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 31 Code Transformation  Eliminating dependent computations  Copy propagation  Tree height reduction DADDUI R1, R2, #4 DADDUI R1, R1, #4  DADDUI R1, R2, #8 ADD R1, R2, R3 ADD R4, R1, R6 ADD R8, R4, R7 ADD R1, R2, R3 ADD R4, R6, R7 ADD R8, R1, R4  Can be done in parallel sum=sum+x /* suppose this is in a loop and we unroll it 5 times */ sum=sum+x1+x2+x3+x4+x5 sum=(sum+x1)+(x2+x3)+(x4+x5) Can be done in parallel Must be done sequentially 

32 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 32 Software Pipelining  Combining instructions from different loop iterations to separate dependent instructions within an iteration

33 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 33 Software Pipelining  Apply software pipelining technique to the following loop: L.D F0,0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) R1+16 R1+8R1 16 8 S.D F0,16(R1) ADD.D F4, F0, F2 L.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop   Startup code Cleanup code

34 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 34 Software Pipelining vs. Loop Unrolling  Loop unrolling eliminates loop maintenance overhead exposing parallelism between iterations  Creates larger code  Software pipelining enables some loop iterations to run at top speed by eliminating RAW hazards that create latencies within iteration  Requires more complex transformations

35 CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 35 Homework #8  Due Tuesday, November 16 by the end of the class  Submit either in class (paper) or by E-mail (PS or PDF only) or bring the paper copy to my office  Do exercises 4.2, 4.6, 4.9 (skip parts d. and e.), 4.11


Download ppt "CIS 662 – Computer Architecture – Fall 2004 - Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for."

Similar presentations


Ads by Google