1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 29, 2003 Topic: Software Approaches for ILP (Compiler Techniques) contd.

2Outline  Motivation  Compiler scheduling Loop unrolling Loop unrolling Software pipelining Software pipelining Static branch prediction Static branch prediction VLIW VLIW Reading: HP3, Sections 4.1-4.5

3 Review: Instruction-Level Parallelism (ILP)  Pipelining most effective when: parallelism among instrs instrs u and v are parallel if neither P(u,v) nor P(v,u) holds instrs u and v are parallel if neither P(u,v) nor P(v,u) holds  Problem: parallelism within a basic block is limited branch freq of 15%: implies about 6 instructions in basic block branch freq of 15%: implies about 6 instructions in basic block these instructions are likely to depend on each other these instructions are likely to depend on each other  need to look beyond basic blocks  Solution: exploit loop-level parallelism i.e., parallelism across loop iterations i.e., parallelism across loop iterations to convert loop-level parallelism into ILP, need to “unroll” the loop to convert loop-level parallelism into ILP, need to “unroll” the loop  dynamically, by the hardware  statically, by the compiler  using vector instructions –same operation is applied to all the vector elements

4 Motivating Example for Loop Unrolling for (i = 1000; i > 0; i--) x[i] = x[i] + s; Assumptions Scalar s is in register F2 Array x starts at memory address 0 1-cycle branch delay No structural hazards LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D 0(R1), F4 DADDUIR1, R1, -8 BNEZR1, LOOP NOP LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D 0(R1), F4 DADDUIR1, R1, -8 BNEZR1, LOOP NOP 10 cycles per iteration

5 How Far Can We Get With Scheduling? LOOP: L.DF0, 0(R1) DADDUIR1, R1, -8 ADD.DF4, F0, F2 nop BNEZR1, LOOP S.D 8(R1), F4 LOOP: L.DF0, 0(R1) DADDUIR1, R1, -8 ADD.DF4, F0, F2 nop BNEZR1, LOOP S.D 8(R1), F4 LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D 0(R1), F4 DADDUIR1, R1, -8 BNEZR1, LOOP NOP LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D 0(R1), F4 DADDUIR1, R1, -8 BNEZR1, LOOP NOP 6 cycles per iteration Note change in S.D instruction, from 0(R1) to 8(R1) ; this is a non-trivial change!

6 Observations on Scheduled Code  3 out of 5 instructions involve FP work  The other two constitute loop overhead  Could we improve performance by unrolling the loop? assume number of loop iterations is a multiple of 4, and unroll loop body four times assume number of loop iterations is a multiple of 4, and unroll loop body four times  in real life, must also handle loop counts that are not multiples of 4

7 Unrolling: Take 1  Even though we have gotten rid of the control dependences, we have data dependences through R1  We could remove data dependences by observing that R1 is decremented by 8 each time Adjust the address specifiers Adjust the address specifiers Delete the first three DADDUI’s Delete the first three DADDUI’s Change the constant in the fourth DADDUI to 32 Change the constant in the fourth DADDUI to 32  These are non-trivial inferences for a compiler to make LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 BNEZR1, LOOP NOP LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 DADDUIR1, R1, -8 BNEZR1, LOOP NOP

8 Unrolling: Take 2  Performance is now limited by the WAR dependencies on F0  These are name dependences The instructions are not in a producer-consumer relation The instructions are not in a producer-consumer relation They are simply using the same registers, but they don’t have to They are simply using the same registers, but they don’t have to We can use different registers in different loop iterations, subject to availability We can use different registers in different loop iterations, subject to availability  Let’s rename registers LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 L.DF0, -8(R1) ADD.DF4, F0, F2 S.D -8(R1), F4 L.DF0, -16(R1) ADD.DF4, F0, F2 S.D-16(R1), F4 L.DF0, -24(R1) ADD.DF4, F0, F2 S.D-24(R1), F4 DADDUIR1, R1, -32 BNEZR1, LOOP NOP LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 L.DF0, -8(R1) ADD.DF4, F0, F2 S.D -8(R1), F4 L.DF0, -16(R1) ADD.DF4, F0, F2 S.D-16(R1), F4 L.DF0, -24(R1) ADD.DF4, F0, F2 S.D-24(R1), F4 DADDUIR1, R1, -32 BNEZR1, LOOP NOP

9 Unrolling: Take 3 LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 L.DF6, -8(R1) ADD.DF8, F6, F2 S.D-8(R1), F8 L.DF10, -16(R1) ADD.DF12, F10, F2 S.D-16(R1), F12 L.DF14, -24(R1) ADD.DF16, F14, F2 S.D-24(R1), F16 DADDUIR1, R1, -32 BNEZR1, LOOP NOP LOOP: L.DF0, 0(R1) ADD.DF4, F0, F2 S.D0(R1), F4 L.DF6, -8(R1) ADD.DF8, F6, F2 S.D-8(R1), F8 L.DF10, -16(R1) ADD.DF12, F10, F2 S.D-16(R1), F12 L.DF14, -24(R1) ADD.DF16, F14, F2 S.D-24(R1), F16 DADDUIR1, R1, -32 BNEZR1, LOOP NOP  Time for execution of 4 iterations 14 instruction cycles 14 instruction cycles 4 L.D  ADD.D stalls 4 L.D  ADD.D stalls 8 ADD.D  S.D stalls 8 ADD.D  S.D stalls 1 DADDUI  BNEZ stall 1 DADDUI  BNEZ stall 1 branch delay stall (NOP) 1 branch delay stall (NOP)  28 cycles for 4 iterations, or 7 cycles per iteration  Slower than scheduled version of original loop, which needed 6 cycles per iteration  Let’s schedule the unrolled loop

10 Unrolling: Take 4  This code runs without stalls 14 cycles for 4 iterations 14 cycles for 4 iterations 3.5 cycles per iteration 3.5 cycles per iteration loop control overhead = once every four iterations loop control overhead = once every four iterations  Note that original loop had three FP instructions that were not independent  Loop unrolling exposed independent instructions from multiple loop iterations  By unrolling further, can approach asymptotic rate of 3 cycles per instruction Subject to availability of registers Subject to availability of registers LOOP: L.DF0, 0(R1) L.DF6, -8(R1) L.DF10, -16(R1) L.DF14, -24(R1) ADD.DF4, F0, F2 ADD.DF8, F6, F2 ADD.DF12, F10, F2 ADD.DF16, F14, F2 S.D0(R1), F4 S.D-8(R1), F8 DADDUIR1, R1, -32 S.D16(R1), F12 BNEZR1, LOOP S.D8(R1), F16 LOOP: L.DF0, 0(R1) L.DF6, -8(R1) L.DF10, -16(R1) L.DF14, -24(R1) ADD.DF4, F0, F2 ADD.DF8, F6, F2 ADD.DF12, F10, F2 ADD.DF16, F14, F2 S.D0(R1), F4 S.D-8(R1), F8 DADDUIR1, R1, -32 S.D16(R1), F12 BNEZR1, LOOP S.D8(R1), F16

11 What Did The Compiler Have To Do?  Determine it was legal to move the S.D after the DADDUI and BNEZ, and find the amount to adjust the S.D offset  Determine that loop unrolling would be useful by discovering independence of loop iterations  Rename registers to avoid name dependences  Eliminate extra tests and branches and adjust loop control  Determine that L.D’s and S.D’s can be interchanged by determining that (since R1 is not being updated) the address specifiers 0(R1), -8(R1), -16(R1), -24(R1) all refer to different memory locations  Schedule the code, preserving dependences

12 Limits to Gain from Loop Unrolling  Benefit of reduction in loop overhead tapers off Amount of overhead amortized diminishes with successive unrolls Amount of overhead amortized diminishes with successive unrolls  Code size limitations For larger loops, code size growth is a concern For larger loops, code size growth is a concern  Especially for embedded processors with limited memory Instruction cache miss rate increases Instruction cache miss rate increases  Architectural/compiler limitations Register pressure Register pressure  Need many registers to exploit ILP  Especially challenging in multiple-issue architectures

13 Dependences in the Loop Context  Three kinds of dependences Data dependence Data dependence Name dependence Name dependence Control dependence Control dependence  In the context of loop-level parallelism, data dependence can be Loop-independent Loop-independent Loop-carried Loop-carried  Data dependences act as a limit of how much ILP can be exploited in a compiled program Compiler tries to identify and eliminate dependences Compiler tries to identify and eliminate dependences Hardware tries to prevent dependences from becoming stalls Hardware tries to prevent dependences from becoming stalls

14 Data and Name Dependences  Instruction v is data-dependent on instruction u if u produces a result that v consumes u produces a result that v consumes  Instruction v is anti-dependent on instruction u if u precedes v u precedes v v writes a register or memory location that u reads v writes a register or memory location that u reads  Instruction v is output-dependent on instruction u if u precedes v u precedes v v writes a register or memory location that u writes v writes a register or memory location that u writes  Relationship to Hazards: A data dependence that cannot be removed by renaming corresponds to a RAW hazard A data dependence that cannot be removed by renaming corresponds to a RAW hazard Anti-dependence corresponds to a WAR hazard Anti-dependence corresponds to a WAR hazard Output dependence corresponds to a WAW hazard Output dependence corresponds to a WAW hazard

15 Control Dependences  A control dependence determines the ordering of an instruction with respect to a branch instruction so that the non-branch instruction is executed only when it should be if (p1) {s1;} if (p2) {s2;}  Control dependence constrains code motion An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch An instruction that is control dependent on a branch cannot be moved before the branch so that its execution is no longer controlled by the branch An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch An instruction that is not control dependent on a branch cannot be moved after the branch so that its execution is controlled by the branch

16 Data Dependence in Loop Iterations A[u+1] = A[u]+C[u]; B[u+1] = B[u]+A[u+1]; A[u+1] = A[u]+C[u]; B[u+1] = B[u]+A[u+1]; A[u+1] = A[u]+C[u]; B[u+1] = B[u]+A[u+1]; A[u+2] = A[u+1]+C[u+1]; B[u+2] = B[u+1]+A[u+2]; A[u+1] = A[u]+C[u]; B[u+1] = B[u]+A[u+1]; A[u+2] = A[u+1]+C[u+1]; B[u+2] = B[u+1]+A[u+2]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2];

17 Loop Transformation A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u];  Sometimes loop-carried dependence does not prevent loop parallelization Example: Second loop of previous slide Example: Second loop of previous slide  In other cases, loop-carried dependence prohibits loop parallelization Example: First loop of previous slide Example: First loop of previous slide A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2]; B[u+3] = C[u+2]+D[u+2]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2]; B[u+3] = C[u+2]+D[u+2]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2]; B[u+3] = C[u+2]+D[u+2]; A[u] = A[u]+B[u]; B[u+1] = C[u]+D[u]; A[u+1] = A[u+1]+B[u+1]; B[u+2] = C[u+1]+D[u+1]; A[u+2] = A[u+2]+B[u+2]; B[u+3] = C[u+2]+D[u+2];

18 Software Pipelining  Observation If iterations from loops are independent, then we can get ILP by taking instructions from different iterations If iterations from loops are independent, then we can get ILP by taking instructions from different iterations  Software pipelining reorganize loops so that each iteration is made from instructions chosen from different iterations of the original loop reorganize loops so that each iteration is made from instructions chosen from different iterations of the original loop i4 i3 i2 i1 i0 Software Pipeline Iteration

19 Software Pipelining Example After: Software Pipelined L.DF0,0(R1) ADD.DF4,F0,F2 L.DF0,-8(R1) 1 S.D0(R1),F4; Stores M[i] 2 ADD.DF4,F0,F2; Adds to M[i-1] 3 L.DF0,-16(R1); loads M[i-2] 4 DADDUI R1,R1,-8 5 BNEZR1,LOOP S.D0(R1),F4 ADD.DF4,F0,F2 S.D-8(R1),F4 IF ID EX Mem WB S.D ADD.D L.D Read F4 Write F4 Read F0 Write F0 Before: Unrolled 3 times 1 L.DF0,0(R1) 2 ADD.DF4,F0,F2 3 S.D0(R1),F4 4 L.DF0,-8(R1) 5 ADD.DF4,F0,F2 6 S.D-8(R1),F4 7 L.DF0,-16(R1) 8 ADD.DF4,F0,F2 9 S.D-16(R1),F4 10 DADDUI R1,R1,-24 11 BNEZR1,LOOP (As in slide 4)

20 Software Pipelining: Concept Loop: L i E i S i B Loop Loop: L i E i S i B Loop L 1 E 1 S 1 B Loop L 2 E 2 S 2 B Loop L 3 E 3 S 3 B Loop … L n E n S n  Notation: Load, Execute, Store  Iterations are independent  In normal sequence, E i depends on L i, and S i depends on E i, leading to pipeline stalls  Software pipelining attempts to reduce these delays by inserting other instructions between such dependent pairs and “hiding” the delay “Other” instructions are L and S instructions from other loop iterations “Other” instructions are L and S instructions from other loop iterations  Does this without consuming extra code space or registers Performance usually not as high as that of loop unrolling Performance usually not as high as that of loop unrolling  How can we permute L, E, S to achieve this? “A Study of Scalar Compilation Techniques for Pipelined Supercomputers”, S. Weiss and J. E. Smith, ISCA 1987, pages 105-109

21 An Abstract View of Software Pipelining Loop: L i E i S i B Loop Loop: L i E i S i B Loop L 1 Loop: E i S i L i+1 B Loop E n S n L 1 Loop: E i S i L i+1 B Loop E n S n J Entry Loop: S i-1 Entry: L i E i B Loop S n J Entry Loop: S i-1 Entry: L i E i B Loop S n L 1 J Entry Loop: S i-1 Entry: E i L i+1 B Loop S n-1 E n S n L 1 J Entry Loop: S i-1 Entry: E i L i+1 B Loop S n-1 E n S n L 1 Loop: E i L i+1 S i B Loop E n S n L 1 Loop: E i L i+1 S i B Loop E n S n L 1 J Entry Loop: L i S i-1 Entry: E i B Loop S n L 1 J Entry Loop: L i S i-1 Entry: E i B Loop S n Maintains original L/S order Changes original L/S order

22 Other Compiler Techniques  Static Branch Prediction Examples: Examples:  predict always taken  predict never taken  predict: forward never taken, backward always taken Stall needed after LD Stall needed after LD  if branch almost always taken, and R7 not needed in fall-thru –move DADDU R7, R8, R9 to right after LD  if branch almost never taken, and R4 not needed on taken path –move OR instruction to right after LD LDR1, 0(R2) DSUBUR1, R1, R3 BEQZR1, L ORR4, R5, R6 DADDUR10, R4, R3 L:DADDUR7, R8, R9 LDR1, 0(R2) DSUBUR1, R1, R3 BEQZR1, L ORR4, R5, R6 DADDUR10, R4, R3 L:DADDUR7, R8, R9

23 Very Long Instruction Word (VLIW)  VLIW: compiler schedules multiple instructions/issue The long instruction word has room for many operations The long instruction word has room for many operations By definition, all the operations the compiler puts in the long instruction word can execute in parallel By definition, all the operations the compiler puts in the long instruction word can execute in parallel E.g., 2 integer operations, 2 FP operations, 2 memory references, 1 branch E.g., 2 integer operations, 2 FP operations, 2 memory references, 1 branch  16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide Need very sophisticated compiling technique … Need very sophisticated compiling technique …  … that schedules across several branches

24 Loop Unrolling in VLIW Memory MemoryFPFPInt. op/Clock reference 1reference 2operation 1 op. 2 branch LD F0,0(R1)LD F6,-8(R1)1 LD F10,-16(R1)LD F14,-24(R1)2 LD F18,-32(R1)LD F22,-40(R1)ADDD F4,F0,F2ADDD F8,F6,F23 LD F26,-48(R1)ADDD F12,F10,F2ADDD F16,F14,F24 ADDD F20,F18,F2ADDD F24,F22,F25 SD 0(R1),F4SD -8(R1),F8ADDD F28,F26,F26 SD -16(R1),F12SD -24(R1),F167 SD -32(R1),F20SD -40(R1),F24SUBI R1,R1,#488 SD -0(R1),F28BNEZ R1,LOOP9  Unrolled 7 times to avoid delays  7 results in 9 clocks, or 1.3 clocks/iter (down from 6)  Need more registers in VLIW

25 Trace Scheduling (briefly)  Parallelism across IF branches vs. LOOP branches  Two steps: Trace Selection Trace Selection  Find likely sequence of basic blocks (trace) of (statically predicted) long sequence of straight-line code Trace Compaction Trace Compaction  Squeeze trace into few VLIW instructions  Need bookkeeping code in case prediction is wrong

26 Trace Scheduling

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 29, 2003 Topic: Software Approaches for ILP (Compiler Techniques) contd.

Similar presentations

Presentation on theme: "1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 29, 2003 Topic: Software Approaches for ILP (Compiler Techniques) contd."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 29, 2003 Topic: Software Approaches for ILP (Compiler Techniques) contd.

Similar presentations

Presentation on theme: "1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 29, 2003 Topic: Software Approaches for ILP (Compiler Techniques) contd."— Presentation transcript:

Similar presentations

About project

Feedback