COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.

COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism

COMP381 by M. Hamdi 2 Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping the execution of independent instructions. The CPI of a real-life pipeline is given by: Pipeline CPI = Ideal Pipeline CPI + Structural Stalls + RAW Stalls + WAR Stalls + WAW Stalls + Control Stalls A basic instruction block is a straight-line code sequence with no branches in, except at the entry point, and no branches out except at the exit point of the sequence. The amount of parallelism in a basic block is limited by instruction dependence present and size of the basic block. In a typical integer code, dynamic branch frequency is about 15% (average basic block size of 7 instructions).

COMP381 by M. Hamdi 3 Increasing Instruction-Level Parallelism A common way to increase parallelism among instructions is to exploit parallelism among iterations of a loop –(i.e Loop Level Parallelism, LLP). This is accomplished by unrolling the loop either statically by the compiler, or dynamically by hardware, which increases the size of the basic block present. In this loop every iteration can overlap with any other iteration. Overlap within each iteration is minimal. for (i=1; i<=1000; i=i+1;) x[i] = x[i] + y[i];

COMP381 by M. Hamdi 4 MIPS Loop Unrolling Example For the loop: for (i=1; i<=1000; i++) x(i) = x(i) + s; ; The straightforward MIPS assembly code is given by: Loop: L.D F0, 0 (R1) ;F0=array element ADD.D F4, F0, F2 ;add scalar in F2 S.D F4, 0(R1) ;store result SUBI R1, R1, # 8 ;decrement pointer 8 bytes BNE R1,Loop ; branch R1!=zero R1 is initially the address of the element with highest address.

COMP381 by M. Hamdi 5 FP Loop: Where are the Hazards? Loop:LDF0,0(R1);F0=vector element ADDD F4,F0,F2;add scalar from F2 SD0(R1),F4;store result SUBIR1,R1,8;decrement pointer 8B (DW) BNEZR1,Loop;branch R1!=zero InstructionInstructionLatency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Integer opInteger op0 Where are the stalls?

COMP381 by M. Hamdi 6 FP Loop Showing Stalls 9 clock cycles per loop iteration. Rewrite code to minimize stalls? 1 Loop:LD F0,0(R1);F0=vector element 2stall;stall one cycle between LD and FP ALU op 3ADDDF4,F0,F2;add scalar in F2 4stall;stall two cycles between FP ALU op and SD 5stall 6 SD0(R1),F4;store result 7 SUBIR1,R1,8;decrement pointer 8B (DW) 8 BNEZR1,Loop;branch R1!=zero 9 stall;delayed branch slot

COMP381 by M. Hamdi 7 Revised FP Loop Minimizing Stalls Code now takes 6 clock cycles per loop iteration Speedup = 9/6 = 1.5 Unroll loop 4 times to make faster? 1 Loop:LDF0,0(R1) 2stall 3ADDDF4,F0,F2 4SUBIR1,R1,8 5BNEZR1,Loop;delayed branch 6 SD8(R1),F4;altered when move past SUBI Swap BNEZ and SD by changing address of SD

COMP381 by M. Hamdi 8 Unroll Loop Four Times (straightforward way) Rewrite loop to minimize stalls? 1 Loop:LDF0,0(R1) 2ADDDF4,F0,F2 3SD0(R1),F4 ;drop SUBI & BNEZ 4LDF6,-8(R1) 5ADDDF8,F6,F2 6SD-8(R1),F8 ;drop SUBI & BNEZ 7LDF10,-16(R1) 8ADDDF12,F10,F2 9SD-16(R1),F12 ;drop SUBI & BNEZ 10LDF14,-24(R1) 11ADDDF16,F14,F2 12SD-24(R1),F16 13SUBIR1,R1,#32;alter to 4*8 14BNEZR1,LOOP 15stall 15 + 4 x (2 + 1)= 27 clock cycles, or 6.8 cycles per iteration (2 stalls for SD and 1 stall for LD)

COMP381 by M. Hamdi 9 Unrolled Loop That Minimizes Stalls What assumptions made when moved code? –OK to move store past SUBI even though SUBI changes register –OK to move loads before stores: get right data? –When is it safe for compiler to make such changes? 1 Loop:LDF0,0(R1) 2LDF6,-8(R1) 3LDF10,-16(R1) 4LDF14,-24(R1) 5ADDDF4,F0,F2 6ADDDF8,F6,F2 7ADDDF12,F10,F2 8ADDDF16,F14,F2 9SD0(R1),F4 10 10SD-8(R1),F8 11 11SD-16(R1),F8 13SUBIR1,R1,#32 12BNEZR1,LOOP 14SD8(R1),F16;8-32 = -24 14 clock cycles or 3.5 clock cycles per iteration

COMP381 by M. Hamdi 10 Summary of Loop Unrolling Example Determine that it was legal to move the SD after the SUBI and BNEZ, and find the amount to adjust the SD offset. Determine that unrolling the loop would be useful by finding that the loop iterations are independent, except for the loop maintenance code. Use extra registers to avoid unnecessary constraints that would be forced by using the same registers for different computations.

COMP381 by M. Hamdi 11 Summary of Loop Unrolling Example Eliminate the extra tests and branches and adjust the loop maintenance code. Determine that the loads and stores in the unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent. This requires analyzing the memory addresses and finding that they do not refer to the same address. Schedule the code, preserving any dependences needed to yield the same result as the original code.

COMP381 by M. Hamdi 12 Loop-Level Parallelism (LLP) Analysis Loop-Level Parallelism (LLP) analysis focuses on whether data accesses in later iterations of a loop are data dependent on data values produced in earlier iterations. e.g. in for (i=1; i<=1000; i++) x[i] = x[i] + s; the computation in each iteration is independent of the previous iterations and the loop is thus parallel. The use of X[i] twice is within a single iteration.  Thus loop iterations are parallel (or independent from each other). Loop-carried Dependence: A data dependence between different loop iterations (data produced in earlier iteration used in a later one) – limits parallelism. Instruction level parallelism (ILP) analysis, on the other hand, is usually done when instructions are generated by the compiler.

COMP381 by M. Hamdi 13 LLP Analysis Example 1 In the loop: for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1];} /* S2 */ } (Where A, B, C are distinct non-overlapping arrays) –S2 uses the value A[i+1], computed by S1 in the same iteration. This data dependence is within the same iteration (not a loop-carried dependence).  does not prevent loop iteration parallelism. –S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] read in iteration i+1 (loop-carried dependence, prevents parallelism). The same applies for S2 for B[i] and B[i+1]  These two dependences are loop-carried spanning more than one iteration preventing loop parallelism.

COMP381 by M. Hamdi 14 LLP Analysis Example 2 In the loop: for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } –S1 uses the value B[i] computed by S2 in the previous iteration (loop-carried dependence) –This dependence is not circular: S1 depends on S2 but S2 does not depend on S1. –Can be made parallel by replacing the code with the following: A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } B[101] = C[100] + D[100]; Loop Start-up code Loop Completion code

COMP381 by M. Hamdi 15 LLP Analysis Example 2 Original Loop: A[100] = A[100] + B[100]; B[101] = C[100] + D[100]; A[1] = A[1] + B[1]; B[2] = C[1] + D[1]; A[2] = A[2] + B[2]; B[3] = C[2] + D[2]; A[99] = A[99] + B[99]; B[100] = C[99] + D[99]; A[100] = A[100] + B[100]; B[101] = C[100] + D[100]; A[1] = A[1] + B[1]; B[2] = C[1] + D[1]; A[2] = A[2] + B[2]; B[3] = C[2] + D[2]; A[99] = A[99] + B[99]; B[100] = C[99] + D[99]; for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } B[101] = C[100] + D[100]; Modified Parallel Loop: Iteration 1 Iteration 2 Iteration 100Iteration 99 Loop-carried Dependence Loop Start-up code Loop Completion code Iteration 1 Iteration 98 Iteration 99 Not Loop Carried Dependence.....

COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.

Similar presentations

Presentation on theme: "COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism.

Similar presentations

Presentation on theme: "COMP381 by M. Hamdi 1 Loop Level Parallelism Instruction Level Parallelism: Loop Level Parallelism."— Presentation transcript:

Similar presentations

About project

Feedback