Download presentation

Presentation is loading. Please wait.

Published byElias Critchlow Modified over 2 years ago

1
1 ILP (Recap)

2
2 Basic Block (BB) ILP is quite small –BB: a straight-line code sequence with no branches in except to the entry and no branches out except at the exit –average dynamic branch frequency 15% to 25% => 4 to 7 instructions execute between a pair of branches –Plus instructions in BB likely to depend on each other To obtain substantial performance enhancements, we must exploit ILP across multiple basic blocks Simplest: loop-level parallelism to exploit parallelism among iterations of a loop Instruction Level Parallelism

3
3 Loop Unrolling Example: Key to increasing ILP For the loop: for (i=1; i<=1000; i++) x(i) = x(i) + s; ; The straightforward MIPS assembly code is given by: Loop: L.DF0, 0 (R1) ADD.DF4, F0, F2 S.D F4, 0(R1) SUBI R1, R1, # 8 BNE R1,Loop InstructionInstruction Latency in producing resultusing result clock cycles FP ALU opAnother FP ALU op3 FP ALU opStore double2 Load doubleFP ALU op1 Load doubleStore double0 Integer opInteger op0

4
4 Loop Showing Stalls and Code Re-arrangement 1 Loop:LD F0,0(R1) 2stall 3ADDDF4,F0,F2 4stall 5stall 6 SD0(R1),F4 7 SUBIR1,R1,8 8 BNEZR1,Loop 9 stall 9 clock cycles per loop iteration. 1Loop:LDF0,0(R1) 2Stall 3ADDDF4,F0,F2 4SUBIR1,R1,8 5BNEZR1,Loop 6SD8(R1),F4 Code now takes 6 clock cycles per loop iteration Speedup = 9/6 = 1.5 The number of cycles cannot be reduced further because: The body of the loop is small The loop overhead (SUBI R1, R1, 8 and BNEZ R1, Loop)

5
5 Basic Loop Unrolling Concept: 4n iterations n iterations 4 iterations

6
6 Unroll Loop Four Times to expose more ILP and reduce loop overhead 1 Loop:LDF0,0(R1) 2ADDDF4,F0,F2 3SD0(R1),F4 ;drop SUBI & BNEZ 4LDF6,-8(R1) 5ADDDF8,F6,F2 6SD-8(R1),F8 ;drop SUBI & BNEZ 7LDF10,-16(R1) 8ADDDF12,F10,F2 9SD-16(R1),F12 ;drop SUBI & BNEZ 10LDF14,-24(R1) 11ADDDF16,F14,F2 12SD-24(R1),F16 13SUBIR1,R1,#32 14BNEZR1,LOOP 15stall 15 + 4 x (2 + 1)= 27 clock cycles, or 6.8 cycles per iteration (2 stalls after each ADDD and 1 stall after each LD) 1 Loop:LDF0,0(R1) 2LDF6,-8(R1) 3LDF10,-16(R1) 4LDF14,-24(R1) 5ADDDF4,F0,F2 6ADDDF8,F6,F2 7ADDDF12,F10,F2 8ADDDF16,F14,F2 9SD0(R1),F4 10SD-8(R1),F8 11SD-16(R1),F8 12SUBIR1,R1,#32 13BNEZR1,LOOP 14SD8(R1),F16 14 clock cycles or 3.5 clock cycles per iteration The compiler (or Hardware) must be able to: Determine data dependency Do code re-arrangement Register renaming

7
7 Loop-Level Parallelism (LLP) Analysis Loop-Level Parallelism (LLP) analysis focuses on whether data accesses in later iterations of a loop are data dependent on data values produced in earlier iterations. e.g. in for (i=1; i<=1000; i++) x[i] = x[i] + s; the computation in each iteration is independent of the previous iterations and the loop is thus parallel. The use of X[i] twice is within a single iteration. Thus loop iterations are parallel (or independent from each other). Loop-carried Dependence: A data dependence between different loop iterations (data produced in earlier iteration used in a later one) – limits parallelism. Instruction level parallelism (ILP) analysis, on the other hand, is usually done when instructions are generated by the compiler.

8
8 LLP Analysis Example 1 In the loop: for (i=1; i<=100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1];} /* S2 */ } (Where A, B, C are distinct non-overlapping arrays) –S2 uses the value A[i+1], computed by S1 in the same iteration. This data dependence is within the same iteration (not a loop-carried dependence). does not prevent loop iteration parallelism. –S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] read in iteration i+1 (loop-carried dependence, prevents parallelism). The same applies for S2 for B[i] and B[i+1] These two dependences are loop-carried spanning more than one iteration preventing loop parallelism.

9
9 LLP Analysis Example 2 In the loop: for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } –S1 uses the value B[i] computed by S2 in the previous iteration (loop-carried dependence) –This dependence is not circular: S1 depends on S2 but S2 does not depend on S1. –Can be made parallel by replacing the code with the following: A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } B[101] = C[100] + D[100]; Loop Start-up code Loop Completion code

10
10 LLP Analysis Example 2 Original Loop: A[100] = A[100] + B[100]; B[101] = C[100] + D[100]; A[1] = A[1] + B[1]; B[2] = C[1] + D[1]; A[2] = A[2] + B[2]; B[3] = C[2] + D[2]; A[99] = A[99] + B[99]; B[100] = C[99] + D[99]; A[100] = A[100] + B[100]; B[101] = C[100] + D[100]; A[1] = A[1] + B[1]; B[2] = C[1] + D[1]; A[2] = A[2] + B[2]; B[3] = C[2] + D[2]; A[99] = A[99] + B[99]; B[100] = C[99] + D[99]; for (i=1; i<=100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } A[1] = A[1] + B[1]; for (i=1; i<=99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } B[101] = C[100] + D[100]; Modified Parallel Loop: Iteration 1 Iteration 2 Iteration 100Iteration 99 Loop-carried Dependence Loop Start-up code Loop Completion code Iteration 1 Iteration 98 Iteration 99 Not Loop Carried Dependence.....

Similar presentations

OK

CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.

CS 378 Programming for Performance Single-Thread Performance: Compiler Scheduling for Pipelines Adopted from Siddhartha Chatterjee Spring 2009.

© 2018 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Best ppt on water pollution Ppt on polynomials and coordinate geometry formulas Ppt on solar system for class 7 Ppt on articles of association template Ppt on steve jobs as entrepreneur Free ppt on moving coil galvanometer experiment Ppt on mammals and egg laying animals wikipedia Ppt on anti bullying Ppt online viewership Ppt on high voltage engineering applications