Presentation is loading. Please wait.

Presentation is loading. Please wait.

Loop-Level Parallelism

Similar presentations


Presentation on theme: "Loop-Level Parallelism"— Presentation transcript:

1 Loop-Level Parallelism
Analysis at the source level Dependencies across iterations for (i=1000; i>0; i=i-1) x[i] = x[i] + s; for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; /* loop-carried dependence */ y[i+1] = y[i] + x[i+1]; }

2 Loop-Carried Dependences
for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; } Non-circular dependences x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100];

3 Compiler support for ILP
Dependence analysis Finding dependences is important for: Good scheduling of code Determining loop-level parallelism Eliminating name dependencies Complexity Simple for scalar variable references Complex for pointers, array references Software pipelining Trace scheduling

4 Loop-level Parallelism
Primary focus of dependence analysis Determine all dependences and find cycles for (i=1; i<=100; i=i+1) { x[i] = y[i] + z[i]; w[i] = x[i] + v[i]; } for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; loop-carried, recurrent, circular dependence } x[1] = x[1] + y[1]; for (i=1; i<=99; i=i+1) { y[i+1] = w[i] + z[i]; x[i+1] = x[i +1] + y[i +1]; } y[101] = w[100] + z[100]; for (i=1; i<=100; i=i+1) { x[i] = x[i] + y[i]; y[i+1] = w[i] + z[i]; }

5 Dependence Analysis Algorithms
Assume array indexes are affine (ai + b) GCD test: For two affine array indexes ai+b and ci+d: if a loop-carried dependence exists, then GCD (c,a) must divide (d-b) x[8*i ] = x[4*i + 2] +3 (2-0)/GCD(8,4) General graph cycle determination is NP a, b, c, and d may not be known at compile time

6 Example For(I=1; I<=100; I=1+1) { Y[I] = X[I] / c X[I] = X[I] + c;
Z[I] = Y[I] + c; Y[I] = c - Y[I]; } For(I=1; I<=100; I=1+1) { T[I] = X[I] / c X1[I] = X[I] + c; Z[I] = T[I] + c; Y[I] = c - T[I]; }

7 Software Pipelining Start-up Finish-up Software pipelined iteration
Iteration Iteration Iteration Iteration 3 Software pipelined iteration

8 Example Iteration i Iteration i+1 Iteration i+2 LD F0, 0(R1)
ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop Loop: SD 16(R1), F4 ADDD F4, F0, F2 LD F0, 0(R1) SUBI R1, R1, #8 BNEZ R1, Loop

9 Trace Scheduling Find ILP across conditional branches Two-step process
Trace selection Find a trace (sequence of basic blocks) Use loop unrolling to generate long traces Use static branch prediction for other conditional branches Trace compaction Squeeze the trace into a small number of wide instructions Preserve data and control dependences

10 Trace Selection LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4, R5 SW 0(R1), R4
A[I] = A[I] + B[I] LW R4, 0(R1) LW R5, 0(R2) ADD R4, R4, R5 SW 0(R1), R4 BNEZ R4, else SW 0(R2), . . . J join Else: X Join: SW 0(R3), . . . T F A[I] = 0? X B[I] = C[I] =

11 Summary of Compiler Techniques
Try to avoid dependence stalls Loop unrolling Reduce loop overhead Software pipelining Reduce single body dependence stalls Trace scheduling Reduce impact of other branches Compilers use a mix of three All techniques depend on prediction accuracy

12 Hardware-based ILP Techniques
Limitation of static techniques Ability to predict at compile time the behavior of branches Hardware-based schemes Conditional or predicated instructions Extend the ISA Speculation Static Hardware support for compiler speculation Dynamic Use branch prediction to guide the speculation process

13 Predicated Instructions
Condition evaluation It is part of the instruction execution If true, the instruction executes normally If false, the instruction is replaced by a no-op Conditional move BNEZ R1, L MOV R2, R3 L: … If (A= = 0) S = T; CMOVZ R2, R3, R1

14 Limitations of Predicated Instructions
Annulled conditional instructions take execution time Useful when the condition can be evaluated early Data dependences may not allow to separate conditional instruction and branch Control flow involves more than a simple alternative sequence Moving an instruction across multiple branches Conditional instructions may be more expensive

15 Compiler Speculation Conditional instructions can be used for limited speculative computation If compiler can predict branches accurately then it can use speculation for: Improving scheduling (eliminating stalls) Increasing the issue rate (IPC) Challenge: maintain exception behavior Resuming and terminating exceptions Methods for aggressive speculation Hardware-software cooperation Poison bits Renaming

16 Hardware – Software Cooperation
Hardware and operating system handle exceptions: Resumable exceptions are processed as usual Terminating instructions return an undefined value Correct programs never fail (what about incorrect ones?) LW R1, 0(R3) BNEZ R1, L1 LW R1, 0(R2) J L2 L1: ADDI R1, R1, #4 L2: SW 0(R3), R1 LW R1, 0(R3) LW R14, 0(R2) BEQZ R1, L3 ADDI R14, R1, #4 L3: SW 0(R3), R14 Stores can not be speculative (only register renaming)

17 Poison Bits Less change to the exception behavior
Incorrect programs will still cause exceptions when speculation is used Poison bits: one for each register and instruction Destination register: ON when speculative instruction produces terminating exception Instruction: ON for speculative instructions No memory poison bits => Saves are not speculative

18 Renaming Renaming and buffering in hardware (like Tomasulo)
Boosted instructions (moved across a branch) Execute before the controlling branch Commit/abort once the branch is resolved Other boosted instructions can use temporary results LW R1, 0(R3) LW R1, 0(R2) BEQZ R1, L3 ADDI R1, R1, #4 L3: SW 0(R3), R1

19 Dynamic Speculation Dataflow execution Advantages
Dynamic branch prediction Speculation Dynamic scheduling Advantages Memory references disambiguation Hardware-based branch prediction is better Precise exception model even for all instructions No need for compensation or bookkeeping code Portability across hardware platforms Disadvantage: complex hardware

20 Implementation of Dynamic Speculation
Extend Tomasulo’s algorithm Execute and bypass results out of order Commit in order Reorder buffer Additional virtual registers (can be operands) Stores speculative results (before commit) Integrates the function of the store and load buffers Fields: Instruction type, Destination and Value Easy to undo speculated instructions on mispredicted branches or exceptions Precise exceptions


Download ppt "Loop-Level Parallelism"

Similar presentations


Ads by Google