Presentation is loading. Please wait.

Presentation is loading. Please wait.

CPE 631 Lecture 14: Exploiting ILP with SW Approaches (2)

Similar presentations


Presentation on theme: "CPE 631 Lecture 14: Exploiting ILP with SW Approaches (2)"— Presentation transcript:

1 CPE 631 Lecture 14: Exploiting ILP with SW Approaches (2)
5/8/2019 CPE 631 Lecture 14: Exploiting ILP with SW Approaches (2) Aleksandar Milenković, Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovich

2 Basic Pipeline Scheduling and Loop Unrolling
Outline Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW Software Pipelining 08/05/2019 UAH-CPE631

3 ILP: Concepts and Challenges
5/8/2019 ILP: Concepts and Challenges ILP (Instruction Level Parallelism) – overlap execution of unrelated instructions Techniques that increase amount of parallelism exploited among instructions reduce impact of data and control hazards increase processor ability to exploit parallelism Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls Reducing each of the terms of the right-hand side minimize CPI and thus increase instruction throughput The potential to overlap the execution of the unrelated instructions is called Instruction Level Parallelism (ILP). Here, we are particularly interested in techniques that increase amount of parallelism through - reducing the impact of data and control hazards, and - increasing the processor ability to exploit parallelism. The CPI of a pipelined machine is the sum of the base CPI and all contributors from stalls: .... By reducing each of the terms of the right-hand side we can minimise CPI and thus increase instruction throughput. 08/05/2019 UAH-CPE631 Aleksandar Milenkovich

4 Basic Pipeline Scheduling: Example
5/8/2019 Basic Pipeline Scheduling: Example Simple loop: Assumptions: for(i=1; i<=1000; i++) x[i]=x[i] + s; Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Integer op Integer op 0 ;R1 points to the last element in the array ;for simplicity, we assume that x[0] is at the address 0 Loop: L.D F0, 0(R1) ;F0=array el. ADD.D F4,F0,F2 ;add scalar in F2 S.D 0(R1),F4 ;store result SUBI R1,R1,#8 ;decrement pointer BNEZ R1, Loop ;branch Let’s consider a simple loop that adds a scalar value s to an array x. Here is a typical source code. This loop is parallel since the body of each iterations is independent. Throughout this lecture we will assume FP latencies shown here. The first column shows the originating instruction type. The second column shows the type of the consuming instruction. The last column is the number of intervening clock cycles needed to avoid a stall. The first step is to translate the above code segment to DLX assembly language. We assume that R1 points to the last element in the array x[999], and F2 contains the scalar value s. For simplicity, we assume that the element with the lowest address is at zero. The straightforward DLX code, not scheduled for the pipeline looks like this. 08/05/2019 UAH-CPE631 Aleksandar Milenkovich

5 5/8/2019 Executing FP Loop 1. Loop: LD F0, 0(R1) 2. Stall 3. ADDD F4,F0,F2 4. Stall 5. Stall 6. SD 0(R1),F4 7. SUBI R1,R1,#8 8. Stall 9. BNEZ R1, Loop 10. Stall 10 clocks per iteration (5 stalls) => Rewrite code to minimize stalls? Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Integer op Integer op 0 The loop will execute as follows. According to our assumptions we need one stall for the LD (the destination of the LD is used as a source in the ADDD), two stalls for ADDD, one stall SUBI, and one for the delayed branch. So, this code requires 10 clock cycles per iteration, and 5 of them are stalls (or 50%). 08/05/2019 UAH-CPE631 Aleksandar Milenkovich

6 Revised FP loop to minimise stalls
5/8/2019 Revised FP loop to minimise stalls 1. Loop: LD F0, 0(R1) 2. SUBI R1,R1,#8 3. ADDD F4,F0,F2 4. Stall 5. BNEZ R1, Loop ;delayed branch 6. SD 8(R1),F4 ;altered and interch. SUBI Swap BNEZ and SD by changing address of SD SUBI is moved up 6 clocks per iteration (1 stall); but only 3 instructions do the actual work processing the array (LD, ADDD, SD) => Unroll loop 4 times to improve potential for instr. scheduling Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0 Integer op Integer op 0 We can schedule the loop to obtain this. We moved SUBI instruction to follow the LD, so we eliminate a stall after the LD. Then we moved the SD in the delayed branch slot, so we need only one stall after the ADDD. To schedule the delayed branch, the compiler had to determine that it could swap the SUBI and SD by changing the address to which the SD stored: 0(R1) is replaced with 8(R1). It is not a trivial observation, since most compilers would see that SD depends on SUBI and would refuse to interchange them. However, the smarter compilers could figure out the relationship and performs the interchange. Overall, we need 6 clock cycles per iteration. Notice, that actual work of operating an array element takes just 3 clock cycles (LD, ADDD, SD). The remaining 3 clock cycles consists of loop overhead (SUBI, BNEZ) and a stall. To eliminate these 3 clock cycles we need to get more operations within a loop relative to overhead instructions. A simple scheme to do that is loop unrolling. 08/05/2019 UAH-CPE631 Aleksandar Milenkovich

7 Unrolled Loop 1 cycle stall
5/8/2019 Unrolled Loop 1 cycle stall This loop will run 28 cc (14 stalls) per iteration; each LD has one stall, each ADDD 2, SUBI 1, BNEZ 1, plus 14 instruction issue cycles - or 28/4=7 for each element of the array (even slower than the scheduled version)! => Rewrite loop to minimize stalls LD F0, 0(R1) ADDD F4,F0,F2 SD 0(R1),F4 ; drop SUBI&BNEZ LD F0, -8(R1) ADDD F4,F0,F2 SD -8(R1),F4 ; drop SUBI&BNEZ LD F0, -16(R1) SD -16(R1),F4 ; drop SUBI&BNEZ LD F0, -24(R1) SD -24(R1),F4 SUBI R1,R1,#32 BNEZ R1,Loop 2 cycles stall Loop unrolling is done by simply replicating the loop body multiple times, and adjusting the loop termination code. Here we assume that the number of iterations is multiple of 4. Here you can see the result of loop unrolling. We drop all unnecessary SUBI and BNEZ instructions (we eliminated 3 branches and 3 SUBI instructions). Here, we use the same registers for all iterations and it could prevent us of effectively scheduling. Without any scheduling each operation is followed by a dependent one, and thus we have a lot of stalls. This loop will run 28 clock cycles with 14 stalls (50%); it is 7 clock cycles per an element of the array => even slower than the scheduled version). To solve this we will rewrite the code. 08/05/2019 UAH-CPE631 Aleksandar Milenkovich

8 Where are the name dependencies?
5/8/2019 Where are the name dependencies? 1 Loop: LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 ;drop DSUBUI & BNEZ 4 LD F0,-8(R1) 5 ADDD F4,F0,F2 6 SD -8(R1),F4 ;drop DSUBUI & BNEZ 7 LD F0,-16(R1) 8 ADDD F4,F0,F2 9 SD -16(R1),F4 ;drop DSUBUI & BNEZ 10 LD F0,-24(R1) 11 ADDD F4,F0,F2 12 SD -24(R1),F4 13 SUBUI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,LOOP 15 NOP How can remove them? L.D F0 to output to next L.D F0 ADD F0 input to L.D F0 08/05/2019 UAH-CPE631 Aleksandar Milenkovich

9 Where are the name dependencies?
5/8/2019 Where are the name dependencies? 1 Loop: L.D F0,0(R1) 2 ADD.D F4,F0,F2 3 S.D 0(R1),F4 ;drop DSUBUI & BNEZ 4 L.D F6,-8(R1) 5 ADD.D F8,F6,F2 6 S.D -8(R1),F8 ;drop DSUBUI & BNEZ 7 L.D F10,-16(R1) 8 ADD.D F12,F10,F2 9 S.D -16(R1),F12 ;drop DSUBUI & BNEZ 10 L.D F14,-24(R1) 11 ADD.D F16,F14,F2 12 S.D -24(R1),F16 13 DSUBUI R1,R1,#32 ;alter to 4*8 14 BNEZ R1,LOOP 15 NOP The Orginal“register renaming” 08/05/2019 UAH-CPE631 Aleksandar Milenkovich

10 Unrolled Loop that Minimise Stalls
5/8/2019 Unrolled Loop that Minimise Stalls Loop: LD F0,0(R1) LD F6,-8(R1) LD F10,-16(R1) LD F14,-24(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 ADDD F12,F10,F2 ADDD F16,F14,F2 SD 0(R1),F4 SD -8(R1),F8 SUBI R1,R1,#32 SD 16(R1),F12 BNEZ R1,Loop SD 8(R1),F4 ; This loop will run 14 cycles (no stalls) per iteration; or 14/4=3.5 for each element! Assumptions that make this possible: - move LDs before SDs - move SD after SUBI and BNEZ - use different registers When is it safe for compiler to do such changes? First, we use different registers for each iteration. It results in increasing the number of registers (so, now you can remember our statement we gave discussing GPRs in the ISA lecture: the more is better). Then we will make some changes which are for us as human beings obviously allowable. - it is legal to move the SD after SUBI and BNEZ - use different registers to avoid unnecessary constraints that would be forced by using the same registers for different computations - schedule the code, preserving any dependences needed to yield the same result as the original code. The key requirement underlying all of these transformations is an understanding of how an instruction depends on another and how the instructions can be changed or reordered given the dependences. The next slides will define main ideas and describe main restrictions that should be maintained. 08/05/2019 UAH-CPE631 Aleksandar Milenkovich

11 Steps Compiler Performed to Unroll
Determine that is OK to move the S.D after SUBUI and BNEZ, and find amount to adjust SD offset Determine that unrolling the loop would be useful by finding that the loop iterations were independent Rename registers to avoid name dependencies Eliminate extra test and branch instructions and adjust the loop termination and iteration code Determine loads and stores in unrolled loop can be interchanged by observing that the loads and stores from different iterations are independent requires analyzing memory addresses and finding that they do not refer to the same address. Schedule the code, preserving any dependences needed to yield same result as the original code 08/05/2019 UAH-CPE631

12 Decrease Ideal pipeline CPI Multiple issue
Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls Decrease Ideal pipeline CPI Multiple issue Superscalar Statically scheduled (compiler techniques) Dynamically scheduled (Tomasulo’s alg.) VLIW (Very Long Instruction Word) parallelism is explicitly indicated by instruction EPIC (explicitly parallel instruction computers) 08/05/2019 UAH-CPE631

13 Superscalar MIPS Note: FP operations extend EX cycle
5/8/2019 Superscalar MIPS Superscalar MIPS: 2 instructions, 1 FP & 1 anything else Fetch 64-bits/clock cycle; Int on left, FP on right Can only issue 2nd instruction if 1st instruction issues More ports for FP registers to do FP load & FP op in a pair 5 10 Time [clocks] I Mem IF ID Ex WB FP Mem IF ID Ex WB Note: FP operations extend EX cycle In a typical superscalar processor, the hardware might issue from one to eight instructions in clock cycle. Usually these instructions must be independent and will have to satisfy some constraints (e.g., no more than one memory reference issued pre clock). Superscalar DLX: two instructions issued per a clock (Load/Branch/Store/ALU, and the other can be any FP operation). ... I Mem IF ID Ex WB FP Mem IF ID Ex WB I Mem IF ID Ex WB FP Mem IF ID Ex WB Instr. 08/05/2019 UAH-CPE631 Aleksandar Milenkovich

14 Loop Unrolling in Superscalar
5/8/2019 Loop Unrolling in Superscalar Integer Instr. FP Instr. 1 Loop: LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1) ADDD F4,F0,F2 4 LD F14,-24(R1) ADDD F8,F6,F2 5 LD F18,-32(R1) ADDD F12,F10,F2 6 SD 0(R1),F4 ADDD F16,F14,F2 7 SD -8(R1),F8 ADDD F20,F18,F2 8 SD -16(R1),F12 9 SUBI R1,R1,#40 10 SD 16(R1),F16 11 BNEZ R1,Loop 12 SD 8(R1),F20 Unrolled 5 times to avoid delays This loop will run 12 cycles (no stalls) per iteration - or 12/5=2.4 for each element of the array 08/05/2019 UAH-CPE631 Aleksandar Milenkovich

15 Multiple Issue Processors
5/8/2019 Multiple Issue Processors Two variations Superscalar: varying no. instructions/cycle (1 to 8), scheduled by compiler or by HW (Tomasulo) IBM PowerPC, Sun UltraSparc, DEC Alpha, HP 8000 (Very) Long Instruction Words (V)LIW: fixed number of instructions (4-16) scheduled by the compiler; put ops into wide templates Crusoe VLIW processor [ Intel Architecture-64 (IA-64) 64-bit address Style: “Explicitly Parallel Instruction Computer (EPIC)” Anticipated success lead to use of Instructions Per Clock cycle (IPC) vs. CPI So far we have considered techniques used to eliminate data and control stalls and achieve an ideal CPI of 1. To improve performance further we would like to decrease the CPI to less than 1. But this cannot be done if we issue only one instruction every clock cycle. Multiple issue processors discussed here allow us to issue multiple instructions in a clock cycles. There are two variations: superscalar and VLIW (Very Long Instruction Word) processors. Superscalar processors issue varying numbers of instructions per clock and may be either statically scheduled by compiler or dynamically scheduled using techniques based on scoreboarding or Tomasulo algorithm. VLIWs issue a fixed number of instructions formatted either as one large instruction or as a fixed instruction packet. VLIW processors are inherently statically scheduled by the compiler. 08/05/2019 UAH-CPE631 Aleksandar Milenkovich

16 The VLIW Approach VLIWs use multiple independent functional units
5/8/2019 The VLIW Approach VLIWs use multiple independent functional units VLIWs package the multiple operations into one very long instruction Compiler is responsible to choose instructions to be issued simultaneously Time [clocks] Ii IF ID E W E E Ii+1 IF ID E W E E Instr. 08/05/2019 UAH-CPE631 Aleksandar Milenkovich

17 5/8/2019 Loop Unrolling in VLIW Mem. Ref1 Mem Ref. 2 FP1 FP2 Int/Branch 1 LD F2,0(R1) LD F6,-8(R1) 2 LD F10,-16(R1) LD F14,-24(R1) 3 LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F0,F6 4 LD F26,-48(R1) ADDD F12,F0,F10 ADDD F16,F0,F14 5 ADDD F20,F0,F18 ADDD F24,F0,F22 6 SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F0,F26 7 SD -16(R1),F12 SD -24(R1),F16 SUBI R1,R1,#56 8 SD 24(R1),F20 SD 16(R1),F24 BNEZ R1,Loop 9 SD 8(R1),F28 Unrolled 7 times to avoid delays 7 results in 9 clocks, or 1.3 clocks per each element (1.8X) Average: 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW (15 vs. 6 in SS) 08/05/2019 UAH-CPE631 Aleksandar Milenkovich

18 Multiple Issue Challenges
5/8/2019 Multiple Issue Challenges While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with: Exactly 50% FP operations No hazards If more instructions issue at same time, greater difficulty of decode and issue Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue VLIW: tradeoff instruction space for simple decoding The long instruction word has room for many operations By definition, all the operations the compiler puts in the long instruction word are independent => execute in parallel E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide Need compiling technique that schedules across several branches 08/05/2019 UAH-CPE631 Aleksandar Milenkovich

19 When Safe to Unroll Loop?
Example: Where are data dependencies? (A,B,C distinct & nonoverlapping) for (i=0; i<100; i=i+1) { A[i+1] = A[i] + C[i]; /* S1 */ B[i+1] = B[i] + A[i+1]; /* S2 */ } 1. S2 uses the value, A[i+1], computed by S1 in the same iteration 2. S1 uses a value computed by S1 in an earlier iteration, since iteration i computes A[i+1] which is read in iteration i+1. The same is true of S2 for B[i] and B[i+1] This is a “loop-carried dependence”: between iterations For our prior example, each iteration was distinct 08/05/2019 UAH-CPE631

20 Does a loop-carried dependence mean there is no parallelism???
Consider: for (i=0; i< 8; i=i+1) { A = A + C[i]; /* S1 */ } Could compute: ”Cycle 1”: temp0 = C[0] + C[1]; temp1 = C[2] + C[3]; temp2 = C[4] + C[5]; temp3 = C[6] + C[7]; ”Cycle 2”: temp4 = temp0 + temp1; temp5 = temp2 + temp3; ”Cycle 3”: A = temp4 + temp5; Relies on associative nature of “+”. 08/05/2019 UAH-CPE631

21 Loop carried dependences?
Another Example Loop carried dependences? To overlap iteration execution: for (i=1; i<100; i=i+1) { A[i] = A[i] + B[i]; /* S1 */ B[i+1] = C[i] + D[i]; /* S2 */ } A[1] = A[1] + B[1]; for (i=1; i<100; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } B[101] = C[100] + D[100]; 08/05/2019 UAH-CPE631

22 Another possibility: Software Pipelining
Observation: if iterations from loops are independent, then can get more ILP by taking instructions from different iterations Software pipelining: reorganizes loops so that each iteration is made from instructions chosen from different iterations of the original loop (~ Tomasulo in SW) 08/05/2019 UAH-CPE631

23 Software Pipelining Example
After: Software Pipelined 1 SD 0(R1),F4 ; Stores M[i] 2 ADDD F4,F0,F2 ; Adds to M[i-1] 3 LD F0,-16(R1); Loads M[i-2] 4 SUBUI R1,R1,#8 5 BNEZ R1,LOOP Before: Unrolled 3 times 1 LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 10 SUBUI R1,R1,#24 11 BNEZ R1,LOOP 5 cycles per iteration SW Pipeline overlapped ops Time Loop Unrolled Symbolic Loop Unrolling Maximize result-use distance Less code space than unrolling Fill & drain pipe only once per loop vs. once per each unrolled iteration in loop unrolling Time 08/05/2019 UAH-CPE631

24 Loop unrolling to minimise stalls Multiple issue to minimise CPI
5/8/2019 Things to Remember Pipeline CPI = Ideal pipeline CPI + Structural stalls + RAW stalls + WAR stalls + WAW stalls + Control stalls Loop unrolling to minimise stalls Multiple issue to minimise CPI Superscalar processors VLIW architectures 08/05/2019 UAH-CPE631 Aleksandar Milenkovich

25 Statically Scheduled Superscalar
E.g., four-issue static superscalar 4 instructions make one issue packet Fetch examines each instruction in the packet in the program order instruction cannot be issued will cause a structural or data hazard either due to an instruction earlier in the issue packet or due to an instruction already in execution can issue from 0 to 4 instruction per clock cycle 08/05/2019 UAH-CPE631

26 Multiple Issue with Dynamic Scheduling
From Instruction Unit FP Registers From Mem FP Op Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 Store Buffers Store1 Store2 Store3 Add1 Add2 Add3 Mult1 Mult2 Reservation Stations To Mem FP adders FP multipliers Issue: 2 instructions per clock cycle 08/05/2019 UAH-CPE631

27 Multiple Issue with Dynamic Scheduling
Loop: L.D F0, 0(R1) ADD.D F4,F0,F2 S.D 0(R1), F4 DADDIU R1,R1,-#8 BNE R1,R2,Loop Assumptions: One FP and one integer operation can be issued; Resources: ALU (int + effective address), a separate pipelined FP for each operation type, branch prediction hardware, 1 CDB 2 cc for loads, 3 cc for FP Add Branches single issue, branch prediction is perfect 08/05/2019 UAH-CPE631

28 Multiple Issue with Dynamic Scheduling
Iter. Inst. Issue Exe. (begins) Mem. Access Write at CDB Com. 1 LD.D F0,0(R1) 2 3 4 first issue ADD.D F4,F0,F2 5 8 Wait for LD.D S.D 0(R1), F4 9 Wait for ADD.D DADDIU R1,R1,-#8 Wait for ALU BNE R1,R2,Loop 6 Wait for DAIDU 7 Wait for BNE 10 13 14 11 12 15 18 19 16 08/05/2019 UAH-CPE631

29 Multiple Issue with Dynamic Scheduling: Resource Usage
Clock Int ALU FP ALU Data Cache CDB 2 1/L.D 3 1/S.D 4 1/DADDIU 5 1/ADD.D 6 7 2/L.D 8 2/S.D 9 2/ DADDIU 10 2/ADD.D 2/DADDIU 11 12 3/L.D 13 3/S.D 14 3/ DADDIU 15 3/ADD.D 3/DADDIU 16 17 18 19 08/05/2019 UAH-CPE631

30 Multiple Issue with Dynamic Scheduling:
DADDIU waits for ALU used by S.D Add one ALU dedicated to effective address calculation Use 2 CDBs Draw table for the dual-issue version of Tomasulo’s pipeline 08/05/2019 UAH-CPE631

31 Multiple Issue with Dynamic Scheduling
Iter. Inst. Issue Exe. (begins) Mem. Access Write at CDB Com. 1 LD.D F0,0(R1) 2 3 4 first issue ADD.D F4,F0,F2 5 8 Wait for LD.D S.D 0(R1), F4 9 Wait for ADD.D DADDIU R1,R1,-#8 Executes earlier BNE R1,R2,Loop Wait for DAIDU 6 7 Wait for BNE 12 13 10 11 15 16 08/05/2019 UAH-CPE631

32 Multiple Issue with Dynamic Scheduling: Resource Usage
Clock Int ALU Adr. Adder FP ALU Data Cache CDB#1 CDB#2 2 1/L.D 3 1/DADDIU 1/S.D 4 5 1/ADD.D 6 2/ DADDIU 2/L.D 7 2/S.D 2/DADDIU 8 9 3/ DADDIU 3/L.D 2/ADD.D 10 3/S.D 3/DADDIU 11 12 3/ADD.D 13 14 15 16 08/05/2019 UAH-CPE631


Download ppt "CPE 631 Lecture 14: Exploiting ILP with SW Approaches (2)"

Similar presentations


Ads by Google