Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria

Vittorio Zaccaria – Laboratory of Architectures DLX Load/Store Architecture Registers are faster than memory The compiler can do deeper optimization 16bit offsets and immediates 32bit integer registers 64bit floating point registers Fixed operation encoding: Addr. Mode contained in the operation code Fits in one word Faster decoding

Vittorio Zaccaria – Laboratory of Architectures DLX (cont.) 32 General purpose registers 32 bit instructions:

Vittorio Zaccaria – Laboratory of Architectures DLX Pipeline

Vittorio Zaccaria – Laboratory of Architectures Pipeline Visualization

Vittorio Zaccaria – Laboratory of Architectures Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle –Structural hazards: HW cannot support this combination of instructions –Data hazards: Instruction depends on result of prior instruction still in the pipeline –Control hazards: Pipelining of branches & other instructions that change the PC Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline Hazards

Vittorio Zaccaria – Laboratory of Architectures Structural Hazards

Vittorio Zaccaria – Laboratory of Architectures Data Hazards

Vittorio Zaccaria – Laboratory of Architectures Control Hazards

Vittorio Zaccaria – Laboratory of Architectures An example program:.data dati_a:.word1,2,3,4,5,6,7,8 dati_b:.word2,3,4,5,6,7,7,9.text.globalmain addr3,r0,0 loop:lwr4,dati_a(r3) lwr5,dati_b(r3) subr5,r5,r4 addir3,r3,4 bnezr5,loop exit:

Vittorio Zaccaria – Laboratory of Architectures 1st Exercise: Draw pipeline chart Indicate: Data Hazards between WB stages and ID stages. Control Hazards between EX stage and IF stage

Hazard Individuation

Vittorio Zaccaria – Laboratory of Architectures 2nd Exercise: Hazard Resolution Software solution NOPs insertion Hardware solutions Bubbles/stalls generation Register forwarding Software optimizations Code rescheduling

Vittorio Zaccaria – Laboratory of Architectures NOP insertion add r3,r0,0 NOP Loop: Lw r4,dati_a(r3) Lw r5,dati_b(r3) NOP Sub r5,r5,r4 Add r3,r3,4 NOP Bnez r5,Loop NOP

Vittorio Zaccaria – Laboratory of Architectures NOP dynamic execution First loop: Second loop:........ Loop composed by 5 instr and 4 Nops

Vittorio Zaccaria – Laboratory of Architectures Performance Indexes CPI= average clock cycles per instruction; Average Clock cycles= n° instr+n°stalls/nops+4 4 is the n° of cycles needed to execute the last instruction. CPI=[Average Clock cycles]/[n° instr]

Vittorio Zaccaria – Laboratory of Architectures Performance evaluation of NOPs Actual CPI= Instructions+Nops+4 13+4 --------------------------------- = -------- = 2.42 Instructions 7 MIPS frequency[=200Mhz] ------------------------- = 82.35 MIPS CPI*10^6

Vittorio Zaccaria – Laboratory of Architectures NOPs Manual Exercise Execute manually the loop for two cycles (finishing on the nop after the 2nd bnez ) and calculate CPI and MIPS 10 minutes

Vittorio Zaccaria – Laboratory of Architectures Results CPI= (21+4)/11=2.27 MIPS= 88

Vittorio Zaccaria – Laboratory of Architectures Asymptotic loop performance Consider an intermediate cycle of the loop. Count instructions + nops of the cycle and divide it by the number of effective instructions -> asymptotical CPI 10 minutes

Vittorio Zaccaria – Laboratory of Architectures Performance evaluation of NOPs (asymptotic) Asymptotic loop CPI= (Instructions+Nops)*n+4 9n+4 --------------------------------- = ---------- =~ 1.8 Instructions*n 5n MIPS frequency[=200Mhz] ------------------------- = 111 MIPS CPI*10^6

Vittorio Zaccaria – Laboratory of Architectures Bubbles Bubbles are NOPs inserted by the hardware. Branch instructions provoke the generation of a NOP Next instructions are stalled Previous instructions are executed.

Vittorio Zaccaria – Laboratory of Architectures Bubbles Example

Vittorio Zaccaria – Laboratory of Architectures Performance evaluation of bubbles Actual CPI= Instructions+Bubbles/aborts+4 7+6+4 --------------------------------- = -----------= 2.42 Instructions 7 MIPS frequency[=200Mhz] ------------------------- = 82.35 MIPS CPI*10^6

Vittorio Zaccaria – Laboratory of Architectures Verify on the simulator File-> load code... -> pipe1.s -> select -> load -> yes Configuration -> disable forwarding Open clock cycle diagram Execute -> single cycle (until 1st load of the 2nd cycle has been executed)

Vittorio Zaccaria – Laboratory of Architectures Result

Vittorio Zaccaria – Laboratory of Architectures Manual Exercise Preview what happens in an intermediate cycle Calculate asymptotical CPI and MIPS 10 minutes

Vittorio Zaccaria – Laboratory of Architectures Let’s simulate it Simulate the program until the 4 th cycle

Vittorio Zaccaria – Laboratory of Architectures Solutions After the 1st cycle, we note the same behavior: 5 instructions 1 nop 3 stalls so the asymptotic values are: Asymptotic values: CPI=1.8 MIPS=111.11

Vittorio Zaccaria – Laboratory of Architectures Result Forwarding

Vittorio Zaccaria – Laboratory of Architectures Forwarding Example

Vittorio Zaccaria – Laboratory of Architectures Simulation of 2 cycles of the loop. Configuration -> enable forwarding Open clock cycle diagram File -> Reset DLX Execute -> single cycle Just to the WB of the 2nd bnez

Vittorio Zaccaria – Laboratory of Architectures Simulation results

Vittorio Zaccaria – Laboratory of Architectures Manual Exercise Calculate CPI and MIPS for the 2 cycles. Calculate Asymptotical CPI and MIPS. 15 minutes

Vittorio Zaccaria – Laboratory of Architectures Results 2 cycles: 11 instructions 1 nop 2 stalls 4 cycles to flush the pipe  CPI=18/11=1.63  MIPS=122

Vittorio Zaccaria – Laboratory of Architectures Asymptotical Results 5 instructions 1 nop 1 stall CPI=[7n+4]/5n=1.4 MIPS=142.86.

Vittorio Zaccaria – Laboratory of Architectures Speedup Speed up of A w.r.t. B: Exec. Time B ------------- Exec. Time A

Vittorio Zaccaria – Laboratory of Architectures Calculate asymptotical speedup Speedup(NOPs,Bubbles) Speedup(Forwarding,NOPs) Speedup(Forwarding,Bubbles) 5 minutes

Vittorio Zaccaria – Laboratory of Architectures Calculate Asym. speedup Speedup(NOPs,Bubbles)=1 Speedup(Forwarding,NOPs)=1.29 Speedup(Forwarding,Bubbles)=1.29

Vittorio Zaccaria – Laboratory of Architectures Scheduling Optimizations change of the order of operations to minimize stalls/bubbles (forwarding enabled): lw r3,0(r2) addr3,r3,r7 lw r4,0(r2) add r4,r4,r8 addr4,r4,r3 CPI=(5+2+4)/5 lw r3,0(r2) lw r4,0(r2) addr3,r3,r7 add r4,r4,r8 addr4,r4,r3 CPI=(5+4)/5

Vittorio Zaccaria – Laboratory of Architectures 1st Exercise addi r1,r0,1 seqr2,r1,r1 addr3,r3,r3 Loop: lwr4,0(r3) sub r3,r3,r4 bnez r1,Loop

Vittorio Zaccaria – Laboratory of Architectures Manual Exercises Draw the conflicts between operations until the end of the 3 rd execution of the cycle (last instruction bnez ). No forwarding possible. Insert bubbles/aborts in the right place to solve hazards. Calculate CPI and throughput of the trace. Calculate asymptotical CPI of the loop. 20 minutes

Vittorio Zaccaria – Laboratory of Architectures Hazard Diagram

Vittorio Zaccaria – Laboratory of Architectures Bubbles/Stall insertion

Vittorio Zaccaria – Laboratory of Architectures CPIs Trace CPI=[24+4]/12=~2.33 Asymptotic CPI=[6n+4]/3n=~2

Vittorio Zaccaria – Laboratory of Architectures Manual Exercises Suppose now that forwarding is possible. Draw the new execution pipeline diagram (until the execution of the 3rd bnez) and indicate when stalls must be generated by the hardware. Calculate CPI and MIPS Calculate asymptotical CPI and MIPS 20 minutes

Vittorio Zaccaria – Laboratory of Architectures Pipeline Diagram

Vittorio Zaccaria – Laboratory of Architectures Results CPI=21/12=1.75 Asymptotical CPI=[(4+1)n+4]/3n=5/3=1.66

Vittorio Zaccaria – Laboratory of Architectures 2 nd exercise loop:lw r2,dati_a(r4) lw r3,dati_b(r5) add r1,r2,r3 sw dati_a(r6),r1 addi r4,r4,4 addi r5,r5,4 addi r6,r6,4 j loop

Vittorio Zaccaria – Laboratory of Architectures 1 st part Assume no forwarding possible Insert bubbles/aborts in the right place to solve hazards, assume no forwarding possible. Calculate asymptotical CPI of the loop. Schedule the instructions to minimize stalls by augmenting the distance between conflicting instructions. 20 minutes

Vittorio Zaccaria – Laboratory of Architectures Results 8 instructions 1 NOP 4 stalls => CPI=~13/8

Vittorio Zaccaria – Laboratory of Architectures Results No forwarding and no scheduling asymptotic result: 13/8

Vittorio Zaccaria – Laboratory of Architectures A Possible Re-Scheduling loop:lw r2,dati_a(r4) lw r3,dati_b(r5) addi r4,r4,4 addi r5,r5,4 add r1,r2,r3 sw dati_a(r6),r1 addi r6,r6,4 j loop Idea: increase distance of add from last lw.

Vittorio Zaccaria – Laboratory of Architectures Re-Scheduling results Scheduled code decreases CPI to 11/8

Vittorio Zaccaria – Laboratory of Architectures 2 nd part Now assume that forwarding is possible Insert needed bubbles/aborts in the right place to solve hazards Schedule the instructions to minimize stalls by augmenting the distance between conflicting instructions. Calculate Asymptotical CPI of the two loops. Calculate Speedup between the original code (w/o fw.) and the last rescheduled and forwarded code. 10 minutes

Vittorio Zaccaria – Laboratory of Architectures Forwarding Results With forwarding but not rescheduling we obtain: 10/8

Vittorio Zaccaria – Laboratory of Architectures Re-scheduling We use the same re-scheduled code: By rescheduling the loop we obtain 9/8

Vittorio Zaccaria – Laboratory of Architectures Speedup Results Total requested speedup is: CPI[unscheduled,unforwarded] 13 ---------------------------- = ---- CPI[scheduled,forwarded] 9

Vittorio Zaccaria – Laboratory of Architectures 3 rd Exercise loop:lw r2,dati_a(r1) addi r2,r2,4 lw r3,dati_b(r1) addi r3,r3,4 lw r4,dati_a(r1) addi r4,r4,4 add r2,r2,r3 add r2,r2,r4 sw dati_a(r1),r2 addi r1,r1,4 bnez r1,loop

Vittorio Zaccaria – Laboratory of Architectures 1 st part Assume no forwarding possible Insert bubbles/aborts in the right place to solve hazards. Calculate asymptotical CPI of the loop. Schedule the instructions to minimize stalls by augmenting the distance between conflicting instructions. 20 minutes

Vittorio Zaccaria – Laboratory of Architectures Bubbles insertion 11 instructions, 1 nop, 12 stalls => CPI= 24/11

Vittorio Zaccaria – Laboratory of Architectures Rescheduled code loop:lw r2,dati_a(r1) lw r3,dati_b(r1) lw r4,dati_a(r1) addi r2,r2,4 addi r3,r3,4 addi r4,r4,4 add r2,r2,r3 add r2,r2,r4 sw dati_a(r1),r2 addi r1,r1,4 bnez r1,loop Idea: perform elaborations after all data has been loaded

Vittorio Zaccaria – Laboratory of Architectures Scheduled code results 11 instr., 1 nop, 7 stalls => CPI=19/11

Vittorio Zaccaria – Laboratory of Architectures 2 nd part Now assume that forwarding is possible Insert needed bubbles/aborts in the right place to solve hazards Schedule the instructions to minimize stalls by augmenting the distance between conflicting instructions. Calculate Asymptotical CPI of the loop. Calculate Speedup between the original code (w/o fw.) and the last rescheduled and forwarded code. 10 minutes

Vittorio Zaccaria – Laboratory of Architectures Bubbles insertion 11 + 1 NOP + 4 stalls => CPI=16/11

Vittorio Zaccaria – Laboratory of Architectures Rescheduling Results 11 instr. + 1 NOP + 1 stall => CPI=13/11 Requested Speedup=24/13

Vittorio Zaccaria – Laboratory of Architectures Floating Point Pipeline Hazards DLX FPU Pipeline

Vittorio Zaccaria – Laboratory of Architectures DLX FPU Pipeline Latency of a FU=number of cycles that must intervene between an instruction that produce a value through the FU and an instruction that uses this value (-1). Initiation Interval of the FU: time that must elapse between issuing two operations to the same FU. A stall in a pipeline does not mean a stall in the entire processor.

Vittorio Zaccaria – Laboratory of Architectures FPU Latencies and I.I. FULatencyInitiation Interval Integer ALU01 FP add11 FP and integer multiply 41 FP and integer divide 1819 [structural hazards!] WINDLX default latencies

Vittorio Zaccaria – Laboratory of Architectures Problems with FPUs Divide instructions can provoke structural hazards and need to be stalled in the ID stage. Writes in the RF can be more than one. WAW hazards are possible because WB can be reached out of order. RAW hazards more frequent due to the longer latency of operations.

Vittorio Zaccaria – Laboratory of Architectures Long Stalls even with Full Forwarding

Vittorio Zaccaria – Laboratory of Architectures Register file structural hazard solution. Structural hazards on register file: Solution: stall one of the instructions before entering the MEM stage.

Vittorio Zaccaria – Laboratory of Architectures FPU WAW Hazards Subd finishes before multd! there is a WAW conflict, i.e., if we dont stall subd, multd will overwrite its results! ldf6,dati_a(r2) ldf2,dati_b(r3) multdf6,f2,f4 subdf6,f2,f2 addd f6,f8,f2

Vittorio Zaccaria – Laboratory of Architectures Exercise: execute only a cycle of this loop: loop:ldf0,dati_a(r2) ldf4,dati_b(r3) multdf0,f0,f4 adddf2,f0,f2 addir2,r2,8 addir3,r3,8 subr5,r4,r2 bnezr5,loop How many cycles between the IF of the 1 st ld and the WB of the 1 st bnez ?

Vittorio Zaccaria – Laboratory of Architectures Results CPI of the trace =19/8 instructions.

Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Similar presentations

Presentation on theme: "Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Similar presentations

Presentation on theme: "Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria."— Presentation transcript:

Similar presentations

About project

Feedback