Throughput = #instructions per unit time (seconds/cycles etc.)

Throughput = #instructions per unit time (seconds/cycles etc.)
Throughput of an unpipelined machine 1/time per instruction Time per instruction = pipeline depth*time to execute a single stage. The time to execute a single stage can be rewritten as: Throughput of a pipelined machine 1/time to execute a single stage (assuming all stages take same time) Deriving the throughput equation for pipelined machine Unit time determined by units that are used to represent denominator Cycles  Instr/Cycles, seconds  Instr/second Time per instruction on unpipelined machine Pipeline depth Throughput = Time per instruction on unpipelined machine Depth of the pipeline

Physics of Clock Skew Basically caused because the clock edge reaches different parts of the chip at different times Capacitance-charge-discharge rates All wires, leads, transistors, etc. have capacitance Longer wire, larger capacitance Repeaters used to drive current, handle fan-out problems C is inversely proportional to rate-of-change of V Time to charge/discharge adds to delay Dominant problem in old integration densities. For a fixed C, rate-of-change of V is proportional to I Problem with this approach is power requirements go up Power dissipation becomes a problem. Speed-of-light propagation delays Dominates current integration densities as nowadays capacitances are much lower. But nowadays clock rates are much faster (even small delays will consume a large part of the clock cycle) Current day research  asynchronous chip designs

Return to pipelining Its Not That Easy for Computers
Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) Control hazards: Pipelining of branches & other instructions that change the PC Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline

Speedup = average instruction time unpiplined
average instruction time pipelined Remember that average instruction time = CPI*Clock Cycle And ideal CPI for pipelined machine is 1. 2

Structural Hazards Overlapped execution of instructions:
Pipelining of functional units Duplication of resources Structural Hazard When the pipeline can not accommodate some combination of instructions Consequences Stall Increase of CPI from its ideal value (1)

Pipelining of Functional Units
Fully pipelined M1 M2 M3 M4 M5 FP Multiply IF ID MEM WB EX Partially pipelined M1 M2 M3 M4 M5 FP Multiply IF ID MEM WB EX Not pipelined M1 M2 M3 M4 M5 FP Multiply IF ID MEM WB EX

To pipeline or Not to pipeline
Elements to consider Effects of pipelining and duplicating units Increased costs Higher latency (pipeline register overhead) Frequency of structural hazard Example: unpipelined FP multiply unit in DLX Latency: 5 cycles Impact on mdljdp2 program? Frequency of FP instructions: 14% Depends on the distribution of FP multiplies Best case: uniform distribution Worst case: clustered, back-to-back multiplies

Resource Duplication Load Inst 1 Inst 2 Stall Inst 3 M Reg M Reg Reg M
ALU Reg Inst 1 M Reg M ALU Inst 2 M Reg M Reg ALU Stall Inst 3 M Reg M Reg ALU

Three Generic Data Hazards
InstrI followed by InstrJ Read After Write (RAW) InstrJ tries to read operand before InstrI writes it

InstrI followed by InstrJ Write After Read (WAR) InstrJ tries to write operand before InstrI reads i Gets wrong operand Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5

InstrI followed by InstrJ Write After Write (WAW) InstrJ tries to write operand before InstrI writes it Leaves wrong result ( InstrI not InstrJ ) Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes

Examples in more complicated pipelines
WAW - write after write WAR - write after read LW R1, 0(R2) IF ID EX M1 M2 WB ADD R1, R2, R IF ID EX WB SW 0(R1), R IF ID EX M1 M2 WB ADD R2, R3, R IF ID EX WB This is a problem if Register writes are during The first half of the cycle And reads during the Second half

Data Hazards IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg
ADD R1, R2, R3 ALU IM Reg DM Reg SUB R4, R1, R5 ALU IM Reg DM Reg ALU AND R6, R1, R7 IM Reg DM Reg ALU OR R8, R1, R9 IM Reg DM XOR R10, R1, R11 ALU

Forwarding IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM Reg DM Reg IM
ADD R1, R2, R3 ALU IM Reg DM Reg SUB R4, R1, R5 ALU IM Reg DM Reg ALU AND R6, R1, R7 IM Reg DM Reg ALU OR R8, R1, R9 IM Reg DM XOR R10, R1, R11 ALU

Stalls inspite of forwarding
IM Reg DM Reg LW R1, 0(R2) ALU IM Reg DM Reg SUB R4, R1, R5 ALU IM Reg DM Reg ALU AND R6, R1, R7 IM Reg DM Reg ALU OR R8, R1, R9

Pipeline Interlocks IM Reg DM Reg IM Reg DM Reg Reg DM IM IM Reg
LW R1, 0(R2) ALU IM Reg DM Reg SUB R4, R1, R5 ALU Reg DM IM ALU AND R6, R1, R7 IM Reg ALU OR R8, R1, R9 LW R1, 0(R2) IF ID EX MEM WB SUB R4, R1, R IF ID stall EX MEM WB AND R6, R1, R IF stall ID EX MEM WB OR R8, R1, R stall IF ID EX MEM WB

Software Scheduling to Avoid Load Hazards
Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd

Effect of Software Scheduling
LW Rb,b IF ID EX MEM WB LW Rc,c IF ID EX MEM WB ADD Ra,Rb,Rc IF ID EX MEM WB SW a,Ra IF ID EX MEM WB LW Re,e IF ID EX MEM WB LW Rf,f IF ID EX MEM WB SUB Rd,Re,Rf IF ID EX MEM WB SW d,Rd IF ID EX MEM WB LW Rb,b IF ID EX MEM WB LW Rc,c IF ID EX MEM WB LW Re,e IF ID EX MEM WB ADD Ra,Rb,Rc IF ID EX MEM WB LW Rf,f IF ID EX MEM WB SW a,Ra IF ID EX MEM WB SUB Rd,Re,Rf IF ID EX MEM WB SW d,Rd IF ID EX MEM WB

Compiler Scheduling Eliminates load interlocks Demands more registers
Simple scheduling Basic block (sequential segment of code) Good for simple pipelines Percentage of loads that result in a stall FP: 13% Int: 25%

Throughput = #instructions per unit time (seconds/cycles etc.)

Similar presentations

Presentation on theme: "Throughput = #instructions per unit time (seconds/cycles etc.)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Throughput = #instructions per unit time (seconds/cycles etc.)

Similar presentations

Presentation on theme: "Throughput = #instructions per unit time (seconds/cycles etc.)"— Presentation transcript:

Similar presentations

About project

Feedback