Presentation is loading. Please wait.

Presentation is loading. Please wait.

EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.

Similar presentations


Presentation on theme: "EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining."— Presentation transcript:

1 EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining

2 EEL5708 Acknowledgements All the lecture slides were adopted from the slides of David Patterson (1998, 2001) and David E. Culler (2001), Copyright 1998-2002, University of California Berkeley

3 EEL5708 Pipelining

4 EEL5708 Sequential Laundry Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? ABCD 304020304020304020304020 6 PM 789 10 11 Midnight TaskOrderTaskOrder Time

5 EEL5708 Pipelined Laundry Start work ASAP Pipelined laundry takes 3.5 hours for 4 loads ABCD 6 PM 789 10 11 Midnight TaskOrderTaskOrder Time 3040 20

6 EEL5708 Pipelining Lessons Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup ABCD 6 PM 789 TaskOrderTaskOrder Time 3040 20

7 EEL5708 Fast, Pipelined Instruction Interpretation Instruction Register Operand Registers Instruction Address Result Registers Next Instruction Instruction Fetch Decode & Operand Fetch Execute Store Results NI IF D E W NI IF D E W NI IF D E W NI IF D E W NI IF D E W Time Registers or Mem

8 EEL5708 Instruction Pipelining Execute billions of instructions, so throughput is what matters –except when? What is desirable in instruction sets for pipelining? –Variable length instructions vs. all instructions same length? –Memory operands part of any operation vs. memory operands only in loads or stores? –Register operand many places in instruction format vs. registers located in same place?

9 EEL5708 Example: MIPS (Note register location) Op 312601516202125 Rs1Rd immediate Op 3126025 Op 312601516202125 Rs1Rs2 target RdOpx Register-Register 56 1011 Register-Immediate Op 312601516202125 Rs1Rs2/Opx immediate Branch Jump / Call

10 EEL5708 5 Steps of MIPS Datapath Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc ALU MUX Memory Reg File MUX Data Memory MUX Sign Extend 4 Adder Zero? Next SEQ PC Address Next PC WB Data Inst RD RS1 RS2 Immediate

11 EEL5708 5 Steps of MIPS Datapath Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc ALU Memory Reg File MUX Data Memory MUX Sign Extend Zero? IF/ID ID/EX MEM/WB EX/MEM 4 Adder Next SEQ PC RD WB Data Data stationary control – local decode for each instruction phase / pipeline stage Next PC Address RS1 RS2 Imm MUX

12 EEL5708 Visualizing Pipelining Figure 3.3, Page 133, CA:AQA 2e I n s t r. O r d e r Time (clock cycles) Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg Cycle 1Cycle 2Cycle 3Cycle 4Cycle 6Cycle 7Cycle 5

13 EEL5708 Its Not That Easy for Computers Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle –Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) –Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) –Control hazards: Pipelining of branches & other instructions that change the PC –Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline

14 EEL5708 Speed Up Equation for Pipelining CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instr Speedup = Ideal CPI x Pipeline depth Clock Cycle unpipelined Ideal CPI + Pipeline stall CPI Clock Cycle pipelined Speedup = Pipeline depth Clock Cycle unpipelined 1 + Pipeline stall CPI Clock Cycle pipelined x x

15 EEL5708 Structural Hazard Example: Dual-port vs. Single-port Machine A: Dual ported memory Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed SpeedUp A = Pipeline Depth/(1 + 0) x (clock unpipe /clock pipe ) = Pipeline Depth SpeedUp B = Pipeline Depth/(1 + 0.4 x 1) x (clock unpipe /(clock unpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUp A / SpeedUp B = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 Machine A is 1.33 times faster

16 EEL5708 Three Generic Data Hazards Instr I followed by Instr J Read After Write (RAW) Instr J tries to read operand before Instr I writes it

17 EEL5708 Three Generic Data Hazards Instr I followed by Instr J Write After Read (WAR) Instr J tries to write operand before Instr I reads i –Gets wrong operand Can’t happen in our 5 stage pipeline because: – All instructions take 5 stages, and – Reads are always in stage 2, and – Writes are always in stage 5

18 EEL5708 Three Generic Data Hazards Instr I followed by Instr J Write After Write (WAW) Instr J tries to write operand before Instr I writes it – Leaves wrong result ( Instr I not Instr J ) Can’t happen in our 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5

19 EEL5708 Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SWd,Rd Software Scheduling to Avoid Load Hazards Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SWd,Rd

20 EEL5708 Control Hazard on Branches Three Stage Stall

21 EEL5708 Branch Stall Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution: –Determine branch taken or not sooner, AND –Compute taken branch address earlier Branch tests if register = 0 or <> 0 Solution: –Move Zero test to ID/RF stage –Adder to calculate new PC in ID/RF stage –1 clock cycle penalty for branch versus 3

22 EEL5708 Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken –Execute successor instructions in sequence –“Squash” instructions in pipeline if branch actually taken –Advantage of late pipeline state update –47% branches not taken on average –PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken –53% branches taken on average –But haven’t calculated branch target address »still incurs 1 cycle branch penalty »Other machines: branch target known before outcome

23 EEL5708 Four Branch Hazard Alternatives #4: Delayed Branch –Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor 2........ sequential successor n branch target if taken –1 slot delay allows proper decision and branch target address in 5 stage pipeline Branch delay of length n

24 EEL5708 Delayed Branch Where to get instructions to fill branch delay slot? –Before branch instruction –From the target address: only valuable when branch taken –From fall through: only valuable when branch not taken –Cancelling branches allow more slots to be filled Compiler effectiveness for single branch delay slot: –Fills about 60% of branch delay slots –About 80% of instructions executed in branch delay slots useful in computation –About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

25 EEL5708 Evaluating Branch Alternatives SchedulingBranchCPIspeedup v.speedup v. scheme penaltyunpipelinedstall Stall pipeline31.423.51.0 Predict taken11.144.41.26 Predict not taken11.094.51.29 Delayed branch0.51.074.61.31 Conditional & Unconditional = 14%, 65% change PC

26 EEL5708 Pipelining Summary Just overlap tasks, and easy if tasks are independent Speed Up / Pipeline Depth; if ideal CPI is 1, then: Hazards limit performance on computers: –Structural: need more HW resources –Data (RAW,WAR,WAW): need forwarding, compiler scheduling –Control: delayed branch, prediction Speedup = Pipeline Depth 1 + Pipeline stall CPI X Clock Cycle Unpipelined Clock Cycle Pipelined


Download ppt "EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining."

Similar presentations


Ads by Google