Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 15: Pipelining: Branching & Complications

Similar presentations


Presentation on theme: "Lecture 15: Pipelining: Branching & Complications"— Presentation transcript:

1 Lecture 15: Pipelining: Branching & Complications
Michael B. Greenwald Computer Architecture CIS 501 Fall 1999 2

2 Administration Solutions to midterm up on web page.
HW #4 posted on web page

3 Visualizing Pipelining Figure 3.3, Page 133
Time (clock cycles) Register Bank Data Memory Instruction Memory ALU Register Bank I n s t r. O r d e Instruction Level Parallelism (ILP) Which resources active during each clock cycle?

4 Its Not That Easy for Computers
Decreasing cycle time increases demands on components Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Pipelining of branches & other instructions. Common solution is to stall the pipeline until the hazard “bubbles” in the pipeline

5 Visualizing Stalls Load Instr 1 Instr 2 stall Instr 3
Time (clock cycles) Load I n s t r. O r d e Instr 1 Simplest way to view a stall is as an “empty instruction”. This delays the start of the next real instruction. Instr 2 stall Instr 3

6 Visualizing Stalls Load Instr 1 Instr 2 stall Instr 3
Time (clock cycles) Load I n s t r. O r d e Instr 1 Simplest way to view a stall is as an “empty instruction”. This delays the start of the next real instruction. Point is to detect resource conflicts Instr 2 Inaccurate when stall isn’t introduced at beginning of pipeline stall Instr 3

7 Visualizing Stalls Load Instr 1 Instr 2 stall Instr 3
Time (clock cycles) Load I n s t r. O r d e Instr 1 Alternative is a bubble propagating through all later instructions (stop all earlier stages). Instr 2 stall Instr 3

8 Visualizing Stalls lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9
Time (clock cycles) I n s t r. O r d e lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 Pipeline interlock or r8,r1,r9

9 Visualizing Stalls What if multiple cycle stall?
CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 CC10 CC11 IF ID1 ID2 EX MEM WB IF ID1 stall stall IF ID1 ID2 EX MEM WB IF stall stall stall IF ID1 ID2 EX MEM IF ID1 ID2 EX IF ID1 ID2

10 Branch Stall Impact Recall: If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Key point: penalty to throughput only applies to instruction that first caused the stall, even though all subsequent instructions are stalled equally We only care about throughput Instruction completion time relative to previous instruction If we are interested in avg. instruction time (latency), then penalty applies to all subsequent instr.s and we must know what stage stall begins to calculate how many subsequent instr.s it affects.

11 Speed Up Equation for Pipelining
Speedup from pipelining = Avg Instr Time unpipelined Avg Instr Time pipelined = CPIunpipelined x Clock Cycleunpipelined CPIpipelined x Clock Cyclepipelined = CPIunpipelined Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined Basic speedup equation. Two ways of looking at this: For a given clock-cycle, pipelining decreases CPI Treat CPI as 1 for both machines, pipelining decreases clock cycle time. Yields 2 different speedup equations, depending on definition of cycle. Always be consistent within problem! x

12 Speed Up Equation for Pipelining
Speedup from pipelining = CPIunpipelined Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined Basic speedup equation. Two ways of looking at this: For a given clock-cycle, pipelining decreases CPI Treat CPI as 1 for both machines, pipelining decreases clock cycle time. In what sense is CPI reduced? Each cycle pipelineDepth instructions are active, so if each instruction takes n cycles, CPI = n/pipelineDepth. CPI counts cycles-per-instruction completion. X F D X M W F D X M W 1 2 3 4 5 F D X M W F D X M W 1 F D X M W F D X M W

13 Speed Up Equation for Pipelining: Impact of Stalls
Speedup from pipelining = CPIunpipelined Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined Basic speedup equation. Two ways of looking at this: For a given clock-cycle, pipelining decreases CPI Treat CPI as 1 for both machines, pipelining decreases clock cycle time. CPIunpipelined = Ideal CPI x Pipeline depth CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined Ideal CPI + Pipeline stall CPI Clock Cyclepipelined X Simplification: all insts = len. Cycles/stage: may as well be 1 X

14 Speed Up Equation for Pipelining: Impact of Stalls
Speedup from pipelining = CPIunpipelined Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined Basic speedup equation. Two ways of looking at this: For a given clock-cycle, pipelining decreases CPI Treat CPI as 1 for both machines, pipelining decreases clock cycle time. Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined Ideal CPI + Pipeline stall CPI Clock Cyclepipelined Speedup = 1 x Pipeline depth Clock Cycleunpipelined 1 + Pipeline stall CPI Clock Cyclepipelined X X X Assume balance Clock Cycleunpipelined + penalty

15 Speed Up Equation for Pipelining
Speedup from pipelining = CPIunpipelined Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined Basic speedup equation. Two ways of looking at this: For a given clock-cycle, pipelining decreases CPI Treat CPI as 1 for both machines, pipelining decreases clock cycle time. In what sense is CPI 1 for both machines? Amount of time to complete an instruction Pipelining’s main effect is to decrease clock cycle time. X F D X M W 1 2 3 4 5 F D X M W F D X M W 1 F D X M W F D X M W

16 Speed Up Equation for Pipelining: Impact of Stalls
Speedup from pipelining = CPIunpipelined Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined Basic speedup equation. Two ways of looking at this: For a given clock-cycle, pipelining decreases CPI Treat CPI as 1 for both machines, pipelining decreases clock cycle time. Clock Cyclepipelined = Clock Cycleunpipelined/pipeDepth + penalty Speedup = CPIunpipelined Clock Cycleunpipelined CPIpipelined Clock Cycleunpipelined/pipeDepth+penalty Speedup = Clock Cycleunpipelined 1 + StallCPI Clock Cycleunpipelined + penalty Pipeline Depth X X X

17 Its Not That Easy for Computers
Decreasing cycle time increases demands on components Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Pipelining of branches & other instructions. Common solution is to stall the pipeline until the hazard “bubbles” in the pipeline

18 Control Hazards When a branch is executed it may (or may not!) change the PC to something other than Current PC+4. Simple solution: After detecting branch instruction, stall pipeline until target address is computed. This introduces 3 cycles of stalls. DIfferent implementation than data stall, since IF cycle must be repeated.

19 Pipelined DLX Datapath Figure 3.4, page 137
Need Result Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc. Write Back Memory Access Pipeline registers Latch for later Data stationary control local decode for each instruction phase / pipeline stage

20 Four Branch Hazard Alternatives
#1: Stall until branch direction is clear #2: Predict Branch Not Taken Execute successor instructions in sequence “Flush” instructions in pipeline if branch actually taken Advantage of late pipeline state update, since don’t need to “Undo” 47% DLX branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken 53% DLX branches taken on average But haven’t calculated branch target address in DLX DLX still incurs 1 cycle branch penalty

21 Four Branch Hazard Alternatives
#4: Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor sequential successorn branch target if taken 1 slot delay allows proper decision and branch target address in 5 stage pipeline DLX uses this Branch delay of length n

22 Delayed Branch Where to get instructions to fill branch delay slot?
Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Cancelling branches allow more slots to be filled because become no-op if prediction is wrong. Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled

23 Evaluating Branch Alternatives
Scheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined stall Stall pipeline Predict taken Predict not taken Delayed branch Conditional & Unconditional = 14%, 65% change PC

24 Compiler “Static” Prediction of Taken/Untaken Branches
Improves strategy for placing instructions in delay slot Two strategies Backward branch predict taken, forward branch not taken Profile-based prediction: record branch behavior, predict branch based on prior run (common case both directions) Taken backwards Not Taken Forwards Always taken

25 Evaluating Static Branch Prediction
Misprediction ignores frequency of branch “Instructions between mispredicted branches” is a better metric since it weights by frequency.

26 Interrupts Synchronous vs. Asynch. User requested vs. coerced.
Maskable vs. unmaskable. Within vs. between Resume vs. Terminate External interrupts are easier to implement (Asynch, maskable, between), because can be handled after completion of all instructions in the pipe. Harder to perform well (coerced), because not predictable (like a branch, with no support).

27 Pipelining Complications
Interrupts: 5 instructions executing in 5 stage pipeline How to stop the pipeline? How to restart the pipeline? Who caused the interrupt? Stage Problem interrupts occurring IF Page fault on instruction fetch; misaligned memory access; memory-protection violation ID Undefined or illegal opcode EX Arithmetic interrupt MEM Page fault on data fetch; misaligned memory access; memory-protection violation

28 Pipelining Complications
“Precise exceptions”: all instructions before the exception complete before the exception is taken, and all subsequent instructions can be restarted from scratch. Performance penalty (> 10x on some machines) Consequently: 2 modes (precise for debugging). Flushing pipeline so that instr. is restartable Delayed branch complicates, since can’t just save one PC, since pipeline is not sequentially ordered. May need to save a PC per pipeline stage (of delayed branch) + 1. Must retrieve source operands DST = SRC SRC may be overwritten by other instructions.

29 Pipelining Complications
Simultaneous exceptions in more than one pipeline stage, e.g., Load with data page fault in MEM stage Add with instruction page fault in IF stage Add fault will happen BEFORE load fault Solution #1 Interrupt status vector per instruction Defer check til last stage, kill state update if exception Solution #2 Interrupt ASAP Restart everything that is incomplete Another advantage for state update late in pipeline!

30 Pipelining Complications
Simultaneous exceptions in more than one pipeline stage, e.g., Load with data page fault in MEM stage Add with instruction page fault in IF stage Add fault will happen BEFORE load fault Solution #1 Interrupt status vector per instruction Defer check til last stage, kill state update if exception Solution #2 Interrupt ASAP Restart everything that is incomplete Another advantage for state update late in pipeline!

31 Implementing Precise Exceptions
Straightforward for DLX integer pipeline Interrupt Status Vector is a special register ISV stored in pipeline latch At time of any interrupt/exception all writes are disabled. Actual check for interrupt and trap to interrupt handler defered until WB. ISV is checked, and interrupts are posted in instruction order (unpipelined order).

32 Pipelining Complications
Complex Addressing Modes and Instructions Address modes: Autoincrement causes register change during instruction execution Interrupts? Need to restore register state Adds WAR and WAW hazards since writes no longer last stage Memory-Memory Move Instructions Must be able to handle multiple page faults Long-lived instructions: partial state save on interrupt Condition Codes: set by previous instruction (what if interrupt in between?)

33 Pipelining Complications: When not all instructions take the same time...
Floating Point: long execution time Impractical to require that FP operations complete in one clock cycle: Slow clock So we have at least one stage of the pipeline that will complete with different latencies for different instructions.


Download ppt "Lecture 15: Pipelining: Branching & Complications"

Similar presentations


Ads by Google