Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from

Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from
“ “ Copyright 1998 UCB Chapter 3: Pipelining Review today, not so fast in future 순천향대학교 컴퓨터학부 이 상 정

Review, #1 Classifying Instruction Set Architectures
Accumulator (1 register):1 address Stack:0 address General Purpose Register:2 address, 3 address Load/Store: 3 address General Purpose Registers Dorminate Data Addressing modes that are important: Displacement, Immediate, Register Indirect Displacement size should be 12 to 16 bits Immediate size should be 8 to 16 bits

Review, #2 Operations in the Instruction Set :
Data Movement, Arithmetic, Shift, Logical, Control subroutine linkage, interrupt synchronization, string, graphics(MMX) Methods of Testing Condition condition codes condition register compare and branch

Review, #3 DLX Architecture Simple load-store architecture
DLX registers 32 32-bit GPRS named R0, R1, ..., R31 32 32-bit FPRs named F0, F2, ..., F30 Byte addressable in big-endian with 32-bit address

What Is Pipelining Laundry Example
Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 30 minutes “Folder”takes 30 minutes “Stasher” takes 30 minutes to put clothes into drawers A B C D

Sequential Laundry 6 PM 7 8 9 10 11 12 1 2 AM T a s k O r d e 30 30 30
Time A B C D Sequential laundry takes 8 hours for 4 loads If they learned pipelining, how long would laundry take?

Pipelined Laundry 6 PM 7 8 9 10 11 12 1 2 AM B C D A 30 Time T a s k O
Pipelined laundry takes 3.5 hours for 4 loads!

Pipelining Lessons 6 PM 7 8 9 T a s k O r d e B C D A 30
Pipelining doesn’t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Stall for Dependences 6 PM 7 8 9 Time T a s k O r d e B C D A 30

A Simple Implementation of DLX
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 IF ID EX MEM WB IF: Instruction Fetch Cycle ID: Decode/Registers Fetch Cycle Ex: EXution/Effective Address Cycle MEM: MEMory Access/Branch completion Cycle Wr: Write-Back Cycle As shown here, each of these five steps will take one clock cycle to complete. And in pipeline terminology, each step is referred to as one stage of the pipeline. +1 = 8 min. (X:48)

DLX Pipeline Stage Instruction fetch cycle (IF)
IR  MEM[PC] NPC  PC + 4 Instruction decode/register fetch cycle ( ID) A  Regs[IR ], B  Regs[IR ] Imm  ( (IR16)16 ## IR ) EXution/effective address cycle (EX) ALUOutput  A + Imm ( MEMory reference) ALUOutput  A op B (Register-Register ALU) ALUOutput  A op Imm (Register-Immediate ALU) ALUOutput  NPC + Imm, Cond  (A op 0) (Branch)

DLX Pipeline Stage MEMory access/branch completioncycle (MEM)
LMD  MEM[ALUOutput] ( MEMory access - load) MEM[ALUOutput]  B ( MEMory access - store) IF (cond) PC  ALUOutput ELSE PC <- NPC (Branch) Write-back cycle ( WB) Regs[IR ]  ALUOutput (Register-Register ALU) Regs[IR ]  ALUOutput (Register-Immediate ALU) Regs[IR11..15]  LMD (Load)

5 Steps of DLX Datapath Figure 3.1, Page 130
Instruction Fetch Instr. Decode Reg. Fetch EXute Addr. Calc MEMory Access Write Back IR L M D

DLX Execution times Branch - four cycles All others - five cycles
Assume branch frequency = 12% overall CPI = 4.88 Can reduce ALU instructions in four cycles by skipping MEM cycle Assume ALU frequency = 44%; CPI = 4.44 Improvement 4.88/4/44 = 1.1

Single Cycle, Multiple Cycle, vs. Pipeline
Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. In the multiple clock cycle implementation, however, we cannot start executing the store until Cycle 6 because we must wait for the load instruction to complete. Similarly, we cannot start the execution of the R-type instruction until the store instruction has completed its execution in Cycle 9. In the Single Cycle implementation, the cycle time is set to accommodate the longest instruction, the Load instruction. Consequently, the cycle time for the Single Cycle implementation can be five times longer than the multiple cycle implementation. But may be more importantly, since the cycle time has to be long enough for the load instruction, it is too long for the store instruction so the last part of the cycle here is wasted. +2 = 77 min. (X:57) Multiple Cycle Implementation: Load Store R-type IF ID EX MEM WB IF ID EX MEM IF Pipeline Implementation: Load IF ID EX MEM WB Store IF ID EX MEM WB R-type IF ID EX MEM WB

Why Pipeline? Suppose we execute 100 instructions Single Cycle Machine
45 ns/cycle x 1 CPI x 100 inst = 4500 ns Multicycle Machine 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns

The Basic Pipeline for DLX
IF ID EX MEM WB Time IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Program Flow IF ID EX MEM WB

Visualizing Pipelining Figure 3.3, Page 133
Time (clock cycles) I n s t r. O r d e

Pipelined DLX Datapath Figure 3.4, page 137
Instruction Fetch Instr. Decode Reg. Fetch EXute Addr. Calc. Write Back MEMory Access Data stationary control local decode for each instruction phase / pipeline stage

Basic performance issues : Example
Unpipelined CPU: 10-ns clock cycle Four cycles for ALU operations and branches Five cycles for memory operations Frequency = 40%, 20% and 40% Averge instruction execution time = Clock * Average CPI = 10ns*((.4+.2)*4 + (.4*5)) = 10ns*4.4 = 44ns

Basic performance issues : Example
Pipelined CPU: 11-ns clock cycle (to accommodate slowest stage) No pipeline conflicts Averge instruction execution time = 11ns Speedup = Time unpipelined / Time pipelined = 44ns/11ns = 4

The Major Hurdle of Pipelining- Pipeline Hazards
Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Pipelining of branches & other instructions Common solution is to stall the pipeline until the hazard “bubbles” in the pipeline

One MEMory Port/Structural Hazards Figure 3.6, Page 142
Time (clock cycles) Load I n s t r. O r d e Instr 1 Instr 2 Instr 3 Instr 4

One MEMory Port/Structural Hazards Figure 3.7, Page 143
Time (clock cycles) Load I n s t r. O r d e Instr 1 Instr 2 stall Instr 3

Data Hazard on R1 Figure 3.9, page 147
Time (clock cycles) IF ID/RF EX MEM WB I n s t r. O r d e add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

Three Generic Data Hazards
InstrI followed by InstrJ Read After Write (RAW) InstrJ tries to read operand before InstrI writes it

InstrI followed by InstrJ Write After Read (WAR) InstrJ tries to write operand before InstrI reads i Gets wrong operand Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5

InstrI followed by InstrJ Write After Write (WAW) InstrJ tries to write operand before InstrI writes it Leaves wrong result ( InstrI not InstrJ ) Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes

Forwarding to Avoid Data Hazard Figure 3.10, Page 149
Time (clock cycles) I n s t r. O r d e add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

HW Change for Forwarding Figure 3.20, Page 161

Data Hazard Even with Forwarding Figure 3.12, Page 153
Time (clock cycles) I n s t r. O r d e lw r1, 0(r2) MIPS actutally didn뭪 interlecok: MPU without Interlocked Pipelined Stages sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

Data Hazard Even with Forwarding Figure 3.13, Page 154
Time (clock cycles) I n s t r. O r d e lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

Software Scheduling to Avoid Load Hazards
Try producing fast code for a = b + c; d = e - f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd

Control Hazard on Branches Three Stage Stall
IF ID EX MEM WB Time Branch IF stall ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Program Flow

Branch Stall Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution: Determine branch taken earlier, AND Compute taken branch address earlier DLX branch tests if register = 0 or  0 DLX Solution: Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3

Control Hazard on Branches One Stage Stall
IF ID EX MEM WB Time Branch stall IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB Program Flow

Pipelined DLX Datapath Figure 3.22, page 163
Instruction Fetch Instr. Decode Reg. Fetch EXute Addr. Calc. MEMory Access Write Back This is the correct 1 cycle latency implementation! Does MIPS test affect clock (add forwarding logic too!)

Four Branch Hazard Alternatives
#1: Stall until branch direction is clear #2: Predict Branch Not Taken EXute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken 47% DLX branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken 53% DLX branches taken on average But haven’t calculated branch target address in DLX DLX still incurs 1 cycle branch penalty Other machines: branch target known before outcome

Four Branch Hazard Alternatives
#4: Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor sequential successorn branch target if taken 1 slot delay allows proper decision and branch target address in 5 stage pipeline DLX uses this Branch delay of length n

Delayed Branch Where to get instructions to fill branch delay slot?
P.169 Fig. 3.28 Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Cancelling branches allow more slots to be filled Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

Evaluating Branch Alternatives
Scheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined stall Stall pipeline Predict taken Predict not taken Delayed branch

Summary, #1 Pipelining is a technique for increasing the performance of a CPU by overlapping the execution of instructions. Pipelining doesn’t help latency of single task, it helps throughput of entire workload Potential speedup = Number pipe stages

Summary, #2 A Simple Implementation of DLX IF: Instruction Fetch Cycle
ID: Decode/Registers Fetch Cycle Ex: EXution/Effective Address Cycle MEM: MEMory Access/Branch completion Cycle Wr: Write-Back Cycle IF ID EX MEM WB

Summary, #3 Limits to pipelining: Hazards Structural hazards
Common solution is to stall the pipeline until the hazard “bubbles” in the pipeline Structural hazards HW cannot support this combination of instructions Data hazards Instruction depends on result of prior instruction still in the pipeline RAW, WAR, WAW Forwarding to Avoid Data Hazard (RAW) Control hazards Pipelining of branches & other instructions Predict Branch Delayed Branch

Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from

Similar presentations

Presentation on theme: "Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from

Similar presentations

Presentation on theme: "Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from"— Presentation transcript:

Similar presentations

About project

Feedback