Peng Liu liupeng@zju.edu.cn Lecture 9 Pipeline Peng Liu liupeng@zju.edu.cn.

Peng Liu liupeng@zju.edu.cn
Lecture 9 Pipeline Peng Liu

Pipelining Review What makes it easy
all instructions are the same length just a few instruction formats memory operands appear only in loads and stores Hazards limit performance Structural: need more HW resources Data: need forwarding, compiler scheduling Data hazards must be handled carefully MIPS32 instruction set architecture made pipeline visible (delayed branch, delayed load) Data hazards occur when the pipeline changes the order of read/write accesses to operands so that the order differs from the order seen by sequentially at executing instructions on an unpipelined machine.

MIPS Pipeline Data / Control Paths A (fast)
1 PCSrc ID/EX EX EX/MEM Control MEM IF/ID Add MEM/WB Branch Add WB 4 RegWrite Shift left 2 Instruction Memory Read Addr 1 Data Memory Register File Read Data 1 Read Addr 2 MemtoReg Read Address ALUSrc PC Read Data Address 1 Write Addr ALU Read Data 2 Write Data How many bits wide is each pipeline register? PC – 32 bits IF/ID – 64 bits ID/EX – x = 147 EX/MEM – x3 + 5 = 107 MEM/WB – x2 + 5 = 71 Write Data 1 ALU cntrl MemWrite MemRead Sign Extend 16 32 ALUOp 1 RegDst

MIPS Pipeline Data / Control Paths (debug)
1 PCSrc ID/EX EX/MEM MEM/WB Control Instr EX Instr MEM WB Instr IF/ID Control Control Add Branch Add 4 RegWrite Shift left 2 Instruction Memory Read Addr 1 Data Memory Register File Read Data 1 Read Addr 2 MemtoReg Read Address ALUSrc PC Read Data Address 1 Write Addr ALU Read Data 2 How many bits wide is each pipeline register? PC – 32 bits IF/ID – 64 bits ID/EX – x = 147 EX/MEM – x3 + 5 = 107 MEM/WB – x2 + 5 = 71 Write Data Write Data 1 ALU cntrl MemWrite MemRead Sign Extend 16 32 ALUOp 1 RegDst

MIPS Pipeline Control (pipelined debug)
1 PCSrc ID/EX EX/MEM MEM/WB Instr Instr Instr EX MEM IF/ID WB Control Control Control Add Branch Add 4 RegWrite Shift left 2 Instruction Memory Read Addr 1 Data Memory Register File Read Data 1 Read Addr 2 MemtoReg Read Address ALUSrc PC Read Data Address 1 Write Addr ALU Read Data 2 How many bits wide is each pipeline register? PC – 32 bits IF/ID – 64 bits ID/EX – x = 147 EX/MEM – x3 + 5 = 107 MEM/WB – x2 + 5 = 71 Write Data Write Data 1 ALU cntrl MemWrite MemRead Sign Extend 16 32 ALUOp 1 RegDst

Control Hazards When the flow of instruction addresses is not what the pipeline expects; incurred by change of flow instructions Conditional branches (beq, bne) Unconditional branches (j) Possible solutions Stall Move decision point earlier in the pipeline Predict Delay decision (requires compiler support) Control hazards occur less frequently than data hazards; there is nothing as effective against control hazards as forwarding is for data hazards

The Last Type of Pipeline Hazards: Control Hazards
Just like a data hazard on the program counter (PC) Cannot fetch the next instruction if I don’t know its address (PC) Only worse: we use the PC at the very beginning of the pipeline Makes forwarding very difficult Calculating the next PC for MIPS: Most instructions: PC + 4 Can be calculated in IF stage and forward to next instructions Branches Must first compare registers and calculate PC+4+offset Must fetch, decode, and execute instruciton Jump Must calculate new PC based on offset Must fetch and decode instruction

Datapath Branch and Jump Hardware
1 Shift left 2 Jump PC+4[31-28] 1 Branch PCSrc Shift left 2 Add ID/EX Read Address Instruction Memory Add PC 4 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Data 1 Data 2 16 32 ALU 1 Data IF/ID Sign Extend EX/MEM MEM/WB Control cntrl Forward Unit For lecture

Jumps Incur One Stall j stall lw and
Jumps not decoded until ID, so one stall is needed ALU IM Reg DM j I n s t r. O r d e stall lw Fortunately, jumps are very infrequent – less that 2% of the SPECint instructions and x% of the SPECfp ones. and Fortunately, jumps are very infrequent – only 2% of the SPECint instruction mix

Review: Branches Incur Three Stalls
ALU IM Reg DM beq I n s t r. O r d e Can fix branch hazard by waiting – stall – but affects throughput stall stall stall lw ALU IM Reg DM and Another “solution” is to put in enough extra hardware so that we can test registers, calculate the branch address, and update the PC during the second stage of the pipeline. That would reduce the number of stalls to only one. A third approach is to prediction to handle branches, e.g., always predict that branches will be untaken. When right, the pipeline proceeds at full speed. When wrong, have to stall (and make sure nothing completes – changes machine state – that shouldn’t have). Will talk about these options in more detail in next,next lecture.

Moving Branch Decisions Earlier in Pipe
Move the branch decision hardware back to the EX stage Reduces the number of stall cycles to two Adds an and gate and a 2x1 mux to the EX timing path Add hardware to compute the branch target address and evaluate the branch decision to the ID stage Reduces the number of stall cycles to one (like with jumps) Computing branch target address can be done in parallel with RegFile read (done for all instructions – only used when needed) Comparing the registers can’t be done until after RegFile read, so comparing and updating the PC adds a comparator, an and gate, and a 3x1 mux to the ID timing path Need forwarding hardware in ID stage For longer pipelines, decision points are later in the pipeline, incurring more stalls, so we need a better solution Want a small branch penalty. Need more forwarding and hazard detection hardware for second option (one stall implementation) since a branch depended on a result still in the pipeline (that is one of the source operands for the comparison logic) must be forwarded from the EX/MEM or MEM/WB pipeline latches.

Early Branch Forwarding Issues
Bypass of source operands from the EX/MEM if (IDcontrol.Branch and (EX/MEM.RegisterRd != 0) and (EX/MEM.RegisterRd = IF/ID.RegisterRs)) ForwardC = 1 and (EX/MEM.RegisterRd = IF/ID.RegisterRt)) ForwardD = 1 Forwards the result from the second previous instr. to either input of the Compare MEM/WB “forwarding” is taken care of by the normal RegFile write before read operation If the instruction immediately before the branch produces one of the branch compare source operands, then a stall will be required since the EX stage ALU operation is occurring at the same time as the ID stage branch compare operation

Supporting ID Stage Branches
PCSrc Branch 1 Hazard Unit ID/EX IF.Flush EX/MEM 1 IF/ID Control Add MEM/WB 4 Shift left 2 Add Compare Read Addr 1 Instruction Memory Data Memory RegFile Read Addr 2 Read Address PC Read Data 1 Read Data 1 Write Addr ALU Address ReadData 2 1 Write Data Write Data ALU cntrl 16 Sign Extend Book claims that you have to forward from the MEM/WB pipeline latch, but with RegFile write before read, I don’t think that is the case!! 32 Forward Unit 1 Forward Unit

Branch Prediction Resolve branch hazards by assuming a given outcome and proceeding without waiting to see the actual branch outcome Predict not taken – always predict branches will not be taken, continue to fetch from the sequential instruction stream, only when branch is taken does the pipeline stall If taken, flush instructions in the pipeline after the branch in IF, ID, and EX if branch logic in MEM – three stalls in IF if branch logic in ID – one stall ensure that those flushed instructions haven’t changed machine state– automatic in the MIPS pipeline since machine state changing operations are at the tail end of the pipeline (MemWrite or RegWrite) restart the pipeline at the branch destination

Flushing with Misprediction (Not Taken)
ALU IM Reg DM I n s t r. O r d e 4 beq $1,$2,2 flush ALU IM Reg DM 8 sub $4,$1,$5 16 and $6,$1,$7 20 or r8,$1,$9 ALU IM Reg DM For lecture To flush the IF stage instruction, add a IF.Flush control line that zeros the instruction field of the IF/ID pipeline register (transforming it into a noop)

Branch Prediction, con’t
Resolve branch hazards by statically assuming a given outcome and proceeding Predict taken – always predict branches will be taken Predict taken always incurs a stall (if branch destination hardware has been moved to the ID stage) As the branch penalty increases (for deeper pipelines), a simple static prediction scheme will hurt performance With more hardware, possible to try to predict branch behavior dynamically during program execution Dynamic branch prediction – predict branches at run-time using run-time information Predict taken always incurs one stall at least – assuming the branch destination address hardware has been moved up to the ID stage. So predict not taken is easier since sequential instruction address can be computed in the IF stage.

Dynamic Branch Prediction
A branch prediction buffer (aka branch history table (BHT)) in the IF stage, addressed by the lower bits of the PC, contains a bit that tells whether the branch was taken the last time it was execute Bit may predict incorrectly (may be from a different branch with the same low order PC bits, or may be a wrong prediction for this branch) but the doesn’t affect correctness, just performance If the prediction is wrong, flush the incorrect instructions in pipeline, restart the pipeline with the right instructions, and invert the prediction bit The BHT predicts when a branch is taken, but does not tell where its taken to! A branch target buffer (BTB) in the IF stage can cache the branch target address (or !even! the branch target instruction) so that a stall can be avoided 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12% 4096 about as good as infinite table, but 4096 is a lot of hardware Ask class why would want to store branch target instruction in the BTB

1-bit Prediction Accuracy
1-bit predictor in loop is incorrect twice when not taken Assume predict_bit = 0 to start (indicating branch not taken) and loop control is at the bottom of the loop code First time through the loop, the predictor mispredicts the branch since the branch is taken back to the top of the loop; invert prediction bit (predict_bit = 1) As long as branch is taken (looping), prediction is correct Exiting the loop, the predictor again mispredicts the branch since this time the branch is not taken falling out of the loop; invert prediction bit (predict_bit = 0) Loop: 1st loop instr 2nd loop instr . last loop instr bne $1,$2,Loop fall out instr For 10 times through the loop we have a 80% prediction accuracy for a branch that is taken 90% of the time

2-bit Predictors right 9 times wrong on loop fall out 1 1
A 2-bit scheme can give 90% accuracy since a prediction must be wrong twice before the prediction bit is changed right 9 times Loop: 1st loop instr 2nd loop instr . last loop instr bne $1,$2,Loop fall out instr wrong on loop fall out Taken Not taken 1 Predict Taken 1 Predict Taken Taken right on 1st iteration Not taken Taken Not taken For lecture Predict Not Taken Predict Not Taken Taken Not taken

Delayed Decision First, move the branch decision hardware and target address calculation to the ID pipeline stage A delayed branch always executes the next sequential instruction – the branch takes effect after that next instruction MIPS software moves an instruction to immediately after the branch that is not affected by the branch (a safe instruction) thereby hiding the branch delay As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot. Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches Growth in available transistors has made dynamic approaches relatively cheaper No processor uses delayed branches of more than 1 cycle. For longer branch delays, hardware-base branch prediction is used.

Scheduling Branch Delay Slots
A. From before branch B. From branch target C. From fall through add $1,$2,$3 if $2=0 then add $1,$2,$3 if $1=0 then sub $4,$5,$6 delay slot delay slot add $1,$2,$3 if $1=0 then delay slot sub $4,$5,$6 becomes becomes becomes if $2=0 then add $1,$2,$3 add $1,$2,$3 if $1=0 then sub $4,$5,$6 add $1,$2,$3 if $1=0 then sub $4,$5,$6 Limitations on delayed-branch scheduling come from 1) restrictions on the instructions that can be moved/copied into the delay slot and 2) limited ability to predict at compile time whether a branch is likely to be taken or not. In B and C, the use of $1 prevents the add instruction from being moved to the delay slot In B the sub may need to be copied because it could be reached by another path. B is preferred when the branch is taken with high probability (such as loop branches A is the best choice, fills delay slot & reduces instruction count (IC) In B, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute sub when branch fails

Compiler Optimizations & Data Hazards
Compilers rearrange code to try to avoid load-use stalls Also known as: filling the load delay slot Idea: find an independent instruction to space apart load & use When successful, no stall is introduced dynamically Independent instruction from before or after the load-use No data dependence to the load-use instructions No control dependence to the load-use instructions No exception issues When can’t fill the slot Leave it empty, hardware will introduce the stall In original MIPS (no HW stalls), a nop was needed in the slot

Code Scheduling to Avoid Stalls
Recorder code to avoid use of load result in the next instruction C code for A = B + E; C = B + F;

Control Hazard Solutions
Stall on branches Wait until you know the answer (branch CPI becomes 3) Control inserted in ID stage of the pipeline Predict not-taken Assume branch is not taken and continue with PC + 4 If branch is actually not taken, all is good Otherwise, nullify misfetched instructions and restart from PC+4+offset Predict taken Assume branch is taken More complex, cause you also need to predict PC+4+offset Still need to nullify instructions if assumption is wrong Most machines do some type of predictions (taken, not-taken)

Exceptions and Interrupts
Unexpected events requiring changes in flow of control Different ISAs use the terms differently Exception Arises within the CPU E.g., undefined opcode, overflow, syscall, … Interrupt E.g., from an external I/O controller (network card) Dealing with them without sacrificing performance is hard

Handling Exceptions Something bad happens to an instruction
Need to stop the machine Fix up the problem (with privileged code like the operating system) Start the machine up again MIPS: exceptions managed by a System Control Coprocessor (CP0) Save PC of offending (or interrupted) instruction In MIPS: Exception Program Counter (EPC) Save indication of the problem In MIPS: Cause register We’ll assume 1-bit for the examples 0 for undefined opcode, 1 for overflow Jump to handler code at hardwired address (e.g., 8000_0180)

An Alternate Mechanism
Vectored Interrupts Handler address determined by the cause Avoids the software overhead of selecting the write code to run based on cause register Example: Undefined opcode: C000_0000 Overflow: C000_0020 C000_0040 Instructions either Deal with the interrupt, or Jump to real handler

Handler Actions Read cause, and transfer to relevant handler
A software switch statement May be necessary even if vectored interrupts are available Determine action required If restartable Take corrective action Use EPC to return to program Otherwise Terminate program (e.g. segfault, …) Report error using EPC, cause, …

Precise Exceptions Definition: precise exceptions
All previous instruction had completed The faulting instruction was not started None of the next instructions were started No changes to the architecture state (registers, memory) Why are precise exceptions desirable by OS developers? With a single cycle machine, precise exceptions are easy Why?

Exceptions in a Pipeline
Another form of control hazard Consider overflow on add in EX stage Add $1, $2, $1 Prevent $1 from being clobbered by add Complete previous instructions Nullify subsequent instructions Set Cause and EPC register values Transfer control to handler Similar to mispredicted branch Uses much of the same hardware

Nullifying Instructions
Nullify = turn an instruction into a nop ( or pipeline bubble) Reset its RegWrite and MemWrite signals Does not affect the sate of the system Resets its branch and jump signals Does not cause unexpected flow control Mark that is should not raise any exceptions of its own Let if flow down the pipeline

Dealing with Multiple Exceptions
Pipelining overlaps multiple instructions Could have multiple exceptions at once Simple approach: deal with exception from earliest instruction Flush subsequent instructions Necessary for “precise” exceptions In more complex pipelines Multiple instructions issued per cycle Out-of-order completion Maintaining precise exceptions is difficult!

Imprecise Exceptions Actually available in several processors
Just stop pipeline and save pipeline state Including exception causes Let the handler work out Which instruction(s) had exceptions Which to complete or flush May require “manual” completion Simplifies hardware, but more complex handler software Not feasible for complex multiple-issue out-of-order pipelines

Beyond the 5-sstage Pipeline
Convert transistors to performance Use transistors to Exploit parallelism Or create it (speculate) Processor generations Simple machine Reuse hardware Pipelined Separate hardware for each stage Super-scalar Multiple port mems, function units Out-of-order Mega-ports, complex scheduling Speculation Multi-core processors Each design has more logic to accomplish same task (but faster)

Acknowledgements These slides contain material from courses: UCB CS152
Stanford EE108B

Peng Liu liupeng@zju.edu.cn Lecture 9 Pipeline Peng Liu liupeng@zju.edu.cn.

Similar presentations

Presentation on theme: "Peng Liu liupeng@zju.edu.cn Lecture 9 Pipeline Peng Liu liupeng@zju.edu.cn."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Peng Liu liupeng@zju.edu.cn Lecture 9 Pipeline Peng Liu liupeng@zju.edu.cn.

Similar presentations

Presentation on theme: "Peng Liu liupeng@zju.edu.cn Lecture 9 Pipeline Peng Liu liupeng@zju.edu.cn."— Presentation transcript:

Similar presentations

About project

Feedback