Peng Liu liupeng@zju.edu.cn Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn.

Peng Liu liupeng@zju.edu.cn
Lecture 7 Pipelining Peng Liu

Review: The Single Cycle Processor

Review: Given Datapath,RTL -> Control
Instruction<31:0> Inst Memory <21:25> <21:25> <16:20> <11:15> <0:15> Adr Op Fun Rt Rs Rd Imm16 Control PCSrc RegWr RegDst ExtOp ALUSrc ALUctr MemWr MemtoReg Zero DATA PATH

Review: The Concept of Local Decoding
That is, instead of asking the Main Control to generates the ALUctr signals directly (see the diagram with the ALU), the main cotrol will generate a set of signals called ALUop. For all I and J type instructions, ALUop will tell the ALU Control exactly what the ALU needs to do (Add, Subtract, ...) . But whenever the Main Control sees a R-type instructions, it simply throws its hands up and say: “Wow, I don’t know what the ALU has to do but I know it is a R-type instruction” and let the Local Control Block, ALU Control to take care of the rest. Notice that this save us one column from the table we had on the last slide. But let’s be honest, if one column is the ONLY thing we save, we probably will not do it. But when you have to design for the entire MIPS instruction set, this column will used for ALL R-type instructions, which is more than just Add and Subtract I showed you here. Another advantage of this table over the last one, besides being smaller, is that we can uniquely identify each column by looking at the Op field only. Therefore, as I will show you later, the Main Control ONLY needs to look at the Opcode field. How many bits do we need for ALUop? func ALU Control (Local) ALUctr op Main Control 6 3 ALUop 6 N ALU

Review: The Encoding of ALUop
Main Control op 6 ALU (Local) func N ALUop ALUctr 3 In this exercise, ALUop has to be 2 bits wide to represent: (1) “R-type” instructions “I-type” instructions that require the ALU to perform: (2) Or, (3) Add, and (4) Subtract To implement the more of MIPS ISA, ALUop has to be 3 bits to represent (4 bits in book to include NOR): (2) Or, (3) Add, (4) Subtract, and (5) And (Example: andi) Well the answer is 2 because we only need to represent 4 things: “R-type,” the Or operation, the Add operation, and the Subtract operation. If you are implementing the entire MIPS instruction set, then ALUop has to be 3 bits wide because we will need to represent 5 things: R-type, Or, Add, Subtract, and AND. Here I show you the bit assignment I made for the 3-bit ALUop. With this bit assignment in mind, let’s figure out what the local control ALU Control has to do. R-type ori lw sw beq jump ALUop (Symbolic) “R-type” Or Add Subtract xxx ALUop<2:0> 1 00 0 10 0 00 0 01

Review: The Decoding of the “func” Field
Main Control op 6 ALU (Local) func N ALUop ALUctr 3 R-type ori lw sw beq jump ALUop (Symbolic) “R-type” Or Add Subtract xxx ALUop<2:0> 1 00 0 10 0 00 0 01 op rs rt rd shamt funct 6 11 16 21 26 31 R-type funct<5:0> Instruction Operation add subtract and or set-on-less-than ALUctr<2:0> ALU Operation 000 001 010 110 111 And Or Add Subtract Set-on-less-than ALUctr ALU What this table and diagram implies is that if the ALU Control receives ALUop = 100, it has to decode the instruction’s “func” field to figure out what the ALU needs to do. Based on the MIPS encoding in Appendix A of your text book, we know we have a Add instruction if the func field is If the func field is , we know we have a subtract operation and so on. Notice that the bit 5 and bit 4 of this field is the same for all these operations so as far as the ALU control is concerned, these bits are don’t care. Now recall from your ALU homework, the ALUctr signals has the following meaning (point to the table): 000 means add, 001 means subtract, ... etc. Based on these three tables (point to the last row of the top table and then the two other tables) and the fact that bit 5 and bit 4 of the “func” field are don’t care, we can derive the following truth table for ALUctr.

Drawback of This Single Cycle Processor
Long cycle time: Cycle time must be long enough for the load instruction: PC’s Clock -to-Q + Instruction Memory Access Time + Register File Access Time + ALU Delay (address calculation) + Data Memory Access Time + Register File Setup Time + Clock Skew Cycle time for load is much longer than needed for all other instructions More specifically, the cycle time must be long enough for the load instruction which has the following components: Clock to Q time of the PC, .... Having a long cycle time is a big problem but not the the only problem. Another problem of this single cycle implementation is that this cycle time, which is long enough for the load instruction, is too long for all other instructions.

Single Cycle Processor
Advantages Single cycle per instruction makes logic and clock simple Disadvantages Inefficient utilization of memory and functional units since different instructions take different lengths of time ALU only computes values a small amount of the time Cycle time is the worst case path -> long cycle times Load instruction Best possible CPI is 1 A single-cycle implementation thus violates our key design principle making the common case fast.

Single Cycle Processor Performance
Functional unit delay Memory: 200ps ALU and adders: 200ps Register file: 100ps CPU clock cycle = 800 ps = 0.8ns(1.25GHz)

Variable Clock Single Cycle Processor Performance
Instruction Mix 45%ALU 25%loads 10%stores 15%branches 5%jumps CPU clock cycle = 0.6x45%+ 0.8x25% + 0.7x10% +0.5x15% +0.2x5%= ns(1.6GHz)

Increasing Parallelism
Problem: Each functional unit used once per cycle Most of the time it is sitting waiting for its turn Well it is calculating all the time, but it is waiting for valid data There is no parallelism in this arrangement Making instructions take more cycles makes machine faster! Each instruction takes roughly the same time While the CPI is much worse, the clock freq is much higher Overlap execution of multiple instructions at the same time Different instructions will be active at the same time This is called “Pipelining” Increases the parallelism going on in the machine We will look at a 5-stage pipeline Modern machines have order 15 cycles/instruction

Pipelined MIPS Processor
Start the next instruction while still working on the current one improves throughput or bandwidth - total amount of work done in a given time (average instructions per second or per clock) instruction latency is not reduced (time from the start of an instruction to its completion) pipeline clock cycle (pipeline stage time) is limited by the slowest stage for some instructions, some stages are wasted cycles Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB sw Latency = execution time (delay or response time) – the total time from start to finish of ONE instruction For processors one important measure is THROUGHPUT (or the execution bandwidth) – the total amount of work done in a given amount of time For memories one important measure is BANDWIDTH – the amount of information communicated across an interconnect (e.g., bus) per unit time; the number of operations performed per second (the WIDTH of the operation and the RATE of the operation) IFetch Dec Exec Mem WB R-type

Pipeline Pipelining doesn’t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously Potential speedup = number pipe stages Pipeline rate limited by slowest pipeline stage

Pipelining the MIPS ISA
What makes it easy all instructions are the same length (32 bits) easier to fetch in 1st stage and decode in 2nd stage few instruction formats (three) with symmetry across formats can begin reading register file in 2nd stage memory operations can occur only in loads and stores can use the execute stage to calculate memory addresses each MIPS instruction writes at most one result and does so near the end of the pipeline

An Ideal Pipeline All objects go through the same stages
1 2 3 4 All objects go through the same stages No sharing of resources between any two stages Propagation delay through all pipeline stages is equal The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines, but instructions depend on each other! CS252 S05 15

Single Cycle, Multiple Cycle, vs. Pipeline
Single Cycle Implementation: Cycle 1 Cycle 2 Clk Load Store Waste Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk lw sw R-type IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. Pipeline Implementation: “wasted” cycles lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB

Multiple Cycle vs. Pipeline, Bandwidth vs. Latency
Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk lw sw R-type IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch Pipeline Implementation: lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB Here are the timing diagrams showing the differences between the single cycle, multiple cycles, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. Latency per lw = 5 clock cycles for both Bandwidth of lw is 1 per clock clock (IPC) for pipeline vs. 1/5 IPC for multicycle Pipelining improves instruction bandwidth, not instruction latency

Graphically Representing MIPS Pipeline
Can help with answering questions like: How many cycles does it take to execute this code? What is the ALU doing during cycle 4? Is there a hazard, why does it occur, and how can it be fixed? ALU IM Reg DM

Why Pipeline? For Throughput!
Time (clock cycles) ALU IM Reg DM Inst 0 Once the pipeline is full, one instruction is completed every cycle I n s t r. O r d e ALU IM Reg DM Inst 1 ALU IM Reg DM Inst 2 ALU IM Reg DM Inst 3 ALU IM Reg DM Inst 4 Time to fill the pipeline

The Five Stages of Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw IFetch Dec Exec Mem WB IFetch: Instruction Fetch and Update PC Dec: Registers Fetch and Instruction Decode Exec: Execute R-type; calculate memory address Mem: Read/write the data from/to the Data Memory WB: Write the result data into the register file As shown here, each of these five steps will take one clock cycle to complete.

Pipelining Load Load instruction takes 5 stages
Five independent functional units work on each stage Each functional unit used only once Another load can start as soon as 1st finishes IF stage Each load still takes 5 cycles to complete The throughput, however, is much higher

Functional Units Are Busy
Pipelining now keeps all the functional units busy Fetch a new instruction each cycle Fetch register every cycle Use the ALU almost every cycle Use the Data Memory many cycles Instructions still take 10ns to complete But start a new instruction every 2ns Look like CPI is 1 Pipeline Timing Diagram

Pipeline Datapath

Load Datapath: Stage 1

Pipeline Control Need to control functional units
But they are from working on different instructions! Not a problem Just pipeline the control signals along with the data Make sure they line up Using labeling conventions often helps Instruction_rf –means this instructions is in RF Every time it gets flopped, changes pipestage Make sure right signal go to the right places

Control Signals Use a main control unit to generate signals during RF/ID stage Control signals for EX ExtOp, ALUSrc, … used 1 cycle later Control signal for Mem MemWr, Branch used 2 cycles later Control signals for WB MemtoReg, MemWr used 3 cycles later

Implementing Control

Putting it All Together

Pipeline Performance Assume time for stage is
100ps for register read or write 200ps for other stages Compare pipelined datapath with single-cycle datapath

Pipeline Performance Program

Pipeline Speedup If all stages are balanced
i.e., all take the same time Time between instructions pipelined = Time between instructions nonpipelined Number of stages If not balanced，speedup is less Speedup due to increased throughput Latency (time for each instruction) does not decrease

Pipelined Datapath write -back phase fetch execute decode & Reg-fetch
IR PC Add we rs1 rs2 rd1 addr we rdata ws addr wd ALU rd2 GPRs rdata Data Memory Inst. Memory Imm Ext wdata write -back phase fetch execute decode & Reg-fetch memory Clock period can be reduced by dividing the execution of an instruction into multiple cycles tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably) However, CPI will increase unless instructions are pipelined CS252 S05 36

MIPS Pipeline Datapath Modifications
What do we need to add/modify in our MIPS datapath? registers between pipeline stages to isolate them IF:IFetch ID:Dec EX:Execute MEM: MemAccess WB: WriteBack 1 Add Add 4 Shift left 2 Instruction Memory Read Addr 1 Data Memory Register File Read Data 1 Read Addr 2 Read Address IFetch/Dec PC Read Data Dec/Exec Exec/Mem Address 1 Note two exceptions to right-to-left flow WB that writes the result back into the register file in the middle of the datapath Selection of the next value of the PC, one input comes from the calculated branch address from the MEM stage Only later instructions in the pipeline can be influenced by these two REVERSE data movements. The first one (WB to ID) leads to data hazards. The second one (MEM to IF) leads to control hazards. All instructions must update some states in the processor – the register file, the memory, or the PC – so separate pipeline registers are redundant to the state that is updated (not needed). PC can be thought of as a pipeline register: the one that feeds the IF stage of the pipeline. Unlike all of the other pipeline registers, the PC is part of the visible architecture states – its content must be saved when an exception occurs (the contents of the other pipe registers are discarded). Write Addr ALU Read Data 2 Mem/WB Write Data Write Data 1 Sign Extend 16 32 System Clock

Technology Assumptions
A small amount of very fast memory (caches) backed up by a large, slower memory Fast ALU (at least for integers) Multiported Register files (slower!) Thus, the following timing assumption is reasonable tIM tRF tALU tDM  tRW A 5-stage pipeline will be the focus of our detailed design - some commercial designs have over 30 pipeline stages to do an integer add! CS252 S05 38

5-Stage Pipelined Execution
Write -Back (WB) I-Fetch (IF) Execute (EX) Decode, Reg. Fetch (ID) Memory (MA) addr wdata rdata Data Memory we ALU Imm Ext 0x4 Add Inst. rd1 GPRs rs1 rs2 ws wd rd2 IR PC time t0 t1 t2 t3 t4 t5 t6 t instruction1 IF1 ID1 EX1 MA1 WB1 instruction2 IF2 ID2 EX2 MA2 WB2 instruction3 IF3 ID3 EX3 MA3 WB3 instruction4 IF4 ID4 EX4 MA4 WB4 instruction IF5 ID5 EX5 MA5 WB5 CS252 S05 39

5-Stage Pipelined Execution Resource Usage Diagram
Write -Back (WB) I-Fetch (IF) Execute (EX) Decode, Reg. Fetch (ID) Memory (MA) addr wdata rdata Data Memory we ALU Imm Ext 0x4 Add Inst. rd1 GPRs rs1 rs2 ws wd rd2 IR PC time t0 t1 t2 t3 t4 t5 t6 t IF I1 I2 I3 I4 I5 ID I1 I2 I3 I4 I5 EX I1 I2 I3 I4 I5 MA I1 I2 I3 I4 I5 WB I1 I2 I3 I4 I5 Resources CS252 S05 40

Pipelined Execution: ALU Instructions
PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data IR 31 Not quite correct! We need an Instruction Reg (IR) for each stage CS252 S05 41

Pipelined MIPS Datapath without jumps
F D E M W IR WBSrc MemWrite 31 OpSel 0x4 Add RegDst RegWrite rd1 GPRs rs1 rs2 ws wd rd2 we PC A addr inst Inst Memory wdata addr rdata we IR Y ALU B Data Memory R ExtSel BSrc Imm Ext MD1 MD2 Control Points Need to Be Connected CS252 S05 42

Pipelining What makes it hard
structural hazards: what if we had only one memory? control hazards: what about branches? data hazards: what if an instruction’s input operands depend on the output of a previous instruction?

Instructions Interact With Each Other in Pipeline
An instruction in the pipeline may need a resource being used by another instruction in the pipeline  structural hazard An instruction may depend on something produced by an earlier instruction Dependence may be for a data value  data hazard Dependence may be for the next instruction’s address  control hazard (branches, exceptions) CS252 S05 44

Structural Hazards Resource conflict
Occurs when two instructions try to use same hardware Often arise when some functional units are not fully pipelined Simple examples: MIPS pipeline with a single unified memory Load/store requires data access Instruction fetch would have to stall for that cycle Also used for units that are not fully pipelined (mult, div)

Resolving Structural Hazards
Structural hazards occurs when two instructions need same hardware resource at the same time Can resolve in hardware by stalling newer instruction till older instruction finished with resource A structural hazard can always be avoided by adding more hardware to design E.g., if two instructions both need a port to memory at the same time, could avoid hazard by adding second port to memory Our 5-stage pipe has no structural hazards by design Thanks to MIPS ISA, which was designed for pipelining

Data Dependencies Data dependencies for instruction j following instruction I Read after write (RAW) (true dependence) instruction j tries to read before instruction I tries to write it Write after write (WAW) (output dependence) instruction j tries to write an operand before I writes its value Write after read (WAR) (anti dependence) instruction j tries to write a destination before it is read by I No such thing as a read after read (RAR) hazard since there is never a problem reading twice

Dependency Examples True dependency (RAW hazard) addu $t0, $t1, $t2
subu $t3, $t4, $t0 Output dependency (WAW hazard) subu $t0, $t4, $t5 Anti dependency (WAR hazard) subu $t1, $t4, $t5

Data Hazards r1 is stale. Oops! r4  r1 … r1 … ... r1 r0 + 10
IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data ... r1 r0 + 10 r4 r1 + 17 r1 is stale. Oops! CS252 S05 49

Resolving Data Hazards (1)
Strategy 1: Wait for the result to be available by freezing earlier pipeline stages  interlocks CS252 S05 50

Feedback to Resolve Hazards
FB1 FB2 FB3 FB4 stage 1 2 3 4 Later stages provide dependence information to earlier stages which can stall (or kill) instructions Real designs will seldom provide full feedback nor will they be able to stop on a dime. Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur) CS252 S05 51

Interlocks to Resolve Data Hazards
Stall Condition IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data nop ... r1 r0 + 10 r4 r1 + 17 CS252 S05 52

Stalled Stages and Pipeline Bubbles
time t0 t1 t2 t3 t4 t5 t6 t (I1) r1 (r0) + 10 IF1 ID1 EX1 MA1 WB1 (I2) r4 (r1) IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2 (I3) IF3 IF3 IF3 IF3 ID3 EX3 MA3 WB3 (I4) IF4 ID4 EX4 MA4 WB4 (I5) IF5 ID5 EX5 MA5 WB5 stalled stages time t0 t1 t2 t3 t4 t5 t6 t IF I1 I2 I3 I3 I3 I3 I4 I5 ID I1 I2 I2 I2 I2 I3 I4 I5 EX I1 nop nop nop I2 I3 I4 I5 MA I1 nop nop nop I2 I3 I4 I5 WB I1 nop nop nop I2 I3 I4 I5 Resource Usage nop  pipeline bubble CS252 S05 53

Interlock Control Logic
Cstall ws rs rt ? stall IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data nop Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions. CS252 S05 54

Interlock Control Logic ignoring jumps & branches
IR PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data 31 nop stall Cstall rs rt ? we ws we Cdest re1 re2 Cre Cdest Should we always stall if the rs field matches some rd? not every instruction writes a register we not every instruction reads a register re CS252 S05 55

Source & Destination Registers
R-type: op rs rt rd func I-type: op rs rt immediate16 J-type: op immediate26 source(s) destination ALU rd  (rs) func (rt) rs, rt rd ALUi rt  (rs) op imm rs rt LW rt M [(rs) + imm] rs rt SW M [(rs) + imm]  (rt) rs, rt BZ cond (rs) true: PC  (PC) + imm rs false: PC  (PC) + 4 rs J PC  (PC) + imm JAL r31  (PC), PC  (PC) + imm 31 JR PC  (rs) rs JALR r31  (PC), PC  (rs) rs 31 CS252 S05 56

Deriving the Stall Signal
Cdest ws = Case opcode ALU rd ALUi, LW rt JAL, JALR R31 we = Case opcode ALU, ALUi, LW (ws  0) JAL, JALR on ... off Cre re1 = Case opcode ALU, ALUi, on off re2 = Case opcode LW, SW, BZ, JR, JALR J, JAL ALU, SW ... Cstall stall = ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW) . re1D + ((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW) . re2D This is not the full story ! CS252 S05 57

Hazards Due to Loads & Stores
IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data nop Stall Condition What if (r1)+7 = (r3)+5 ? ... M[(r1)+7]  (r2) r4  M[(r3)+5] Is there any possible data hazard in this instruction sequence? CS252 S05 58

(r1)+7 = (r3)+5  data hazard
Load & Store Hazards ... M[(r1)+7]  (r2) r4  M[(r3)+5] (r1)+7 = (r3)+5  data hazard However, the hazard is avoided because our memory system completes writes in one cycle ! Load/Store hazards are sometimes resolved in the pipeline and sometimes in the memory system itself. More on this later in the course. CS252 S05 59

Strategy 2: Route data as soon as possible after it is calculated to the earlier pipeline stage  bypass CS252 S05 60

Bypassing Each stall or kill introduces a bubble in the pipeline
time t0 t1 t2 t3 t4 t5 t6 t (I1) r1 r IF1 ID1 EX1 MA1 WB1 (I2) r4 r IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2 (I3) IF3 IF3 IF3 IF3 ID3 EX3 MA3 (I4) stalled stages IF4 ID4 EX4 (I5) IF5 ID5 Each stall or kill introduces a bubble in the pipeline  CPI > 1 A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input time t0 t1 t2 t3 t4 t5 t6 t (I1) r1 r IF1 ID1 EX1 MA1 WB1 (I2) r4  r IF2 ID2 EX2 MA2 WB2 (I3) IF3 ID3 EX3 MA3 WB3 (I4) IF4 ID4 EX4 MA4 WB4 (I5) IF5 ID5 EX5 MA5 WB5 CS252 S05 61

Adding a Bypass ... (I1) r1 r0 + 10 (I2) r4 r1 + 17 r4  r1...
PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data 31 nop stall D E M W ... (I1) r1 r0 + 10 (I2) r4 r1 + 17 r4  r1... r1 ... ASrc When does this bypass help? r1 M[r0 + 10] r4 r1 + 17 JAL 500 r4 r yes no no CS252 S05 62

The Bypass Signal Deriving it from the Stall Signal
stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D +((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D ) ws = Case opcode ALU rd ALUi, LW rt JAL, JALR R31 we = Case opcode ALU, ALUi, LW (ws  0) JAL, JALR on off ASrc = (rsD=wsE).weE.re1D Is this correct? No because only ALU and ALUi instructions can benefit from this bypass We can’t bypass on memory or JAL* instructions. Split weE into two components: we-bypass, we-stall CS252 S05 63

Bypass and Stall Signals
Split weE into two components: we-bypass, we-stall we-bypassE = Case opcodeE ALU, ALUi  (ws  0) ... off we-stallE = Case opcodeE LW  (ws  0) JAL, JALR on ... off ASrc = (rsD =wsE).we-bypassE . re1D stall = ((rsD =wsE).we-stallE + (rsD=wsM).weM + (rsD=wsW).weW). re1D +((rtD = wsE).weE + (rtD = wsM).weM + (rtD = wsW).weW). re2D CS252 S05 64

Fully Bypassed Datapath
ASrc IR PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add ALU Imm Ext rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data 31 nop stall D E M W PC for JAL, ... BSrc Is there still a need for the stall signal ? stall = (rsD=wsE). (opcodeE=LWE).(wsE0 ).re1D + (rtD=wsE). (opcodeE=LWE).(wsE0 ).re2D CS252 S05 65

Strategy 3: Speculate on the dependence. Two cases: Guessed correctly  do nothing Guessed incorrectly  kill and restart …. We’ll later see examples of this approach in more complex processors. CS252 S05 66

Acknowledgements These slides contain material from courses: UCB CS152
Stanford EE108B

Peng Liu liupeng@zju.edu.cn Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn.

Similar presentations

Presentation on theme: "Peng Liu liupeng@zju.edu.cn Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Peng Liu liupeng@zju.edu.cn Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn.

Similar presentations

Presentation on theme: "Peng Liu liupeng@zju.edu.cn Lecture 7 Pipelining Peng Liu liupeng@zju.edu.cn."— Presentation transcript:

Similar presentations

About project

Feedback