Computer Organization & Design 计算机组成与设计

Computer Organization & Design 计算机组成与设计
Weidong Wang (王维东) College of Information Science & Electronic Engineering 信息与通信网络工程研究所（ICAN） Zhejiang University

Course Information Instructor: Weidong WANG TA:
Tel(O): ; Office Hours: TBD, Yuquan Campus, Xindian (High-Tech) Building 306 Mobile: TA: mobile，陈彬彬 Binbin CHEN, ; 陈佳云 Jiayun CHEN， ; Office Hours: Wednesday & Saturday 14:00-16:30 PM. Xindian (High-Tech) Building 308.（也可以短信邮件联系）微信号-“2017计组群”

Lecture 7 Introduction to Pipelining

Review: Procedure Activation Record (Frame)
Each procedure creates an activation record on the stack

Review: Calling Convention Steps

Review: Timing Analysis and Logic Delay
Register: An Array of Flip-Flops Combinational Logic If T > worst-case delay through CL, does this ensure correct operation?

Review: Flip-Flop delays eat into “time budget”
Combinational Logic ALU “time budget”

Review: Critical Timing Issues
Flops work great as long as input is stable when clock rises Called setup and hold windows Clock skew can cause some nasty problems Hold time violations Cycle Time = longest Prop Delay + Setup + Clock Skew

Review: Instruction Execution
Execution of an instruction involves 1. instruction fetch 2. decode and register fetch 3. ALU operation 4. memory operation (optional) 5. write back and the computation of the address of the next instruction CS252 S05

Review: Putting Datapath All Together

Putting it All Together: 1 Cycle Datapath
Adr Inst Memory Instruction<31:0> <21:25> <16:20> <11:15> <0:15> Rs Rt Rd Imm16 PCSrc RegDst ALUctr MemWr MemtoReg Zero Rd Rt 1 Rs Rt 4 Adder RegWr 5 5 5 busA Mux Rw Ra Rb 00 busW 32 32 32-bit Registers ALU 32 busB 32 PC Clk Adder 32 So here is the single cycle datapath we just built. If you push into the Instruction Fetch Unit, you will see the last slide showing the PC, the next address logic, and the Instruction Memory. Here I have shown how we can get the Rt, Rs, Rd, and Imm16 fields out of the 32-bit instruction word. The Rt, Rs, and Rd fields will go to the register file as register specifiers while the Imm16 field will go to the Extender where it is either Zero and Sign extended to 32 bits. The signals ExtOp, ALUSrc, ALUctr, MemWr, MemtoReg, RegDst, RegWr, Branch, and Jump are control signals. And I will show you how to generate them on Friday. +2 = 80 min. (Z:00) Mux Mux 32 WrEn Adr 1 Data In Extender 1 PC Ext imm16 Data Memory Clk 32 16 imm16 Clk ExtOp ALUSrc

Review: Harvard-Style Datapath for MIPS
PCSrc br RegWrite clk WBSrc MemWrite addr wdata rdata Data Memory we rind jabs pc+4 0x4 Add Add RegDst BSrc ExtSel OpCode z OpSel clk zero? addr inst Inst. Memory PC rd1 GPRs rs1 rs2 ws wd rd2 we Imm Ext ALU Control 31 CS252 S05

Review: Single-Cycle Hardwired Control Harvard architecture
We will assume clock period is sufficiently long for all of the following steps to be “completed”: 1. instruction fetch 2. decode and register fetch 3. ALU operation 4. data fetch if required 5. register write-back setup time  tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB At the rising edge of the following clock, the PC, the register file and the memory are updated CS252 S05

Review: Multilevel Decoding

Review: Putting Datapath&Control All Together

Given Datapath: RTL -> Control
Instruction<31:0> Inst Memory <21:25> <21:25> <16:20> <11:15> <0:15> Adr Op Fun Rt Rs Rd Imm16 Control PCSrc RegWr RegDst ExtOp ALUSrc ALUctr MemWr MemtoReg Zero DATA PATH

The Single Cycle Datapath during Or Immediate
op rs rt immediate 16 21 26 31 R[rt] <= R[rs] or ZeroExt[Imm16] Instruction<31:0> PCSrc <= +4 Instruction Fetch Unit Rd Rt <21:25> <16:20> <11:15> <0:15> RegDst <= 0 Clk 1 Mux Rs Rt Rt Rs Rd Imm16 ALUctr <= Or RegWr <= 1 5 5 5 MemtoReg <= 0 busA Zero MemWr <= 0 Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 Clk 32 Mux Now let’s look at the control signals setting for the Or immediate instruction. The OR immediate instruction OR the content of the register specified by the Rs field to the Zero Extended Immediate field and write the result to the register specified in Rt. This is how it works in the datapath. The Rs field is fed to the Ra address port to cause the contents of register Rs to be placed on busA. The other operand for the ALU will come from the immediate field. In order to do this, the controller need to set ExtOp to 0 to instruct the extender to perform a Zero Extend operation. Furthermore, ALUSrc must set to 1 such that the MUX will block off bus B from the register file and send the zero extended version of the immediate field to the ALU. Of course, the ALUctr has to be set to OR so the ALU can perform an OR operation. The rest of the control signals (MemWr, MemtoReg, Branch, and Jump) are the same as theAdd and Subtract instructions. One big difference is the RegDst signal. In this case, the destination register is specified by the instruction’s Rt field, NOT the Rd field because we do not have a Rd field here. Consequently, RegDst must be set to 0 to place Rt onto the Register File’s Rw address port. Finally, in order to accomplish the register write, RegWr must be set to 1. +3 = 20 min. (X:60) Mux 32 WrEn Adr 1 1 Data In 32 Extender Data Memory imm16 32 16 Clk ALUSrc <= 1 ExtOp <= 0

The Concept of Local Decoding
That is, instead of asking the Main Control to generates the ALUctr signals directly (see the diagram with the ALU), the main cotrol will generate a set of signals called ALUop. For all I and J type instructions, ALUop will tell the ALU Control exatly what the ALU needs to do (Add, Subtract, ...) . But whenever the Main Control sees a R-type instructions, it simply throws its hands up and say: “Wow, I don’t know what the ALU has to do but I know it is a R-type instruction” and let the Local Control Block, ALU Control to take care of the rest. Notice that this save us one column from the table we had on the last slide. But let’s be honest, if one column is the ONLY thing we save, we probably will not do it. But when you have to design for the entire MIPS instruction set, this column will used for ALL R-type instructions, which is more than just Add and Subtract I showed you here. Another advantage of this table over the last one, besides being smaller, is that we can uniquely identify each column by looking at the Op field only. Therefore, as I will show you later, the Main Control ONLY needs to look at the Opcode field. How many bits do we need for ALUop? func ALU Control (Local) ALUctr op Main Control 6 3 ALUop 6 N ALU

The Encoding of ALUop Main Control op 6 ALU (Local) func N ALUop ALUctr 3 In this exercise, ALUop has to be 2 bits wide to represent: (1) “R-type” instructions “I-type” instructions that require the ALU to perform: (2) Or, (3) Add, and (4) Subtract To implement the more of MIPS ISA, ALUop has to be 3 bits to represent (4 bits in book to include NOR): (2) Or, (3) Add, (4) Subtract, and (5) And (Example: andi) Well the answer is 2 because we only need to represent 4 things: “R-type,” the Or operation, the Add operation, and the Subtract operation. If you are implementing the entire MIPS instruction set, then ALUop has to be 3 bits wide because we will need to repreent 5 things: R-type, Or, Add, Subtract, and AND. Here I show you the bit assignment I made for the 3-bit ALUop. With this bit assignment in mind, let’s figure out what the local control ALU Control has to do. R-type ori lw sw beq jump ALUop (Symbolic) “R-type” Or Add Subtract xxx ALUop<2:0> 1 00 0 10 0 00 0 01

The Decoding of the “func” Field
Main Control op 6 ALU (Local) func N ALUop ALUctr 3 R-type ori lw sw beq jump ALUop (Symbolic) “R-type” Or Add Subtract xxx ALUop<2:0> 1 00 0 10 0 00 0 01 op rs rt rd shamt funct 6 11 16 21 26 31 R-type funct<5:0> Instruction Operation add subtract and or set-on-less-than ALUctr<2:0> ALU Operation 000 001 010 110 111 And Or Add Subtract Set-on-less-than ALUctr ALU What this table and diagram implies is that if the ALU Control receives ALUop = 100, it has to decode the instruction’s “func” field to figure out what the ALU needs to do. Based on the MIPS encoding in Appendix A (or Fig 3.18, page 153 of 2/e) of your text book, we know we have a Add instruction if the func field is If the func field is , we know we have a subtract operation and so on. Notice that the bit 5 and bit 4 of this field is the same for all these operations so as far as the ALU control is concerned, these bits are don’t care. Now recall from your ALU homework, the ALUctr signals has the following meaning (point to the table): 000 means Add, 001 means subtract, ... etc. Based on these three tables (point to the last row of the top table and then the two other tables) and the fact that bit 5 and bit 4 of the “func” field are don’t care, we can derive the following truth table for ALUctr. +2 = 48 min. (Y:28)

Drawback of This Single Cycle Processor
Long cycle time: Cycle time must be long enough for the load instruction: PC’s Clock -to-Q + Instruction Memory Access Time + Register File Access Time + ALU Delay (address calculation) + Data Memory Access Time + Register File Setup Time + Clock Skew Cycle time for load is much longer than needed for all other instructions Well, the last slide pretty much illustrate one of the biggest disadvantage of the single cycle implementation: it has a long cycle time. More specifically, the cycle time must be long enough for the load instruction which has the following components: Clock to Q time of the PC, .... Having a long cycle time is a big problem but not the the only problem. Another problem of this single cycle implementation is that this cycle time, which is long enough for the load instruction, is too long for all other instructions. We will show you why this is bad and what we can do about it in the next few lectures. That’s all for today.

What is Single Cycle Control?
Combinational Logic (Only Gates, No Flip Flops) Just specify logic functions! 32 Addr Data Instr Mem Equal rs,rt,rd,imm PCSrc RegDest RegWr MemToReg ExtOp ALUsrc MemWr 32 rd1 RegFile rd2 WE wd 5 rs1 rs2 ws ALUctr MemWr MemToReg Equal Ext RegDest ExtOp ALUsrc RegWr

Single Cycle Processor
Advantages Single cycle per instruction makes logic and clock simple Disadvantages Inefficient utilization of memory and functional units since different instructions take different lengths of time ALU only computes values a small amount of the time Cycle time is the worst case path -> long cycle times Load instruction Best possible CPI is 1

Single Cycle Processor Performance
Functional unit delay Memory: 200ps ALU and adders: 200ps Register file: 100ps CPU clock cycle = 800 ps = 0.8ns(1.25GHz)

Variable Clock Single Cycle Processor Performance
Instruction Mix 45%ALU 25%loads 10%stores 15%branches 5%jumps CPU clock cycle = 0.6x45%+ 0.8x25% + 0.7x10% +0.5x15% +0.2x5%= ns(1.6GHz)

Increasing Parallelism
Problem: Each functional unit used once per cycle Most of the time it is sitting waiting for its turn Well it is calculating all the time, but it is waiting for valid data There is no parallelism in this arrangement Making instructions take more cycles makes machine faster! Each instruction takes roughly the same time While the CPI is much worse, the clock freq is much higher Overlap execution of multiple instructions at the same time Different instructions will be active at the same time This is called “Pipelining” Increases the parallelism going on in the machine We will look at a 5 stage pipeline Modern machines have order 15 cycles/instruction

Pipelining: It’s Natural and You Do It All the Time
• Laundry洗衣店Example Ann君, Brian君, Cathy君, Dave君 each have one load of clothes to wash, dry, and fold 折叠 • Washer takes 30 minutes • Dryer takes 40 minutes • “Folding bench” takes 20 minutes

Sequential Laundry Sequential laundry takes 6 hours for 4 loads

Pipelined Laundry: Start work ASAP
Pipelined laundry takes 3.5 hours for 4 loads

Pipelining Lessons Pipelining doesn’t help latency of single task, it helps throughput吞吐量of entire workload Multiple tasks operating simultaneously Potential speedup = Number pipe stages Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup

The Five Stages of Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 lw IFetch Dec Exec Mem WB IFetch: Instruction Fetch and Update PC Dec: Registers Fetch and Instruction Decode Exec: Execute R-type; calculate memory address Mem: Read/write the data from/to the Data Memory WB: Write the result data into the register file Make memory access one cycle, that is largest factor As shown here, each of these five steps will take one clock cycle to complete.

An Ideal Pipeline All objects go through the same stages
1 2 3 4 All objects go through the same stages No sharing of resources between any two stages Propagation delay through all pipeline stages is equal The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines. But can an instruction pipeline satisfy the last condition? CS252 S05

First build MIPS without pipelining with CPI=1
Pipelined MIPS To pipeline MIPS: First build MIPS without pipelining with CPI=1 Next, add pipeline registers to reduce cycle time while maintaining CPI=1 CS252 S05

Review: Unpipelined Datapath for MIPS
PCSrc br RegWrite clk WBSrc MemWrite addr wdata rdata Data Memory we rind jabs pc+4 0x4 Add Add RegDst BSrc ExtSel OpCode z OpSel clk zero? addr inst Inst. Memory PC rd1 GPRs rs1 rs2 ws wd rd2 we Imm Ext ALU Control 31 CS252 S05

Review: Hardwired Control Table
Opcode ExtSel BSrc OpSel MemW RegW WBSrc RegDst PCSrc ALU ALUi ALUiu LW SW BEQZz=0 BEQZz=1 J JAL JR JALR * Reg Func no yes ALU rd pc+4 sExt16 Imm Op pc+4 no yes ALU rt uExt16 pc+4 Imm Op no yes ALU rt sExt16 Imm + no yes Mem rt pc+4 pc+4 sExt16 Imm + yes no * sExt16 * 0? no * br sExt16 * 0? no pc+4 * no * jabs * no yes PC R31 jabs * no * rind * no yes rind PC R31 BSrc = Reg / Imm WBSrc = ALU / Mem / PC RegDst = rt / rd / R31 PCSrc = pc+4 / br / rind / jabs CS252 S05

Pipelined MIPS Processor
Start the next instruction while still working on the current one improves throughput or bandwidth - total amount of work done in a given time (average instructions per second or per clock) instruction latency延迟 is not reduced (time from the start of an instruction to its completion) pipeline clock cycle (pipeline stage time) is limited by the slowest stage for some instructions, some stages are wasted cycles Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 lw IFetch Dec Exec Mem WB IFetch Dec Exec Mem WB sw Latency = execution time (delay or response time) – the total time from start to finish of ONE instruction For processors one important measure is THROUGHPUT (or the execution bandwidth) – the total amount of work done in a given amount of time For memories one important measure is BANDWIDTH – the amount of information communicated across an interconnect (e.g., bus) per unit time; the number of operations performed per second (the WIDTH of the operation and the RATE of the operation) IFetch Dec Exec Mem WB R-type

Single Cycle, Multiple Cycle, vs. Pipeline
Single Cycle Implementation: Cycle 1 Cycle 2 Clk Load Store Waste Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk lw sw R-type IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. Pipeline Implementation: “wasted” cycles lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB

Multiple Cycle v. Pipeline, Bandwidth v. Latency
Multiple Cycle Implementation: Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk lw sw R-type IFetch Dec Exec Mem WB IFetch Dec Exec Mem IFetch Pipeline Implementation: lw IFetch Dec Exec Mem WB sw IFetch Dec Exec Mem WB R-type IFetch Dec Exec Mem WB Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. Latency per lw = 5 clock cycles for both Bandwidth of lw is 1 instruction per clock (IPC) for pipeline vs. 1/5 IPC for multicycle Pipelining improves instruction bandwidth, not instruction latency

Pipelining the MIPS ISA
What makes it easy all instructions are the same length (32 bits) easier to fetch in 1st stage and decode in 2nd stage few instruction formats (three) with symmetry整齐across formats can begin reading register file in 2nd stage memory operations can occur only in loads and stores can use the execute stage to calculate memory addresses each MIPS instruction writes at most one result and does so near the end of the pipeline What makes it hard structural hazards冒险: what if we had only one memory? control hazards: what about branches? data hazards: what if an instruction’s input operands depend on the output of a previous instruction?

MIPS Pipeline Datapath Modifications
What do we need to add/modify in our MIPS datapath? registers between pipeline stages to isolate them IF:IFetch ID:Dec EX:Execute MEM: MemAccess WB: WriteBack 1 Add Add 4 Shift left 2 Instruction Memory Read Addr 1 Data Memory Register File Read Data 1 Read Addr 2 Read Address IFetch/Dec PC Read Data Dec/Exec Exec/Mem Address 1 Note two exceptions to right-to-left flow WB that writes the result back into the register file in the middle of the datapath Selection of the next value of the PC, one input comes from the calculated branch address from the MEM stage Only later instructions in the pipeline can be influenced by these two REVERSE data movements. The first one (WB to ID) leads to data hazards. The second one (MEM to IF) leads to control hazards. All instructions must update some state in the processor – the register file, the memory, or the PC – so separate pipeline registers are redundant to the state that is updated (not needed). PC can be thought of as a pipeline register: the one that feeds the IF stage of the pipeline. Unlike all of the other pipeline registers, the PC is part of the visible architecture state – its content must be saved when an exception occurs (the contents of the other pipe registers are discarded). Write Addr ALU Read Data 2 Mem/WB Write Data Write Data 1 Sign Extend 16 32 System Clock

How to divide the datapath into stages
Suppose memory is significantly slower than other stages. In particular, suppose Since the slowest stage determines the clock, it may be possible to combine some stages without any loss of performance

Pipelined Datapath write -back phase fetch execute decode & Reg-fetch
IR PC Add we rs1 rs2 rd1 addr we rdata ws addr wd ALU rd2 GPRs rdata Data Memory Inst. Memory Imm Ext wdata write -back phase fetch execute decode & Reg-fetch memory Clock period can be reduced by dividing the execution of an instruction into multiple cycles tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably) However, CPI will increase unless instructions are pipelined CS252 S05

Graphically Representing MIPS Pipeline
Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? is there a hazard, why does it occur, and how can it be fixed? ALU IM Reg DM

Why Pipeline? For Throughput !
Time (clock cycles) ALU IM Reg DM Inst 0 Once the pipeline is full, one instruction is completed every cycle I n s t r. O r d e ALU IM Reg DM Inst 1 ALU IM Reg DM Inst 2 ALU IM Reg DM Inst 3 ALU IM Reg DM Inst 4 Time to fill the pipeline

Technology Assumptions
A small amount of very fast memory (caches) backed up by a large, slower memory Fast ALU (at least for integers) Multiported Register files (slower!) Thus, the following timing assumption is reasonable tIM tRF tALU tDM  tRW A 5-stage pipeline will be the focus of our detailed design - some commercial designs have over 30 pipeline stages to do an integer add! CS252 S05

5-Stage Pipelined Execution
Write -Back (WB) I-Fetch (IF) Execute (EX) Decode, Reg. Fetch (ID) Memory (MA) addr wdata rdata Data Memory we ALU Imm Ext 0x4 Add Inst. rd1 GPRs rs1 rs2 ws wd rd2 IR PC time t0 t1 t2 t3 t4 t5 t6 t instruction1 IF1 ID1 EX1 MA1 WB1 instruction2 IF2 ID2 EX2 MA2 WB2 instruction3 IF3 ID3 EX3 MA3 WB3 instruction4 IF4 ID4 EX4 MA4 WB4 instruction IF5 ID5 EX5 MA5 WB5 CS252 S05

5-Stage Pipelined Execution Resource Usage Diagram
Write -Back (WB) I-Fetch (IF) Execute (EX) Decode, Reg. Fetch (ID) Memory (MA) addr wdata rdata Data Memory we ALU Imm Ext 0x4 Add Inst. rd1 GPRs rs1 rs2 ws wd rd2 IR PC time t0 t1 t2 t3 t4 t5 t6 t IF I1 I2 I3 I4 I5 ID I1 I2 I3 I4 I5 EX I1 I2 I3 I4 I5 MA I1 I2 I3 I4 I5 WB I1 I2 I3 I4 I5 In=第n指令 Resources CS252 S05

Pipeline Datapath

Load Datapath: Stage 1

Pipelined Execution: ALU Instructions
PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data IR 31 Not quite correct! We need an Instruction Reg (IR) for each stage CS252 S05

Pipelined MIPS Datapath without jumps
IF ID EX MA WB IR WBSrc MemWrite 31 OpSel 0x4 Add RegDst RegWrite rd1 GPRs rs1 rs2 ws wd rd2 we PC A addr inst Inst Memory wdata addr rdata we IR Y ALU B Data Memory R ExtSel BSrc Imm Ext MD1 MD2 Control Points Need to Be Connected CS252 S05

Pipeline Control • Need to control functional units
– But they are from working on different instructions! • Not a problem – Just pipeline the control signals along with the data – Make sure they line up • Using labeling conventions ofen helps – Instruction_rf – means this instruction is in RegisterFile – Every time it gets flopped, changes pipestage • Make sure right signals go to the right places

Control Signals Use a Main Control unit to generate signals during RF/ID Stage – Control signals for EX • (ExtOp, ALUSrc, …) used 1 cycle later – Control signals for Mem • (MemWr, Branch) used 2 cycles later – Control signals for WB • (MemtoReg, MemWr) used 3 cycles later

Implementing Control

Putting it All Together

Pipeline Performance Assume time for stages is
– 100ps for register read or write – 200ps for other stages • Compare pipelined datapath with single‐cycle datapath Instr Instr fetch Register read ALU op Memory access write Total time lw 200ps 100 ps 800ps sw 700ps R-format 600ps beq 500ps

Maximum Speedup by Pipelining

Pipelining and ISA Design
MIPS ISA designed for pipelining – All instructions are 32‐bits • Easier to fetch and decode in one cycle • c.f. x86: 1‐ to 17‐byte instructions – Few and regular instruction formats • Can decode and read registers in one step – Load/store addressing • Can calculate address in 3rd stage, access memory in 4th stage – Alignment of memory operands • Memory access takes only one cycle

Instructions interact with each other in pipeline
An instruction in the pipeline may need a resource being used by another instruction in the pipeline  structural hazard An instruction may depend on something produced by an earlier instruction Dependence may be for a data value  data hazard Dependence may be for the next instruction’s address  control hazard (branches, exceptions) CS252 S05

Resolving Structural Hazards
Structural hazards occurs when two instruction need same hardware resource at same time Can resolve in hardware by stalling newer instruction till older instruction finished with resource A structural hazard can always be avoided by adding more hardware to design E.g., if two instructions both need a port to memory at same time, could avoid hazard by adding second port to memory Our 5-stage pipe has no structural hazards by design Thanks to MIPS ISA, which was designed for pipelining

Data Hazards r1 is stale陈旧. Oops! r4  r1+17 … r1 r0+10… ...
IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data ... r1 r0 + 10 r4 r1 + 17 r1 is stale陈旧. Oops! CS252 S05

Resolving Data Hazards (1)
Strategy 1: Wait for the result to be available by freezing earlier pipeline stages  interlocks互锁，闭锁，连锁 CS252 S05

Feedback to Resolve Hazards
FB1 FB2 FB3 FB4 stage 1 2 3 4 Later stages provide dependence information to earlier stages which can stall (or kill) instructions Real designs will seldom provide full feedback nor will they be able to stop on a dime. Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur) CS252 S05

Interlocks to resolve Data Hazards
Stall Condition IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data nop ... r1 r0 + 10 r4 r1 + 17 CS252 S05

Stalled Stages and Pipeline Bubbles
time t0 t1 t2 t3 t4 t5 t6 t (I1) r1 (r0) + 10 IF1 ID1 EX1 MA1 WB1 (I2) r4 (r1) IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2 (I3) IF3 IF3 IF3 IF3 ID3 EX3 MA3 WB3 (I4) IF4 ID4 EX4 MA4 WB4 (I5) IF5 ID5 EX5 MA5 WB5 stalled stages time t0 t1 t2 t3 t4 t5 t6 t IF I1 I2 I3 I3 I3 I3 I4 I5 ID I1 I2 I2 I2 I2 I3 I4 I5 EX I1 nop nop nop I2 I3 I4 I5 MA I1 nop nop nop I2 I3 I4 I5 WB I1 nop nop nop I2 I3 I4 I5 Resource Usage nop  pipeline bubble CS252 S05

Interlock Control Logic
Cstall ws rs rt ? stall IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data nop Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted未遂的instructions. CS252 S05

Interlock Control Logic ignoring jumps & branches
IR PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data 31 nop stall Cstall rs rt ? we ws we Cdest re1 re2 Cre Cdest Should we always stall if the rs field matches some rd? not every instruction writes a register we not every instruction reads a register re CS252 S05

Source & Destination Registers
R-type: op rs rt rd func I-type: op rs rt immediate16 J-type: op immediate26 source(s) destination ALU rd  (rs) func (rt) rs, rt rd ALUi rt  (rs) op imm rs rt LW rt M [(rs) + imm] rs rt SW M [(rs) + imm]  (rt) rs, rt BZ cond (rs) true: PC  (PC) + imm rs false: PC  (PC) + 4 rs J PC  (PC) + imm JAL r31  (PC), PC  (PC) + imm 31 JR PC  (rs) rs JALR r31  (PC), PC  (rs) rs 31 CS252 S05

Deriving the Stall Signal
Cdest ws = Case opcode ALU rd ALUi, LW rt JAL, JALR R31 we = Case opcode ALU, ALUi, LW (ws  0) JAL, JALR on ... off Cre re1 = Case opcode ALU, ALUi, on off re2 = Case opcode LW, SW, BZ, JR, JALR J, JAL ALU, SW ... Cstall stall = ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW) . re1D + ((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW) . re2D This is not the full story ! CS252 S05

Hazards due to Loads & Stores
IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data nop Stall Condition What if (r1)+7 = (r3)+5 ? ... M[(r1)+7]  (r2) r4  M[(r3)+5] Is there any possible data hazard in this instruction sequence? CS252 S05

(r1)+7 = (r3)+5  data hazard
Load & Store Hazards ... M[(r1)+7]  (r2) r4  M[(r3)+5] (r1)+7 = (r3)+5  data hazard However, the hazard is avoided because our memory system completes writes in one cycle ! Load/Store hazards are sometimes resolved in the pipeline and sometimes in the memory system itself. More on this later in the course. CS252 S05

Strategy 2: Route data as soon as possible after it is calculated to the earlier pipeline stage  bypass旁路 CS252 S05

Bypassing旁路 Each stall or kill introduces a bubble in the pipeline
time t0 t1 t2 t3 t4 t5 t6 t (I1) r1 r IF1 ID1 EX1 MA1 WB1 (I2) r4 r IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2 (I3) IF3 IF3 IF3 IF3 ID3 EX3 MA3 (I4) stalled stages IF4 ID4 EX4 (I5) IF5 ID5 Each stall or kill introduces a bubble in the pipeline  CPI > 1 A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input time t0 t1 t2 t3 t4 t5 t6 t (I1) r1 r IF1 ID1 EX1 MA1 WB1 (I2) r4  r IF2 ID2 EX2 MA2 WB2 (I3) IF3 ID3 EX3 MA3 WB3 (I4) IF4 ID4 EX4 MA4 WB4 (I5) IF5 ID5 EX5 MA5 WB5 CS252 S05

Adding a Bypass ... (I1) r1 r0 + 10 (I2) r4 r1 + 17 r4  r1...
PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data 31 nop stall D E M W ... (I1) r1 r0 + 10 (I2) r4 r1 + 17 r4  r1... r1 ... ASrc When does this bypass help? r1 M[r0 + 10] r4 r1 + 17 JAL 500 r4 r yes no no CS252 S05

The Bypass Signal Deriving it from the Stall Signal
stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D +((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D ) ws = Case opcode ALU rd ALUi, LW rt JAL, JALR R31 we = Case opcode ALU, ALUi, LW (ws  0) JAL, JALR on off ASrc = (rsD=wsE).weE.re1D Is this correct? No because only ALU and ALUi instructions can benefit from this bypass We can’t bypass on memory or JAL* instructions. Split weE into two components: we-bypass, we-stall CS252 S05

Bypass and Stall Signals
Split weE into two components: we-bypass, we-stall we-bypassE = Case opcodeE ALU, ALUi  (ws  0) ... off we-stallE = Case opcodeE LW  (ws  0) JAL, JALR on ... off ASrc = (rsD =wsE).we-bypassE . re1D stall = ((rsD =wsE).we-stallE + (rsD=wsM).weM + (rsD=wsW).weW). re1D +((rtD = wsE).weE + (rtD = wsM).weM + (rtD = wsW).weW). re2D CS252 S05

Fully Bypassed Datapath
ASrc IR PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add ALU Imm Ext rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data 31 nop stall D E M W PC for JAL, ... BSrc Is there still a need for the stall signal ? stall = (rsD=wsE). (opcodeE=LWE).(wsE0 ).re1D + (rtD=wsE). (opcodeE=LWE).(wsE0 ).re2D CS252 S05

Strategy 3: Speculate推测on the dependence. Two cases: Guessed correctly  do nothing Guessed incorrectly  kill and restart …. We’ll later see examples of this approach in more complex processors. CS252 S05

Next Time: Control Hazards
Branches/Jumps Exceptions/Interrupts

What do we need to calculate next PC?
Control Hazards What do we need to calculate next PC? For Jumps Opcode, offset and PC For Jump Register Opcode and Register value For Conditional Branches Opcode, PC, Register (for condition), and offset For all other instructions Opcode and PC have to know it’s not one of above! CS252 S05

Opcode Decoding Bubble (assuming no branch delay slots for now)
time t0 t1 t2 t3 t4 t5 t6 t (I1) r1 (r0)+10 IF1 ID1 EX1 MA1 WB1 (I2) r3 (r2)+17 IF2 IF2 ID2 EX2 MA2 WB2 (I3) IF3 IF3 ID3 EX3 MA3 WB3 (I4) IF4 IF4 ID4 EX4 MA4 WB4 time t0 t1 t2 t3 t4 t5 t6 t IF I1 nop I2 nop I3 nop I4 ID I1 nop I2 nop I3 nop I4 EX I1 nop I2 nop I3 nop I4 MA I1 nop I2 nop I3 nop I4 WB I1 nop I2 nop I3 nop I4 Resource Usage nop  pipeline bubble CS252 S05

Speculate推测next address is PC+4
104 IR PC addr inst Inst Memory 0x4 Add nop E M Jump? PCSrc (pc+4 / jabs / rind/ br) stall I1 096 ADD I2 100 J 304 I3 104 ADD I4 304 ADD A jump instruction kills (not stalls) the following instruction kill How? CS252 S05

Pipelining Jumps PCSrc (pc+4 / jabs / rind/ br) stall To kill a fetched instruction -- Insert a mux before IR Add E M 0x4 nop IR IR Add Jump? I2 I1 304 nop I2 I1 104 Any interaction between stall and jump? nop IRSrcD addr PC inst IR Inst Memory Kill takes precedence over stall. IRSrcD = Case opcodeD J, JAL nop ... IM I1 096 ADD I2 100 J 304 I3 104 ADD I4 304 ADD kill CS252 S05

Jump Pipeline Diagrams
time t0 t1 t2 t3 t4 t5 t6 t (I1) 096: ADD IF1 ID1 EX1 MA1 WB1 (I2) 100: J 304 IF2 ID2 EX2 MA2 WB2 (I3) 104: ADD IF3 nop nop nop nop (I4) 304: ADD IF4 ID4 EX4 MA4 WB4 time t0 t1 t2 t3 t4 t5 t6 t IF I1 I2 I3 I4 I5 ID I1 I2 nop I4 I5 EX I1 I2 nop I4 I5 MA I1 I2 nop I4 I5 WB I1 I2 nop I4 I5 Resource Usage nop  pipeline bubble CS252 S05

Pipelining Conditional Branches
104 stall IR PC addr inst Inst Memory 0x4 Add nop E M PCSrc (pc+4 / jabs / rind / br) IRSrcD BEQZ? A Y ALU zero? Branch condition is not known until the execute stage what action should be taken in the decode stage ? I1 096 ADD I2 100 BEQZ r1 +200 I3 104 ADD … I4 304 ADD CS252 S05

PCSrc (pc+4 / jabs / rind / br) stall BEQZ? ? Add E M 0x4 nop A Y ALU zero? IR IR Add I2 I1 108 I3 nop IRSrcD addr PC inst IR Inst Memory If the branch is taken - kill the two following instructions - the instruction at the decode stage is not valid  stall signal is not valid I1 096 ADD I2 100 BEQZ r1 +200 I3 104 ADD … I4 304 ADD CS252 S05

PCSrc (pc+4/jabs/rind/br) stall Add PC BEQZ? E M IRSrcE 0x4 nop A Y ALU zero? IR IR Add Jump? I2 I1 108 I3 IRSrcD addr PC nop inst IR Inst Memory If the branch is taken - kill the two following instructions - the instruction at the decode stage is not valid  stall signal is not valid I1 096 ADD I2 100 BEQZ r1 +200 I3 104 ADD … I4 304 ADD CS252 S05

New Stall Signal Don’t stall if the branch is taken. Why?
stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D + ((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D ) . !((opcodeE=BEQZ).z + (opcodeE=BNEZ).!z) Don’t stall if the branch is taken. Why? Instruction at the decode stage is invalid 而Stall只会延迟流水线注意stall与控制IR_MUX的差异！ CS252 S05

Control Equations for PC and IR Muxes
PCSrc = Case opcodeE BEQZ.z, BNEZ.!z br  Case opcodeD J, JAL  jabs JR, JALR  rind  pc+4 Give priority优先级 to the older instruction, i.e., execute-stage instruction over decode-stage instruction IRSrcD = Case opcodeE BEQZ.z, BNEZ.!z nop  Case opcodeD J, JAL, JR, JALR nop IM IRSrcE = Case opcodeE BEQZ.z, BNEZ.!z nop stall.nop + !stall.IRD CS252 S05

Branch Pipeline Diagrams (resolved in execute stage)
time t0 t1 t2 t3 t4 t5 t6 t (I1) 096: ADD IF1 ID1 EX1 MA1 WB1 (I2) 100: BEQZ +200 IF2 ID2 EX2 MA2 WB2 (I3) 104: ADD IF3 ID3 nop nop nop (I4) 108: IF4 nop nop nop nop (I5) 304: ADD IF5 ID5 EX5 MA5 WB5 time t0 t1 t2 t3 t4 t5 t6 t IF I1 I2 I3 I4 I5 ID I1 I2 I3 nop I5 EX I1 I2 nop nop I5 MA I1 I2 nop nop I5 WB I1 I2 nop nop I5 Resource Usage nop  pipeline bubble CS252 S05

Reducing Branch Penalty惩罚，损失 (resolve in decode stage)
One pipeline bubble can be removed if an extra comparator is used in the Decode stage But might elongate拉长cycle time PCSrc (pc+4 / jabs / rind/ br) Add IR nop E 0x4 Add Zero detect on register file output rd1 GPRs rs1 rs2 ws wd rd2 we nop addr PC inst IR Inst Memory D Pipeline diagram now same as for jumps CS252 S05

Branch Delay Slots (expose control hazard to software)
Change the ISA semantics so that the instruction that follows a jump or branch is always executed gives compiler the flexibility to put in a useful instruction where normally a pipeline bubble would have resulted. I1 096 ADD I2 100 BEQZ r1 +200 I3 104 ADD I4 304 ADD Delay slot instruction executed regardless of branch outcome Other techniques include more advanced branch prediction, which can dramatically reduce the branch penalty... to come later CS252 S05

Branch Pipeline Diagrams (branch delay slot)
time t0 t1 t2 t3 t4 t5 t6 t (I1) 096: ADD IF1 ID1 EX1 MA1 WB1 (I2) 100: BEQZ +200 IF2 ID2 EX2 MA2 WB2 (I3) 104: ADD IF3 ID3 EX3 MA3 WB3 (I4) 304: ADD IF4 ID4 EX4 MA4 WB4 time t0 t1 t2 t3 t4 t5 t6 t IF I1 I2 I3 I4 ID I1 I2 I3 I4 EX I1 I2 I3 I4 MA I1 I2 I3 I4 WB I1 I2 I3 I4 Resource Usage CS252 S05

Why an Instruction may not be dispatched发出，完成every cycle (CPI>1)
Full bypassing may be too expensive to implement typically all frequently used paths are provided some infrequently used bypass paths may increase cycle time and counteract the benefit of reducing CPI Loads have two-cycle latency Instruction after load cannot use load result MIPS-I ISA defined load delay slots, a software-visible pipeline hazard (compiler schedules independent instruction or inserts NOP to avoid hazard). MIPS:“Microprocessor without Interlocked Pipeline Stages” Removed in MIPS-II (pipeline interlocks added in hardware) Conditional branches may cause bubbles kill following instruction(s) if no delay slots CS252 S05

Iron Law铁律with Software-Visible NOPs
Time = Instructions Cycles Time Program Program * Instruction * Cycle If software has to insert NOP instructions for hazard avoidance, instructions/program increases average cycles/instruction decreases - doing nothing fast is easy! But performance (time/program) worse or same as if hardware instead uses interlocks to avoid hazard Hardware-generated interlocks (bubbles) don’t change instructions/program, but only add to cycles/instruction Hardware interlocks don’t take space in instruction cache

HomeWork Readings: do exercises 4.10, 4.11，4.12.
Read CDROM-D2; Read Chapter familiar with SPIM do exercises 4.10, 4.11，4.12. Computer Organization and Design (COD) (Fifth Edition) HW3 ( Reading Material

Reading Material： Detail of Control Signals
For simple one cycle processor

Meaning of the Control Signals
MemWr: 1  write memory MemtoReg: 0  ALU; 1  Mem RegDst: 0  “rt”; 1  “rd” RegWr: 1  write register ExtOp: “zero”, “sign” ALUsrc: 0  regB; 1  immed ALUctr: “add”, “sub”, “or” RegDst Zero ALUctr MemWr MemtoReg Rd Rt 1 Rs Rt RegWr 5 5 5 busA Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 Clk 32 Mux Mux 32 WrEn Adr 1 1 Data In imm16 Extender Data Memory 32 16 Clk ExtOp ALUSrc

Two Equivalent Ways to Specify Control
(Rotate about 45degree axis)

Setting PC Source Control Signal
PCSrc: 0  PC <= PC  PC <= PC {SignExt(Im16), 2b00} Later in lecture: higher-level connection between mux and branch cond PCSrc Inst Memory Adr 4 Adder Mux 00 PC Adder Clk imm16 PC Ext

Meaning of the Control Signals
MemWr: 1  write memory MemtoReg: 0  ALU; 1  Mem RegDst: 0  “rt”; 1  “rd” RegWr: 1  write register ExtOp: 0  “zero” ; 1  “sign” ALUsrc: 0  regB; 1  immed ALUctr: “add”, “sub”, “or” RegDst Zero ALUctr MemWr MemtoReg Rd Rt 1 Rs Rt RegWr 5 5 5 busA Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 32 Mux Clk Mux 32 WrEn Adr 1 1 Data In imm16 Extender Data Memory 32 16 Clk ExtOp ALUSrc

Specify ALU Source mux Control
ALUsrc: 0  reg as ALU B input; 1  immediate as ALU B input Rd Rt 1 An sw e r ? Ad d U S ubU OR I L W B E Q 1 2 Rs Rt 5 5 5 busA 3 1 4 5 X 6 Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 Mux Clk 32 1 Data In 7 1 1 1 1 1 X imm16 Extender 32 16 8 X X X X X 1 9 No n e of t h e ab o v e ExtOp ALUSrc

Specify Immediate Extender Op Control
ExtOp: 0  “zero extend immediate” ; 1  “sign extend imm.” Rd Rt 1 Rs Rt 5 5 5 busA Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 Mux Clk 32 1 Data In imm16 Extender 32 16 ExtOp ALUSrc

Specify Register Write Control
RegWr: 1  write register RegDst Rd Rt 1 Rs Rt RegWr 5 5 5 busA Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 32 Mux Clk 1 imm16 Extender 32 16 ExtOp ALUSrc

Specify Register Destination Control
RegDst: 0  “rt”; 1  “rd” op rs rt rd shamt funct 6 11 16 21 26 31 RegDst op rs rt immediate 16 21 26 31 Rd Rt 1 Rs Rt RegWr 5 5 5 busA Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 Mux Clk 1 imm16 Extender 32 16 ExtOp ALUSrc

Specify the Memory Write Control Signal
MemWr: 1  write memory busW 32 ALUctr Clk RegWr busA busB 5 Rw Ra Rb 32 32-bit Registers Rs Rt Rd RegDst Extender Mux 16 imm16 ALUSrc ExtOp MemtoReg Data In WrEn Adr Data Memory MemWr ALU Zero 1

Specify Memory To Register File Mux Control
MemtoReg: 0  ALU; 1  Mem busW 32 ALUctr Clk RegWr busA busB 5 Rw Ra Rb 32 32-bit Registers Rs Rt Rd RegDst Extender Mux 16 imm16 ALUSrc ExtOp MemtoReg Data In WrEn Adr Data Memory MemWr ALU Zero 1

Specify the ALU Control Signals
ALUctr: 0  “add”, 1  “sub”, 2  “or” busW 32 ALUctr Clk RegWr busA busB 5 Rw Ra Rb 32 32-bit Registers Rs Rt Rd RegDst Extender Mux 16 imm16 ALUSrc ExtOp MemtoReg Data In WrEn Adr Data Memory MemWr ALU Zero 1

The Add Instruction add rd, rs, rt
op rs rt rd shamt funct 6 11 16 21 26 31 6 bits 5 bits add rd, rs, rt mem[PC] Fetch the instruction from memory R[rd] <= R[rs] + R[rt] The actual operation PC <= PC Calculate the next instruction’s address OK, let’s get on with today’s lecture by looking at the simple add instruction. In terms of Register Transfer Language, this is what the Add instruction need to do. First, you need to fetch the instruction from Memory. Then you perform the actual add operation. More specifically: (a) You add the contents of the register specified by the Rs and Rt fields of the instruction. (b) Then you write the results to the register specified by the Rd field. And finally, you need to update the program counter to point to the next instruction. Now, let’s take a detail look at the datapath during various phase of this instruction. +2 = 10 min. (X:50)

Instruction Fetch Unit at the Beginning of Add
Fetch the instruction from Instruction memory: Instruction <= mem[PC] This is the same for all instructions Adr Inst Memory Instruction<31:0> PCSrc 4 Adder Mux 00 PC Adder Clk imm16 PC Ext

Instruction Fetch Unit at the End of Branch
op rs rt immediate 16 21 26 31 if (Zero == 1) PC = PC {SignExt[imm16], 2b00} ; else PC = PC + 4 Adr Inst Memory Instruction<31:0> PCSrc What is encoding of PCSrc? Direct MUX select? Branch / not branch Let’s choose second option Zero PCSrc 4 Adder Let’s look at the interesting case where the branch condition Zero is true (Zero = 1). Well, if Zero is not asserted, we will have our boring case where PC + 1 is selected. Anyway, with Branch = 1 and Zero = 1, the output of the second adder will be selected. That is, we will add the seqential address, that is output of the first adder, to the sign extended version of the immediate field, to form the branch target address (output of 2nd adder). With the control signal Jump set to zero, this branch target address will be written into the Program Counter register (PC) at the end of the clock cycle. +2 = 35 min. (Y:15) Mux 00 PC Adder 1 Clk imm16

The Single Cycle Datapath during Load
op rs rt immediate 16 21 26 31 R[rt] <= Data Memory [R[rs] + SignExt[imm16]] Instruction<31:0> PCSrc<= +4 Instruction Fetch Unit Rd Rt <21:25> <16:20> <11:15> <0:15> RegDst = 0 Clk 1 Mux ALUctr <= Add Rs Rt Rt Rs Rd Imm16 RegWr <= 1 5 5 5 MemtoReg <= 1 busA Zero MemWr = 0 Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 Clk 32 Mux Let’s continue our lecture with the load instruction. What does the load instruction do? It first adds the contecnts of the register specified by the Rs field to the Sign Extended version of the Immediate field to form the memory address. Then it uses this memory address to access the memory and write the data back to the register specified by the Rt field of the instruction. Here is how the datapath works: first the Rs field is fed to the Register File’s Ra address port to place the register onto bus A. Then the ExtOp signal is set to 1 so that the immediate field is Sign Extended and we place this value (output of Extender) onto the ALU input by setting ALUsrc to 1. The ALU then add (ALUctr = add) the two together to form the memory address which is then placed onto the Data Memory’s address port. In order to place the Data Memory’s output bus onto the Register File’s input bus (busW), the control needs to set MemtoReg to 1. Similar to the OR immediate instruction I showed you earlier, the destination register here is specified by the Rt field. Therefore RegDst must be set to 0. Finally, RegWr must be set to 1 to completer the register write operation. Well, it should be obvious to you guys by now that we need to set Branch and Jump to 0 to make sure the Instruction Fetch Unit update the Program Counter correctly. +3 = 28 min. (Y:08) Mux WrEn Adr 1 1 Data In 32 Extender Data Memory imm16 32 32 16 Clk ALUSrc = 1 ExtOp <= 1

The Single Cycle Datapath during Store
op rs rt immediate 16 21 26 31 Data Memory [R[rs] + SignExt[imm16]] <= R[rt] Instruction<31:0> PCSrc <= Instruction Fetch Unit Rd Rt <21:25> <16:20> <11:15> <0:15> RegDst <= Clk 1 Mux Rs Rt Rt Rs Rd Imm16 ALUctr <= RegWr <= 5 5 5 MemtoReg <= busA Zero MemWr <= Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 Clk 32 Mux The store instruction performs the inverse function of the load. Instead of loading data from memory, the store instruction sends the contents of register specified by Rt to data memory. Similar to the load instruction, the store instruction needs to read the contents of register Rs (points to Ra port) and add it to the sign extended verion of the immediate filed (Imm16, ExtOp = 1, ALUSrc = 1) to form the data memory address (ALUctr = add). However unlike the Load instructoion where busB is not used, the store instruction will use busB to send the data to the Data memory. Consequently, the Rt field of the instruction has to be fed to the Rb port of the register file. In order to write the Data Memory properly, the MemWr signal has to be set to 1. Notice that the store instruction does not update the register file. Therefore, RegWr must be set to zero and consequently control signals RegDst and MemtoReg are don’t cares. And once again we need to set the control signals Branch and Jump to zero to ensure proper Program Counter updataing. Well, by now, you are probably tied of these boring stuff where Branch and Jump are zero so let’s look at something different--the bracnh instruction. +3 = 31 min. (Y:11) Mux 32 WrEn Adr 1 1 Data In 32 Data Memory imm16 Extender 32 16 Clk ALUSrc <= ExtOp <=

The Single Cycle Datapath during Store
op rs rt immediate 16 21 26 31 Data Memory [R[rs] + SignExt[imm16]] <= R[rt] Instruction<31:0> PCSrc<= +4 Instruction Fetch Unit Rd Rt <21:25> <16:20> <11:15> <0:15> RegDst <= x Clk 1 Mux ALUctr <= Add Rs Rt Rt Rs Rd Imm16 RegWr <= 0 5 5 5 MemtoReg <= x busA Zero MemWr <= 1 Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 Clk 32 Mux The store instruction performs the inverse function of the load. Instead of loading data from memory, the store instruction sends the contents of register specified by Rt to data memory. Similar to the load instruction, the store instruction needs to read the contents of register Rs (points to Ra port) and add it to the sign extended verion of the immediate filed (Imm16, ExtOp <= 1, ALUSrc = 1) to form the data memory address (ALUctr = add). However unlike the Load instructoion where busB is not used, the store instruction will use busB to send the data to the Data memory. Consequently, the Rt field of the instruction has to be fed to the Rb port of the register file. In order to write the Data Memory properly, the MemWr signal has to be set to 1. Notice that the store instruction does not update the register file. Therefore, RegWr must be set to zero and consequently control signals RegDst and MemtoReg are don’t cares. And once again we need to set the control signals Branch and Jump to zero to ensure proper Program Counter updataing. Well, by now, you are probably tied of these boring stuff where Branch and Jump are zero so let’s look at something different--the bracnh instruction. +3 = 31 min. (Y:11) Mux 32 WrEn Adr 1 1 Data In 32 imm16 Extender Data Memory 32 16 Clk ALUSrc <= 1 ExtOp <= 1

The Single Cycle Datapath during Branch
op rs rt immediate 16 21 26 31 if (R[rs] - R[rt] == 0) Zero <= 1 ; else Zero <= 0 Instruction<31:0> PCSrc<= “Br” Instruction Fetch Unit Rd Rt <21:25> <16:20> <11:15> <0:15> RegDst <= x Clk 1 Mux Rs Rt ALUctr <=Sub Rt Rs Rd Imm16 RegWr <= 0 5 5 5 MemtoReg <= x busA Zero MemWr <= 0 Rw Ra Rb busW 32 32 32-bit Registers ALU 32 busB 32 Clk 32 Mux So how does the branch instruction work? As far as the main datapath is concerned, it needs to calculate the branch condition. That is, it subtracts the register specified in the Rt field from the register specified in the Rs field and set the condition Zero accordingly. In order to place the register values on busA and busB, we need to feed the Rs and Rt fields of the instruction to the Ra and Rb ports of the register file and set ALUSrc to 0. Then we have to instruction the ALU to perform the subtract (ALUctr = sub) operation and set the Zero bit accordingly. The Zero bit is sent to the Instruction Fetch Unit. I will show you the internal of the Instruction Fetch Unit in a second. But before we leave this slide, I want you to notice that ExtOp, MemtoReg, and RegDst are don’t cares but RegWr and MemWr have to be ZERO to prevent any write to occur. And finally, the controller needs to set the Branch signal to 1 so the Instruction Fetch Unit knows what to do. So now let’s take a look at the Instruction Fetch Unit. +2 = 33 min. (Y:13) Mux 32 WrEn Adr 1 1 Data In 32 imm16 Extender Data Memory 32 16 Clk ALUSrc <= 0 ExtOp <= x

Step 4: Given Datapath: RTL -> Control
Instruction<31:0> Inst Memory <21:25> <21:25> <16:20> <11:15> <0:15> Adr Op Fun Rt Rs Rd Imm16 Control PCSrc RegWr RegDst ExtOp ALUSrc ALUctr MemWr MemtoReg Zero DATA PATH

A Summary of Control Signals
inst Register Transfer ADD R[rd] <= R[rs] + R[rt]; PC <= PC + 4 ALUsrc = RegB, ALUctr = “add”, RegDst = rd, RegWr, PCSrc = “+4” SUB R[rd] <= R[rs] – R[rt]; PC <= PC + 4 ALUsrc = RegB, ALUctr = “sub”, RegDst = rd, RegWr, PCSrc = “+4” ORi R[rt] <= R[rs] + zero_ext(Imm16); PC <= PC + 4 ALUsrc = Im, Extop = “Z”, ALUctr = “or”, RegDst = rt, RegWr, PCSrc = “+4” LOAD R[rt] <= MEM[ R[rs] + sign_ext(Imm16)]; PC <= PC + 4 ALUsrc = Im, Extop = “Sn”, ALUctr = “add”, MemtoReg, RegDst = rt, RegWr, PCSrc = “+4” STORE MEM[ R[rs] + sign_ext(Imm16)] <= R[rs]; PC <= PC + 4 ALUsrc = Im, Extop = “Sn”, ALUctr = “add”, MemWr, PCSrc = “+4” BEQ if ( R[rs] == R[rt] ) then PC <= PC +4 + {sign_ext(Imm16)], 00’b2} else PC <= PC + 4 PCSrc = “Br”, ALUctr = “sub”

A Summary of the Control Signals
See func We Don’t Care :-) Appendix A op add sub ori lw sw beq jump RegDst 1 1 x x x ALUSrc 1 1 1 x MemtoReg 1 x x x RegWrite 1 1 1 1 MemWrite 1 PCSrc 1 ExtOp x x 1 1 x x ALUctr<2:0> Add Subtract Or Add Add Subtract xxx Here is a table summarizing the control signals setting for the seven (add, sub, ...) instructions we have looked at. Instead of showing you the exact bit values for the ALU control (ALUctr), I have used the symbolic values here. The first two columns are unique in the sense that they are R-type instrucions and in order to uniquely identify them, we need to look at BOTH the op field as well as the func fiels. Ori, lw, sw, and branch on Zero are I-type instructions and Jump is J-type. They all can be uniquely idetified by looking at the opcode field alone. Now let’s take a more careful look at the first two columns. Notice that they are identical except the last row. So we can combine these two rows here if we can “delay” the generation of ALUctr signals. This lead us to something call “local decoding.” +3 = 42 min. (Y:22) op target address rs rt rd shamt funct 6 11 16 21 26 31 immediate R-type I-type J-type add, sub ori, lw, sw, beq jump

HomeWork Readings: /exercises/ Read Chapter 4.5-4.12,
familiar with SPIM /exercises/ do exercises, 4.10, 4.11, 4.12. Computer Organization and Design (COD) (Fifth Edition)

Acknowledgements These slides contain material from courses: UCB CS152
Stanford EE108B Also MIT course 6.823

Computer Organization & Design 计算机组成与设计

Similar presentations

Presentation on theme: "Computer Organization & Design 计算机组成与设计"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Organization & Design 计算机组成与设计

Similar presentations

Presentation on theme: "Computer Organization & Design 计算机组成与设计"— Presentation transcript:

Similar presentations

About project

Feedback