Designing a Single Cycle Datapath

Designing a Single Cycle Datapath
or The Do-It-Yourself CPU Kit Start: X:40 Reading 4.4 – HW due Monday Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.

The Big Picture: Where are We Now?
The Five Classic Components of a Computer Today’s Topic: Datapath Design, then Control Design Processor Input Control Memory Datapath Output Before we go any further, let’s step back for a second and take a look at the big picture. All computer consist of five components: (1) Input and (2) output devices. (3) The Memory System. And the (4) Control and (5) Datapath of the Processor. Today’s lecture covers the datapath design. In Friday’s lecture, we will show you how to design the processor’s control unit. +1 = 5 min. (X:45)

The Big Picture: The Performance Perspective
Processor design (datapath and control) will determine: Clock cycle time Clock cycles per instruction Starting today: Single cycle processor: Advantage: One clock cycle per instruction Disadvantage: long cycle time ET = Insts * CPI * Cycle Time This slide shows how the next two lectures fit into the overall performance picture. Recalled from one of your earlier lectures that the performance of a machine is determined by 3 factors: (a) Instruction count, (b) Clock cycle time, and (c) Clock cycles per instruction. Instruction count is controlled by the Instruction Set Architecture and the compiler design so the computer engineer has very little control over it (Instruction Count). What you as a computer engineer can control, while you are designing a processor, are the Clock Cycle Time and Instruction Count per cycle. More specifically, in the next two lectures, you will be designing a single cycle processor which by definition takes one clock cycle to execute every instruction. The disadvantage of this single cycle processor design is that it has a long cycle time. +2 = 7 min. (X:47) Execute an entire instruction

The Processor: Datapath & Control
We're ready to look at an implementation of the MIPS simplified to contain only: memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq Generic Implementation: use the program counter (PC) to supply instruction address get the instruction from memory read registers use the instruction to decide exactly what to do Let’s look at some regularity in our instructions

Review: Two Types of Logic Components
A State Element C = f(A,B,state) B clk A Combinational Logic C = f(A,B) B

Clocking Methodology Clk Setup Hold Setup Hold Don’t Care . . Remember, we will be using a clocking methodology where all storage elements are clocked by the same clock edge. Consequently, our cycle time will be the sum of: (a) The Clock-to-Q time of the input registers. (b) The longest delay path through the combinational logic block. (c) The set up time of the output register. (d) And finally the clock skew. In order to avoid hold time violation, you have to make sure this inequality is fulfilled. +2 = 18 min. (X:58) All storage elements are clocked by the same clock edge Consequently, our cycle time will be the sum of: (a) The Clock-to-Q time of the input registers. (b) The longest delay path through the combinational logic block. (c) The set up time of the output register. (d) And finally the clock skew. In order to avoid hold time violation, you have to make sure this inequality is fulfilled DRAW CT

Which is correct about the ALU and memory in MIPS?
Isomorphic The ALU always performs an operation before accessing data memory The ALU sometimes performs an operation before accessing data memory Data memory is always accessed before performing an ALU operation Data memory is sometimes accessed before performing an ALU operation None of the above. Coin flip

Which is correct about the ALU and the register file in MIPS?
Isomorphic The ALU always performs an operation before accessing the register file The ALU sometimes performs an operation before accessing the register file The register file is always accessed before performing an ALU operation The register file is sometimes accessed before performing an ALU operation None of the above.

So what does this tell us?
Draw the register file before ALU before memory

Register Transfer Language (RTL)
is a mechanism for describing the movement and manipulation of data between storage elements: R[3] <- R[5] + R[7] PC <- PC R[5] R[rd] <- R[rs] + R[rt] R[rt] <- Mem[R[rs] + immed] We’ll be using this from time to time – its just a shorthand for what is going on in hardware, we’ll use it in a second

Review: The MIPS Instruction Formats
All MIPS instructions are 32 bits long. The three instruction formats: R-type I-type J-type op rs rt rd shamt funct 6 11 16 21 26 31 6 bits 5 bits op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits op target address 26 31 6 bits 26 bits One of the most important thing you need to know before you start designing a processor is how the instructions look like. Or in more technical term, you need to know the instruction format. One good thing about the MIPS instruction set is that it is very simple. First of all, all MIPS instructions are 32 bits long and there are only three instruction formats: (a) R-type, (b) I-type, and (c) J-type. The different fields of the R-type instructions are: (a) OP specifies the operation of the instruction. (b) Rs, Rt, and Rd are the source and destination register specifiers. (c) Shamt specifies the amount you need to shift for the shift instructions. (d) Funct selects the variant of the operation specified in the “op” field. For the I-type instruction, bits 0 to 15 are used as an immediate field. I will show you how this immediate field is used differently by different instructions. Finally for the J-type instruction, bits 0 to 25 become the target address of the jump. +3 = 10 min. (X:50) Before we start designing our processor – we need to know how the instructions look alike. MIPS is simple – only 3 formats and they have some common features. Let’s look more closely at the few instructions we are focusing on today.

The MIPS Subset R-type LOAD and STORE BRANCH: add rd, rs, rt
op rs rt rd shamt funct 6 11 16 21 26 31 6 bits 5 bits R-type add rd, rs, rt sub, and, or, slt LOAD and STORE lw rt, rs, imm16 sw rt, rs, imm16 BRANCH: beq rs, rt, imm16 PC = PC+4 R[rd] = R[rs] OP R[rt] op rs rt immediate 16 21 26 31 6 bits 16 bits 5 bits In today’s lecture, I will show you how to implement the following subset of MIPS instructions: add, subtract, or immediate, load, store, branch, and the jump instruction. The Add and Subtract instructions use the R format. The Op together with the Func fields together specified all the different kinds of add and subtract instructions. Rs and Rt specifies the source registers. And the Rd field specifies the destination register. The Or immediate instruction uses the I format. It only uses one source register, Rs. The other operand comes from the immediate field. The Rt field is used to specified the destination register. Both he load and store instructions use the I format and both add the Rs and the immediate filed together to from the memory address. The difference is that the load instruction will load the data from memory into Rt while the store instruction will store the data in Rt into the memory. The branch on equal instruction also uses the I format. Here Rs and Rt are used to specified the registers we need to compare. If these two registers are equal, we will branch to a location specified by the immediate field. Finally, the jump instruction uses the J format and always causes the program to jump to a memory location specified in the address field. I know I went over this rather quickly and you may have missed something. But don’t worry, this is just an overview. You will keep seeing these (point to the format) all day today. +3 = 13 min. (X:53) PC = PC+4 R[rt] = Mem[R[rs] + SE(imm)] OR Mem[R[rs] + SE(imm)] = R[rt] op rs rt displacement 16 21 26 31 6 bits 16 bits 5 bits ZERO = (R[rs] – R[rt] == 0) PC = if(ZERO) PC + 4+ (SE(Imm)<<2) Else PC = PC+4 BEFORE GOING ON… quick reminder…

Storage Element: Register
Write Enable Register Similar to the D Flip Flop except N-bit input and output Write Enable input Write Enable: 0: Data Out will not change 1: Data Out will become Data In (on the clock edge) Data In Data Out N N Clk As far as storage elements are concerned, we will need a N-bit register that is similar to the D flip-flop I showed you in class. The significant difference here is that the register will have a Write Enable input. That is the content of the register will NOT be updated if Write Enable is zero. The content is updated at the clock tick ONLY if the Write Enable signal is set to 1. +1 = 31 min. (Y:11)

Which of these describes our register file?
Two 32-bit outputs, 3 5-bit inputs, clk input, 1-bit control input Two 32-bit outputs, 3 32-bit inputs, clk input, 1-bit control input Two 32-bit outputs, 2 5-bit inputs, 1 32-bit input, clk input, 1-bit control input Two 32-bit outputs, 2 32-bit inputs, 1 32-bit input, clk input, 1-bit control input None of the above

Register File RegWrite Read Data 1 Write Data 32 32 32-bit Registers
RR1 32 5 RR2 5 WR 5 Clk

Which of these describes our memory (for now)?
One 32-bit output, 1 5-bit input, 1 32-bit input, clk input, 1-bit control input, 1 bit control input One 32-bit output, 2 5-bit inputs, clk input, 1-bit control input, 1 bit control input One 32-bit output, 2 32-bit inputs, clk input, 2 1-bit control inputs One 32-bit output, 1 32-bit input, clk input, 2 1-bit control inputs None of the above

Memory MemWrite Address Write Data Read Data 32 32 Clk MemRead

Can we layout a high-level design to do this?
Draw as much as you can implementing one instruction at a time – get the students involved You’ll want to do something like this for your lab

Putting it All Together: A Single Cycle Datapath
We have everything except control signals (later) So here is the single cycle datapath we just built. If you push into the Instruction Fetch Unit, you will see the last slide showing the PC, the next address logic, and the Instruction Memory. Here I have shown how we can get the Rt, Rs, Rd, and Imm16 fields out of the 32-bit instruction word. The Rt, Rs, and Rd fields will go to the register file as register specifiers while the Imm16 field will go to the Extender where it is either Zero and Sign extended to 32 bits. The signals ExtOp, ALUSrc, ALUctr, MemWr, MemtoReg, RegDst, RegWr, Branch, and Jump are control signals. And I will show you how to generate them on Friday. +2 = 80 min. (Z:00)

Active Single-Cycle Datapath
So here is the single cycle datapath we just built. If you push into the Instruction Fetch Unit, you will see the last slide showing the PC, the next address logic, and the Instruction Memory. Here I have shown how we can get the Rt, Rs, Rd, and Imm16 fields out of the 32-bit instruction word. The Rt, Rs, and Rd fields will go to the register file as register specifiers while the Imm16 field will go to the Extender where it is either Zero and Sign extended to 32 bits. The signals ExtOp, ALUSrc, ALUctr, MemWr, MemtoReg, RegDst, RegWr, Branch, and Jump are control signals. And I will show you how to generate them on Friday. +2 = 80 min. (Z:00) Ignoring control - which instruction does this active datapath represent R-type lw sw Beq None of the above

Key Points CPU is just a collection of state and combinational logic
We just designed a very rich processor, at least in terms of functionality ET = IC * CPI * Cycle Time where does the single-cycle machine fit in?

Control Logic for the Single-Cycle CPU
Who’s in charge here?

Putting it All Together: A Single Cycle Datapath
We have everything except control signals Here is the single-cycle datapath we derived in the previous lecture. With this we have the ability to move the data around to all the right places to accomplish all of the instructions we want to execute. How do we decide where to move what? (from the instructions). That means, then, that we will need to derive all of these signals from the instruction, more specifically, from the opcode and funtion fields. We’re going to connect up all these Signals to a central place, and control Them from there, based on opcode/funct

Okay, then, what about those Control Signals?
Point out we’ve just hooked these up.

Select the true statement for MIPS A
Peer instruction question asking if decode can happen in parallel with register read. Selection Select the true statement for MIPS A Registers can be read in parallel with control signal generation B Instruction Read can be done in parallel with control signal generation C Registers can be written in parallel with control signal generation D The main ALU can execute in parallel with control signal generation E None of the above

Okay, then, what about those Control Signals?
We see here that all of the control signals are going to be generated in one of two places, from the opcode and function code bits. Notice that most signals are generated just from the opcode. Instructions that share the same opcode but differ only in function code (most R-type) look identical to the processor except the specific operation performed by the ALU. Notice control bits come from opcode and sometimes function code bits. R-type are the same except for the ALU Start here

ALU control bits Take your time here, this isn’t obvious. These are the 3 bit input signals which cause the processor to do what you want. Recall: 5-function ALU The point here is that the main control logic also has some input into the ALU control logic, based on the opcode – because sometimes the opcode tells us completely what operation to perform, and sometimes it is the function field.

Full ALU Consolidate to 3 wires since Binvert and CIn are always the same what signals accomplish: Binvert CIn Oper and? or? add? sub? beq? slt? And Or Add Sub Beq Slt sign bit (adder output from bit 31)

ALU control bits Recall: 5-function ALU based on and from instruction
ALU doesn’t need to know all opcodes--we will summarize opcode with ALUOp (2 bits): 00 - lw,sw 01 - beq 10 - R-format Opcode (31-26) Function code (5-0) The point here is that the main control logic also has some input into the ALU control logic, based on the opcode – because sometimes the opcode tells us completely what operation to perform, and sometimes it is the function field. Main Control op 6 ALU func 2 ALUop ALUctr 3

Generating ALU control
This is essentially a truth table. So they should understand what this table is saying, but also that since we can write this down as a truth table, then we can design the logic for it. Essentially a truth table, and we can design logic to do this. ALU Control Logic

Generating individual ALU signals
A: (Op1)(!Op)(F0+F3) B: !Op1+!F2 C: Op0+Op1F1 Select ALUctr2 ALUctr1 ALUctr0 A B C D E None of the above This, then, is the logic. Notice that two function bits were dropped for being completely redundant (because of our small set of instructions). Main Control op 6 ALU func 2 ALUop ALUctr 3 ALUctr2 = ALUctr1 = ALUctr0 = Op0 + Op1F1 Op1+F2 Op1Op0(F0+F3)

Select RegDst MemToReg ALUOp A X 00 B 1 C 10 D E None of the above
ISOMORPHIC add instruction control signals? Select RegDst MemToReg ALUOp A X 00 B 1 C 10 D E None of the above Both a bit hard (for different reasons) - Heads

Select ALUSrc RegDst ALUOp A 00 B 1 X C 10 D E None of the above
ISOMORPHIC sw instruction control signals? Select ALUSrc RegDst ALUOp A 00 B 1 X C 10 D E None of the above

beq Control Ultimately we can Generate the control
Signals for all insts. Branches are a bit tricker – let’s Do this together

Control Truth Table So then we have another truth table. Again, since we can derive a truth table, we must be able to design the logic for it. Here’s a truth table – which means we can make the logic to design it.

Control Simple combinational logic (truth tables)
And here it is. Here’s the truth table

Which wire – if always ZERO – would break add?
ISOMORPHIC D B A C Equal (HW) – heads Correct Answer - D Which wire – if always ZERO – would break add?

Which wire – if always ONE – would break lw?
ISOMORPHIC D B A C Correct Answer - B Primary Purpose: Helping students understand the single-cycle processor Concept: Understanding how control lines dictate the behavior of the processor Expected mistakes: If they don’t understand how lw functions – they may not understand the control signals. Post Discussion: Draw the active datapath of both instructions and mark control signals. Which wire – if always ONE – would break lw?

Add new instructions Potentially requires modifying the datapath
Potentially requires adding more control wires – which would impact our previous truth table.

Yes – we need both new control and datapath. B
ISOMORPHIC Do we need to modify our single-cycle design to do jr Select Best Answer A Yes – we need both new control and datapath. B Yes – we need just datapath. C No – but we should for better performance. D No – just changing control signals is fine. E Single cycle can’t do jump register.

Single-Cycle CPU Summary
Easy, particularly the control Which instruction takes the longest? By how much? Why is that a problem? ET = IC * CPI * CT What else can we do? The main points I want them to get: cpi always equals 1.0, but the cycle time is going to be long (and determined by the longest instruction). performance is therefore a function of the longest instruction in the ISA In a multicycle implementation, where some insts take 3 cycles, some 4, some 5, performance is a function of the average length instead of the longest. calculate an estimated speedup for the example here.

Single-Cycle CPU Summary
Easy, particularly the control Which instruction takes the longest? By how much? Why is that a problem? ET = IC * CPI * CT What else can we do? When does a multi-cycle implementation make sense? e.g., 70% of instructions take 75 ns, 30% take 200 ns? suppose 20% overhead for extra latches Real machines have much more variable instruction latencies than this. The main points I want them to get: cpi always equals 1.0, but the cycle time is going to be long (and determined by the longest instruction). performance is therefore a function of the longest instruction in the ISA In a multicycle implementation, where some insts take 3 cycles, some 4, some 5, performance is a function of the average length instead of the longest. calculate an estimated speedup for the example here. 200 vs. (200*.3+75*.7)*1.2 (60+50)*1.2 ~ 135

Designing a Single Cycle Datapath

Similar presentations

Presentation on theme: "Designing a Single Cycle Datapath"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Designing a Single Cycle Datapath

Similar presentations

Presentation on theme: "Designing a Single Cycle Datapath"— Presentation transcript:

Similar presentations

About project

Feedback