Processor Design & Implementation

Processor Design & Implementation

Review: MIPS (RISC) Design Principles
Simplicity favors regularity fixed size instructions small number of instruction formats opcode always the first 6 bits Smaller is faster limited instruction set limited number of registers in register file limited number of addressing modes Make the common case fast arithmetic operands from the register file (load-store machine) allow instructions to contain immediate operands Good design demands good compromises three instruction formats

Sequential vs Combinational Circuits
Combinational logic circuits output is a function of the present value of the inputs only. When inputs are changed, the information about the previous inputs is lost  memoryless E.g., Sequential logic circuits outputs are also dependent upon past inputs  has memory  flip flops/latches

Sequential vs Combinational Circuits
Combinational logic circuits output is a function of the present value of the inputs only. When inputs are changed, the information about the previous inputs is lost  memoryless e.g., multiplexors. Sequential logic circuits outputs are also dependent upon past inputs  has memory basically combinational circuits with the additional properties of storage (to remember past inputs) and feedback

RS Latches An RS latch is a memory element with 2 inputs: - Reset (R)
- Set (S) - 2 outputs: Q and Q Note: if inputs don’t change, outputs are held indefinitely.

RS Latch State Transition Diagram

Clocks and Synchronous Circuits
• Asynchronous operation : - the output state of RS latches changes occur directly in response to changes in the inputs. • Virtually all sequential circuits currently employ the notion of synchronous operation  the output of a sequential circuit is constrained to change only at a time specified by a global enabling signal.  This signal is generally known as the system clock

Transparent D Latches • modify the RS Latch such that its output state is only permitted to change when a valid enable signal (system clock) is present • Add a couple of AND gates in cascade with the R and S inputs that are controlled by an additional input known as the enable (EN) input

Master-Slave Flip Flops
• Easy to design sequential circuits if outputs change on: - rising (positive trending) - falling (negative trending) edges of a clock (i.e., enable) signal Can be done by combining two transparent D latches in a Master-Slave configuration.

The Processor: Datapath & Control
Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j Generic implementation use the program counter (PC) to supply the instruction address and fetch the instruction from memory (and update the PC) decode the instruction (and read registers) execute the instruction All instructions (except j) use the ALU after reading the registers How? memory-reference? arithmetic? control flow? memory reference use ALU to compute addresses arithmetic use the ALU to do the require arithmetic control use the ALU to compute branch conditions. Fetch PC = PC+4 Decode Exec

Aside: Clocking Methodologies
The clocking methodology defines when data in a state element is valid and stable relative to the clock State elements - a memory element such as a register Edge-triggered – all state changes occur on a clock edge Typical execution read contents of state elements -> send values through combinational logic -> write results to one or more state elements State element 1 State element 2 Combinational logic clock State elements (a memory element) – instruction memory, data memory, registers With edge-triggered state elements, there is no worry about feedback within a single clock cycle (it’s a single-sided clock constraint (just have to worry about making sure the clock is long enough, don’t have to worry about it being too short)) one clock cycle Assumes state elements are written on every clock cycle; if not, need explicit write control signal write occurs only when both the write control is asserted and the clock edge occurs

Fetching Instructions
Fetching instructions involves reading the instruction from the Instruction Memory updating the PC value to be the address of the next (sequential) instruction Read Address Instruction Memory Add PC 4 clock Fetch PC = PC+4 Decode Exec PC is updated every clock cycle, so it does not need an explicit write control signal just a clock signal Reading from the Instruction Memory is a combinational activity, so it doesn’t need an explicit read control signal

Decoding Instructions
Decoding instructions involves sending the fetched instruction’s opcode and function field bits to the control unit Fetch PC = PC+4 Decode Exec Control Unit Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Data 2 Instruction reading two values from the Register File Register File addresses are contained in the instruction

Executing R Format Operations
R format operations (add, sub, slt, and, or) perform operation (op and funct) on values in rs and rt store the result back into the Register File (into location rd) R-type: 31 25 20 15 5 op rs rt rd funct shamt 10 Instruction Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Data 2 ALU overflow zero ALU control (3 bit code) RegWrite Fetch PC = PC+4 Decode Exec Since writes to the register file are edge-triggered, we can legally read and write the same register within a clock cycle – the read will get the value written in an earlier clock cycle, which the value written will be available to a read in a subsequent cycle. Note that Register File is not written every cycle (e.g., sw), so we need an explicit write control signal for the Register File

Executing Load and Store Operations
Load and store operations involves compute memory address by adding the base register (read from the Register File during decode) to the 16-bit signed-extended offset field in the instruction, e.g., sw $s3 4($t5) store value (read from the Register File during decode) written to the Data Memory, load value, read from the Data Memory, written to the Register File Instruction Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Data 2 ALU overflow zero ALU control RegWrite Data Memory Address Read Data Sign Extend MemWrite MemRead $t5 Note there are separate read and write controls to the memory – only one of which may be asserted on any given clock cycle. The memory unit needs a read signal, since, unlike the register file, reading the value of an invalid address can cause problems as we will see later. (Standard memory chips actually have a write enable signal that is used for writes.) 4 16 32

Branch Addressing Branch instructions specify
opcode, two registers, target address Most branch targets are near branch - Forward or backward op rs rt constant or address 6 bits 5 bits 16 bits PC-relative addressing Target address = PC + offset × 4 PC already incremented by 4 by this time

Executing Branch Operations
Branch operations involves compare the operands read from the Register File during decode for equality (zero ALU output) compute the branch target address by adding the updated PC to the 16-bit signed-extended offset field in the instr Why << 2? partition#(4)+addr(26)+00(2) Add Branch target address Add 4 Shift left 2 ALU control (3 bit code) PC zero (to branch control logic) Read Addr 1 Read Data 1 Register File Read Addr 2 Instruction ALU Write Addr Read Data 2 Write Data Sign Extend 16 32

Chapter 2 — Instructions: Language of the Computer — 34
Jump Addressing Jump (j and jal) targets could be anywhere in text segment Encode full address in instruction op address 6 bits 26 bits (Pseudo)Direct jump addressing Target address = PC31…28 : (address × 4) Chapter 2 — Instructions: Language of the Computer — 34

Executing Jump Operations
Jump operation involves replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits Add 4 4 Jump address Instruction Memory Shift left 2 28 Read Address PC Instruction 26

Creating a Single Datapath from the Parts
Assemble the datapath segments and add control lines and multiplexors as needed Single cycle design – fetch, decode and execute each instructions in one clock cycle no datapath resource can be used more than once per instruction, so some must be duplicated (e.g., separate Instruction Memory and Data Memory, several adders) multiplexors needed at the input of shared elements with control lines to do the selection write signals to control writing to the Register File and Data Memory Cycle time is determined by length of the longest path

Multiplexors 2 Input 1 Bit Selector Device (2x1 MUX) output
Here is a truth table definition of a “function” we wish to implement: When S = 0, A is “selected” for output When S = 1, B is “selected” for output S A B output 1

Detailed gate level description Higher level abstraction
2x1 MUX (Multiplexor) What is the Boolean expression for a 2x1 MUX? Output = S • B + S • A How do you implement this using gates? S A B output 1 A B S (control signal) output Detailed gate level description Higher level abstraction

Fetch, R, and Memory Access Portions
MemtoReg Read Address Instruction Memory Add PC 4 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Data 1 Data 2 ALU ovf zero ALU control RegWrite Data Read Data MemWrite MemRead Sign Extend 16 32 ALUSrc

Adding the Control Observations
Selecting the operations to perform (ALU, Register File and Memory read/write) Controlling the flow of data (multiplexor inputs) 31 25 20 15 10 5 R-type: op rs rt rd shamt funct Observations op field always in bits 31-26 addr of registers to be read are always specified by the rs field (bits 25-21) and rt field (bits 20-16); for lw and sw rs is the base register addr. of register to be written is in one of two places – in rt (bits 20-16) for lw; in rd (bits 15-11) for R-type instructions offset for beq, lw, and sw always in bits 15-0 31 25 20 15 I-Type: op rs rt address offset J-type: 31 25 op target address

Control  The control unit is responsible for setting all the control signals so that each instruction is executed properly. — The control unit’s input is the 32-bit instruction word. — The outputs are values for the blue control signals in the datapath.  Most of the signals can be generated from the instruction opcode alone, and not the entire 32-bit word.  To illustrate the relevant control signals, we will show the route that is taken through the datapath by R-type, lw, sw and beq instructions.

ALU Control Unit 1 ovf zero 1 1 1 Add Add 4 Shift left 2 PCSrc ALUOp
Add Add 1 4 Shift left 2 PCSrc ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Instr[31-0] Read Data 1 ALU Write Addr Read Data 2 1 Write Data Instr[ ] Write Data 1 3 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0] 2

Bit I/O for ALU Control Unit
Add Add 1 4 Shift left 2 PCSrc ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Instr[31-0] Read Data 1 ALU Write Addr Read Data 2 1 Write Data Instr[ ] Write Data 1 3 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0] 2

R-type Instruction 31 25 20 15 10 5 R-type: op rs rt rd shamt funct 31
R-type: op rs rt rd shamt funct 31 25 20 15

R-type Dataflow 1 ovf zero 1 1 1 Add Add 4 Shift left 2 PCSrc ALUOp
Add Add 1 4 Shift left 2 PCSrc ALUOp Branch MemRead Instr[31-26] Control Unit ALUSrc RegWrite MemWrite RegDst ovf Instr[25-21][rs] Read Addr 1 Instruction Memory Instr[20-16][rt] Read Data 1 Address Register File zero Read Addr 2 Data Memory Read Address PC Instr[31-0] Read Data 1 Write Addr ALU Read Data 2 1 For lecture Write Data Instr[ ][rd] Write Data 1 ALUOp Instr[15-0] Sign Extend MemtoReg ALU control 16 32 Instr[5-0]

R type - Control Lines 1

Load Word Instruction Data/Control Flow
Add Add 1 4 Shift left 2 PCSrc ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Instr[25-21] Register File zero Read Addr 2 Data Memory Read Address PC Instr[31-0] Read Data 1 ALU Write Addr Read Data 2 1 For class handout – have a student come forward and mark the connections in the datapath that are active. And show the state of the control lines. Write Data Instr[ ] Write Data 1 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0]

Load Word Instruction Data/Control Flow
Add Add 1 4 Shift left 2 PCSrc ALUOp Branch Instr[31-26] Control Unit ALUSrc RegWrite MemtoReg RegDst MemWrite ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 $t0 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Instr[31-0] Read Data 1 $s1 ALU Write Addr Read Data 2 1 Write Data Instr[ ] Write Data 32 1 Instr[15-0] Sign Extend MemRead ALU control 16 32 Instr[5-0] lw $s1, 32($t0)

lw - Control Lines 1 1 1 1

Branching 31 25 20 15 I-Type: op rs rt address offset

Branch Instruction Data/Control Flow
Add Add 1 4 Shift left 2 PCSrc ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Instr[31-0] Read Data 1 ALU Write Addr Read Data 2 1 For class handout – have a student come forward and mark the connections in Write Data Instr[ ] Write Data 1 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0]

Branch Instruction Data/Control Flow
Add Add 1 4 Shift left 2 PCSrc ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Instr[31-0] Read Data 1 ALU Write Addr Read Data 2 1 Write Data Instr[ ] Write Data 1 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0]

Main Control Lines 1

J-type: 31 25 op target address

Adding the Jump Operation
Instr[25-0] 25-0] 1 Shift left 2 28 32 26 PC+4[31-28] Add Add 1 4 Shift left 2 PCSrc Jump ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Read Data 1 ALU Write Addr Instr[31-0] Read Data 2 1 Write Data Instr[ ] Write Data 1 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0]

Instruction Times (Critical Paths)
What is the clock cycle time assuming negligible delays for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except: Instruction Memory (IM) and Data Memory (DM) (200 ps) ALU and adders (200 ps) Register File access (reads or writes) (100 ps) Instr. IM R_Read ALU Op DM R_Write Total R-type load store beq jump For class handout

Instruction Critical Paths
What is the clock cycle time assuming negligible delays for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except: Instruction and Data Memory (200 ps) ALU and adders (200 ps) Register File access (reads or writes) (100 ps) Instr. I Mem Reg Rd ALU Op D Mem Reg Wr Total R-type load store beq jump 200 100 600 For lecture Note that PC is updated during I Mem read 200 100 800 200 100 700 200 100 500 200

Single Cycle Disadvantages & Advantages
Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instruction especially problematic for more complex instructions like floating point multiply 800 ps ps May be wasteful of area since some functional units (e.g., adders) must be duplicated since they can not be shared during a clock cycle but is simple and easy to understand Clk lw sw Waste Cycle 1 Cycle 2 In the Single Cycle implementation, the cycle time is set to accommodate the longest instruction, the Load instruction. Since the cycle time has to be long enough for the load instruction, it is too long for the store instruction so the last part of the cycle here is wasted.

How Can We Make It Faster?
Start fetching and executing the next instruction before the current one has completed Pipelining – (all?) modern processors are pipelined for performance Remember the performance equation: CPU time = CPI * CC * IC Under ideal conditions and with a large number of instructions, the speedup from pipelining is approximately equal to the number of pipe stages A five stage pipeline is nearly five times faster because the CC is nearly five times faster In reality the time per instruction in a pipeline processor is longer than the minimum possible because 1) the pipeline stages may not be perfectly balanced and 2) pipelining involves some overhead (like pipeline stage isolation registers). Fetch (and execute) more than one instruction at a time Superscalar processing – stay tuned

Pipelining Analogy Pipelined laundry: overlapping execution
Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Sequential = 8 hrs Pipelined = 3.5 hrs

Pipeline Control IF Stage: read Instr Memory (always asserted) and write PC (on System Clock) ID Stage: no optional control signals to set EX Stage MEM Stage WB Stage Reg Dst ALU Op1 ALU Op0 ALU Src Brch Mem Read Mem Write Reg Write Mem toReg R 1 lw sw X beq

Output from ALU of 1st instruction has to be forwarded to Instr#2 and #3
Since only one output (from ALU) can be forwarded, it is sent to Instr#2 Instr#3’s ALU must wait till the next stage to get output of 1st Instr’s ALU

IF ID EX MEM WB ? ?

Why 13 cycles? Cycles Instr 1 2 (stall) 3 4 5 6 7

Forwarding Unit

Forwarding Unit Inputs

Forwarding Example $t1 $t0 $t0 $t1 $t1 $t0 $t0 $t1

$t1 $t0 $t0 $t1

Add $t0, add $t0, or $t1, Х Х sub _ , _, _ $t0,$t0 (if one instr away, can forward at MEM stage)

Hazard Unit

Processor Design & Implementation

Similar presentations

Presentation on theme: "Processor Design & Implementation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Processor Design & Implementation

Similar presentations

Presentation on theme: "Processor Design & Implementation"— Presentation transcript:

Similar presentations

About project

Feedback