Processor Design & Implementation

Slides:



Advertisements
Similar presentations
331 W08.1Spring :332:331 Computer Architecture and Assembly Language Spring 2006 Week 8: Datapath Design [Adapted from Dave Patterson’s UCB CS152.
Advertisements

The Processor: Datapath & Control
1  1998 Morgan Kaufmann Publishers Chapter Five The Processor: Datapath and Control.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Chapter 5 The Processor: Datapath and Control Basic MIPS Architecture Homework 2 due October 28 th. Project Designs due October 28 th. Project Reports.
331 W9.1Spring :332:331 Computer Architecture and Assembly Language Spring 2006 Week 9 Building a Single-Cycle Datapath [Adapted from Dave Patterson’s.
Levels in Processor Design
331 Lec 14.1Fall 2002 Review: Abstract Implementation View  Split memory (Harvard) model - single cycle operation  Simplified to contain only the instructions:
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Computer Structure - Datapath and Control Goal: Design a Datapath  We will design the datapath of a processor that includes a subset of the MIPS instruction.
Chapter Five The Processor: Datapath and Control.
CSE431 L05 Basic MIPS Architecture.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 05: Basic MIPS Architecture Review Mary Jane Irwin.
The Processor: Datapath & Control. Implementing Instructions Simplified instruction set memory-reference instructions: lw, sw arithmetic-logical instructions:
Dr. Iyad F. Jafar Basic MIPS Architecture: Multi-Cycle Datapath and Control.
Chapter 4 Sections 4.1 – 4.4 Appendix D.1 and D.2 Dr. Iyad F. Jafar Basic MIPS Architecture: Single-Cycle Datapath and Control.
COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections
Lec 15Systems Architecture1 Systems Architecture Lecture 15: A Simple Implementation of MIPS Jeremy R. Johnson Anatole D. Ruslanov William M. Mongan Some.
CSE431 Chapter 4A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 4A: The Processor, Part A Mary Jane Irwin ( )
Computer Organization CS224 Chapter 4 Part b The Processor Spring 2010 With thanks to M.J. Irwin, T. Fountain, D. Patterson, and J. Hennessy for some lecture.
Computer Architecture and Design – ECEN 350 Part 6 [Some slides adapted from A. Sprintson, M. Irwin, D. Paterson and others]
1 Processor: Datapath and Control Single cycle processor –Datapath and Control Multicycle processor –Datapath and Control Microprogramming –Vertical and.
1 Processor: Datapath and Control Single cycle processor –Datapath and Control Multicycle processor –Datapath and Control Microprogramming –Vertical and.
CSE331 W10.1Irwin&Li Fall 2006 PSU CSE 331 Computer Organization and Design Fall 2006 Week 10 Section 1: Mary Jane Irwin (
ECE-C355 Computer Structures Winter 2008 The MIPS Datapath Slides have been adapted from Prof. Mary Jane Irwin ( )
Datapath and Control AddressInstruction Memory Write Data Reg Addr Register File ALU Data Memory Address Write Data Read Data PC Read Data Read Data.
Chapter 4 From: Dr. Iyad F. Jafar Basic MIPS Architecture: Single-Cycle Datapath and Control.
Datapath and Control AddressInstruction Memory Write Data Reg Addr Register File ALU Data Memory Address Write Data Read Data PC Read Data Read Data.
COM181 Computer Hardware Lecture 6: The MIPs CPU.
Chapter 4 From: Dr. Iyad F. Jafar Basic MIPS Architecture: Multi-Cycle Datapath and Control.
Computer Architecture Lecture 6.  Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions:
CS Computer Architecture Week 10: Single Cycle Implementation
CS161 – Design and Architecture of Computer Systems
Computer Organization
CS 230: Computer Organization and Assembly Language
Single-Cycle Datapath and Control
Computer Architecture
Morgan Kaufmann Publishers
Basic MIPS Architecture
Processor (I).
CS/COE0447 Computer Organization & Assembly Language
MIPS processor continued
Designing MIPS Processor (Single-Cycle) Presentation G
Chapter 4 The Processor Part 2
CSCI206 - Computer Organization & Programming
Single-cycle datapath, slightly rearranged
Single-Cycle CPU DataPath.
CS/COE0447 Computer Organization & Assembly Language
CSCI206 - Computer Organization & Programming
Levels in Processor Design
Rocky K. C. Chang 6 November 2017
Composing the Elements
The Processor Lecture 3.4: Pipelining Datapath and Control
Composing the Elements
The Processor Lecture 3.2: Building a Datapath with Control
Vishwani D. Agrawal James J. Danaher Professor
The Processor Lecture 3.1: Introduction & Logic Design Conventions
COMS 361 Computer Organization
COSC 2021: Computer Organization Instructor: Dr. Amir Asif
Lecture 14: Single Cycle MIPS Processor
Processor: Multi-Cycle Datapath & Control
Single Cycle Datapath Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
MIPS processor continued
CS/COE0447 Computer Organization & Assembly Language
A single-cycle MIPS processor
The Processor: Datapath & Control.
COMS 361 Computer Organization
Processor: Datapath and Control
Pipelined datapath and control
CS/COE0447 Computer Organization & Assembly Language
Presentation transcript:

Processor Design & Implementation

Review: MIPS (RISC) Design Principles Simplicity favors regularity fixed size instructions small number of instruction formats opcode always the first 6 bits Smaller is faster limited instruction set limited number of registers in register file limited number of addressing modes Make the common case fast arithmetic operands from the register file (load-store machine) allow instructions to contain immediate operands Good design demands good compromises three instruction formats

Sequential vs Combinational Circuits Combinational logic circuits output is a function of the present value of the inputs only. When inputs are changed, the information about the previous inputs is lost  memoryless E.g., Sequential logic  circuits outputs are also dependent upon past inputs  has memory  flip flops/latches

Sequential vs Combinational Circuits Combinational logic circuits output is a function of the present value of the inputs only. When inputs are changed, the information about the previous inputs is lost  memoryless e.g., multiplexors. Sequential logic  circuits outputs are also dependent upon past inputs  has memory  basically combinational circuits with the additional properties of storage (to remember past inputs) and feedback

RS Latches An RS latch is a memory element with 2 inputs: - Reset (R) - Set (S) - 2 outputs: Q and Q Note: if inputs don’t change, outputs are held indefinitely.

RS Latch State Transition Diagram

Clocks and Synchronous Circuits • Asynchronous operation : - the output state of RS latches changes occur directly in response to changes in the inputs. • Virtually all sequential circuits currently employ the notion of synchronous operation  the output of a sequential circuit is constrained to change only at a time specified by a global enabling signal.  This signal is generally known as the system clock

Transparent D Latches • modify the RS Latch such that its output state is only permitted to change when a valid enable signal (system clock) is present • Add a couple of AND gates in cascade with the R and S inputs that are controlled by an additional input known as the enable (EN) input

Master-Slave Flip Flops • Easy to design sequential circuits if outputs change on: - rising (positive trending) - falling (negative trending) edges of a clock (i.e., enable) signal Can be done by combining two transparent D latches in a Master-Slave configuration.

The Processor: Datapath & Control Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j Generic implementation use the program counter (PC) to supply the instruction address and fetch the instruction from memory (and update the PC) decode the instruction (and read registers) execute the instruction All instructions (except j) use the ALU after reading the registers How? memory-reference? arithmetic? control flow? memory reference use ALU to compute addresses arithmetic use the ALU to do the require arithmetic control use the ALU to compute branch conditions. Fetch PC = PC+4 Decode Exec

Aside: Clocking Methodologies The clocking methodology defines when data in a state element is valid and stable relative to the clock State elements - a memory element such as a register Edge-triggered – all state changes occur on a clock edge Typical execution read contents of state elements -> send values through combinational logic -> write results to one or more state elements State element 1 State element 2 Combinational logic clock State elements (a memory element) – instruction memory, data memory, registers With edge-triggered state elements, there is no worry about feedback within a single clock cycle (it’s a single-sided clock constraint (just have to worry about making sure the clock is long enough, don’t have to worry about it being too short)) one clock cycle Assumes state elements are written on every clock cycle; if not, need explicit write control signal write occurs only when both the write control is asserted and the clock edge occurs

Fetching Instructions Fetching instructions involves reading the instruction from the Instruction Memory updating the PC value to be the address of the next (sequential) instruction Read Address Instruction Memory Add PC 4 clock Fetch PC = PC+4 Decode Exec PC is updated every clock cycle, so it does not need an explicit write control signal just a clock signal Reading from the Instruction Memory is a combinational activity, so it doesn’t need an explicit read control signal

Decoding Instructions Decoding instructions involves sending the fetched instruction’s opcode and function field bits to the control unit Fetch PC = PC+4 Decode Exec Control Unit Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Data 2 Instruction reading two values from the Register File Register File addresses are contained in the instruction

Executing R Format Operations R format operations (add, sub, slt, and, or) perform operation (op and funct) on values in rs and rt store the result back into the Register File (into location rd) R-type: 31 25 20 15 5 op rs rt rd funct shamt 10 Instruction Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Data 2 ALU overflow zero ALU control (3 bit code) RegWrite Fetch PC = PC+4 Decode Exec Since writes to the register file are edge-triggered, we can legally read and write the same register within a clock cycle – the read will get the value written in an earlier clock cycle, which the value written will be available to a read in a subsequent cycle. Note that Register File is not written every cycle (e.g., sw), so we need an explicit write control signal for the Register File

Executing Load and Store Operations Load and store operations involves compute memory address by adding the base register (read from the Register File during decode) to the 16-bit signed-extended offset field in the instruction, e.g., sw $s3 4($t5) store value (read from the Register File during decode) written to the Data Memory, load value, read from the Data Memory, written to the Register File Instruction Write Data Read Addr 1 Read Addr 2 Write Addr Register File Read Data 1 Data 2 ALU overflow zero ALU control RegWrite Data Memory Address Read Data Sign Extend MemWrite MemRead $t5 Note there are separate read and write controls to the memory – only one of which may be asserted on any given clock cycle. The memory unit needs a read signal, since, unlike the register file, reading the value of an invalid address can cause problems as we will see later. (Standard memory chips actually have a write enable signal that is used for writes.) 4 16 32

Branch Addressing Branch instructions specify opcode, two registers, target address Most branch targets are near branch - Forward or backward op rs rt constant or address 6 bits 5 bits 16 bits PC-relative addressing Target address = PC + offset × 4 PC already incremented by 4 by this time

Executing Branch Operations Branch operations involves compare the operands read from the Register File during decode for equality (zero ALU output) compute the branch target address by adding the updated PC to the 16-bit signed-extended offset field in the instr Why << 2? partition#(4)+addr(26)+00(2) Add Branch target address Add 4 Shift left 2 ALU control (3 bit code) PC zero (to branch control logic) Read Addr 1 Read Data 1 Register File Read Addr 2 Instruction ALU Write Addr Read Data 2 Write Data Sign Extend 16 32

Chapter 2 — Instructions: Language of the Computer — 34 Jump Addressing Jump (j and jal) targets could be anywhere in text segment Encode full address in instruction op address 6 bits 26 bits (Pseudo)Direct jump addressing Target address = PC31…28 : (address × 4) Chapter 2 — Instructions: Language of the Computer — 34

Executing Jump Operations Jump operation involves replace the lower 28 bits of the PC with the lower 26 bits of the fetched instruction shifted left by 2 bits Add 4 4 Jump address Instruction Memory Shift left 2 28 Read Address PC Instruction 26

Creating a Single Datapath from the Parts Assemble the datapath segments and add control lines and multiplexors as needed Single cycle design – fetch, decode and execute each instructions in one clock cycle no datapath resource can be used more than once per instruction, so some must be duplicated (e.g., separate Instruction Memory and Data Memory, several adders) multiplexors needed at the input of shared elements with control lines to do the selection write signals to control writing to the Register File and Data Memory Cycle time is determined by length of the longest path

Multiplexors 2 Input 1 Bit Selector Device (2x1 MUX) output Here is a truth table definition of a “function” we wish to implement: When S = 0, A is “selected” for output When S = 1, B is “selected” for output S A B output 1

Detailed gate level description Higher level abstraction 2x1 MUX (Multiplexor) What is the Boolean expression for a 2x1 MUX? Output = S • B + S • A How do you implement this using gates? S A B output 1 A B S (control signal) output Detailed gate level description Higher level abstraction

Fetch, R, and Memory Access Portions MemtoReg Read Address Instruction Memory Add PC 4 Write Data Read Addr 1 Read Addr 2 Write Addr Register File Data 1 Data 2 ALU ovf zero ALU control RegWrite Data Read Data MemWrite MemRead Sign Extend 16 32 ALUSrc

Adding the Control Observations Selecting the operations to perform (ALU, Register File and Memory read/write) Controlling the flow of data (multiplexor inputs) 31 25 20 15 10 5 R-type: op rs rt rd shamt funct Observations op field always in bits 31-26 addr of registers to be read are always specified by the rs field (bits 25-21) and rt field (bits 20-16); for lw and sw rs is the base register addr. of register to be written is in one of two places – in rt (bits 20-16) for lw; in rd (bits 15-11) for R-type instructions offset for beq, lw, and sw always in bits 15-0 31 25 20 15 I-Type: op rs rt address offset J-type: 31 25 op target address

Control  The control unit is responsible for setting all the control signals so that each instruction is executed properly. — The control unit’s input is the 32-bit instruction word. — The outputs are values for the blue control signals in the datapath.  Most of the signals can be generated from the instruction opcode alone, and not the entire 32-bit word.  To illustrate the relevant control signals, we will show the route that is taken through the datapath by R-type, lw, sw and beq instructions.

ALU Control Unit 1 ovf zero 1 1 1 Add Add 4 Shift left 2 PCSrc ALUOp Add Add 1 4 Shift left 2 PCSrc ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Instr[31-0] Read Data 1 ALU Write Addr Read Data 2 1 Write Data Instr[15 -11] Write Data 1 3 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0] 2

Bit I/O for ALU Control Unit Add Add 1 4 Shift left 2 PCSrc ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Instr[31-0] Read Data 1 ALU Write Addr Read Data 2 1 Write Data Instr[15 -11] Write Data 1 3 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0] 2

R-type Instruction 31 25 20 15 10 5 R-type: op rs rt rd shamt funct 31 R-type: op rs rt rd shamt funct 31 25 20 15

R-type Dataflow 1 ovf zero 1 1 1 Add Add 4 Shift left 2 PCSrc ALUOp Add Add 1 4 Shift left 2 PCSrc ALUOp Branch MemRead Instr[31-26] Control Unit ALUSrc RegWrite MemWrite RegDst ovf Instr[25-21][rs] Read Addr 1 Instruction Memory Instr[20-16][rt] Read Data 1 Address Register File zero Read Addr 2 Data Memory Read Address PC Instr[31-0] Read Data 1 Write Addr ALU Read Data 2 1 For lecture Write Data Instr[15 -11][rd] Write Data 1 ALUOp Instr[15-0] Sign Extend MemtoReg ALU control 16 32 Instr[5-0]

R type - Control Lines 1

Load Word Instruction Data/Control Flow Add Add 1 4 Shift left 2 PCSrc ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Instr[25-21] Register File zero Read Addr 2 Data Memory Read Address PC Instr[31-0] Read Data 1 ALU Write Addr Read Data 2 1 For class handout – have a student come forward and mark the connections in the datapath that are active. And show the state of the control lines. Write Data Instr[15 -11] Write Data 1 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0]

Load Word Instruction Data/Control Flow Add Add 1 4 Shift left 2 PCSrc ALUOp Branch Instr[31-26] Control Unit ALUSrc RegWrite MemtoReg RegDst MemWrite ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 $t0 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Instr[31-0] Read Data 1 $s1 ALU Write Addr Read Data 2 1 Write Data Instr[15 -11] Write Data 32 1 Instr[15-0] Sign Extend MemRead ALU control 16 32 Instr[5-0] lw $s1, 32($t0)

lw - Control Lines 1 1 1 1

Branching 31 25 20 15 I-Type: op rs rt address offset

Branch Instruction Data/Control Flow Add Add 1 4 Shift left 2 PCSrc ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Instr[31-0] Read Data 1 ALU Write Addr Read Data 2 1 For class handout – have a student come forward and mark the connections in Write Data Instr[15 -11] Write Data 1 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0]

Branch Instruction Data/Control Flow Add Add 1 4 Shift left 2 PCSrc ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Instr[31-0] Read Data 1 ALU Write Addr Read Data 2 1 Write Data Instr[15 -11] Write Data 1 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0]

Main Control Lines 1

J-type: 31 25 op target address

Adding the Jump Operation Instr[25-0] 25-0] 1 Shift left 2 28 32 26 PC+4[31-28] Add Add 1 4 Shift left 2 PCSrc Jump ALUOp Branch MemRead Instr[31-26] Control Unit MemtoReg MemWrite ALUSrc RegWrite RegDst ovf Instr[25-21] Read Addr 1 Instruction Memory Read Data 1 Address Register File Instr[20-16] zero Read Addr 2 Data Memory Read Address PC Read Data 1 ALU Write Addr Instr[31-0] Read Data 2 1 Write Data Instr[15 -11] Write Data 1 Instr[15-0] Sign Extend ALU control 16 32 Instr[5-0]

Instruction Times (Critical Paths) What is the clock cycle time assuming negligible delays for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except: Instruction Memory (IM) and Data Memory (DM) (200 ps) ALU and adders (200 ps) Register File access (reads or writes) (100 ps) Instr. IM R_Read ALU Op DM R_Write Total R-type load store beq jump For class handout

Instruction Critical Paths What is the clock cycle time assuming negligible delays for muxes, control unit, sign extend, PC access, shift left 2, wires, setup and hold times except: Instruction and Data Memory (200 ps) ALU and adders (200 ps) Register File access (reads or writes) (100 ps) Instr. I Mem Reg Rd ALU Op D Mem Reg Wr Total R-type load store beq jump 200 100 600 For lecture Note that PC is updated during I Mem read 200 100 800 200 100 700 200 100 500 200

Single Cycle Disadvantages & Advantages Uses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowest instruction especially problematic for more complex instructions like floating point multiply 800 ps 700 ps May be wasteful of area since some functional units (e.g., adders) must be duplicated since they can not be shared during a clock cycle but is simple and easy to understand Clk lw sw Waste Cycle 1 Cycle 2 In the Single Cycle implementation, the cycle time is set to accommodate the longest instruction, the Load instruction. Since the cycle time has to be long enough for the load instruction, it is too long for the store instruction so the last part of the cycle here is wasted.

How Can We Make It Faster? Start fetching and executing the next instruction before the current one has completed Pipelining – (all?) modern processors are pipelined for performance Remember the performance equation: CPU time = CPI * CC * IC Under ideal conditions and with a large number of instructions, the speedup from pipelining is approximately equal to the number of pipe stages A five stage pipeline is nearly five times faster because the CC is nearly five times faster In reality the time per instruction in a pipeline processor is longer than the minimum possible because 1) the pipeline stages may not be perfectly balanced and 2) pipelining involves some overhead (like pipeline stage isolation registers). Fetch (and execute) more than one instruction at a time Superscalar processing – stay tuned

Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance Four loads: Speedup = 8/3.5 = 2.3 Sequential = 8 hrs Pipelined = 3.5 hrs

Pipeline Control IF Stage: read Instr Memory (always asserted) and write PC (on System Clock) ID Stage: no optional control signals to set EX Stage MEM Stage WB Stage Reg Dst ALU Op1 ALU Op0 ALU Src Brch Mem Read Mem Write Reg Write Mem toReg R 1 lw sw X beq

W R

Output from ALU of 1st instruction has to be forwarded to Instr#2 and #3 Since only one output (from ALU) can be forwarded, it is sent to Instr#2 Instr#3’s ALU must wait till the next stage to get output of 1st Instr’s ALU

IF ID EX MEM WB ? ?

Why?

Why 13 cycles? Cycles 1 2 3 4 5 6 7 8 9 10 11 12 13 Instr 1 2 (stall) 3 4 5 6 7

Forwarding Unit

Forwarding Unit Inputs

Forwarding Example $t1 $t0 $t0 $t1 $t1 $t0 $t0 $t1

$t1 $t0 $t0 $t1

Add $t0, add $t0, or $t1, Х Х sub _ , _, _ $t0,$t0 (if one instr away, can forward at MEM stage)

Hazard Unit