ECM534 Advanced Computer Architecture Lecture 5. MIPS Processor Design

Slides:



Advertisements
Similar presentations
Computer Science Education
Advertisements

CS/COE1541: Introduction to Computer Architecture Datapath and Control Review Sangyeun Cho Computer Science Department University of Pittsburgh.
1 Today  All HW1 turned in on time, this is great!  HW2 will be out soon —You will work on procedure calls/stack/etc.  Lab1 will be out soon (possibly.
The Processor: Datapath & Control
Chapter 5 The Processor: Datapath and Control Basic MIPS Architecture Homework 2 due October 28 th. Project Designs due October 28 th. Project Reports.
Levels in Processor Design
Lec 17 Nov 2 Chapter 4 – CPU design data path design control logic design single-cycle CPU performance limitations of single cycle CPU multi-cycle CPU.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
The Processor 2 Andreas Klappenecker CPSC321 Computer Architecture.
Copyright 1998 Morgan Kaufmann Publishers, Inc. All rights reserved. Digital Architectures1 Machine instructions execution steps (1) FETCH = Read the instruction.
Processor I CPSC 321 Andreas Klappenecker. Midterm 1 Thursday, October 7, during the regular class time Covers all material up to that point History MIPS.
The Processor: Datapath & Control. Implementing Instructions Simplified instruction set memory-reference instructions: lw, sw arithmetic-logical instructions:
Dr. Iyad F. Jafar Basic MIPS Architecture: Multi-Cycle Datapath and Control.
Chapter 4 Sections 4.1 – 4.4 Appendix D.1 and D.2 Dr. Iyad F. Jafar Basic MIPS Architecture: Single-Cycle Datapath and Control.
Topics We are going to discuss the following topics for roughly 3 weeks from today Introduction to Hardware Description Language (HDL) Combinational Logic.
COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections
Lecture 9. MIPS Processor Design – Instruction Fetch Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System Education &
Chapter 4 CSF 2009 The processor: Building the datapath.
Processor: Datapath and Control
Lec 15Systems Architecture1 Systems Architecture Lecture 15: A Simple Implementation of MIPS Jeremy R. Johnson Anatole D. Ruslanov William M. Mongan Some.
1 COMP541 Multicycle MIPS Montek Singh Apr 4, 2012.
COMP541 Multicycle MIPS Montek Singh Apr 8, 2015.
Gary MarsdenSlide 1University of Cape Town Chapter 5 - The Processor  Machine Performance factors –Instruction Count, Clock cycle time, Clock cycles per.
Computer Organization CS224 Fall 2012 Lesson 22. The Big Picture  The Five Classic Components of a Computer  Chapter 4 Topic: Processor Design Control.
ECE 445 – Computer Organization
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /19/2013 Lecture 17: The Processor - Overview Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER.
IT253: Computer Organization Lecture 9: Making a Processor: Single-Cycle Processor Design Tonga Institute of Higher Education.
CS2100 Computer Organisation The Processor: Datapath (AY2015/6) Semester 1.
Computer Architecture and Design – ECEN 350 Part 6 [Some slides adapted from A. Sprintson, M. Irwin, D. Paterson and others]
Datapath and Control Unit Design
1 A single-cycle MIPS processor  An instruction set architecture is an interface that defines the hardware operations which are available to software.
1 COMP541 Datapaths II & Control I Montek Singh Mar 22, 2010.
December 26, 2015©2003 Craig Zilles (derived from slides by Howard Huang) 1 A single-cycle MIPS processor  As previously discussed, an instruction set.
CPU Overview Computer Organization II 1 February 2009 © McQuain & Ribbens Introduction CPU performance factors – Instruction count n Determined.
ECE-C355 Computer Structures Winter 2008 The MIPS Datapath Slides have been adapted from Prof. Mary Jane Irwin ( )
February 22, 2016©2003 Craig Zilles (derived from slides by Howard Huang) 1 A single-cycle MIPS processor  As previously discussed, an instruction set.
COM181 Computer Hardware Lecture 6: The MIPs CPU.
MIPS Processor.
Morgan Kaufmann Publishers The Processor
Lecture 9. MIPS Processor Design – Single-Cycle Processor Design Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Chapter 4 From: Dr. Iyad F. Jafar Basic MIPS Architecture: Multi-Cycle Datapath and Control.
Lecture 5. MIPS Processor Design
Computer Architecture Lecture 6.  Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions:
Microarchitecture.
CS161 – Design and Architecture of Computer Systems
CS 230: Computer Organization and Assembly Language
Morgan Kaufmann Publishers
Introduction CPU performance factors
Morgan Kaufmann Publishers The Processor
Morgan Kaufmann Publishers
Processor (I).
CS/COE0447 Computer Organization & Assembly Language
COSC 2021: Computer Organization Instructor: Dr. Amir Asif
CS/COE0447 Computer Organization & Assembly Language
Single-Cycle CPU DataPath.
Levels in Processor Design
Rocky K. C. Chang 6 November 2017
The Processor Lecture 3.2: Building a Datapath with Control
The Processor Lecture 3.1: Introduction & Logic Design Conventions
Lecture 9. MIPS Processor Design – Decoding and Execution
COMS 361 Computer Organization
COMP541 Datapaths I Montek Singh Mar 18, 2010.
Single Cycle Datapath Lecture notes from MKP, H. H. Lee and S. Yalamanchili.
Chapter 7 Microarchitecture
Chapter 7 Microarchitecture
The Processor: Datapath & Control.
COMS 361 Computer Organization
Processor: Datapath and Control
CS/COE0447 Computer Organization & Assembly Language
Presentation transcript:

ECM534 Advanced Computer Architecture Lecture 5. MIPS Processor Design Single-cycle MIPS #1 Prof. Taeweon Suh Computer Science Education Korea University

Introduction Microarchitecture means a lower-level structure that is able to execute instructions Multiple implementations for a single architecture Single-cycle Each instruction is executed in a single cycle It suffers from the long critical path delay, limiting the clock frequency Multi-cycle Each instruction is broken up into a series of shorter steps Different instructions use different numbers of steps, so simpler instructions completes faster than more complex ones Pipeline (5 stage) Each instruction is broken up into a series of steps All the instructions use the same number of steps Multiple instructions (up to 5) are executed simultaneously

Revisiting Performance CPU Time = # insts X CPI X clock cycle time (T) = # insts X CPI / f Performance depends on Algorithm affects the instruction count Programming language affects the instruction count and CPI Compiler affects the instruction count and CPI Instruction set architecture affects the instruction count, CPI, and T (f) Microarchitecture (Hardware implementation) affect CPI and T (f) Semiconductor technology affects T (f) Challenges in designing microarchitecture is to satisfy constraints of cost, power and performance

Revisiting Logic Design Basic Combinational logic Output is directly determined by current input Sequential logic Output is determined not only by current input, but also internal state (i.e., previous inputs) Sequential logic needs state elements to store information Flip-flops and latches are used to store the state information. But, avoid using latch in digital design A B Y + Adder I0 I1 Y M u x S Multiplexer (Mux) A B Y ALU F AND gate A B Y

Revisiting State Element Registers (implemented with flip-flops) store data in a circuit Clock signal determines when to update the stored value Rising-edge triggered: update when clock changes from 0 to 1 Falling-edge triggered: update when clock changes from 1 to 0 Data input determines what (0 or 1) to update to the output D Clk Q D Flip-flop Clk D Q Register with write control Only updates on clock edge when write control input is 1 Write D Q Clk D Clk Q Write

Clocking Methodology Virtually all digital systems are synchronous to the clock Combinational logic sits between state elements (flip-flops) Combinational logic produces its intended data during clock cycles Input from state elements Output to the next state elements Longest delay determines the clock period (frequency)

Overview We are going to design a MIPS CPU that is able to execute the machine code we discussed so far For the sake of your understanding, we simplify the CPU and its system structure CPU North Bridge South Bridge Main Memory (DDR) FSB (Front-Side Bus) DMI (Direct Media I/F) Real-PC system Memory (Instruction, data) MIPS CPU Address Bus Data Bus Simplified

Our MIPS Model Our MIPS CPU model has separate connections to memory Actually, this structure is more realistic as we will see when we study caches We use both structural and behavioral modeling with Verilog-HDL Behavioral modeling descriptively specifies what a module does For example, the lowest modules (such as ALU and register files) are designed with the behavioral modeling Structural modeling describes a module from simpler modules via instantiations For example, the top module (such as mips.v) are designed with the structural modeling Instruction fetch Instruction/ Data Memory Address Bus MIPS CPU Data Bus Address Bus Data Bus Data access

Overview Microarchitecture is composed of datapath and control Datapath operates on words of data Datapath elements are used to operate on or hold data within a processor In MIPS implementation, datapath elements include the register file, ALU, muxes, and memory Control tells the datapath how to execute instructions Control unit receives the current instruction from the datapath and tells the datapath how to execute that instruction Specifically, the control unit produces mux select, register enable, ALU control, and memory write signals to control the operation of the datapath Our MIPS implementation is simplified by designing only Data processing instructions: add, sub, and, or, slt Memory access instructions: lw, sw Branch instructions: beq, j

MIPS_System_tb.v (testbench) Overview of Our Design MIPS_System_tb.v (testbench) MIPS_System.v reset mips.v ram2port_inst_data.v Decoding Address fetch, pc Code and Data in your program clock Instruction Register File ALU Memory Access Address DataOut DataIn

Instruction Execution in CPU Generic steps of the instruction execution in CPU Fetch uses the program counter (PC) to supply the instruction address and fetch instruction from memory Decoding decodes instruction and reads operands Extract opcode: determine what operation should be done Extract operands: register numbers or immediate from fetched instruction Execution Use ALU to calculate (depending on instruction class) Arithmetic or logical result Memory address for load/store Branch target address Access memory for load/store Next Fetch PC  target address or PC + 4 Address Bus Instruction/ Data Memory MIPS CPU Fetch with PC Data Bus PC = PC +4 Decode Address Bus Execute Data Bus

Increment by 4 for the next instruction 32-bit register (flip-flops) Instruction Fetch MIPS CPU Increment by 4 for the next instruction 4 Add Memory Address Out 32 PC reset clock instruction 32-bit register (flip-flops) What is PC on reset? MIPS initializes PC to 0xBFC0_0000 For the sake of simplicity, let’s initialize the PC to 0x0000_0000 in our design

Instruction Fetch Verilog Model mips.v module mips( input clk, input reset, output[31:0] pc, input [31:0] instr); wire [31:0] pcnext; // instantiate pc pcreg mips_pc (.clk (clk), .reset (reset), .pc (pc), .pcnext(pcnext)); // instantiate adder adder pcadd4 (.a (pc), .b (32'b100), .y (pcnext)); endmodule Adder 4 pcnext pc pcreg reset clock module pcreg ( input clk, input reset, output reg [31:0] pc, input [31:0] pcnext); always @(posedge clk, posedge reset) begin if (reset) pc <= 32'h00000000; else pc <= pcnext; end endmodule module adder( input [31:0] a, input [31:0] b, output [31:0] y); assign y = a + b; endmodule

Memory As studied in the Computer Logic Design, memory is classified into RAM (Random Access Memory) and ROM (Read-Only Memory) RAM is classified into DRAM (Dynamic RAM) and SRAM (Static RAM) DDR is a kind of DRAM DDR is a short form of DDR (Double Data Rate) SDRAM (Synchronous DRAM) DDR is used as main memory in modern computers We use a Cyclone-II (Altera FPGA)-specific memory model because we port our design to the Cyclone-II FPGA

Generic Memory Model in Verilog module mem(input clk, MemWrite, input [7:2] Address, input [31:0] WriteData, output [31:0] ReadData); reg [31:0] RAM[63:0]; // Memory Initialization initial begin $readmemh("memfile.dat",RAM); end // Memory Read assign ReadData = RAM[Address[7:2]]; // Memory Write always @(posedge clk) if (MemWrite) RAM[Address[7:2]] <= WriteData; endmodule 32 Memory Address ReadData[31:0] WriteData[31:0] MemWrite 6 64 words 20020005 2003000c 2067fff7 00e22025 00642824 00a42820 10a7000a 0064202a 10800001 20050000 00e2202a 00853820 00e23822 ac670044 8c020050 08000011 20020001 ac020054 Word (32-bit) Compiled binary file memfile.dat

Simple MIPS Test Code assemble

Our Memory As mentioned, we use a Cyclone-II (Altera FPGA)-specific memory model because we port our design to the Cyclone-II FPGA Prof. Suh has created a memory model using MegaWizard in Quartus-II To initialize the memory, it requires a special format called mif Prof. Suh wrote a perl script to generate the mif-format file Check out Makefile For synthesis and simulation, just copy insts_data.mif to MIPS_System_Syn and MIPS_System_Sim directories

Instruction Decoding Instruction decoding separates the fetched instruction into the fields according to the instruction types (R, I, and J types) Opcode and funct fields determine which operation the instruction wants to do Control logic should be designed to supply control signals to datapath elements (such as ALU and register file) Operands Register numbers in the instruction are sent to the register file Immediate field is either sign-extended or zero-extended depending on instructions

Schematic with Instruction Decoding MIPS CPU Core Control Unit Opcode funct sign_ext RegWrite Register File wa[4:0] ra1[4:0] ra2[4:0] rd1 32 rd2 wd RegWrite R0 R1 R2 R3 R30 R31 … instruction PC Add 4 reset clock Memory Address Out 16 32 Sign or zero-extended imm 32 sign_ext

Register File in Verilog module regfile(input clk, input RegWrite, input [4:0] ra1, ra2, wa, input [31:0] wd, output [31:0] rd1, rd2); reg [31:0] rf[31:0]; // three ported register file // read two ports combinationally // write third port on rising edge of clock // register 0 hardwired to 0 always @(posedge clk) if (RegWrite) rf[wa] <= wd; assign rd1 = (ra1 != 0) ? rf[ra1] : 0; assign rd2 = (ra2 != 0) ? rf[ra2] : 0; endmodule Register File wa ra1[4:0] ra2[4:0] 32 bits rd1 32 5 rd2 wd RegWrite R0 R1 R2 R3 R30 R31 …

Sign & Zero Extension in Verilog Why declares it as reg? Is it going to be synthesized as registers? Is this logic combinational or sequential logic? module sign_zero_ext(input sign_ext, input [15:0] a, output reg [31:0] y); always @(*) begin if (sign_ext) y <= {{16{a[15]}}, a}; else y <= {{16{1'b0}}, a}; end endmodule 16 32 Sign or zero-extended a[15:0] (= imm) y[31:0] sign_ext

Instruction Execution #1 Execution of the arithmetic and logical instructions R-type arithmetic and logical instructions Examples: add, sub, and, or ... 2 source operands from the register file I-type arithmetic and logical instructions Examples: addi, andi, ori ... 1 source operand from the register file 1 source operand from the immediate field opcode rs rt rd sa funct add $t0, $s1, $s2 destination register opcode rs rt immediate addi $t0, $s3, -12

Schematic with Instruction Execution #1 MIPS CPU Core Control Unit Opcode funct ALUSrc RegWrite Register File wa[4:0] ra1[4:0] ra2[4:0] rd1 32 rd2 wd RegWrite R0 R1 R2 R3 R30 R31 … ALU ALUSrc instruction mux PC Add 4 reset clock Memory Address Out 16 32 Sign or zero-extended imm 32

How to Design Mux in Verilog? module mux2 (input [31:0] d0, input [31:0] d1, input s, output [31:0] y); assign y = s ? d1 : d0; endmodule module mux2 (input [31:0] d0, input [31:0] d1, input s, output reg [31:0] y); always @(*) begin if (s) y <= d1; else y <= d0; end endmodule OR Design it with parameter, so that this module can be used (instantiatiated) in any sized muxes in your design module datapath(………); wire [31:0] writedata, signimm; wire [31:0] srcb; wire alusrc // Instantiation mux2 #(32) srcbmux( .d0 (writedata), .d1 (signimm), .s (alusrc), .y (srcb)); endmodule module mux2 #(parameter WIDTH = 8) (input [WIDTH-1:0] d0, d1, input s, output [WIDTH-1:0] y); assign y = s ? d1 : d0; endmodule

Instruction Execution #2 Execution of the memory access instructions lw, sw instructions opcode rs rt immediate lw $t0, 24($s3) // $t0 <= [$s3 + 24] opcode rs rt immediate sw $t2, 8($s3) // [$s3 + 8] <= $t2

Schematic with Instruction Execution #2 MIPS CPU Core Control Unit Opcode funct MemWrite MemtoReg Memory Address ReadData WriteData MemWrite ALUSrc RegWrite Register File wa[4:0] ra1[4:0] ra2[4:0] rd1 32 rd2 wd R0 R1 R2 R3 R30 R31 … ALU ALUSrc instruction mux MemtoReg mux PC Add 4 reset clock Memory Address Out 16 32 Sign or zero-extended imm 32 lw $t0, 24($s3) // $t0 <= [$s3 + 24] sw $t2, 8($s3) // [$s3 + 8] <= $t2

Instruction Execution #3 Execution of the branch and jump instructions beq, bne, j, jal, jr instructions opcode rs rt immediate beq $s0, $s1, Lbl // go to Lbl if $s0=$s1 Destination = (PC + 4) + (imm << 2) opcode jump target j target // jump Destination = {(PC+4)[31:28] , jump target, 2’b00}

Schematic with Instruction Execution #3 (beq) MIPS CPU Core Control Unit Opcode funct branch Memory Address ReadData WriteData MemWrite PCSrc zero Register File wa[4:0] ra1[4:0] ra2[4:0] rd1 32 rd2 wd R0 R1 R2 R3 R30 R31 … ALU ALUSrc mux MemtoReg instruction mux PCSrc mux Add Memory Address Out 16 32 Sign or zero-extended imm 4 Add <<2 32 PC reset clock Destination = (PC + 4) + (imm << 2)

Schematic with Instruction Execution #3 (j) MIPS CPU Core Control Unit Opcode funct jump branch Memory Address ReadData WriteData MemWrite PCSrc zero Register File wa[4:0] ra1[4:0] ra2[4:0] rd1 32 rd2 wd R0 R1 R2 R3 R30 R31 … ALU ALUSrc mux MemtoReg instruction mux PCSrc jump mux Add mux 16 32 Sign or zero-extended imm Memory Address Out <<2 4 Add 26 imm <<2 32 PC 28 Concatenation reset clock PC[31:28] Destination = {(PC+4)[31:28], jump target, 2’b00}

Demo Synthesis with Quartus-II Simulation with ModelSim

Backup Slides

Why HDL? In old days (~ early 1990s), hardware engineers used to draw schematic of the digital logic, based on Boolean equations, FSM, and so on… But, it is not virtually possible to draw schematic as the hardware complexity increases Example: Number of transistors in Core 2 Duo is roughly 300 million Assuming that the gate count is based on 2-input NAND gate, (which is composed of 4 transistors), do you want to draw 75 million gates by hand? Absolutely NOT!

Why HDL? Hardware description language (HDL) Allows designer to specify logic function using language So, hardware designer only needs to specify the target functionality (such as Boolean equations and FSM) with language Then a computer-aided design (CAD) tool produces the optimized digital circuit with logic gates Nowadays, most commercial designs are built using HDLs CAD Tool module example( input a, b, c, output y); assign y = ~a & ~b & ~c | a & ~b & ~c | a & ~b & c; endmodule HDL-based Design Optimized Gates

HDLs Two leading HDLs Verilog-HDL VHDL Developed in 1984 by Gateway Design Automation Became an IEEE standard (1364) in 1995 We are going to use Verilog-HDL in this class The book on the right is a good reference (but not required to purchase) VHDL Developed in 1981 by the Department of Defense Became an IEEE standard (1076) in 1987 IEEE: Institute of Electrical and Electronics Engineers is a professional society responsible for many computing standards including WiFi (802.11), Ethernet (802.3) etc

HDL to (Logic) Gates There are 3 steps to design hardware with HDL Hardware design with HDL Describe your hardware with HDL When describing circuits using an HDL, it’s critical to think of the hardware the code should produce Simulation Once you design your hardware with HDL, you need to verify if the design is implemented correctly Input values are applied to your design with HDL Outputs checked for correctness Millions of dollars saved by debugging in simulation instead of hardware Synthesis Transforms HDL code into a netlist, describing the hardware Netlist is a text file describing a list of logic gates and the wires connecting them

CAD tools for Simulation There are renowned CAD companies that provide HDL simulators Cadence www.cadence.com Synopsys www.synopsys.com Mentor Graphics www.mentorgraphics.com We are going to use ModelSim Altera Starter Edition for simulation http://www.altera.com/products/software/quartus-ii/modelsim/qts-modelsim-index.html

CAD tools for Synthesis The same companies (Cadence, Synopsys, and Mentor Graphics) provide synthesis tools, too They are extremely expensive to purchase though We are going to use a synthesis tool from Altera Altera Quartus-II Web Edition (free) Synthesis, place & route, and download to FPGA http://www.altera.com/products/software/quartus-ii/web-edition/qts-we-index.html

MIPS CPU with imem and Testbench module mips_tb(); reg clk; reg reset; // instantiate device to be tested mips_cpu_mem imips_cpu_mem(clk, reset); // initialize test initial begin reset <= 1; # 32; reset <= 0; end // generate clock to sequence tests clk <= 0; forever #10 clk <= ~clk; endmodule module mips_cpu_mem(input clk, reset); wire [31:0] pc, instr; // instantiate processor and memories mips_cpu imips_cpu (clk, reset, pc, instr); imem imips_imem (pc[7:2], instr); endmodule