Advanced Computer Architecture 5MD00 / 5Z033 MIPS Design data path and control Henk Corporaal TUEindhoven 2007
H.Corporaal ACA 5MD002 Topics n Building a datapath u support a subset of the MIPS-I instruction-set n A single cycle processor datapath u all instruction actions in one (long) cycle n A multi-cycle processor datapath u each instructions takes multiple (shorter) cycles n Exception support n For details see book (ch 5):
H.Corporaal ACA 5MD003 Datapath and Control DatapathControl Registers & Memories Multiplexors Buses ALUs FSM or Micro- programming
H.Corporaal ACA 5MD004 n Simplified MIPS implementation to contain only: memory-reference instructions: lw, sw arithmetic-logical instructions: add, sub, and, or, slt control flow instructions: beq, j n Generic Implementation: u use the program counter (PC) to supply instruction address u get the instruction from memory u read registers u use the instruction to decide exactly what to do n All instructions use the ALU after reading the registers Why? F memory-reference? F arithmetic? F control flow? The Processor: Datapath & Control
H.Corporaal ACA 5MD005 n Abstract / Simplified View: n Two types of functional units: u elements that operate on data values (combinational) u elements that contain state (sequential) More Implementation Details
H.Corporaal ACA 5MD006 Our Implementation n An edge triggered methodology n Typical execution: u read contents of some state elements, u send values through some combinational logic, u write results to one or more state elements Clock cycle State element 1 Combinational logic State element 2
H.Corporaal ACA 5MD007 D flip-flop n Output changes only on the clock edge QQ _ Q Q _ Q D latch D C D latch DD C C
H.Corporaal ACA 5MD008 n 3-ported: one write, two read ports Register File Read reg. #1 Read reg.#2 Write reg.# Read data 1 Read data 2 Write data
H.Corporaal ACA 5MD009 Register file: read ports Register file built using D flip-flops
H.Corporaal ACA 5MD0010 Register file: write port n Note: we still use the real clock to determine when to write
H.Corporaal ACA 5MD0011 Simple Implementation n Include the functional units we need for each instruction ALU control RegWrite Registers Write register Read data 1 Read data 2 Read register 1 Read register 2 Write data ALU result ALU Data Data Register numbers a. Registersb. ALU Zero
H.Corporaal ACA 5MD0012 Building the Datapath n Use multiplexors to stitch them together PC Instruction memory Read address Instruction Add ALU result M u x Registers Write register Write data Read data 1 Read data 2 Read register 1 Read register 2 Shift left 2 4 M u x ALU operation 3 RegWrite MemRead MemWrite PCSrc ALUSrc MemtoReg ALU result Zero ALU Data memory Address Write data Read data M u x Sign extend Add
H.Corporaal ACA 5MD0013 n All of the logic is combinational n We wait for everything to settle down, and the right thing to be done u ALU might not produce “right answer” right away u we use write signals along with clock to determine when to write n Cycle time determined by length of the longest path Our Simple Control Structure We are ignoring some details like setup and hold times ! Clock cycle State element 1 Combinational logic State element 2
H.Corporaal ACA 5MD0014 Control n Selecting the operations to perform (ALU, read/write, etc.) n Controlling the flow of data (multiplexor inputs) n Information comes from the 32 bits of the instruction Example: add $8, $17, $18 Instruction Format: op rs rt rd shamt funct n ALU's operation based on instruction type and function code
H.Corporaal ACA 5MD0015 Control: 2 level implementation instruction register ALUop ALUcontrol Opcode Funct bit Control 1 Control 2 ALU 00: lw, sw 01: beq 10: add, sub, and, or, slt 000: and 001: or 010: add 110: sub 111: set on less than
H.Corporaal ACA 5MD0016 Datapath with Control
H.Corporaal ACA 5MD0017 What should the ALU do with this instruction example: lw $1, 100($2) op rsrt 16 bit offset ALU control input 000 AND 001OR 010add 110subtract 111set-on-less-than n Why is the code for subtract 110 and not 011? ALU Control1
H.Corporaal ACA 5MD0018 n Must describe hardware to compute 3-bit ALU control input u given instruction type 00 = lw, sw 01 = beq, 10 = arithmetic u function code for arithmetic n Describe it using a truth table (can turn into gates): ALU Operation class, computed from instruction type ALU Control1 outputs intputs
H.Corporaal ACA 5MD0019 ALU Control1 n Simple combinational logic (truth tables)
H.Corporaal ACA 5MD0020 Deriving Control2 signals 9 control (output) signals Determine these control signals directly from the opcodes: R-format: 0 lw: 35 sw: 43 beq: 4 Input 6-bits
H.Corporaal ACA 5MD0021 Control 2 n PLA example implementation
H.Corporaal ACA 5MD0022 Single Cycle Implementation n Calculate cycle time assuming negligible delays except: u memory (2ns), ALU and adders (2ns), register file access (1ns)
H.Corporaal ACA 5MD0023 Single Cycle Implementation n Memory (2ns), ALU & adders (2ns), reg. file access (1ns) n Fixed length clock: longest instruction is the ‘lw’ which requires 8 ns n Variable clock length (not realistic, just as exercise): u R-instr: 6 ns u Load: 8 ns u Store: 7 ns u Branch: 5 ns u Jump: 2 ns n Average depends on instruction mix (see pg 374)
H.Corporaal ACA 5MD0024 Where we are headed n Single Cycle Problems: u what if we had a more complicated instruction like floating point? u wasteful of area: NO Sharing of Hardware resources n One Solution: u use a “smaller” cycle time u have different instructions take different numbers of cycles u a “multicycle” datapath: IR MDR
H.Corporaal ACA 5MD0025 n We will be reusing functional units u ALU used to compute address and to increment PC u Memory used for instruction and data n Add registers after every major functional unit n Our control signals will not be determined solely by instruction u e.g., what should the ALU do for a “subtract” instruction? n We’ll use a finite state machine (FSM) or microcode for control Multicycle Approach
H.Corporaal ACA 5MD0026 n Finite state machines: u a set of states and u next state function (determined by current state and the input) u output function (determined by current state and possibly input) u We’ll use a Moore machine (output based only on current state) Review: finite state machines Next-state function Current state Clock Output function Next state Outputs Inputs
H.Corporaal ACA 5MD0027 n Break up the instructions into steps, each step takes a cycle u balance the amount of work to be done u restrict each cycle to use only one major functional unit n At the end of a cycle u store values for use in later cycles (easiest thing to do) u introduce additional “internal” registers n Notice: we distinguish u processor state: programmer visible registers u internal state: programmer invisible registers (like IR, MDR, A, B, and ALUout) Multicycle Approach
H.Corporaal ACA 5MD0028 Multicycle Approach Shift left 2 PC Memory MemData Write data M u x 0 1 Registers Write register Write data Read data 1 Read data 2 Read register 1 Read register 2 M u x 0 1 M u x Instruction [15–0] Sign extend 3216 Instruction [25–21] Instruction [20–16] Instruction [15–0] Instruction register 1 M u x M u x ALU result ALU Zero Memory data register Instruction [15–11] A B ALUOut 0 1 Address
H.Corporaal ACA 5MD0029 Multicycle Approach n Note that previous picture does not include: u branch support u jump support u Control lines and logic n Tclock > max (ALU delay, Memory access, Regfile access) n See book for complete picture
H.Corporaal ACA 5MD0030 n Instruction Fetch n Instruction Decode and Register Fetch n Execution, Memory Address Computation, or Branch Completion n Memory Access or R-type instruction completion n Write-back step Five Execution Steps INSTRUCTIONS TAKE FROM CYCLES!
H.Corporaal ACA 5MD0031 n Use PC to get instruction and put it in the Instruction Register n Increment the PC by 4 and put the result back in the PC Can be described succinctly using RTL "Register-Transfer Language" IR = Memory[PC]; PC = PC + 4; n Can we figure out the values of the control signals? n What is the advantage of updating the PC now? Step 1: Instruction Fetch
H.Corporaal ACA 5MD0032 n Read registers rs and rt in case we need them n Compute the branch address in case the instruction is a branch n Previous two actions are done optimistically!! RTL: A = Reg[IR[25-21]]; B = Reg[IR[20-16]]; ALUOut = PC+(sign-extend(IR[15-0])<< 2); n We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic) Step 2: Instruction Decode and Register Fetch
H.Corporaal ACA 5MD0033 n ALU is performing one of four functions, based on instruction type Memory Reference: ALUOut = A + sign-extend(IR[15-0]); R-type: ALUOut = A op B; Branch: if (A==B) PC = ALUOut; n Jump: PC = PC[31-28] || (IR[25-0]<<2) Step 3 (instruction dependent)
H.Corporaal ACA 5MD0034 Loads and stores access memory MDR = Memory[ALUOut]; or Memory[ALUOut] = B; R-type instructions finish Reg[IR[15-11]] = ALUOut; The write actually takes place at the end of the cycle on the edge Step 4 (R-type or memory-access)
H.Corporaal ACA 5MD0035 Memory read completion step Reg[IR[20-16]]= MDR; What about all the other instructions? Write-back step
H.Corporaal ACA 5MD0036 Steps taken to execute any instruction class Summary execution steps
H.Corporaal ACA 5MD0037 How many cycles will it take to execute this code? lw $t2, 0($t3) lw $t3, 4($t3) beq $t2, $t3, L1 #assume not taken add $t5, $t2, $t3 sw $t5, 8($t3) L1:... n What is going on during the 8th cycle of execution? In what cycle does the actual addition of $t2 and $t3 takes place? Simple Questions
H.Corporaal ACA 5MD0038 n Value of control signals is dependent upon: u what instruction is being executed u which step is being performed n Use the information we have accumulated to specify a finite state machine (FSM) u specify the finite state machine graphically, or u use microprogramming n Implementation can be derived from specification Implementing the Control
H.Corporaal ACA 5MD0039 FSM: high level view Start/reset Instruction fetch, decode and register fetch Memory access instructions R-type instructions Branch instruction Jump instruction
n How many state bits will we need? Graphical Specification of FSM PCWrite PCSource = 10 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 01 PCWriteCond PCSource = 01 ALUSrcA =1 ALUSrcB = 00 ALUOp= 10 RegDst = 1 RegWrite MemtoReg = 0 MemWrite IorD = 1 MemRead IorD = 1 ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 RegDst=0 RegWrite MemtoReg=1 ALUSrcA = 0 ALUSrcB = 11 ALUOp = 00 MemRead ALUSrcA = 0 IorD = 0 IRWrite ALUSrcB = 01 ALUOp = 00 PCWrite PCSource = 00 Instruction fetch Instruction decode/ register fetch Jump completion Branch completionExecution Memory address computation Memory access Memory access R-type completion Write-back step ( O p = ' L W ' ) o r ( O p = ' S W ' ) ( O p = R - t y p e ) ( O p = ' B E Q ' ) ( O p = ' J ' ) ( O p = ' S W ' ) ( O p = ' L W ' ) Start
H.Corporaal ACA 5MD0041 Implementation: Finite State Machine for Control
H.Corporaal ACA 5MD0042 PLA Implemen- tation n If I picked a horizontal or vertical line could you explain it ? n What type of FSM is used? Mealy or Moore? (see fig C.14) next state current state datapath control opcode
H.Corporaal ACA 5MD0043 n ROM = "Read Only Memory" u values of memory locations are fixed ahead of time n A ROM can be used to implement a truth table u if the address is m-bits, we can address 2 m entries in the ROM u our outputs are the bits of data that the address points to ROM Implementation m is the "heigth", and n is the "width" m bits n bits ROM addressdata
H.Corporaal ACA 5MD0044 n How many inputs are there? 6 bits for opcode, 4 bits for state = 10 address lines (i.e., 2 10 = 1024 different addresses) n How many outputs are there? 16 datapath-control outputs, 4 state bits = 20 outputs n ROM is 2 10 x 20 = 20K bits (very large and a rather unusual size) n Rather wasteful, since for lots of the entries, the outputs are the same — i.e., opcode is often ignored ROM Implementation
H.Corporaal ACA 5MD0045 ROM Implementation Cheaper implementation: n Exploit the fact that the FSM is a Moore machine ==> u Control outputs only depend on current state and not on other incoming control signals ! u Next state depends on all inputs n Break up the table into two parts — 4 state bits tell you the 16 outputs, 2 4 x 16 bits of ROM — 10 bits tell you the 4 next state bits, 2 10 x 4 bits of ROM — Total number of bits: 4.3K bits of ROM
H.Corporaal ACA 5MD0046 n PLA is much smaller u can share product terms (ROM has an entry (=address) for every product term u only need entries that produce an active output u can take into account don't cares Size of PLA: (#inputs #product-terms) + (#outputs #product-terms) u For this example: (10x17)+(20x17) = 460 PLA cells n PLA cells usually slightly bigger than the size of a ROM cell ROM vs PLA
H.Corporaal ACA 5MD0047 Exceptions n Unexpected events n External: interrupt u e.g. I/O request n Internal: exception u e.g. Overflow, Undefined instruction opcode, Software trap, Page fault n How to handle exception? u Jump to general entry point (record exception type in status register) u Jump to vectored entry point u Address of faulting instruction has to be recorded !
H.Corporaal ACA 5MD0048 Exceptions Changes needed: see fig / 5.49 / 5.50 n Extend PC input mux with extra entry with fixed address: “C000000hex” n Add EPC register containing old PC (we’ll use the ALU to decrement PC with 4) u extra input ALU src2 needed with fixed value 4 n Cause register (one bit in our case) containing: u 0: undefined instruction u 1: ALU overflow n Add 2 states to FSM u undefined instr. state #10 u overflow state #11
H.Corporaal ACA 5MD0049 Exceptions 2 New states: Legend: IntCause =0/1type of exception CauseWritewrite Cause register ALUSrcA = 0select PC ALUSrcB = 01select constant 4 ALUOp = 01subtract operation EPCWritewrite EPC register with current PC PCWritewrite PC with exception address PCSource =11select exception address: C000000hex Legend: IntCause =0/1type of exception CauseWritewrite Cause register ALUSrcA = 0select PC ALUSrcB = 01select constant 4 ALUOp = 01subtract operation EPCWritewrite EPC register with current PC PCWritewrite PC with exception address PCSource =11select exception address: C000000hex IntCause =0 CauseWrite ALUSrcA = 0 ALUSrcB = 01 ALUOp = 01 EPCWrite PCWrite PCSource =11 IntCause =1 CauseWrite ALUSrcA = 0 ALUSrcB = 01 ALUOp = 01 EPCWrite PCWrite PCSource =11 #10 undefined instruction#11 overflow To state 0 (begin of next instruction)