1 CS-447– Computer Architecture Lecture 12 Multiple Cycle Datapath September 24, 2008Nael Abu-GhazalehGreet class
2 Implementation vs. Performance Performance of a processor is determined byInstruction count of a programCPIClock cycle time (clock rate)The compiler & the ISA determine the instruction count.The implementation of the processor determines the CPI and the clock cycle time.
3 Possible Execution Steps of Any Instructions Instruction FetchInstruction Decode and Register FetchExecution of the Memory Reference InstructionExecution of Arithmetic-Logical operationsBranch InstructionJump Instruction
4 Instruction Processing Five steps:Instruction fetch (IF)Instruction decode and operand fetch (ID)ALU/execute (EX)Memory (not required) (MEM)Write-back (WB)WBIFEXIDMEM
8 What’s Wrong with Single Cycle? All instructions run at the speed of the slowest instruction.Adding a long instruction can hurt performanceWhat if you wanted to include multiply?You cannot reuse any parts of the processorWe have 3 different adders to calculate PC+4, PC+4+offset and the ALUNo profit in making the common case fastSince every instruction runs at the slowest instruction speedThis is particularly important for loads as we will see later
10 Computing Execution Time Assume: 100 instructions executed25% of instructions are loads,10% of instructions are stores,45% of instructions are adds, and20% of instructions are branches.Single-cycle execution:100 * 8ns = 800 nsOptimal execution:25*8ns + 10*7ns + 45*6ns + 20*5ns = 640 ns
11 Single Cycle Problems A sequence of instructions: LW (IF, ID, EX, MEM, WB)SW (IF, ID, EX, MEM)etcSingle Cycle Implementation:Cycle 1Cycle 2ClkLoadStoreWastewhat if we had a more complicated instruction like floating point?wasteful of area
12 Multiple Cycle Solution use a “smaller” cycle timehave different instructions take different numbers of cyclesa “multicycle” datapath:
13 Multicycle Approach We will be reusing functional units ALU used to compute address and to increment PCMemory used for instruction and dataWe’ll use a finite state machine for control
14 The Five Stages of an Instruction Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5IFIDExMemWBIF: Instruction Fetch and Update PCID: Instruction Decode and Registers FetchEx: Execute R-type; calculate memory addressMem: Read/write the data from/to the Data MemoryWB: Write the result data into the register fileAs shown here, each of these five steps will take one clock cycle to complete.
15 Multicycle Implementation Break up the instructions into steps, each step takes a cyclebalance the amount of work to be donerestrict each cycle to use only one major functional unitAt the end of a cyclestore values for use in later cycles (easiest thing to do)introduce additional “internal” registers
16 The Five Stages of Load Instruction Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5lwIFIDExMemWBIF: Instruction Fetch and Update PCID: Instruction Decode and Registers FetchEx: Execute R-type; calculate memory addressMem: Read/write the data from/to the Data MemoryWB: Write the result data into the register fileAs shown here, each of these five steps will take one clock cycle to complete.
17 Multiple Cycle Implementation Break the instruction execution into Clock CyclesDifferent instructions require a different number of clock cyclesClock cycle is limited by the slowest stageInstruction latency is not reduced (time from the start of an instruction to its completion)Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8lwIFetchDecExecMemWBLatency = execution time (delay or response time) – the total time from start to finish of ONE instructionFor processors one important measure is THROUGHPUT (or the execution bandwidth) – the total amount of work done in a given amount of timeFor memories one important measure is BANDWIDTH – the amount of information communicated across an interconnect (e.g., bus) per unit time; the number of operations performed per second (the WIDTH of the operation and the RATE of the operation)IFetchDecExecMemWBsw
18 Single Cycle vs. Multiple Cycle Single Cycle Implementation:Cycle 1Cycle 2ClkLoadStoreWasteMultiple Cycle Implementation:Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations.For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles.Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10ClklwswR-typeIFetchDecExecMemWBIFetchDecExecMemIFetch
19 Multicycle Implementation Break up the instructions into steps, each step takes a cyclebalance the amount of work to be donerestrict each cycle to use only one major functional unitAt the end of a cyclestore values for use in later cycles (easiest thing to do)introduce additional “internal” registers
20 Instructions from ISA perspective Consider each instruction from perspective of ISA.Example:The add instruction changes a register.Register specified by bits 15:11 of instruction.Instruction specified by the PC.New value is the sum (“op”) of two registers.Registers specified by bits 25:21 and 20:16 of the instructionReg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]]In order to accomplish this we must break up the instruction. (kind of like introducing variables when programming)
21 Breaking down an instruction ISA definition of arithmetic: Reg[Memory[PC][15:11]] <= Reg[Memory[PC][25:21]] op Reg[Memory[PC][20:16]]Could break down to:IR <= Memory[PC]A <= Reg[IR[25:21]]B <= Reg[IR[20:16]]ALUOut <= A op BReg[IR[20:16]] <= ALUOutWe forgot an important part of the definition of arithmetic!PC <= PC + 4
22 Idea behind multicycle approach We define each instruction from the ISA perspective (do this!)Break it down into steps following our rule that data flows through at most one major functional unit (e.g., balance work across steps)Introduce new registers as needed (e.g, A, B, ALUOut, MDR, etc.)Finally try and pack as much work into each step (avoid unnecessary cycles) while also trying to share steps where possible (minimizes control, helps to simplify solution)
23 Five Execution Steps Instruction Fetch Instruction Decode and Register FetchExecution, Memory Address Computation, or Branch CompletionMemory Access or R-type instruction completionWrite-back step INSTRUCTIONS TAKE FROM CYCLES!
24 Step 1: Instruction Fetch Use PC to get instruction and put it in the Instruction Register.Increment the PC by 4 and put the result back in the PC.Can be described succinctly using RTL "Register-Transfer Language" IR <= Memory[PC]; PC <= PC + 4; Can we figure out the values of the control signals? What is the advantage of updating the PC now?
25 Step 2: Instruction Decode and Register Fetch Read registers rs and rt in case we need themCompute the branch address in case the instruction is a branchRTL: A <= Reg[IR[25:21]]; B <= Reg[IR[20:16]]; ALUOut <= PC + (sign-extend(IR[15:0]) << 2);We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic)
26 Step 3 (instruction dependent) ALU is performing one of three functions, based on instruction typeMemory Reference: ALUOut <= A + sign-extend(IR[15:0]);R-type: ALUOut <= A op B;Branch: if (A==B) PC <= ALUOut;
27 Step 4 (R-type or memory-access) Loads and stores access memory MDR <= Memory[ALUOut]; or Memory[ALUOut] <= B;R-type instructions finish Reg[IR[15:11]] <= ALUOut;
31 Review: finite state machines a set of states andnext state function (determined by current state and the input)output function (determined by current state and possibly input)We’ll use a Moore machine (output based only on current state)
32 Implementing the Control Value of control signals is dependent upon:what instruction is being executedwhich step is being performedUse the information we’ve accumulated to specify a finite state machinespecify the finite state machine graphically, oruse microprogrammingImplementation can be derived from specification