3The processor : Data Path and Control Two types of functional units:elements that operate on data values (combinational)elements that contain state (state elements)
4Single Cycle Implementation Typical execution:read contents of some state elements,send values through some combinational logicwrite results to one or more state elementsUsing a clock signal for synchronizationEdge triggered methodology
5A portion of the datapath used for fetching instructions
10Control Selecting the operations to perform (ALU, read/write, etc.) Controlling the flow of data (multiplexor inputs)Information comes from the 32 bits of the instructionExample: lW $1, 100($2)Value of control signals is dependent upon:what instruction is being executedwhich step is being performed
12Single Cycle Implementation Calculate cycle time assuming negligible delays except:memory (2ns), ALU and adders (2ns), register file access (1ns)
13Multicycle approach Single Cycle Problems: One Solution: what if we had a more complicated instruction like floating point?wasteful of areaOne Solution:use a “smaller” cycle timehave different instructions take different numbers of cyclesa “multicycle” datapath:
14Break up the instructions into steps, each step takes a cycle Multicycle ApproachBreak up the instructions into steps, each step takes a cyclebalance the amount of work to be donerestrict each cycle to use only one major functional unitAt the end of a cyclestore values for use in later cycles (easiest thing to do)introduce additional “internal” registers
16Five Execution StepsInstruction FetchInstruction Decode and Register FetchExecution, Memory Address Computation, or Branch CompletionMemory Access or R-type instruction completionWrite-back stepINSTRUCTIONS TAKE FROM CYCLES!
17Five Execution Steps Step name Action for R-type instructions Action for Memory-reference InstructionsAction for branchesAction for jumpsInstruction fetchIR = MEM[PC]PC = PC + 4Instruction decode/ register fetchA = Reg[IR[25-21]]B = Reg[IR[20-16]]ALUOut = PC + (sign extend (IR[15-0])<<2)Execution, address computation, branch/jump completionALUOut = A op BALUOut = A+sign extend(IR[15-0])IF(A==B) Then PC=ALUOutPC=PC[31-28]||(IR[25-0]<<2)Memory access or R-type completionReg[IR[15-11]] = ALUOutLoad:MDR =Mem[ALUOut]orStore:Mem[ALUOut] = BMemory read completionLoad: Reg[IR[20-16]] = MDR
19“Folder” takes 20 minutes Pipelining is Natural!Pipelining provides a method for executing multiple instructions at the same time.Laundry ExampleAnn, Brian, Cathy, Dave each have one load of clothes to wash, dry, and foldWasher takes 30 minutesDryer takes 40 minutes“Folder” takes 20 minutesABCD
20Sequential laundry takes 6 hours for 4 loads 6 PM7891011MidnightTime304020304020304020304020TaskOrdeABCDSequential laundry takes 6 hours for 4 loadsIf they learned pipelining, how long would laundry take?
21Pipelined Laundry: Start work ASAP 6 PM7891011MidnightTime304020TaskOrdeABCDPipelined laundry takes 3.5 hours for 4 loads
22Pipelining Lessons 6 PM 7 8 9 30 40 20 A B C D Pipelining doesn’t help latency of single task, it helps throughput of entire workloadPipeline rate limited by slowest pipeline stageMultiple tasks operating simultaneously using different resourcesPotential speedup = Number pipe stagesUnbalanced lengths of pipe stages reduces speedupTime to “fill” pipeline and time to “drain” it reduces speedupStall for DependencesTime304020TaskOrdeABCD
23Basic DataPathWhat do we need to add to actually split the datapath into stages?
25The Five Stages of the Load Instruction Cycle 1Cycle 2Cycle 3Cycle 4Cycle 5LoadIfetchReg/DecExecMemWrIfetch: Instruction FetchFetch the instruction from the Instruction MemoryReg/Dec: Registers Fetch and Instruction DecodeExec: Calculate the memory addressMem: Read the data from the Data MemoryWr: Write the data back to the register fileAs shown here, each of these five steps will take one clock cycle to complete.And in pipeline terminology, each step is referred to as one stage of the pipeline.+1 = 8 min. (X:48)
26Assume each instruction takes five cycles Pipelined ExecutionTimeIFetchDcdExecMemWBIFetchDcdExecMemWBIFetchDcdExecMemWBIFetchDcdExecMemWBIFetchDcdExecMemWBProgram FlowIFetchDcdExecMemWBOn a processor multiple instructions are in various stages at the same time.Assume each instruction takes five cycles
27Single Cycle, Multiple Cycle, vs. Pipeline Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations.For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles.In the multiple clock cycle implementation, however, we cannot start executing the store until Cycle 6 because we must wait for the load instruction to complete.Similarly, we cannot start the execution of the R-type instruction until the store instruction has completed its execution in Cycle 9.In the Single Cycle implementation, the cycle time is set to accommodate the longest instruction, the Load instruction.Consequently, the cycle time for the Single Cycle implementation can be five times longer than the multiple cycle implementation.But may be more importantly, since the cycle time has to be long enough for the load instruction, it is too long for the store instruction so the last part of the cycle here is wasted.+2 = 77 min. (X:57)
28Graphically Representing Pipelines Can help with answering questions like:How many cycles does it take to execute this code?What is the ALU doing during cycle 4?Are two instructions trying to use the same resource at the same time?
31Use separate instruction and data memories ObservationsUse separate instruction and data memoriesTo reduce memory conflictThe memory system must deliver 5 times the bandwidth, if the pipelined processor has a clock cycle that is equal to the unpipelined processor.Register file is used in the two stagesOne for reading in ID and one for writing in WB2 reads 1 write in every clock cycle• Maintaining PCIF stage: preparing the PC for the next instructionID stage: to compute the branch target addressProblem of a branch does not change the PC until the ID stage
32Why Pipeline? Suppose Single Cycle Machine Multicycle Machine 100 instructions are executedThe single cycle machine has a cycle time of 45 nsThe multicycle and pipeline machines have cycle times of 10 nsThe multicycle machine has a CPI of 4.6Single Cycle Machine45 ns/cycle x 1 CPI x 100 inst = 4500 nsMulticycle Machine10 ns/cycle x 4.6 CPI x 100 inst = 4600 nsIdeal pipelined machine10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 nsIdeal pipelined vs. single cycle speedup4500 ns / 1040 ns = 4.33What has not yet been considered?
33Compare PerformanceCompare: Single-cycle, multicycle and pipelined control using SPECint2000Single-cycle: memory access = 200ps, ALU = 100ps, register file read and write = 50ps=600psMulticycle: 25% loads, 10% stores, 11% branches, 2% jumps, 52% ALUCPI = 4.12, The clock cycle = 200ps (longest functional unit)Pipelined1 clock cycle when there is no load-use dependence2 when there is, average 1.5 per loadStores and ALU take 1 clock cycleBranches - 1 when predicted correctly and 2 when not, average 1.25Jump – 21.5x25%+1x10%+1x52%+1.25x11%+2x2% = 1.17Average instruction time: single-cycle = 600ps, multicycle = 4.12x200=824, pipelined 1.17x200 = 234psMemory access 200ps is the bottleneck. How to improve?
34Can pipelining get us into trouble? Yes: Pipeline Hazardsstructural hazards: attempt to use the same resource two different ways at the same timeE.g., two instructions try to read the same memory at the same timedata hazards: attempt to use item before it is readyinstruction depends on result of prior instruction still in the pipelineadd r1, r2, r3sub r4, r2, r1control hazards: attempt to make a decision before condition is evaulatedbranch instructionsbeq r1, loopCan always resolve hazards by waitingpipeline control must detect the hazardtake action (or delay action) to resolve hazards
35Single Memory is a Structural Hazard Time (clock cycles)ALUInstr.OrdeMemRegMemRegLoadALUMemRegInstr 1ALUMemRegInstr 2ALUInstr 3MemRegMemRegALUMemRegInstr 4Detection is easy in this case! (right half highlight means read, left half write)
36Structural Hazards limit performance Example: if 1.3 memory accesses per instruction and only one memory access per cycle thenaverage CPI = 1.3otherwise resource is more than 100% utilizedSolution 1: Use separate instruction and data memoriesSolution 2: Allow memory to read and write more than one word per cycleSolution 3: Stall
37Control Hazard Solutions Stall: wait until decision is clearIts possible to move up decision to 2nd stage by adding hardware to check registers as being readImpact: 2 clock cycles per branch instruction => slowInstr.OrdeTime (clock cycles)ALUMemRegMemRegAddALUBeqMemRegMemRegLoadALUMemRegMemReg
38Control Hazard Solutions Predict: guess one direction then back up if wrongPredict not takenImpact: 1 clock cycle per branch instruction if right, 2 if wrong (right 50% of time)More dynamic scheme: history of 1 branch ( 90%)Instr.OrdeTime (clock cycles)ALUMemRegMemRegAddALUBeqMemRegMemRegLoadALUMemRegMemReg
39Control Hazard Solutions Redefine branch behavior (takes place after next instruction) “delayed branch”Impact: 1 clock cycles per branch instruction if can find instruction to put in “slot” ( 50% of time)Launch more instructions per clock cycle=>less usefulInstr.OrdeTime (clock cycles)ALUMemRegMemRegAddALUBeqMemRegMemRegALUMiscMemRegMemRegALULoadMemRegMemReg
41Data Hazard---Forwarding Use temporary results, don’t wait for them to be writtenregister file forwarding to handle read/write to same registerALU forwarding
42Can’t always forward Load word can still cause a hazard: an instruction tries to read a register following a load instruction that writes to the same register.Thus, we need a hazard detection unit to “stall” the load instruction
43StallingWe can stall the pipeline by keeping an instruction in the same stage
44Designing Instruction Set for Pipelining MIPS instructions are the same length.This makes much easer to fetch instructions in the first stage and to decode them in the second stage.In IA-32 instructions vary from 1-17 bytes, pipelining is considerably more challenging.All recent IA-32 architectures translate instructions into micro_operations, which are pipelinedMIPS has only few instruction formats, with the source registers fields in the same placeThis symmetry – the second stage can read the register file at the same time that the hardware is decoding the type of instructionWithout this symmetry we would need to split stage 2
45Designing Instruction Set for Pipelining Memory operands only appears in loads or stores in MIPSWe can use the execute stage to calculate the memory and then access memory in the following stageIf we operate on operand in memory, stages 3 and 4 would expand to an address stage, memory stage and then execute stageOperands must be aligned in memoryWe can have instructions requiring two data memory access, the transfer can be done in a single pipeline stage
46Summary: Pipelining What makes it easy What makes it hard? all instructions are the same lengthjust a few instruction formatsmemory operands appear only in loads and storesWhat makes it hard?structural hazards: suppose we had only one memorycontrol hazards: need to worry about branch instructionsdata hazards: an instruction depends on a previous instructionWe’ll talk about modern processors and what really makes it hard:exception handlingtrying to improve performance with out-of-order execution, etc.
47Pipelining is a fundamental concept SummaryPipelining is a fundamental conceptmultiple steps using distinct resourcesUtilize capabilities of the datapath by pipelined instruction processingstart next instruction while working on the current onelimited by length of longest stage (plus fill/flush)detect and resolve hazards