Presentation on theme: "Out-of-Order Execution & Register Renaming"— Presentation transcript:
1 Out-of-Order Execution & Register Renaming Asanovic/DevadasSpring 20026.823Out-of-Order Execution & Register RenamingKrste AsanovicLaboratory for Computer ScienceMassachusetts Institute of Technology
2 Scoreboard for In-order Issue Asanovic/DevadasSpring 20026.823Scoreboard for In-order IssueBusy[unit#] : a bit-vector to indicate unit’s availability (unit = Int, Add, Mult, Div) These bits are hardwired to FU's.WP[reg#] : a bit-vector to record the registers for which writes are pendingIssue checks the instruction (opcode dest src1 src2) against the scoreboard (Busy & WP) to dispatchFU available? not Busy[FU#] RAW? WP[src1] or WP[src2] WAR? cannot arise WAW? WP[dest]
3 Out-of-Order Dispatch Asanovic/DevadasSpring 20026.823Out-of-Order DispatchALUMemIFIDWBIssueFaddFmul• Issue stage buffer holds multiple instructions waitingto issue.• Decode adds next instruction to buffer if there isspace and the instruction does not cause a WARor WAW hazard.• Any instruction in buffer whose RAW hazards aresatisfied can be dispatched (for now, at most onedispatch per cycle). On a write back (WB), newinstructions may get enabled.
4 Out-of-Order Issue: an example Asanovic/DevadasSpring 20026.823Out-of-Order Issue: an examplelatency1 LD F2, 34(R2) 12 LD F4, 45(R3) long3 MULTD F6, F4, F2 34 SUBD F8, F2, F2 15 DIVD F4, F2, F8 46 ADDD F10, F6, F4 1124356In-order: 1 (2,1)Out-of-order: 1 (2,1)Out-of-order did not allow any significant improvement !
5 How many Instructions can be in the pipeline Asanovic/DevadasSpring 20026.823How many Instructions can be in the pipelineWhich features of an ISA limit the number ofinstructions in the pipeline?Which features of a program limit the number of
6 Overcoming the Lack of Register Names Asanovic/DevadasSpring 20026.823Overcoming the Lack of Register NamesNumber of registers in an ISA limits the numberof partially executed instructions in complex pipelinesFloating Point pipelines often cannot be kept filled withsmall number of registers.IBM 360 had only 4 Floating Point RegistersCan a microarchitecture use more registers thanspecified by the ISA without loss of ISA compatibility ?Robert Tomasulo of IBM suggested aningenious solution in 1967 based on on-the-flyregister renaming
7 Instruction-Level Parallelism with Renaming Asanovic/DevadasSpring 20026.823Instruction-Level Parallelism with Renaminglatency1 LD F2, 34(R2) 12 LD F4, 45(R3) long3 MULTD F6, F4, F2 34 SUBD F8, F2, F2 15 DIVD F4’, F2, F8 46 ADDD F10, F6, F4’ 11243X56In-order: 1 (2,1) (5,3)Out-of-order: 1 (2,1) (3,5) 3 6 6Any antidependence can be eliminated by renamingrenaming ⇒ additional storageCan it be done in hardware? yes!
8 Asanovic/DevadasSpring 20026.823Register RenamingALUMemIFIDWBROBFaddFmul• Decode does register renaming and adds instructionsto the issue stage reorder buffer (ROB).⇒ renaming makes WAR or WAW hazardsimpossible• Any instruction in ROB whose RAW hazards havebeen satisfied can be dispatched.⇒ Out-of order or dataflow execution
9 Renaming & Out-of-order Issue An example Asanovic/DevadasSpring 20026.823Renaming & Out-of-order Issue An exampleRenaming tableReorder bufferp dataIns# use exec op p1 src1 p2 src2t1tF1F2F3F4F5F6F7F8data / ti1 LD F2, 34(R2)2 LD F4, 45(R3)3 MULTD F6, F4, F24 SUBD F8, F2, F25 DIVD F4, F2, F86 ADDD F10, F6, F4• When are names in sourcesreplaced by data?• When can a name be reused?
10 Data-Driven Execution Asanovic/DevadasSpring 20026.823Data-Driven ExecutionRenamingtable ® fileIns# use exec op p1 src1 p2 src2Reorderbuffert1t2.tnReplacing thetag by its valueis an expensiveoperationLoadUnitStore UnitFUFU< t, result >• Instruction template (i.e., tag t) is allocated by theDecode stage, which also stores the tag in the reg file• When an instruction completes, its tag is deallocated
11 Simplifying Allocation/Deallocation Asanovic/DevadasSpring 20026.823Simplifying Allocation/DeallocationReorder bufferIns# use exec op p1 src p2 src2t1t tnptr2next todeallocateprt1nextavailableInstruction buffer is managed circularly• When an instruction completes its “use” bit ismarked free• ptr2 is incremented only if the “use” bit is marked free
12 IBM 360/91 Floating Point Unit Asanovic/DevadasSpring 20026.823R. M. Tomasulo, 1967dataloadbuffers(frommemory)instructionspdataFloatingPointReg. . .distributeinstructiontemplatesbyfunctionalunitspdatapdatapdatapdataAddrMult< t, result >Common bus ensures that data ismade available immediately to allthe instructions waiting for itstorebuffers(to memory)pdata
13 Effectiveness? Renaming and Out-of-order execution was first Asanovic/DevadasSpring 20026.823Effectiveness?Renaming and Out-of-order execution was firstimplemented in 1969 in IBM 360/91 but did not showup in the subsequent models until mid-Nineties.Why ?Reasons1. Exceptions not precise!2. Effective on a very small class of programsOne more problem needed to be solved
14 Asanovic/DevadasSpring 20026.823Precise InterruptsIt must appear as if an interrupt is taken between twoinstructions (say Ii and Ii+1)• the effect of all instructions up to and including Iiis totally complete• no effect of any instruction after Ii has taken placeThe interrupt handler either aborts the program orrestarts it at Ii+1 .
15 Effect on Interrupts Out-of-order Completion Asanovic/DevadasSpring 20026.823Effect on Interrupts Out-of-order Completion1 LD f6, f6, f42 LD f2, 45(r3)3 MULTD f0, f2, f44 SUBD f8, f6, f25 DIVD f10, f0, f66 ADDD f6, f8, f2Out-of-order comprestore f10Consider interruptsPrecise interrupts are difficult to implement at high speed- want to start execution of later instructions beforeexception checks finished on earlier instructions
16 Exception Handling (In-Order Five-Stage Pipeline) Asanovic/DevadasSpring 20026.823Exception Handling (In-Order Five-Stage Pipeline)CommitPointPCInst.MemDDecodeMDataMemE+WIllegalOpcodeKillWritebackData AddressExceptionsOverflowSelectHandlerPcPC AddressExceptionsExcDExcEExcMCausePCDPCEPCMKill DStageEPCKill FStageKill EStageAsynchronousInterrupts• Hold exception flags in pipeline until commit point (M stage)• Exceptions in earlier pipe stages override later exceptions• Inject external interrupts at commit point (override others)• If exception at commit: update Cause and EPC registers, killall stages, inject handler PC into fetch stage
17 Phases of Instruction Execution Asanovic/DevadasSpring 20026.823Phases of Instruction ExecutionPCFetch: Instruction bits retrieved fromcache.I-cacheFetchBufferDecode: Instructions placed in appropriateissue stage buffer (sometimes called“issue” or “dispatch”)IssueBufferExecute: Instructions and operands sent to execution units (sometimes called “issue” or “dispatch”). When execution completes, all results and exception flags are available.Func.UnitsResultBufferCommit: Instruction irrevocably updatesarchitectural state (sometimes called“graduation” or “completion”).Arch.State
18 In-Order Commit for Precise Exceptions Asanovic/DevadasSpring 20026.823In-Order Commit for Precise Exceptions• Instructions fetched and decoded into instruction reorder buffer in-order• Execution is out-of-order (⇒ out-of-order completion)• Commit (write-back to architectural state, regfile+memory) is in-orderTemporary storage needed to hold results before commit(shadow registers and store buffers)In-orderIn-orderOut-of-orderCommitFetchDecodeReorder BufferKillKillKillExecuteInject handler PCException?
19 Extensions for Precise Exceptions Asanovic/DevadasSpring 20026.823Instruction reorder bufferInst# use exec op p1 src1 p2 src2 pd dest data causeptr2next tocommitptr1nextavailable• add <pd, dest, data, cause> fields in the instruction template• commit instructions to reg file and memory in programorder ⇒ buffers can be maintained circularly• on exception, clear reorder buffer by resetting ptr1=ptr2(stores must wait for commit before updating memory)
20 Asanovic/DevadasSpring 20026.823Rollback and RenamingRegister File(now holds onlycommitted state)t1t2.tnReorderbufferLoadUnitStoreUnitCommitFUFUFU< t, result >Register file does not contain renaming tags any more.How does the decode stage find the tag of a sourceregister?
21 Asanovic/DevadasSpring 20026.823Renaming Tabler1r2tiviRenameTableRegisterFilet1t2.tnReorderbufferLoadUnitStoreUnitCommitFUFUFU< t, result >Renaming table is like a cache to speed up register namelook up (rename tag + valid bit per entry). It needs to becleared after each exception taken . When else are validbits cleared?
22 Effect of Control Transfer on Pipelined Execution Asanovic/DevadasSpring 20026.823Effect of Control Transfer on Pipelined ExecutionControl transfer instructions require insertionof bubbles in the pipeline.The number of bubbles depends upon the numberof cycles it takes• to determine the next instruction address, and• to fetch the next instruction
23 Branch penality PC Next fetch started I-cache Fetch Fetch Buffer Asanovic/DevadasSpring 20026.823Branch penalityPCNext fetchstartedI-cacheFetchFetchBufferDecodeIssueBufferFunc.UnitsExecuteResultBufferBranch executedCommitArch.State
24 Branch Penalties in Modern Pipelines Asanovic/DevadasSpring 20026.823Branch Penalties in Modern PipelinesUltraSPARC-III instruction fetch pipeline stages(in-order issue, 4-way superscalar, 750MHz, 2000)PC Generation/MuxInstruction Fetch Stage 1Instruction Fetch Stage 2Branch Address Calc/Begin DecodeComplete DecodeSteer Instructions to Functional unitsRegister File ReadInteger ExecuteRemainder of execute pipeline (+another 6 stages)APFBIJREFetchDecodeBranch penalty: Cycles?________ Instructions?_________
25 Average Run-Length between Branches Asanovic/DevadasSpring 20026.823Average Run-Length between BranchesAverage dynamic instruction mix from SPEC92:SPECint92 SPECfp92ALU % %FPU Add %FPU Mult %load % %store % %branch % %other % %SPECint92: compress, eqntott, espresso, gcc , liSPECfp92: doduc, ear, hydro2d, mdijdp2, su2corWhat is the average run length between branches?
26 Reducing Control Transfer Penalties Asanovic/DevadasSpring 20026.823Reducing Control Transfer PenaltiesSoftware solution• loop unrollingIncreases the run length• instruction schedulingCompute the branch condition as earlyas possible(limited)Hardware solution• delay slotsreplaces pipeline bubbles with useful work(requires software cooperation)• branch prediction & speculative executionof instructions beyond the branch
27 Asanovic/DevadasSpring 20026.823Branch PredictionMotivation: branch penalties limit performance of deeplypipelined processorsModern branch predictors have high accuracy(>95%) and can reduce branch penalties significantlyRequired hardware support:Prediction structures: branch history tables, branchtarget buffers, etc.Mispredict recovery mechanisms:• In-order machines: kill instructions followingbranch in pipeline• Out-of-order machines: shadow registers andmemory buffers for each speculated branch
28 DLX Branches and Jumps Instruction Taken known? Target known? Asanovic/DevadasSpring 20026.823DLX Branches and JumpsInstruction Taken known? Target known?BEQZ/BNEZ After Reg. Fetch After Inst. FetchJ Always Taken After Inst. FetchJR Always Taken After Reg. FetchMust know (or guess) both target address andwhether taken to execute branch/jump.