2Outline Basic Compiler Techniques for Exposing ILP Static Branch PredictionStatic Multiple Issue: The VLIW ApproachHardware Support for Exposing More Parallelism at Compiler TimeH.W verses S.W Solutions
4Basic Pipeline Scheduling and Loop Unrolling To keep a pipeline full, parallelism among instructions must be exploited by finding sequences of unrelated instructions that can be overlapped in the pipeline.To avoid a pipeline stall, a dependent instruction must be separated from a source instruction.A compiler’s ability to perform this scheduling depends boththe amount of ILP available in the programOn the latencies of the functional units in the pipeline.
5Basic Pipeline Scheduling and Loop Unrolling (contd..) Idea – find sequences of unrelated instructions (no hazard) that can be overlapped in the pipeline to exploit ILPA dependent instruction must be separated from the source instruction by a distance in clock cycles equal to latency of the source instruction to avoid stallLatencies of FP operations used
6Consider adding a scalar s to a vector for (i=1000; i > 0; i=i-1) x[i] = x[i] + sLoop: L.D F0,0( R1 ) ;F0=vector elementADD.D F4,F0,F2 ;add scalar from F2S.D F4, 0(R1), ;store resultDADDUI R1,R1,#-8 ;decrement pointer 8B (DW)BNE R1, R2,Loop ;branch R1!=R2Assume R2 is pre-computed, so that 8(R2) is the last element to operate on
8Scheduled Loop Clock Cycle Issued Loop: L.D F0,0( R1 ) 1 DADDUI R1,R1,#82ADD.D F4,F0,F23stall4BNE R1, R2, Loop5S.D F4, 8(R1)6The latency between ADD.D and SD is 2OverheadAt minimum 6 cycles are necessary to execute this sequence. Why?R1 has been modified (not trivial!!!)Common View: S.Di depends on DADDUIi-1 and therefore can’t be movedSmarter View: DADDUI is immediate, so solution exists
9Basic pipeline scheduling and Unrolling To eliminate 3 clock cycles we need to get more operations within the loop relative to the number of overhead instructions.A Simple scheme for increasing the number of instructions relative to the branch and overhead instructions is loop unrolling.Unrolling simply replicates the loop body multiple times, adjusting the loop termination code.Loop unrolling can also be used to improve scheduling.Because it eliminates the branch, it allows instructions from different iterations to be scheduled together.
10Loop Unrolling – Make Body Fat Three of the six instructions are overheadGet more operations within loop relative to # of overhead instructionsLoop unrollingReplicate the loop body multiple times and adjust the loop termination codeBasic IdeaTake n loop bodies and concatenate them into 1 basic blockAdjust the new termination codeLet’s say n was 4Then modify the R1 pointer in the example by 4x of what it was beforeSavings - 4 BNE’s + 4 DADDUI’s just one of eachHence 75% improvement
11Summary of the Loop Unrolling and Scheduling We will look at the variety of H/W and S/W techniques that allows us to takeadvantage of instructions-level-parallelism to fully utilize the potential of the functional units in a processor.The key to most of these techniques isto know when andhow the ordering among instructions may be changed.This process must be performed in a methodical fashion either by a compiler or by hardware.
12Use different registers to avoid unnecessary constraints. To obtain the final unrolled code we had to make the following decisions and transformations.Determine if it is legal to move the instructions and adjust the offset.Determine the unrolling loop would be useful by finding if the loop iterations were independent.Use different registers to avoid unnecessary constraints.Eliminate the extra tests and branches.Determine that the loads and stores in the unrolling loop can be interchanged.Schedule the code, preserving any dependences needed.Key requirement: an understanding of how an instruction depends onanother and how the instructions can be changed or reordered given the dependences
13Limitation of Gains of Loop Unrolling Amount of loop overhead amortized with each unrollUnroll 4 times – 2 out 14 CC are overhead 0.5 CC per iterationUnrolled 8 times 0.25 CC per iterationGrowth in code sizeLarge code size is not good for embedded computerLarge code size may increase cache miss ratePotential shortfall in registers that is created by aggressive unrolling and schedulingRegister pressure
15Static Branch Prediction Static branch predictors are sometimes used in processors where the expectations is that branch behavior is highly predictable at compile time.Static prediction can also be used to assist dynamic predictors.
16Static Branch Prediction: Using Compiler Technology How to statically predict branches?To perform some optimization we need to predict the branch statically when we compile the program.There are several methods to statically predict branch behavior.To predict a branch was takenThis scheme has an average misprediction rate that is equal to the untaken branch frequency.Choosing backward-going branches to be taken and forward –going branches to be not taken.For some programs and compilation systems, the frequency of forward taken branches may be significantly less than 50%.And this scheme will do better than just predicting all branches taken.
17Static Branch Prediction: Using Compiler Technology Profile-based predictor: use profile information collected from earlier runsSimplest is the Basket bit ideaEasily extends to use more bitsDefinite win for some regular applications
18Static Branch Prediction: Using Compiler Technology Useful forScheduling instructions when the branch delays are exposed by the architecture (either delayed or canceling branches)Assisting dynamic predictorsDetermining which code paths are more frequent, a key step in code scheduling
20OverviewCompiler does most of the work of finding and scheduling instructions for parallel executionSuperscalar processors decide on the fly how many instructions issue.A statically scheduled superscalar must check forany dependences between instructions in the issue packetAny instruction already in the pipeline.A statically scheduled superscalar requires significant compiler assistance to achieve good performance.In contrast, A dynamically scheduled superscalar requires less compiler assistance, but significant hardware costs.
21Overview (Contd..)An alternative to superscalar approach is to rely on compiler technology toMinimize the potential hazard stallActually format the instructions in a potential issue packet so that HW need not check explicitly for dependencies. Compiler ensures…Dependences within the issue packet cannot be present– or – Indicate when a dependence may occur.Compiler technology offers potential advantage of simpler hardware while still exhibiting good performance through extensive compiler technology.Better architectural approach was named VLIW (Very Long Instruction Word)
22Basic VLIW A VLIW uses multiple, independent functional units A VLIW packages multiple independent operations into one very long instructionThe burden for choosing and packaging independent operations falls on compilerHW in a superscalar makes these issue decisions is unneededThis advantage increases as the maximum issue rate growsHere we consider a VLIW processor might have instructions that contain 5 operations, including 1 integer (or branch), 2 FP, and 2 memory referencesDepend on the available FUs and frequency of operation
23Basic VLIW (Cont.)VLIW depends on enough parallelism for keeping FUs busyThis parallelism is uncovered by Loop unrolling and then code scheduling with in the single larger loop body.If the unrolling generates straight-line code, the local scheduling techniques, which operate on a single basic block, can be used.If finding and exploiting the parallelism requires scheduling code across branches, more complex global scheduling algorithm must be used.
24VLIW Problems – Technical Increase in code sizeAmbitious loop unrollingWhenever instruction are not full, the unused FUs translate to waste bits in the instruction encodingAn instruction may need to be left completely empty if no operation can be scheduledClever encoding or compress/decompress
25VLIW Problems – Logistical Synchronous VS. Independent FUsEarly VLIW – all FUs must be kept synchronizedA stall in any FU pipeline may cause the entire processor to stallRecent VLIW – FUs operate more independentlyCompiler is used to avoid hazards at issue timeHardware checks allow for unsynchronized execution once instructions are issued.
26VLIW Problems – Logistical Binary code compatibilityCode sequence makes use of both the instruction set definition and the detailed pipeline structure (FUs and latencies)Need migration between successive implementations, or between implementations recompliationSolutionObject-code translation or emulationTemper the strictness of the approach so that binary compatibility is still feasible
27Advantages of Superscalar over VLIW Old codes still runLike those tools you have that came as binariesHW detects whether the instruction pair is a legal dual issue pairIf not they are run sequentiallyLittle impact on code densityDon’t need to fill all of the can’t issue here slots with NOP’sCompiler issues are very similarStill need to do instruction scheduling anywayDynamic issue hardware is there so the compiler does not have to be too conservative
284.4 Hardware Support for Exposing More Parallelism at Compiler Time
29Hardware Support for Exposing More Parallelism at Compiler Time When the behavior of branches is not well known, compiler techniques alone may not be able to uncover much ILP.In such cases, the control dependences may severely limit the amount of parallelism that can be exploited.Potential dependences between memory reference instructions could prevent code movement.Here we have several techniques that can help overcome such limitations.
30(Condt..)An extension of the instruction set to include conditional or predicated instructions. Such instructions can be used toeliminate branches,converting a control dependences into a data dependencesPotentially improving the performanceTo enhance the ability of the compiler to speculatively move code over branches, while still preserving the exception behavior.The hardware speculation schemes provided for supporting reordering loads and stores.
31Conditional or Predicated Instructions the concept behind conditional instructions is quite simple:An instruction refers to a condition, which is evaluated as part of the execution.Many new architectures include some form of conditional instructions.Most common example of such instruction is conditional move.Which moves a value from one registers to another if the condition is true.
32Conditional or Predicated Instructions (contd..) Other variantsConditional loads and storesALPHA, MIPS, SPARC, PowerPC, and P6 all have simple conditional movesEffect is to eliminating simple brancheschanges a control dependence into a data dependence
33Condition Instruction Limitations Precise ExceptionsIf an exception happens prior to conditional evaluation, it must be carried through the pipeSimple for register accesses but consider a memory protection violation or a page faultLong conditional sequences – If-then with a big then bodyIf the task to be done is complex, better to evaluate the condition once.Conditional instructions are most useful when the condition can be evaluated earlyIf data dependence in determining the condition help less
34Condition Instruction Limitations (Cont.) Wasted resourceConditional instructions consume real resourcesTends to work well in the superscalar caseOur simple 2-way model Even if no conditional instruction, other resource is wasted anywayCycle-time or CPI IssuesConditional instructions are more complexDanger is that they may consume more cycles or a longer cycle timeNote that the utility is mainly to correct short control flawsHence use may not be for the common caseThings better not slow down for the real common case to support the uncommon case
35Compiler Speculation with HW support As we saw earlier, many programs have branches that can be accurately produced at compile time eitherFrom the program structure (or)By using a profile.In such cases, the compiler may want to speculate eitherTo improve the scheduling (or)To increase the issue rate.Predicted instructions provide one method to speculate, but they are really more usefulWhen control dependences can be completely eliminated by if conversion.
36Compiler Speculation with HW support (Condt..) In many cases, we would like to move speculated instructionsDo conditional things in advance of the branch (and before the condition evaluation)Nullify them if the branch goes the wrong wayAlso implies the need to nullify exception behavior as wellLimitsExceptions can not cause any destructive activity
37To Speculate Ambitiously… To speculate ambitiously requires 3 capabilities:Ability of the compiler to find instructions that can be speculatively moved and not affect the program data flowAbility of HW to ignore exceptions in speculated instructions, until we know that such exceptions should really occurAbility of HW to speculatively interchange loads and stores, or stores and stores, which may have address conflicts
38HW Support for Preserving Exception Behavior How to make sure that a mis-predicted speculated instruction (SI) can not cause an exceptionFour methods that have been supporting speculation without an exception.HW and OS cooperatively ignore exceptions for Speculative instructions (SI)SI that never raise exceptions are used, and checks are introduced to determine when an exception should occurA set of status bits called Poison bits are attached to the result registers written by SI when SI cause exceptions.The poison bits cause a fault when a normal instruction attempts to use the registerA mechanism to indicate that an instruction is speculative, and HW buffers the instruction result until it is certain that the instruction is no longer speculative
39Exception TypesIndicate a program error and normally cause terminationMemory protection violation…Should not be handled for SI when mispredictionExceptions cannot be taken until we know the instruction is no longer speculativeHandled and normally resumedPage fault…Can be handled for SI just if they are normal instructionsOnly have negative performance effect when misprediction
40HW-SW Cooperation for Speculation Return an undefined value for any terminating exceptionThe program is allowed to continue, but almost generate incorrect resultsIf the excepting instruction is not speculative program in errorIf the excepting instruction is speculativeprogram correct but speculative result will simply be unused (No harm)Never cause a correct program to fail, no matter how much speculationAn incorrect program, which formerly might have received a terminating exception, will get an incorrect resultAcceptable if the compiler can also generate a normal version of the program (no speculate, and receive a terminating exception)
414.5 HW Versus SW Speculation Mechanisms To speculate extensively, we must be able to disambiguate memory referenceHW speculation works better when control flow is unpredictable, and when HW branch prediction is superior to SW branch prediction done at compiler timeHW speculation maintains a completely precise exception model for SIHW speculation does not require compensation or bookkeeping code, needed by ambitious SW speculation
42HW Versus SW Speculation Mechanisms (Cont.) HW speculation with dynamic scheduling does not require different code sequences to achieve good performance for different implementation of an architectureHW speculation require complex and additional HW resourcesSome designers have tried to combine the dynamic and compiler-based approaches to achieve the best of each