# Exploiting ILP with Software Approaches

## Presentation on theme: "Exploiting ILP with Software Approaches"— Presentation transcript:

Exploiting ILP with Software Approaches

Outline Basic Compiler Techniques for Exposing ILP
Static Branch Prediction Static Multiple Issue: The VLIW Approach Hardware Support for Exposing More Parallelism at Compiler Time H.W verses S.W Solutions

4.1 Basic Compiler Techniques for Exposing ILP

Basic Pipeline Scheduling and Loop Unrolling
To keep a pipeline full, parallelism among instructions must be exploited by finding sequences of unrelated instructions that can be overlapped in the pipeline. To avoid a pipeline stall, a dependent instruction must be separated from a source instruction. A compiler’s ability to perform this scheduling depends both the amount of ILP available in the program On the latencies of the functional units in the pipeline.

Basic Pipeline Scheduling and Loop Unrolling (contd..)
Idea – find sequences of unrelated instructions (no hazard) that can be overlapped in the pipeline to exploit ILP A dependent instruction must be separated from the source instruction by a distance in clock cycles equal to latency of the source instruction to avoid stall Latencies of FP operations used

Consider adding a scalar s to a vector
for (i=1000; i > 0; i=i-1) x[i] = x[i] + s Loop: L.D F0,0( R1 ) ;F0=vector element ADD.D F4,F0,F2 ;add scalar from F2 S.D F4, 0(R1), ;store result DADDUI R1,R1,#-8 ;decrement pointer 8B (DW) BNE R1, R2,Loop ;branch R1!=R2 Assume R2 is pre-computed, so that 8(R2) is the last element to operate on

Unscheduled Loop Clock Cycle Issued Loop: L.D F0,0( R1 ) 1 stall 2
ADD.D F4,F0,F2 3 4 5 S.D F4, 0(R1) 6 DADDUI R1,R1,#-8 7 8 BNE R1, R2, Loop 9 10

Scheduled Loop Clock Cycle Issued Loop: L.D F0,0( R1 ) 1
DADDUI R1,R1,#8 2 ADD.D F4,F0,F2 3 stall 4 BNE R1, R2, Loop 5 S.D F4, 8(R1) 6 The latency between ADD.D and SD is 2 Overhead At minimum 6 cycles are necessary to execute this sequence. Why? R1 has been modified (not trivial!!!) Common View: S.Di depends on DADDUIi-1 and therefore can’t be moved Smarter View: DADDUI is immediate, so solution exists

Basic pipeline scheduling and Unrolling
To eliminate 3 clock cycles we need to get more operations within the loop relative to the number of overhead instructions. A Simple scheme for increasing the number of instructions relative to the branch and overhead instructions is loop unrolling. Unrolling simply replicates the loop body multiple times, adjusting the loop termination code. Loop unrolling can also be used to improve scheduling. Because it eliminates the branch, it allows instructions from different iterations to be scheduled together.

Loop Unrolling – Make Body Fat
Three of the six instructions are overhead Get more operations within loop relative to # of overhead instructions Loop unrolling Replicate the loop body multiple times and adjust the loop termination code Basic Idea Take n loop bodies and concatenate them into 1 basic block Adjust the new termination code Let’s say n was 4 Then modify the R1 pointer in the example by 4x of what it was before Savings - 4 BNE’s + 4 DADDUI’s  just one of each Hence 75% improvement

Summary of the Loop Unrolling and Scheduling
We will look at the variety of H/W and S/W techniques that allows us to take advantage of instructions-level-parallelism to fully utilize the potential of the functional units in a processor. The key to most of these techniques is to know when and how the ordering among instructions may be changed. This process must be performed in a methodical fashion either by a compiler or by hardware.

Use different registers to avoid unnecessary constraints.
To obtain the final unrolled code we had to make the following decisions and transformations. Determine if it is legal to move the instructions and adjust the offset. Determine the unrolling loop would be useful by finding if the loop iterations were independent. Use different registers to avoid unnecessary constraints. Eliminate the extra tests and branches. Determine that the loads and stores in the unrolling loop can be interchanged. Schedule the code, preserving any dependences needed. Key requirement: an understanding of how an instruction depends on another and how the instructions can be changed or reordered given the dependences

Limitation of Gains of Loop Unrolling
Amount of loop overhead amortized with each unroll Unroll 4 times – 2 out 14 CC are overhead  0.5 CC per iteration Unrolled 8 times  0.25 CC per iteration Growth in code size Large code size is not good for embedded computer Large code size may increase cache miss rate Potential shortfall in registers that is created by aggressive unrolling and scheduling Register pressure

4.2 Static Branch Prediction

Static Branch Prediction
Static branch predictors are sometimes used in processors where the expectations is that branch behavior is highly predictable at compile time. Static prediction can also be used to assist dynamic predictors.

Static Branch Prediction: Using Compiler Technology
How to statically predict branches? To perform some optimization we need to predict the branch statically when we compile the program. There are several methods to statically predict branch behavior. To predict a branch was taken This scheme has an average misprediction rate that is equal to the untaken branch frequency. Choosing backward-going branches to be taken and forward –going branches to be not taken. For some programs and compilation systems, the frequency of forward taken branches may be significantly less than 50%. And this scheme will do better than just predicting all branches taken.

Static Branch Prediction: Using Compiler Technology
Profile-based predictor: use profile information collected from earlier runs Simplest is the Basket bit idea Easily extends to use more bits Definite win for some regular applications

Static Branch Prediction: Using Compiler Technology
Useful for Scheduling instructions when the branch delays are exposed by the architecture (either delayed or canceling branches) Assisting dynamic predictors Determining which code paths are more frequent, a key step in code scheduling

4.3 Static Multiple Issue: VLIW

Overview Compiler does most of the work of finding and scheduling instructions for parallel execution Superscalar processors decide on the fly how many instructions issue. A statically scheduled superscalar must check for any dependences between instructions in the issue packet Any instruction already in the pipeline. A statically scheduled superscalar requires significant compiler assistance to achieve good performance. In contrast, A dynamically scheduled superscalar requires less compiler assistance, but significant hardware costs.

Overview (Contd..) An alternative to superscalar approach is to rely on compiler technology to Minimize the potential hazard stall Actually format the instructions in a potential issue packet so that HW need not check explicitly for dependencies. Compiler ensures… Dependences within the issue packet cannot be present – or – Indicate when a dependence may occur. Compiler technology offers potential advantage of simpler hardware while still exhibiting good performance through extensive compiler technology. Better architectural approach was named VLIW (Very Long Instruction Word)

Basic VLIW A VLIW uses multiple, independent functional units
A VLIW packages multiple independent operations into one very long instruction The burden for choosing and packaging independent operations falls on compiler HW in a superscalar makes these issue decisions is unneeded This advantage increases as the maximum issue rate grows Here we consider a VLIW processor might have instructions that contain 5 operations, including 1 integer (or branch), 2 FP, and 2 memory references Depend on the available FUs and frequency of operation

Basic VLIW (Cont.) VLIW depends on enough parallelism for keeping FUs busy This parallelism is uncovered by Loop unrolling and then code scheduling with in the single larger loop body. If the unrolling generates straight-line code, the local scheduling techniques, which operate on a single basic block, can be used. If finding and exploiting the parallelism requires scheduling code across branches, more complex global scheduling algorithm must be used.

VLIW Problems – Technical
Increase in code size Ambitious loop unrolling Whenever instruction are not full, the unused FUs translate to waste bits in the instruction encoding An instruction may need to be left completely empty if no operation can be scheduled Clever encoding or compress/decompress

VLIW Problems – Logistical
Synchronous VS. Independent FUs Early VLIW – all FUs must be kept synchronized A stall in any FU pipeline may cause the entire processor to stall Recent VLIW – FUs operate more independently Compiler is used to avoid hazards at issue time Hardware checks allow for unsynchronized execution once instructions are issued.

VLIW Problems – Logistical
Binary code compatibility Code sequence makes use of both the instruction set definition and the detailed pipeline structure (FUs and latencies) Need migration between successive implementations, or between implementations  recompliation Solution Object-code translation or emulation Temper the strictness of the approach so that binary compatibility is still feasible

Old codes still run Like those tools you have that came as binaries HW detects whether the instruction pair is a legal dual issue pair If not they are run sequentially Little impact on code density Don’t need to fill all of the can’t issue here slots with NOP’s Compiler issues are very similar Still need to do instruction scheduling anyway Dynamic issue hardware is there so the compiler does not have to be too conservative

4.4 Hardware Support for Exposing More Parallelism at Compiler Time

Hardware Support for Exposing More Parallelism at Compiler Time
When the behavior of branches is not well known, compiler techniques alone may not be able to uncover much ILP. In such cases, the control dependences may severely limit the amount of parallelism that can be exploited. Potential dependences between memory reference instructions could prevent code movement. Here we have several techniques that can help overcome such limitations.

(Condt..) An extension of the instruction set to include conditional or predicated instructions. Such instructions can be used to eliminate branches, converting a control dependences into a data dependences Potentially improving the performance To enhance the ability of the compiler to speculatively move code over branches, while still preserving the exception behavior. The hardware speculation schemes provided for supporting reordering loads and stores.

Conditional or Predicated Instructions
the concept behind conditional instructions is quite simple: An instruction refers to a condition, which is evaluated as part of the execution. Many new architectures include some form of conditional instructions. Most common example of such instruction is conditional move. Which moves a value from one registers to another if the condition is true.

Conditional or Predicated Instructions (contd..)
Other variants Conditional loads and stores ALPHA, MIPS, SPARC, PowerPC, and P6 all have simple conditional moves Effect is to eliminating simple branches changes a control dependence into a data dependence

Condition Instruction Limitations
Precise Exceptions If an exception happens prior to conditional evaluation, it must be carried through the pipe Simple for register accesses but consider a memory protection violation or a page fault Long conditional sequences – If-then with a big then body If the task to be done is complex, better to evaluate the condition once. Conditional instructions are most useful when the condition can be evaluated early If data dependence in determining the condition  help less

Condition Instruction Limitations (Cont.)
Wasted resource Conditional instructions consume real resources Tends to work well in the superscalar case Our simple 2-way model  Even if no conditional instruction, other resource is wasted anyway Cycle-time or CPI Issues Conditional instructions are more complex Danger is that they may consume more cycles or a longer cycle time Note that the utility is mainly to correct short control flaws Hence use may not be for the common case Things better not slow down for the real common case to support the uncommon case

Compiler Speculation with HW support
As we saw earlier, many programs have branches that can be accurately produced at compile time either From the program structure (or) By using a profile. In such cases, the compiler may want to speculate either To improve the scheduling (or) To increase the issue rate. Predicted instructions provide one method to speculate, but they are really more useful When control dependences can be completely eliminated by if conversion.

Compiler Speculation with HW support (Condt..)
In many cases, we would like to move speculated instructions Do conditional things in advance of the branch (and before the condition evaluation) Nullify them if the branch goes the wrong way Also implies the need to nullify exception behavior as well Limits Exceptions can not cause any destructive activity

To Speculate Ambitiously…
To speculate ambitiously requires 3 capabilities: Ability of the compiler to find instructions that can be speculatively moved and not affect the program data flow Ability of HW to ignore exceptions in speculated instructions, until we know that such exceptions should really occur Ability of HW to speculatively interchange loads and stores, or stores and stores, which may have address conflicts

HW Support for Preserving Exception Behavior
How to make sure that a mis-predicted speculated instruction (SI) can not cause an exception Four methods that have been supporting speculation without an exception. HW and OS cooperatively ignore exceptions for Speculative instructions (SI) SI that never raise exceptions are used, and checks are introduced to determine when an exception should occur A set of status bits called Poison bits are attached to the result registers written by SI when SI cause exceptions. The poison bits cause a fault when a normal instruction attempts to use the register A mechanism to indicate that an instruction is speculative, and HW buffers the instruction result until it is certain that the instruction is no longer speculative

Exception Types Indicate a program error and normally cause termination Memory protection violation… Should not be handled for SI when misprediction Exceptions cannot be taken until we know the instruction is no longer speculative Handled and normally resumed Page fault… Can be handled for SI just if they are normal instructions Only have negative performance effect when misprediction

HW-SW Cooperation for Speculation
Return an undefined value for any terminating exception The program is allowed to continue, but almost generate incorrect results If the excepting instruction is not speculative  program in error If the excepting instruction is speculativeprogram correct but speculative result will simply be unused (No harm) Never cause a correct program to fail, no matter how much speculation An incorrect program, which formerly might have received a terminating exception, will get an incorrect result Acceptable if the compiler can also generate a normal version of the program (no speculate, and receive a terminating exception)

4.5 HW Versus SW Speculation Mechanisms
To speculate extensively, we must be able to disambiguate memory reference HW speculation works better when control flow is unpredictable, and when HW branch prediction is superior to SW branch prediction done at compiler time HW speculation maintains a completely precise exception model for SI HW speculation does not require compensation or bookkeeping code, needed by ambitious SW speculation

HW Versus SW Speculation Mechanisms (Cont.)
HW speculation with dynamic scheduling does not require different code sequences to achieve good performance for different implementation of an architecture HW speculation require complex and additional HW resources Some designers have tried to combine the dynamic and compiler-based approaches to achieve the best of each