Exploiting Instruction-Level Parallelism with Software Approaches

Exploiting Instruction-Level Parallelism with Software Approaches
Chapter 4 Exploiting Instruction-Level Parallelism with Software Approaches

Overview Basic Compiler Techniques Static Branch Prediction
Pipeline scheduling loop unrolling Static Branch Prediction Static Multiple Issue: VLIW Advanced Compiler Support for Exposing ILP Detecting loop-level parallelism Software pipelining – symbolic loop unrolling Global code scheduling Hardware support for exposing more parallelism Conditional or predicted instructions Compiler speculation with hardware support Hardware vs Software speculation mechanisms Intel IA-64 ISA

Review of Multi-issue Taxonomy
Common Name Issue structure Hazard detection Scheduling Distinguishing characteristic Examples Superscaler (static) dynamic hardware static in-order execution Sun UltraSPARC II/III (dynamic) some out-of-order execution IBM Power2 Sperscaler (speculative) dynamic with speculation out-of-order execution with speculation Pentium III/4 MIPS R10K, Alpha 21264, HP PA 8500, IBM RS64III VLIW/LIW software no hazards between issue packets Trimedia, i860 EPIC mostly static mostly software explicit dependencies marked by compiler Itanium (IA-64 is one implementation)

Quote about IA-64 Architecture
“One of the surprises about IA-64 is that we hear no claims of high frequency, despite claims that an EPIC processor is less complex than a superscaler processor. It’s hard to know why this is so, but one can speculate that the overall complexity involved in focusing on CPI, as IA-64 does, makes it hard to get high megahertz.” - M. Hopkins, 2000

Basic Pipeline Scheduling
To keep pipeline full Find sequences of unrelated instructions to overlap Separate dependent instructions by at least the latency of source instruction Compiler success depends on: Amount of ILP available Latencies of functional units

Assumptions for Examples
Standard 5-stage integer pipeline plus floating point pipeline Branches have delay of 1 cycle Integer load latency of 1 cycle, ALU latency of 0 Functional units fully pipelined or replicated so that there are no structural hazards Latencies between dependent FP instructions: Instruction producing result Instruction using result Latency in clock cycles FP ALU operation Another FP ALU operation 3 Store double 2 Load double 1

Loop Example Add a scalar to an array. for (i=1000; i>0; i=i-1)
x[i] = x[i] + s; Iterations of the loop are parallel with no dependencies between iterations.

Straightforward Conversion
R1 holds the address of the highest array element F2 holds the scalar R2 is pre-computed so that 8(R2) is the last element loop: L.D F0, 0(R1) ;F0 = array element ADD.D F4,F0,F2 ;add scalar in F2 S.D F4, 0(R1) ;store result DADDUI R1,R1,# ;decrement pointer (DW) BNE R1,R2, loop ;branch if R1 != R2

Program in MIPS Pipeline
Clock cycle issued loop: L.D F0, 0(R1) 1 Stall 2 ADD.D F4,F0,F2 3 Stall 4 Stall S.D F4, 0(R1) 6 DADDUI R1,R1,#-8 7 Stall 8 BNE R1,R2, loop 9 Stall 1-cycle latency 2-cycle latency 1-cycle latency branch delay slot

Scheduled Program in MIPS Pipeline
Clock cycle issued loop: L.D F0, 0(R1) 1 Stall 2 ADD.D F4,F0,F2 3 Stall 4 Stall S.D F4, 0(R1) 6 DADDUI R1,R1,#-8 7 Stall 8 BNE R1,R2, loop 9 Stall 10 OLD Clock cycle issued loop: L.D F0, 0(R1) 1 DADDUI R1,R1,#-8 2 ADD.D F4,F0,F2 3 Stall 4 BNE R1,R2, loop 5 S.D F4, 8(R1) 6 NEW 2-cycle latency

Compiler Tasks OK to reorder DADDUI and ADD.D
Clock cycle issued loop: L.D F0, 0(R1) 1 DADDUI R1,R1,#-8 2 ADD.D F4,F0,F2 3 Stall 4 BNE R1,R2, loop 5 S.D F4, 8(R1) 6 OK to reorder DADDUI and ADD.D OK to reorder S.D and BNE OK to reorder DADDUI and S.D, but requires 0(R1)  8(R1) This one is difficult since a reverse dependency can be seen: S.D F4,0(R1) DADDUI R1,R1,# -8

Loop Overhead Clock cycle issued loop: L.D F0, 0(R1) 1 DADDUI R1,R1,#-8 2 ADD.D F4,F0,F2 3 Stall 4 BNE R1,R2, loop 5 S.D F4, 8(R1) 6 6 is the minimum due to dependencies and pipeline latencies Actual work of the loop is just 3 instructions: L.D, ADD.D, S.D Other instructions are loop overhead: DADDUI, BNE

Loop Unrolling Eliminate some of the overhead by unrolling the loop (fully or partially). Need to adjust the loop termination code Allows more parallel instructions in a row Allows more flexibility in reordering Usually requires register renaming

Unrolled Version loop: L.D F0, 0(R1) ADD.D F4,F0,F2
S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8,F6,F2 S.D F8, -8(R1) L.D F10, -16(R1) ADD.D F12,F10,F2 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16,F14,F2 S.D F16, -24(R1) DADDI R1, R1, #-32 BNE R1, R2, loop Assume that the # iterations is a multiple of 4 Decrement R1 by 32 for this 4 iterations More registers required to avoid unnecessary dependencies Eliminate 3 DADDI and 3 BNE instructions Without scheduling, each operation will cause a stall when pipelined: L.D F0, 0(R1) stall ADD.D F4,F0,F2 S.D F4, 0(R1)

Scheduled Unrolled Version
loop: L.D F0, 0(R1) ADD.D F4,F0,F2 S.D F4, 0(R1) L.D F6, -8(R1 ADD.D F8,F6,F2 S.D F8, -8(R1) L.D F10, -16(R1) ADD.D F12,F10,F2 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16,F14,F2 S.D F16, -24(R1) DADDI R1, R1, #-32 BNE R1, R2, loop loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) L.D F14, -24(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 S.D F4, 0(R1) S.D F8, -8(R1) DADDI R1, R1, # -32 S.D F12, 16(R1) BNE R1, R2, loop S.D F16, 8(R1) ; 8-32 = -24 Move dependent instructions apart Schedule Move dependent instructions apart Fill in delay slot Requires 28 clock cycles for 4 iterations 7 cycles/iteration Requires 14 clock cycles for 4 iterations 3.5 cycles/iteration

Loop Unrolling in General
For a loop with n iterations, unrolled k times: (note n might not be a multiple of k) 1 copy n mod k iterations Unroll n iterations 1 copy k copies (unrolled) n/k iterations

Summary of Example Version Clock cycles per iteration Code size
Unscheduled 10 5 Scheduled 6 Unrolled 7 14 Unrolled and Scheduled 3.5

Compiler Tasks for Unrolled Scheduled Version
OK to move S.D after DADDUI and BNE if S.D offset is adjusted Determine that unrolling is useful because loop iterations are independent Use different registers to avoid name hazards Eliminate extra test and branch instructions and adjust iteration code OK to move L.D and S.D instructions in unrolled code (requires analyzing memory addresses) Keep all the real dependencies, but reorder to avoid stalls

Limits to Loop Unrolling
Eventually the gains of removing loop overhead diminishes Remaining loop overhead amortization Code size limitations Embedded applications Increase in cache misses Compiler limitations Shortfall in registers Increase in number of “live values” past # registers

Unrolled Scheduled Loop in Pipeline
Example: Statically Scheduled Superscaler MIPS Processor 2 instructions per clock cycle (dual - issue) if – 1 is a load/store/branch/integer ALU (includes FP load/store) – 1 is a FP operation (add, mult /div) Fetch Decode Integer (and FP LD/ST) FP ALU Mem Write

Pipeline Schedule for 5 Iteration Unrolled Version
Integer Instruction FP Instruction Clock Cycle L.D F0, 0(R1) 1 L.D F6, -8(R1) 2 L.D F10, -16(R1) ADD.D F4,F0,F2 3 L.D F14, -24(R1) ADD.D F8,F6,F2 4 L.D F18, -32(R1) ADD.D F12,F10,F2 5 S.D F4, 0(R1) ADD.D F16,F14,F2 6 S.D F8, -8(R1) ADD.D F20,F18,F2 7 S.D F12, -16(R1) 8 DADDI R1, R1, # -32 9 S.D F16, 16(R1) 10 BNE R1, R2, loop 11 S.D F20, 8(R1) 12 loop: 12 cycles for 5 iterations 2.4 cycles/iteration

Summary of Example Version Clock cycles per iteration Code size
Unscheduled 10 5 Scheduled 6 Unrolled (4) 7 14 Unrolled (4) and Scheduled 3.5 Unrolled (5) and Schedule in Multi-Issue Pipe 2.4 17

Overview Basic Compiler Techniques Static Branch Prediction

Detecting Parallelism
Loop-level parallelism Analyzed at the source level requires recognition of array references, loops, indices. Loop-carried dependence – a dependence of one loop iteration on a previous iteration. for (k=1; k<=100; k=k+1) { A[k+1] = A[k] + B[k]; }

Examples of Loop Parallelism
A loop is parallel if it can be written without a cycle in the dependencies. for (k=1; k<=100; k=k+1) { A[k+1] = A[k] + C[k]; B[k+1] = B[k] + A[k+1]; } for (k=1; k<=100; k=k+1) { A[k] = A[k] + B[k]; B[k+1] = C[k] + D[1]; } a loop carried dependency in a single statement is a cycle loop carried dependency, but no cycle (B[k+1] does not have B[k] as source) dependency within iteration does not make a cycle This loop can be modified to make it parallel

Transformation Two statements can be interchanged.
First iteration of first statement can be computed outside loop so that A[k+1] is computed within loop. Last iteration of second statement must also be computed outside loop. A[1] = A[1] + B[1]; for (k=1; k<=99; k=k+1) { B[k+1] = C[k] + D[k]; A[k+1] = A[k+1] + B[k+1]; } B[101] = C[100] + D[100]; for (k=1; k<=100; k=k+1) { A[k] = A[k] + B[k]; B[k+1] = C[k] + D[k]; } Expose the parallelism dependency within iteration

Recurrences for (i=2; i<=100; i=i-1) { Y[i] = Y[i-n] + Y[i]; }
Y[i] depends on itself, but uses the value of an earlier iteration. n = “dependence distance”. Most often, n=1. The larger that n is, the more parallelism is available. Some architectures have special support for recurrences.

Finding Dependencies Important for
Efficient scheduling Determining which loops to unroll Eliminating name dependencies Makes finding dependencies difficult: Arrays and pointers in C or C++ Pass by reference parameter passing in FORTRAN

Dependencies in Arrays
Array indices, i, are affine if: a×i+ b (for a one-dimensional array) Index of multiple-dimension arrays is affine if indices in each dimension are affine Common example of non-affine index: x[y[i]] (indirect array addressing) For two affine indices: a×i+b and c×k+d there is a dependence if: GCD(c,a) must divide (d-b) evenly

Example For (k=1; k<=100; k=k+1) { X[2k+3] = X[2k] * 5.0; } k 1 2 3
4 5 6 7 . 100 2k+3 5 7 9 11 13 15 17 . 203 2k 2 4 6 8 10 12 14 . 200 For (k=1; k<=100; k=k+1) { X[2k+3] = X[2k] * 5.0; } GCD test: a=2, b=3 c=2, d=0 Dependence if GCD(c,a) divides (d-b) evenly GCD(2,2) = 2 d-b = -3

GCD Test Limitations GCD test must take the limits of the indices into account. If GCD shows NO dependency, then there is no dependency If GCD shows SOME dependency, it might not occur (because it might be outside the bounds of the indices). In general, determining if dependency exists is NP-complete. There are exact tests for restricted situations

Dependency Classification
Different dependencies are handled differently Anti-dependence and output dependence rename Real dependencies try to reorder to separate by length of latency

Example Find dependencies True dependencies Output dependencies
Anti-dependencies Output dependence for (i=1; i<=100; i=i+1) { Y[i] = X[i]/c; X[i] = X[i] + c; Z[i] = Y[i] + c; Y[i] = c -Y[i]; } Anti-dependence True dependencies Not loop-carried Anti-dependence

Example Eliminate output dependence (also eliminates second anti-dependence) Rename Y  T Output dependence for (i=1; i<=100; i=i+1) { T[i] = X[i]/c; X[i] = X[i] + c; Z[i] = T[i] + c; Y[i] = c -T[i]; } Anti-dependence True dependencies Not loop-carried Anti-dependence

Example Eliminate anti-dependence for (i=1; i<=100; i=i+1) {
Final result is parallel loop that can be unrolled Rename X  S for (i=1; i<=100; i=i+1) { T[i] = X[i]/c; S[i] = X[i] + c; Z[i] = T[i] + c; Y[i] = c -T[i]; } Anti-dependence True dependencies Not loop-carried

Software Pipelining Interleaves instructions from different iterations of a loop without unrolling each iteration is made from instructions from different iterations of the loop software counterpart to Tomasulo’s algorithm start-up and finish-up code required

Software Pipelining

Software Pipelining 3 2 1 symbolic unrolling new loop
Loop: LD F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, # -8 BNE R1,R2, Loop 3 2 1 symbolic unrolling LD F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) 3 2 new loop 1 Loop: S.D F0, 16(R1) ; store to M[i] ADD.D F4, F0, F2 ; add to M[i-1] L.D F4, 0(R1) ; load M[i-2] DADDUI R1, R1, # -8 BNE R1,R2, Loop

Software Pipelining Result: 1 cycle per instruction
new loop Result: 1 cycle per instruction 1 loop iteration per 5 cycles less code space than unrolling Loop: S.D F0, 16(R1) ; store to M[i] ADD.D F4, F0, F2 ; add to M[i-1] L.D F4, 0(R1) ; load M[i-2] DADDUI R1, R1, # -8 BNE R1,R2, Loop rescheduled loop separate to eliminate RAW stall and to fill delay slot Loop: S.D F0, 16(R1) ; store to M[i] DADDUI R1, R1, # -8 ADD.D F4, F0, F2 ; add to M[i-1] BNE R1, R2, Loop L.D F4, 8(R1) ; load M[i-2]

Start-up and Wind-down Code
Iterations in software pipeline loop

Software Pipelining Benefits
Here the “overhead” includes instructions not optimally overlapped Here the “overhead” includes branch/counter instructions that are not easy to overlap

Global Code Scheduling
Loop unrolling and software pipelining Improve ILP when loop bodies are straight-ling code (no branches) Control flow (branches) within loops makes both more complex. will require moving instructions across branches Global Code Scheduling – moving instructions across branches

Global Code Scheduling
Goal: compact code fragment with internal control structure into shortest possible sequence must preserve data and control dependencies data dependencies force partial order of instructions control dependencies dictate instructions across which code cannot be moved Finding shortest possible sequence requires identification of critical path. Critical path is the longest sequence of dependent instructions

Global Code Motion Moving code across branches affects frequency of execution of such code Need to determine relative frequency of different paths

Global Code Motion Global code motion:
can we move B[i] or C[i] assignments? is it beneficial? execution frequencies cost of moving (any empty slots?) effect on critical path which is better move, B[i] or C[i] cost of compensation code

Simplifications of Global Code Scheduling
Trace Scheduling find the most frequent path (trace selection) unroll loop to create trace and then schedule it efficiently (trace compaction)

Trace Scheduling Example
Trace exits and re-entrances are very complex and require much bookkeeping

Trace Scheduling Advantages Disadvantages
eliminates some hard decisions in global code scheduling good for code such as scientific programs with intensive loops and predictable behavior Disadvantages significant overhead in compensation code when trace must be exited

Superblocks Similar to trace scheduling, but have only ONE entrance point When trace is exited, a duplicate loop is used for remainder of code (tail duplication)

Chapter 4 Overview Basic Compiler Techniques Static Branch Prediction

Review Loop unrolling Software pipelining Trace scheduling
Global code scheduling Problems Unpredictable branches Dependencies between memory references Increase parallelism when branch behavior is known Moving code across branches, very difficult problem

Hardware Options Instruction set change: conditional instructions
example: conditional move CMOVZ R1, R2, R3 predicated instructions example: predicated load LWC R1, 9(R2), R3 R1  R2 if R3=0 R1  M[R2+9] if R3  0

Conditional Moves Can be used to eliminate some branches
if (A==0) {S=T;} Let A,S,T be assigned to R1, R2, R3 Code without conditional move: BNEZ R1, L ADDU R2, R3, R0 L: Using conditional move: CMOVZ R2, R3, R1 Control dependence is converted to data dependence

Conditional Moves Useful for conversions such as absolute value
A = abs (B) if (B < 0), { A = -B;} else {A = B}; Can implement with two conditional moves

Conditional Moves Useful for short sequences
Not efficient for branches that guard large blocks of code. Simplest form of predicated instruction

Predication Execution of an instruction is controlled by a predicate.
When predicate is false, instruction becomes a nop Full predication is when all instructions can be predicated. Full predication allows conversions of large blocks of code that are branch dependent

Predication and Multi-Issue
Predicated instructions can be used to improve scheduling Can be used to fill delay slots Some branches can be eliminated Eliminates some control dependencies Reduces overhead of global code scheduling

Limits of Predicated Instructions
Annulled instructions take processor resources Predicated instructions are not efficient for multiple branches. Implementing conditional/predicated instructions has some hardware cost

Examples Support conditional moves: Support full predication: MIPS
Alpha PowerPC SPARC Intelx86 Support full predication: IA-64

Compiler Speculation To speculate ambitiously, must have
The ability to find instructions that can be speculatively moved and not affect program data flow. The ability to ignore exceptions in speculated instructions, until it is certain they should occur. The ability to speculatively interchange loads and stores which may have address conflicts. The last two require hardware support.

Preserving Exception Behavior
The results of a mis-predicted instruction will not be used in final computation, and should not cause an exception. Four approaches: HW and OS cooperatively ignore exceptions for speculated instructions Speculative instructions that never raise exceptions are used, “checks” are implemented to determine when exceptions should occur “Poison” status bits are attached to result registers written by speculated instructions when they cause exceptions. Cause a fault when normal instruction uses results. Hardware buffers speculative instruction results until instruction is no longer speculative

Exception Categories Those that are handled and then normally resumed (page fault, I/O request, etc) (Resuming exceptions) Those that indicate program error (overflow, memory protection fault, etc) (Terminating exceptions)

Approach 1. HW and OS cooperatively ignore exceptions for speculated instructions – handle all exceptions but ignore terminating exceptions Resuming exceptions Handling exception for speculative instructions will cause performance penalty, but not incorrect program behavior. Terminating exceptions In this approach HW and OS return an undefined value for any exception that would normally cause termination. Because termination SHOULD result when these exceptions do occur, execution of INCORRECT program is in error. Used for “Fast Mode” in some processors.

Approach 2. Same as approach 1, but add instructions to “check” for terminating exceptions. Both correct and incorrect programs execute without error Requires additional checking instructions Example: LD R1,0(R3) ; load A sLD R14, 0(R2) ; speculative, no termination BNEZ R1, L1 ; test A SPECCK 0(R2) ; check for speculation exception J L2 ; skip else L1: DADDI R14, R1, #4 ; else clause L2: SD R14, 0(R3) ; store A

Approach 3. Track exceptions as they occur
Postpone terminating exceptions Requires “poison” bit for each register and speculation bit for instruction. Terminating exception causes poison bit to be set for result register. Speculative instructions using poison result pass poison bit to their result Non-speculative instructions using poison result cause termination Stores cannot be speculative since memory locations cannot have poison bits.

Approach 3. Example Example: LD R1,0(R3) ; load A sLD R14, 0(R2) ; speculative, set poison bit for R ; on exception BEQZ R1, L1 ; test A DADDI R14, R1, #4 ; L1: SD R14, 0(R3) ; store A – if poison bit is set, fault Both correct and incorrect programs execute without error

Approach 4. Hardware buffers speculative instruction results until instruction is no longer speculative Similar to reorder buffer with dynamic scheduling Compiler Marks instructions as speculative Indicates number of branches instruction spans Indicates branch assumptions (taken/not taken) Original location of speculative instruction is marked with “sentinel” Responsible for register renaming Hardware Instructions placed in reorder buffer when issued and forced to commit in order (without dynamic scheduling) Values committed when instruction is no longer speculative (sentinel location is reached or branch they depend on is resolved) Exceptions handled when values are committed.

Hardware Support for Memory Reference Speculation
Moving loads above stores is important for reducing critical paths of code segments Compiler can not always be certain that reordering load/store is correct (memory address dependencies) Add special instruction to check for address conflicts. Left at original location of load instruction Hardware saves address of speculative load If subsequent store (before special instruction) uses address, then speculation fails, else it succeeds When failure occurs, must re-execute load at check point If other speculative instructions were execute, must re-execute those instructions too. Expensive to have speculation fail

Hardware vs Software Speculation
Disambiguation of memory references Software – hard to do at compile time if program uses pointers Hardware – dynamic disambiguation is possible for supporting reordering of loads and stores in Tomasulo’s approach Support for speculative memory references can help compiler, but overhead of recovery is high.

When control flow is unpredictable, hardware speculation works better than compiler speculation Integer programs tend to have unpredictable control flow Static (compiler) predictor misses 16% (SPECint) Hardware predictor misses under 10% Statically scheduled processors normally include dynamic branch predictors

Hardware-based speculation maintains completely precise exception model Software-based approaches have added special hardware support to allow this as well.

Hardware-based speculation does not require compensation or bookkeeping code Ambitious software-based approaches require this hardware support

Compiler-based approaches can see a longer code sequence Better code scheduling can result

Hardware-based speculation requires complex additional hardware resources Compiler-based speculation requires complex software For hardware that supports compiler speculation, tradeoffs between hardware costs and amount and usefulness of simplifications must be made

Intel IA-64 Architecture Itanium Implementation
Instruction set architecture Instruction format Examples of explicit parallelism support Predication and speculation support Itanium Implementation Functional units and instruction issue Performance

The IA-64 Instruction Set Architecture
Register model 64 Predicate Registers (1-bit ) 128 General Purpose Registers (64-bit) 128 Floating Point Registers (82-bit) Actually have 65 bits Other Registers for systems control, memory mapping, performance counters, communication with OS Hold branch address for indirect branches 8 Branch Registers (64-bit)

Integer Registers R0-R31 always accessible
128 General Purpose Registers (64-bit) R0-R31 always accessible R32-R127 implemented as a register stack each procedure is allocated to a set of registers CFM – Current Frame pointer – used to point to a set of registers for a procedure R0-R31 CFM R32-R127

Instruction Format VLIW approach
implicit parallelism among operations in an instruction fixed formatting of the operation fields More flexible than most VLIW architectures depends on compiler to detect ILP compiler schedules into parallel slots

Instruction Groups A sequence of consecutive instructions with no register dependencies There may be some memory dependencies Boundaries between groups are indicated with a “stop”.

Instruction Bundles 128 bits of encoded instructions Each bundle:
5-bit template field specifies what types of execution units each instruction requires 3 instructions, each 41 bits

Execution Slots Execution unit slot Instruction Type
Instruction Description Example instructions I-unit A I Integer ALU Non-integer ALU add,sub,and,or... integer and multimedia shifts, bit tests, ... M-unit M Memory Access load/stores, int/FP F-unit F Floating Point Floating point instructions B-unit B Branches Conditional branches L+X Extended Extended immediates, stops, nops

Templates Template Slot 0 Slot 1 Slot 2 M I 1 2 3 4 L X 5 8 9 10 11 12
M I 1 2 3 4 L X 5 8 9 10 11 12 F 13 Template Slot 0 Slot 1 Slot 2 14 M B 15 16 I 17 18 19 22 23 24 25 28 F 29

Exercise Loop example using MIPS form of instructions:
loop: L.D F0, 0(R1) ;F0 = array element ADD.D F4,F0,F2 ;add scalar in F2 S.D F4, 0(R1) ;store result DADDUI R1,R1,# ;decrement pointer (DW) BNE R1,R2, loop ;branch if R1 != R2 Let’s see if we can unroll this loop and map it to IA-64 bundles.

Unrolled Loop (7 times) L.D F18, -32(R1) loop: L.D F0, 0(R1)
ADD.D F4,F0,F2 S.D F4, 0(R1) L.D F6, -8(R1) ADD.D F8,F6,F2 S.D F8, -8(R1) L.D F10, -16(R1) ADD.D F12,F10,F2 S.D F12, -16(R1) L.D F14, -24(R1) ADD.D F16,F14,F2 S.D F16, -24(R1) L.D F18, -32(R1) ADD.D F20,F18,F2 S.D F20, -32(R1) L.D F22, -40(R1) ADD.D F24,F22,F2 S.D F24, -40(R1) L.D F26, -48(R1) ADD.D F28,F26,F2 S.D F28, -48(R1) DADDI R1, R1, #-56 BNE R1, R2, loop Unrolled Loop (7 times)

Unrolled Loop (Scheduled)
loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) L.D F14, -24(R1) L.D F18, -32(R1) L.D F22, -40(R1) L.D F26, -48(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 ADD.D F20,F18,F2 ADD.D F24,F22,F2 ADD.D F28,F26,F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDI R1, R1, # -56 S.D F24, 16(R1) ; 16-56=-40 BNE R1, R2, loop S.D F28, 8(R1) ; 8-56 = -48 Type M Latencies Instruction producing result Instruction using result Latency in clock cycles FP ALU operation 3 Store double 2 Load double 1 Type F Type M Type I

Find Possible Stops in Code
loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10, -16(R1) L.D F14, -24(R1) L.D F18, -32(R1) L.D F22, -40(R1) L.D F26, -48(R1) ADD.D F4,F0,F2 ADD.D F8,F6,F2 ADD.D F12,F10,F2 ADD.D F16,F14,F2 ADD.D F20,F18,F2 ADD.D F24,F22,F2 ADD.D F28,F26,F2 S.D F4, 0(R1) S.D F8, -8(R1) S.D F12, -16(R1) S.D F16, -24(R1) S.D F20, -32(R1) DADDI R1, R1, # -56 S.D F24, 16(R1) ; 16-56=-40 BNE R1, R2, loop S.D F28, 8(R1) ; 8-56 = -48 Find Possible Stops in Code Type M Type F Type M Type I

Scheduled Code Template Slot 0 Slot 1 Slot 2 Cycle 9: M M I
L.D F0, 0(R1) L.D F6, -8(R1) 1 14: M M F L.D F10, -16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 3

Scheduled Code – Faster
Template Slot 0 Slot 1 Slot 2 Cycle 8: M M I L.D F0, 0(R1) L.D F6, -8(R1) 1 9: M M F L.D F10, -16(R1) L.D F14, -24(R1) 2 14: M M F L.D F18, -32(R1) LD F22, -40(R1) ADD.D F4, F0, F2 3

IA-64 Instruction Format
41-bits 4-bits (major opcode) 31-bits 6-bits (predicate register) Together with the 5-bit bundle template, determine the major operation 64 Predicate Registers (1-bit )

Type-A Instructions Instruction Type Number of formats
Example instructions Extra opcode bits GPRs/ FPRs Immediate bits Other/ comment A 8 Add,sub, and,or 9 3 Shift left and add 7 2-bit shift count ALU immediates 2 Add immediate 14 22 Compare 4 2 predicate register destinations Compare immediate 1

Type-I Instructions Instruction Type Number of formats
Example instructions Extra opcode bits GPRs/ FPRs Immediate bits Other/ comment I 29 Shift R/L variable 9 3 Used by multimedia inst. Test bit 6 6-bit field specifier 2 predicate register dest. Move to BR 1 9-bit branch predict Branch register specifier

Type-M Instructions Instruction Type Number of formats
Example instructions Extra opcode bits GPRs/ FPRs Immediate bits Other/ comment M 46 Integer/FP load and store, line prefetch 10 2 Speculative/ nonspeculative Integer/FP load and store, and line prefetch and post-increment by immediate 9 8 Integer/FP load prefetch and register postincrement 3 Integer/FP speculation check 1 21 in two fields

Type-B Instructions Instruction Type Number of formats
Example instructions Extra opcode bits GPRs/ FPRs Immediate bits Other/ comment B 9 PC-relative branch, counted branch 7 21 PC-relative call 4 1 branch register

Type-F Instructions Instruction Type Number of formats
Example instructions Extra opcode bits GPRs/ FPRs Immediate bits Other/ comment F 15 FP arithmetic 2 4 FP compare 2 6-bit predicate regs

Type-L + X Instructions
Instruction Type Number of formats Example instructions Extra opcode bits GPRs/ FPRs Immediate bits Other/ comment L + X 4 Move immediate long 2 1 64 Takes 2 slots Summary does not include all of the multi-media instructions.

Predication Almost all instructions can be predicated
Specify a predicate register in last 6 bits of instruction Predicate registers set with compare or test instructions Compare – 10 possible comparison tests Two predicate registers as destinations Written with result and compliment OR, with logical function that combines tests and compliment Multiple tests can be done with one instruction

Speculation Support Control speculation support:
Deferred exceptions for speculated instructions Equivalent of poison bits Memory reference speculation Support for speculation of load instructions

Deferred Exceptions Support to indicate an exception on a speculative instruction GPRs have NaT (Not a Thing) bits (make registers 65-bits long) FPRs use NaTVal (Not a Thing Value) Significand of 0 and exponent out of range NaTs and NaTVals are propagated by speculative instructions that don’t reference memory FP instructions use status registers to record exceptions for this purpose.

Resolution of Deferred Exceptions
If non-speculative instruction receives a NaT or NaTVal as source operand, generate terminating exception If a chk.s instruction detects the presence of NaT or NaTVal, branch to routine designed to recover from speculative operation.

Memory Reference Support
Advanced loads Speculative load that is moved above a store Creates an entry in ALAT table Register destination of the load Address of accessed memory location When store is executed, address is compared to active ALAT table entries If a match occurs, ALAT table entry marked invalid When instruction USING speculative load value is executed, ALAT table is checked Ld.c – check that is used if only load is speculative Only causes a reload of the value Chk.a – check that is used if other speculative code used the load value. Specifies address of a “fix-up” routine that re-executes code sequence.

Itanium Processor First implementation of IA-64 (2001) 800 MHz clock
Multi-issue – up to 6-issues per clock cycle Up to 3 branches and 2 memory references 3-level cache memory hierarchy L1 is split data/instruction caches L2 is unified cache on-chip L3 is unified cache off-chip (but on container)

Functional Units All functional units are pipelined
Bypassing paths are implemented (forwarding) Bypass between units has a 1 cycle delay I-unit M-unit I-unit M-unit Instruction Latency Integer load 1 Floating-point load 9 Correctly predicted taken branch 0-3 Mispredicted branch Integer ALU operation FP arithmetic 4 B-unit F-unit B-unit F-unit B-unit

Itanium Multi-Issue Instruction Issue window of 2 bundles at a time
Up to 6 instructions issues at once template inst 1 inst 2 inst 3 template inst 1 inst 2 inst 3 NOPs and predicated instructions with false predicates are not issued If one or more instructions cannot be issued due to unavailable function unit, bundle can be split

Itanium Pipeline 10 Stages Instruction Delivery Operand Delivery
Front-End Execution IPG Fetch Rotate EXP REN WLD REG EXE DET WRB Prefetches up to 32 bytes per clock Can hold up to 8 bundles Branch prediction: multi-level adaptive predictor Distributes up to 6 instructions to 9 functional units. Implements register renaming Access register file register bypassing register scoreboard checks predicate dependencies Executes instructions through ALU and load-store units Detects exceptions and posts NaTs Write back

Features in Common with Dynamically Scheduled Pipelines
Branch prediction Register renaming Scoreboarding (like Tomasulo’s algorithm) Deep pipeline – many stages before EX Stages after execution to handle exceptions

Itanium Performance Integer benchmarks
Itanium shows best performance only for mcf benchmark Geometric Means: PItanium = 60% PPentium 4 PItanium = 68% PAlpha 21264 Geometric Means adjusted for clock speed: PItanium = 85% PAlpha 21264

Itanium Performance FP benchmarks art benchmark has large effect
Itanium has best performance for 8/16 benchmarks Geometric Means: PItanium = 108% PPentium 4 PItanium = 120% PAlpha 21264 art benchmark has large effect

Conclusions Multi-issue processors only achieve high performance with much investment in silicon area and hardware complexity No clear “winner” in hardware or software approaches to ILP in general Software helps for conditional instructions and speculative load support Hardware helps for scoreboard type scheduling, dynamic branch prediction, local checking for speculated load correctness

Exploiting Instruction-Level Parallelism with Software Approaches

Similar presentations

Presentation on theme: "Exploiting Instruction-Level Parallelism with Software Approaches"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Exploiting Instruction-Level Parallelism with Software Approaches

Similar presentations

Presentation on theme: "Exploiting Instruction-Level Parallelism with Software Approaches"— Presentation transcript:

Similar presentations

About project

Feedback