Chapter 2: ILP and Its Exploitation

Name: Chapter 2: ILP and Its Exploitation
Uploaded: 2017-10-10T10:03:48+00:00
Duration: PTM25S38
Channel: Cameron Ellis
Description: Chapter 2: ILP and Its Exploitation

Chapter 2: ILP and Its Exploitation
Review simple static pipeline ILP Overview Dynamic branch prediction Dynamic scheduling, out-of-order execution Multiple issue (superscalar) Hardware-based speculation ILP limitation Intel P6 microarchitecture

Advanced Processor Pipelining
Focus on exploit Instruction-Level Parallelism (ILP): Definition: Executing multiple instructions (within a single program thread) simultaneously Note that even ordinary pipelining does some ILP (overlapping execution of multiple instructions). Focus of this chapter: Increasing ILP even more by allowing out-of-order execution with/without speculation Using multiple-issue datapaths to initiate multiple instructions simultaneously for further improvement Microarchitectures that do this are called superscalar Examples: PowerPC, Pentium, etc.

Pipeline Performance Ideal pipeline CPI = 1 is minimum number of cycles per instruction issued, if no stalls occur. May be <1 in superscalar machines. E.g., Ideal CPI=1/3 in 3-way superscalar (often use IPC=3 in superscalar) Real pipeline CPI = Ideal pipeline CPI + Structural Stalls + Data Hazard Stalls + Control Hazard Stalls Maximize performance using various techniques to eliminate stalls and reduce ideal CPI. Note: Real pipeline CPI still need to account for cache misses (discuss later).

Advanced Pipelining Techniques
Technique Reduces Loop unrolling Control stalls Basic pipeline scheduling / forwarding RAW stalls Dynamic scheduling w. scoreboarding RAW stalls Dyna. sched. with register renaming WAR & WAW stalls Dynamic branch prediction Control stalls Issuing multiple instructions per cycle Ideal pipeline CPI Compiler dependence analysis/software Ideal CPI & data stalls Software pipelining & trace scheduling Ideal CPI & data stalls Hardware Speculation All data & control stalls Dynamic memory disambiguation RAW stalls involving mem.

Instruction-Level Parallelism (ILP)
Basic Block (BB) ILP is quite small BB: a straight-line code sequence with no branches in and out except for the entry and at the exit average dynamic branch frequency 15% to 25% => 4 to 7 instructions execute between two branches Plus instructions in BB likely to depend on each other To obtain performance enhancements, we must exploit ILP across multiple basic blocks Simplest: loop-level parallelism to exploit parallelism among iterations of a loop. E.g., for (i=1; i<=1000; i=i+1) x[i] = x[i] + y[i];

Dependences A dependence is a way in which one instruction can depend on (be impacted by) another for scheduling purposes. Three major dependence types: Data (True) dependence: RAW Name dependence: WAR, WAW Control dependence: branch, jump, etc. A dependency (or dependence) is a particular instance of one instruction depending on another. The instructions can’t be effectively (as opposed to just syntactically) fully parallelized, or reordered.

Data Dependence Recursive definition: Potential for a RAW hazard
Instruction B is data dependent on instruction A iff: B uses a data result produced by instruction A, or There is another instruction C such that B is data dependent on C, and C is data dependent on A. Potential for a RAW hazard Loop: LD F0,0(R1) ADDD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,#8 BNEZ R1,Loop A A B C B Data dependencies in loop example

Name Dependence Occurs when two instructions both access the same data storage location due to reuse the storage (Also called storage dependence, at least one of the accesses must be a write.) Two sub-types (for inst. B after inst. A): Antidependence: A reads, then B writes. Potential for a WAR hazard. Output dependence: A writes, then B writes. Potential for a WAW hazard. Note: Name dependencies can be avoided by changing instructions to use different locations (rather than reusing a location).

WAR, WAW Examples WAR Hazard: InstrJ writes operand before InstrI reads it WAW Hazard: InstrJ writes operand before InstrI writes it. I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Control Dependence Occurs when the execution of an instruction depends on a conditional branch instruction. Program control flow must follow for correct executions. However, only two things must really be preserved: Data flow (how a given result is produced) Exception behavior (must handle in order) Example: (for Exception Behavior) DADDU R2, R3, R4 BEQZ R2, L1 LW R1, 0(R2) ; may not before BEQZ L1: ; due to exception & R1

Control Dependence – Another Example
Example: (for data flow) DADDU R2, R3, R4 BEQZ R5, L1 DSUBU R2, R6, R7 L1: OR R8, R2, R9 OR depends DADDU and DSUBU. Maintaining data dependences is not enough Control flow decides where the correct R2 comes from (DADDU or DSUBU)

Relaxing Control Dependence
Only two things must really be preserved: Data flow (how a given result is produced) Exception behavior Some techniques permit removing control dependence from instruction execution, by dependently ignoring instruction results instead Speculation (betting on branches, to fill delay slots) Make instructions unconditional if no harm done Speculative multiple-execution Take both paths, invalidate results of one later Conditional / predicated instructions (used in IA-64). Note, Instruction reordering around branch must has no side-effect

Loop Unrolling This code, add a scalar to a vector:
for (i=1000; i>0; i=i–1) x[i] = x[i] + s; Assume following latencies for all examples Ignore delayed branch in these examples Instruction Instruction Latency stalls between producing result using result in cycles in cycles FP ALU op Another FP ALU op FP ALU op Store double Load double FP ALU op Load double Store double Integer op Integer op

MIPS Code First translate into MIPS code:
To simplify, assume 8 is lowest address Loop: L.D F0,0(R1) ;F0=vector element ADD.D F4,F0,F2 ;add scalar from F2 S.D 0(R1),F4 ;store result DADDUI R1,R1,-8 ;decrement pointer 8B (DW) BNEZ R1,Loop ;branch R1!=zero

Execution Cycles without Inst Scheduling
1 Loop: L.D F0,0(R1) ;F0=vector element 2 stall 3 ADD.D F4,F0,F2 ;add scalar in F2 4 stall 5 stall 6 S.D 0(R1),F4 ;store result 7 DADDUI R1,R1,-8 ;decrement pointer 8B (DW) 8 stall ;assumes can’t forward branch 9 BNEZ R1,Loop ;branch R1!=zero Instruction Instruction Latency in producing result using result clock cycles FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 9 clock cycles: Rewrite code to minimize stalls?

Apply Instruction Scheduling
1 Loop: L.D F0,0(R1) 2 DADDUI R1,R1,-8 3 ADD.D F4,F0,F2 4 stall 5 stall 6 S.D 8(R1),F4 ;altered offset 7 BNEZ R1,Loop 7 clock cycles, but just 3 for execution (L.D, ADD.D, S.D), 4 for loop overhead; Can we make faster?

Unroll Loop Four Times 1 Loop: L.D F0,0(R1) 3 ADD.D F4,F0,F2
6 S.D 0(R1),F4 ;no DSUBUI/BNEZ 7 L.D F6,-8(R1) 9 ADD.D F8,F6,F2 12 S.D -8(R1),F8 ;drop loop 13 L.D F10,-16(R1) 15 ADD.D F12,F10,F2 18 S.D -16(R1),F12 ;drop loop 19 L.D F14,-24(R1) 21 ADD.D F16,F14,F2 24 S.D -24(R1),F16 25 DADDUI R1,R1,#-32 ;alter to 4*8 27 BNEZ R1,LOOP 27 clock cycles, or 6.75 per iteration (Assumes R1 is multiple of 4)

Unrolled and Rescheduled Loop
1 Loop: L.D F0,0(R1) 2 L.D F6,-8(R1) 3 L.D F10,-16(R1) 4 L.D F14,-24(R1) 5 ADD.D F4,F0,F2 6 ADD.D F8,F6,F2 7 ADD.D F12,F10,F2 8 ADD.D F16,F14,F2 9 S.D 0(R1),F4 10 S.D -8(R1),F8 11 S.D -16(R1),F12 12 DSUBUI R1,R1,#32 13 S.D 8(R1),F16 ; 8-32 = -24 14 BNEZ R1,LOOP 14 clock cycles, or 3.5 per iteration

Loop Unrolling Decisions
Requires understanding how one instruction depends on another and how the instructions can be changed or reordered given the dependences: Determine loop unrolling useful by finding that loop iterations were independent (except for maintenance code) Use different registers to avoid unnecessary constraints forced by using same registers for different computations Eliminate the extra test and branch instructions and adjust the loop termination and iteration code Determine that loads and stores in unrolled loop can be interchanged by observing that loads and stores from different iterations are independent Transformation requires analyzing memory addresses and finding that they do not refer to the same address Schedule the code, preserving any dependences needed to yield the same result as the original code

Three Unrolling Considerations
Decrease in amount of overhead amortized with each extra unrolling Amdahl’s Law Growth in code size For larger loops, concern it increases the instruction cache miss rate Register pressure: potential shortfall in registers created by aggressive unrolling and scheduling If not be possible to allocate all live values to registers, may lose some or all of its advantage  Loop unrolling reduces impact of branches on pipeline; another way is branch prediction

Static Branch Prediction
Delayed branch To reorder code around branches, need to predict branch statically when compile Predict a branch as taken: Average misprediction = untaken frequency = 34% SPEC More accurate prediction: Profile based

Dynamic Branch Prediction
Why does prediction work? Underlying algorithm has regularities Data that is being operated on has regularities Instruction sequence has redundancies that are artifacts of way that humans/compilers think about problems Is dynamic branch prediction better than static branch prediction? Seems to be There are a small number of important branches in programs which have dynamic behavior

Dynamic Branch Prediction
As the amount of ILP exploited increases (CPI decreases), impact of control stalls increases. Branches come more often An n-cycle delay postpones more instructions Dynamic Hardware Branch Prediction “Learns” which branches are taken, or not Make the right guess (most of the time) about whether a branch is taken, or not. Delay depends on whether prediction is correct, and whether branch is taken.

Branch-Prediction Buffers (BPB)
Also called “branch history table” Low-order n bits of branch address used to index a table of branch history data for prediction. May have “collisions” between distant branches. Associative tables also possible In each entry, k bits of information about history of that branch are stored. Common values of k: 1, 2, and larger Entry is used to predict what branch will do. Actual behavior of branch will update the entry.

1-bit Branch Prediction
The entry for a branch has only two states: Bit = 1 “The last time this branch was encountered, it was taken. I predict it will be taken next time.” Bit = 0 “The last time this branch was encountered, it was not taken. I predict it will not be taken next time.” Will make 2 mistakes each time a loop is encountered. At the end of the first & last iterations. May always mispredict in pathological cases!

2-bit Branch Prediction
4 states Based on the most recent two branch history actions Only 1 mis-prediction per loop execution, after the first time the loop is reached. Last iteration What about n-bit predictor?

State Transition Diagram

Implementing Branch Histories
Separate “cache” (prediction table) accessed during IF stage Extra bits in instruction cache Problem with this approach in MIPS: After fetch, don’t know whether the instruction is really a branch or not (until decoding) Also don’t know the target address. In MIPS, by the time you know these things (in ID), you already know whether it’s really taken! Haven’t saved any time! Branch-Target Buffers can fix this problem (later)...

Misprediction Rate for 2-bit BPB

Branch-Prediction Performance
Contribution to cycle count depends on: Branch frequency & misprediction frequency Freqs. of taken/not taken, predicted/mispredicted. Delay of taken/not taken, predicted/mispredicted. How to reduce misprediction frequency? Increase buffer size to avoid collisions. Has little effect beyond ~4,096 entries. Increase prediction accuracy Increase # of bits/entry (little effect beyond 2) Use a different prediction scheme (correlated predictors, tournament predictors)

Exhaustive Search for Optimal 2-bit Predictor
• There are 2^20 possible state machines of 2-bit predictors • Some machines are uninteresting, pruning them out reduces the number of state machines to 5248 • For each benchmark, determine the prediction accuracy for all the predictor state machines • Optimal 2-bit predictor for each application (by IBM) spice2g % doduc 94.3% gcc 89.1% espresso 89.1% li 87.1% eqntott 87.9%

Correlated Prediction - Example
Code fragment from eqntott: if (aa==2) aa=0; if1 if (bb==2) bb=0; if2 if (aa!=bb) { … };  false if (if1 & if2) MIPS code (aa=R1, bb=R2): SUBUI R3,R1,#2 ; (aa-2) BNEZ R3,L1 ; branch b1 (aa!=2) ADD R1,R0,R0 ; aa=0 L1: SUBUI R3,R2,#2 ; (bb-2) BNEZ R3,L2 ; branch b2 (bb!=2) ADD R2,R0,R0 ; bb=0 L2: SUBU R3,R1,R2 ; (aa-bb) BEQZ R3,L3 … ; branch b3 (aa==bb)

Even Simpler Example C code: MIPS code (d=R1):
if (d==0) d=1; if (d==1) ... MIPS code (d=R1): BNEZ R1,L1 ; b1: d!=0 ADDI R1,R0,#1 ; d=1 L1: SUBUI R3,R1,#1 ; (d-1) BNEZ R3,L2 ; b2: d!=1 (and others) Any Correlation (b1, b2)?

Using 1-bit Predictor Suppose value of ‘d’ alternates between 2 and 0, and the code repeat multiple times All branches are mispredicted!! (with initial NT, NT)

Correlating Predictors
Have different predictions for the current branch depending on the previously executed branch instruction was taken or not. Notation: _ / _ (separate prediction table) What to predict if the last branch was NOT taken What to predict if the last branch was TAKEN Prediction used is shown in bold Based on the last outcome

(m,n) correlated predictors
Uses the behavior of the most recent m branches encountered to select one of 2m different branch predictors for the next branch. Each of these predictors records n bits of history information for any given branch. On previous slide we saw a (1,1) predictor. Easy to implement: Behavior of last m branches: an m-bit shift register Branch-prediction buffer: access with low-order bits of branch address, concatenated with shift register.

(2,2) Correlated Predictor
1 (Correction based on the last 2 branch outcomes)

Correlated Predictors Better

Tournament Predictors
Three predictors: Global correlated predictor, Local 2-bit predictor, and a tournament predictor The Tournament predictor determines which (global or local) predictor to be used for prediction

Performance Tournament Predictor

Branch-Target Buffers (BTB)
How to know the address of the next instruction as soon as the current instruction is fetched? Normally, an extra (ID) cycle is needed to: Determine that the first instruction is a branch Determine whether the branch is taken Compute the target address PC+offset Branch prediction alone doesn’t help – Need next PC What if, instead, the next instruction address could be fetched at the same time that the current instruction is fetched?  BTB

BTB Schematic

Handling Instruction with BTB
Fetch from target This flowchart based on 1-bit predictor

Branch Penalties, Branch Folding
If instruction not in BTB and branch not taken (case not shown), penalty is 0. Store target instructions instead of their addresses in BTB Saves on fetch time. Permits branch folding - zero-cycle branches! Substitute destination instruction for branch in pipeline!

Return Address Predictor
Predicting register/indirect branches E.g., abstract function calls, switch statements, procedure returns. CPU-internal return-address stack

Dynamic Branch Prediction Summary
Prediction becoming important part of execution Branch History Table: 2 bits for loop accuracy Correlation: Recently executed branches correlated with next branch Either different branches (GA) Or different executions of same branches (PA) Tournament predictors take insight to next level, by using multiple predictors usually one based on global information and one based on local information, and combining them with a selector In 2006, tournament predictors using  30K bits are in processors like the Power5 and Pentium 4 Branch Target Buffer: include branch address & prediction

Chapter 2: ILP and Its Exploitation

Similar presentations

Presentation on theme: "Chapter 2: ILP and Its Exploitation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 2: ILP and Its Exploitation

Similar presentations

Presentation on theme: "Chapter 2: ILP and Its Exploitation"— Presentation transcript:

Similar presentations

About project

Feedback