Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dynamic Hardware Prediction

Similar presentations


Presentation on theme: "Dynamic Hardware Prediction"— Presentation transcript:

1 Dynamic Hardware Prediction
Importance of control dependences Branches and jumps are frequent Limiting factor as ILP increases (Amdahl’s law) Schemes to attack control dependences Static Basic (stall the pipeline) Predict-not-taken and predict-taken Delayed branch and canceling branch Dynamic predictors Effectiveness of dynamic prediction schemes Accuracy Cost

2 Basic Branch Prediction Buffers
a.k.a. Branch History Table (BHT) - Small direct-mapped cache of T/NT bits Branch Instruction IR: + Branch Target PC: BHT T (predict taken) NT (predict not- taken) PC + 4

3 N-bit Branch Prediction Buffers
Use an n-bit saturating counter Only the loop exit causes a misprediction 2-bit predictor almost as good as any general n-bit predictor Predict taken Predict taken 11 10 taken not taken Predict not taken Predict not taken 00 01 2-bit Predictor

4 Correlating Predictors
a.k.a. Two-level Predictors – Use recent behavior of other (previous) branches Branch Instruction IR: + Branch Target PC: BHT T (predict taken) NT (predict not- taken) PC + 4 1-bit global branch history: (stores behavior of previous branch) NT/T NT T

5 Example . . . Basic one-bit predictor
BNEZ R1, L ; branch b1 (d!=0) ADDI R1, R0, #1 L1: SUBUI R3, R1, #1 BNEZ R3, L ; branch b2 L2: . . . Basic one-bit predictor d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred NT T T NT T T T NT NT T NT NT One-bit predictor with one-bit correlation d=? b1 pred b1 action new b1 pred b2 pred b2 action new b2 pred NT/NT T T/NT NT/NT T NT/T T/NT NT T/NT NT/T NT NT/T T/NT T T/NT NT/T T NT/T T/NT NT T/NT NT/T NT NT/T

6 (m, n) Predictors Use behavior of the last m branches
2m n-bit predictors for each branch Simple implementation Use m-bit shift register to record the behavior of the last m branches (m,n) BPF m-bit GBH PC: n-bit predictor

7 Size of the Buffers Number of bits in a (m,n) predictor
2m x n x Number of entries in the table Example – assume 8K bits in the BHT (0,1): 8K entries (0,2): 4K entries (2,2): 1K entries (12,2): 1 entry! Does not use the branch address Relies only on the global branch history

8 Performance of 2-bit Predictors

9 Branch-Target Buffers
Further reduce control stalls (hopefully to 0) Store the predicted address in the buffer Access the buffer during IF PC T/NT Predicted address Look up = NO: instruction is not a branch YES: instruction is a branch

10 Prediction with BTF IF ID EX Send PC to memory and BTF NO YES
Entry found in BTF? Send out predicted address Is instr a taken branch? ID NO YES Taken branch? NO YES Update BTF Kill fetched instr; restart fetch at other target delete entry from BTF; EX

11 Target Instruction Buffers
Store target instructions instead of addresses Advantages BTB access can take longer than time between IFs and BTB can be larger Branch folding Zero-cycle unconditional branches Replace branch with target instruction

12 Performance Issues Limitations of branch prediction schemes
Prediction accuracy (80% - 95%) Type of program Size of buffer Penalty of misprediction Fetch from both directions to reduce penalty Memory system should: Dual-ported Have an interleaved cache Fetch from one path and then from the other

13 Software approaches to exploiting ILP
Chapter 4

14 Instruction Level Parallelism
Potential overlap among instructions Few possibilities in a basic block Blocks are small (6-7 instructions) Instructions are dependent Goal: Exploit ILP across multiple basic blocks Iterations of a loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;

15 Basic Pipeline Scheduling
Find sequences of unrelated instructions Compiler’s ability to schedule Amount of ILP available in the program Latencies of the functional units Latency assumptions for the examples Standard MIPS integer pipeline No structural hazards (fully pipelined or duplicated units Latencies of FP operations: Instruction producing result Instruction using result Latency FP ALU op 3 SD 2 LD 1

16 Basic Scheduling for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s;
Sequential MIPS Assembly Code Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop for (i = 1000; i > 0; i=i-1) x[i] = x[i] + s; Pipelined execution: Loop: LD F0, 0(R1) stall ADDD F4, F0, F stall stall SD 0(R1), F SUBI R1, R1, # stall BNEZ R1, Loop stall Scheduled pipelined execution: Loop: LD F0, 0(R1) SUBI R1, R1, # ADDD F4, F0, F stall BNEZ R1, Loop SD 8(R1), F

17 Loop Unrolling Unrolled loop (four copies): Scheduled Unrolled loop:
Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop Scheduled Unrolled loop: Loop: LD F0, 0(R1) LD F6, -8(R1) LD F10, -16(R1) LD F14, -24(R1) ADDD F4, F0, F2 ADDD F8, F6, F2 ADDD F12, F10, F2 ADDD F16, F14, F2 SD 0(R1), F4 SD -8(R1), F8 SUBI R1, R1, #32 SD 16(R1), F12 BNEZ R1, Loop SD 8(R1), F16

18 Loop Transformations Instruction independency is the key requirement for the transformations Example Determine that is legal to move SD after SUBI and BNEZ Determine that unrolling is useful (iterations are independent) Use different registers to avoid unnecessary constrains Eliminate extra tests and branches Determine that LD and SD can be interchanged Schedule the code, preserving the semantics of the code

19 Dependences If instructions are independent Types of dependences
They are parallel They can be reordered Types of dependences Data Name Control

20 Data Dependences Instruction j is data dependent on instr. i if:
i produces a result used by j j is data dependent on k and k is data dependent on i Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD O(R1), F4 SUBI R1, R1, 8 BNEZ R1, Loop Dependences Indicate potential hazard (one or more RAW) Determine order of results Set upper bound on ILP

21 Techniques to Increase ILP
Dependences are a property of programs Actual hazards are a property of the pipeline Techniques to avoid dependence limitations Maintain dependences but avoid hazards Code scheduling hardware software Eliminate dependences by code transformations Complex Compiler-based

22 Example: Dependence Elimination
Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 BNEZ R1, Loop Data dependencies SUBI, LD, SD Force sequential execution of iterations Compiler removes this dependency by: Computing intermediate R1 values Eliminating intermediate SUBI Changing final SUBI Data flow analysis Can do on Registers Cannot do easily on memory locations 100(R1) = 20(R2)

23 Name Dependences Two instructions use the same register or memory location, but there is no flow of data Antidependence Corresponds to a WAR hazard Output Dependence Corresponds to a WAW hazard To eliminate the dependence: change the name! Register renaming (easy) Static or dynamic

24 Example: Name Dependences
Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, -8(R1) SD -8(R1), F4 LD F0, -16(R1) SD -16(R1), F4 LD F0, -24(R1) SD -24(R1), F4 SUBI R1, R1, #32 BNEZ R1, Loop Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop Register Renaming

25 Example: Control Dependences
Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BEQZ R1, Exit LD F6, 0(R1) ADDD F8, F6, F2 SD 0(R1), F8 LD F10, 0(R1) ADDD F12, F10, F2 SD 0(R1), F12 LD F14, 0(R1) ADDD F16, F14, F2 SD 0(R1), F16 BNEZ R1, Loop Exit: Intermediate BEQZ are never taken Eliminate!

26 Dealing with control stalls
Properties of program correctness must be preserved when handling control dependencies Exception behavior Data flow Static techniques that alleviate Delayed branch scheduling reduce stalls Loop unrolling can enable reduction in control dependences Conditional execution or speculation

27 Loop-Level Parallelism
Analysis at the source level Dependencies across iterations for (i=1000; i>0; i=i-1) x[i] = x[i] + s; for (i=1; i<=100; i=i+1) { x[i+1] = x[i] + z[i]; /* loop-carried dependence */ y[i+1] = y[i] + x[i+1]; }


Download ppt "Dynamic Hardware Prediction"

Similar presentations


Ads by Google