Presentation is loading. Please wait.

Presentation is loading. Please wait.

EE 382N Guest Lecture Wish Branches

Similar presentations


Presentation on theme: "EE 382N Guest Lecture Wish Branches"— Presentation transcript:

1 EE 382N Guest Lecture Wish Branches
Hyesoon Kim HPS Research Group The University of Texas at Austin

2 Lecture Outline Predicated execution Wish branches 2D-profiling

3 Motivation Branch predictors are still not perfect.
Deeper pipeline and larger instruction window increase the branch misprediction penalty. Predicated execution can eliminate branch misprediction by converting control-dependency to data dependency. However, predicated code has overhead.

4 Predicated Execution C B D A B C D A if (cond) { b = 0; } else {
(normal branch code) C B D A T N p1 = (cond) branch p1, TARGET mov b, 1 jmp JOIN TARGET: mov b, 0 (predicated code) B C D A if (cond) { b = 0; } else { b = 1; A p1 = (cond) (!p1) mov b, 1 (p1) mov b, 0 B C D add x, b, 1 Convert control flow dependency to data dependency Pro: Eliminate hard-to-predict branches Cons: (1) Fetch blocks B and C all the time (2) Wait until p1 is resolved

5 The Overhead of Predicated Execution
-2% 16% 13% non-predicated A p1 = (cond) (0) mov b,1 (1) mov b,0 p1 = (cond) (!p1) mov b, 1 (p1) mov b, 0 B C D add x, b, 1 (Predicated code) If all overhead is ideally eliminated, predicated execution would provide 16% improvement in average execution time

6 The Problem Due to the predication overhead, predicated execution sometimes reduces performance Branch misprediction characteristics are dependent on run-time behavior: input set, control-flow path and phase behavior The compiler cannot accurately estimate the run-time behavior of branches

7 Predicated Code Performance vs. Branch Misprediction Rate
Predicated code performs better C B D A T N B C D A run-time (input B) profile-time (input A) X Normal branch code performs better Converting a branch to predicated code could hurt performance if run-time misprediction rate is lower than profile-time misprediction rate Execution time(normal branch code) = exec_T * P(T) + exec_N * P(N) misp_penalty * P(misprediction) Execution time of predicated code = exec_pred

8 Lecture Outline Predicated execution Wish branches 2D-profiling

9 Wish Branches [Kim et al. Micro-38]
A new type of control flow instruction types: wish jump/join and wish loop The compiler generates code (with wish branches) that can be executed either as predicated code or non-predicated code (normal branch code) The hardware decides to execute predicated code or normal branch code at run-time based on the confidence of branch prediction Easy to predict: normal branch code Hard to predict: predicated code

10 Wish Jump/Join A wish jump C B D A nop B C D A B wish join Taken
High Confidence Low Confidence A wish jump C B D A T N mov b, 1 jmp JOIN TARGET: mov b,0 normal branch code nop B C D A p1 = (cond) (!p1) mov b,1 (p1) mov b,0 predicated code B wish join Taken Not-Taken C D A p1=(cond) wish.jump p1 TARGET p1 = (cond) branch p1, TARGET B nop (!p1) mov b,1 wish.join !p1 Join (1) mov b,1 wish.join (1) Join C TARGET: (1) mov b,0 TARGET: (p1) mov b,0 D JOIN: wish jump/join code

11 Wish Loop H X X High Confidence Low Confidence Y Y T T N N H X X (1) Y
do { a++; i++; } while (i<N); X T N N High Confidence Low Confidence Y Y H mov p1, 1 LOOP: (p1) add a, a, 1 (p1) add i, i, 1 (p1) p1 = (cond) wish. loop p1, LOOP EXIT: X X LOOP: add a, a, 1 add i, i, 1 p1 = (i<N) branch p1, LOOP EXIT: (1) Y Y normal backward branch code wish loop code

12 Mispredicted Case 1: Early-Exit
H Correct execution: H X1 X2 X3 Y T T N X T Early-exit: (Low confidence) Flush pipeline N H X1 X2 Y T N Y X3 Y N Compared to normal branch code: predicate data dependency and one extra instruction (-)

13 Mispredicted Case 2: Late-Exit
H Correct execution: H X1 X2 X3 Y T T N X T nop nop Late-exit: (Low confidence) N H X1 X2 X3 X4 X5 Y T T T T N Y Compared to normal branch code: pro: reduce flush penalty (+++) cons: predicate data dependency and one extra instruction (-)

14 Mispredicted Cases3: No-Exit
H Correct execution: H X1 X2 X3 Y T T N nop nop Late-exit: X T H X1 X2 X3 X4 X5 Y N T T T T N Flush pipeline Y No-exit: H X1 X2 X3 X4 X5 X6 T T T T T Y No-Exit: predicate data dependency and one extra instruction (-)

15 Questions? Why not all branches?
What kind of branches should be converted to wish branches (jump/join)? Why not all branches? What kind of branches should be converted to wish loops?

16 Advantages/Disadvantages of Wish Branches
Advantages compared to predicated execution Reduce the overhead of predication Increase the benefits of predicated code by allowing the compiler to generate more aggressively-predicated code Provide a mechanism to exploit predication to reduce the branch misprediction penalty for backward branches (Wish loops) Make predicated code less dependent on machine configuration (e.g. branch predictor)

17 Advantages/Disadvantages of Wish Branches
Disadvantages compared to predicated execution Extra branch instructions use machine resources Extra branch instructions increase the contention for branch predictor table entries May constrain the compiler’s scope for code optimizations

18 Wish Branch Support ISA Support Compiler Support Hardware Support
predicated execution, wish branch instruction Compiler Support Wish branch generation algorithms The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches Hardware Support Instruction decode logic Predicate dependency elimination module Confidence estimator Front-end and branch misprediction detection/recovery module

19 OPCODE btype wtype target offset p
ISA Support Using existing hint bits (IA-64, x86, PowerPC) Hint bits can be ignored. A wish branch can be treated as a normal branch. OPCODE btype wtype target offset p btye: branch type (0:normal branch 1:wish branch) wtype: wish branch type (0:jump 1:loop 2:join) p: predicate register identifier

20 Wish Branch Support ISA Support Compiler Support Hardware Support
predicated execution, wish branch instruction Compiler Support Wish branch generation algorithms The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches Hardware Support Instruction decode logic Predicate dependency elimination module Confidence estimator Front-end and branch misprediction detection/recovery module

21 Compiler Support edge/value profiling select candidates cost-benefit analysis predicate selected blocks branch elimination region formation wish jump conversion if-conversion wish join insertion if-conversion loop opt (swp, unrolling) global inst. sched wish loop conversion loop opt register allocation modified local inst. sched new existing Major phase ordering with wish branch generation in code generation [ORC]

22 Wish Branch Generation Algorithm
wish jump/join candidates: all branch which are suitable for if-conversion The number of instructions in the fall-through block > N (N=5) : wish jump and join are inserted All other branches converted to predicated code A loop branch is converted into a wish loop: when the loop body has fewer than L instructions (L=30)

23 Wish Branch Support ISA Support Compiler Support Hardware Support
predicated execution, wish branch instruction Compiler Support Wish branch generation algorithms The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches Hardware Support Instruction decode logic Predicate dependency elimination module Front-end and branch misprediction detection/recovery module Confidence estimator

24 Hardware Support Instruction Fetch/decode logic
Decoder: decode wish branches BTB: mark wish branches Wish branch state machine hardware Wish loop stays as low-confidence mode until the loop exits Predicate dependency elimination module High-confidence mode: predicate values are predicted Branch misprediction detection/recovery module No flush if wish branch is mispredicted during low-confidence mode Confidence estimator

25 JRS Confidence Estimator
Estimate how much confidence the processor has in a branch prediction Trained with branch misprediction information n bit Counters m bits PC + 2^m entries > th? High Confidence Low Confidence Global BHR Assigning Confidence to Conditional Branch Predictions [Jacobsen et al. Micro-29]

26 Experimental Infrastructure
Source Code IA-64 Binary IA-64 Trace µops IA-64 Compiler (ORC) Trace generation module Micro-op Translator Micro-op Simulator IA-64 provides full support for predication Convert IA-64 traces to micro-ops to simulate an out-of-order superscalar processor model

27 Simulation Methodology
Nine SPEC 2000 integer benchmarks Baseline Processor Configuration Front End Large and accurate branch predictor (64KB hybrid branch predictor: gshare + local) Minimum 30-cycle branch misprediction penalty 64KB, 2-cycle latency I-cache Execution Core 8-wide out-of-order processor 512-entry instruction window Confidence Estimator 1KB tagged 16-bit history JRS confidence estimator (Jacobsen et al. MICRO-29)

28 Performance Improvement
-4% 14% 2.02 8% 24% non-predicated 16% over conditional branch prediction (w/o mcf) 11% over selective-predication (w/o mcf) 7 % over aggressive predication (w/o mcf) 14% over conditional branch prediction and 13% over selective-predication and 16% over aggressive-predication 12% over conditional branch prediction 11% over selective-predication 13 % over aggressive predication SELECTIVE-PREDICATION: branches are selectively predicated using compile-time cost-benefit analysis AGGRESSIVE-PREDICATION: all branches that are suitable for if-conversion are predicated

29 Wish Branch: Conclusion
New control flow instructions: wish branches (jump/join/loop) Wish branches improve performance by dividing the work of predication between the compiler and the microarchitecture Compiler: analyzes the control-flow graph and generates code Microarchitecture: makes run-time decision to use predication Wish branches provide significant performance benefits 16% compared to conditional branch prediction 13% compared to selectively predicated code Wish branches can make predicated execution more viable and effective in high performance processors By enabling adaptive and aggressive predicated execution

30 Lecture Outline Predicated execution Wish branches 2D-profiling

31 2D-profiling Goal: Identify input-dependent branches by using a single input set for profiling If We Know a Branch is Input-Dependent May not convert it to predicated code. May convert it to a wish branch. May not perform other compiler optimizations or may perform them less aggressively. Hot-path/trace/superblock-based optimizations [Fisher’81, Pettis’90, Hwu’93, Merten’99]

32 Key Insight of 2D-profiling
Phase behavior in prediction accuracy is a good indicator of input dependence phase 2 input-dependent phase 3 phase 1 input-independent

33 Traditional Profiling
brA time brB pr. Acc MEAN pr.Acc(brA) pr. Acc MEAN pr.Acc(brB) MEAN pr.Acc(brA)  MEAN pr.Acc(brB) behavior of brA  behavior of brB

34 2D-profiling MEAN pr.Acc(brA) brA STD pr.Acc(brA) time brB
MEAN pr.Acc(brB) STD pr.Acc(brB) MEAN pr.Acc(brA)  MEAN pr.Acc(brB) STD pr.Acc(brA) ≠ STD pr.Acc(brB) behavior of brA ≠ behavior of brB A: input-dependent br, B: input-independent br

35 2D-profiling Mechanism
The profiler collects branch prediction accuracy information for every static branch over time slice size = M instructions Slice 1 Slice 2 Slice N time mean Pr.Acc(brA,s1) mean Pr.Acc(brA,s2) ... mean Pr.Acc(brA,sN) mean Pr.Acc(brB,s1) mean Pr.Acc(brB,s2) ... mean Pr.Acc(brB,sN) . . . PAM:50% brA Calculate MEAN (brA, brB, …), Standard deviation (brA, brB, …), PAM:Points Above Mean (brA, brB, …) mean brA brB PAM:0% mean brB

36 2D-profiling: Conclusion & Future Work
2D-profiling is a new profiling technique to find input-dependent characteristics by using a single input data set for profiling 2D-profiling uses time-varying information instead of just average data Phase behavior in prediction accuracy in a profile run  input-dependent Future Work: Better predicated code/wish branch generation algorithms Detecting other input-dependent program characteristics


Download ppt "EE 382N Guest Lecture Wish Branches"

Similar presentations


Ads by Google