EE 382N Guest Lecture Wish Branches

Slides:

Advertisements

Similar presentations

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Advertisements

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Dynamic Branch Prediction

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

W04S1 COMP s1 Seminar 4: Branch Prediction Slides due to David A. Patterson, 2001.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

EECS 470 Branch Prediction Lecture 6 Coverage: Chapter 3.

EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CB E D F G Frequently executed path Not frequently executed path Hard to predict path A C E B H Insert select-µops (φ-nodes SSA) Diverge Branch CFM point.

Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

2D-Profiling Detecting Input-Dependent Branches with a Single Input Data Set Hyesoon Kim M. Aater Suleman Onur Mutlu Yale N. Patt HPS Research Group The.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

Prophet/Critic Hybrid Branch Prediction B B B

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

Dynamic Branch Prediction

Computer Architecture Lecture 10: Branch Prediction II

Computer Architecture: Branch Prediction (II) and Predicated Execution

Prof. Hsien-Hsin Sean Lee

CS5100 Advanced Computer Architecture Advanced Branch Prediction

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Samira Khan University of Virginia Nov 13, 2017

Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N. Patt

Module 3: Branch Prediction

So far we have dealt with control hazards in instruction pipelines by:

Address-Value Delta (AVD) Prediction

15-740/ Computer Architecture Lecture 24: Control Flow

Ka-Ming Keung Swamy D Ponpandi

Phase Capture and Prediction with Applications

Dynamic Branch Prediction

15-740/ Computer Architecture Lecture 26: Predication and DAE

Advanced Computer Architecture

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Control unit extension for data hazards

So far we have dealt with control hazards in instruction pipelines by:

Lecture 10: Branch Prediction and Instruction Delivery

José A. Joao* Onur Mutlu‡ Yale N. Patt*

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

pipelining: static branch prediction Prof. Eric Rotenberg

Adapted from the slides of Prof

Control unit extension for data hazards

Dynamic Hardware Prediction

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

Wackiness Algorithm A: Algorithm B:

So far we have dealt with control hazards in instruction pipelines by:

Control unit extension for data hazards

So far we have dealt with control hazards in instruction pipelines by:

So far we have dealt with control hazards in instruction pipelines by:

Phase based adaptive Branch predictor: Seeing the forest for the trees

Ka-Ming Keung Swamy D Ponpandi

Predication ECE 721 Prof. Rotenberg.

Presentation transcript:

EE 382N Guest Lecture Wish Branches Hyesoon Kim HPS Research Group The University of Texas at Austin

Lecture Outline Predicated execution Wish branches 2D-profiling

Motivation Branch predictors are still not perfect. Deeper pipeline and larger instruction window increase the branch misprediction penalty. Predicated execution can eliminate branch misprediction by converting control-dependency to data dependency. However, predicated code has overhead.

Predicated Execution C B D A B C D A if (cond) { b = 0; } else { (normal branch code) C B D A T N p1 = (cond) branch p1, TARGET mov b, 1 jmp JOIN TARGET: mov b, 0 (predicated code) B C D A if (cond) { b = 0; } else { b = 1; A p1 = (cond) (!p1) mov b, 1 (p1) mov b, 0 B C D add x, b, 1 Convert control flow dependency to data dependency Pro: Eliminate hard-to-predict branches Cons: (1) Fetch blocks B and C all the time (2) Wait until p1 is resolved

The Overhead of Predicated Execution -2% 16% 13% non-predicated A p1 = (cond) (0) mov b,1 (1) mov b,0 p1 = (cond) (!p1) mov b, 1 (p1) mov b, 0 B C D add x, b, 1 (Predicated code) If all overhead is ideally eliminated, predicated execution would provide 16% improvement in average execution time

The Problem Due to the predication overhead, predicated execution sometimes reduces performance Branch misprediction characteristics are dependent on run-time behavior: input set, control-flow path and phase behavior. The compiler cannot accurately estimate the run-time behavior of branches

Predicated Code Performance vs. Branch Misprediction Rate Predicated code performs better C B D A T N B C D A run-time (input B) profile-time (input A) X Normal branch code performs better Converting a branch to predicated code could hurt performance if run-time misprediction rate is lower than profile-time misprediction rate Execution time(normal branch code) = exec_T * P(T) + exec_N * P(N) + misp_penalty * P(misprediction) Execution time of predicated code = exec_pred

Lecture Outline Predicated execution Wish branches 2D-profiling

Wish Branches [Kim et al. Micro-38] A new type of control flow instruction 3 types: wish jump/join and wish loop The compiler generates code (with wish branches) that can be executed either as predicated code or non-predicated code (normal branch code) The hardware decides to execute predicated code or normal branch code at run-time based on the confidence of branch prediction Easy to predict: normal branch code Hard to predict: predicated code

Wish Jump/Join A wish jump C B D A nop B C D A B wish join Taken High Confidence Low Confidence A wish jump C B D A T N mov b, 1 jmp JOIN TARGET: mov b,0 normal branch code nop B C D A p1 = (cond) (!p1) mov b,1 (p1) mov b,0 predicated code B wish join Taken Not-Taken C D A p1=(cond) wish.jump p1 TARGET p1 = (cond) branch p1, TARGET B nop (!p1) mov b,1 wish.join !p1 Join (1) mov b,1 wish.join (1) Join C TARGET: (1) mov b,0 TARGET: (p1) mov b,0 D JOIN: wish jump/join code

Wish Loop H X X High Confidence Low Confidence Y Y T T N N H X X (1) Y do { a++; i++; } while (i<N); X T N N High Confidence Low Confidence Y Y H mov p1, 1 LOOP: (p1) add a, a, 1 (p1) add i, i, 1 (p1) p1 = (cond) wish. loop p1, LOOP EXIT: X X LOOP: add a, a, 1 add i, i, 1 p1 = (i<N) branch p1, LOOP EXIT: (1) Y Y normal backward branch code wish loop code

Mispredicted Case 1: Early-Exit H Correct execution: H X1 X2 X3 Y T T N X T Early-exit: (Low confidence) Flush pipeline N H X1 X2 Y … T N Y X3 Y N Compared to normal branch code: predicate data dependency and one extra instruction (-)

Mispredicted Case 2: Late-Exit H Correct execution: H X1 X2 X3 Y T T N X T nop nop Late-exit: (Low confidence) N H X1 X2 X3 X4 X5 Y … T T T T N Y Compared to normal branch code: pro: reduce flush penalty (+++) cons: predicate data dependency and one extra instruction (-)

Mispredicted Cases3: No-Exit H Correct execution: H X1 X2 X3 Y T T N nop nop Late-exit: X T H X1 X2 X3 X4 X5 Y … N T T T T N Flush pipeline Y No-exit: H X1 X2 X3 X4 X5 X6 … T T T T T Y No-Exit: predicate data dependency and one extra instruction (-)

Questions? Why not all branches? What kind of branches should be converted to wish branches (jump/join)? Why not all branches? What kind of branches should be converted to wish loops?

Advantages/Disadvantages of Wish Branches Advantages compared to predicated execution Reduce the overhead of predication Increase the benefits of predicated code by allowing the compiler to generate more aggressively-predicated code Provide a mechanism to exploit predication to reduce the branch misprediction penalty for backward branches (Wish loops) Make predicated code less dependent on machine configuration (e.g. branch predictor)

Advantages/Disadvantages of Wish Branches Disadvantages compared to predicated execution Extra branch instructions use machine resources Extra branch instructions increase the contention for branch predictor table entries May constrain the compiler’s scope for code optimizations

Wish Branch Support ISA Support Compiler Support Hardware Support predicated execution, wish branch instruction Compiler Support Wish branch generation algorithms The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches Hardware Support Instruction decode logic Predicate dependency elimination module Confidence estimator Front-end and branch misprediction detection/recovery module

OPCODE btype wtype target offset p ISA Support Using existing hint bits (IA-64, x86, PowerPC) Hint bits can be ignored. A wish branch can be treated as a normal branch. OPCODE btype wtype target offset p btye: branch type (0:normal branch 1:wish branch) wtype: wish branch type (0:jump 1:loop 2:join) p: predicate register identifier

Wish Branch Support ISA Support Compiler Support Hardware Support predicated execution, wish branch instruction Compiler Support Wish branch generation algorithms The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches Hardware Support Instruction decode logic Predicate dependency elimination module Confidence estimator Front-end and branch misprediction detection/recovery module

Compiler Support edge/value profiling select candidates cost-benefit analysis predicate selected blocks branch elimination region formation wish jump conversion if-conversion wish join insertion if-conversion loop opt (swp, unrolling) global inst. sched wish loop conversion loop opt register allocation modified local inst. sched new existing Major phase ordering with wish branch generation in code generation [ORC]

Wish Branch Generation Algorithm wish jump/join candidates: all branch which are suitable for if-conversion The number of instructions in the fall-through block > N (N=5) : wish jump and join are inserted All other branches converted to predicated code A loop branch is converted into a wish loop: when the loop body has fewer than L instructions (L=30)

Wish Branch Support ISA Support Compiler Support Hardware Support predicated execution, wish branch instruction Compiler Support Wish branch generation algorithms The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches Hardware Support Instruction decode logic Predicate dependency elimination module Front-end and branch misprediction detection/recovery module Confidence estimator

Hardware Support Instruction Fetch/decode logic Decoder: decode wish branches BTB: mark wish branches Wish branch state machine hardware Wish loop stays as low-confidence mode until the loop exits Predicate dependency elimination module High-confidence mode: predicate values are predicted Branch misprediction detection/recovery module No flush if wish branch is mispredicted during low-confidence mode Confidence estimator

JRS Confidence Estimator Estimate how much confidence the processor has in a branch prediction Trained with branch misprediction information n bit Counters m bits PC + 2^m entries > th? High Confidence Low Confidence Global BHR Assigning Confidence to Conditional Branch Predictions [Jacobsen et al. Micro-29]

Experimental Infrastructure Source Code IA-64 Binary IA-64 Trace µops IA-64 Compiler (ORC) Trace generation module Micro-op Translator Micro-op Simulator IA-64 provides full support for predication Convert IA-64 traces to micro-ops to simulate an out-of-order superscalar processor model

Simulation Methodology Nine SPEC 2000 integer benchmarks Baseline Processor Configuration Front End Large and accurate branch predictor (64KB hybrid branch predictor: gshare + local) Minimum 30-cycle branch misprediction penalty 64KB, 2-cycle latency I-cache Execution Core 8-wide out-of-order processor 512-entry instruction window Confidence Estimator 1KB tagged 16-bit history JRS confidence estimator (Jacobsen et al. MICRO-29)

Performance Improvement -4% 14% 2.02 8% 24% non-predicated 16% over conditional branch prediction (w/o mcf) 11% over selective-predication (w/o mcf) 7 % over aggressive predication (w/o mcf) 14% over conditional branch prediction and 13% over selective-predication and 16% over aggressive-predication 12% over conditional branch prediction 11% over selective-predication 13 % over aggressive predication SELECTIVE-PREDICATION: branches are selectively predicated using compile-time cost-benefit analysis AGGRESSIVE-PREDICATION: all branches that are suitable for if-conversion are predicated

Wish Branch: Conclusion New control flow instructions: wish branches (jump/join/loop) Wish branches improve performance by dividing the work of predication between the compiler and the microarchitecture Compiler: analyzes the control-flow graph and generates code Microarchitecture: makes run-time decision to use predication Wish branches provide significant performance benefits 16% compared to conditional branch prediction 13% compared to selectively predicated code Wish branches can make predicated execution more viable and effective in high performance processors By enabling adaptive and aggressive predicated execution

Lecture Outline Predicated execution Wish branches 2D-profiling

2D-profiling Goal: Identify input-dependent branches by using a single input set for profiling If We Know a Branch is Input-Dependent May not convert it to predicated code. May convert it to a wish branch. May not perform other compiler optimizations or may perform them less aggressively. Hot-path/trace/superblock-based optimizations [Fisher’81, Pettis’90, Hwu’93, Merten’99]

Key Insight of 2D-profiling Phase behavior in prediction accuracy is a good indicator of input dependence phase 2 input-dependent phase 3 phase 1 input-independent

Traditional Profiling brA time brB pr. Acc MEAN pr.Acc(brA) pr. Acc MEAN pr.Acc(brB) MEAN pr.Acc(brA)  MEAN pr.Acc(brB) behavior of brA  behavior of brB

2D-profiling MEAN pr.Acc(brA) brA STD pr.Acc(brA) time brB MEAN pr.Acc(brB) STD pr.Acc(brB) MEAN pr.Acc(brA)  MEAN pr.Acc(brB) STD pr.Acc(brA) ≠ STD pr.Acc(brB) behavior of brA ≠ behavior of brB A: input-dependent br, B: input-independent br

2D-profiling Mechanism The profiler collects branch prediction accuracy information for every static branch over time slice size = M instructions Slice 1 Slice 2 … Slice N time mean Pr.Acc(brA,s1) mean Pr.Acc(brA,s2) ... mean Pr.Acc(brA,sN) mean Pr.Acc(brB,s1) mean Pr.Acc(brB,s2) ... mean Pr.Acc(brB,sN) . . . PAM:50% brA Calculate MEAN (brA, brB, …), Standard deviation (brA, brB, …), PAM:Points Above Mean (brA, brB, …) mean brA brB PAM:0% mean brB

2D-profiling: Conclusion & Future Work 2D-profiling is a new profiling technique to find input-dependent characteristics by using a single input data set for profiling 2D-profiling uses time-varying information instead of just average data Phase behavior in prediction accuracy in a profile run  input-dependent Future Work: Better predicated code/wish branch generation algorithms Detecting other input-dependent program characteristics