Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Slides:



Advertisements
Similar presentations
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
EECS 470 Branch Prediction Lecture 6 Coverage: Chapter 3.
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.
Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.
Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Neural Methods for Dynamic Branch Prediction Daniel A. Jiménez Department of Computer Science Rutgers University.
CB E D F G Frequently executed path Not frequently executed path Hard to predict path A C E B H Insert select-µops (φ-nodes SSA) Diverge Branch CFM point.
Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.
Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.
Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Evaluation of the Gini-index for Studying Branch Prediction Features Veerle Desmet Lieven Eeckhout Koen De Bosschere.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.
Microprocessor Microarchitecture Instruction Fetch Lynn Choi Dept. Of Computer and Electronics Engineering.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.
Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
2D-Profiling Detecting Input-Dependent Branches with a Single Input Data Set Hyesoon Kim M. Aater Suleman Onur Mutlu Yale N. Patt HPS Research Group The.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
Pentium III Instruction Stream. Introduction Pentium III uses several key features to exploit ILP This part of our presentation will cover the methods.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
Prophet/Critic Hybrid Branch Prediction B B B
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
CS 352H: Computer Systems Architecture
Computer Architecture Lecture 10: Branch Prediction II
Computer Architecture: Branch Prediction (II) and Predicated Execution
15-740/ Computer Architecture Lecture 21: Superscalar Processing
Dynamic Branch Prediction
Multiscalar Processors
CS5100 Advanced Computer Architecture Advanced Branch Prediction
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Samira Khan University of Virginia Nov 13, 2017
Henk Corporaal TUEindhoven 2009
15-740/ Computer Architecture Lecture 25: Control Flow II
Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N. Patt
Milad Hashemi, Onur Mutlu, Yale N. Patt
EE 382N Guest Lecture Wish Branches
Address-Value Delta (AVD) Prediction
15-740/ Computer Architecture Lecture 24: Control Flow
Yingmin Li Ting Yan Qi Zhao
15-740/ Computer Architecture Lecture 26: Predication and DAE
Henk Corporaal TUEindhoven 2011
Lecture 10: Branch Prediction and Instruction Delivery
Sampoorani, Sivakumar and Joshua
José A. Joao* Onur Mutlu‡ Yale N. Patt*
pipelining: static branch prediction Prof. Eric Rotenberg
Control unit extension for data hazards
Control unit extension for data hazards
Loop-Level Parallelism
Design of Digital Circuits Lecture 19a: VLIW
Computer Architecture Lecture 11: Control-Flow Handling
Predication ECE 721 Prof. Rotenberg.
Presentation transcript:

Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt The University of Texas at Austin *Oregon Microarchitecture Lab Electrical and Computer Engineering Intel Corporation

Talk Outline Problem Wish Branches Experimental Methodology Results Conclusion

Predicated Execution C B D A B C D A if (cond) { b = 0; } else { (normal branch code) C B D A T N p1 = (cond) branch p1, TARGET mov b, 1 jmp JOIN TARGET: mov b, 0 (predicated code) B C D A if (cond) { b = 0; } else { b = 1; A p1 = (cond) (!p1) mov b, 1 (p1) mov b, 0 B C D add x, b, 1 Convert control flow dependency to data dependency Pro: Eliminate hard-to-predict branches Cons: (1) Fetch blocks B and C all the time (2) Wait until p1 is resolved

The Overhead of Predicated Execution -2% 16% 13% non-predicated A p1 = (cond) (0) mov b,1 (1) mov b,0 p1 = (cond) (!p1) mov b, 1 (p1) mov b, 0 B C D add x, b, 1 (Predicated code) If all overhead is ideally eliminated, predicated execution would provide 16% improvement in average execution time

The Problem Due to the predication overhead, predicated execution sometimes reduces performance Branch misprediction characteristics are dependent on run-time behavior: input set, control-flow path and phase behavior. The compiler cannot accurately estimate the run-time behavior of branches

Talk Outline Problem Wish Branches Experimental Methodology Results Conclusion

Wish Branches A new type of control flow instruction 3 types: wish jump/join and wish loop The compiler generates code (with wish branches) that can be executed either as predicated code or non-predicated code (normal branch code) The hardware decides to execute predicated code or normal branch code at run-time based on the confidence of branch prediction Easy to predict: normal branch code Hard to predict: predicated code

Wish Jump/Join A wish jump C B D A nop B C D A B wish join Taken High Confidence Low Confidence A wish jump C B D A T N mov b, 1 jmp JOIN TARGET: mov b,0 normal branch code nop B C D A p1 = (cond) (!p1) mov b,1 (p1) mov b,0 predicated code B wish join Taken Not-Taken C D A p1=(cond) wish.jump p1 TARGET p1 = (cond) branch p1, TARGET B nop (!p1) mov b,1 wish.join !p1 JOIN (1) mov b,1 wish.join (1) JOIN C TARGET: (1) mov b,0 TARGET: (p1) mov b,0 D JOIN: wish jump/join code

Wish Loop H X X High Confidence Low Confidence Y Y T T N N H X X (1) Y do { a++; i++; } while (i<N); X T N N High Confidence Low Confidence Y Y H mov p1, 1 LOOP: (p1) add a, a, 1 (p1) add i, i, 1 (p1) p1 = (cond) wish. loop p1, LOOP EXIT: X X LOOP: add a, a, 1 add i, i, 1 p1 = (i<N) branch p1, LOOP EXIT: (1) Y Y normal backward branch code wish loop code

Mispredicted Case 1: Early-Exit H Correct execution: H X1 X2 X3 Y T T N X T Early-exit: (Low confidence) Flush pipeline N H X1 X2 Y … T N Y X3 Y N Compared to normal branch code: predicate data dependency and one extra instruction (-)

Mispredicted Case 2: Late-Exit H Correct execution: H X1 X2 X3 Y T T N X T nop nop Late-exit: (Low confidence) N H X1 X2 X3 X4 X5 Y … T T T T N Y Compared to normal branch code: pro: reduce flush penalty (+++) cons: predicate data dependency and one extra instruction (-)

Mispredicted Case 3: No-Exit H Correct execution: H X1 X2 X3 Y T T N Flush pipeline X T No-exit: (Low confidence) N H X1 X2 X3 X4 X5 X6 … T T T T T T Y Y Compared to normal branch code: predicate data dependency and one extra instruction (-)

Advantages/Disadvantages of Wish Branches Advantages compared to predicated execution Reduce the overhead of predication Increase the benefits of predicated code by allowing the compiler to generate more aggressively-predicated code Provide a mechanism to exploit predication to reduce the branch misprediction penalty for backward branches (Wish loops) Make predicated code less dependent on machine configuration (eg. branch predictor)

Advantages/Disadvantages of Wish Branches Disadvantages compared to predicated execution Extra branch instructions use machine resources Extra branch instructions increase the contention for branch predictor table entries May constrain the compiler’s scope for code optimizations

Wish Branch Support ISA Support Compiler Support Hardware Support predicated execution, wish branch instruction Compiler Support Wish branch generation algorithms The compiler needs to decide which branches are predicated, which are converted to wish branches, and which stay as normal branches Hardware Support Confidence estimator Front-end and branch misprediction detection/recovery module

Talk Outline Problem Wish Branches Experimental Methodology Results Conclusion

Experimental Infrastructure Source Code IA-64 Binary IA-64 Trace µops IA-64 Compiler (ORC) Trace generation module Micro-op Translator Micro-op Simulator IA-64 provides full support for predication Convert IA-64 traces to micro-ops to simulate an out-of-order superscalar processor model

Simulation Methodology Nine SPEC 2000 integer benchmarks Baseline Processor Configuration Front End Large and accurate branch predictor (64KB hybrid branch predictor: gshare + local) Minimum 30-cycle branch misprediction penalty 64KB, 2-cycle latency I-cache Execution Core 8-wide out-of-order processor 512-entry instruction window Confidence Estimator 1KB tagged 16-bit history JRS confidence estimator (Jacobsen et al. MICRO-29)

Talk Outline Problem Wish Branches Experimental Methodology Results Conclusion

Performance Improvement -4% 14% 2.02 8% 24% non-predicated 16% over conditional branch prediction (w/o mcf) 11% over selective-predication (w/o mcf) 7 % over aggressive predication (w/o mcf) 14% over conditional branch prediction and 13% over selective-predication and 16% over aggressive-predication 12% over conditional branch prediction 11% over selective-predication 13 % over aggressive predication SELECTIVE-PREDICATION: branches are selectively predicated using compile-time cost-benefit analysis AGGRESSIVE-PREDICATION: all branches that are suitable for if-conversion are predicated

Talk Outline Problem Wish Branches Experimental Methodology Results Conclusion

Conclusion New control flow instructions: wish branches (jump/join/loop) Wish branches improve performance by dividing the work of predication between the compiler and the microarchitecture Compiler: analyzes the control-flow graph and generates code Microarchitecture: makes run-time decision to use predication Wish branches provide significant performance benefits 16% compared to conditional branch prediction 13% compared to selectively predicated code Wish branches can make predicated execution more viable and effective in high performance processors By enabling adaptive and aggressive predicated execution