Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University.

Computer Architecture Computer Architecture Processing of control transfer instructions, part I Ola Flygt Växjö University

8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

EECC551 - Shaaban #1 lec # 5 Fall Reduction of Control Hazards (Branch) Stalls with Dynamic Branch Prediction So far we have dealt with.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Multiscalar processors

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.

CB E D F G Frequently executed path Not frequently executed path Hard to predict path A C E B H Insert select-µops (φ-nodes SSA) Diverge Branch CFM point.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

Exploiting Program Hotspots and Code Sequentiality for Instruction Cache Leakage Management J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, M. Kandemir.

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

Branch Hazards and Static Branch Prediction Techniques

2D-Profiling Detecting Input-Dependent Branches with a Single Input Data Set Hyesoon Kim M. Aater Suleman Onur Mutlu Yale N. Patt HPS Research Group The.

15-740/ Computer Architecture Lecture 29: Control Flow II Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/30/11.

PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,

Instruction-Level Parallelism and Its Dynamic Exploitation

CS 352H: Computer Systems Architecture

Computer Architecture Lecture 10: Branch Prediction II

Computer Architecture: Branch Prediction (II) and Predicated Execution

Computer Organization CS224

CS5100 Advanced Computer Architecture Advanced Branch Prediction

Samira Khan University of Virginia Nov 13, 2017

Improving Program Efficiency by Packing Instructions Into Registers

Henk Corporaal TUEindhoven 2009

15-740/ Computer Architecture Lecture 25: Control Flow II

Chang Joo Lee Hyesoon Kim* Onur Mutlu** Yale N. Patt

The processor: Pipelining and Branching

Milad Hashemi, Onur Mutlu, Yale N. Patt

Module 3: Branch Prediction

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

EE 382N Guest Lecture Wish Branches

Address-Value Delta (AVD) Prediction

15-740/ Computer Architecture Lecture 24: Control Flow

15-740/ Computer Architecture Lecture 26: Predication and DAE

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Henk Corporaal TUEindhoven 2011

Instruction Level Parallelism (ILP)

CS203 – Advanced Computer Architecture

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Wackiness Algorithm A: Algorithm B:

Loop-Level Parallelism

rePLay: A Hardware Framework for Dynamic Optimization

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin

2 Outline  Predicated Execution  Diverge-Merge Processor (DMP)  Implementation of DMP  Experimental Evaluation  Conclusion

3 Predicated Execution Convert control flow dependence to data dependence (normal branch code) CB D A T N p1 = (cond) branch p1, TARGET mov b, 1 jmp JOIN TARGET: mov b, 0 A B C B C D A (predicated code) A B C if (cond) { b = 0; } else { b = 1; } p1 = (cond) (!p1) mov b, 1 (p1) mov b, 0

4 Fetch Decode Rename Schedule RegisterRead Execute Benefit of Predicated Execution  Predicated Execution can be high performance and energy-efficient. A B C D A E F Predicated Execution Branch Prediction Pipeline flush!! EDBF nop Fetch Decode Rename Schedule RegisterRead Execute A B A C BA CB D A DCBEAEDCFB A FEDC BAAFBCDE F EDABCFEABCD FED CBA FE DCAB EDC BAFAFBCDE

5 Limitations/Problems of Predication  ISA: Predicate registers and predicated instructions Dynamic-Hammock Predication[Klauser ’ 98] can solve this problem but it is only applicable to simple hammocks.  Adaptivity: Static predication is not adaptive to run-time branch behavior. Branch behavior changes based on input set, phase, control-flow path. Wish Branches[Kim ’ 05]  Complex CFG: A large subset of control-flow graphs is not converted to predicated code. Function calls, loops, many instructions inside a region, and complex CFGs Hyperblock[Mahlke ’ 92] cannot adapt to frequently-executed paths dynamically.

6 Outline  Predicated Execution  Diverge-Merge Processor (DMP)  Implementation of DMP  Experimental Evaluation  Conclusion

7 Diverge-Merge Processor (DMP)  DMP can dynamically predicate complex branches (in addition to simple hammocks).  The compiler identifies Diverge branches Control-flow merge (CFM) points  The microarchitecture decides when and what to predicate dynamically.

8 select-µops (φ-nodes in SSA) Dynamic Predication A B C H Klauser et al.[PACT’98]: Dynamic-hammock predication CB H A T N mov R1, 1 jmp JOIN TARGET: mov R1, 0 A B C p1 = (cond) branch p1, TARGET (mov R1, 1) PR10 = 1 (mov R1, 0) PR11 = 0 PR12 = (cond) ? PR11 : PR10 Low-confidence H JOIN: add R5, R1, 1

9 Diverge-Merge Processor CB E D F G Frequently executed path Not frequently executed path A C E B H Insert select-µops Diverge Branch CFM point A H

10 diverge-branch executed block CFM point Diverge-Merge Processor CB E D F G Frequently executed path Not frequently executed path AAA AAA A H

11 Control-Flow Graphs A simple hammock A nested hammock A frequently-hammock A loop A non-merging DMP Dynamic Hammock SW pred Wish br. Dual-path

12 Dual-path Execution vs. DMP Low-confidence C D E F B D E F A B C D E F path 1path 2 C D E F B path 1path 2 Dual-pathDMP CFM

13 Control-Flow Graphs A simple hammock A nested hammock A frequently-hammock A loop A non-merging DMP Dynamic- hammock SW pred Wish br. Dual-path sometimes

14 Distribution of Mispredicted Branches  66% of mispredicted branches can be dynamically predicated in DMP.

15 Distribution of Mispredicted Branches  66% of mispredicted branches can be dynamically predicated in DMP.

16 Outline  Predicated Execution  Diverge-Merge Processor (DMP)  Implementation of DMP  Experimental Evaluation  Conclusion

17 Fetch Mechanism CB E D F G predicted path A C E B H Diverge Branch CFM point A H Low Confidence Round-robin fetch

18 PR21 PR11 PR41 add pr21  pr13, #1 (p1) Dynamic Predication Arch.Phy.M R1 R2PR12 R3PR13 A C E B H branch r0, C add r1  r3, #1 add r4  r1, r3 add r1  r2, # -1 branch pr10,C p1 = pr10 add pr24  pr41, pr13add pr31  pr12, # -1(!p1) Arch.Phy.M R1 R2PR12 R3PR13 PR select-µop pr41 = p1? pr21 : pr31 RAT2 RAT1 Forks RAT, RAS, and GHR PR11

19 DMP Support  ISA Support Mark diverge branches/CFM points.  Compiler Support [CGO’07] The compiler identifies diverge branches and the corresponding CFM points.  Hardware Support Confidence estimator Fetch mechanisms Load/store processing Instruction retirement Dynamic predication

20 Hardware Complexity Analysis ST-LD Forwarding SW pred. Dual path Select-Uop Gen. Rename Support Front-End Check Flush/no Flush Predicate Registers Confidence Estimator Wish br. Multi path Dyn. ham. DMP

21 Outline  Predicated Execution  Diverge-Merge Processor (DMP)  Implementation of DMP  Experimental Evaluation  Conclusion

22 Simulation Methodology  12 SPEC 2000 INT, 5 SPEC 95 INT Different input sets for profiling and evaluation  Alpha ISA execution driven simulator  Baseline processor configuration 64KB perceptron predictor/O-GEHL (paper) Minimum 30-cycle branch misprediction penalty 8-wide, 512-entry instruction window 2 KB 12-bit history enhanced JRS confidence estimator  Less aggressive processor (paper)  Power model using Wattch

23 Different CFG types

24 Performance Improvement

25 Energy Consumption

26 Outline  Predicated Execution  Diverge-Merge Processor (DMP)  Implementation of DMP  Experimental Evaluation  Conclusion

27 Conclusion  DMP introduces the concept of frequently-hammocks and it dynamically predicates complex CFGs.  DMP can overcome the three major limitations of software predication: ISA support, adaptivity, complex CFG.  DMP reduces branch mispredictions energy efficiently 19% performance improvement, 9% less energy  DMP divides the work between the compiler and the microarchitecture: The compiler analyzes the control-flow graphs. The microarchitecture decides when and what to predicate dynamically.

Thank You!!

Questions?

30 Handling Mispredictions CB E D F G predicted path A C E B H Diverge Br. CFM point A H Misprediction! add pr21  pr13, #1 (p1) branch pr10,C p1 = pr10 add pr24  pr41, pr13 add pr31  pr12, # -1(!p1) select-µop pr41 = p1? pr21 : pr31 add pr44  pr34, # -1(!p1) B C E H A (0) (1) Flush D add pr34  pr31, pr13 D

31 Loop Branches  Exit Condition The loop branch is predicted to exit the loop.  Benefit Reduced pipeline flushes: when the predicated loop is iterated more times than it should be.  Instructions in the extra iterations of the loop become NOPs. Instructions after loop-exit can still be executed.  Negative Effects Increased execution delay of loop-carried dependencies The overhead of select-µops

32 Loop Branches  Predicate each loop iteration separately A B select-uop pr32 = p2 ? pr31: pr22 select-uop pr33 = p2 ? pr30: pr23 select-uop pr22 = p1 ? pr21: pr11 select-uop pr23 = p1? pr20: pr10 add pr21  pr11, #1 (p1) pr20 = (cond1) (p1) branch A, pr20 (p1) p2 = pr20 A add r1  r1, #1 r0 = (cond1) branch A, r0 A add r1  r1, #1 r0 = (cond1) branch A, r0 A add r7  r1, #10B add r1  r1, #1 r0 = (cond1) branch A, r0 A add pr31  pr22, #1 (p2) pr30 = (cond1) (p2) branch A, pr30 (p2) A add pr7  pr32, #10B branch A, pr10 p1 = pr10 A Loop br. is predicted to exit the loop

33 Enhanced Mechanisms  Multiple CFM points The hardware chooses one CFM point for each instance of dynamic predication.  Exit Optimizations Counter Policy: What if one path does not reach the CFM point?  Number of fetched instructions > Threshold Yield Policy: What if another low confidence diverge branch is encountered in dynamic predication mode?  Later low confidence branch is more likely mispredicted. A BC GDF E H

34 Detailed DMP Support  32 Predicate register ids  Fetch mechanism High performance I-Cache Fetch two cache lines Predict 3 branches Fetch stops at the first taken branch

35 Diverge and Merge?

36 Useful Dynamic Predication Mode

37 Perfect Branch Prediction

38 Maximum Power

39 Branch Predictor Effects

40 Confidence Estimator Effects

41 Results in Less Aggressive Processors

42 DMP vs. Perfect Conditional BP

43 Enhanced DMP Mechanisms

44 DMP vs. Other Mechanisms

45 Comparisons with Predication/Wish Branches non-predicated

46 Reduction in Pipeline Flushes  Average overhead: Dynamic-hammock: 4 instructions/entry Dual-path: 150 instructions/entry Multipath: 200 instructions/entry DMP: 20 instructions/entry

47 Handling Nested Diverge Branches  Basic DMP Ignore other low confidence div. branches  Enhanced DMP Exit dynamic predication mode and re-enter from the younger low confidence branch on predicted path (Yield policy) CB E F G Diverge Br. CFM point A H D

48 Compiler Support [CGO’07]  Compiler analyzes the control flow and the profile data Step1: Identify diverge branch candidates and CFM points. Step2: Select diverge branches based on (1) the number of instructions between a branch and the CFM point (2) the probability of merging at the CFM point  Heuristics or a cost-benefit model Step3: Mark the selected branches/CFM points.

49 Future Research  Hardware Support Better confidence estimators Efficient hardware mechanism to detect diverge branches and CFM points  Increase hardware complexity but eliminate the need for ISA/compiler support  Compiler Support Better compiler algorithms [CGO’07]

50 Power Measurement Configurations  100 nm Technology  Baseline processor 4GHZ  Less aggressive processor 1.5GHz  CC3 clock-gating model in Wattch: unused units dissipate only 10% of their maximum power  DMP: one more RAT/RAS/GHR, select-uop generation module, additional fields in BTB, predicate registers, CFM registers, load- store forwarding, instruction retirement

51 Fetched wrong-path instructions per entry into dynamic-predication/dual-path mode

52 Fetched/Executed Instructions

53 ISA Support  Example of Diverge Br and CFM markers OPCODE TARGET 00 : normal branch 10 : diverge forward branch 11 : diverge loop branch CFM rel address CFM = CFM rel address + PC

54 Entering Dynamic Predication Mode  Entry condition When a diverge branch has low confidence.  The Front-end Stores the address of the CFM point to the CFM register. Forks the RAS, GHR, and RAT. Allocates a predicate register.  Fetch Mechanisms Round-robin fetch from two paths The processor follows the branch predictor until it reaches the corresponding CFM point.

55 Exiting Dynamic Predication Mode  Exit condition Both paths of a diverge branch have reached the corresponding CFM point. A diverge branch is resolved.  Select-µop mechanism Similar to φ-node in SSA Merges register values from two paths.

56 Multipath Execution Low-confidence C E H I Instructions after the control-flow merge point are fetched multiple times. Waste of resources and energy. B G H I A B C E H I path 3path 4 DGFD H I F H I path 1path 2 Low-confidence

57 Modeling Software Predication  Mark using a binary instrumentation tool  All simple and nested hammocks can be predicated.  All instruction between a branch and the control-flow merge point are fetched.  All nested branches are predicated.