CML Branch Penalty Reduction by Software Branch Hinting Jing Lu Yooseong Kim, Aviral Shrivastava, and Chuan Huang Compiler Microarchitecture Lab Arizona.

Slides:

Advertisements

Similar presentations

Branch prediction Titov Alexander MDSP November, 2009.

Advertisements

UPC Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф INTERACT-9, San Francisco.

COMP25212 Further Pipeline Issues. Cray 1 COMP25212 Designed in 1976 Cost $8,800,000 8MB Main Memory Max performance 160 MFLOPS Weight 5.5 Tons Power.

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

Energy Efficient Instruction Cache for Wide-issue Processors Alex Veidenbaum Information and Computer Science University of California, Irvine.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.

Author: D. Brooks, V.Tiwari and M. Martonosi Reviewer: Junxia Ma

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

CB E D F G Frequently executed path Not frequently executed path Hard to predict path A C E B H Insert select-µops (φ-nodes SSA) Diverge Branch CFM point.

Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.

Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Sanghyun Park, §Aviral Shrivastava and Yunheung Paek

By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp 마이크로 프로세서.

Speculative Software Management of Datapath-width for Energy Optimization G. Pokam, O. Rochecouste, A. Seznec, and F. Bodin IRISA, Campus de Beaulieu

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

Idempotent Processor Architecture Marc de Kruijf Karthikeyan Sankaralingam Vertical Research Group UW-Madison MICRO 2011, Porto Alegre.

StaticILP.1 2/12/02 Static ILP Static (Compiler Based) Scheduling Σημειώσεις UW-Madison Διαβάστε κεφ. 4 βιβλίο, και Paper on Itanium στην ιστοσελίδα.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Diverge-Merge Processor (DMP) Hyesoon Kim José A. Joao Onur Mutlu* Yale N. Patt HPS Research Group *Microsoft Research University of Texas at Austin.

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

Pipelining and Parallelism Mark Staveley

Branch.1 10/14 Branch Prediction Static, Dynamic Branch prediction techniques.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.

CML CML A Software Solution for Dynamic Stack Management on Scratch Pad Memory Arun Kannan, Aviral Shrivastava, Amit Pabalkar, Jong-eun Lee Compiler Microarchitecture.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

现代计算机体系结构主讲教师：张钢天津大学计算机学院 2009 年.

Improving Program Efficiency by Packing Instructions Into Registers

Ke Bai and Aviral Shrivastava Presented by Bryce Holton

Module 3: Branch Prediction

EE 382N Guest Lecture Wish Branches

Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.

Dynamic Code Mapping Techniques for Limited Local Memory Systems

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

Lecture 10: Branch Prediction and Instruction Delivery

Sampoorani, Sivakumar and Joshua

Midterm 2 review Chapter

Predicting Unroll Factors Using Supervised Classification

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

CML Branch Penalty Reduction by Software Branch Hinting Jing Lu Yooseong Kim, Aviral Shrivastava, and Chuan Huang Compiler Microarchitecture Lab Arizona State University, USA

CML Web page: aviral.lab.asu.edu CMLSummary 2  Branch predictor needed for high performance, but consumes too much power.  As power-efficiency becomes the key design metric, push to remove branch predictor  Possible solution: Software Branch Hinting  Contributions of this paper:  1. Develop a model of branch hinting for the compiler  2. Propose first solution to the problem of “Where to place branch hints”  3 basic methods  Combined heuristic  Reduce branch penalty by 20% on average, compared to SPU GCC –O3  Avg. performance improvement ~ 7%.

CML Web page: aviral.lab.asu.edu CML Branch Prediction 3  Improve performance in pipelined processors  1. Increasing branch mis-prediction penalty  Pipelines becoming longer  Branch penalty ~ cycles in modern processors  2. Improve ILP  Speculative, OOO execution can reorder instructions  Without branch prediction – can only reorder inside BB  Every 5-8 th instruction is a branch  Trend of Increasing Complexity of Hardware Branch Predictor  BTB Size  Alpha EV6 - 36kbit BTB, EV Kbit  Branch Prediction Complexity  Alpha EV6 - Hierarchical tournament, EV8 - e-gskew and bimodal

CML Web page: aviral.lab.asu.edu CML Times are a changing 4  Already dissipating more power than cooling efficiency  Cap on power and power-density  Cannot improve performance without improving power-efficiency  Multi-core era  Cores are becoming simpler  Simpler cores are more power-efficient  Power-efficiency of system = power-efficiency of core  Performance scaling by number of cores  Simple, power-efficient cores  No speculation  In-order execution  Branch predictor???

CML Web page: aviral.lab.asu.edu CML Can we get rid of Branch Predictor? 5  Needed for performance  Consumes too much power  10% of on-chip power dissipation [1]  IBM Cell processor  Extremely power-efficient  5 Gops/W  Compare to Intel Core 2 duo  0.2 Gops/W  No branch prediction  NOT Taken Runtime Power [1] D.Parikh et.al., Power Issues Related to Branch Prediction. In Proc. Of HPCA, 2002 BenchmarkBranch penalty cnt59% Insert_sort31% Janne_complex63% ns51% select36% Branch Penalty on Cell SPUs can be high for some embedded applications

CML Web page: aviral.lab.asu.edu CML Software Branch Hinting 6  Branch Hint Instruction hbr  Branch instruction at jumps to  Inserted by Compiler/Programmer  Negligible power consumption  Some branch targets are easily known  Unconditional branches  Loops branches L3: shli$13,$11,2 selb $6,$6,$15,$8 rotqby$2,$12,$7 hbrrL14,L4 ai$6,$6,1 cgti$3,$6,2 a$5,$9,$2 lnop selb $10,$5,$10,$8 L14: brz$3,L4 ai$11,$11,1 ceqi$18,$11,3 BenchmarkBranch penalty without hint Branch penalty with GCC hint cnt59%29% Insert_sort31%19% Janne_complex63%58% ns51%28% select36%32%

CML Web page: aviral.lab.asu.edu CML Contributions of this work 7  Modeling Branch Hinting Mechanism  How does branch hinting work?  How can we make performance model of branch hinting for the compiler to use?

CML Web page: aviral.lab.asu.edu CML Branch and Hint Separation 8 hbrrL14,L4 shli$13,$11,2 selb $6,$6,$15,$8 rotqby$2,$12,$7 ai$6,$6,1 cgti$3,$6,2 a$5,$9,$2 selb $10,$5,$10,$8 lnop … L14: brz$3,L4 ai$11,$11,1 ceqi$18,$11,3 lnop 18 nop instructions Penalty when hint is correct  Experiment on Cell SPU hardware:  Separate hint and branch by nop instructions  Execution time measured using SPU decrementer

CML Web page: aviral.lab.asu.edu CML Mechanism of Software Branch Hinting 9 Instruction memory Inline Prefetch Buffer PCPC IRIR Hint Target Buffer 1 0 Comparator branch address target address branch address target address branch address target address BH BRBR BRBR 1

CML Web page: aviral.lab.asu.edu CML 3 Key Parameters of Software Branch Hinting 10 Instruction memory Inline Prefetch Buffer PCPC IRIR Hint Target Buffer 1 0 Comparator branch address target address branch address target address branch address target address d cycles to register hint s entries f cycles

CML Web page: aviral.lab.asu.edu CML Parameters of Branch Hinting 11  d: How many cycles to register hint?  If separation less than “d”, then hint is not active  For Cell, d=8  s: Size of Branch Target Buffer  How many hints can be effective at a time?  For Cell, s = 1  f: Cycles to load instructions from memory into hint target buffer  If separation is more than “d+f”, then no penalty  For cell, f = 11, therefore penalty =0, if separation > 18

CML Web page: aviral.lab.asu.edu CML Branch Penalty Model for Compiler 12  Model the penalty of a branch as a function of separation, taken probability, and number of branches is executed

CML Web page: aviral.lab.asu.edu CML Branch Penalty Model for Compiler 13  Model the penalty of a branch as a function of separation, taken probability, and number of branches is executed L15 brz $3, L4L4 p =branch probability 1-p hbrr L14, L4 L14: l = separation between branch and hint n = no. of times branch is executed

CML Web page: aviral.lab.asu.edu CML Contributions of this work 14  1. Modeling Branch Hinting Mechanism  How does branch hinting work?  How can we make performance model of branch hinting for the compiler to use?  2. Branch Hint Placement  3 basic branch hint placement methods  NOP padding  Hint Pipelining  Loop restructuring

CML Web page: aviral.lab.asu.edu CML Related Work 15  Predication [Muchnick 97]  Extra hardware overhead and power consumption  Loop Unrolling [Muchnick 97]  Increase code size  Energy efficient branch prediction on Cell SPUs [Briejer 10]  Involving hardware branch predictor Static Branch Probability Analysis [Ball 93], [Wu 94] Static Branch Probability Analysis [Ball 93], [Wu 94] Static Branch Hint Placement [SPU GCC, This work] Static Branch Hint Placement [SPU GCC, This work] Software branch hinting

CML Web page: aviral.lab.asu.edu CML Branch Hint Placement Problem 16  Input ：  Control Flow Graph  For each branch  Taken probability  execution count  Output:  Where to insert hint?  Which branches to hint?  Objective  Minimize total branch penalty d=10 d=2 Too small! L14: brz $3,L5 brz $3, L4 L4 L5 L16 ： 1 - p 1 p2p2 1– p 2 n1n1 p1p1 n2n2 hbrr L14, L4 hbrr L16, L5

CML Web page: aviral.lab.asu.edu CML SPU GCC Branch Hint Placement 17 GCC Compiler in IBM Cell BE SDK –Hint most important branches –Hint only one of two closely placed branches –Hint only innermost loop in nested loops L1 L3 L4 L2 brnz $5, L2b4:b4: brnz $4, L3b3:b3: hbrr b 3, L3 hbrr b 4, L2 Separation too small

CML Web page: aviral.lab.asu.edu CML Branch Hint Reduction Methods 18  Three basic techniques:  NOP Padding  Finds out the number of NOP instructions needed between a branch and its hint to maximize profit  Hint Pipelining  Enables hinting branches that are very close to each other  Loop Restructuring  Hint nested loops

CML Web page: aviral.lab.asu.edu CML NOP Padding 19  Insert nop and lnop instructions to artificially in crease separation  Case (a):  Separation=4  Branch penalty=18 cycles  Case (b):  Separation=4  Branch penalty= 10cycles  Profit=8 cycles separation=4 separation=8 hbrr br ……………… hbrr ……………… br nop lnop nop lnop (a)(b) Benefit of NOP Padding

CML Web page: aviral.lab.asu.edu CML Hint Pipelining 20  hoist the hint for b 2 above b 1 to increase separation  Can not hint b 1  Place the hint for branch b 2 less than eight instructions ahead of branch b 1 l 1 = 10 l 2 = 10 L1 : L2: br z $3, L4 br L3 b1:b1: b2:b2: hbrr b 2, L3 l 1 +l 2 = 17 L1 : L2 : brz $3, L4 br L3 b1:b1: b2:b2: hbrr b 1, L2hbrr b 2, L3 (a) (b) 7 –Case (b): Penalty_b 1 =7 cycles, Penalty_b 2 =1 cycle Branch penalty=8 cycles Overhead: 1 hint instruction Profit = 18-(8+1)=9 cycles –Case (a): Penalty_b 1 =18 cycles, Penalty_b 2 =0 cycles Branch penalty=18 cycles

CML Web page: aviral.lab.asu.edu CML Loop Restructuring 21  Branch penalty from loops will be accumulated  Observation: only inner most look can be hinted  Change structure of loop L1 L3 L4 L5 L2 brnz $5, L2b4:b4: brnz $4, L3b3:b3: hbrr b 3, L3 hbrr b 4, L2 Inner loop body Outer loop body Space for hint L1 L3 L4 L5 L2 brnz $5, L2b4:b4: brnz $4, L3b3:b3: hbrr b 3, L3 hbrr b 4, L2 b1:b1: br L2 br L3 b2:b2: brz $5, L5 Space for hint Increased space Separation too small

CML Web page: aviral.lab.asu.edu CML Contributions of this work 22  1. Modeling Branch Hinting Mechanism  How does branch hinting work?  Performance model of branch hinting for the compiler  2. Branch Hint Placement  3 basic branch hint placement methods  NOP padding  Hint Pipelining  Loop restructuring  Profitability analysis for each method  3. Heuristic to apply these techniques to a given application  Prudently apply each method with profitability analysis in each step  Please see paper for details

CML Web page: aviral.lab.asu.edu CML Experimental Setup 23  Baseline of Comparison is GCC compiler  Included in IBM Cell BE SDK  Benchmarks compiled with -O3 optimization level  Benchmarks from Multimedia Loops and WCET benchmarks  “low” and “high” group according to percentage of branch penalty  Performance measured using IBM SystemSim simulator  Cycle accurate  Provide statistic results:  Total execution cycle  Number of branch penalty cycle  nop cycle  Measurements are done only on user codes  Library functions are not changed  Branch probability and Cyclic frequencies obtained by static analysis  Also implemented in GCC Multimedia Loops WCET Benchmarks

CML Web page: aviral.lab.asu.edu CML Average 20% branch penalty reduction 24  Reduce average 19.2% of the branch penalty more than GCC  Consider the increased NOP cycles as part of branch penalty  More effective for deeply nested loops Deeply nested loops high low Max 35% reduction

CML Web page: aviral.lab.asu.edu CML Average 10% speedup 25  Peak Speed up of 18%  “High” group more susceptible to branch penalty reduction  Involves profitability analysis high low

CML Web page: aviral.lab.asu.edu CMLSummary 26  Branch predictor needed for high performance, but consumes too much power.  As power-efficiency becomes the key design metric, push to remove branch predictor  Possible solution: Software Branch Hinting  Contributions of this paper:  1. Develop a model of branch hinting for the compiler  2. Propose first solution to the problem of “Where to place branch hints”  3 basic methods  Combined heuristic  Reduce branch penalty by 20% on average, compared to SPU GCC –O3  Avg. performance improvement ~ 7%.