Code Size Efficiency in Global Scheduling for ILP Processors TINKER Research Group Department of Electrical & Computer Engineering North Carolina State.

Slides:

Advertisements

Similar presentations

Branch prediction Titov Alexander MDSP November, 2009.

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

1 Compiling for VLIWs and ILP Profiling Region formation Acyclic scheduling Cyclic scheduling.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

1 ILP (Recap). 2 Basic Block (BB) ILP is quite small –BB: a straight-line code sequence with no branches in except to the entry and no branches out except.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

8 Processing of control transfer instructions TECH Computer Science 8.1 Introduction 8.2 Basic approaches to branch handling 8.3 Delayed branching 8.4.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Clustered Indexing for Conditional Branch Predictors Veerle Desmet Ghent University Belgium.

CPE 731 Advanced Computer Architecture ILP: Part II – Branch Prediction Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

EECS 470 Branch Prediction Lecture 6 Coverage: Chapter 3.

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Scheduling Reusable Instructions for Power Reduction J.S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M.J. Irwin Proceedings of the Design, Automation.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Goal: Reduce the Penalty of Control Hazards

UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.

Improving Branch Prediction by Dynamic Dataflow-based Identification of Correlated Branches from a Large Global History Renjiu Thomas, Manoij Franklin,

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

EENG449b/Savvides Lec /25/05 March 24, 2005 Prof. Andreas Savvides Spring g449b EENG 449bG/CPSC 439bG.

Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

High Throughput Compression of Double-Precision Floating-Point Data Martin Burtscher and Paruj Ratanaworabhan School of Electrical and Computer Engineering.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Pipelines for Future Architectures in Time Critical Embedded Systems By: R.Wilhelm, D. Grund, J. Reineke, M. Schlickling, M. Pister, and C.Ferdinand EEL.

Statistical Simulation of Superscalar Architectures using Commercial Workloads Lieven Eeckhout and Koen De Bosschere Dept. of Electronics and Information.

Analysis of Branch Predictors

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis

1 Understanding the Energy-Delay Tradeoff of ILP-based Compilation Techniques on a VLIW Architecture G. Pokam, F. Bodin CPC 2004 Chiemsee, Germany, July.

Branch Prediction Perspectives Using Machine Learning Veerle Desmet Ghent University.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.

Prof. Hsien-Hsin Sean Lee

Dynamic Branch Prediction

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

CS 704 Advanced Computer Architecture

‘99 ACM/IEEE International Symposium on Computer Architecture

5.2 Eleven Advanced Optimizations of Cache Performance

CSL718 : VLIW - Software Driven ILP

Module 3: Branch Prediction

Phase Capture and Prediction with Applications

Advanced Computer Architecture

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Lecture 10: Branch Prediction and Instruction Delivery

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Dynamic Hardware Prediction

Patrick Akl and Andreas Moshovos AENAO Research Group

CMSC 611: Advanced Computer Architecture

Phase based adaptive Branch predictor: Seeing the forest for the trees

Presentation transcript:

Code Size Efficiency in Global Scheduling for ILP Processors TINKER Research Group Department of Electrical & Computer Engineering North Carolina State University Huiyang Zhou, Tom Conte

2 Outline Introduction Quantitative measure of code size efficiency Best code size efficiency for a given code size limit Optimal code size efficiency for a program Summary Future work

3 Introduction Instruction level parallelism (ILP) vs. static code size –Region enlarging optimizations usually enhance ILP Cyclic scheduling: loop unrolling, loop peeling, etc. Acyclic scheduling: tail duplication, recovery code, etc. I-cache and ITLB performance vs. static code size –Larger code usually means larger I-Cache footprint Trade off of the conflicting effects of code size increase –Especially in acyclic global scheduling

4 Background of Treegion Scheduling Treegion scheduling –An acyclic scheduling technique –Two phases Treegion formation Treegion-based instruction scheduling: Tree Traversal Scheduling (TTS) (HPCA-4, LCPC’01) Treegion –Basic scheduling unit –A single-entry / multiple-exit nonlinear region with CFG forming a tree (i.e., no merge points and back-edges in a treegion) BB1 BB2 BB3 BB4 BB5BB6 Tree1 Tree2

5 Background of Treegion Scheduling Treegion examples BB1 BB2 BB3 BB4 BB5BB6 Natural treegion : treegions formed without tail duplication (i.e., no code size increase during natural treegion formation) BB1 BB2 BB3 BB4 BB5BB6 BB4’ BB5’BB6’ Tree1 Tree2 Tree 1’

6 Code Size Effects in Treegion Scheduling Tail duplication increases code size General operation combining reduces code size BB1 BB2 BB3 … R1=R3+R4 … BB5BB6 BB4’ BB5’ … R7=R3+R4 R9=R7*4 … R1=R3+R4 … BB2 BB3 … ________ … BB5BB6 BB4’ BB5’ … _________ R9=R1*4 …

7 Quantitative Measure of Code Size Efficiency ILP vs. static code size Havanki’s heuristic: A treegion formation heuristic proposed before [HPCA-4].

8 Code Size Efficiency for Any Code Size Related Optimizations Use the ratio of IPC changes over code size changes as an indication of code size efficiency. –Average code size efficiency –Instantaneous code size efficiency

9 Average and Instantaneous Code Size Efficiency Code Size Static IPC A1A1 A2A2 A3A3 A4A4 A0A0

10 Estimate Static IPC Before Scheduling Use the expected execution time to calculate the static IPC For a multi-path region: Now, IPC changes can be calculated as execution time saved by the optimization. tree1 tree2 Tree1’ Example:

11 Optimal Code Size Efficiency For A Given Code Size Limit Code Size Static IPC Natural Treegion Size Limit Fixed code size, try to maximize the static IPC, i.e., maximize the average code size efficiency

12 Optimal Tail Duplication Under Code Size Constraint 1.Calculate the instantaneous code size efficiency for all possible tail duplication candidates in the program scope. 2.Find the one with best code size efficiency. 3.If the selected candidate satisfies the code size constraint, perform the tail duplication and update the code size efficiencies of the candidates that are affected by the tail duplication process. 4.Repeat steps 2-3 until the code size limit is reached. Relative Code Size IPC limit

13 Processor Model Specification Execution Dispatch/Issue/Retire bandwidth: 8; Universal function units: 8; Operation latency: ALU, ST, BR: 1 cycle; LD, floating-point (FP) add/subtract: 2 cycles. I-cache Compressed (zero-nop) and two banks with 2-way 16KB each bank. Line size: 16 operations with 4 bytes each operation. Miss latency: 12 cycles D-cache Size/Associativity/Replacement: 64KB/4-way/LRU; Line size: 32 bytes Miss Penalty: 14 cycles Branch Predictor G-share style Multiway branch prediction [20] Branch prediction table: 2 14 entries; Branch target buffer: 2 14 entries/8-way/LRU. Branch misprediction penalty: 10 cycles

14 Results: ILP vs. Code Size 0% 2% 5% 80% 30%

15 Results: ILP vs. Code Size (cont.) 0% 2% 5% 80% 30% Reason: only a very small part of the program is frequently executed.

16 Optimal Code Size Efficiency Definition: the point where the ‘diminishing returns’ start Finding the optimal code size efficiency Relative code size IPC A l A’

17 Finding the Optimal Code Size Efficiency Relative code size 0 K K1K1 K2K2 Threshold on the first derivative of IPC vs. code size curve, which is simply the threshold on instantaneous code size efficiency ! A or A’ K is the slope of line l

18 Finding the Optimal Code Size Efficiency (cont.) Meaning of K1 and K2 Relative code size IPC AB C l1l1 l2l2 K1 and K2 are the slope of the lines l1 and l2. The range (K1 – K2) determines the robustness of the threshold scheme. Point B  Threshold as K1 Point C  Threshold as K2

19 Algorithm for Finding the Optimal Code Size Efficiency 1.Set the threshold k anywhere between tan(  /6) to tan(  /12) 2.Calculate the instantaneous code size efficiency for all possible tail duplication candidates in the program scope. 3.If there is a candidate whose instantaneous code size efficiency is above the threshold, duplicate the candidate and update the efficiency of affected candidates, repeat until there are no more candidates. When the expected execution time is used, the threshold scheme becomes (derivation details in ref [21])

20 Results for Optimal Code Size Efficiency Vary threshold from tan(  /12 ) to tan(  /6 ), the threshold scheme finds the optimal efficiency accurately. Use m88ksim as an example 0% 2% 5% 10% 20%

21 I-Cache Impacts of the Code Size Increase Code size impacts and locality impacts (ref [3])

22 I-Cache Impacts of the Code Size Increase (cont.) Denser schedule of optimal efficiency results

23 I-Cache Impacts of the Code Size Increase (cont.) The combined impact

24 Processor Performance In average, significant speedup (17% over natural treegion) in dynamic IPC at the cost of 2% code size increase.

25 Conclusions Quantitative measure of the code size efficiency: the ratio of IPC changes over code size increase Best code size efficiency for a given code size limit –Results Significant but varying impact on IPC Optimal efficiency: simple yet robust threshold scheme to find ‘knee’ of the curve –Results Improved I-cache performance (4%) Significant speedup (17%) Moderate static code size increase (2%) Future Work –Combine with other optimization, e.g., loop unrolling.

26 Contact Information Huiyang Zhou Tom Conte TINKER Research Group North Carolina State University