Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.

Slides:

Advertisements

Similar presentations

Efficient Program Compilation through Machine Learning Techniques Gennady Pekhimenko IBM Canada Angela Demke Brown University of Toronto.

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.

*time Optimization Heiko, Diego, Thomas, Kevin, Andreas, Jens.

9. Code Scheduling for ILP-Processors TECH Computer Science {Software! compilers optimizing code for ILP-processors, including VLIW} 9.1 Introduction 9.2.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.

Zhelong Pan [1] This presentation as.pptx: (or scan QR code) The paper: [1]

Source Code Optimization and Profiling of Energy Consumption in Embedded System Simunic, T.; Benini, L.; De Micheli, G.; Hans, M.; Proceedings on The 13th.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Performance Potentials of Compiler- directed Data Speculation Author: Youfeng Wu, Li-Ling Chen, Roy Ju, Jesse Fang Programming Systems Research Lab Intel.

Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Multiscalar processors

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

SyNAR: Systems Networking and Architecture Group Symbiotic Jobscheduling for a Simultaneous Multithreading Processor Presenter: Alexandra Fedorova Simon.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.

1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.

Code Size Efficiency in Global Scheduling for ILP Processors TINKER Research Group Department of Electrical & Computer Engineering North Carolina State.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware

Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Branch Hazards and Static Branch Prediction Techniques

Optimal Superblock Scheduling Using Enumeration Ghassan Shobaki, CS Dept. Kent Wilken, ECE Dept. University of California, Davis

CISC Machine Learning for Solving Systems Problems Presented by: Eunjung Park Dept of Computer & Information Sciences University of Delaware Solutions.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

Computer Architecture Principles Dr. Mike Frank

Multiscalar Processors

Application-Specific Customization of Soft Processor Microarchitecture

The University of Adelaide, School of Computer Science

CS203 – Advanced Computer Architecture

Instruction Scheduling for Instruction-Level Parallelism

CSCI1600: Embedded and Real Time Software

Performance Optimization for Embedded Software

EE 382N Guest Lecture Wish Branches

Yingmin Li Ting Yan Qi Zhao

Instruction Scheduling Hal Perkins Winter 2008

Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle

What to do when you don’t know anything know nothing

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,

How to improve (decrease) CPI

Loop-Level Parallelism

CSCI1600: Embedded and Real Time Software

Rohan Yadav and Charles Yuan (rohany) (chenhuiy)

Phase based adaptive Branch predictor: Seeing the forest for the trees

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David I. August

November 11, Outline  Introduction  The Problem:  Predictive Heuristics and A Priori Evaluation  Some Solutions:  Iterative Compilation and A Posteriori Evaluation  Our Solution  Optimization-Space Exploration  Evaluation  Conclusion

November 11, Introduction  Processors  become more complex  incorporate additional computational resources  Consequence  Compilers  become more complex  use aggressive optimizations  have to use predictive heuristics in order to decide where and to what extend optimizations should be applied

November 11, The Problem: Predictive Heuristics  Predictive Heuristics  tries to determine a priori the benefits of certain optimization  are tuned to give the highest average performance  The Result  significant performance gains are unrealized!

November 11, Some Solutions: Iterative Compilation  Iterative Compilation  optimize the programs in many ways  choose a posteriori the best code version  Pitfall of current schemes  prohibitive compilation times!  limitation to specific architectures  embedded systems  limited to specific optimizations

November 11, Our solution: Optimization-Space Exploration  OSE Compiler ( Practical Iterative Compilation )  explores the space of optimization configurations through multiple compilations  it uses the experience of the compiler writer to prune the number of configurations that should be explored  uses a performance estimator to not evaluate the code by execution  selects a custom configuration for each code segment  selects next optimization configuration by examining the previous configurations characteristics

November 11, OSE over many conigurations

November 11, OSE – Limiting the Search Space  Optimization Space  derived from a set of optimization parameters  Optimization Parameters  Optimization level  High Level Optimization (HLO) level  Micro-architecture type  Coalesce adjacent loads and stores  HLO phase order  Loop unroll limit  Update dependencies after unrolling  Perform software pipelining

November 11, OSE – Limiting the Search Space  Optimization Parameters  Heuristic to disable software pipelining  Allow control speculation during software pipelining  Software pipeline outer loops  Enable if-conversion heuristic for software pipelining  Software pipeline loops with early exists  Enable if conversion  Enable non-standard predication  Enable pre-scheduling  Scheduler ready criterion

November 11, OSE – Limiting the Search Space  Compiler Construction-time Pruning  limit the total number of configurations that will be considered at compile time  construct a set S with at most N configurations  S is chosen by determining the impact on a representative set of code segments C as follows: S’ = default configuration + configurations with non-default parameters a) run C compiled with S’ on real hardware and retain in S’ only the valuable configurations b) consider the combination of configurations in S’ as S’’ repeat a) for S’’ and retain only the best N configurations repeat b) until no new configurations can be generated or the speedup does not improve

November 11, OSE – Limiting the Search Space  Characterizing Configuration Correlations  build a optimization configuration tree  critical configurations = conf. at the same level 1. Construct O = set of m most important configurations in S for all code segments in C 2. Choose all oi in O as the successor of the root node. 3. For each configurations oi in O: 4. Construct Ci = {cj: argmax(pj,k) = i} k=1…m 5. Repeat steps 3, 4 to find oi successors limiting the code segments to Ci and configurations to S\O.

November 11, OSE – Limiting the Search Space  Compile-time search  do a breadth first search on the optimization configuration tree  choose the configuration that yields the best estimated performance

November 11, OSE – Limiting the Search Space  Limit the OSE application  to hot code segments  hot code segments are identified through profiling or hardware performance counters during a program run

November 11, Evaluation  OSE Compiler Algorithm 1. Profile the code 2. For each Function: 3. Compile to the high level IR 4. Optimize using HLO 5. For each Function: 6. If the function is hot: 7. Perform OSE on second HLO and CG 8. Emit the function using the best configuration 9. If the function is not hot use the standard configuration

November 11, Compile-time Performance Estimation  Model Based on:  Ideal Cycle Count – T  Data cache performance, Lambda, L  Instruction cache performance, I  Branch misprediction, B

November 11, Results

November 11, Conclusions