Efficient Program Compilation through Machine Learning Techniques Gennady Pekhimenko IBM Canada Angela Demke Brown University of Toronto.

Slides:



Advertisements
Similar presentations
Models of Computation Prepared by John Reif, Ph.D. Distinguished Professor of Computer Science Duke University Analysis of Algorithms Week 1, Lecture 2.
Advertisements

Project : Phase 1 Grading Default Statistics (40 points) Values and Charts (30 points) Analyses (10 points) Branch Predictor Statistics (30 points) Values.
Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
On-line learning and Boosting
Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Mitigating the Compiler Optimization Phase- Ordering Problem using Machine Learning.
CISC Machine Learning for Solving Systems Problems Presented by: John Tully Dept of Computer & Information Sciences University of Delaware Using.
Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?
Online Performance Auditing Using Hot Optimizations Without Getting Burned Jeremy Lau (UCSD, IBM) Matthew Arnold (IBM) Michael Hind (IBM) Brad Calder (UCSD)
Lecture 14 – Neural Networks
Karl Schnaitter and Neoklis Polyzotis (UC Santa Cruz) Serge Abiteboul (INRIA and University of Paris 11) Tova Milo (University of Tel Aviv) Automatic Index.
Frank Hutter, Holger Hoos, Kevin Leyton-Brown
Three kinds of learning
Parallelizing Compilers Presented by Yiwei Zhang.
Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Chair for Computer Aided Medical Procedures & Augmented Reality Department of Computer Science | Technische Universität München Chair for Computer Aided.
CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.
Software design and development Marcus Hunt. Application and limits of procedural programming Procedural programming is a powerful language, typically.
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Research supported by IBM CAS, NSERC, CITO Context Threading: A flexible and efficient dispatch technique for virtual machine interpreters Marc Berndl.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Optimizing Sorting With Genetic Algorithms Xiaoming Li, María Jesús Garzarán, and David Padua University of Illinois at Urbana-Champaign.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
Buffered dynamic run-time profiling of arbitrary data for Virtual Machines which employ interpreter and Just-In-Time (JIT) compiler Compiler workshop ’08.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Mantis: Automatic Performance Prediction for Smartphone Applications Yongin Kwon, Sangmin Lee, Hayoon Yi, Donghyun Kwon, Seungjun Yang, Byung-Gon Chun,
Chapter 25: Code-Tuning Strategies. Chapter 25  Code tuning is one way of improving a program’s performance, You can often find other ways to improve.
CISC Machine Learning for Solving Systems Problems Presented by: Alparslan SARI Dept of Computer & Information Sciences University of Delaware
CDA 3101 Discussion Section 09 CPU Performance. Question 1 Suppose you wish to run a program P with 7.5 * 10 9 instructions on a 5GHz machine with a CPI.
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
® IBM Software Group © 2009 IBM Corporation Assist Threads for Data Prefetching in IBM XL Compilers Gennady Pekhimenko, Yaoging Gao, IBM Toronto, Zehra.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
Tivoli Software © 2010 IBM Corporation 1 Using Machine Learning Techniques to Enhance The Performance of an Automatic Backup and Recovery System Amir Ronen,
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
CISC Machine Learning for Solving Systems Problems Presented by: Eunjung Park Dept of Computer & Information Sciences University of Delaware Solutions.
Machine Learning in Compiler Optimization By Namita Dave.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Intelligent Compilation John Cavazos Computer & Information Sciences Department.
Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Automatic Feature Generation for Machine Learning Based Optimizing Compilation Hugh Leather, Edwin Bonilla, Michael O'Boyle Institute for Computing Systems.
Bassem Makni SML 16 Click to add text 1 Deep Learning of RDF rules Semantic Machine Learning.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Complexity Analysis (Part I)
Machine Learning with Spark MLlib
Deep Learning Amin Sobhani.
Chilimbi, et al. (2014) Microsoft Research
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
On Spatial Joins in MapReduce
Register Pressure Guided Unroll-and-Jam
Reducing Training Time in a One-shot Machine Learning-based Compiler
Predicting Unroll Factors Using Supervised Classification
Word embeddings (continued)
Attention for translation
Complexity Analysis (Part I)
Sculptor: Flexible Approximation with
Complexity Analysis (Part I)
Presentation transcript:

Efficient Program Compilation through Machine Learning Techniques Gennady Pekhimenko IBM Canada Angela Demke Brown University of Toronto

Motivation My cool program Compiler -O2 DCE Peephole Unroll Inline Executable But what to do if executable is slow? Replace –O2 with –O5 Unroll Optimization 100 New Fast Executable 1-10 minutes few seconds Unroll Inline Peephole DCE

Motivation (2) Compiler -O2 Our cool Operating System 1 hour Executable Too slow! Compiler -O5 20 hours New Executable We do not have that much time Why did it happen?

Basic Idea Unroll Optimization 100 Do we need all these optimizations for every function? Probably not. Compiler writers can typically solve this problem, but how ? 1.Description of every function 2.Classification based on the description 3.Only certain optimizations for every class Machine Learning is good for solving this kind of problems

Overview Motivation System Overview Experiments and Results Related Work Conclusions Future Work

Initial Experiment 3X difference on average

Initial Experiment (2)

Classification parameters Our System Prepare extract features modify heuristic values choose transformations find hot methods Gather Training Data Compile Measure run time Learn Logistic Regression Classifier Best feature settings Offline Deploy TPO/XL Compiler set heuristic values Online

Data Preparation Three key elements: Feature extraction Heuristic values modification Target set of transformations Total # of insts Loop nest level # and % of Loads, Stores, Branches Loop characteristics Float and Integer # and % Existing XL compiler is missing functionality Extension was made to the existing Heuristic Context approach Unroll Wandwaving If-conversion Unswitching CSE Index Splitting ….

Gather Training Data Try to cut transformation backwards (from last to first) If run time not worse than before, transformation can be skipped Otherwise we keep it We do this for every hot function of every test The main benefit is linear complexity. Late Inlining Unroll Wandwaving

Learn with Logistic Regression Function Descriptions Best Heuristic Values Input Classifier Logistic Regression Neural Networks Genetic Programming Output.hpredict files Compiler + Heuristic Values

Deployment Online phase, for every function: Calculate the feature vector Compute the prediction Use this prediction as heuristic context Overhead is negligible

Overview Motivation System Overview Experiments and Results Related Work Conclusions Future Work

Experiments Benchmarks: SPEC2000 Others from IBM customers Platform: IBM server, 4 x Power5 1.9 GHz, 32GB RAM Running AIX 5.3

Results: compilation time 2x average speedup

Results: execution time

New benchmarks: compilation time

New benchmarks: execution time 4% speedup

Overview Motivation System Overview Experiments and Results Related Work Conclusions Future Work

Related Work Iterative Compilation Pan and Eigenmann Agakov, et al. Single Heuristic Tuning Calder, et al. Stephenson, et al. Multiple Heuristic Tuning Cavazos, et al. MILEPOST GCC

Conclusions and Future Work 2x average compile time decrease Future work Execution time improvement -O5 level Performance Counters for better method description Other benefits Heuristic Context Infrastructure Bug Finding

Thank you Raul Silvera, Arie Tal, Greg Steffan, Mathew Zaleski Questions?