Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX.
The view from space Last weekend in Los Angeles, a few miles from my apartment…
CS492B Analysis of Concurrent Programs Memory Hierarchy Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
1 Optimizing compilers Managing Cache Bercovici Sivan.
Programming Languages Marjan Sirjani 2 2. Language Design Issues Design to Run efficiently : early languages Easy to write correctly : new languages.
" Characterizing the Relationship between ILU-type Preconditioners and the Storage Hierarchy" " Characterizing the Relationship between ILU-type Preconditioners.
A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang, Vahid F., Lysecky R. Proceedings of Design, Automation and Test in Europe Conference.
Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
The Effect of Data-Reuse Transformations on Multimedia Applications for Different Processing Platforms N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.
University of California San Diego Locality Phase Prediction Xipeng Shen, Yutao Zhong, Chen Ding Computer Science Department, University of Rochester Class.
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.
SAGE: Self-Tuning Approximation for Graphics Engines
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.
Unit 2: Engineering Design Process
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
" Characterizing the Relationship between ILU-type Preconditioners and the Storage Hierarchy" " Characterizing the Relationship between ILU-type Preconditioners.
An Object-Oriented Approach to Programming Logic and Design Fourth Edition Chapter 5 Arrays.
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
Chapter 25: Code-Tuning Strategies. Chapter 25  Code tuning is one way of improving a program’s performance, You can often find other ways to improve.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.
Data Structure and Algorithms. Algorithms: efficiency and complexity Recursion Reading Algorithms.
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Accurate Robot Positioning using Corrective Learning Ram Subramanian ECE 539 Course Project Fall 2003.
 In computer programming, a loop is a sequence of instruction s that is continually repeated until a certain condition is reached.  PHP Loops :  In.
Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
An Evaluation of Data-Parallel Compiler Support for Line-Sweep Applications Daniel Chavarría-Miranda John Mellor-Crummey Dept. of Computer Science Rice.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
Adaptive Inlining Keith D. CooperTimothy J. Harvey Todd Waterman Department of Computer Science Rice University Houston, TX.
Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,
The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.
Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
A few words on locality and arrays
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Ioannis E. Venetis Department of Computer Engineering and Informatics
The Hardware/Software Interface CSE351 Winter 2013
Section 7: Memory and Caches
Accurate Robot Positioning using Corrective Learning
for more information ... Performance Tuning
CSCI1600: Embedded and Real Time Software
Department of Computer Science University of California, Santa Barbara
Memory Hierarchies.
Tosiron Adegbija and Ann Gordon-Ross+
Adaptive Strassen and ATLAS’s DGEMM
Objective of This Course
Ann Gordon-Ross and Frank Vahid*
Christophe Dubach, Timothy M. Jones and Michael F.P. O’Boyle
Automatic Tuning of Two-Level Caches to Embedded Applications
Department of Computer Science University of California, Santa Barbara
CSCI1600: Embedded and Real Time Software
Topology Optimization through Computer Aided Software
Coevolutionary Automated Software Correction
Presentation transcript:

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX USA

2 Motivation Despite astonishing increases in processor performance certain applications still require a heroic compiler effort  Scientific applications: weather, earthquake, and nuclear physics simulations High quality compilation is difficult  The solutions to many problems are NP-complete  Many decisions that impact performance must be made  The correct choice can depend on the target machine, source program, and input data  Exhaustively determining the correct choices is impractical Typical compilers use a single preset sequence of decisions How do we determine the correct sequence for each context?

3 Adaptive Compilation An adaptive compiler experimentally explores the decision space  Uses a process of feedback-driven iterative refinement  Program is compiled repeatedly with a different sequence of optimization decisions  Performance is evaluated using either execution or estimation  Performance results are used to determine future sequences  Sequence of compiler decisions is customized to always provide a high level of performance  Compiler easily accounts for different input programs, target machines and input data Can current compilers be used for adaptive compilation?

4 Experimental Setup Searched for certain properties in a compiler  Produces high quality executables  Performs high-level optimizations  Command-line flags that control optimization Selected the MIPSpro compiler  Initial experiments showed that changing blocking sizes could improve running times Loop Blocking  A memory hierarchy transformation that reorders array accesses to improve spatial and temporal locality  Major impact on array based codes  Includes DGEMM -- a general matrix multiply routine  Allows comparison with ATLAS

5 ATLAS Automatically tuned linear algebra software Goal is to achieve hand-coded performance for linear algebra kernels without a programmer modifying the code for each processor  Kernel is modified and parameterized once by a programmer  When ATLAS is installed on a machine experiments are run to determine the proper parameters for the kernel Saves human time at the expense of additional machine time Adaptive compilation aims to take this tradeoff one step further

6 Adjusting Blocking Size Compare three versions of DGEMM  Compiled with MIPSpro and varying specified block sizes  Built by ATLAS  Compiled with MIPSpro using built-in blocking heuristic Test machine: SGI MIPS R10000  195 MHz processor  256 MB memory  32 KB L1 data cache  1 MB unified L2 cache

7 DGEMM running time for 500 x 500 arrays

8 DGEMM running time for 1000 x 1000 arrays

9 DGEMM running time for 1500 x 1500 arrays

10 DGEMM running times for square matrices

11 Relative DGEMM running times

12 L1 Cache Misses for DGEMM

13 L2 Cache Misses for DGEMM

14 Adjusting Blocking Size The performance of MIPSpro using the built-in blocking heuristic drops off substantially when the array size reaches 900 x 900  Far more L1 cache misses  Fewer L2 cache misses  Heuristic uses a rectangular blocking size that increases as the total array size increases MIPSpro with adaptively chosen blocking sizes delivers performance close to ATLAS level  Remains close as array size increases  Fewer L1 and L2 cache misses than ATLAS Similar results were observed for non-square matrices as well

15 Determining Blocking Size Exhaustively searching for blocking sizes is expensive Intelligent exploration of blocking sizes can find very good blocking sizes while only examining a few block sizes Our approach:  Determine the result for block size 50  Sample higher and lower block sizes in increments of ten until results are more than 10% from optimal  Examine all of the block sizes within five of the best found in the previous step This approach always found the best block size in our experiments Quicker approaches could be found at the expense of finding less ideal block sizes

16 Search time required

17 Making Adaptive Compilation General Making adaptive compilation general will require changing how compilers work Adaptive compilation is limited by the decisions the compiler exposes  If the MIPSpro compiler only allowed blocking to be turned on and off our experiments would not have been possible The interface between adaptive system and compiler needs to allow complex communication  Which transformations are applied  Granularity  Optimization scope  Detailed parameter settings

18 Conclusions Adaptively selecting the appropriate blocking size for DGEMM provides performance close to ATLAS  The standard compiler’s performance drops off for larger array sizes  Only a small portion of possible block sizes needs to be examined Making adaptive compilation a successful technique for a wide variety of applications will require changes to the design of compilers

19 Extra slides begin here.

20 DGEMM running times for varying M

21 DGEMM running times for varying N

22 DGEMM running times for varying K