Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.

Slides:

Advertisements

Similar presentations

Performance Analysis and Optimization through Run-time Simulation and Statistics Philip J. Mucci University Of Tennessee

Advertisements

Automatic Tuning of Scientific Applications Apan Qasem Ken Kennedy Rice University Houston, TX Apan Qasem Ken Kennedy Rice University Houston, TX.

Performance Evaluation of Cache Replacement Policies for the SPEC CPU2000 Benchmark Suite Hussein Al-Zoubi.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999.

*time Optimization Heiko, Diego, Thomas, Kevin, Andreas, Jens.

3.13. Fallacies and Pitfalls Fallacy: Processors with lower CPIs will always be faster Fallacy: Processors with faster clock rates will always be faster.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.

Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.

Discovery of Locality-Improving Refactorings by Reuse Path Analysis – Kristof Beyls – HPCC pag. 1 Discovery of Locality-Improving Refactorings.

Introductory Courses in High Performance Computing at Illinois David Padua.

Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Compiler Optimization-Space Exploration Adrian Pop IDA/PELAB Authors Spyridon Triantafyllis, Manish Vachharajani, Neil Vachharajani, David.

- 1 - EE898-HW/SW co-design Hardware/Software Codesign “Finding right combination of HW/SW resulting in the most efficient product meeting the specification”

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

Autotuning Large Computational Chemistry Codes PERI Principal Investigators: David H. Bailey (lead)Lawrence Berkeley National Laboratory Jack Dongarra.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

EECE **** Embedded System Design

SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.

Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University

Beyond Automatic Performance Analysis Prof. Dr. Michael Gerndt Technische Univeristät München

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Overview of the Course Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.

Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,

MM5 Optimization Experiences and Numerical Sensitivities Found in Convective/Non-Convective Cloud Interactions Carlie J. Coats, Jr., MCNC

John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

On Tuning Microarchitecture for Programs Daniel Crowell, Wenbin Fang, and Evan Samanas.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.

Reuse Distance as a Metric for Cache Behavior Kristof Beyls and Erik D’Hollander Ghent University PDCS - August 2001.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

1 EECS 6083 Compiler Theory Based on slides from text web site: Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

Adaptive Inlining Keith D. CooperTimothy J. Harvey Todd Waterman Department of Computer Science Rice University Houston, TX.

Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:

Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.

Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)

Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.

Compiler Support for Better Memory Utilization in Scientific Code Rob Fowler, John Mellor-Crummey, Guohua Jin, Apan Qasem {rjf, johnmc, jin,

Optimizing the Performance of Sparse Matrix-Vector Multiplication

Code Optimization.

Computer Architecture Principles Dr. Mike Frank

5.2 Eleven Advanced Optimizations of Cache Performance

Overview of the Course Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Memory Hierarchies.

Hardware Multithreading

Performance Optimization for Embedded Software

STUDY AND IMPLEMENTATION

Optimizing MMM & ATLAS Library Generator

Instruction Level Parallelism (ILP)

Predicting Unroll Factors Using Supervised Classification

What does it take to produce near-peak Matrix-Matrix Multiply

Srinivas Neginhal Anantharaman Kalyanaraman CprE 585: Survey Project

Presentation transcript:

Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX

Autotuning Workshop 2005Rice University2 Outline Overview of Framework –Fine grain control of transformations –Loop-level feedback –Search strategy Experimental Results Empirical Tuning of Loop Fusion Future Work

A Framework for Autotuning Binary F77 Source Vendor Compiler LoopTool hpcview bloophpcrun Parameterized Search Engine Analytic Modeler Annotated Source Transformed Source Search Space Architectural Specs Search State Profile Information Program Structure Simulate d Annealin g Random Direct Search Itanium2 Alpha SGI Pentium 4 Initial Parameters Next Iteration Parameters Feedback

Autotuning Workshop 2005Rice University4 Fine Grain Control of Transformations Not enough control over transformations in most commercial compilers Example sweep3d 68 loops in the sweep routine Using the same unroll factor for all 4 loops results in a 2% speedup Using 4 different unroll factors for 4 loops results in a 8% speedup

Autotuning Workshop 2005Rice University5 Fine Grain Control of Transformations LoopTool: A Source to source transformer –Performs transformations such as loop fusion, tiling, unroll-and-jam –Applies transformations from source code annotation –Provides loop level control of transformations cdir$ unroll 4 do j = 1, N cdir$ block 16 do i = 1, M cdir$ block 16 do k = 1, L S1(k, i, j) enddo

Autotuning Workshop 2005Rice University6 Feedback Feedback beyond whole program execution time –Need feedback at loop level granularity Can potentially optimize independent code regions concurrently –Need other performance metrics: Cache misses TLB misses Pipeline stalls

Autotuning Workshop 2005Rice University7 Feedback HPCToolKit –hpcrun - profiles executions using statistical sampling of hardware performance counters –bloop - retrieves loop structure from binaries –hpcview - correlates program structure information with sample-based performance profiles

Autotuning Workshop 2005Rice University8 Search Space Optimization search space is very large and complex –Example mm 2-level blocking (range 1-100) 1-unrolled loop (range 1-20) 100 x 100 x 20 = 200,000 search points –Characteristics of the search space is hard to predict –Need a search technique that can explore the search space efficiently

Autotuning Workshop 2005Rice University9 Search Strategy Pattern-based direct search method –The search is derivative-free Works well for a non-continuous search space –Provides approximate solutions at each stage of the calculation Can stop the search at any point when constrained by tuning time –Provides flexibility

Autotuning Workshop 2005Rice University10 Performance Improvement

Autotuning Workshop 2005Rice University11 Performance Improvement Comparison

Autotuning Workshop 2005Rice University12 Tuning Time Comparison

Autotuning Workshop 2005Rice University13 Empirical Tuning of Loop Fusion Important transformation when looking at whole applications Poses new problems –Multi-loop reuse analysis –Interaction with tiling –What parameters to tune? Trying all combinations is perhaps too much

Autotuning Workshop 2005Rice University14 Profitability Model for Fusion Hierarchical reuse analysis to estimate cost at different memory levels A probabilistic model to account for conflict misses –Effective Cache Size Resource constraints –Register Pressure –Instruction Cache Capacity Integrate the model into a pair-wise greedy fusion algorithm

Autotuning Workshop 2005Rice University15 Tuning Parameters for Fusion Use search to tune the following parameters –Effective Cache Size –Register Pressure –Instruction Cache capacity Preliminary Experiments –Works reasonably well

Autotuning Workshop 2005Rice University16 Speedup from Loop Fusion

Autotuning Workshop 2005Rice University17 Summary Autotuning framework able to improve performance on a range of architectures Direct search able to find good tile sizes and unroll factors by exploring only a small fraction of the search space Combining static models with search works well for loop fusion

Autotuning Workshop 2005Rice University18 Future Work Refine cache models to capture interactions between transformations Consider data allocation strategies for tuning

Extra Slides Begin Here

Autotuning Workshop 2005Rice University20 Platforms ProcessorL1L2L3Compiler Alpha 667MHz 8K8M-Compaq Fortran V MHz 16K256K1.5MIntel Fortran Compiler 7.1 SGI Origin 195 MHz 32K1M-MipsPro 7.3 Pentium 2 GHz 8K512K-Intel Fortran Compiler 7.1

Autotuning Workshop 2005Rice University21 Performance Improvement for Different Architectures