Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende

Slides:

Advertisements

Similar presentations

Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.

Advertisements

Houssam Haitof Technische Univeristät München

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

Productive Performance Tools for Heterogeneous Parallel Computing Allen D. Malony Department of Computer and Information Science University of Oregon Shigeo.

Building Enterprise Applications Using Visual Studio ®.NET Enterprise Architect.

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Scalability Study of S3D using TAU Sameer Shende

Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.

May 25, 2010 Mary Hall May 25, 2010 Advancing the Compiler Community’s Research Agenda with Archiving and Repeatability * This work has been partially.

TAU Parallel Performance System DOD UGC 2004 Tutorial Allen D. Malony, Sameer Shende, Robert Bell Univesity of Oregon.

Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

DANSE Central Services Michael Aivazis Caltech NSF Review May 23, 2008.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.

Performance Tools BOF, SC’07 5:30pm – 7pm, Tuesday, A9 Sameer S. Shende Performance Research Laboratory University.

Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.

1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.

Scalability Study of S3D using TAU Sameer Shende

Instrumentation and Measurement CSci 599 Class Presentation Shreyans Mehta.

CIT241 Prerequisite Knowledge ◦ Variables ◦ Operators ◦ C++ Syntax ◦ Program Structure ◦ Classes  Basic Structure of a class  Concept of Data Hiding.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Autotuning Large Computational Chemistry Codes PERI Principal Investigators: David H. Bailey (lead)Lawrence Berkeley National Laboratory Jack Dongarra.

SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.

Tool Integration and Autotuning for SUPER Performance Optimization Allen D. Malony, Nick ChaimovUniversity of Oregon Mary HallUniversity of Utah Jeff HollingsworthUniversity.

SUPER 1 Bob Lucas University of Southern California Sept. 23, 2011 Science Pipeline Allen D. Malony University of Oregon May 6, 2014 Support for this work.

German National Research Center for Information Technology Research Institute for Computer Architecture and Software Technology German National Research.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

TRACEREP: GATEWAY FOR SHARING AND COLLECTING TRACES IN HPC SYSTEMS Iván Pérez Enrique Vallejo José Luis Bosque University of Cantabria TraceRep IWSG'15.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

A Component Infrastructure for Performance and Power Modeling of Parallel Scientific Applications Boyana Norris Argonne National Laboratory Van Bui, Lois.

John Mellor-Crummey Robert Fowler Nathan Tallent Gabriel Marin Department of Computer Science, Rice University Los Alamos Computer Science Institute HPCToolkit.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon

Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Kevin A. Huck Department of Computer and Information Science Performance Research Laboratory University of.

1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.

Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

PerfExplorer Component for Performance Data Analysis Kevin Huck – University of Oregon Boyana Norris – Argonne National Lab Li Li – Argonne National Lab.

University of Maryland Towards Automated Tuning of Parallel Programs Jeffrey K. Hollingsworth Department of Computer Science University.

Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:

© 2006, National Research Council Canada © 2006, IBM Corporation Solving performance issues in OTS-based systems Erik Putrycz Software Engineering Group.

TAKE – A Derivation Rule Compiler for Java Jens Dietrich, Massey University Jochen Hiller, TopLogic Bastian Schenke, BTU Cottbus/REWERSE.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)

Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.

Performance Tool Integration in Programming Environments for GPU Acceleration: Experiences with TAU and HMPP Allen D. Malony1,2, Shangkar Mayanglambam1.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Productive Performance Tools for Heterogeneous Parallel Computing

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Chapter 1 Introduction.

Introduction to the TAU Performance System®

Performance Technology for Scalable Parallel Systems

Tutorial Outline Welcome (Malony)

Chapter 13 The Data Warehouse

TAU integration with Score-P

Chapter 1 Introduction.

Allen D. Malony, Sameer Shende

Many-core Software Development Platforms

Outline Introduction Motivation for performance mapping SEAA model

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende University of Oregon

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Outline  Motivation  Performance engineering and autotuning  Performance tool integration with autotuning process  TAU performance system overview  Performance database (TAUdb)  Framework for empirical-based performance tuning  Integration with CHiLL and Active Harmony  Integration with Orio (preliminary)  Conclusions and future directions

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Parallel Performance Engineering  Scalable, optimized applications deliver HPC promise  Optimization through performance engineering process  Understand performance complexity and inefficiencies  Tune application to run optimally on high-end machines  How to make the process more effective and productive?  What is the nature of the performance problem solving?  What is the performance technology to be applied?  Performance tool efforts have been focused on performance observation, analysis, problem diagnosis  Application development and optimization productivity  Programmability, reusability, portability, robustness  Performance technology part of larger programming system

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning  Traditionally an empirically-based approach observation experimentation diagnosis tuning  Performance technology developed for each level characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Performance Technology Experiment management Performance storage Performance Technology Data mining Models Expert systems Performance Technology Parallel Performance Engineering Process

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Parallel Performance Diagnosis

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning “Extreme” (Progressive) Performance Engineering  Increased performance complexity forces the engineering process to be more intelligent and automated  Automate performance data analysis / mining / learning  Automated performance problem identification  Performance engineering tools and practice must incorporate a performance knowledge discovery process  Model-oriented knowledge  Computational semantics of the application  Symbolic models for algorithms  Performance models for system architectures / components  Application developers can be more directly involved in the performance engineering process

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning “Extreme” Performance Engineering  Empirical performance data evaluated with respect to performance expectations at various levels of abstraction

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Autotuning is a Performance Engineering Process  Autotuning methodology incorporates aspects common to “traditional” application performance engineering  Empirical performance observation  Experiment-oriented  Autotuning embodies progressive engineering techniques  Automated experimentation and performance testing  Guided optimization by (intelligent) search space exploration  Model-based (domain-specific) comptational semantics  Autotuning is a different approach to performance diagnosis style of performance engineering  There are shared objectives for performance technology and opportunities for tool integration

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAU Performance System ® (  Parallel performance framework and toolkit  Supports all HPC platforms, compilers, runtime system  Provides portable instrumentation, measurement, analysis

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAU Components  Instrumentation  Fortran, C, C++, UPC, Chapel, Python, Java  Source, compiler, library wrapping, binary rewriting  Automatic instrumentation  Measurement  MPI, OpenSHMEM, ARMCI, PGAS  Pthreads, OpenMP, other thread models  GPU, CUDA, OpenCL, OpenACC  Performance data (timing, counters) and metadata  Parallel profiling and tracing  Analysis  Performance database technology (TAUdb, formerly PerfDMF)  Parallel profile analysis (ParaProf)  Performance data mining / machine learning (PerfExplorer)

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAU Performance Database – TAUdb  Started in 2004 (Huck et al., ICPP 2005)  Performance Data Management Framework (PerfDMF)  Database schema and Java API  Profile parsing  Database queries  Conversion utilities (parallel profiles from other tools)  Provides DB support for TAU profile analysis tools  ParaProf, PerfExplorer, EclipsePTP  Used as regression testing database for TAU  Used as performance regression database  Ported to several DBMS  PostgreSQL, MySQL, H2, Derby, Oracle, DB2

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAUdb Database Schema  Parallel performance profiles  Timer and counter measurements with 5 dimensions  Physical location: process / thread  Static code location: function / loop / block / line  Dynamic location: current callpath and context (parameters)  Time context: iteration / snapshot / phase  Metric: time, HW counters, derived values  Measurement metadata  Properties of the experiment  Anything from name:value pairs to nested, structured data  Single value for whole experiment or full context (tuple of thread, timer, iteration, timestamp)

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAUdb Programming APIs  Java  Original API  Basis for in-house analysis tool support  Command line tools for batch loading into the database  Parses 15+ profile formats  TAU, gprof, Cube, HPCT, mpiP, DynaProf, PerfSuite, …  Supports Java embedded databases (H2, Derby)  C programming interface under development  PostgreSQL support first, others as requested  Query Prototype developed  Plan full-featured API: Query, Insert, & Update  Evaluating SQLite support

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAUdb Tool Support  ParaProf  Parallel profile viewer / analyzer  2, 3+D visualizations  Single experiment analysis  PerfExplorer  Data mining framework  Clustering, correlation  Multi-experiment analysis  Scripting engine  Expert system

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning PerfExplorer Architecture DBMS (TAUdb)

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAU Integration with CHiLL and Active Harmony  Major goals:  Integrate TAU with existing autotuning frameworks  U se TAU to gather performance data for autotuning/specialization  Store performance data tagged with metadata about execution environment and input in a centralized database  Use machine learning and data mining techniques to increase the level of automation of autotuning and specialization  Using TAU in two ways:  Using multi-parameter-based profiling support to generate separate profiles based function parameters (or outlined code)  Using TAU metrics stored in PerfDMF/TauDB as performance measures in optimization

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Components  ROSE Outliner  ROSE is a compiler with built-in support for source-to- source transformations  ROSE outliner, given a reference to an AST node, extracts the AST node into its own function or file  CHiLL  provides a domain specific language for specifying transformations on loops  Active Harmony  Searches space of parameters to transformation recipes  TAU  Performance instrumentation and measurement

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Multi-Parameter Profiling  Added multi-parameter-based profiling in TAU to support specialization  User can select which parameters are of interest using a selective instrumentation file  Consider a matrix multiply function  We can generate profiles based on the dimensions of the matrices encountered during execution: for void matmult(float **c, float **a, float **b, int L, int M, int N), parameterize using L, M, N

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Using Parameterized Profiling in TAU BEGIN_INCLUDE_LIST matmult BEGIN_INSTRUMENT_SECTION loops file=“foo.c” routine=“matrix#” param file=“foo.c” routine=“matmult” param=“L” param=“M” param=“N” END_INSTRUMENT_SECTION int matmult(float **, float **, float **, int, int, int) C

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Parameterized Profiling / Autotuning with TauDB

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Autotuning with TauDB Methodology  Each time the program executes a code variant, we store metadata in the performance database indicating by what process the variant was produced:  Source function  Name of CHiLL recipe  Parameters to CHiLL recipe  The database also contains metadata on what parameters were called and also on the execution environment:  OS name, version, release, native architecture  CPU vendor, ID, clock speed, cache sizes, # cores  Memory size  Any metadata specified by the end user

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Machine Learning  Given all these data stored in TauDB...  OS name, OS Release, CPU Type, CPU MHz, CPU cores  param, param  Chillscript  Metric... we can build a decision tree which selects the best-performing code variant given information available at run-time

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Decision Tree  PerfExplorer already has an interface to Weka  Use Weka to generate decision trees based upon the data in the performance database

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Wrapper Generation  Use a ROSE-based tool to generate a wrapper function  Carries out the decisions in the decision tree and executes the best code variant  Decision tree code generation tool takes Weka-generated decision tree and a set of decision functions as input  If using custom metadata, the user needs to provide a custom decision function  Decision functions for metadata automatically collected by TAU are provided

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Example: Sparse MM Specialization with CHiLL  Previous study: CHiLL and Active Harmony were used to specialize matrix multiply in Nek5000 (a fluid dynamics solver) for small matrices based on input size.  Limitations: histogram of input sizes generated manually, code to evaluate input data and select specialized variant generated manually.  We can automate these processes with parameterized profiling and machine learning over the collected data.  Replicated small-matrix specialization study using TAU and TauDB.

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Introduction to Orio  Orio is an annotation-based empirical performance tuning framework  Source code annotations allow Orio to generate a set of low-level performance optimizations  After each optimization (or transformation) is applied the kernel is run  Set of optimizations is searched for the best transformations to be applied for a given kernel  First effort to integrate Orio with TAU was to collect performance data about each experiment that Orio runs  Move performance data from Orio into TAUdb  Orio could read from TAUdb in future

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAU's GPU Measurement Library  Focused on Orio ’ s CUDA kernel transformations  TAU uses NVIDIA's CUPTI interface to gather information about the GPU execution  Memory transfers  Kernels (runtime performance, performance attributes)  GPU counters  Using the CUPTI interface does not require any recompiling or re-linking of the application

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Orio and TAU Integration

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning PerfExplorer and Weka  PerfExplorer is TAU ’ s performance data mining tool  PerfExplorer features:  Database management (both local and remote databases)  Easy generation of charts for common performance studies  Clustering and Correlation analysis in addition to custom scripted analysis  The Weka machine learning component can be useful when analyzing autotuned data  Correlation of tuning transforms with execution time

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Orio Tuning of Vector Multiplication  Orio tuning of a simple 3D vector multiplication  2,048 experiments fed into TAUdb  Use TAU PerfExplorer with Weka to do component analysis Threads Per Block# of BlocksPreferred L1 SizeUnroll factorCFLAGWarps Per SM Kernel Execution Time Small c orrelation with runtime Better correlated with runtime

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Number of Threads per Kernel  GPU occupancy (# warps) increases with larger # threads  Greater occupancy improves memory latency hiding resulting in faster execution time Number of Threads Kernel Execution Time (us) GPU occupancy (warps)

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Autotuning Measurements in MPI Applications  How is the execution context of a MPI application captured and used for autotuning experiments?  Is this a problem?  Avoid running the entire MPI application for one variant  How about integrate autotuning in the MPI application?  Each process runs code variants in vivo  Each variant is measured separately and stored in the profile  How are code variants generated and selected during run?  code transformations, compiler options, runtime parameters, …  What assumptions are there about the MPI application?  Functional consistency: does it continue to work properly?  Execution regularity: is the context stable between iterations?

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Autotuning Measurements in MPI Application

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Conclusions  Autotuning IS a performance engineering process  It is complementary with performance engineering for empirical-based performance diagnosis and optimization  There are opportunities to integrate application parallel performance tools with autotuning methodologies  Performance experiment instrumentation and measurement  Performance data/metadata, analysis, data mining  Knowledge engineering is key (at all levels)  Performance + parallel computation + system  Represented in form the tools can reason about  Bridge between application performance characterization methodology with autotuning methodology  Integration will help to explain performance

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Future Work  DOE SUPER SciDAC project  Integration of TAU with autotuning frameworks  CHiLL, Active Harmony, Orio  Apply tools for end-to-end application performance  Build performance databases  Enable exploration and understanding of search spaces  Enable association of multi-dimensional data/metadata  Relate performance across compilers, platforms, …  Feedback of semantic information to explain performance  Explore data mining and machine learning techniques  Discovery of relationships, factors, latent variables, …  Create performance models and feedback to autotuning  Learn optimal algorithm parameters for specific scenarios  Bridge between model-based and experimental-based  Create knowledge-based integrated performance system

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Support Acknowledgements  Department of Energy (DOE)  Office of Science  ASC/NNSA  Department of Defense (DoD)  HPC Modernization Office (HPCMO)  NSF Software Development for Cyberinfrastructure (SDCI)  Research Centre Juelich  Argonne National Laboratory  Technical University Dresden  ParaTools, Inc.  NVIDIA