Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende

Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende {malony,nchaimov,khuck,scott,sameer}@cs.uoregon.edu University of Oregon

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Outline  Motivation  Performance engineering and autotuning  Performance tool integration with autotuning process  TAU performance system overview  Performance database (TAUdb)  Framework for empirical-based performance tuning  Integration with CHiLL and Active Harmony  Integration with Orio (preliminary)  Conclusions and future directions

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Parallel Performance Engineering  Scalable, optimized applications deliver HPC promise  Optimization through performance engineering process  Understand performance complexity and inefficiencies  Tune application to run optimally on high-end machines  How to make the process more effective and productive?  What is the nature of the performance problem solving?  What is the performance technology to be applied?  Performance tool efforts have been focused on performance observation, analysis, problem diagnosis  Application development and optimization productivity  Programmability, reusability, portability, robustness  Performance technology part of larger programming system

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning  Traditionally an empirically-based approach observation experimentation diagnosis tuning  Performance technology developed for each level characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Performance Technology Experiment management Performance storage Performance Technology Data mining Models Expert systems Performance Technology Parallel Performance Engineering Process

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Parallel Performance Diagnosis

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning “Extreme” (Progressive) Performance Engineering  Increased performance complexity forces the engineering process to be more intelligent and automated  Automate performance data analysis / mining / learning  Automated performance problem identification  Performance engineering tools and practice must incorporate a performance knowledge discovery process  Model-oriented knowledge  Computational semantics of the application  Symbolic models for algorithms  Performance models for system architectures / components  Application developers can be more directly involved in the performance engineering process

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning “Extreme” Performance Engineering  Empirical performance data evaluated with respect to performance expectations at various levels of abstraction

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Autotuning is a Performance Engineering Process  Autotuning methodology incorporates aspects common to “traditional” application performance engineering  Empirical performance observation  Experiment-oriented  Autotuning embodies progressive engineering techniques  Automated experimentation and performance testing  Guided optimization by (intelligent) search space exploration  Model-based (domain-specific) comptational semantics  Autotuning is a different approach to performance diagnosis style of performance engineering  There are shared objectives for performance technology and opportunities for tool integration

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAU Performance System ® (http://tau.uoregon.edu)  Parallel performance framework and toolkit  Supports all HPC platforms, compilers, runtime system  Provides portable instrumentation, measurement, analysis

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAU Components  Instrumentation  Fortran, C, C++, UPC, Chapel, Python, Java  Source, compiler, library wrapping, binary rewriting  Automatic instrumentation  Measurement  MPI, OpenSHMEM, ARMCI, PGAS  Pthreads, OpenMP, other thread models  GPU, CUDA, OpenCL, OpenACC  Performance data (timing, counters) and metadata  Parallel profiling and tracing  Analysis  Performance database technology (TAUdb, formerly PerfDMF)  Parallel profile analysis (ParaProf)  Performance data mining / machine learning (PerfExplorer)

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAU Performance Database – TAUdb  Started in 2004 (Huck et al., ICPP 2005)  Performance Data Management Framework (PerfDMF)  Database schema and Java API  Profile parsing  Database queries  Conversion utilities (parallel profiles from other tools)  Provides DB support for TAU profile analysis tools  ParaProf, PerfExplorer, EclipsePTP  Used as regression testing database for TAU  Used as performance regression database  Ported to several DBMS  PostgreSQL, MySQL, H2, Derby, Oracle, DB2

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAUdb Database Schema  Parallel performance profiles  Timer and counter measurements with 5 dimensions  Physical location: process / thread  Static code location: function / loop / block / line  Dynamic location: current callpath and context (parameters)  Time context: iteration / snapshot / phase  Metric: time, HW counters, derived values  Measurement metadata  Properties of the experiment  Anything from name:value pairs to nested, structured data  Single value for whole experiment or full context (tuple of thread, timer, iteration, timestamp)

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAUdb Programming APIs  Java  Original API  Basis for in-house analysis tool support  Command line tools for batch loading into the database  Parses 15+ profile formats  TAU, gprof, Cube, HPCT, mpiP, DynaProf, PerfSuite, …  Supports Java embedded databases (H2, Derby)  C programming interface under development  PostgreSQL support first, others as requested  Query Prototype developed  Plan full-featured API: Query, Insert, & Update  Evaluating SQLite support

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAUdb Tool Support  ParaProf  Parallel profile viewer / analyzer  2, 3+D visualizations  Single experiment analysis  PerfExplorer  Data mining framework  Clustering, correlation  Multi-experiment analysis  Scripting engine  Expert system

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning PerfExplorer Architecture DBMS (TAUdb)

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAU Integration with CHiLL and Active Harmony  Major goals:  Integrate TAU with existing autotuning frameworks  U se TAU to gather performance data for autotuning/specialization  Store performance data tagged with metadata about execution environment and input in a centralized database  Use machine learning and data mining techniques to increase the level of automation of autotuning and specialization  Using TAU in two ways:  Using multi-parameter-based profiling support to generate separate profiles based function parameters (or outlined code)  Using TAU metrics stored in PerfDMF/TauDB as performance measures in optimization

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Components  ROSE Outliner  ROSE is a compiler with built-in support for source-to- source transformations  ROSE outliner, given a reference to an AST node, extracts the AST node into its own function or file  CHiLL  provides a domain specific language for specifying transformations on loops  Active Harmony  Searches space of parameters to transformation recipes  TAU  Performance instrumentation and measurement

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Multi-Parameter Profiling  Added multi-parameter-based profiling in TAU to support specialization  User can select which parameters are of interest using a selective instrumentation file  Consider a matrix multiply function  We can generate profiles based on the dimensions of the matrices encountered during execution: for void matmult(float **c, float **a, float **b, int L, int M, int N), parameterize using L, M, N

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Using Parameterized Profiling in TAU BEGIN_INCLUDE_LIST matmult BEGIN_INSTRUMENT_SECTION loops file=“foo.c” routine=“matrix#” param file=“foo.c” routine=“matmult” param=“L” param=“M” param=“N” END_INSTRUMENT_SECTION int matmult(float **, float **, float **, int, int, int) C

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Parameterized Profiling / Autotuning with TauDB

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Autotuning with TauDB Methodology  Each time the program executes a code variant, we store metadata in the performance database indicating by what process the variant was produced:  Source function  Name of CHiLL recipe  Parameters to CHiLL recipe  The database also contains metadata on what parameters were called and also on the execution environment:  OS name, version, release, native architecture  CPU vendor, ID, clock speed, cache sizes, # cores  Memory size  Any metadata specified by the end user

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Machine Learning  Given all these data stored in TauDB...  OS name, OS Release, CPU Type, CPU MHz, CPU cores  param, param  Chillscript  Metric... we can build a decision tree which selects the best-performing code variant given information available at run-time

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Decision Tree  PerfExplorer already has an interface to Weka  Use Weka to generate decision trees based upon the data in the performance database

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Wrapper Generation  Use a ROSE-based tool to generate a wrapper function  Carries out the decisions in the decision tree and executes the best code variant  Decision tree code generation tool takes Weka-generated decision tree and a set of decision functions as input  If using custom metadata, the user needs to provide a custom decision function  Decision functions for metadata automatically collected by TAU are provided

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Example: Sparse MM Specialization with CHiLL  Previous study: CHiLL and Active Harmony were used to specialize matrix multiply in Nek5000 (a fluid dynamics solver) for small matrices based on input size.  Limitations: histogram of input sizes generated manually, code to evaluate input data and select specialized variant generated manually.  We can automate these processes with parameterized profiling and machine learning over the collected data.  Replicated small-matrix specialization study using TAU and TauDB.

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Introduction to Orio  Orio is an annotation-based empirical performance tuning framework  Source code annotations allow Orio to generate a set of low-level performance optimizations  After each optimization (or transformation) is applied the kernel is run  Set of optimizations is searched for the best transformations to be applied for a given kernel  First effort to integrate Orio with TAU was to collect performance data about each experiment that Orio runs  Move performance data from Orio into TAUdb  Orio could read from TAUdb in future

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning TAU's GPU Measurement Library  Focused on Orio ’ s CUDA kernel transformations  TAU uses NVIDIA's CUPTI interface to gather information about the GPU execution  Memory transfers  Kernels (runtime performance, performance attributes)  GPU counters  Using the CUPTI interface does not require any recompiling or re-linking of the application

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Orio and TAU Integration

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning PerfExplorer and Weka  PerfExplorer is TAU ’ s performance data mining tool  PerfExplorer features:  Database management (both local and remote databases)  Easy generation of charts for common performance studies  Clustering and Correlation analysis in addition to custom scripted analysis  The Weka machine learning component can be useful when analyzing autotuned data  Correlation of tuning transforms with execution time

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Orio Tuning of Vector Multiplication  Orio tuning of a simple 3D vector multiplication  2,048 experiments fed into TAUdb  Use TAU PerfExplorer with Weka to do component analysis Threads Per Block# of BlocksPreferred L1 SizeUnroll factorCFLAGWarps Per SM Kernel Execution Time Small c orrelation with runtime Better correlated with runtime

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Number of Threads per Kernel  GPU occupancy (# warps) increases with larger # threads  Greater occupancy improves memory latency hiding resulting in faster execution time Number of Threads Kernel Execution Time (us) GPU occupancy (warps)

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Autotuning Measurements in MPI Applications  How is the execution context of a MPI application captured and used for autotuning experiments?  Is this a problem?  Avoid running the entire MPI application for one variant  How about integrate autotuning in the MPI application?  Each process runs code variants in vivo  Each variant is measured separately and stored in the profile  How are code variants generated and selected during run?  code transformations, compiler options, runtime parameters, …  What assumptions are there about the MPI application?  Functional consistency: does it continue to work properly?  Execution regularity: is the context stable between iterations?

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Autotuning Measurements in MPI Application

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Conclusions  Autotuning IS a performance engineering process  It is complementary with performance engineering for empirical-based performance diagnosis and optimization  There are opportunities to integrate application parallel performance tools with autotuning methodologies  Performance experiment instrumentation and measurement  Performance data/metadata, analysis, data mining  Knowledge engineering is key (at all levels)  Performance + parallel computation + system  Represented in form the tools can reason about  Bridge between application performance characterization methodology with autotuning methodology  Integration will help to explain performance

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Future Work  DOE SUPER SciDAC project  Integration of TAU with autotuning frameworks  CHiLL, Active Harmony, Orio  Apply tools for end-to-end application performance  Build performance databases  Enable exploration and understanding of search spaces  Enable association of multi-dimensional data/metadata  Relate performance across compilers, platforms, …  Feedback of semantic information to explain performance  Explore data mining and machine learning techniques  Discovery of relationships, factors, latent variables, …  Create performance models and feedback to autotuning  Learn optimal algorithm parameters for specific scenarios  Bridge between model-based and experimental-based  Create knowledge-based integrated performance system

DOE SciDAC CScADSAugust 13-14, 2012 Performance Tools for Empirical Autotuning Support Acknowledgements  Department of Energy (DOE)  Office of Science  ASC/NNSA  Department of Defense (DoD)  HPC Modernization Office (HPCMO)  NSF Software Development for Cyberinfrastructure (SDCI)  Research Centre Juelich  Argonne National Laboratory  Technical University Dresden  ParaTools, Inc.  NVIDIA

Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende

Similar presentations

Presentation on theme: "Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende

Similar presentations

Presentation on theme: "Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende"— Presentation transcript:

Similar presentations

About project

Feedback