Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,

Slides:

Advertisements

Similar presentations

Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.

Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,

Sameer Shende, Allen D. Malony, and Alan Morris {sameer, malony, Steven Parker, and J. Davison de St. Germain {sparker,

Sameer Shende, Allen D. Malony, and Alan Morris {sameer, malony, Department of Computer and Information Science NeuroInformatics.

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Scalability Study of S3D using TAU Sameer Shende

Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.

Profiling S3D on Cray XT3 using TAU Sameer Shende

TAU: Tuning and Analysis Utilities. TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel.

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

TAU Performance System

On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.

TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony, ACTS Workshop, LBNL, Aug.

Case Study: PETSc ex19  Non-linear solver (snes)  2-D driven cavity code  uses velocity-velocity formulation  finite difference discretization on a.

Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.

The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.

Performance and Memory Evaluation using TAU Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony, Peter.

TAU Performance System Alan Morris, Sameer Shende, Allen D. Malony University of Oregon {amorris, sameer,

Performance Tools BOF, SC’07 5:30pm – 7pm, Tuesday, A9 Sameer S. Shende Performance Research Laboratory University.

Tools for Performance Discovery and Optimization Sameer Shende, Allen D. Malony, Alan Morris, Kevin Huck University of Oregon {sameer, malony, amorris,

Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee Sameer Shende, and.

June 2, 2003ICCS Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee.

Performance Evaluation of S3D using TAU Sameer Shende

TAU: Performance Regression Testing Harness for FLASH Sameer Shende

Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.

Scalability Study of S3D using TAU Sameer Shende

Performance and Memory Evaluation using the TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,

The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.

Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.

Performance Observation Sameer Shende and Allen D. Malony cs.uoregon.edu.

© 2008 Pittsburgh Supercomputing Center Performance Engineering of Parallel Applications Philip Blood, Raghu Reddy Pittsburgh Supercomputing Center.

Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.

Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.

Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,

PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.

Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.

ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA This work.

PerfExplorer Component for Performance Data Analysis Kevin Huck – University of Oregon Boyana Norris – Argonne National Lab Li Li – Argonne National Lab.

Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li

Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.

Connections to Other Packages The Cactus Team Albert Einstein Institute

Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

Allen D. Malony Department of Computer and Information Science Performance Research Laboratory.

Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.

TAU Performance System ® TAU is a profiling and tracing toolkit that supports programs written in C, C++, Fortran, Java, Python,

AdaptJ Sookmyung Women’s Univ. PSLAB. 1. 목차 1. Overview 2. Collecting Trace Data using the AdaptJ Agent 2.1 Recording a Trace 3. Analyzing Trace Data.

TAU Performance System (ACTS Workshop LBL) Sameer Shende, Allen D. Malony University of Oregon {sameer,

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Parallel OpenFOAM CFD Performance Studies Student: Adi Farshteindiker Advisors: Dr. Guy Tel-Zur,Prof. Shlomi Dolev The Department of Computer Science Faculty.

Navigating TAU Visual Display ParaProf and TAU Portal Mahin Mahmoodi Pittsburgh Supercomputing Center 2010.

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Introduction to the TAU Performance System®

Performance Technology for Scalable Parallel Systems

TAU integration with Score-P

TAU: Performance Technology for Productive, High Performance Computing

Allen D. Malony, Sameer Shende

TAU Parallel Performance System

A configurable binary instrumenter

TAU The 11th DOE ACTS Workshop

TAU Performance System (ACTS Workshop LBL) Sameer Shende, Allen D

Allen D. Malony Computer & Information Science Department

Outline Introduction Motivation for performance mapping SEAA model

Parallel Program Analysis Framework for the DOE ACTS Toolkit

Presentation transcript:

Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, PARA’06: MS8: Tools for Parallel Performance Analysis, 2:40pm – 3pm, Mon 6/19/06

TAU Performance System2 Outline  Overview of features  Instrumentation  Measurement (Profiling, Tracing)  Analysis tools  Tools and techniques for optimizing instrumentation  Conclusions

TAU Performance System3  Tuning and Analysis Utilities (14+ year project effort)  Performance system framework for HPC systems  Integrated, scalable, portable, flexible, and parallel  Integrated toolkit for performance problem solving  Automatic instrumentation  Highly configurable measurement system with support for many flavors of profiling and tracing  Portable analysis and visualization tools  Performance data management and data mining 

TAU Performance System4 TAU Performance System Architecture event selection

TAU Performance System5 TAU Performance System Architecture

TAU Performance System6 Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 C / C++ IL analyzer Fortran IL analyzer Program Database Files IL DUCTAPE PDBhtml SILOON CHASM TAU_instr Program documentation Application component glue C++ / F90/95 interoperability Automatic source instrumentation

TAU Performance System7 ParaProf – Manager Window performance database derived performance metrics

TAU Performance System8 ParaProf – Full Profile (Miranda) 8K processors!

TAU Performance System9 ParaProf - Statistics Table (Uintah)

TAU Performance System10 ParaProf – 3D Full Profile (Miranda) 16k processors

TAU Performance System11 ParaProf – 3D Scatterplot (Miranda)  Each point is a “thread” of execution  Relation between four routines shown at once

TAU Performance System12 TAU Instrumentation Approach  Support for standard program events  Routines  Classes and templates  Statement-level blocks  Support for user-defined events  Begin/End events (“user-defined timers”)  Atomic events (e.g., size of memory allocated/freed)  Support definition of “semantic” entities for mapping  Support for event groups  Instrumentation optimization (eliminate instrumentation in lightweight routines)

TAU Performance System13 Sampling vs Measured Profiling  Sampling  At a sample, PC or callstack is examined  Estimate performance of the program based on samples taken in code regions  Fixed overhead, depends on inter-sample interval  Typically used in gprof, prof and other system profilers  Measured Profiling  Instrumentation calls inserted at code regions  Entry/exit from routine, outer-loops, “events”  Accurate measurements, compensation for timer overheads possible  Accuracy inversely proportional to the granularity of instrumentation  Coarse grained instrumentation is more accurate  Overhead of instrumentation depends on event frequency  Optimize instrumentation to capture necessary detail, eliminate instrumentation in frequently executing lightweight routines  Used in TAU

TAU Performance System14 TAU Instrumentation  Flexible instrumentation mechanisms at multiple levels  Source code  manual (TAU API, TAU Component API)  automatic C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP spec)  Object code  pre-instrumented libraries (e.g., MPI using PMPI)  statically-linked and dynamically-linked  Executable code  dynamic instrumentation (pre-execution) (DynInstAPI)  virtual machine instrumentation (e.g., Java using JVMPI)  Runtime Linking (LD_PRELOAD)

TAU Performance System15 PAPI [UTK]  Performance Application Programming Interface  The purpose of the PAPI project is to design, standardize and implement a portable and efficient API to access the hardware performance monitor counters found on most modern microprocessors.  Parallel Tools Consortium project  University of Tennessee, Knoxville 

TAU Performance System16 KOJAK  KOJAK Toolkit [ICL, UTK and FZJ, Germany]  Epilog tracing library  Opari OpenMP re-writing tool  Expert automatic bottleneck detection trace analyzer  CUBE performance data browser 

TAU Performance System17 Automatic Instrumentation  We now provide compiler wrapper scripts  Simply replace mpxlf90 with tau_f90.sh  Automatically instruments Fortran source code, links with TAU MPI Wrapper libraries.  Use tau_cc.sh and tau_cxx.sh for C/C++ Before CXX = mpCC F90 = mpxlf90_r CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $(LIBS).cpp.o: $(CC) $(CFLAGS) -c $< After CXX = tau_cxx.sh F90 = tau_f90.sh CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $(LIBS).cpp.o: $(CC) $(CFLAGS) -c $<

TAU Performance System18 AutoInstrumentation using TAU_COMPILER  $(TAU_COMPILER) stub Makefile variable in release  Invokes PDT parser, TAU instrumentor, compiler through tau_compiler.sh shell script  Requires minimal changes to application Makefile  Compilation rules are not changed  User sets TAU_MAKEFILE and TAU_OPTIONS environment variables  User renames the compilers  F90=xlf90 to  F90= tau_f90.sh  Passes options from TAU stub Makefile to the four compilation stages  Uses original compilation command if an error occurs

TAU Performance System19 TAU_COMPILER Options  Optional parameters for $(TAU_COMPILER): [tau_compiler.sh –help]  -optVerboseTurn on verbose debugging messages  -optPdtDir="" PDT architecture directory. Typically $(PDTDIR)/$(PDTARCHDIR)  -optPdtF95Opts="" Options for Fortran parser in PDT (f95parse)  -optPdtCOpts="" Options for C parser in PDT (cparse). Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS)  -optPdtCxxOpts="" Options for C++ parser in PDT (cxxparse). Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS)  -optPdtF90Parser="" Specify a different Fortran parser. For e.g., f90parse instead of f95parse  -optPdtUser="" Optional arguments for parsing source code  -optPDBFile="" Specify [merged] PDB file. Skips parsing phase.  -optTauInstr="" Specify location of tau_instrumentor. Typically $(TAUROOT)/$(CONFIG_ARCH)/bin/tau_instrumentor  -optTauSelectFile="" Specify selective instrumentation file for tau_instrumentor  -optTau="" Specify options for tau_instrumentor  -optCompile="" Options passed to the compiler. Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS)  -optLinking="" Options passed to the linker. Typically $(TAU_MPI_FLIBS) $(TAU_LIBS) $(TAU_CXXLIBS)  -optNoMpi Removes -l*mpi* libraries during linking (default)  -optKeepFiles Does not remove intermediate.pdb and.inst.* files e.g., % setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau –optVerbose -optPdtCOpts=“-I/home -DFOO” ’ % setenv TAU_MAKEFILE /usr/local/tau /ia64/lib/Makefile.tau-icpc-mpi-pdt % tau_cxx.sh matrix.cpp -o matrix –lm % tau_f90.sh foo.o bar.o –o app –lm

TAU Performance System20 Optimization of Instrumentation Overhead  Group routines into profile groups, runtime selection of profiling groups  Instrument sections of code selectively  Exclude or include list of routines fed to the instrumentor – controlled manually or automatically  Rule based control of instrumentation  Generate selective instrumentation file by examining performance data from a previous run

TAU Performance System21 tau_reduce: Rule-Based Overhead Analysis  Analyze the performance data to determine events with high (relative) overhead performance measurements  Create a select list for excluding those events  Rule grammar (used in tau_reduce tool) [GroupName:] Field Operator Number  GroupName indicates rule applies to events in group  Field is a event metric attribute (from profile statistics)  numcalls, numsubs, percent, usec, cumusec, count [PAPI], totalcount, stdev, usecs/call, counts/call  Operator is one of >, <, or =  Number is any number  Compound rules possible using & between simple rules

TAU Performance System22 Optimizing Instrumentation Overhead: Examples  #Exclude all events that are members of TAU_USER #and use less than 1000 microseconds TAU_USER:usec < 1000  #Exclude all events that have less than 100 #microseconds and are called only once usec < 1000 & numcalls = 1  #Exclude all events that have less than 1000 usecs per #call OR have a (total inclusive) percent less than 5 usecs/call < 1000 percent < 5  Scientific notation can be used  usec>1000 & numcalls> & usecs/call 25

TAU Performance System23 TAU_REDUCE  Reads profile files and rules  Creates selective instrumentation file  Specifies which routines should be excluded from instrumentation tau_reduce rules profile Selective instrumentation file

TAU Performance System24 Instrumentation Specification % tau_instrumentor Usage : tau_instrumentor [-o ] [-noinline] [-g groupname] [-i headerfile] [-c|-c++|-fortran] [-f ] For selective instrumentation, use –f option % tau_instrumentor foo.pdb foo.cpp –o foo.inst.cpp –f selective.dat % cat selective.dat # Selective instrumentation: Specify an exclude/include list of routines/files. BEGIN_EXCLUDE_LIST void quicksort(int *, int, int) void sort_5elements(int *) void interchange(int *, int *) END_EXCLUDE_LIST BEGIN_FILE_INCLUDE_LIST Main.cpp Foo?.c *.C END_FILE_INCLUDE_LIST # Instruments routines in Main.cpp, Foo?.c and *.C files only # Use BEGIN_[FILE]_INCLUDE_LIST with END_[FILE]_INCLUDE_LIST

TAU Performance System25 Optimization of Instrumentation Overhead (contd.)  Runtime throttling of events based on rule  Numcalls > ThresholdA and TimePerCall < ThresholdB  setenv TAU_THROTTLE 1  setenv TAU_THROTTLE_NUMCALLS  setenv TAU_THROTTLE_PERCALL  Default values:  = calls  = 10 microseconds per call  The next call to meet these conditions is disabled at runtime and put in a TAU_DISABLE group

TAU Performance System26 EPILOG Tracing Optimization  TAU and Epilog Tracing Package  TAU can generate epilog trace files  configure –epilog= -TRACE …  Epilog uses its own MPI wrapper library  Events are analyzed by Expert to detect performance bottlenecks automatically  Output is a CUBE profile file with callpath information  CUBE output read by CUBE GUI and TAU’s ParaProf profile browser  Expert discards all events do not call an MPI call directly/indirectly  Optimization opportunity for instrumentation

TAU Performance System27 Runtime Instrumentation Control  When TAU is configured with –MPITRACE configuration option (without EPILOG support)  TAU stores events and wallclock time in a buffer  Defers writing buffer to disk until an MPI call takes place  Events directly in callstack are enabled and written to disk  Other events are discarded  TAU traces are converted to Epilog traces (tau2elg)  Expert has minimal set of events

TAU Performance System28 Callpath Profiling Based Selective Instrumentation  TAU is configured with –PROFILECALLPATH  Env. variable TAU_CALLPATH_DEPTH set to a large value  Callpaths rooted at “main”  TAU profiles analyzed to produce an “include list”  list of routines that should be instrumented (tauinc.sh) [F. Wolf]  Events that call an MPI routine directly/indirectly  TAU generates EPILOG traces  Expert analyzes EPILOG traces to produce CUBE profiles  ParaProf and CUBE browsers read CUBE files  PerfDMF performance database stores bottleneck results

TAU Performance System29 Conclusions  Optimization of instrumentation is critical for balancing the volume of performance data generated  Several techniques for reducing the amount of instrumentation

TAU Performance System30 Support Acknowledgements  Department of Energy (DOE)  Office of Science contracts  University of Utah ASC Level 1 sub-contract  LLNL ASC/NNSA Level 3 contract  LLNL ParaTools/GWT contract  NSF  High-End Computing Grant  T.U. Dresden, GWT  Dr. Wolfgang Nagel and Holger Brunst  Research Centre Juelich  Dr. Bernd Mohr, Dr. Felix Wolf  Los Alamos National Laboratory contracts