TAU: Performance Technology for Productive, High Performance Computing

Slides:

Advertisements

Similar presentations

Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.

Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

Profiling your application with Intel VTune at NERSC

Automated Instrumentation and Monitoring System (AIMS)

Productive Performance Tools for Heterogeneous Parallel Computing Allen D. Malony Department of Computer and Information Science University of Oregon Shigeo.

Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,

Sameer Shende, Allen D. Malony, and Alan Morris {sameer, malony, Steven Parker, and J. Davison de St. Germain {sparker,

S3D: Performance Impact of Hybrid XT3/XT4 Sameer Shende

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

Scalability Study of S3D using TAU Sameer Shende

Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.

Profiling S3D on Cray XT3 using TAU Sameer Shende

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.

Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.

TAU Performance System

TAU Performance System Alan Morris, Sameer Shende, Allen D. Malony University of Oregon {amorris, sameer,

Performance Tools BOF, SC’07 5:30pm – 7pm, Tuesday, A9 Sameer S. Shende Performance Research Laboratory University.

TAU PERFORMANCE SYSTEM Sameer Shende Alan Morris, Wyatt Spear, Scott Biersdorff Performance Research Lab Allen D. Malony, Shangkar Mayanglambam, Suzanne.

Workshop on Performance Tools for Petascale Computing 9:30 – 10:30am, Tuesday, July 17, 2007, Snowbird, UT Sameer S. Shende

Performance Evaluation of S3D using TAU Sameer Shende

TAU: Performance Regression Testing Harness for FLASH Sameer Shende

Scalability Study of S3D using TAU Sameer Shende

Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,

S3D: Comparing Performance of XT3+XT4 with XT4 Sameer Shende

Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.

Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende

TAU Performance System® OpenSHMEM Tools Tutorial Sameer Shende ParaTools, Inc and University of Oregon. 1pm, Tuesday, March 4 th,

© 2008 Pittsburgh Supercomputing Center Performance Engineering of Parallel Applications Philip Blood, Raghu Reddy Pittsburgh Supercomputing Center.

1 Performance Analysis with Vampir DKRZ Tutorial – 7 August, Hamburg Matthias Weber, Frank Winkler, Andreas Knüpfer ZIH, Technische Universität.

Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,

Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon

Allen D. Malony 1, Scott Biersdorff 2, Wyatt Spear 2 1 Department of Computer and Information Science 2 Performance Research Laboratory University of Oregon.

Overview of CrayPat and Apprentice 2 Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

1 Performance Analysis with Vampir ZIH, Technische Universität Dresden.

ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA This work.

Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:

Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.

Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:

Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.

TAU Performance System Sameer Shende Performance Reseaerch Lab, University of Oregon

Tuning Threaded Code with Intel® Parallel Amplifier.

Beyond Application Profiling to System Aware Analysis Elena Laskavaia, QNX Bill Graham, QNX.

Navigating TAU Visual Display ParaProf and TAU Portal Mahin Mahmoodi Pittsburgh Supercomputing Center 2010.

Profiling OpenSHMEM with TAU Commander

Performance Tool Integration in Programming Environments for GPU Acceleration: Experiences with TAU and HMPP Allen D. Malony1,2, Shangkar Mayanglambam1.

VisIt Project Overview

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

Productive Performance Tools for Heterogeneous Parallel Computing

TAU Open Source Performance Tool for HPC

Introduction to the TAU Performance System®

Python Performance Evaluation with the TAU Performance System

TAU Open Source Performance Tool for HPC

Performance Technology for Scalable Parallel Systems

TAUmon: Scalable Online Performance Data Analysis in TAU

TAU integration with Score-P

Process Management Presented By Aditya Gupta Assistant Professor

Improving Application Performance Using TAU

Allen D. Malony, Sameer Shende

Advanced TAU Commander

A configurable binary instrumenter

TAU The 11th DOE ACTS Workshop

Allen D. Malony Computer & Information Science Department

Parallel Program Analysis Framework for the DOE ACTS Toolkit

Presentation transcript:

TAU: Performance Technology for Productive, High Performance Computing Cray, Seattle, 10am Jan 13, 2010 Sameer Shende Director, Performance Research Laboratory University of Oregon, Eugene, OR sameer@cs.uoregon.edu http://tau.uoregon.edu

Acknowledgements: University of Oregon Dr. Allen D. Malony, Professor, CIS Dept, and Director, NeuroInformatics Center Alan Morris, Senior software engineer Wyatt Spear, Software engineer Scott Biersdorff, Software engineer Dr. Robert Yelle, Research faculty Suzanne Millstein, Ph.D. student Ivan Pulleyn, Systems administrator

Outline Introduction to TAU Instrumentation Measurement Analysis Examples of TAU usage Future work/collaboration

What is TAU? TAU is a performance evaluation tool It supports parallel profiling and tracing toolkit Profiling shows you how much (total) time was spent in each routine Tracing shows you when the events take place in each process along a timeline Profiling and tracing can measure time as well as hardware performance counters from your CPU TAU can automatically instrument your source code (routines, loops, I/O, memory, phases, etc.) It supports C++, C, Chapel, UPC, Fortran, Python and Java TAU runs on all HPC platforms and it is free (BSD style license) TAU has instrumentation, measurement and analysis tools To use TAU, you need to set a couple of environment variables and substitute the name of the compiler with a TAU shell script

TAU Performance System® Integrated toolkit for performance problem solving Instrumentation, measurement, analysis, visualization Portable performance profiling and tracing facility Performance data management and data mining Based on direct performance measurement approach Open source Available on all HPC platforms http://tau.uoregon.edu TAU Architecture

Performance Evaluation Profiling Presents summary statistics of performance metrics number of times a routine was invoked exclusive, inclusive time/hpm counts spent executing it number of instrumented child routines invoked, etc. structure of invocations (calltrees/callgraphs) memory, message communication sizes also tracked Tracing Presents when and where events took place along a global timeline timestamped log of events message communication events (sends/receives) are tracked shows when and where messages were sent large volume of performance data generated leads to more perturbation in the program

TAU Performance Profiling Performance with respect to nested event regions Program execution event stack (begin/end events) Profiling measures inclusive and exclusive data Exclusive measurements for region only performance Inclusive measurements includes nested “child” regions Support multiple profiling types Flat, callpath, and phase profiling

TAU Parallel Performance System Goals Portable (open source) parallel performance system Computer system architectures and operating systems Different programming languages and compilers Multi-level, multi-language performance instrumentation Flexible and configurable performance measurement Support for multiple parallel programming paradigms Multi-threading, message passing, mixed-mode, hybrid, object oriented (generic), component-based Support for performance mapping Integration of leading performance technology Scalable (very large) parallel performance analysis

TAU Performance System Components TAU Architecture Program Analysis Performance Data Mining PDT PerfExplorer Parallel Profile Analysis PerfDMF Performance Monitoring ParaProf TAUoverSupermon

TAU Performance System Architecture

TAU Performance System Architecture

Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 Program documentation PDBhtml Application component glue IL IL SILOON C / C++ IL analyzer Fortran IL analyzer C++ / F90/95 interoperability CHASM Program Database Files Automatic source instrumentation DUCTAPE tau_instrumentor

Automatic Source-Level Instrumentation in TAU

Building Bridges to Other Tools

TAU Instrumentation Approach Support for standard program events Routines, classes and templates Statement-level blocks Begin/End events (Interval events) Support for user-defined events Begin/End events specified by user Atomic events (e.g., size of memory allocated/freed) Selection of event statistics Support definition of “semantic” entities for mapping Support for event groups (aggregation, selection) Instrumentation optimization Eliminate instrumentation in lightweight routines

Interval, Atomic and Context Events in TAU Interval Event Context Event Atomic Event

TAU Measurement Mechanisms Parallel profiling Function-level, block-level, statement-level Supports user-defined events and mapping events Support for flat, callgraph/callpath, phase profiling Support for memory profiling (headroom, malloc/leaks) Support for tracking I/O (wrappers, read/write/print calls) Parallel profiles written at end of execution Parallel profile snapshots can be taken during execution Tracing All profile-level events + inter-process communication Inclusion of multiple counter data in traced events

Types of Parallel Performance Profiling Flat profiles Metric (e.g., time) spent in an event (callgraph nodes) Exclusive/inclusive, # of calls, child calls Callpath profiles (Calldepth profiles) Time spent along a calling path (edges in callgraph) “main=> f1 => f2 => MPI_Send” (event name) TAU_CALLPATH_DEPTH environment variable Phase profiles Flat profiles under a phase (nested phases are allowed) Default “main” phase Supports static or dynamic (e.g., per-iteration) phases

Performance Evaluation Alternatives Depthlimit profile Parameter profile Callpath/ callgraph profile Trace Flat profile Phase profile Each alternative has: one metric/counter multiple counters Volume of performance data

Parallel Profile Visualization: ParaProf (AORSA)

Comparing Effects of Multi-Core Processors AORSA2D  magnetized plasma simulation  Blue is single node Red is dual core Cray XT3 (4K cores)

Comparing FLOPS (AORSA2D, Cray XT3)  Blue is dual core Red is single node Cray XT3 (4K cores) Data generated by Richard Barrett, ORNL

ParaProf – Scalable Histogram View (Miranda) 8k processors 16k processors

ParaProf – 3D Scatterplot (Miranda) Each point is a “thread” of execution A total of four metrics shown in relation ParaProf’s visualization library JOGL

Visualizing Hybrid Problems (S3D, XT3+XT4) S3D combustion simulation (DOE SciDAC PERI) ORNL Jaguar * Cray XT3/XT4 * 6400 cores

Zoom View of Hybrid Execution (S3D, XT3+XT4) Gap represents XT3 nodes MPI_Wait takes less time, other routines take more time

Visualizing Hybrid Execution (S3D, XT3+XT4) Process metadata is used to map performance to machine type Memory speed accounts for performance difference 6400 cores

S3D Run on XT4 Only Better balance across nodes More performance uniformity

ParaProf – Profile Snapshots (Flash) Profile snapshots are parallel profiles recorded at runtime Used to highlight profile changes during execution Initialization Checkpointing This slide shows a four processor Flash run showing how different snapshots differ. Before the main loop beings, I write a snapshot to mark the performance data up to that point. It shows up as “Initialization”. At the end of each loop, a snapshot is written. At the end of execution, a final snapshot is written labeled “Finalization”. I’m 99% sure the four main spikes are the checkpointing phases involving a lot of IO. The big purple color on top is “Other”. I have it limited to top 20 functions + “other”. Finalization 29

Filtered Profile Snapshots (Flash) Only show main loop iterations Same view as previous slide, but I’ve filtered away initialization and finalization to better see the loop snapshots. The red at the bottom of the four spikes is MPI_Barrier, which is interesting. 30

Profile Snapshots with Breakdown (Flash) Breakdown as a percentage Breakdown as a percentage over time 31

Profile Snapshot Replay (Flash) All windows dynamically update When you drag the slider, the data in all windows dynamically updates

Snapshot Dynamics of Event Relations (Flash) Follow progression of various displays through time 3D scatter plot shown below T = 0s T = 11s When you drag the slider, the data in all windows dynamically updates. Here we see the progression of the 3d scatterplot over time. The clusters start out more spread out, but then collect as time goes by. 33

TAU: Usage Scenarios

Using TAU: A brief Introduction Each configuration of TAU produces a unique stub makefile with configuration specific parameters (e.g., MPI, pthread, PGI compiler, etc.) To instrument source code using PDT Choose an appropriate TAU stub makefile in <arch>/lib: % setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl/lib/Makefile.tau-mpi-pdt-pgi % setenv TAU_OPTIONS ‘-optVerbose …’ (see tau_compiler.sh) And use tau_f90.sh, tau_cxx.sh or tau_cc.sh as Fortran, C++ or C compilers: % ftn foo.f90 changes to % tau_f90.sh foo.f90 Execute application and analyze performance data: % pprof (for text based profile display) % paraprof (for GUI)

TAU Measurement Configuration – Examples % cd /usr/common/acts/TAU/tau_latest/craycnl/lib; ls Makefile.* Makefile.tau-pdt-pgi Makefile.tau-mpi-pdt-pgi Makefile.tau-papi-mpi-pdt-pgi Makefile.tau-pthread-pdt-pgi Makefile.tau-pthread-mpi-pdt-pgi Makefile.tau-openmp-opari-pdt-pgi Makefile.tau-openmp-opari-mpi-pdt-pgi Makefile.tau-papi-openmp-opari-mpi-pdt-pgi … For an MPI+F90 application, you may want to start with: Supports MPI instrumentation & PDT for automatic source instrumentation % setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl/lib/Makefile.tau-mpi-pdt-pgi

Automatic Instrumentation TAU provides compiler wrapper scripts Simply replace ftn with tau_f90.sh Automatically instruments Fortran source code, links with TAU MPI Wrapper libraries. Use tau_cc.sh and tau_cxx.sh for C/C++ Before F90 = ftn CXX = CC CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(F90) $(LDFLAGS) $(OBJS) -o $@ $(LIBS) .f90.o: $(F90) $(CFLAGS) -c $< After F90 = tau_f90.sh CXX = tau_cxx.sh CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(F90) $(LDFLAGS) $(OBJS) -o $@ $(LIBS) .f90.o: $(F90) $(CFLAGS) -c $<

TAU_COMPILER Commandline Options See <taudir>/<arch>/bin/tau_compiler.sh –help Compilation: % ftn -c foo.f90 Changes to % gfparse foo.f90 $(OPT1) % tau_instrumentor foo.pdb foo.f90 –o foo.inst.f90 $(OPT2) % ftn –c foo.f90 $(OPT3) Linking: % ftn foo.o bar.o –o app Changes to % ftn foo.o bar.o –o app $(OPT4) Where options OPT[1-4] default values may be overridden by the user: % setenv TAU_OPTIONS ‘...’ % make F90=tau_f90.sh

Compile-Time Environment Variables Optional parameters for TAU_OPTIONS: [tau_compiler.sh –help] -optVerbose Turn on verbose debugging messages -optCompInst Use compiler based instrumentation -optDetectMemoryLeaks Turn on debugging memory allocations/ de-allocations to track leaks -optKeepFiles Does not remove intermediate .pdb and .inst.* files -optPreProcess Preprocess Fortran sources before instrumentation -optTauSelectFile="" Specify selective instrumentation file for tau_instrumentor -optLinking="" Options passed to the linker. Typically $(TAU_MPI_FLIBS) $(TAU_LIBS) $(TAU_CXXLIBS) -optCompile="" Options passed to the compiler. Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) -optPdtF95Opts="" Add options for Fortran parser in PDT (f95parse/gfparse) -optPdtF95Reset="" Reset options for Fortran parser in PDT (f95parse/gfparse) -optPdtCOpts="" Options for C parser in PDT (cparse). Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) -optPdtCxxOpts="" Options for C++ parser in PDT (cxxparse). Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) ...

Runtime Environment Variables Default Description TAU_TRACE Setting to 1 turns on tracing TAU_CALLPATH Setting to 1 turns on callpath profiling TAU_TRACK_HEAP or TAU_TRACK_HEADROOM Setting to 1 turns on tracking heap memory/headroom at routine entry & exit using context events (e.g., Heap at Entry: main=>foo=>bar) TAU_CALLPATH_DEPTH 2 Specifies depth of callpath. Setting to 0 generates no callpath or routine information, setting to 1 generates flat profile and context events have just parent information (e.g., Heap Entry: foo) TAU_SYNCHRONIZE_CLOCKS 1 Synchronize clocks across nodes to correct timestamps in traces TAU_COMM_MATRIX Setting to 1 generates communication matrix display using context events TAU_THROTTLE Setting to 0 turns off throttling. Enabled by default to remove instrumentation in lightweight routines that are called frequently TAU_THROTTLE_NUMCALLS 100000 Specifies the number of calls before testing for throttling TAU_THROTTLE_PERCALL 10 Specifies value in microseconds. Throttle a routine if it is called over 100000 times and takes less than 10 usec of inclusive time per call TAU_COMPENSATE Setting to 1 enables runtime compensation of instrumentation overhead TAU_PROFILE_FORMAT Profile Setting to “merged” generates a single file. “snapshot” generates xml format TAU_METRICS TIME Setting to a comma separted list generates other metrics. (e.g., TIME:linuxtimers:PAPI_FP_OPS:PAPI_NATIVE_<event>)

Overriding Default Options: TAU_OPTIONS % cat Makefile F90 = tau_f90.sh OBJS = f1.o f2.o f3.o … LIBS = -Lappdir –lapplib1 –lapplib2 … app: $(OBJS) $(F90) $(OBJS) –o app $(LIBS) .f90.o: $(F90) –c $< %setenv TAU_OPTIONS ‘-optVerbose optTauSelectFile=select.tau’ % setenv TAU_MAKEFILE <taudir>/craycnl/lib/Makefile.tau-mpi-pdt-pgi

Usage Scenarios: Routine Level Profile Goal: What routines account for the most time? How much? Flat profile with wallclock time:

Solution: Generating a flat profile with MPI % setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-mpi-pdt-pgi % set path=(/usr/common/acts/TAU/tau_latest/x86_64/bin $path) Or % module load tau % make F90=tau_f90.sh % tau_f90.sh matmult.f90 –o matmult (Or edit Makefile and change F90=tau_f90.sh) % qsub run.job % paraprof To view. To view the data locally on the workstation, % paraprof -–pack app.ppk Move the app.ppk file to your desktop. % paraprof app.ppk

Usage Scenarios: Loop Level Instrumentation Goal: What loops account for the most time? How much? Flat profile with wallclock time with loop instrumentation:

Solution: Generating a loop level profile % setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-mpi-pdt-pgi % setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau –optVerbose’ % cat select.tau BEGIN_INSTRUMENT_SECTION loops routine=“#” END_INSTRUMENT_SECTION % module load tau % make F90=tau_f90.sh (Or edit Makefile and change F90=tau_f90.sh) % qsub run.job % paraprof -–pack app.ppk Move the app.ppk file to your desktop. % paraprof app.ppk

Usage Scenarios: MFlops in Loops Goal: What MFlops am I getting in all loops? Flat profile with PAPI_FP_INS/OPS and TIME with loop instrumentation:

ParaProf: Mflops Sorted by Exclusive Time low mflops?

Generate a PAPI profile with 2 or more counters % setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-papi-mpi-pdt-pgi % setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau –optVerbose’ % cat select.tau BEGIN_INSTRUMENT_SECTION loops routine=“#” END_INSTRUMENT_SECTION % make F90=tau_f90.sh (Or edit Makefile and change F90=tau_f90.sh) % setenv TAU_METRICS TIME:PAPI_FP_INS:PAPI_L1_DCM % qsub run.job % paraprof -–pack app.ppk Move the app.ppk file to your desktop. % paraprof app.ppk Choose Options -> Show Derived Panel -> Arg 1 = PAPI_FP_INS, Arg 2 = GET_TIME_OF_DAY, Operation = Divide -> Apply, choose.

Usage Scenarios: Compiler-based Instrumentation Goal: Easily generate routine level performance data using the compiler instead of PDT for parsing the source code

Use Compiler-Based Instrumentation % setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-mpi-pdt-pgi % setenv TAU_OPTIONS ‘-optCompInst –optVerbose’ % module load tau % make F90=tau_f90.sh (Or edit Makefile and change F90=tau_f90.sh) % qsub run.job % paraprof -–pack app.ppk Move the app.ppk file to your desktop. % paraprof app.ppk

Profiling Chapel Applications Using compiler-based instrumentation in TAU to profile Chapel applications

Chapel: Compiler-Based Instrumentation % setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/x86_64 /lib/Makefile.tau-papi-pthread-pdt % setenv TAU_OPTIONS ‘-optCompInst –optVerbose’ % setenv CHPL_MAKE_COMPILER tau % make % cat $CHPL_HOME/make/compiler/Makefile.tau CC=tau_cc.sh CXX=tau_cxx.sh … % qsub run.job % paraprof -–pack app.ppk Move the app.ppk file to your desktop. % paraprof app.ppk

Profiling UPC Applications Atomic Events for UPC

UPC: Compiler-Based Instrumentation % setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/x86_64 /lib/Makefile.tau-mpi-upc % setenv TAU_OPTIONS ‘-optCompInst –optVerbose’ % make … % qsub run.job % paraprof -–pack app.ppk Move the app.ppk file to your desktop. % paraprof app.ppk

Usage Scenarios: Generating Callpath Profile Goal: To reveal the calling structure of the program Callpath profile for a given callpath depth:

Callpath Profile Generates program callgraph

Generate a Callpath Profile % setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-mpi-pdt-pgi % set path=(/usr/common/acts/TAU/tau_latest/craycnl/bin $path) % make F90=tau_f90.sh (Or edit Makefile and change F90=tau_f90.sh) % setenv TAU_CALLPATH 1 % setenv TAU_CALLPATH_DEPTH 100 % qsub run.job % paraprof -–pack app.ppk Move the app.ppk file to your desktop. % paraprof app.ppk (Windows -> Thread -> Call Graph)

Usage Scenario: Detect Memory Leaks

Detect Memory Leaks % setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-mpi-pdt-pgi % setenv TAU_OPTIONS ‘-optDetectMemoryLeaks -optVerbose’ % module load tau % make F90=tau_f90.sh (Or edit Makefile and change F90=tau_f90.sh) % setenv TAU_CALLPATH_DEPTH 100 % qsub run.job % paraprof -–pack app.ppk Move the app.ppk file to your desktop. % paraprof app.ppk (Windows -> Thread -> Context Event Window -> Select thread -> select... expand tree) (Windows -> Thread -> User Event Bar Chart -> right click LEAK -> Show User Event Bar Chart)

Usage Scenarios: Mixed Python+F90+C+pyMPI Goal: Generate multi-level instrumentation for Python+MPI+…

Generate a Multi-Language Profile with Python % setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-python-mpi-pdt-pgi % set path=(/usr/common/acts/TAU/tau_latest/craycnl/bin $path) % setenv TAU_OPTIONS ‘-optShared -optVerbose…’ (Python needs shared object based TAU library) % make F90=tau_f90.sh CXX=tau_cxx.sh CC=tau_cc.sh (build pyMPI w/TAU) % cat wrapper.py import tau def OurMain(): import App tau.run(‘OurMain()’) Uninstrumented: % aprun –a xt –n 4 <dir>/pyMPI-2.5b0/bin/pyMPI ./App.py Instrumented: % setenv PYTHONPATH <taudir>/craycnl/lib/bindings-python-mpi-pdt-pgi (same options string as TAU_MAKEFILE) setenv LD_LIBRARY_PATH <taudir>/craycnl/lib/bindings-python-mpi-pdt \:$LD_LIBRARY_PATH % aprun –a xt –n 4 <dir>/pyMPI-2.5b0-TAU/bin/pyMPI ./wrapper.py (Instrumented pyMPI with wrapper.py)

Usage Scenarios: Generating a Trace File Goal: What happens in my code at a given time? When? Event trace visualized in Vampir [TUD] /Jumpshot [ANL]

VNG Process Timeline with PAPI Counters

Vampir Counter Timeline Showing I/O BW

Generate a Trace File % setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-mpi-pdt-pgi % set path=(/usr/common/acts/TAU/tau_latest/craycnl/bin $path) % make F90=tau_f90.sh (Or edit Makefile and change F90=tau_f90.sh) % setenv TAU_TRACE 1 % qsub run.job % tau_treemerge.pl (merges binary traces to create tau.trc and tau.edf files) JUMPSHOT: % tau2slog2 tau.trc tau.edf –o app.slog2 % jumpshot app.slog2 OR VAMPIR: % tau2otf tau.trc tau.edf app.otf –n 4 –z (4 streams, compressed output trace) % vampir app.otf (or vng client with vngd server).

Usage Scenarios: Evaluate Scalability Goal: How does my application scale? What bottlenecks at what cpu counts? Load profiles in PerfDMF database and examine with PerfExplorer

Usage Scenarios: Evaluate Scalability

Performance Regression Testing

Evaluate Scalability using PerfExplorer Charts % setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-mpi-pdt-pgi % set path=(/usr/common/acts/TAU/tau_latest/craycnl/bin $path) % make F90=tau_f90.sh (Or edit Makefile and change F90=tau_f90.sh) % qsub run1p.job % paraprof -–pack 1p.ppk % qsub run2p.job … % paraprof -–pack 2p.ppk … and so on. On your client: % perfdmf_configure (Choose derby, blank user/passwd, yes to save passwd, defaults) % perfexplorer_configure (Yes to load schema, defaults) % paraprof (load each trial: DB -> Add Trial -> Type (Paraprof Packed Profile) -> OK) % perfexplorer (Charts -> Speedup)

Communication Matrix Display Goal: What is the volume of inter-process communication? Along which calling path?

Communication Matrix Display Goal: What is the volume of inter-process communication? Along which calling path?

Evaluate Scalability using PerfExplorer Charts % setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-mpi-pdt-pgi % module load tau % make F90=tau_f90.sh (Or edit Makefile and change F90=tau_f90.sh) % setenv TAU_COMM_MATRIX 1 % qsub run.job (setting the environment variables) % paraprof (Windows -> Communication Matrix)

PGI Compiler for GPUs Accelerator programming support Compiled program Fortran and C Directive-based programming Loop parallelization for acceleration on GPUs PGI 9.0 for x64-based Linux (preview release) Compiled program CUDA target Synchronous accelerator operations Profile interface support

TAU with PGI Accelerator Compiler Compiler-based instrumentation for PGI compilers Track runtime system events as seen by host processor Wrapped runtime library Show source information associated with events Routine name File name, source line number for kernel Variable names in memory upload, download operations Grid sizes Any configuration of TAU with PGI supports tracking of accelerator operations Tested with PGI 8.0.3, 8.0.5, 8.0.6 compilers Qualification and testing with PGI 9.0-4, 10.x complete

Measuring Performance of PGI Accelerator Code

Binary Rewriting: DyninstAPI [U.Wisc] and TAU

HMPP-TAU Event Instrumentation/Measurement User Application HMPP Runtime C U D A HMPP CUDA Codelet TAU TAUcuda Measurement User events HMPP events Codelet events Measurement CUDA stream events Waiting information

HMPP-TAU Compilation Workflow HMPP annotated application TAU compiler TAU instrumenter TAU-instrumented HMPP annotated application HMPP compiler CUDA generator TAUcuda instrumenter TAU/TAUcuda library TAUcuda-instrumented CUDA codelets HMPP runtime library CUDA compiler Generic compiler TAUcuda- instrumented CUDA codelet library TAU-instrumented HMPP application executable

HMPP Workbench with TAUcuda Host process Compute kernel Transfer kernel

NAMD with CUDA NAMD is a molecular dynamics application (Charm++) NAMD has been accelerated with CUDA TAU integrated in Charm++ Apply TAUcuda to NAMD Four processes with one Tesla GPU for each

NAMD with CUDA (4 processes) GPU kernel

Scaling NAMD with CUDA good GPU performance

Scaling NAMD with CUDA: Jumpshot Timeline

Scaling NAMD with CUDA Data transfer

Conclusions Heterogeneous parallel computing will challenge parallel performance technology Must deal with diversity in hardware and software Must deal with richer parallelism and concurrency Performance tools should support parallel execution and computation models Understanding of “performance” interactions between integrated components control and data interactions Might not be able to see full parallel (concurrent) detail Need to support multiple performance perspectives Layers of performance abstraction

Discussions TAU represents a mature technology for performance instrumentation, measurement and analysis We would like to collaborate with the Cray language and compiler teams to improve the support for TAU on Cray systems Near-term goals Chapel runtime support Support for compiler-based instrumentation for Cray compilers on XT systems Long-term goals Explore hybrid execution models (XT5h) and other new systems Integrate and ship TAU with the Cray tool chain

Support Acknowledgements Department of Energy (DOE) Office of Science MICS, Argonne National Lab ASC/NNSA University of Utah ASC/NNSA Level 1 ASC/NNSA, LLNL Department of Defense (DoD) HPC Modernization Office (HPCMO) NSF SDCI Research Centre Juelich LBL, ORNL, ANL, LANL, PNNL, LLNL TU Dresden ParaTools, Inc.