TAU Evaluation Report Adam Leko, Hung-Hsun Su UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.

Slides:

Advertisements

Similar presentations

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.

Advertisements

MICHAEL MARINO CSC 101 Whats New in Office Office Live Workspace 3 new things about Office Live Workspace are: Anywhere Access Store Microsoft.

® IBM Software Group © 2010 IBM Corporation What’s New in Profiling & Code Coverage RAD V8 April 21, 2011 Kathy Chan

Programming Types of Testing.

TRACK 2™ Version 5 The ultimate process management software.

Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.

The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.

June 2, 2003ICCS Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee.

Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 2: Operating-System Structures Modified from the text book.

TRACK 3™ The ultimate process management software.

Software Development Unit 6.

Intel Trace Collector and Trace Analyzer Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding.

Open and save files directly from Word, Excel, and PowerPoint No more flash drives or sending yourself documents via Stop manually merging versions.

Overview of Eclipse Parallel Tools Platform Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:

Page 1 © 2001 Hewlett-Packard Company Tools for Measuring System and Application Performance Introduction GlancePlus Introduction Glance Motif Glance Character.

GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.

CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.

Introduction to The Linaro Toolchain Embedded Processors Training Multicore Software Applications Literature Number: SPRPXXX 1.

1 Introduction to Tool chains. 2 Tool chain for the Sitara Family (but it is true for other ARM based devices as well) A tool chain is a collection of.

M. Taimoor Khan * Java Server Pages (JSP) is a server-side programming technology that enables the creation of dynamic,

Chocolate Bar! luqili. Milestone 3 Speed 11% of final mark 7%: path quality and speed –Some cleverness required for full marks –Implement some A* techniques.

PAPI Tool Evaluation Bryan Golden 1/4/2004 HCS Research Laboratory University of Florida.

Tool Visualizations, Metrics, and Profiled Entities Overview Adam Leko HCS Research Laboratory University of Florida.

UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.

What is Sure BDCs? BDC stands for Batch Data Communication and is also known as Batch Input. It is a technique for mass input of data into SAP by simulating.

MpiP Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida.

® IBM Software Group © 2009 IBM Corporation Rational Publishing Engine RQM Multi Level Report Tutorial David Rennie, IBM Rational Services A/NZ

MPE/Jumpshot Evaluation Report Adam Leko Hans Sherburne, UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information.

CCS APPS CODE COVERAGE. CCS APPS Code Coverage Definition: –The amount of code within a program that is exercised Uses: –Important for discovering code.

London April 2005 London April 2005 Creating Eyeblaster Ads The Rich Media Platform The Rich Media Platform Eyeblaster.

Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.

Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Chapter 34 Java Technology for Active Web Documents methods used to provide continuous Web updates to browser – Server push – Active documents.

11 July 2005 Tool Evaluation Scoring Criteria Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko,

VAMPIR. Visualization and Analysis of MPI Resources Commercial tool from PALLAS GmbH VAMPIRtrace - MPI profiling library VAMPIR - trace visualization.

MPICL/ParaGraph Evaluation Report Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information.

Overview of CrayPat and Apprentice 2 Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.

SvPablo Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

Performance Analysis Tool List Hans Sherburne Adam Leko HCS Research Laboratory University of Florida.

240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.

Debugging and Profiling With some help from Software Carpentry resources.

KOJAK Evaluation Report Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

Dynaprof Evaluation Report Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:

HPCToolkit Evaluation Report Hans Sherburne, Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:

Preparatory Research on Performance Tools for HPC HCS Research Laboratory University of Florida November 21, 2003.

Test Specifications A Specification System for Multi-Platform Test Suite Configuration, Build, and Execution Greg Cooksey.

11 July 2005 Briefing on Tool Evaluations Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr.

Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.

Dynaprof Evaluation Report Adam Leko, Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red:

Overview of dtrace Adam Leko UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green: Positive.

Comparison of different output options from Stata

NetLogger Using NetLogger for Distributed Systems Performance Analysis of the BaBar Data Analysis System Data Intensive Distributed Computing Group Lawrence.

Overview of AIMS Hans Sherburne UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green:

Performance Analysis with Parallel Performance Wizard Prashanth Prakash, Research Assistant Dr. Vikas Aggarwal, Research Scientist. Vrishali Hajare, Research.

21 Sep UPC Performance Analysis Tool: Status and Plans Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant.

Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

Testing plan outline Adam Leko Hans Sherburne HCS Research Laboratory University of Florida.

Profiling/Tracing Method and Tool Evaluation Strategy Summary Slides Hung-Hsun Su UPC Group, HCS lab 1/25/2005.

Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.

Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.

Beyond Application Profiling to System Aware Analysis Elena Laskavaia, QNX Bill Graham, QNX.

Lecture 1 Page 1 CS 111 Summer 2013 Important OS Properties For real operating systems built and used by real people Differs depending on who you are talking.

SQL Database Management

Kai Li, Allen D. Malony, Sameer Shende, Robert Bell

How to Contribute to System Testing and Extract Results

TAU integration with Score-P

Presentation transcript:

TAU Evaluation Report Adam Leko, Hung-Hsun Su UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative note Green: Positive note

2 Basic Information Name: Tuning and Analysis Utilities (TAU) Developer: University of Oregon Current version:  TAU  Program database toolkit Website: Contact:  Sameer Shende:

3 TAU Overview Performance tool suite that offers profiling and tracing of programs  Available instrumentation methods: source (manual), source (automatic), binary (DynInst)  Supported languages: C, C++, Fortran, Python, Java, SHMEM (TurboSHMEM and Cray SHMEM), OpenMP, MPI, Charm  Hardware counter support Relies on existing toolkits and libraries for some functionality  PDToolkit and Opari for automatic source instrumentation  DynInst for runtime binary instrumentation  PCL and PAPI for hardware counter information  libvtf3, slog2sdk, and EPILOG for exporting trace files

4 TAU Architecture

5 Configuring & Installing TAU TAU relies on several existing toolkits for efficient usage, but some of these toolkits are time-consuming to install  PDToolkit, PAPI, etc Users must choose between modes at compile time using./configure script  Profiling via -PROFILE, tracing via -TRACE  TAU must also be notified about the location of supported languages and compilers -mpilib=/path/to/mpi/lib -dyninst=/path/to/dyninst -pdt=/path/to/pdt Other supported languages/libraries handled in a similar manner This results in a very flexible installation process  Users can easily install different configurations of TAU in their home directory  However, several configuration options are mutually exclusive, such as Profiling and tracing Using PAPI counters vs. gettimeofday or TSC counters Profiling w/callpaths vs. profiling with extra statistics  Unfortunately, mutually exclusive nature of things proves to be annoying Would be nice if TAU supported (for instance) tracing and profiling without compiling & installing twice! Luckily, software compiles quickly on modern machines, so this is not fatal However, TAU relies on several environment variables, which makes switching between installations cumbersome

6 The Many Faces of TAU Two main methods of operation: profiling and tracing Profile mode  Reports aggregate spent in each function per each node/thread  Several profile recording options Report min/max/std. dev of times using the -TRACESTATS configure option Attempt to compensate for profiling overhead ( -COMPENSATE ) Record memory stats while profiling ( -PROFILEMEMORY, -PROFILEHEADROOM ) Stop profiling after a certain function depth ( -DEPTHLIMIT ) Record call trees in profile ( -PROFILECALLPATH ) Record phase of program in profiles ( -PROFILEPHASE, requires manual instrumentation of phases)  If instrumented code uses the TAU_INIT macros, can also pass arguments to compiled, instrumented program to restrict what is recorded at runtime --profile main+func2  Metrics that can be recorded: wall clock time (via gettimeofday or several hardware- specific timers) or hardware counter metrics (via PAPI or PCL)  Data visualized using pprof (text-based) or paraprof (Java-based GUI)  Profile data can be exported to KOJAK’s cube viewer  Profile data can be imported from Vampir VTF traces

7 The Many Faces of TAU (2) Trace mode  Records timestamps for function entry/exit points Or arbitrary code section points via manual instrumentation  Also records messages sent/received for MPI programs  No trace visualizer, but can export to ALOG: Upshot/nupshot Paraver’s trace format SLOG-2: Jumpshot VTF: Vampir/Intel Trace Analyzer 5 SDDF: Format used by Pablo/SvPablo EPILOG: KOJAK’s trace format

8 TAU Instrumentation: Profile Mode Source-level instrumentation  tau_instrument (which requires PDToolkit) is used to produce an instrumented source code for C, C++, and Fortran files  For OpenMP code, TAU can use OPARI (from KOJAK)  Users may insert instrumentation using TAU’s simple API ( TAU_PROFILE_START, TAU_PROFILE_STOP )  When compiling, must use stub Makefiles which define compilation macros like CFLAGS, LDFLAGS, etc. This can complicate the compile & link cycle greatly, especially if fully automatic source instrumentation is desired  Selective instrumentation is supported through a flag to tau_instrument Give a file containing which functions to include or exclude from instrumentation Can tau_reduce use in conjuction with existing profiles to exclude functions matching certain criteria, like  numcalls > & usecs/call < 2 Binary-level instrumentation  Based on DynInst, considered “experimental” according to documentation  Use tau_run wrapper script with instrumentation file in same format as selective instrumentation file

9 TAU Instrumentation: Trace Mode Source-level instrumentation  Same procedure as in profile mode Binary instrumentation  Can link against MPI wrapper library (only re- linking necessary)  Runtime instrumentation for trace mode is not supported using DynInst

10 Source Instrumentation Process

11 Instrumentation Test Suite: Problems Problem with using selective instrumentation + MPI wrapper library + PAPI metrics  Only instrumenting main in CAMEL caused several floating point instructions to be attributed to MPI_Send and MPI_Recv instead of main  For timing measurements and overhead measurements, used wallclock time with the low-overhead -LINUXTIMERS option Some code had to be modified before feeding it through PDToolkit’s cparse  cparse usese the Edison Design Group’s parser, which is stricter about some things than other compilers  ANSI C/standard Fortran code poses no problems, though NAS NPB LU benchmark (NPBv3.1-MPI) would not run with TAU libraries  Segfaults, “signal 11s” when using either LAM or MPICH with only MPI wrapper libraries (profiling & tracing)  Modified, updated version (3.2) of LU comes with TAU Had problems compiling and running this  Gave TAU the benefit of the doubt for the rest of the evaluations Guessed at what TAU profile would tell us had it been working with LU for bottleneck tests LU timing overheads omitted from overhead measurements

12 Instrumentation Overhead: Notes Performed automatic instrumentation of CAMEL using tau_instrument  Like KOJAK, program execution time was several orders of magnitude slower  This is likely due to the use of very small functions which normally get inlined by the compiler  For profile measurements on the following slides, only main was instrumented Under this scenario, profiling and tracing overhead was almost nonexistent (<1%) Instrumentation points chosen for overhead measurements  Profiling CAMEL: all MPI calls, main enter + exit PPerfMark suite: all MPI calls, all function calls Used –PROFILECALLPATH configuration option  Using other profile flavors (without call paths, with extra stats) made a negligible difference on overall profile overhead  Tracing CAMEL: all MPI calls PPerfMark suite: all MPI calls Similar to what we have done for other tools Benchmarks marked with * had high variability in runtimes

13 Instrumentation Overhead: Notes (2) Used LAM for all measurements  Some benchmarks with high overhead (small-messages, wrong-way, ping-pong) had slightly smaller overhead using MPICH Small messages: 54.2% vs % Wrong way: 24.5% vs % Ping-pong: 51.5% vs %  Probably due to LAM running faster (especially on small-messages) and execution time being limited by I/O time for writing trace file Same I/O time, smaller execution time -> higher % overhead In general, overhead for profiling and tracing extremely low except for a few cases  High profile overhead programs with small functions that get called a lot small-messages, wrong-way, ping-pong, CAMEL with everything instrumented  High trace overhead for programs with large traces generated very quickly small-messages, wrong-way, ping-pong  tau_reduce provides a nice way to help reduce instrumentation overhead, although an initial profile must be first gathered

14 Instrumentation Overhead: Profiles

15 Instrumentation Overhead: Traces

16 Visualizations: pprof Gives text-based dump of profile files, similar to gprof / prof output Example (partial) output: … USER EVENTS Profile :NODE 7, CONTEXT 0, THREAD NumSamples MaxValue MinValue MeanValue Std. Dev. Event Name Message size received from all nodes Message size sent to all nodes FUNCTION SUMMARY (total): %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call : : main() MPI_Init() main() => MPI_Init() MPI_Recv() main() => MPI_Recv() MPI_Barrier() …

17 Visualizations: paraprof paraprof provides visual representations of same data given by pprof Used to be a Tcl/Tk application known as “racy”  Racy has been deprecated, but is still included with TAU for historical reasons Java application with three main views  Main profile view  Histograms (next slides)  Three-dimensional visualization (next slides) Main profile view (right top)  “Function ledger” maps colors to function names (right, bottom left)  Overall time for each function displayed as a stacked bar chart  Can click on each function to get detailed information (right, bottom right) No line-level source code correlation  Can infer this information Indirectly if call paths are used Main profile view Function ledgerFunction details view

18 Visualizations: paraprof (2) paraprof can also show histogram views for each function of the main profile view (right) Simply show histogram of aggregate time for a function across all threads  Histogram to right shows that most functions spent around 75.8 seconds (midpoint between min and max) in MPI_Barrier

19 Visualizations: paraprof (3) paraprof also can display three-dimensional displays of profile data Bar and triangle meshes axes  Time spent in each function (height)  Which function (width)  Which node (depth) Scatter plot lets you pick axes Plots support transparency, rotation, and highlighting a particular function or node Surprisingly responsive for a Java application!

20 Bottleneck Identification Test Suite Testing metric: what did pprof/paraprof tell us from wallclock time profiles?  Since no built-in trace visualizer, we ignored what could be done with other trace tools Programs correctness not affected by instrumentation  Except for our version of LU  CAMEL: PASSED  Showed work evenly distributed among nodes  When full tracing used, can easily show which functions take the most wall clock time LU: FAILED  Could not run, got segfaults using MPICH or LAM  Even if it worked, it would be very difficult/impossible to garner communication patterns from profile views Big messages: PASSED  Profile showed most of application time dominated by MPI calls to send and receive Diffuse procedure: TOSS-UP  Profile showed most time taken by MPI_Barrier calls  However, profile also showed the bottleneck procedure (which is dispersed across all nodes) taking up a negligible amount of overall time  Really need a trace view to see diffuse behavior of program Hot procedure: PASSED  Profile clearly shows that one function is responsible for most execution time

21 Bottleneck Identification Test Suite (2) Intensive server: PASSED  Profile showed most time spent in MPI_Recv for all nodes except first node  Profile also illustrated most time for first node spent in waste_time Ping-pong: PASSED  Easy to see from profile that most time is being spent in MPI_Send and MPI_Recv  pprof and paraprof also showed a large number of MPI calls Random barrier: TOSS-UP  Profile showed most time being spent in MPI_Barrier  However, random nature of barrier not shown by profile  Trace view is necessary to see random barrier behavior Small messages: PASS  Profile showed one process spending most time in MPI_Send and the other process in MPI_Recv System time: FAILED  No built-in way to separate wall clock time into system time vs. user time  PAPI metrics can’t record system time vs. user time either Wrong order: FAILED  Impossible to see communication behavior without a trace

22 TAU General Comments Good things  Supports profiling & tracing  Very portable  Wide range of software support Several programming models & libraries supported  Visualization tools seem very stable  Good support for exporting data to other tools Things that could use improvement  Dependence on other software for basic functionality (instrumentation via PDToolkit or DynInst) makes installation difficult  Source code correlation could be better Only at the function or function call level (with call paths)  Export is nice, but lots of things are easier to do directly in other tools For example, mpicc -mpilog to get a trace for Jumpshot instead of cparse, tau_instrument, wrapper Makefiles, … TAU does add automatic instrumentation for profiling functions, which is an added benefit  Three-dimensional visualizations are nice but “Cube” viewer from KOJAK is easier to use and displays data in a very concise manner Text is also hard to read on three-dimensional views for function names  Some interoperability features (export to SLOG-2 and ALOG) do not work well in version we tested TAU could potentially serve as a base for our UPC and SHMEM performance tool

23 TAU: Adding UPC & SHMEM SHMEM  Not much extra work needed  Have already created weak binding patches for GPSHMEM & created a wrapper library that calls the appropriate TAU functions UPC  If we have source code instrumentation, then just put in TAU* instrumentation calls in the appropriate places  If we do binary instrumentation, we’ll probably have to make major modifications to DynInst  In any case, once the UPC instrumentation problem is solved, adding support for UPC into TAU will not be too hard However, how to instrument UPC programs while retaining low overhead? Also, how to extend TAU to support more advanced analyses? Support for profiles and traces a nice bonus

24 Evaluation (1) Available metrics: 4/5  Supports recording execution time (broken down into call trees)  Supports several methods of gathering profile data  Supports all PAPI metrics for profiles Cost: 5/5  Free! Documentation quality: 3.5/5  User’s manual very good, but out of date  For example, three-dimensional visualizations not covered in manual Extensibility: 4/5  Open source, uses documented APIs  Can add support for new languages using source instrumentation Filtering and aggregation: 2.5/5  Filtering & aggregation available through profile view  No advanced filter or custom aggregation methods built in for traces

25 Evaluation (2) Hardware support: 5/5  Many platforms supported: 64-bit Linux (Opteron, Itanium, Alpha, SPARC); IBM SP2 (AIX); IBM BlueGene/L; AlphaServer (Tru64); SPARC-based clusters (Solaris); SGI (IRIX 6.x) systems, including Indy, Power Challenge, Onyx, Onyx2, Origin 200, 2000, 3000 series; NEC SX-5; Cray X1, T3E; Apple OS X; HP RISC systems (HP-UX) Heterogeneity support: 0/5 (not supported) Installation: 2.5/5  As simple as./configure with options, then make install  However, dependence on other software for source or binary instrumentation makes installation time-consuming Interoperability: 5/5  Profile files use simple ASCII format; trace files use documented binary format  Can export to VAMPIR, Jumpshot/upshot (ALOG & SLOG-2), CUBE, SDDF, Paraver Learning curve: 2.5/5  Learning how to use the different Makefile wrappers and command-line programs takes a while  After a short period, instrumentation & tool usage relatively easy

26 Evaluation (3) Manual overhead: 4/5  Automatic instrumentation of MPI calls on all platforms  Automatic instrumentation of all functions or a selected group of functions  Call path support gives almost the same information as instrumenting call sites  MPI and OpenMP instrumentation support Measurement accuracy: 5/5  CAMEL overhead < 1% for profiling and tracing when a few functions were instrumented  Overall, accuracy pretty good except for a few cases Multiple executions: 3/5  Can relate profile metrics between runs in paraprof  Can store performance data in DBMS (PerfDB) Seems like PerfDB is in a preliminary state, though Multiple analyses & views: 4/5  Both profiling and tracing are supported (although no built-in trace viewer)  Profile view has stacked bar charts, “regular” views, three-dimensional views, and histograms

27 Evaluation (4) Performance bottleneck identification: 3.5/5  No automatic bottleneck identification  Profile viewer helpful for identifying methods that take most time  Lack of built-in trace viewer makes identification of some bottlenecks impossible, but trace export means could combine with several other viewers to cover just about anything Profiling/tracing support: 4/5  Tracing & profiling supported  Default trace file format size reasonable but not most compact Response time: 3/5  Loading profiles after run almost instantaneous using paraprof viewer  Exporting traces to other tools time consuming (have to run tau_merge, tau_convert, etc; a few extra disk I/Os) Software support: 5/5  Supports OpenMP, MPI, and several other programming models  A wide range of compilers are supported  Can support linking against any library, but does not instrument library functions Source code correlation: 2/5  Supported down to the function and function call site level (when collecting call paths is enabled) Searching: 0/5 (not supported)

28 Evaluation (5) System stability: 3/5  Software is generally stable  Bugs encountered: Segfaults on instrumented version of our LU code SLOG-2 export seems to give Jumpshot-4 some trouble (several “unsupported event” messages on a few exported traces) Exporting to ALOG format puts stray “: %d” lines in ALOG file Technical support: 5/5  Good response from our contact (Sameer), most s answered within 48 hours with useful information