Sameer Shende, Allen D. Malony, and Alan Morris {sameer, malony, Department of Computer and Information Science NeuroInformatics.

Slides:



Advertisements
Similar presentations
Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.
Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,
Sameer Shende, Allen D. Malony, and Alan Morris {sameer, malony, Steven Parker, and J. Davison de St. Germain {sparker,
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Scalability Study of S3D using TAU Sameer Shende
Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.
Profiling S3D on Cray XT3 using TAU Sameer Shende
TAU: Tuning and Analysis Utilities. TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel.
The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.
Sameer Shende, Allen D. Malony, and Alan Morris {sameer, malony, Department of Computer and Information Science NeuroInformatics.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
TAU Performance System
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Case Study: PETSc ex19  Non-linear solver (snes)  2-D driven cavity code  uses velocity-velocity formulation  finite difference discretization on a.
Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.
The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.
Performance and Memory Evaluation using TAU Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony, Peter.
Workshop on Performance Tools for Petascale Computing 9:30 – 10:30am, Tuesday, July 17, 2007, Snowbird, UT Sameer S. Shende
TAU Performance System Alan Morris, Sameer Shende, Allen D. Malony University of Oregon {amorris, sameer,
Performance Tools BOF, SC’07 5:30pm – 7pm, Tuesday, A9 Sameer S. Shende Performance Research Laboratory University.
Tools for Performance Discovery and Optimization Sameer Shende, Allen D. Malony, Alan Morris, Kevin Huck University of Oregon {sameer, malony, amorris,
Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee Sameer Shende, and.
June 2, 2003ICCS Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee.
Allen D. Malony Department of Computer and Information Science Performance Research Laboratory NeuroInformatics Center University.
Workshop on Performance Tools for Petascale Computing 9:30 – 10:30am, Tuesday, July 17, 2007, Snowbird, UT Sameer S. Shende
Performance Evaluation of S3D using TAU Sameer Shende
TAU Performance Toolkit (WOMPAT 2004 OpenMP Lab) Sameer Shende, Allen D. Malony University of Oregon {sameer,
Scalability Study of S3D using TAU Sameer Shende
Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,
Performance and Memory Evaluation using the TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer, malony,
Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.
The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
SC’01 Tutorial Nov. 7, 2001 TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel and distributed.
Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.
Using TAU Performance Technology in ESMF Sameer Shende, Nancy Collins University of Oregon, UCAR
Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.
Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.
Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,
Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon
Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
PerfExplorer Component for Performance Data Analysis Kevin Huck – University of Oregon Boyana Norris – Argonne National Lab Li Li – Argonne National Lab.
Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li
Preparatory Research on Performance Tools for HPC HCS Research Laboratory University of Florida November 21, 2003.
Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Allen D. Malony, Sameer S. Shende, Robert Bell Kai Li, Li Li, Kevin Huck Department of Computer.
Simplifying the Usage of Performance Evaluation Tools: Experiences with TAU and DyninstAPI Paradyn/Condor Week 2010, Rm 221, Fluno Center, U. of Wisconsin,
21 Sep UPC Performance Analysis Tool: Status and Plans Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant.
Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University.
Allen D. Malony Department of Computer and Information Science Performance Research Laboratory.
TAU Performance System ® TAU is a profiling and tracing toolkit that supports programs written in C, C++, Fortran, Java, Python,
TAU Performance System (ACTS Workshop LBL) Sameer Shende, Allen D. Malony University of Oregon {sameer,
Introduction to the TAU Performance System®
Performance Technology for Scalable Parallel Systems
TAU integration with Score-P
Tutorial Outline – Part 1
Allen D. Malony, Sameer Shende
TAU Parallel Performance System
Advanced TAU Commander
Performance Technology for Complex Parallel and Distributed Systems
A configurable binary instrumenter
TAU Parallel Performance System
TAU: A Framework for Parallel Performance Analysis
TAU Performance System (ACTS Workshop LBL) Sameer Shende, Allen D
Outline Introduction Motivation for performance mapping SEAA model
Parallel Program Analysis Framework for the DOE ACTS Toolkit
Presentation transcript:

Sameer Shende, Allen D. Malony, and Alan Morris {sameer, malony, Department of Computer and Information Science NeuroInformatics Center University of Oregon TAU Performance Tools

2 Acknowledgements  Pete Beckman, ANL  Suravee Suthikulpanit, U. Oregon  Aroon Nataraj, U. Oregon  Katherine Riley, ANL

3 Outline  Overview of features  Instrumentation  Measurement  Analysis tools  Linux kernel profiling with TAU

4 TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel and distributed high- performance computing  Targets a general complex system computation model  nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance instrumentation, measurement, analysis, and visualization  Portable, configurable performance profiling/tracing facility  Open software approach  University of Oregon, LANL, FZJ Germany 

5 TAU Performance System Architecture Jumpshot Paraver paraprof

6 TAU Instrumentation Approach  Support for standard program events  Routines  Classes and templates  Statement-level blocks  Support for user-defined events  Begin/End events (“user-defined timers”)  Atomic events (e.g., size of memory allocated/freed)  Selection of event statistics  Support definition of “semantic” entities for mapping  Support for event groups  Instrumentation optimization (eliminate instrumentation in lightweight routines)

7 TAU Instrumentation  Flexible instrumentation mechanisms at multiple levels  Source code  manual (TAU API, TAU Component API)  automatic C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP spec)  Object code  pre-instrumented libraries (e.g., MPI using PMPI)  statically-linked and dynamically-linked  Executable code  dynamic instrumentation (pre-execution) (DynInstAPI)  virtual machine instrumentation (e.g., Java using JVMPI)  Proxy Components

8 Using TAU – A tutorial  Configuration  Instrumentation  Manual  MPI – Wrapper interposition library  PDT- Source rewriting for C,C++, F77/90/95  OpenMP – Directive rewriting  Component based instrumentation – Proxy components  Binary Instrumentation  DyninstAPI – Runtime instrumentation/Rewriting binary  Java – Runtime instrumentation  Python – Runtime instrumentation  Measurement  Performance Analysis

9 TAU Measurement System Configuration  configure [OPTIONS]  {-c++=, -cc= } Specify C++ and C compilers  {-pthread, -sproc}Use pthread or SGI sproc threads  -openmpUse OpenMP threads  -jdk= Specify Java instrumentation (JDK)  -opari= Specify location of Opari OpenMP tool  -papi= Specify location of PAPI  -pdt= Specify location of PDT  -dyninst= Specify location of DynInst Package  -mpi[inc/lib]= Specify MPI library instrumentation  -shmem[inc/lib]= Specify PSHMEM library instrumentation  -python[inc/lib]= Specify Python instrumentation  -epilog= Specify location of EPILOG  -vtf= Specify location of VTF3 trace package  -arch= Specify architecture explicitly (bgl,ibm64,ibm64linux…)

10 TAU Measurement System Configuration  configure [OPTIONS]  -TRACEGenerate binary TAU traces  -PROFILE (default) Generate profiles (summary)  -PROFILECALLPATHGenerate call path profiles  -PROFILEPHASEGenerate phase based profiles  -PROFILEMEMORYTrack heap memory for each routine  -MULTIPLECOUNTERSUse hardware counters + time  -COMPENSATECompensate timer overhead  -CPUTIMEUse usertime+system time  -PAPIWALLCLOCKUse PAPI’s wallclock time  -PAPIVIRTUALUse PAPI’s process virtual time  -SGITIMERSUse fast IRIX timers  -LINUXTIMERSUse fast x86 Linux timers

11 TAU Measurement Configuration – Examples ./configure –arch=bgl –mpi –pdt=/usr/pdtoolkit pdt_c++=xlC  Use IBM BlueGene/L arch, XL compilers, MPI and PDT  Builds /bgl/bin/tau_instrumentor (executes on the front-end) and /bgl/lib/Makefile.tau-mpi-pdt stub ./configure –TRACE –PROFILE –arch=bgl –mpi  Enable both TAU profiling and tracing ./configure -c++=xlC_r -cc=xlc_r -mpi –pdt=/home/pdtoolkit –TRACE –vtf=/usr/vtf  Use IBM’s xlC_r and xlc_r compilers with VTF3, PDT, MPI packages and multiple counters for measurements on the ppc64 front-end node  Typically configure multiple measurement libraries

12 TAU Performance Framework Interfaces  PDT [U. Oregon, LANL, FZJ] for instrumentation of C++, C99, F95 source code  PAPI [UTK] & PCL[FZJ] for accessing hardware performance counters data  DyninstAPI [U. Maryland, U. Wisconsin] for runtime instrumentation  KOJAK [FZJ, UTK]  Epilog trace generation library  CUBE callgraph visualizer  Opari OpenMP directive rewriting tool  Vampir/Intel® Trace Analyzer [Pallas/Intel]  VTF3 trace generation library for Vampir [TU Dresden] (available from TAU website)  Paraver trace visualizer [CEPBA]  Jumpshot-4 trace visualizer [MPICH, ANL]  JVMPI from JDK for Java program instrumentation [Sun]  Paraprof profile browser/PerfDMF database supports:  TAU format  Gprof [GNU]  HPM Toolkit [IBM]  MpiP [ORNL, LLNL]  Dynaprof [UTK]  PSRun [NCSA]  PerfDMF database can use Oracle, MySQL or PostgreSQL (IBM DB2 support planned)

13 Memory Profiling in TAU  Configuration option –PROFILEMEMORY  Records global heap memory utilization for each function  Takes one sample at beginning of each function and associates the sample with function name  Independent of instrumentation/measurement options selected  No need to insert macros/calls in the source code  User defined atomic events appear in profiles/traces

14 Memory Profiling in TAU Flash2 code profile on IBM BlueGene/L [MPI rank 0]

15 Memory Profiling in TAU  Instrumentation based observation of global heap memory (not per function)  call TAU_TRACK_MEMORY()  Triggers one sample every 10 secs  call TAU_TRACK_MEMORY_HERE()  Triggers sample at a specific location in source code  call TAU_SET_INTERRUPT_INTERVAL(seconds)  To set inter-interrupt interval for sampling  call TAU_DISABLE_TRACKING_MEMORY()  To turn off recording memory utilization  call TAU_ENABLE_TRACKING_MEMORY()  To re-enable tracking memory utilization

16 Profile Measurement – Three Flavors  Flat profiles  Time (or counts) spent in each routine (nodes in callgraph).  Exclusive/inclusive time, no. of calls, child calls  E.g,: MPI_Send, foo, …  Callpath Profiles  Flat profiles, plus  Sequence of actions that led to poor performance  Time spent along a calling path (edges in callgraph)  E.g., “main=> f1 => f2 => MPI_Send” shows the time spent in MPI_Send when called by f2, when f2 is called by f1, when it is called by main. Depth of this callpath = 4 (TAU_CALLPATH_DEPTH environment variable)  Phase based profiles  Flat profiles, plus  Flat profiles under a phase (nested phases are allowed)  Default “main” phase has all phases and routines invoked outside phases  Supports static or dynamic (per-iteration) phases  E.g., “IO => MPI_Send” is time spent in MPI_Send in IO phase

17 TAU Timers and Phases  Static timer  Shows time spent in all invocations of a routine (foo)  E.g., “foo()” 100 secs, 100 calls  Dynamic timer  Shows time spent in each invocation of a routine  E.g., “foo() 3” 4.5 secs, “foo 10” 2 secs (invocations 3 and 10 respectively)  Static phase  Shows time spent in all routines called (directly/indirectly) by a given routine (foo)  E.g., “foo() => MPI_Send()” 100 secs, 10 calls shows that a total of 100 secs were spent in MPI_Send() when it was called by foo.  Dynamic phase  Shows time spent in all routines called by a given invocation of a routine.  E.g., “foo() 4 => MPI_Send()” 12 secs, shows that 12 secs were spent in MPI_Send when it was called by the 4 th invocation of foo.

18 Flat Profile – Pprof Profile Browser  Intel Linux cluster  F90 + MPICH  Profile - Node - Context - Thread  Events - code - MPI

19 Flat Profile – TAU’s Paraprof Profile Browser

20 Callpath Profile

21 Callpath Profile - parent/node/child view

22 Callpath Profiling

23 Phase Profile – Dynamic Phases In 51 st iteration, time spent in MPI_Waitall was secs Total time spent in MPI_Waitall was secs across all 92 iterations

24 Using TAU  Install TAU % configure ; make clean install  Instrument application  TAU Profiling API  Typically modify application makefile  include TAU’s stub makefile, modify variables  Set environment variables  directory where profiles/traces are to be stored  name of merged trace file, retain intermediate trace files, etc.  Execute application % mpirun –np a.out;  Analyze performance data  paraprof, vampir/traceanalyzer, pprof, paraver …

25 AutoInstrumentation using TAU_COMPILER  $(TAU_COMPILER) stub Makefile variable in release  Invokes PDT parser, TAU instrumentor, compiler through tau_compiler.sh shell script  Requires minimal changes to application Makefile  Compilation rules are not changed  User adds $(TAU_COMPILER) before compiler name  F90=mpxlf90 Changes to F90= $(TAU_COMPILER) mpxlf90  Passes options from TAU stub Makefile to the four compilation stages  Uses original compilation command if an error occurs

26 TAU_COMPILER – Improving Integration in Makefiles OLD include /usr/tau-2.14/include/Makefile CXX = mpCC F90 = mpxlf90_r PDTPARSE = $(PDTDIR)/ $(PDTARCHDIR)/bin/cxxparse TAUINSTR = $(TAUROOT)/$(CONFIG_ARCH)/ bin/tau_instrumentor CFLAGS = $(TAU_DEFS) $(TAU_INCLUDE) LIBS = $(TAU_MPI_LIBS) $(TAU_LIBS) -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $(LIBS).cpp.o: $(PDTPARSE) $< $(TAUINSTR) $*.pdb $< -o $*.i.cpp –f select.dat $(CC) $(CFLAGS) -c $*.i.cpp NEW include /usr/tau-2.14/include/Makefile CXX = $(TAU_COMPILER) mpCC F90 = $(TAU_COMPILER) mpxlf90_r CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $(LIBS).cpp.o: $(CC) $(CFLAGS) -c $<

27 TAU_COMPILER Options  Optional parameters for $(TAU_COMPILER):  -optVerboseTurn on verbose debugging messages  -optPdtDir="" PDT architecture directory. Typically $(PDTDIR)/$(PDTARCHDIR)  -optPdtF95Opts="" Options for Fortran parser in PDT (f95parse)  -optPdtCOpts="" Options for C parser in PDT (cparse). Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS)  -optPdtCxxOpts="" Options for C++ parser in PDT (cxxparse). Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS)  -optPdtF90Parser="" Specify a different Fortran parser. For e.g., f90parse instead of f95parse  -optPdtUser="" Optional arguments for parsing source code  -optPDBFile="" Specify [merged] PDB file. Skips parsing phase.  -optTauInstr="" Specify location of tau_instrumentor. Typically $(TAUROOT)/$(CONFIG_ARCH)/bin/tau_instrumentor  -optTauSelectFile="" Specify selective instrumentation file for tau_instrumentor  -optTau="" Specify options for tau_instrumentor  -optCompile="" Options passed to the compiler. Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS)  -optLinking="" Options passed to the linker. Typically $(TAU_MPI_FLIBS) $(TAU_LIBS) $(TAU_CXXLIBS)  -optNoMpi Removes -l*mpi* libraries during linking (default)  -optKeepFiles Does not remove intermediate.pdb and.inst.* files e.g., OPT=-optTauSelectFile=select.tau –optPDBFile=merged.pdb F90 = $(TAU_COMPILER) $(OPT) blrts_xlf90

28 Program Database Toolkit (PDT)  Program code analysis framework  develop source-based tools  High-level interface to source code information  Integrated toolkit for source code parsing, database creation, and database query  Commercial grade front-end parsers  Portable IL analyzer, database format, and access API  Open software approach for tool development  Multiple source languages  Implement automatic performance instrumentation tools  tau_instrumentor

29 Program Database Toolkit Component source/ Library C / C++ parser Fortran 77/90/95 parser C / C++ IL analyzer Fortran 77/90/95 IL analyzer Program Database Files IL DUCTAPE tau_pg SILOON CHASM TAU_instr Proxy Component Application component glue C++ / F90 interoperability Automatic source instrumentation

30 TAU Tracing Enhancements  Configure TAU with -TRACE –vtf=dir option % configure –TRACE –vtf= … Generates tau_merge, tau2vtf tools in /ppc64/bin dir % configure -arch=bgl -TRACE -pdt= -pdt_c++=xlC -mpi Generates library in /bgl/lib directory  Execute application % mpirun -partition Pgeneral2 -np 16 -cwd `pwd` -exe `pwd`/  Merge and convert trace files to VTF3 format % tau_merge *.trc app.trc % tau2vtf app.trc tau.edf app.vpt.gz % traceanalyzer foo.vpt.gz

31 Intel ® Traceanalyzer (Vampir) Global Timeline

32 Visualizing TAU Traces with Counters/Samples

33 Visualizing TAU Traces with Counters/Samples

34 ParaProf TAU Performance Data Management Framework Performance analysis programs PerfDMF Java API... JDBC PostgreSQL Oracle MySQL Database Profile meta-data Raw performance data Hpmtoolkit Psrun Dynaprof mpiP Gprof … … C API…

35 Paraprof Manager – Performance Database

36 Paraprof Scalable Histogram View

37 MPI_Barrier Histogram over 16K cpus of BG/L

38 Paraprof Profile Browser

39 Paraprof – Full Callgraph View

40 Paraprof – Highlight Callpaths

41 Paraprof – Callgraph View (Zoom In +/Out -)

42 Paraprof – Callgraph View (Zoom In +/Out -)

43 Paraprof - Function Data Window

44 KOJAK’s CUBE [UTK, FZJ] Browser

45 Linux Kernel Profiling using TAU  Identifying points in kernel source for instrumentation  Developing TAU’s kernel profiling API  Kernel compiled with TAU instrumentation  Maintains per process performance data for each kernel routine  Performance data accessible via /proc filesystem  Instrumented application maintains data in userspace  Performance data from application and kernel merged at program termination

46 Kernel Profiling Issues for IBM BlueGene/L  I/O node kernel - Linux kernel approach  Compute node kernel:  No daemon processes  Single address space  Single performance database & callstack across user/kernel  Keeps track of one process only (optimization)  Instrumented compute node kernel

47 TAU Performance System Status (v )  Computing platforms (selected)  IBM BGL, AIX, pSeries Linux, SGI Origin, Cray RedStorm, T3E / SV-1 / X1, HP (Compaq) SC (Tru64), Sun, Hitachi SR8000, NEC SX-5/6, Linux clusters (IA- 32/64, Alpha, PPC, PA-RISC, Power, Opteron), Apple (G4/5, OS X), Windows,…  Programming languages  C, C++, Fortran 77/90/95, HPF, Java, OpenMP, Python  Thread libraries  pthreads, SGI sproc, Java,Windows, OpenMP  Compilers (selected)  IBM, Intel, Intel KAI, PGI, GNU, Fujitsu, Sun, NAG, Microsoft, SGI, Cray, HP, NEC, Absoft, Lahey

48 Support Acknowledgements  Department of Energy (DOE)  Office of Science contracts  University of Utah DOE ASCI Level 1 sub-contract  DOE ASC/NNSA Level 3 contract  NSF Software and Tools for High-End Computing Grant  Research Centre Juelich  John von Neumann Institute for Computing  Dr. Bernd Mohr  Los Alamos National Laboratory