Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,

Slides:



Advertisements
Similar presentations
Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.
Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
Automated Instrumentation and Monitoring System (AIMS)
Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,
Sameer Shende, Allen D. Malony, and Alan Morris {sameer, malony, Steven Parker, and J. Davison de St. Germain {sparker,
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Scalability Study of S3D using TAU Sameer Shende
Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.
Profiling S3D on Cray XT3 using TAU Sameer Shende
TAU: Tuning and Analysis Utilities. TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel.
TAU Parallel Performance System DOD UGC 2004 Tutorial Allen D. Malony, Sameer Shende, Robert Bell Univesity of Oregon.
The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
TAU Performance System
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Case Study: PETSc ex19  Non-linear solver (snes)  2-D driven cavity code  uses velocity-velocity formulation  finite difference discretization on a.
Workshop on Performance Tools for Petascale Computing 9:30 – 10:30am, Tuesday, July 17, 2007, Snowbird, UT Sameer S. Shende
TAU Performance System Alan Morris, Sameer Shende, Allen D. Malony University of Oregon {amorris, sameer,
Performance Tools BOF, SC’07 5:30pm – 7pm, Tuesday, A9 Sameer S. Shende Performance Research Laboratory University.
Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee Sameer Shende, and.
June 2, 2003ICCS Performance Instrumentation and Measurement for Terascale Systems Jack Dongarra, Shirley Moore, Philip Mucci University of Tennessee.
Workshop on Performance Tools for Petascale Computing 9:30 – 10:30am, Tuesday, July 17, 2007, Snowbird, UT Sameer S. Shende
Performance Evaluation of S3D using TAU Sameer Shende
Scalability Study of S3D using TAU Sameer Shende
Optimization of Instrumentation in Parallel Performance Evaluation Tools Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,
Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.
The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
Performance Observation Sameer Shende and Allen D. Malony cs.uoregon.edu.
1 Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Markus Geimer 2), Bert Wesarg 1), Brian Wylie.
Tool Visualizations, Metrics, and Profiled Entities Overview Adam Leko HCS Research Laboratory University of Florida.
1 Performance Analysis with Vampir DKRZ Tutorial – 7 August, Hamburg Matthias Weber, Frank Winkler, Andreas Knüpfer ZIH, Technische Universität.
Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.
WORK ON CLUSTER HYBRILIT E. Aleksandrov 1, D. Belyakov 1, M. Matveev 1, M. Vala 1,2 1 Joint Institute for nuclear research, LIT, Russia 2 Institute for.
Score-P – A Joint Performance Measurement Run-Time Infrastructure for Periscope, Scalasca, TAU, and Vampir Alexandru Calotoiu German Research School for.
VAMPIR. Visualization and Analysis of MPI Resources Commercial tool from PALLAS GmbH VAMPIRtrace - MPI profiling library VAMPIR - trace visualization.
BG/Q Performance Tools Scott Parker Mira Community Conference: March 5, 2012 Argonne Leadership Computing Facility.
Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon
Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,
Performance Monitoring Tools on TCS Roberto Gomez and Raghu Reddy Pittsburgh Supercomputing Center David O’Neal National Center for Supercomputing Applications.
Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
ASC Tri-Lab Code Development Tools Workshop Thursday, July 29, 2010 Lawrence Livermore National Laboratory, P. O. Box 808, Livermore, CA This work.
Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Tool Visualizations, Metrics, and Profiled Entities Overview [Brief Version] Adam Leko HCS Research Laboratory University of Florida.
TAU Evaluation Report Adam Leko, Hung-Hsun Su UPC Group HCS Research Laboratory University of Florida Color encoding key: Blue: Information Red: Negative.
Allen D. Malony, Sameer S. Shende, Robert Bell Kai Li, Li Li, Kevin Huck Department of Computer.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Allen D. Malony Department of Computer and Information Science Performance Research Laboratory.
Aroon Nataraj, Matthew Sottile, Alan Morris, Allen D. Malony, Sameer Shende { anataraj, matt, amorris, malony,
Performane Analyzer Performance Analysis and Visualization of Large-Scale Uintah Simulations Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
TAU Performance System ® TAU is a profiling and tracing toolkit that supports programs written in C, C++, Fortran, Java, Python,
Navigating TAU Visual Display ParaProf and TAU Portal Mahin Mahmoodi Pittsburgh Supercomputing Center 2010.
Profiling OpenSHMEM with TAU Commander
Lecture 5z Linux Tools – Call graphs Virtual Machines
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Introduction to the TAU Performance System®
TAU integration with Score-P
Allen D. Malony, Sameer Shende
TAU Parallel Performance System
Advanced TAU Commander
A configurable binary instrumenter
TAU: A Framework for Parallel Performance Analysis
Allen D. Malony Computer & Information Science Department
Outline Introduction Motivation for performance mapping SEAA model
MPI MPI = Message Passing Interface
Parallel Program Analysis Framework for the DOE ACTS Toolkit
Working in The IITJ HPC System
Presentation transcript:

Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,

TAU Performance System2 Outline  What is TAU?  Instrumentation  Measurement  Invoking on SiCortex - tauex  Demo  sweep3D  Visualizing performance results

TAU Performance System3  Tuning and Analysis Utilities (13+ year project effort)  Performance measurement framework for HPC systems  Portable, scalable, flexible, and parallel  Integrated toolkit for performance problem solving  Automatic instrumentation  Highly configurable measurement system with support for many flavors of profiling and tracing  Portable analysis and visualization tools  Performance data management and data mining 

TAU Performance System4 TAU Instrumentation Approach  Support for standard program events  Routines  Classes and templates  Finer-grain -- loop-level  Support for user-defined events  Begin/End events (“user-defined timers”)  Atomic events (e.g., size of memory allocated/freed)  Support for event groups  Selective examination of performance data  Runtime disabling of groups  Instrumentation optimization  Selective instrumentation of events (only instrument needed)  tau_reduce - generate selective instrumentation file

TAU Performance System5 TAU Instrumentation  Flexible instrumentation mechanisms at multiple levels  Source code  manual (TAU API, CCA TAU Component API)  automatic C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP spec)  Library level  pre-instrumented libraries (e.g., MPI using PMPI)  statically-linked and dynamically-linked  Executable code  dynamic instrumentation (pre-execution) (DynInstAPI)  virtual machine instrumentation (e.g., Java using JVMPI)  Runtime Linking (LD_PRELOAD)

TAU Performance System6 Automatic Instrumentation  We provide compiler wrapper scripts  Simply replace mpif90 with tauf90  Automatically instruments Fortran source code, links with TAU MPI Wrapper libraries.  Use taucc and taucxx for C/C++ Before CXX = mpicxx F90 = mpif90 CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $(LIBS).cpp.o: $(CC) $(CFLAGS) -c $< After CXX = taucxx F90 = tauf90 CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(CXX) $(LDFLAGS) $(OBJS) -o $(LIBS).cpp.o: $(CC) $(CFLAGS) -c $<

TAU Performance System7 Measurement Options  Flat profiles  Time (or counts) spent in each routine (nodes in callgraph).  Exclusive/inclusive time, no. of calls, child calls  Support for hardware counters (PAPI), multiple counters.  Callpath Profiles  Flat profiles, plus  Time spent along a calling path (edges in callgraph)  E.g., “ main=> f1 => f2 => MPI_Send ” shows the time spent in MPI_Send when called by f2, when f2 is called by f1, when it is called by main.  Configurable callpath depth limit ( TAU_CALLPATH_DEPTH environment variable)  Tracing  VAMPIRTRACE  TAU Trace format; Converters: tau2slog2, tau2vtf, tau2otf

TAU Performance System8 Running Applications with TAU on SiCortex  New tool, tauex Usage: tauex [options] [--] Options: -d: Enable debugging output, use repeatedly for more output. -h: Print this message. -i: Print information about the host machine. -s: Dump the shell environment variables and exit. -U: User mode counts -K: Kernel mode counts -S: Supervisor mode counts -I: Interrupt mode counts -l: List events -L : Describe event -a: Count all native events (implies -m) -m: Multiple runs (enough runs of exe to gather all events) -e : Specify PAPI preset or native event -T : specify TAU option -v: Debug/Verbose mode -XrunTAU- : specify TAU library directly

TAU Performance System9 Demo  Application Sweep3D  Build standard sweep3d application (un-instrumented)  Run standard (un-instrumented) sweep3d using tauex  Provides MPI-only profiles  Build sweep3d with automatic TAU instrumentor  Run instrumented sweep3d with tauex  Provides application-level and MPI events

TAU Performance System10 Using tauex  Use with uninstrumented executable for MPI profiling and tracing $ mpif90 ring.f90 –o ring $ srun -pscx -n4 tauex./ring $ cd ring.tau.1669/MULTI__P_WALL_CLOCK_TIME $ pprof … FUNCTION SUMMARY (mean): %Time Exclusive Inclusive #Call #Subrs Inclusive Name msec total msec usec/call TAU application MPI_Init() MPI_Finalize() MPI_Barrier() MPI_Recv() MPI_Bcast() MPI_Send() MPI_Comm_size() MPI_Comm_rank()

TAU Performance System11 Using TAU compiler wrappers $ tauf90 ring.f90 –o ring # verbose output shows each step Debug: Parsing with PDT Parser Executing> /usr/share/PDT/mips/bin/f95parse ring.f90 -I/home/amorris/usr/include Debug: Instrumenting with TAU Executing> /usr/bin/tau_instrumentor ring.pdb ring.f90 -o ring.inst.f90 Debug: Compiling (Individually) with Instrumented Code Executing> pathf95 -mabi=64 -I. -c ring.inst.f90 -o ring.o Debug: Linking (Together) object files Executing> pathf95 ring.o -mabi=64 -lTAU -Wl,-rpath -lpfm -lpapi -lpfm -lpthread - L/usr/lib/gcc/mips64el-gentoo-linux-gnu/4.1.2/ -lstdc++ -lgcc_s -lscmpi -o ring Debug: cleaning inst file Executing> /bin/rm -f ring.inst.f90 Debug: cleaning PDB file Executing> /bin/rm -f ring.pdb

TAU Performance System12 Using tauex  Use with instrumented executables for Application+MPI profiling and tracing $ tau_f90.sh ring.f90 –o ring $ srun -pscx -n4 tauex –e CPU_CYCLES./ring $ cd ring.tau.1674/MULTI__CPU_CYCLES $ pprof FUNCTION SUMMARY (mean): %Time Exclusive Inclusive #Call #Subrs Count/Call Name counts total counts E E MAIN E E MPI_Init() E E FUNC E E MPI_Barrier() E E MPI_Recv() E E MPI_Finalize() E E MPI_Bcast() E E MPI_Send() MPI_Comm_size() MPI_Comm_rank()

TAU Performance System13 Using tauex  Other typical usage scenarios # floating point instruction counts and time (compute derived # FLOPS in ParaProf) $ tauex –e P_WALL_CLOCK_TIME –e PAPI_FP_INS # Generate callpath profiles $ tauex –e PAPI_FP_INS –T callpath # Generate OTF traces for Vampir $ tauex –e PAPI_FP_INS –T vampirtrace # Generate Epilog traces for Kojak $ tauex –T epilog

TAU Performance System14 Using TAU with FLASH on SiCortex  To use TAU with FLASH on SiCortex platforms, simply specify –tau= in setup # On Full Disclosure: $./setup Sedov -2d -auto -site=sicortex -objdir=tau - tau=/home/amorris/usr/share/TAU/64/Makefile.tau- multiplecounters-pathcc-mpi-papi-pdt # Build $ cd tau ; make # Run (using tauex) # This will generate callpath profiles with time and floating point instruction metrics $ srun -pscx -n16 tauex -T callpath –e P_WALL_CLOCK_TIME –e PAPI_FP_INS./flash3

TAU Performance System15 Using ParaProf  Not yet available on SiCortex mips nodes (requires Java)  For now, tar up.tau. directory and copy to another machine  ParaProf can be run on Linux, Windows, or Mac  If you don’t have TAU/paraprof, but have Java Web Start, visit to run it scx-m23-n6> tar czf flash3.tau.1578.tar.gz flash3.tau.1578 scx-m32-n6> scp flash3.tau.1578.tar.gz somewhere-else: scx-m32-n6> ssh somewhere-else se> tar –xzf flash3.tau.1578.tar.gz se> paraprof flash3.tau.1578

TAU Performance System16 ParaProf – Full Profile (FLASH) MPI_Barrier IO routines

TAU Performance System17 ParaProf - Statistics Table (Uintah)

TAU Performance System18 ParaProf –Callgraph View (MFIX)

TAU Performance System19 ParaProf – Histogram View (Miranda) 8k processors 16k processors  Scalable 2D displays

TAU Performance System20 ParaProf – 3D Full Profile (Miranda) 16k processors

TAU Performance System21 ParaProf – 3D Scatterplot (Miranda)  Each point is a “thread” of execution  Relation between four routines shown at once

TAU Performance System22 Tracing (Vampir) - Uintah  Trace analysis provides in-depth understanding of temporal event and message passing relationships  Traces can even store hardware counters

TAU Performance System23 VNG Timeline Display (Miranda on BGL)

TAU Performance System24 Thank You  TAU should soon be part of the SiCortex standard install  Check out: