TAU: Performance Technology for Productive, High Performance Computing

TAU: Performance Technology for Productive, High Performance Computing
Cray, Seattle, 10am Jan 13, 2010 Sameer Shende Director, Performance Research Laboratory University of Oregon, Eugene, OR

Acknowledgements: University of Oregon
Dr. Allen D. Malony, Professor, CIS Dept, and Director, NeuroInformatics Center Alan Morris, Senior software engineer Wyatt Spear, Software engineer Scott Biersdorff, Software engineer Dr. Robert Yelle, Research faculty Suzanne Millstein, Ph.D. student Ivan Pulleyn, Systems administrator

Outline Introduction to TAU Instrumentation Measurement Analysis
Examples of TAU usage Future work/collaboration

What is TAU? TAU is a performance evaluation tool
It supports parallel profiling and tracing toolkit Profiling shows you how much (total) time was spent in each routine Tracing shows you when the events take place in each process along a timeline Profiling and tracing can measure time as well as hardware performance counters from your CPU TAU can automatically instrument your source code (routines, loops, I/O, memory, phases, etc.) It supports C++, C, Chapel, UPC, Fortran, Python and Java TAU runs on all HPC platforms and it is free (BSD style license) TAU has instrumentation, measurement and analysis tools To use TAU, you need to set a couple of environment variables and substitute the name of the compiler with a TAU shell script

TAU Performance System®
Integrated toolkit for performance problem solving Instrumentation, measurement, analysis, visualization Portable performance profiling and tracing facility Performance data management and data mining Based on direct performance measurement approach Open source Available on all HPC platforms TAU Architecture

Performance Evaluation
Profiling Presents summary statistics of performance metrics number of times a routine was invoked exclusive, inclusive time/hpm counts spent executing it number of instrumented child routines invoked, etc. structure of invocations (calltrees/callgraphs) memory, message communication sizes also tracked Tracing Presents when and where events took place along a global timeline timestamped log of events message communication events (sends/receives) are tracked shows when and where messages were sent large volume of performance data generated leads to more perturbation in the program

TAU Performance Profiling
Performance with respect to nested event regions Program execution event stack (begin/end events) Profiling measures inclusive and exclusive data Exclusive measurements for region only performance Inclusive measurements includes nested “child” regions Support multiple profiling types Flat, callpath, and phase profiling

TAU Parallel Performance System Goals
Portable (open source) parallel performance system Computer system architectures and operating systems Different programming languages and compilers Multi-level, multi-language performance instrumentation Flexible and configurable performance measurement Support for multiple parallel programming paradigms Multi-threading, message passing, mixed-mode, hybrid, object oriented (generic), component-based Support for performance mapping Integration of leading performance technology Scalable (very large) parallel performance analysis

TAU Performance System Components
TAU Architecture Program Analysis Performance Data Mining PDT PerfExplorer Parallel Profile Analysis PerfDMF Performance Monitoring ParaProf TAUoverSupermon

TAU Performance System Architecture

Program Database Toolkit (PDT)
Application / Library C / C++ parser Fortran parser F77/90/95 Program documentation PDBhtml Application component glue IL IL SILOON C / C++ IL analyzer Fortran IL analyzer C++ / F90/95 interoperability CHASM Program Database Files Automatic source instrumentation DUCTAPE tau_instrumentor

Automatic Source-Level Instrumentation in TAU

Building Bridges to Other Tools

TAU Instrumentation Approach
Support for standard program events Routines, classes and templates Statement-level blocks Begin/End events (Interval events) Support for user-defined events Begin/End events specified by user Atomic events (e.g., size of memory allocated/freed) Selection of event statistics Support definition of “semantic” entities for mapping Support for event groups (aggregation, selection) Instrumentation optimization Eliminate instrumentation in lightweight routines

Interval, Atomic and Context Events in TAU
Interval Event Context Event Atomic Event

TAU Measurement Mechanisms
Parallel profiling Function-level, block-level, statement-level Supports user-defined events and mapping events Support for flat, callgraph/callpath, phase profiling Support for memory profiling (headroom, malloc/leaks) Support for tracking I/O (wrappers, read/write/print calls) Parallel profiles written at end of execution Parallel profile snapshots can be taken during execution Tracing All profile-level events + inter-process communication Inclusion of multiple counter data in traced events

Types of Parallel Performance Profiling
Flat profiles Metric (e.g., time) spent in an event (callgraph nodes) Exclusive/inclusive, # of calls, child calls Callpath profiles (Calldepth profiles) Time spent along a calling path (edges in callgraph) “main=> f1 => f2 => MPI_Send” (event name) TAU_CALLPATH_DEPTH environment variable Phase profiles Flat profiles under a phase (nested phases are allowed) Default “main” phase Supports static or dynamic (e.g., per-iteration) phases

Performance Evaluation Alternatives
Depthlimit profile Parameter profile Callpath/ callgraph profile Trace Flat profile Phase profile Each alternative has: one metric/counter multiple counters Volume of performance data

Parallel Profile Visualization: ParaProf (AORSA)

Comparing Effects of Multi-Core Processors
AORSA2D  magnetized plasma simulation  Blue is single node Red is dual core Cray XT3 (4K cores)

Comparing FLOPS (AORSA2D, Cray XT3)
 Blue is dual core Red is single node Cray XT3 (4K cores) Data generated by Richard Barrett, ORNL

ParaProf – Scalable Histogram View (Miranda)
8k processors 16k processors

ParaProf – 3D Scatterplot (Miranda)
Each point is a “thread” of execution A total of four metrics shown in relation ParaProf’s visualization library JOGL

Visualizing Hybrid Problems (S3D, XT3+XT4)
S3D combustion simulation (DOE SciDAC PERI) ORNL Jaguar * Cray XT3/XT4 * 6400 cores

Zoom View of Hybrid Execution (S3D, XT3+XT4)
Gap represents XT3 nodes MPI_Wait takes less time, other routines take more time

Visualizing Hybrid Execution (S3D, XT3+XT4)
Process metadata is used to map performance to machine type Memory speed accounts for performance difference 6400 cores

S3D Run on XT4 Only Better balance across nodes
More performance uniformity

ParaProf – Profile Snapshots (Flash)
Profile snapshots are parallel profiles recorded at runtime Used to highlight profile changes during execution Initialization Checkpointing This slide shows a four processor Flash run showing how different snapshots differ. Before the main loop beings, I write a snapshot to mark the performance data up to that point. It shows up as “Initialization”. At the end of each loop, a snapshot is written. At the end of execution, a final snapshot is written labeled “Finalization”. I’m 99% sure the four main spikes are the checkpointing phases involving a lot of IO. The big purple color on top is “Other”. I have it limited to top 20 functions + “other”. Finalization 29

Filtered Profile Snapshots (Flash)
Only show main loop iterations Same view as previous slide, but I’ve filtered away initialization and finalization to better see the loop snapshots. The red at the bottom of the four spikes is MPI_Barrier, which is interesting. 30

Profile Snapshots with Breakdown (Flash)
Breakdown as a percentage Breakdown as a percentage over time 31

Profile Snapshot Replay (Flash)
All windows dynamically update When you drag the slider, the data in all windows dynamically updates

Snapshot Dynamics of Event Relations (Flash)
Follow progression of various displays through time 3D scatter plot shown below T = 0s T = 11s When you drag the slider, the data in all windows dynamically updates. Here we see the progression of the 3d scatterplot over time. The clusters start out more spread out, but then collect as time goes by. 33

TAU: Usage Scenarios

Using TAU: A brief Introduction
Each configuration of TAU produces a unique stub makefile with configuration specific parameters (e.g., MPI, pthread, PGI compiler, etc.) To instrument source code using PDT Choose an appropriate TAU stub makefile in <arch>/lib: % setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl/lib/Makefile.tau-mpi-pdt-pgi % setenv TAU_OPTIONS ‘-optVerbose …’ (see tau_compiler.sh) And use tau_f90.sh, tau_cxx.sh or tau_cc.sh as Fortran, C++ or C compilers: % ftn foo.f90 changes to % tau_f90.sh foo.f90 Execute application and analyze performance data: % pprof (for text based profile display) % paraprof (for GUI)

TAU Measurement Configuration – Examples
% cd /usr/common/acts/TAU/tau_latest/craycnl/lib; ls Makefile.* Makefile.tau-pdt-pgi Makefile.tau-mpi-pdt-pgi Makefile.tau-papi-mpi-pdt-pgi Makefile.tau-pthread-pdt-pgi Makefile.tau-pthread-mpi-pdt-pgi Makefile.tau-openmp-opari-pdt-pgi Makefile.tau-openmp-opari-mpi-pdt-pgi Makefile.tau-papi-openmp-opari-mpi-pdt-pgi … For an MPI+F90 application, you may want to start with: Supports MPI instrumentation & PDT for automatic source instrumentation % setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl/lib/Makefile.tau-mpi-pdt-pgi

Automatic Instrumentation
TAU provides compiler wrapper scripts Simply replace ftn with tau_f90.sh Automatically instruments Fortran source code, links with TAU MPI Wrapper libraries. Use tau_cc.sh and tau_cxx.sh for C/C++ Before F90 = ftn CXX = CC CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(F90) $(LDFLAGS) $(OBJS) -o $(LIBS) .f90.o: $(F90) $(CFLAGS) -c $< After F90 = tau_f90.sh CXX = tau_cxx.sh CFLAGS = LIBS = -lm OBJS = f1.o f2.o f3.o … fn.o app: $(OBJS) $(F90) $(LDFLAGS) $(OBJS) -o $(LIBS) .f90.o: $(F90) $(CFLAGS) -c $<

TAU_COMPILER Commandline Options
See <taudir>/<arch>/bin/tau_compiler.sh –help Compilation: % ftn -c foo.f90 Changes to % gfparse foo.f90 $(OPT1) % tau_instrumentor foo.pdb foo.f90 –o foo.inst.f90 $(OPT2) % ftn –c foo.f90 $(OPT3) Linking: % ftn foo.o bar.o –o app Changes to % ftn foo.o bar.o –o app $(OPT4) Where options OPT[1-4] default values may be overridden by the user: % setenv TAU_OPTIONS ‘...’ % make F90=tau_f90.sh

Compile-Time Environment Variables
Optional parameters for TAU_OPTIONS: [tau_compiler.sh –help] -optVerbose Turn on verbose debugging messages -optCompInst Use compiler based instrumentation -optDetectMemoryLeaks Turn on debugging memory allocations/ de-allocations to track leaks -optKeepFiles Does not remove intermediate .pdb and .inst.* files -optPreProcess Preprocess Fortran sources before instrumentation -optTauSelectFile="" Specify selective instrumentation file for tau_instrumentor -optLinking="" Options passed to the linker. Typically $(TAU_MPI_FLIBS) $(TAU_LIBS) $(TAU_CXXLIBS) -optCompile="" Options passed to the compiler. Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) -optPdtF95Opts="" Add options for Fortran parser in PDT (f95parse/gfparse) -optPdtF95Reset="" Reset options for Fortran parser in PDT (f95parse/gfparse) -optPdtCOpts="" Options for C parser in PDT (cparse). Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) -optPdtCxxOpts="" Options for C++ parser in PDT (cxxparse). Typically $(TAU_MPI_INCLUDE) $(TAU_INCLUDE) $(TAU_DEFS) ...

Runtime Environment Variables
Default Description TAU_TRACE Setting to 1 turns on tracing TAU_CALLPATH Setting to 1 turns on callpath profiling TAU_TRACK_HEAP or TAU_TRACK_HEADROOM Setting to 1 turns on tracking heap memory/headroom at routine entry & exit using context events (e.g., Heap at Entry: main=>foo=>bar) TAU_CALLPATH_DEPTH 2 Specifies depth of callpath. Setting to 0 generates no callpath or routine information, setting to 1 generates flat profile and context events have just parent information (e.g., Heap Entry: foo) TAU_SYNCHRONIZE_CLOCKS 1 Synchronize clocks across nodes to correct timestamps in traces TAU_COMM_MATRIX Setting to 1 generates communication matrix display using context events TAU_THROTTLE Setting to 0 turns off throttling. Enabled by default to remove instrumentation in lightweight routines that are called frequently TAU_THROTTLE_NUMCALLS 100000 Specifies the number of calls before testing for throttling TAU_THROTTLE_PERCALL 10 Specifies value in microseconds. Throttle a routine if it is called over times and takes less than 10 usec of inclusive time per call TAU_COMPENSATE Setting to 1 enables runtime compensation of instrumentation overhead TAU_PROFILE_FORMAT Profile Setting to “merged” generates a single file. “snapshot” generates xml format TAU_METRICS TIME Setting to a comma separted list generates other metrics. (e.g., TIME:linuxtimers:PAPI_FP_OPS:PAPI_NATIVE_<event>)

Overriding Default Options: TAU_OPTIONS
% cat Makefile F90 = tau_f90.sh OBJS = f1.o f2.o f3.o … LIBS = -Lappdir –lapplib1 –lapplib2 … app: $(OBJS) $(F90) $(OBJS) –o app $(LIBS) .f90.o: $(F90) –c $< %setenv TAU_OPTIONS ‘-optVerbose optTauSelectFile=select.tau’ % setenv TAU_MAKEFILE <taudir>/craycnl/lib/Makefile.tau-mpi-pdt-pgi

Usage Scenarios: Routine Level Profile
Goal: What routines account for the most time? How much? Flat profile with wallclock time:

Solution: Generating a flat profile with MPI
% setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-mpi-pdt-pgi % set path=(/usr/common/acts/TAU/tau_latest/x86_64/bin $path) Or % module load tau % make F90=tau_f90.sh % tau_f90.sh matmult.f90 –o matmult (Or edit Makefile and change F90=tau_f90.sh) % qsub run.job % paraprof To view. To view the data locally on the workstation, % paraprof -–pack app.ppk Move the app.ppk file to your desktop. % paraprof app.ppk

Usage Scenarios: Loop Level Instrumentation
Goal: What loops account for the most time? How much? Flat profile with wallclock time with loop instrumentation:

Solution: Generating a loop level profile
% setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-mpi-pdt-pgi % setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau –optVerbose’ % cat select.tau BEGIN_INSTRUMENT_SECTION loops routine=“#” END_INSTRUMENT_SECTION % module load tau % make F90=tau_f90.sh (Or edit Makefile and change F90=tau_f90.sh) % qsub run.job % paraprof -–pack app.ppk Move the app.ppk file to your desktop. % paraprof app.ppk

Usage Scenarios: MFlops in Loops
Goal: What MFlops am I getting in all loops? Flat profile with PAPI_FP_INS/OPS and TIME with loop instrumentation:

ParaProf: Mflops Sorted by Exclusive Time
low mflops?

Generate a PAPI profile with 2 or more counters
% setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-papi-mpi-pdt-pgi % setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau –optVerbose’ % cat select.tau BEGIN_INSTRUMENT_SECTION loops routine=“#” END_INSTRUMENT_SECTION % make F90=tau_f90.sh (Or edit Makefile and change F90=tau_f90.sh) % setenv TAU_METRICS TIME:PAPI_FP_INS:PAPI_L1_DCM % qsub run.job % paraprof -–pack app.ppk Move the app.ppk file to your desktop. % paraprof app.ppk Choose Options -> Show Derived Panel -> Arg 1 = PAPI_FP_INS, Arg 2 = GET_TIME_OF_DAY, Operation = Divide -> Apply, choose.

Usage Scenarios: Compiler-based Instrumentation
Goal: Easily generate routine level performance data using the compiler instead of PDT for parsing the source code

Use Compiler-Based Instrumentation
% setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-mpi-pdt-pgi % setenv TAU_OPTIONS ‘-optCompInst –optVerbose’ % module load tau % make F90=tau_f90.sh (Or edit Makefile and change F90=tau_f90.sh) % qsub run.job % paraprof -–pack app.ppk Move the app.ppk file to your desktop. % paraprof app.ppk

Profiling Chapel Applications
Using compiler-based instrumentation in TAU to profile Chapel applications

Chapel: Compiler-Based Instrumentation
% setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/x86_ /lib/Makefile.tau-papi-pthread-pdt % setenv TAU_OPTIONS ‘-optCompInst –optVerbose’ % setenv CHPL_MAKE_COMPILER tau % make % cat $CHPL_HOME/make/compiler/Makefile.tau CC=tau_cc.sh CXX=tau_cxx.sh … % qsub run.job % paraprof -–pack app.ppk Move the app.ppk file to your desktop. % paraprof app.ppk

Profiling UPC Applications
Atomic Events for UPC

UPC: Compiler-Based Instrumentation
% setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/x86_ /lib/Makefile.tau-mpi-upc % setenv TAU_OPTIONS ‘-optCompInst –optVerbose’ % make … % qsub run.job % paraprof -–pack app.ppk Move the app.ppk file to your desktop. % paraprof app.ppk

Usage Scenarios: Generating Callpath Profile
Goal: To reveal the calling structure of the program Callpath profile for a given callpath depth:

Callpath Profile Generates program callgraph

Generate a Callpath Profile
% setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-mpi-pdt-pgi % set path=(/usr/common/acts/TAU/tau_latest/craycnl/bin $path) % make F90=tau_f90.sh (Or edit Makefile and change F90=tau_f90.sh) % setenv TAU_CALLPATH 1 % setenv TAU_CALLPATH_DEPTH 100 % qsub run.job % paraprof -–pack app.ppk Move the app.ppk file to your desktop. % paraprof app.ppk (Windows -> Thread -> Call Graph)

Usage Scenario: Detect Memory Leaks

Detect Memory Leaks % setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-mpi-pdt-pgi % setenv TAU_OPTIONS ‘-optDetectMemoryLeaks -optVerbose’ % module load tau % make F90=tau_f90.sh (Or edit Makefile and change F90=tau_f90.sh) % setenv TAU_CALLPATH_DEPTH 100 % qsub run.job % paraprof -–pack app.ppk Move the app.ppk file to your desktop. % paraprof app.ppk (Windows -> Thread -> Context Event Window -> Select thread -> select... expand tree) (Windows -> Thread -> User Event Bar Chart -> right click LEAK -> Show User Event Bar Chart)

Usage Scenarios: Mixed Python+F90+C+pyMPI
Goal: Generate multi-level instrumentation for Python+MPI+…

Generate a Multi-Language Profile with Python
% setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-python-mpi-pdt-pgi % set path=(/usr/common/acts/TAU/tau_latest/craycnl/bin $path) % setenv TAU_OPTIONS ‘-optShared -optVerbose…’ (Python needs shared object based TAU library) % make F90=tau_f90.sh CXX=tau_cxx.sh CC=tau_cc.sh (build pyMPI w/TAU) % cat wrapper.py import tau def OurMain(): import App tau.run(‘OurMain()’) Uninstrumented: % aprun –a xt –n 4 <dir>/pyMPI-2.5b0/bin/pyMPI ./App.py Instrumented: % setenv PYTHONPATH <taudir>/craycnl/lib/bindings-python-mpi-pdt-pgi (same options string as TAU_MAKEFILE) setenv LD_LIBRARY_PATH <taudir>/craycnl/lib/bindings-python-mpi-pdt \:$LD_LIBRARY_PATH % aprun –a xt –n 4 <dir>/pyMPI-2.5b0-TAU/bin/pyMPI ./wrapper.py (Instrumented pyMPI with wrapper.py)

Usage Scenarios: Generating a Trace File
Goal: What happens in my code at a given time? When? Event trace visualized in Vampir [TUD] /Jumpshot [ANL]

VNG Process Timeline with PAPI Counters

Vampir Counter Timeline Showing I/O BW

Generate a Trace File % setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-mpi-pdt-pgi % set path=(/usr/common/acts/TAU/tau_latest/craycnl/bin $path) % make F90=tau_f90.sh (Or edit Makefile and change F90=tau_f90.sh) % setenv TAU_TRACE 1 % qsub run.job % tau_treemerge.pl (merges binary traces to create tau.trc and tau.edf files) JUMPSHOT: % tau2slog2 tau.trc tau.edf –o app.slog2 % jumpshot app.slog2 OR VAMPIR: % tau2otf tau.trc tau.edf app.otf –n 4 –z (4 streams, compressed output trace) % vampir app.otf (or vng client with vngd server).

Usage Scenarios: Evaluate Scalability
Goal: How does my application scale? What bottlenecks at what cpu counts? Load profiles in PerfDMF database and examine with PerfExplorer

Usage Scenarios: Evaluate Scalability

Performance Regression Testing

Evaluate Scalability using PerfExplorer Charts
% setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-mpi-pdt-pgi % set path=(/usr/common/acts/TAU/tau_latest/craycnl/bin $path) % make F90=tau_f90.sh (Or edit Makefile and change F90=tau_f90.sh) % qsub run1p.job % paraprof -–pack 1p.ppk % qsub run2p.job … % paraprof -–pack 2p.ppk … and so on. On your client: % perfdmf_configure (Choose derby, blank user/passwd, yes to save passwd, defaults) % perfexplorer_configure (Yes to load schema, defaults) % paraprof (load each trial: DB -> Add Trial -> Type (Paraprof Packed Profile) -> OK) % perfexplorer (Charts -> Speedup)

Communication Matrix Display
Goal: What is the volume of inter-process communication? Along which calling path?

Evaluate Scalability using PerfExplorer Charts
% setenv TAU_MAKEFILE /usr/common/acts/TAU/tau_latest/craycnl /lib/Makefile.tau-mpi-pdt-pgi % module load tau % make F90=tau_f90.sh (Or edit Makefile and change F90=tau_f90.sh) % setenv TAU_COMM_MATRIX 1 % qsub run.job (setting the environment variables) % paraprof (Windows -> Communication Matrix)

PGI Compiler for GPUs Accelerator programming support Compiled program
Fortran and C Directive-based programming Loop parallelization for acceleration on GPUs PGI 9.0 for x64-based Linux (preview release) Compiled program CUDA target Synchronous accelerator operations Profile interface support

TAU with PGI Accelerator Compiler
Compiler-based instrumentation for PGI compilers Track runtime system events as seen by host processor Wrapped runtime library Show source information associated with events Routine name File name, source line number for kernel Variable names in memory upload, download operations Grid sizes Any configuration of TAU with PGI supports tracking of accelerator operations Tested with PGI 8.0.3, 8.0.5, compilers Qualification and testing with PGI 9.0-4, 10.x complete

Measuring Performance of PGI Accelerator Code

Binary Rewriting: DyninstAPI [U.Wisc] and TAU

HMPP-TAU Event Instrumentation/Measurement
User Application HMPP Runtime C U D A HMPP CUDA Codelet TAU TAUcuda Measurement User events HMPP events Codelet events Measurement CUDA stream events Waiting information

HMPP-TAU Compilation Workflow
HMPP annotated application TAU compiler TAU instrumenter TAU-instrumented HMPP annotated application HMPP compiler CUDA generator TAUcuda instrumenter TAU/TAUcuda library TAUcuda-instrumented CUDA codelets HMPP runtime library CUDA compiler Generic compiler TAUcuda- instrumented CUDA codelet library TAU-instrumented HMPP application executable

HMPP Workbench with TAUcuda
Host process Compute kernel Transfer kernel

NAMD with CUDA NAMD is a molecular dynamics application (Charm++)
NAMD has been accelerated with CUDA TAU integrated in Charm++ Apply TAUcuda to NAMD Four processes with one Tesla GPU for each

NAMD with CUDA (4 processes)
GPU kernel

Scaling NAMD with CUDA good GPU performance

Scaling NAMD with CUDA: Jumpshot Timeline

Scaling NAMD with CUDA Data transfer

Conclusions Heterogeneous parallel computing will challenge parallel performance technology Must deal with diversity in hardware and software Must deal with richer parallelism and concurrency Performance tools should support parallel execution and computation models Understanding of “performance” interactions between integrated components control and data interactions Might not be able to see full parallel (concurrent) detail Need to support multiple performance perspectives Layers of performance abstraction

Discussions TAU represents a mature technology for performance instrumentation, measurement and analysis We would like to collaborate with the Cray language and compiler teams to improve the support for TAU on Cray systems Near-term goals Chapel runtime support Support for compiler-based instrumentation for Cray compilers on XT systems Long-term goals Explore hybrid execution models (XT5h) and other new systems Integrate and ship TAU with the Cray tool chain

Support Acknowledgements
Department of Energy (DOE) Office of Science MICS, Argonne National Lab ASC/NNSA University of Utah ASC/NNSA Level 1 ASC/NNSA, LLNL Department of Defense (DoD) HPC Modernization Office (HPCMO) NSF SDCI Research Centre Juelich LBL, ORNL, ANL, LANL, PNNL, LLNL TU Dresden ParaTools, Inc.

TAU: Performance Technology for Productive, High Performance Computing

Similar presentations

Presentation on theme: "TAU: Performance Technology for Productive, High Performance Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TAU: Performance Technology for Productive, High Performance Computing

Similar presentations

Presentation on theme: "TAU: Performance Technology for Productive, High Performance Computing"— Presentation transcript:

Similar presentations

About project

Feedback