Presentation is loading. Please wait.

Presentation is loading. Please wait.

Grouping Performance Data in TAU  Profile Groups  A group of related routines forms a profile group  Statically defined  TAU_DEFAULT, TAU_USER[1-5],

Similar presentations


Presentation on theme: "Grouping Performance Data in TAU  Profile Groups  A group of related routines forms a profile group  Statically defined  TAU_DEFAULT, TAU_USER[1-5],"— Presentation transcript:

1 Grouping Performance Data in TAU  Profile Groups  A group of related routines forms a profile group  Statically defined  TAU_DEFAULT, TAU_USER[1-5], TAU_MESSAGE, TAU_IO, …  Dynamically defined  Group name based on string “adlib”, “particles”  Runtime lookup in a map to get unique group identifier  tau_instrumentor file.pdb file.cpp –o file.i.cpp -g “particles” Assigns all routines in file.cpp to group “particles”  Ability to change group names at runtime  Instrumentation control based on profile groups

2 TAU Instrumentation Control API  Enabling Profile Groups  TAU_ENABLE_INSTRUMENTATION(); // Global control  TAU_ENABLE_GROUP(TAU_GROUP); // statically defined  TAU_ENABLE_GROUP_NAME(“group name”); // dynamic  TAU_ENABLE_ALL_GROUPS(); // for all groups  Disabling Profile Groups  TAU_DISABLE_INSTRUMENTATION();  TAU_DISABLE_GROUP(TAU_GROUP);  TAU_DISABLE_GROUP_NAME();  TAU_DISABLE_ALL_GROUPS();  Obtaining Profile Group Identifier  TAU_GET_PROFILE_GROUP(“group name”);  Runtime Switching of Profile Groups  TAU_PROFILE_SET_GROUP(TAU_GROUP);  TAU_PROFILE_SET_GROUP_NAME(“group name”);

3 Disabling Dynamic Profile Group -- Example int main(int argc, char **argv) { /* Invoke program with --profile field+particles */ TAU_INIT(&argc, &argv); … } void foo(void) { TAU_PROFILE(“void foo(void)”, “ “, TAU_DEFAULT); Field f; TAU_DISABLE_GROUP_NAME(“field"); // other routines in “field” dynamic group are affected for (int i=0; i<N; i++) f.applyrules(i); }

4 TAU Pre-execution Control  Dynamic groups defined at file scope  Group names and group associations may be modified at runtime  Controlling groups at pre-execution time using --profile option % tau_instrumentor app.pdb app.cpp –o app.i.cpp –g “particles” % mpirun –np 4 application –profile particles+field+mesh+io  Enables instrumentation for TAU_DEFAULT and particles, field, mesh and io groups.  Examples:  POOMA v1 (LANL)  Static groups used  VTF (ASAP Caltech)  Dynamic execution instrumentation control by python based controller

5 Applications of TAU  POOMA  PETSc  SAMRAI

6 Performance Mapping in TAU: Motivation  Complexity  Layered software  Multi-level instrumentation  Entities not directly in source  Mapping  User-level abstractions

7 Hypothetical Mapping Example Engine  Particles distributed on surfaces of a cube Work packets

8 Hypothetical Mapping Example Source Particle* P[MAX]; /* Array of particles */ int GenerateParticles() { /* distribute particles over all faces of the cube */ for (int face=0, last=0; face < 6; face++){ /* particles on this face */ int particles_on_this_face = num(face); for (int i=last; i < particles_on_this_face; i++) { /* particle properties are a function of face */ P[i] =... f(face);... } last+= particles_on_this_face; }

9 Hypothetical Mapping Example (continued)  How much time is spent processing face i particles?  What is the distribution of performance among faces? int ProcessParticle(Particle *p) { /* perform some computation on p */ } int main() { GenerateParticles(); /* create a list of particles */ for (int i = 0; i < N; i++) /* iterates over the list */ ProcessParticle(P[i]); }

10 No Performance Mapping versus Mapping  Typical performance tools report performance with respect to routines  Do not provide support for mapping  Performance tools with SEAA mapping can observe performance with respect to scientist’s programming and problem abstractions without mappingwith mapping

11 Semantic Entities/Attributes/Associations  New dynamic mapping scheme - SEAA  Entities defined at any level of abstraction  Attribute entity with semantic information  Entity-to-entity associations  Two association types:  Embedded – extends data structure of associated object to store performance measurement entity  External – creates an external look-up table using address of object as the key to locate performance measurement entity

12 Mapping in POOMA II  POOMA [LANL] is a C++ framework for Computational Physics  Provides high-level abstractions:  Fields (Arrays), Particles, FFT, etc.  Encapsulates details of parallelism, data-distribution  Uses custom-computation kernels for efficient expression evaluation [PETE]  Uses vertical-execution of array statements to re-use cache [SMARTS]

13 POOMA II Array Example  Multi- dimensional array statements  A=B+C+D;

14 POOMA, PETE and SMARTS

15 Using Synchronous Timers

16 Form of Expression Templates in POOMA

17 Mapping Problem  One-to-many upward mapping  Traditional methods of mapping (ammortization/aggregation) lack resolution and accuracy! Template <class LHS, class RHS, class Op, class EvalTag> void ExpressionKernel<LHS,RHS,Op, EvalTag>::run() {/* iterate execution */ } A=1.0; B=2.0; … A= B+C+D; C=E-A+2.0*D;...

18 POOMA II Mappings  Each work packet belongs to an ExpressionKernel object  Each statement’s form associated with timer in the constructor of ExpressionKernel  ExpressionKernel class extended with embedded timer  Timing calls and entry and exit of run() method start and stop per object timer

19 Results of TAU Mappings  Per-statement profile!

20 POOMA Traces  Helps bridge the semantic-gap!

21 PETSc (ANL)  Portable, Extensible Toolkit for Scientific Computation  Suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations  Uses MPI for inter-process communication  Instrumentation  PDT for C/C++ source instrumentation  MPI wrapper library layer instrumentation  Example:  Solves a set of linear equations (Ax=b) in parallel (SLES)

22 PETSc Linear Equation Solver Profile

23 PETSc Traces

24 PETSc Calltree and Communication Matrix

25 SAMRAI (LLNL)  Structured Adaptive Mesh Refinement Application Infrastructure  Instrumentation for TAU:  PDT based C++ instrumentation  MPI wrapper interposition library based instrumentation  SAMRAI timers mapped to TAU timers  TAU’s Mapping API  Embedded association

26 Mapping in TAU  Embedded association vs External association SAMRAI Timer Performance Data... Hash Table TAU Timer

27 SAMRAI Euler Gas Dynamics Application (2D)  Adaptive Mesh Refinement (AMR) application  A single overarching algorithm object drives the time integration and adaptive gridding processes  Discrete Euler equations are solved on each patch in the AMR hierarchy

28 Euler Profile (ComputeFluxesOnPatch)

29 Euler Profile (Summary)

30 Euler Profile (Inclusive Time)

31 Euler Profile (Contribution of Flux computation)

32 Euler Traces

33 Euler CallTree (ComputeFluxesOnPatch)

34 Hands-on session  On mcurie.nersc.gov, copy files from /usr/local/pkg/acts/tau/tau2/tau-2.9/training  See README file  Set correct path e.g., % set path=($path /usr/local/pkg/acts/tau/tau2/tau2.9/t3e/bin)  Examine the Makefile.  Type “make” in each directory; then execute the program  Type “racy” or “vampir”  Type a project name e.g., “matrix.pmf” and click OK to see the performance data.

35 Examples The training directory contains example programs that illustrate the use of TAU instrumentation and measuremen options. instrument -This contains a simple C++ example that shows how TAU's API can be used for manually instrumenting a C++ program. It highlights instrumentation for templates and user defined events. threads - A simple multi-threaded program that shows how the main function of a thread is instrumented. Performance data is generated for each thread of execution. Configure with -pthread. cthreads - Same as threads above, but for a C program. An instrumented C program may be compiled with a C compiler, but needs to be linked with a C++ linker. Configure with -pthread. pi - An MPI program that calculates the value of pi and e. It highlights the use of TAU's MPI wrapper library. TAU needs to be configured with -mpiinc= and -mpilib=. Run using mpirun -np cpi. papi - A matrix multiply example that shows how to use TAU statement level timers for comparing the performance of two algorithms for matrix multiplication. When used with PAPI or PCL, this can highlight the cache behaviors of these algorithms. TAU should be configured with -papi= or -pcl= and the user should set PAPI_EVENT or PCL_EVENT respective environment variables, to use this.

36 Examples - (cont.) papithreads - Same as papi, but uses threads to highlight how hardware performance counters may be used in a multi-threaded application. When it is used with PAPI, TAU should be configured with -papi= -pthread autoinstrument - Shows the use of Program Database Toolkit (PDT) for automating the insertion of TAU macros in the source code. It requires configuring TAU with the -pdt= option. The Makefile is modified to illustrate the use of a source to source translator (tau_instrumentor). NPB2.3 - The NAS Parallel Benchmark 2.3 [from NASA Ames]. It shows how to use TAU's MPI wrapper with a manually instrumented Fortran program. LU and SP are the two benchmarks. LU is instrumented completely, while only parts of the SP program are instrumented to contrast the coverage of routines. In both cases MPI level instrumentation is complete. TAU needs to be configured with -mpiinc= and -mpilib= to use this.


Download ppt "Grouping Performance Data in TAU  Profile Groups  A group of related routines forms a profile group  Statically defined  TAU_DEFAULT, TAU_USER[1-5],"

Similar presentations


Ads by Google