Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sameer Shende, Allen D. Malony and Alan Morris {sameer, malony, Department of Computer and Information Science Performance Research.

Similar presentations


Presentation on theme: "Sameer Shende, Allen D. Malony and Alan Morris {sameer, malony, Department of Computer and Information Science Performance Research."— Presentation transcript:

1 Sameer Shende, Allen D. Malony and Alan Morris {sameer, malony, amorris}@cs.uoregon.edu Department of Computer and Information Science Performance Research Laboratory, NeuroInformatics Center University of Oregon Analysis Infrastructure for CQoS using TAU

2 2 Acknowledgement  Jaideep Ray, SNL  Lois McIness, ANL  David Bernholdt, ORNL  Boyana Norris, ANL  Robert Yelle, U. Oregon

3 3 Outline  Motivation: CQoS  Instrumentation  Measurement  Analysis tools

4 4 CQoS in GAMESS  Robert Yelle, PRL, U. Oregon ryelle@uoregon.eduryelle@uoregon.edu  Calculate the energy of Thiophene molecule using different algorithms S FINAL U-B3LYP ENERGY IS -552.9083139587 AFTER 21 ITERATIONS FINAL U-BLYP ENERGY IS -552.9861184848 AFTER 22 ITERATIONS FINAL UHF ENERGY IS -551.3483315053 AFTER 11 ITERATIONS FINAL U-SVWN ENERGY IS -550.2734639639 AFTER 22 ITERATIONS

5 5 TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel and distributed high- performance computing  Targets a general complex system computation model  nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance instrumentation, measurement, analysis, and visualization  Portable, configurable performance profiling/tracing facility  Open software approach  University of Oregon, LANL, FZJ Germany  http://www.cs.uoregon.edu/research/paracomp/tau http://www.cs.uoregon.edu/research/paracomp/tau

6 6 TAU Performance System Architecture event selection

7 7 Performance Evaluation Alternatives Flat profile Depthlimit profile Parameter profile Callpath/ callgraph profile Phase profile Trace Volume of performance data Each alternative has: - one metric/counter - multiple counters

8 8 Enhancements in TAU to support CQoS  Instrumentation  Runtime MPI wrapper interposition for CCA framework instrumentation  Automatic proxy component creation for classic and SIDL components  PDT v3.10 (coming, beta released) supports EDG v3.8 for better C/C++ parsing support (GNU extensions, BOOST, ASM statements)  Profile Measurement  Parameter based profiling to capture application data  Context Events to capture callpath with user  Support for memory profiling and memory leak detection  Timestamped profile snapshots (coming)  Analysis  Extensions to PerfDMF to support model storage  Application specific metadata  ParaProf extensions to display profile snapshots, parameter based profiles  PerfExplorer data mining framework  Web based access to performance database via a TAU portal  Ability to store images, share data, metadata

9 9 TAU’s CCA Performance Component: Core API  Measurement port and interfaces  Timer  set name/type/group  start/stop  Phase  set name/type/group  start/stop  Control  enable/disable groups  Query  get timer names, get metric names, get user-defined event names  get timer data, get user-defined event data, dump data to disk  Event  set name, trigger event  Context Event (callpath of routines + user event information)  set name, trigger event  MemoryTracker and MemoryHeadroomTracker  enable interrupt tracking, track memory/headroom here, set interrupt interval  enable/disable tracking memory/headroom

10 10  Performance evaluation using Performance component  Uses underlying TAU library for measurement  Timer, Phase, Event/ContextEvent, Control, Query, MemoryTracker/MemoryHeadroomTracker interfaces  Lightweight instrumentation option  Performance modeling using Mastermind component  Tracks per-invocation performance data  Associates performance data with application data  Method arguments logged with performance data  Callpath information  Helps us build performance models  Updated performance component 1.7.2 released Jan. ’07 TAU’s CCA Interfaces

11 11 Phase Interface interface Timer { /* Start/stop the Timer */ void start(); void stop(); /* Set/get the Timer name */ void setName(in string name); string getName(); /* Set/get Timer type information (e.g., signature of the routine) */ void setType(in string name); string getType(); /* Set/get the group name associated with the Timer */ void setGroupName(in string name); string getGroupName(); /* Set/get the group id associated with the Timer */ void setGroupId(in long group); long getGroupId(); } interface Measurement extends gov.cca.Port { /* Create a Timer */ Timer createTimer(); Timer createTimerWithName(in string name); Timer createTimerWithNameType(in string name, in string type); Timer createTimerWithNameTypeGroup(in string name, in string type, in string group); interface Phase { /* Start/stop the Phase */ void start(); void stop(); /* Set/get the Phase name */ void setName(in string name); string getName(); /* Set/get Phase type information (e.g., signature of the routine) */ void setType(in string name); string getType(); /* Set/get the group name associated with the Phase */ void setGroupName(in string name); string getGroupName(); /* Set/get the group id associated with the Phase */ void setGroupId(in long group); long getGroupId(); } interface Measurement extends gov.cca.Port { /* Create a Phase */ Phase createPhase(); Phase createPhaseWithName(in string name); Phase createPhaseWithNameType(in string name, in string type); Phase createPhaseWithNameTypeGroup(in string name, in string type, in string group);

12 12 Measurement Proxy Component  Interpose a proxy component for each port  Inside the proxy  Make calls to Performance component for each invocation MidpointIntegrator IntegratorPort Go Driver IntegratorPort IntegratorProxy Component IntegratorPortUsesIntegratorPortProvides MeasurementPort Performance MeasurementPort

13 13 MasterMind Component  Idea: Create a performance model for the component by tracking performance per invocation  Uses Monitor Port  Outputs:  Times per invocation, e.g.  Component call path  Regular performance data (uses performance component) # integ_proxy::integrate(double, double, int) # MPI_TIME Time count lowBound upBound 72420 336 10000 0 1 407 449 1000 0 1 364 540 100 0 1 64838 844 10000 0 1 381 945 1000 0 1 332 1027 100 0 1

14 14 Monitor Proxy Component  Same idea (from the user’s point of view) MidpointIntegrator IntegratorPort Go Driver IntegratorPort Integrator Monitor Proxy IntegratorPortUsesIntegratorPortProvides MonitorPort MasterMind MeasurementPort Performance

15 15  Tree pruner  Input:  Callgraph generated by Mastermind component  User specified rules  Output:  Pruned callgraph with insignificant nodes removed  Performance modeling library – brute force  Tries all possible permutations of component instances  Input: performance model of each component  Selects optimal component assembly for the ensemble  Optimizer  Swaps one component instance with another Tools Included with MasterMind Component

16 16  Generate regular measurement proxy or monitor (MasterMind) proxy  Arguments:  Options: TAU’s Proxy Generator for SIDL/Classic CCA -c Full name of the component -t Type of component -p Name of port to generate proxy for -d Name of pdb file created from cxxparse -h Header file for this port -n Name of the proxy component (default: base of component name + Proxy) -o Name of output file (default: proxy.cc) -f Use Pre-generated Selective instrumentation file -x Namespace Tag -m Generate MasterMind component proxy

17 17 TAU’s Proxy Generator for Classic C++ Interface  Creating PDB Files:  Merging PDB Files:  Invoking tau_pg (example) pdbmerge -o merged.pdb file1.pdb file2.pdb … cxxparse -I -D tau_pg -c integrators::ccaports::Integrator -t integrators.ccaports.Integrator -p IntegratorPort -d ParallelIntegrator_CCA.pdb -o Proxy.cc -h ports/Integrator_CCA.h -f select.dat

18 18 What’s Going On Here? Alternative implementations of performance component runtime TAU performance data TAU API other API … Application Component Application Component Performance Component TAU API Application Component Application Component

19 19 Multi-Level Instrumentation  Inter-Component  Proxy components created automatically  Proxy interposed between caller and callee  Intra-Component  PDT based source instrumentation  Compiler scripts  mpif90 => tau_f90.sh  mpicxx => tau_cxx.sh  mpicc => tau_cc.sh  Framework level MPI instrumentation  Shared library MPI based CCAFFEINE framework  LD_PRELOAD based interposition of MPI wrapper  mpirun –np 4./ccafe-batch  mpirun –np 4 tau_load.sh./ccafe-batch

20 20 MasterMind Component  Idea: Create a performance model for the component by tracking performance per invocation  Uses Monitor Port  Outputs:  Times per invocation, e.g.  Component call path  Regular performance data (uses performance component) # integ_proxy::integrate(double, double, int) # MPI_TIME Time count lowBound upBound 72420 336 10000 0 1 407 449 1000 0 1 364 540 100 0 1 64838 844 10000 0 1 381 945 1000 0 1 332 1027 100 0 1

21 21 Parameter Based Profiling for CQoS  Idea: partition performance data for individual functions based on runtime parameters  Enable by configuring with –PROFILEPARAM  TAU call: TAU_PROFILE_PARAM1L (value, “name”)  Simple example: void foo(long input) { TAU_PROFILE("foo", "", TAU_DEFAULT); TAU_PROFILE_PARAM1L(input, "input");... }

22 22 Parameter Based Profiling  5 seconds spent in function “ foo ” becomes  2 seconds for “ foo [ = ] ”  1 seconds for “ foo [ = ] ”  …  Demonstrated in MPI wrapper library  Allows for partitioning of time spent in MPI routines based on parameters (message size, message tag, destination node)  Can be extrapolated to infer specifics about the MPI subsystem and system as a whole

23 23 Workload Characterization  Simple example, send/receive squared message sizes (0-32MB) #include int buffer[8*1024*1024]; int main(int argc, char **argv) { int rank, size, i, j; MPI_Init(&argc, &argv); MPI_Comm_size( MPI_COMM_WORLD, &size ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); for (i=0;i<1000;i++) for (j=1;j<=8*1024*1024;j*=2) { if (rank == 0) { MPI_Send(buffer,j,MPI_INT,1,42,MPI_COMM_WORLD); } else { MPI_Status status; MPI_Recv(buffer,j,MPI_INT,0,42,MPI_COMM_WORLD,&status); } MPI_Finalize(); }

24 24 Workload Characterization  Use tau_load.sh to instrument MPI routines (SGI Altix) % icc mpi.c –lmpi % mpirun –np 2 tau_load.sh –XrunTAU-icpc-mpi-pdt.so a.out SGI MPI (SGI Altix) Intel MPI (SGI Altix)

25 25 Workload Characterization  Two different message sizes (~3.3MB and ~4K)

26 26 Parameter Based Profiling: SIDL Interface package Performance version 1.7.2 { interface Timer { /* Start/stop the Timer */ void start(); void stop(); /* Set Profile Parameter */ void setParam1L(in long value, in string name);... }

27 27 PerfDMF: Performance Data Mgmt. Framework

28 28 TAU Portal

29 29 TAU Portal https://tau.nic.uoregon.edu

30 30 TAU Portal: Application Specific Metadata Storage

31 31 Performance Data Mining (PerfExplorer)  Performance knowledge discovery framework  Data mining analysis applied to parallel performance data  comparative, clustering, correlation, dimension reduction, …  Use the existing TAU infrastructure  TAU performance profiles, PerfDMF  Client-server based system architecture  Technology integration  Java API and toolkit for portability  PerfDMF  R-project/Omegahat, Octave/Matlab statistical analysis  WEKA data mining package  JFreeChart for visualization, vector output (EPS, SVG)

32 32 Performance Data Mining (PerfExplorer)

33 33 PerfExplorer - Interface Select analysis

34 34 PerfExplorer - Relative Efficiency Plots

35 35 PerfExplorer - Relative Efficiency by Routine

36 36 PerfExplorer - Relative Speedup

37 37 PerfExplorer - Timesteps Per Second

38 38 Summary  Create component version of GAMESS, identify interfaces  Work with GAMESS and other application teams to apply TAU for inter and intra-component instrumentation  Gather requirements for swapping components  Generate proxy components for applications, gather performance data, store results in performance data  Cross-experiment application performance characterization  Develop prototype for CQoS  http://www.cs.uoregon.edu/research/paracomp/tau/cca http://www.cs.uoregon.edu/research/paracomp/tau/cca

39 39 Support Acknowledgements  Department of Energy (DOE)  Office of Science contracts  University of Utah DOE ASCI Level 1 sub-contract  DOE ASC/NNSA Level 3 contract  LLNL, LANL, ANL contracts  NSF Software and Tools for High-End Computing Grant  Research Centre Juelich  John von Neumann Institute for Computing  Dr. Bernd Mohr


Download ppt "Sameer Shende, Allen D. Malony and Alan Morris {sameer, malony, Department of Computer and Information Science Performance Research."

Similar presentations


Ads by Google