Sameer Shende, Allen D. Malony and Alan Morris {sameer, malony, Department of Computer and Information Science Performance Research.

Slides:



Advertisements
Similar presentations
Machine Learning-based Autotuning with TAU and Active Harmony Nicholas Chaimov University of Oregon Paradyn Week 2013 April 29, 2013.
Advertisements

K T A U Kernel Tuning and Analysis Utilities Department of Computer and Information Science Performance Research Laboratory University of Oregon.
MPI Message Passing Interface
Chapter 3. MPI MPI = Message Passing Interface Specification of message passing libraries for developers and users –Not a library by itself, but specifies.
Sameer Shende Department of Computer and Information Science NeuroInformatics Center University of Oregon Generating Proxy Components.
Workload Characterization using the TAU Performance System Sameer Shende, Allen D. Malony, Alan Morris University of Oregon {sameer,
Introduction to MPI. What is Message Passing Interface (MPI)?  Portable standard for communication  Processes can communicate through messages.  Each.
Robert Bell, Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science.
Sameer Shende and Alan Morris {sameer, Department of Computer and Information Science NeuroInformatics Center University of Oregon.
Scalability Study of S3D using TAU Sameer Shende
Sameer Shende Department of Computer and Information Science Neuro Informatics Center University of Oregon Tool Interoperability.
CCA Common Component Architecture Performance Technology for Component Software - TAU Allen D. Malony (U. Oregon) Sameer Shende (U. Oregon) Craig Rasmussen.
Profiling S3D on Cray XT3 using TAU Sameer Shende
Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.
TAU Parallel Performance System DOD UGC 2004 Tutorial Allen D. Malony, Sameer Shende, Robert Bell Univesity of Oregon.
The TAU Performance Technology for Complex Parallel Systems (Performance Analysis Bring Your Own Code Workshop, NRL Washington D.C.) Sameer Shende, Allen.
Nick Trebon, Alan Morris, Jaideep Ray, Sameer Shende, Allen Malony {ntrebon, amorris, Department of.
TAU Performance System
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.
The TAU Performance System: Advances in Performance Mapping Sameer Shende University of Oregon.
TAU Performance System Alan Morris, Sameer Shende, Allen D. Malony University of Oregon {amorris, sameer,
Performance Tools BOF, SC’07 5:30pm – 7pm, Tuesday, A9 Sameer S. Shende Performance Research Laboratory University.
Allen D. Malony Department of Computer and Information Science Computational Science Institute University of Oregon TAU Performance.
Performance Evaluation of S3D using TAU Sameer Shende
TAU: Performance Regression Testing Harness for FLASH Sameer Shende
Allen D. Malony, Sameer Shende Department of Computer and Information Science Computational Science Institute University.
Scalability Study of S3D using TAU Sameer Shende
Kai Li, Allen D. Malony, Robert Bell, Sameer Shende Department of Computer and Information Science Computational.
The TAU Performance System Sameer Shende, Allen D. Malony, Robert Bell University of Oregon.
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
Performance Tools for Empirical Autotuning Allen D. Malony, Nick Chaimov, Kevin Huck, Scott Biersdorff, Sameer Shende
An Automated Component-Based Performance Experiment and Modeling Environment Van Bui, Boyana Norris, Lois Curfman McInnes, and Li Li Argonne National Laboratory,
CQoS Update Li Li, Boyana Norris, Lois Curfman McInnes Argonne National Laboratory Kevin Huck University of Oregon.
A Hybrid Decomposition Scheme for Building Scientific Workflows Wei Lu Indiana University.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
A Component Infrastructure for Performance and Power Modeling of Parallel Scientific Applications Boyana Norris Argonne National Laboratory Van Bui, Lois.
Using TAU on SiCortex Alan Morris, Aroon Nataraj Sameer Shende, Allen D. Malony University of Oregon {amorris, anataraj, sameer,
SC 2012 © LLNL / JSC 1 HPCToolkit / Rice University Performance Analysis through callpath sampling  Designed for low overhead  Hot path analysis  Recovery.
Profile Analysis with ParaProf Sameer Shende Performance Reseaerch Lab, University of Oregon
Dynamic performance measurement control Dynamic event grouping Multiple configurable counters Selective instrumentation Application-Level Performance Access.
PerfExplorer Component for Performance Data Analysis Kevin Huck – University of Oregon Boyana Norris – Argonne National Lab Li Li – Argonne National Lab.
Allen D. Malony Performance Research Laboratory (PRL) Neuroinformatics Center (NIC) Department.
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Enabling Self-management of Component-based High-performance Scientific Applications Hua (Maria) Liu and Manish Parashar The Applied Software Systems Laboratory.
MPI and OpenMP.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Allen D. Malony Department of Computer and Information Science Performance Research Laboratory.
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
Event Management. EMU Graham Heyes April Overview Background Requirements Solution Status.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.
Online Performance Analysis and Visualization of Large-Scale Parallel Applications Kai Li, Allen D. Malony, Sameer Shende, Robert Bell Performance Research.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Performance Tool Integration in Programming Environments for GPU Acceleration: Experiences with TAU and HMPP Allen D. Malony1,2, Shangkar Mayanglambam1.
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Introduction to the TAU Performance System®
Performance Technology for Scalable Parallel Systems
TAU integration with Score-P
MPI Message Passing Interface
Allen D. Malony, Sameer Shende
A configurable binary instrumenter
Introduction to Apache
Allen D. Malony Computer & Information Science Department
Outline Introduction Motivation for performance mapping SEAA model
Allen D. Malony, Sameer Shende
Parallel Program Analysis Framework for the DOE ACTS Toolkit
TAU Performance DataBase Framework (PerfDBF)
Generating Proxy Components using PDT
Presentation transcript:

Sameer Shende, Allen D. Malony and Alan Morris {sameer, malony, Department of Computer and Information Science Performance Research Laboratory, NeuroInformatics Center University of Oregon Analysis Infrastructure for CQoS using TAU

2 Acknowledgement  Jaideep Ray, SNL  Lois McIness, ANL  David Bernholdt, ORNL  Boyana Norris, ANL  Robert Yelle, U. Oregon

3 Outline  Motivation: CQoS  Instrumentation  Measurement  Analysis tools

4 CQoS in GAMESS  Robert Yelle, PRL, U. Oregon  Calculate the energy of Thiophene molecule using different algorithms S FINAL U-B3LYP ENERGY IS AFTER 21 ITERATIONS FINAL U-BLYP ENERGY IS AFTER 22 ITERATIONS FINAL UHF ENERGY IS AFTER 11 ITERATIONS FINAL U-SVWN ENERGY IS AFTER 22 ITERATIONS

5 TAU Performance System Framework  Tuning and Analysis Utilities  Performance system framework for scalable parallel and distributed high- performance computing  Targets a general complex system computation model  nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance instrumentation, measurement, analysis, and visualization  Portable, configurable performance profiling/tracing facility  Open software approach  University of Oregon, LANL, FZJ Germany 

6 TAU Performance System Architecture event selection

7 Performance Evaluation Alternatives Flat profile Depthlimit profile Parameter profile Callpath/ callgraph profile Phase profile Trace Volume of performance data Each alternative has: - one metric/counter - multiple counters

8 Enhancements in TAU to support CQoS  Instrumentation  Runtime MPI wrapper interposition for CCA framework instrumentation  Automatic proxy component creation for classic and SIDL components  PDT v3.10 (coming, beta released) supports EDG v3.8 for better C/C++ parsing support (GNU extensions, BOOST, ASM statements)  Profile Measurement  Parameter based profiling to capture application data  Context Events to capture callpath with user  Support for memory profiling and memory leak detection  Timestamped profile snapshots (coming)  Analysis  Extensions to PerfDMF to support model storage  Application specific metadata  ParaProf extensions to display profile snapshots, parameter based profiles  PerfExplorer data mining framework  Web based access to performance database via a TAU portal  Ability to store images, share data, metadata

9 TAU’s CCA Performance Component: Core API  Measurement port and interfaces  Timer  set name/type/group  start/stop  Phase  set name/type/group  start/stop  Control  enable/disable groups  Query  get timer names, get metric names, get user-defined event names  get timer data, get user-defined event data, dump data to disk  Event  set name, trigger event  Context Event (callpath of routines + user event information)  set name, trigger event  MemoryTracker and MemoryHeadroomTracker  enable interrupt tracking, track memory/headroom here, set interrupt interval  enable/disable tracking memory/headroom

10  Performance evaluation using Performance component  Uses underlying TAU library for measurement  Timer, Phase, Event/ContextEvent, Control, Query, MemoryTracker/MemoryHeadroomTracker interfaces  Lightweight instrumentation option  Performance modeling using Mastermind component  Tracks per-invocation performance data  Associates performance data with application data  Method arguments logged with performance data  Callpath information  Helps us build performance models  Updated performance component released Jan. ’07 TAU’s CCA Interfaces

11 Phase Interface interface Timer { /* Start/stop the Timer */ void start(); void stop(); /* Set/get the Timer name */ void setName(in string name); string getName(); /* Set/get Timer type information (e.g., signature of the routine) */ void setType(in string name); string getType(); /* Set/get the group name associated with the Timer */ void setGroupName(in string name); string getGroupName(); /* Set/get the group id associated with the Timer */ void setGroupId(in long group); long getGroupId(); } interface Measurement extends gov.cca.Port { /* Create a Timer */ Timer createTimer(); Timer createTimerWithName(in string name); Timer createTimerWithNameType(in string name, in string type); Timer createTimerWithNameTypeGroup(in string name, in string type, in string group); interface Phase { /* Start/stop the Phase */ void start(); void stop(); /* Set/get the Phase name */ void setName(in string name); string getName(); /* Set/get Phase type information (e.g., signature of the routine) */ void setType(in string name); string getType(); /* Set/get the group name associated with the Phase */ void setGroupName(in string name); string getGroupName(); /* Set/get the group id associated with the Phase */ void setGroupId(in long group); long getGroupId(); } interface Measurement extends gov.cca.Port { /* Create a Phase */ Phase createPhase(); Phase createPhaseWithName(in string name); Phase createPhaseWithNameType(in string name, in string type); Phase createPhaseWithNameTypeGroup(in string name, in string type, in string group);

12 Measurement Proxy Component  Interpose a proxy component for each port  Inside the proxy  Make calls to Performance component for each invocation MidpointIntegrator IntegratorPort Go Driver IntegratorPort IntegratorProxy Component IntegratorPortUsesIntegratorPortProvides MeasurementPort Performance MeasurementPort

13 MasterMind Component  Idea: Create a performance model for the component by tracking performance per invocation  Uses Monitor Port  Outputs:  Times per invocation, e.g.  Component call path  Regular performance data (uses performance component) # integ_proxy::integrate(double, double, int) # MPI_TIME Time count lowBound upBound

14 Monitor Proxy Component  Same idea (from the user’s point of view) MidpointIntegrator IntegratorPort Go Driver IntegratorPort Integrator Monitor Proxy IntegratorPortUsesIntegratorPortProvides MonitorPort MasterMind MeasurementPort Performance

15  Tree pruner  Input:  Callgraph generated by Mastermind component  User specified rules  Output:  Pruned callgraph with insignificant nodes removed  Performance modeling library – brute force  Tries all possible permutations of component instances  Input: performance model of each component  Selects optimal component assembly for the ensemble  Optimizer  Swaps one component instance with another Tools Included with MasterMind Component

16  Generate regular measurement proxy or monitor (MasterMind) proxy  Arguments:  Options: TAU’s Proxy Generator for SIDL/Classic CCA -c Full name of the component -t Type of component -p Name of port to generate proxy for -d Name of pdb file created from cxxparse -h Header file for this port -n Name of the proxy component (default: base of component name + Proxy) -o Name of output file (default: proxy.cc) -f Use Pre-generated Selective instrumentation file -x Namespace Tag -m Generate MasterMind component proxy

17 TAU’s Proxy Generator for Classic C++ Interface  Creating PDB Files:  Merging PDB Files:  Invoking tau_pg (example) pdbmerge -o merged.pdb file1.pdb file2.pdb … cxxparse -I -D tau_pg -c integrators::ccaports::Integrator -t integrators.ccaports.Integrator -p IntegratorPort -d ParallelIntegrator_CCA.pdb -o Proxy.cc -h ports/Integrator_CCA.h -f select.dat

18 What’s Going On Here? Alternative implementations of performance component runtime TAU performance data TAU API other API … Application Component Application Component Performance Component TAU API Application Component Application Component

19 Multi-Level Instrumentation  Inter-Component  Proxy components created automatically  Proxy interposed between caller and callee  Intra-Component  PDT based source instrumentation  Compiler scripts  mpif90 => tau_f90.sh  mpicxx => tau_cxx.sh  mpicc => tau_cc.sh  Framework level MPI instrumentation  Shared library MPI based CCAFFEINE framework  LD_PRELOAD based interposition of MPI wrapper  mpirun –np 4./ccafe-batch  mpirun –np 4 tau_load.sh./ccafe-batch

20 MasterMind Component  Idea: Create a performance model for the component by tracking performance per invocation  Uses Monitor Port  Outputs:  Times per invocation, e.g.  Component call path  Regular performance data (uses performance component) # integ_proxy::integrate(double, double, int) # MPI_TIME Time count lowBound upBound

21 Parameter Based Profiling for CQoS  Idea: partition performance data for individual functions based on runtime parameters  Enable by configuring with –PROFILEPARAM  TAU call: TAU_PROFILE_PARAM1L (value, “name”)  Simple example: void foo(long input) { TAU_PROFILE("foo", "", TAU_DEFAULT); TAU_PROFILE_PARAM1L(input, "input");... }

22 Parameter Based Profiling  5 seconds spent in function “ foo ” becomes  2 seconds for “ foo [ = ] ”  1 seconds for “ foo [ = ] ”  …  Demonstrated in MPI wrapper library  Allows for partitioning of time spent in MPI routines based on parameters (message size, message tag, destination node)  Can be extrapolated to infer specifics about the MPI subsystem and system as a whole

23 Workload Characterization  Simple example, send/receive squared message sizes (0-32MB) #include int buffer[8*1024*1024]; int main(int argc, char **argv) { int rank, size, i, j; MPI_Init(&argc, &argv); MPI_Comm_size( MPI_COMM_WORLD, &size ); MPI_Comm_rank( MPI_COMM_WORLD, &rank ); for (i=0;i<1000;i++) for (j=1;j<=8*1024*1024;j*=2) { if (rank == 0) { MPI_Send(buffer,j,MPI_INT,1,42,MPI_COMM_WORLD); } else { MPI_Status status; MPI_Recv(buffer,j,MPI_INT,0,42,MPI_COMM_WORLD,&status); } MPI_Finalize(); }

24 Workload Characterization  Use tau_load.sh to instrument MPI routines (SGI Altix) % icc mpi.c –lmpi % mpirun –np 2 tau_load.sh –XrunTAU-icpc-mpi-pdt.so a.out SGI MPI (SGI Altix) Intel MPI (SGI Altix)

25 Workload Characterization  Two different message sizes (~3.3MB and ~4K)

26 Parameter Based Profiling: SIDL Interface package Performance version { interface Timer { /* Start/stop the Timer */ void start(); void stop(); /* Set Profile Parameter */ void setParam1L(in long value, in string name);... }

27 PerfDMF: Performance Data Mgmt. Framework

28 TAU Portal

29 TAU Portal

30 TAU Portal: Application Specific Metadata Storage

31 Performance Data Mining (PerfExplorer)  Performance knowledge discovery framework  Data mining analysis applied to parallel performance data  comparative, clustering, correlation, dimension reduction, …  Use the existing TAU infrastructure  TAU performance profiles, PerfDMF  Client-server based system architecture  Technology integration  Java API and toolkit for portability  PerfDMF  R-project/Omegahat, Octave/Matlab statistical analysis  WEKA data mining package  JFreeChart for visualization, vector output (EPS, SVG)

32 Performance Data Mining (PerfExplorer)

33 PerfExplorer - Interface Select analysis

34 PerfExplorer - Relative Efficiency Plots

35 PerfExplorer - Relative Efficiency by Routine

36 PerfExplorer - Relative Speedup

37 PerfExplorer - Timesteps Per Second

38 Summary  Create component version of GAMESS, identify interfaces  Work with GAMESS and other application teams to apply TAU for inter and intra-component instrumentation  Gather requirements for swapping components  Generate proxy components for applications, gather performance data, store results in performance data  Cross-experiment application performance characterization  Develop prototype for CQoS 

39 Support Acknowledgements  Department of Energy (DOE)  Office of Science contracts  University of Utah DOE ASCI Level 1 sub-contract  DOE ASC/NNSA Level 3 contract  LLNL, LANL, ANL contracts  NSF Software and Tools for High-End Computing Grant  Research Centre Juelich  John von Neumann Institute for Computing  Dr. Bernd Mohr