Presentation is loading. Please wait.

Presentation is loading. Please wait.

Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li

Similar presentations


Presentation on theme: "Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li"— Presentation transcript:

1 Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li {malony,sameer,amorris,khuck,ntrebon,suravee}@cs.uoregon.edu Department of Computer and Information Science Performance Research Laboratory University of Oregon TAU Parallel Performance System

2 DOE NNSA/ASC Presentation, SC20042 Outline  Motivation  TAU architecture and toolkit  Instrumentation  Measurement  Analysis  Example DOE ASC applications  TAU status  Conclusion

3 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20043 Problem Domain  ASC defines leading edge parallel systems and software  Large-scale systems and heterogenous platforms  Multi-model, multi-module simulation  Complex, multi-layered software integration  Multi-language programming  Mixed-model parallelism and hybrid parallelism  Complexity challenges performance analysis tools  System diversity requires portable tools  Need for cross-language support  Support different parallel computation models  Operate at scale

4 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20044 Research Motivation  Tools for parallel performance problem solving  Empirical-based performance optimization process  Performance technology concerns characterization Performance Tuning Performance Diagnosis Performance Experimentation Performance Observation hypotheses properties Instrumentation Measurement Analysis Visualization Performance Technology Experiment management Performance database Performance Technology

5 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20045 TAU Performance System  Tuning and Analysis Utilities (12+ year project effort)  Performance system framework for HPC systems  Integrated, scalable, flexible, and parallel  Targets a general complex system computation model  Entities: nodes / contexts / threads  Multi-level: system / software / parallelism  Measurement and analysis abstraction  Integrated toolkit for performance instrumentation, measurement, analysis, and visualization  Portable performance profiling and tracing facility  Open software approach with technology integration  University of Oregon, Research Center Jülich, LANL

6 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20046 TAU Performance System Objectives  Multi-level performance instrumentation  Multi-language automatic source instrumentation  Flexible and configurable performance measurement  Widely-ported parallel performance profiling system  Computer system architectures and operating systems  Different programming languages and compilers  Support for multiple parallel programming paradigms  Multi-threading, message passing, mixed-mode, hybrid  Enable performance mapping across semantic levels  Support for object-oriented and generic programming  Integration in complex software systems and applications

7 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20047 memory Node VM space Context SMP Threads node memory … … Interconnection Network Inter-node message communication * * physical view model view General Complex System Computation Model  Node: physically distinct shared memory machine  Message passing node interconnection network  Context: distinct virtual memory space within node  Thread: execution threads (user/system) in context

8 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20048 TAU Performance System Architecture

9 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC20049 TAU Performance System Architecture

10 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200410 TAU Instrumentation Approach  Support for standard program events  Routines and statement-level blocks  Classes and templates  Based on begin/end (paired) event semantics  Support for user-defined events  “User-defined timers” (begin/end semantics)  Atomic events (standard and user-defined)  Selection of event statistics  Support definition of “semantic” entities for mapping  Support for event groups  Instrumentation optimization

11 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200411 TAU Instrumentation Mechanisms  Flexible instrumentation mechanisms at multiple levels  Source code  manual  automatic C, C++, F77/90/95 (Program Database Toolkit (PDT)) OpenMP (directive rewriting (Opari), POMP spec)  Object code  pre-instrumented libraries (e.g., MPI using PMPI)  statically-linked and dynamically-linked  Executable code  dynamic instrumentation (pre-execution) (DyninstAPI)  virtual machine instrumentation (e.g., Java using JVMPI)  TAU_COMPILER to automate instrumentation process

12 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200412 Multi-Level Instrumentation and Mapping  Multiple interfaces  Simultaneously active  Information sharing between interfaces  Selective instrumentation  Within/between levels  Associate performance data with high-level semantic abstractions User-level abstractions problem domain source code object codelibraries instrumentation executable runtime image compiler linkerOS VM instrumentation performance data run preprocessor

13 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200413 TAU Source Instrumentation  Automatic source instrumentation (tau_instrumentor)  Routine entry/exit and class method entry/exit  Block entry/exit and statement level (to be added)  Uses an instrumentation specification file  Include/exclude list for events and files  Uses command line options for group selection  Instrumentation event selection (tau_select)  Automatic generation of instrumentation specification file  Instrumentation language to describe event constraints  Event identity and location  Event performance properties (e.g., overhead analysis)  Create tau_select scripts for performance experiments

14 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200414 Program Database Toolkit (PDT)  Program code analysis framework  Use to develop source-based tools  High-level interface to source code information  Integrated toolkit for source code parsing, database creation, and database query  Commercial grade front-end parsers  Portable IL analyzer, database format, and access API  Open software approach for tool development  Multiple source languages  Implement automatic performance instrumentation tools  tau_instrumentor

15 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200415 Program Database Toolkit (PDT) Application / Library C / C++ parser Fortran parser F77/90/95 C / C++ IL analyzer Fortran IL analyzer Program Database Files IL DUCTAPE PDBhtml SILOON CHASM TAU_instr Program documentation Application component glue C++ / F90/95 interoperability Automatic source instrumentation

16 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200416 PDT Status  Cleanscape Flint parser fully integrated for F90/95  Flint parser is very robust  Produces PDB records for TAU instrumentation  Linux x86, HP Tru64, IBM AIX  Tested on SAGE, POP, ESMF, PET benchmarking codes  C++ and Fortran statement-level information  for/while loops, declarations, initialization, assignment,…  PDB records defined for most constructs  PDT applications  CHASM: C++ / Fortran 90/95 interoperability  CCA: proxy generation, component instrumentation

17 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200417 TAU Performance Measurement  TAU supports profiling and tracing measurement  Robust timing and hardware performance support  Online profile access and sampling  Extension of TAU measurement for multiple counters  User-defined TAU counters and system-level metrics  Integration with trace measurement  Support for memory and callpath profiling  Fully portable parallel performance tracing solution  Hierarchical trace merging and trace translation  Component software monitoring  Online performance profile overhead compensation

18 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200418 TAU Measurement Mechanisms  Performance data sources  High-resolution timer library (real-time / virtual clocks)  General software counter library (user-defined events)  Hardware performance counters  PCL (Performance Counter Library) (ZAM, Germany)  PAPI (Performance API) (UTK, Ptools Consortium)  consistent, abstract, portable API  Organization  Node, context, thread levels  Profile groups for collective events (runtime selective)  Performance data mapping between software levels

19 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200419 TAU Measurement Mechanisms (continued)  Parallel profiling  Function-level, block-level, statement-level  Supports user-defined events  TAU parallel profile data stored during execution  Hardware counts values  Support for multiple counters  Support for callgraph and callpath profiling  Tracing  All profile-level events  Inter-process communication events  Inclusion of counter data in traced events

20 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200420 Performance Analysis and Visualization  Analysis of parallel profile and trace measurement  Parallel profile analysis  ParaProf: parallel profile analysis and presentation  ParaVis: parallel performance data visualization (proto)  Profile generation from trace data  Performance data management framework (PerfDMF)  Parallel trace analysis  Translation to VTF (V3.0) and EPILOG formats  Integration with VNG (Technical University of Dresden)  Online parallel analysis and visualization  Integration with CUBE browser (UTK, FZJ)

21 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200421 ParaProf Framework Architecture  Portable, extensible, and scalable tool for profile analysis  Try to offer “best of breed” capabilities to analysts  Build as profile analysis framework for extensibility

22 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200422 TAU PerfDMF Architecture

23 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200423 Selected Applications of TAU  Center for Simulation of Accidental Fires and Explosion  University of Utah, ASCI ASAP Center, C-SAFE  Uintah Computational Framework (UCF) (C++)  Center for Simulation of Dynamic Response of Materials  California Institute of Technology, ASCI ASAP Center  Virtual Testshock Facility (VTF) (Python, Fortran 90)  Earth Systems Modeling Framework (ESMF)  NSF, NOAA, DOE, NASA, …  Instrumentation for ESMF framework and applications  C, C++, and Fortran 95 code modules  MPI wrapper library for MPI calls

24 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200424 Selected Applications of TAU (continued)  Lawrence Livermore National Lab  Hydrodynamics (Miranda)  Radiation diffusion (KULL)  C++ automatic instrumentation, callpath profiling  Sandia National Lab  DOE CCTTSS SciDAC project  Common component architecture (CCA) integration  Combustion code (C++, Fortran 90, GrACE, MPI)  Los Alamos National Lab  Monte Carlo transport (MCNP) (Susan Post)  ASCI Q validation and scaling  SAIC’s Adaptive Grid Eulerian (SAGE) (Jack Horner)  Fortran 90 automatic instrumentation testcase

25 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200425 Component-Based Scientific Applications  How to support performance analysis and tuning process consistent with application development methodology?  Common Component Architecture (CCA) applications  Performance tools should integrate with software  Design performance observation component  Measurement port and measurement interfaces  Build support for application component instrumentation  Interpose a proxy component for each port  Inside proxy, track caller/callee invocations and timings  Automate the process of proxy component creation  using PDT for static analysis of components  include support for selective instrumentation

26 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200426 Flame Reaction-Diffusion (Sandia, J. Ray) CCAFFEINE

27 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200427 Earth Systems Modeling Framework  Coupled modeling with modular software framework  Instrumentation for ESMF framework and applications  PDT automatic instrumentation  Fortran 95 code modules  C / C++ code modules  MPI wrapper library for MPI calls  ESMF Component instrumentation (using CCA)  CCA measurement port manual instrumentation  Proxy generation using PDT and runtime interposition  Significant use of callpath profiling by ESMF team

28 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200428 TAU’s Paraprof Profile Browser (ESMF Data) Callpath profile Global profile

29 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200429 CUBE Browser (UTK, FZJ) (ESMF Data) metric calltree location TAU callpath profile data converted to CUBE form

30 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200430 TAU Traces with Hardware Counters (ESMF)

31 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200431 TAU Traces with User-Defined Counters

32 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200432 Uintah Computational Framework (UCF)  University of Utah, Center for Simulation of Accidental Fires and Explosions (C-SAFE), DOE ASCI Center  UCF analysis  Scheduling  MPI library  Components  Performance mapping  Use for online and offline visualization  ParaVis tools F 500 processes

33 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200433 Scatterplot Displays (UCF, 500 processes)  Each point coordinate determined by three values: MPI_Reduce MPI_Recv MPI_Waitsome  Min/Max value range  Effective for cluster analysis Relation between MPI_Recv and MPI_Waitsome

34 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200434 Online Unitah Performance Profiling  Demonstration of profiling sampling capability  Multiple profile samples  Each profile taken at major iteration (~ 60 seconds)  Colliding elastic disks C-SAFE application  Test material point method (MPM) code  Executed on 512 processors ASCI Blue Pacific at LLNL  Example  3D bargraph visualization  MPI execution time  Performance mapping  Multiple time steps

35 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200435 Online Unitah Performance Profiling

36 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200436 Miranda Performance Analysis  Miranda is a research hydrodynamics code at LLNL  Fortran 95, MPI  Mostly synchronous  MPI_ALLTOALL on  Np x,y communicators  Some MPI reductions and broadcasts for statistics  Good communications scaling  ACL and MCR Linux cluster  Up to 1728 CPUs  Fixed workload per CPU  Ported to BlueGene/L

37 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200437 Profiling of Miranda on BG/L (Miller, LLNL) 128 Nodes512 Nodes1024 Nodes  Profile code performance (automatic instrumentation)  Scaling studies (problem size, number of processors)  Run on 8K and 16K processors just two week ago!

38 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200438 Fine Grained Profiling via Tracing on Miranda  Use TAU to generate VTF3 traces for Vampir analysis  Combines MPI calls with HW counter information  Detailed code behavior to focus optimization efforts

39 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200439 Max Heap Memory (KB) used for 128 3 problem on 16 processors of ASC Frost at LLNL Memory Usage Analysis in Miranda on BG/L  BG/L will have limited memory per node (512 MB)  Miranda uses TAU to profile memory usage  Streamlines code  Squeeze larger problems on the machine  TAU’s footprint is small  Approximately 60 bytes per event per thread

40 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200440 TAU Performance System Status  Computing platforms (selected)  IBM SP / pSeries, SGI Origin 2K/3K, Cray T3E / SV-1 / X1, HP (Compaq) SC (Tru64), Sun, Hitachi SR8000, NEC SX-5/6, Linux (IA-32/64, Alpha, PPC, PA-RISC, Power, Opteron), Apple (G4/5, OS X), Windows, BG/L  Programming languages  C, C++, Fortran 77/90/95, HPF, Java, OpenMP, Python  Thread libraries  pthreads, SGI sproc, Java,Windows, OpenMP  Compilers (selected)  Intel (KCC, KAP/Pro), PGI, GNU, Fujitsu, Sun, Microsoft, SGI, Cray, IBM (xlc, xlf), HP, NEC, Absoft

41 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200441 Concluding Remarks  Complex ASC parallel systems and software pose challenging performance analysis problems that require robust methodologies and tools  To build more sophisticated performance tools, existing proven performance technology must be utilized  Performance tools must be integrated with software and systems models and technology  Performance engineered software  Function consistently and coherently  TAU performance system offers robust performance technology that can be broadly integrated in next- generation scalable software and systems

42 TAU Parallel Performance SystemDOE NNSA/ASC Presentation, SC200442 Acknowledgements  Department of Energy (DOE)  MICS office  “Performance Technology for Tera-class Parallel Computer Systems: Evolution of the TAU Performance System”  “Performance Analysis of Parallel Component Software”  NNSA/ASC  University of Utah DOE ASCI Level 1 sub-contract  ASCI Level 3 project (LANL, LLNL, SNL)  Research Centre Juelich  John von Neumann Institute for Computing  Dr. Bernd Mohr  Los Alamos National Laboratory


Download ppt "Allen D. Malony, Sameer S. Shende, Alan Morris, Robert Bell, Kevin Huck, Nick Trebon, Suravee Suthikulpanit, Kai Li, Li Li"

Similar presentations


Ads by Google