Presentation is loading. Please wait.

Presentation is loading. Please wait.

Integration and Synthesis for Automated Performance Tuning TAU Performance System ®  Performance problem solving framework for HPC  Integrated, scalable,

Similar presentations


Presentation on theme: "Integration and Synthesis for Automated Performance Tuning TAU Performance System ®  Performance problem solving framework for HPC  Integrated, scalable,"— Presentation transcript:

1 Integration and Synthesis for Automated Performance Tuning TAU Performance System ®  Performance problem solving framework for HPC  Integrated, scalable, flexible, portable  Target all parallel programming paradigms  Integrated performance toolkit (open source)  Multi-level performance instrumentation  Source, compiler, library, binary  Flexible and configurable performance measurement  Shared-memory multithreading  Distributed memory message passing  Heterogeneous: accelerators (GPUs), coprocessors (Xeon Phi)  I/O frameworks and runtime systems  Widely-ported performance profiling / tracing system  Measurement system integration with Score-P  Performance data management and data mining  Incorporates several other technologies in its implementation  TAU is available on all DOE HPC systems  TAU is being incorporated in several DOE exascale projects: Mona, Argo, XPRESS, Vancouver http://tau.uoregon.edu

2 Integration and Synthesis for Automated Performance Tuning 1992-1995: DARPA pC++ (Gannon, Malony, Mohr). TAU (Tools Are Us) is born. [parallel profiling, tracing, performance extrapolation] 1995-1998: Shende Ph.D. (performance mapping, instrumentation). TAU v1.0. [multiple languages, source analysis, automatic instrumentation] 1998-2001: Significant effort in Fortran analysis and instrumentation, work with Mohr on OpenMP, Kojak tracing integration, focus on automated performance analysis. [performance diagnosis, source analysis, instrumentation] 2002-2005: Focus on profiling analysis, measurement scalability, and perturbation compensation. [analysis, scalability, perturbation analysis, applications] 2005-2007: More emphasis on tool integration, usability, and data presentation. TAU v2.0 released. [performance visualization, binary instrumentation, integration, performance diagnosis and modeling] 2008-2011: Add performance database support, data mining, and rule-based analysis. Develop measurement/analysis for heterogeneous systems. Core measurement infrastructure integration (Score-P). [database, data mining, expert system, heterogeneous measurement, infrastructure integration] 2012-present: Focus on exascale systems. Improve scalability, heterogeneous support, runtime system integration, dynamic adaptation. Apply to petascale / exascale applications. [scale, autotuning, user-level] TAU History Observability Diagnosis Complexity Exascale

3 Integration and Synthesis for Automated Performance Tuning Advances in TAU Technologies TAU + Scalasca Score-P TAUdb PerfExplorer ParaProf

4 Integration and Synthesis for Automated Performance Tuning Integrated Profiling MadnessFLASH IntrepidRanger NAMD  Phase-based profiling  Integrated hybrid profiling  Merge sampling-based and probe-based measurement  Unified context  Charm++ integration  Projections (tracing)  TAU (profiling)

5 Integration and Synthesis for Automated Performance Tuning  PerfExplorer  Multi-experiment data mining  Programmable, extensible framework to support workflow automation  Rule-based inferences for expert system analysis  S3D scalability analysis (Jaguar, Intrepid, 12K cores) Automation and Knowledge Engineering r = 1 implies direct correlation

6 Integration and Synthesis for Automated Performance Tuning  Parallel profiling is lightweight, but lacks temporal measurement  Tracing can generate large trace files  Capture performance dynamics with profile snapshots Observing Performance Dynamics Information Overhead Traces Profile Snapshots Profiles Initialization Checkpointing Finalization Flash 3.0 INTRFC

7 Integration and Synthesis for Automated Performance Tuning  Parallel performance state is globally distributed  Logically part of application’s global data space  Offline: outputs data at execution end for post-mortem analysis  Online: access to performance state for analysis  Definition: Monitoring  Online access to parallel performance (data) state  May or may not involve runtime analysis  Couple with monitoring infrastructures  TAUoverSupermon  Supermon monitor from Los Alamos National Laboratory  TAUoverMRNET  Multicast Reduction Network (MRNet) infrastructure from Wisconsin  TAUg  MPI-based infrastructure to provide global view of TAU profile data  TAUmon  Transport-neutral (SuperMon, MRNet, MPI)  Develop online analysis methods  Aggregation, statistics, … Parallel Performance Monitoring

8 Integration and Synthesis for Automated Performance Tuning  Integrated GPU support  Enable host-GPU measurement  CUDA, OpenCL, OpenACC  accelerator compiler integration  utilize PAPI CUDA and CUPTI  Provide both heterogeneous profiling and tracing support  contextualization of asynchronous kernel invocation  Static analysis of GPU kernel  GPU kernel sampling  Full support for Intel Xeon Phi Integrated Heterogeneous Support in TAU GTC

9 Integration and Synthesis for Automated Performance Tuning  How do you support a common performance analysis methodology for OpenMP across systems and compilers?  Evaluate four methods for OpenMP measurement:  POMP  ORA w/OpenUH  ORA w/GOMP  OMPT w/Intel … all integrated into TAU  Compare performance visibility, semantic context, measurement overhead, and integration in analysis Cross-Platform OpenMP Performance Analysis ORA: OpenMP Runtime API GOMP: GCC OpenMP implementation OMPT: OpenMP Tools interface

10 Integration and Synthesis for Automated Performance Tuning TAU and Autotuning in SUPER TAUdb CUDA OpenCL CHiLL + AH Orio ROSE Geant4 MPAS-O CESM PerfExplorer XGC1

11 Integration and Synthesis for Automated Performance Tuning TAU Application Highlights – IRMHD  INCITE Implicit Radiative and Magneto Hydro Dynamical Solver (IRMD)  Understand solar winds and coronal heating  Argonne: T. Williams  University of New Hampshire: J. Perez, B. Chandran  University of Oregon: S. Shende  TAU performance optimizations  Eliminated MPI overheads  Improved communication overlap  Reduced barriers and load imbalance  Used more efficient libraries  Over-subscribed nodes  528.18 core hours to 70.85 core hours!  “Groundbreaking Astrophysics Accelerated,” HPC Source, pp. 9-12, February 2013. Mira topology visualization of MPI performance Mira load imbalance on 32K MPI ranks

12 Integration and Synthesis for Automated Performance Tuning TAU Application Highlights – OpenMC  OpenMP Monte Carlo particle transport  Argonne: Andrew Siegel  University of Oregon: D. Ozog, A. Malony  Compare event-based and history-based algorithms for exploiting SIMD simulations  Port OpenMC code to Xeon Phi cluster  TAU performance analysis informed optimizations to OpenMC to achieve the highest known single-node calculation rate on a standard benchmark  17,000 particles/second  Done by load balancing between CPU + Xeon Phi processors  Achieved 95% distributed efficiency when using 512 concurrent Xeon Phi devices  D. Ozog, A. Malony, A. Siegel, “Full-Core PWR Transport Simulations on Xeon Phi Clusters,” Joint International Conference on Mathematics and Computation (M&C), Supercomputing in Nuclear Applications (SNA) and the Monte Carlo (MC) Method (ANS MC 2015), April 19–23, 2015. Strong scaling for 10M particles CPU vs. Xeon Phi performance

13 Integration and Synthesis for Automated Performance Tuning TAU Application Highlights – GTC  Gyrokinetic Toroidal Code (GTC)  3D PIC code to study microturbulence in magnetically confined fusion plasmas  GTC is highly-parallelizable, but different factors can affect its performance  TAU was used for scaling studies on Jaguar/Titan  PIC codes can be affected by imbalances arising during execution because of particle movement  TAU to capture performance metrics with respect to application iterations  Identified reason for declining performance was due to cache problems  GTC implementation on GPUs  TAU provides integrated support for measurement, analysis, and visualization of heterogeneous performance  MPI + OpenMP + CUDA Thread Idle CPU Waiting Chargei Kernel OpenMP Loop GTC, 240 threads profiled

14 Integration and Synthesis for Automated Performance Tuning Data/Vis ECP Questions (UO, TAU)  Do you release your software as open source?  TAU * is released under a BSD style license  TAU is available for download without a fee  Do you have DOE/NNSA users of your software?  TAU is used daily by DOE/NNSA users across all of the labs for performance analysis, engineering, and tuning of HPC applications running on the leadership facilities  Have facilities, vendors, or ISVs picked up your software?  TAU is installed on many of the HPC systems at the DOE national labs  TAU has been downloaded by many vendors  TAU available as 3 rd party software  TAU is part of the Cray Linux Environment  ParaTools, Inc. is an HPC company that contributes, supports, adds value, and trains users in TAU technology  TAU is part of Intel-led OpenHPC community software  IBM is sponsoring Oregon to port TAU to the OpenPOWER platform * The TAU Performance System ® is a federally registered trademark owned by the State of Oregon acting by and through the State Boar d of Higher Education on behalf of the University of Oregon. TAU Performance System ® will be abbreviated TAU for these slides. http://www.cray.com/products/computing/xc-series?tab=software

15 Integration and Synthesis for Automated Performance Tuning Data/Vis ECP Questions (UO, TAU)  What is the support model for your software?  Professional support and training is available through ParaTools, Inc.  Are there any applications in particular that the outcomes of your project are targeting?  TAU will be able to be used as a performance measurement and analysis component for MONA on any scientific workflow involving DOE applications  The specific ties to identified requirements of the applications, other software components?  TAU can work with applications running on any DOE platform architecture, hardware, and software components

16 Integration and Synthesis for Automated Performance Tuning Data/Vis ECP Questions (UO, TAU)  Your plan for ensuring that the developed software technologies be mature enough to be part of the software stack on exascale systems expected to be selected in 2019 and installed in 2023?  The University of Oregon and ParaTools, Inc. are working closely with DOE labs and key HPC vendors, systems integrators, component manufacturers (processors, memory, accelerators, network), and software developers to ensure that the TAU technologies are integrated and available on systems from now until exascale systems are available  It will be important for exascale to provide performance awareness at all levels of the hardware and software so that performance issues can be identified and resolved at runtime  We propose to contribute to ECP by:  Building advanced performance tools for heterogeneous exascale architectures, technologies, and ultimately ECP-specified platforms  Developing scalable performance introspection and in-situ performance analytics that can provide whole systems observation  Creating online, dynamic performance tuning, adaptation, and control  Integrating performance techniques and tools in exascale program development environments, autotuning systems, and runtime systems

17 Integration and Synthesis for Automated Performance Tuning Data/Vis ECP Questions (UO, TAU)  What do you feel are the key challenges posed and opportunities offered by exascale systems for your specific area?  ECP systems are anticipated to be heterogeneous, combining high- performance manycore processors with accelerator devices, hierarchical memory, networking, and I/O infrastructure  It will be important for exascale to provide performance and power usage awareness at all levels of the hardware and software so that performance and energy issues can be identified and resolved at runtime  Increased performance complexity in hardware and software requiring more support for automated performance  Higher degree of system variability and execution dynamics requiring more dynamic runtime observation and adaptation  Multi-objective optimization requiring greater runtime and system awareness

18 Integration and Synthesis for Automated Performance Tuning Data/Vis ECP Questions (UO, TAU)  What is the R&D that you would like to carry out within the ECP?  Our R&D activities will focus on 4 key ECP concerns:  Building advanced performance tools for heterogeneous exascale architectures, technologies, and ultimately ECP-specified platforms  Developing scalable performance introspection and in-situ performance analytics that can provide whole systems observation  Creating online, dynamic performance tuning, adaptation, and control  Integrating performance techniques and tools in exascale program development environments, autotuning systems, and runtime systems  Advancing static and dynamic analysis of parallel programming languages (Fortran2015, C++14, OpenMP, CUDA, OpenACC)  Greater performance introspection of OpenMP using the OMPT interface and MPI using the MPI_T interface  Creating intuitive visualizations for performance and energy

19 Integration and Synthesis for Automated Performance Tuning Data/Vis ECP Questions (UO, TAU)  What research remains for your project’s outcomes to benefit key DOE applications?  Remaining research on TAU will provide a significant benefit to DOE applications by increasing the ability to understand their performance on next-generation of processors and accelerators  TAU is being integrated with multiple runtime systems, including HPX, OCR, OpenMPI, MVAPICH, MPC, and ADIOS  How would the proposed activities build on the research you have been carrying out with ASCR Research funding?  TAU works successfully with scientific workflows utilizing ADIOS and EVPath  Proposed activities for ECP will build on our research work by expanding the scope of TAU’s powerful performance measurement to include runtime system and system resource data  We will build on this research to create in-situ performance awareness and analytics that can incorporate TAU’s performance characterization in online performance optimization and decision control  Our breadth of application coverage will allow us to target any scientific workflow scenario


Download ppt "Integration and Synthesis for Automated Performance Tuning TAU Performance System ®  Performance problem solving framework for HPC  Integrated, scalable,"

Similar presentations


Ads by Google