Allen D. Malony Performance Research Laboratory Department of Computer and Information Science.

Allen D. Malony malony@cs.uoregon.edu http://www.cs.uoregon.edu/research/tau Performance Research Laboratory Department of Computer and Information Science University of Oregon Hybrid Performance Analysis in the TAU Performance System

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 2 Outline  Hybrid parallel programming and performance analysis  TAU performance system  Instrumentation  Measurement  Analysis tools  MPI support  OpenMP support  Hybrid support  Conclusions and future work

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 3 SMP/Multi-core Clusters & Hybrid Programming  Clusters of SMPs (with multi-core) motivate hybrid (mixed-mode) parallel programming …… …… … … interconnection network MM MM PPPP PPPP …… …… … … MM MM PPPP PPPP  Cluster of SMPs  multiple processors per cluster node  multi-core processors  Heterogeneous cluster  cluster of SMPs  Heterogeneous many-core devices per nodes

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 4 Hybrid Parallel Programming and Tools  Multi-programming methods for hybrid execution  Explicit / implicit  Distributed memory message passing  MPI  Shared memory multi-threading  pthreads, OpenMP, OpenCL,...  Implicit  UPC, CAF, GA  What about tools?  performance  debugging  difficult to integrate and often non-portable The Netherlands Bicycle Band Zurich Police Band Festival, 2010

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 5 Research and Tools  A. Malony, B. Mohr, S. Shende, F. Wolf, "Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting," EWOMP 2001.  B.Mohr, A.Malony, S.Shende, F.Wolf, "Design and Prototype of a Performance Tool Interface for OpenMP," The Journal of Supercomputing, 23:105–128, 2002.  J. Cownie, J. DelSignore, B. de Supinski, K. Warren, "DMPL: An OpenMP DLL Debugging Interface," WOMPAT 2003, LNCS 2716:137–146, Springer, Heidelberg, 2003.  K. Fuerlinger, M. Gerndt, "ompP: A Profiling Tool for OpenMP," IWOMP 2005/IWOMP 2006, LNCS 4315:15–23, Springer, Heidelberg, 2008.  A. Morris, A. Malony, S. Shende, "Supporting Nested OpenMP Parallelism in the TAU Performance System," IWOMP 2005/IWOMP 2006, LNCS 4315:279–288, Springer, Heidelberg, 2008.  V. Bui, O. Hernandez, B. Chapman, R. Kufrin, P. Gopalkrishnan, D. Tafti, "Towards an Implementation of the OpenMP Collector API," ParCo 2007.  M. Itzkowitz, O. Mazurov, N. Copty, Y. Lin, "White Paper: An OpenMP Runtime API for Profiling," Technical Report, Sun Microsystems, Inc., 2007.  OpenMP Architecture Review Board, "OpenMP Application Program Interface, Version 3.0," 2008, http://www.openmp.org/mp-documents/spec30.pdf.  R. Nathan, N. Tallent, J. Mellor-Crummey, "Effective Performance Measurement and Analysis of Multithreaded Applications," PPoPP 2009, ACM, New York, 2009.  Y. Lin and O. Mazurov, "Providing Observability for OpenMP 3.0 Applications," IWOMP 2009, LNCS 5568:104–117, Springer-Verlag, 2009.  Tools: TAU, Scalasca, Paraver, ompP, HPCToolkit, OpenUH, Sun, Intel, Cray,...

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 6 TAU Performance System ®  Tuning and Analysis Utilities (18+ year project)  Performance problem solving framework for HPC  Integrated, scalable, flexible, portable  Target all parallel programming / execution paradigms  Integrated performance toolkit  Instrumentation, measurement, analysis, visualization  Widely-ported performance profiling / tracing system  Performance data management and data mining  Open source (BSD-style license)  Broad application use (NSF, DOE, DOD, …) http://tau.uoregon.edu

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 7 memory Node VM space Context SMP Threads node memory … … Interconnection Network Inter-node message communication * * physical view model view General Target Computation Model in TAU  Node: physically distinct shared memory machine  Message passing node interconnection network  Context: distinct virtual memory space within node  Thread: execution threads (user/system) in context

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 8 TAU Performance System Components TAU Architecture Program Analysis Parallel Profile Analysis PDT PerfDMF ParaProf Performance Data Mining Performance Monitoring TAUoverSupermon PerfExplorer

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 9 TAU Instrumentation / Measurement

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 10 TAU Analysis

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 11 ParaProf Profile Analysis Framework

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 12 TAU Instrumentation Approach  Based on direct performance observation  Direct instrumentation of program (system) code (probes)  Instrumentation invokes performance measurement  Event measurement: performance data, meta-data, context  Support for standard program events  Routines, classes and templates  Statement-level blocks and loops  Begin/End events (Interval events)  Support for user-defined events  Begin/End events specified by user  Atomic events (e.g., size of memory allocated/freed)  Flexible selection of event statistics  Provides static events and dynamic events

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 13 Automatic Source Instrumentation tau_instrumentor Parsed program Instrumentation specification file Instrumented source TAU source analyzer Application source

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 14 MPI Instrumentation and Measurement  Uses standard MPI Profiling Interface (PMPI)  Provides name shifted interface (weak bindings)  MPI_Send PMPI_Send  Interpose TAU's MPI wrapper library  -lmpi replaced by –lTauMpi –lpmpi –lmpi  No change to the source code!  Just re-link the application to generate performance data  No re-compilation or re-linking!  Preloading of TAU MPI library  Uses LD_PRELOAD for Linux  TAU captures profiles and traces of communication  Includes messages sizes, source/destination,... 14

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 15 PFLOTRAN Profile (Exclusive, 16,380 cores) MPI_Allreduce MPI_Waitany KSPSolve oursnesjacobian TAU ParaProf profile analyzer

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 16 ParaProf 3D Full Profile (Exclusive) MPI_Allreduce

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 17 ParaProf 3D Full Profile (minus MPI_Allreduce) 17

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 18 OpenMP Instrumentation with POMP / OPARI  POMP: Profiling interface for OpenMP  POMP-1 specification (FJZ, UO) (EWOMP '01)  POMP-2 (FJZ, UO, Pallas, Intel) (EWOMP '02)  Measurement tool implements the POMP library  OPARI: OpenMP Pragma And Region Instrumentor  Source rewriter to insert POMP calls around OpenMP constructs and API functions  Supports  C and C++, Fortran77 and Fortran90, OpenMP 2.x  Scalasca and TAU POMP implementations  Preserves source code information (#line, file)

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 19 OpenMP Event Model  What events are necessary to observe performance?  An OpenMP thread executes on behalf of an OpenMP task inside an OpenMP parallel region  Tool need to have knowledge of this context to relate performance information to the OpenMP execution model  OpenMP constructs and directives/pragmas  Enter/Exit around OpenMP construct plus Begin/End around associated body  OpenMP API calls  Enter/Exit events around omp_set_*_lock() functions  User functions and regions

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 20 OpenMP Directive Instrumentation (POMP)  Insert calls to pomp_NAME_TYPE(d) at appropriate places around directives  NAME name of the OpenMP construct  TYPE  fork, join mark change in parallelism grade  enter, exit flag entering/exiting OpenMP construct  begin, end mark start/end of construct bodies  d context descriptor  Observation of implicit barrier at DO, SECTIONS, WORKSHARE, SINGLE constructs  Add NOWAIT to construct  Make barrier explicit

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 21 !$OMP PARALLEL DO Instrumentation !$OMP PARALLEL DO clauses... do loop !$OMP END PARALLEL DO !$OMP PARALLEL other-clauses... !$OMP DO schedule-clauses, ordered-clauses, lastprivate-clauses do loop !$OMP END DO !$OMP END PARALLEL DO NOWAIT !$OMP BARRIER call pomp_parallel_fork(d) call pomp_parallel_begin(d) call pomp_parallel_end(d) call pomp_parallel_join(d) call pomp_do_enter(d) call pomp_do_exit(d) call pomp_barrier_enter(d) call pomp_barrier_exit(d)

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 22 OpenMP API Instrumentation  Transform  omp_#_lock()  pomp_#_lock()  omp_#_nest_lock()  pomp_#_nest_lock() [ # = init | destroy | set | unset | test ]  POMP version  Calls omp version internally  Can do performance measurement before and after call

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 23 POMP Context Descriptors  Describe execution contexts through context descriptor typedef struct ompregdescr { char name[]; /* construct */ char sub_name[]; /* region name */ int num_sections; char filename[]; /* src filename */ int begin_line1, begin_lineN; /* begin line # */ int end_line1, end_lineN; /* end line # */ WORD data[4]; /* perf. data */ struct ompregdescr* next; } OMPRegDescr;  Generate context descriptors in global static memory: OMPRegDescr rd42675 = { "critical", "phase1", 0, "foo.c", 5, 5, 13, 13 };  Pass address to POMP functions

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 24 Support for OpenMP Nested Parellelism in TAU  Hierarchical nature of nested parallelism poses problems  Use OMP_NESTED or omp_set_nested() to enable  Performance measurement requires knowledge of thread context to correctly attribute data to thread's execution  How to determine thread nesting level?  omp_get_thread_num() can not be used to uniquely identify the unique thread – logical thread ID in team  Nesting context is not available to the tool interface  Static analysis and instrumentation (à la Opari) no help  Requires runtime solution to identify thread and nesting  Lack of performance tools interface specification (v2.x)

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 25 Portable Approach in TAU (IWOMP 2006)  Need to find a new scheme for thead identification  Leverage OpenMP directive #pragma threadprivate()  Create persistent data for each thread in parallel region  Values do not persist between parallel regions  Memory location for threadprivate variables are unique  Single threadprivate variable is initialized in TAU library  Used to register threads not seen before during execution  Region thread IDs are unique, but may change between  Lose nesting depth and team identifier A. Morris, A. Malony, and S. Shende, “Supporting Nested OpenMP Parallelism in the TAU Performance System,” IWOMP, 2006.

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 26 TFS Case Study (IWOMP 2006)  TFS computational fluid dynamics code (RWTH Aachen)  Parallelized using ParaWise to generate:  intra-block parallelization over a single block dimension  inter-block parallelization over blocks  multi-level (hybrid) with nested OpenMP (intra, inter)  Instrumentation approach  Opari for OpenMP constructs and regions  PDT for source-level information about the routine names, and their respective entry and exit locations .TAU application event  Started at thread execution and stopped at execution end  Shows roughly how much time each thread spends idle

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 27 TFS Multi-level Profile (Exclusive: Mean, Flat) idle 90 second run 3 secs 22 secs

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 28 TFS Multi-level Profile (Exclusive)

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 29 TFS Callpath Profile (Exclusive, Thread 0)

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 30 TFS Callgraph Profile (Exclusive, Thread 0)

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 31 Callgraph for Each TFS Thread Thread 0 (main)

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 32 Hybrid Parallel Application Case Studies  2D Stommel model of ocean circulation  Jacobi iteration, 5-point stencil  Timothy Kaiser (San Diego Supercomputing Center)  GTC  Particle-in-cell simulation of fusion turbulence  128 cores: 32 MPI processes x 4 OpenMP threads  Phases assigned to iterations  Performance instrumentation  OpenMP with Opari  MPI with PMPI

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 33 OpenMP + MPI Ocean Modeling (Profile) % configure -papi=../packages/papi -openmp -c++=pgCC -cc=pgcc -mpiinc=../packages/mpich/include -mpilib=../packages/mpich/libo FP instructions Integrated OpenMP + MPI events

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 34 OpenMP + MPI Ocean Modeling (Trace) Integrated OpenMP + MPI events Thread message pairing

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 35 GTC Full Profile (32x4, Exclusive Time)  Overall visual impression of parallel performance  See hybrid structure and interesting behavior

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 36 GTC Full Profile (32x4, Exclusive Time)  Stacked view shows per event comparison  More clearly highlights OpenMP/MPI differences

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 37 GTC MPI-only Performance (128 cores)

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 38 GTC OpenMP-only Performance (128 cores) Height: FP count Color : Time

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 39 GTC Phase Profiling (32x4) increasing phase execution time decreasing flops rate declining cache performance

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 40 GTC Trace with Jumpshot (Argonne) (128 cores)  Full trace visualization with collapse process view  Communication between computation phases highlighted

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 41 GTC Process Trace Visualization (128 cores)  Zoomed in to show phase structure  Still viewing collapsed process

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 42 GTC OpenMP Thread Visualization (128 cores)  Zoomed view showing OpenMP thread performance  CummulativeExclusionRatio in Jumpshot

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 43 NAS Multi-zone Hybrid Benchmarks  BT-MZ and SP-MZ in NPB  Two levels of hybrid parallelism are exploited  OpenMP is applied to fine-grained intra-zone  MPI is used for coarse-grained inter-zone  Load balancing is based on a bin-packing algorithm  Multiple zones are clustered into zone groups  computational workload is evenly distributed over them  zones are sorted by size and bin-packed into zone groups  Each zone group is then assigned to an MPI process  exchanging boundary data within each time step requires MPI many-to-many communication  Hybrid version is part of the standard NPB distribution

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 44 BT-MZ and ST-MZ Traces BT-MZSP-MZ

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 45 Conclusion and Future Work  TAU supports hybrid parallel performance analysis  Measurement of MPI+OpenMP programs at scale  "Portable" OpenMP instrumentation with Opari  Support for nested parallelism  Need better integration with OpenMP compilers / RTS  Need to build better support for OpenMP 3.0  Task model  events and context  Leverage performance tools interface in OpenMP 3.0  Extend with instrumentation and measurement  Incorporate event-based sampling (TAUebs)

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 46 Support Acknowledgements  Department of Energy (DOE)  Office of Science  ASC/NNSA  Department of Defense (DoD)  HPC Modernization Office (HPCMO)  NSF Software Development for Cyberinfrastructure (SDCI)  Research Centre Juelich  Argonne National Laboratory  Technical University Dresden  ParaTools, Inc.

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 47 Hybrid Parallel Computation (Opus / HPF)  Hybrid, hierarchical programming and execution model  Multi-threaded SMP and inter-node message passing  Integrated task and data parallelism  Opus / HPF environment (University of Vienna)  Combined data (HPF) and task (Opus) parallelism  HPF compiler produces Fortran 90 modules  Processes interoperate using Opus runtime system  producer / consumer model  MPI and pthreads  Performance influence at multiple software levels  Performance analysis oriented to programming model

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 48 TAU Tracing of Opus / HPF Application Multiple producers Multiple consumers

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 49 Opus / HPF Execution Trace  4-node, 28 process  Process-grouping in Vampir visualization

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 50 Hybrid Parallel Computation (Java + MPI)  Multi-language applications and hybrid execution  Java, C, C++, Fortran  Java threads and MPI  mpiJava (Syracuse, JavaGrande)  Java wrapper package with JNI C bindings to MPI routines  Integrate cross-language, cross-system performance technology  JVMPI and Tau profiler agent  MPI profiling interface - link-time interposition (wrapper) library  Cross execution mode uniformity and consistency  invoke JVMPI control routines to control Java threads  access thread information and expose to MPI interface  “Performance Tools for Parallel Java Environments,” Java Workshop, ICS 2000, May 2000.

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 51 JVMPI Thread API Event notification TAU Java Instrumentation Architecture Java program TAU package mpiJava package MPI profiling interface TAU wrapper Native MPI library Profile DB JNI TAU

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 52 Parallel Java Game of Life (Profile)  mpiJava testcase  4 nodes, 28 threads Node 0 Node 1 Node 2 Thread 4 executes all MPI routines Merged Java and MPI event profiles

39 th Speedup Workshop 2010Hybrid Performance Analysis with the TAU Performance System 53 Parallel Java Game of Life (Trace)  Integrated event tracing  Merged trace viz  Node process grouping  Thread message pairing  Vampir display  Multi-level event grouping

Allen D. Malony Performance Research Laboratory Department of Computer and Information Science.

Similar presentations

Presentation on theme: "Allen D. Malony Performance Research Laboratory Department of Computer and Information Science."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Allen D. Malony Performance Research Laboratory Department of Computer and Information Science.

Similar presentations

Presentation on theme: "Allen D. Malony Performance Research Laboratory Department of Computer and Information Science."— Presentation transcript:

Similar presentations

About project

Feedback