Allen D. Malony Computer & Information Science Department

Performance Tools Interface for OpenMP A presentation to the OpenMP Futures Committee
Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon

Outline Goals for OpenMP performance tools interface
Performance state and event model Fork-join execution states and events Performance measurement model Event generation (callback) interface Proposal based on directive transformation Sample transformations Comments and other additions Describing execution context General Issues Experience using TAU with OpenMP/MPI January 29, 2019

Goals for an OMP Performance Tools Interface
Goal 1: Expose OpenMP events and execution states to a performance measurement system What are the OpenMP events / states of interest? What is the nature (mechanism) of the interface? Goal 2: Make the performance measurement interface portable “Standardize” on interface mechanism Define interface semantics and information Goal 3: Support source-level and compiler-level implementation of interface Source transformation and compiler transformation January 29, 2019

Performance State and Event Model
Based on performance model for (nested) fork-join parallelism, multi-threaded work-sharing, and thread-based synchronization Define with respect to multi-level state view Level 1: serial and parallel states (with nesting) Level 2: work-sharing states (per team thread) Level 3: synchronization states (per team thread) Level 4: runtime system (thread) states Events reflect state transitions State enter / exit (begin / end) State graph with event edges January 29, 2019

Fork-Join Execution States and Events
master slave Parallel region operation master starts serial execution X S parallel region begins X slaves started X team begins parallel execution X X P team threads hit barrier X X slaves end; master exits barrier X X master resumes serial execution X S January 29, 2019

Performance Measurement Model
Serial performance Detect serial transition points Standard events and statistics within serial regions Time spent in serial execution Locations of serial execution in program Parallel performance Detect parallel transitions points Time spent in parallel execution Region perspective and work-sharing perspective Performance profiles kept per region More complex parallel states of execution January 29, 2019

Event Generation (Callback) Interface
Generic event callback function (pseudo format) omperf(eventID, contextID[, data]) Single callback routine Must define events (not necessarily standardize) Place burden on callback routine to interpret eventID omperf_{begin/end}(eventID, contextID[, data]) Directive-specific callback functions (pseudo format) omperf_{directive}_{begin/end/…}(contextID[, data]) Standardize function names What about execution context data? January 29, 2019

Instrumentation Alternatives
Source-level instrumentation Manual instrumentation TAU performance measurement Directive transformation Compiler instrumentation Could allow more efficient implementation JOMP (EPCC), Java Instrumentation Suite (Barcelona) Runtime system instrumentation Use to see RTL events, in addition to OMP events GuideView (KAI/Intel) Dynamic instrumentation January 29, 2019

Proposal Based on Directive Transformation
Consider source-level approach For each OMP directive, generate an “instrumented” version which calls the performance event API. What is the event model for each directive? Issues OMP RTL execution behavior is not fully exposed May not be able to generate equivalent form Possible conflicts with directive optimization May be less efficient Hard to access RTL events and information Sample transformations (B. Mohr, KFA) January 29, 2019

Example: parallel regions, work-sharing (do)
Parallel region (parallel) #omp parallel omperf_parallel_fork(regionID) #omp parallel omperf_parallel_begin(regionID) #omp end parallel omperf_parallel_end(regionID) omperf_barrier_begin(regionID) #omp barrier omperf_barrier_end(regionID) #omp end parallel omperf_parallel_join(regionID) #omp is just pseudo notation IDs vs. context descriptor (see below) Work-sharing (do/for) #omp do omperf_do_begin(loopID) #omp do #omp end do nowait #omp end do nowait omperf_do_end(loopID) #omp end do #omp end do nowait omperf_do_end(loopID) omperf_barrier_begin(loopID) #omp barrier omperf_barrier_end(loopID) January 29, 2019

Example: work-sharing (sections)
#omp sections omperf_sections_begin(sectionsID) #omp sections #omp section (first section only) #omp section omperf_section_begin(sectionID) #omp section (other sections only) omperf_section_end(prevsectionID) #omp section omperf_section_begin(sectionID) Work-sharing (sections) #omp end sections nowait omperf_section_end(lastsectionID) #omp end sections nowait omperf_sections_end(loopID) #omp end sections omperf_section_end(lastsectionID) #omp end sections nowait omperf_barrier_begin(sectionsID) #omp barrier omperf_barrier_end(sectionsID) omperf_sections_end(sectionsID) January 29, 2019

Example: work-sharing (single, master)
#omp single omperf_single_enter(singleID) #omp single omperf_single_begin(singleID) #omp end single nowait omperf_single_end(singleID) #omp end single nowait omperf_single_exit(singleID) #omp end single omperf_single_end(singleID) #omp end single nowait omperf_barrier_begin(singleID) #omp barrier omperf_barrier_end(singleID) omperf_single_exit(singleID) Work-sharing (master) #omp master #omp master omperf_master_begin(regionID) #omp end master omperf_master_end(regionID) #omp end master January 29, 2019

Example: synchronization (critical, atomic, lock)
Mutual exclusion (critical section) #omp critical omperf_critical_enter(criticalID) #omp critical omperf_critical_begin(criticalID) #omp end critical omperf_critical_end(criticalID) #omp end critical omperf_critical_exit(criticalID) Mutual exclusion (automic) #omp atomic omperf_atomic_begin(atomicID) #omp atomic atomic-expr-stmt omperf_atomic_end(atomicID) Mutual exclusion (lock routines) omp_set_lock(lockID) omperf_lock_set(lockID) omp_set_lock(lockID) omperf_lock_acquire(lockID) omp_unset_lock(lockID) omp_unset_lock(lockID) omperf_lock_unset(lockID) omp_test_lock(lockID) … Overhead issues here January 29, 2019

Comments Appropriate transformations for short-cut directives
#omp parallel do  #omp parallel sections Performance initialization and termination routines omperf_init()  omperf_finalize() User-defined naming to use in context description New attribute? New directive? Runtime function? RTL events and information How to get thread information efficiently? How to get thread-specific context data? Supports portability and source-based analysis tools January 29, 2019

Other Additions Support for user-defined events Measurement control
!$omp perf event ... #pragma omp perf event … Place at arbitrary points in program Translated by compiler into corresponding omperf() Measurement control !$omp perf on/off #pragma omp perf on/off Place at “consistent” points in program Translate by compiler into omperf_on/off() January 29, 2019

Describing Execution Context (B. Mohr)
Describe different contexts through context descriptor struct region_descr { char name[]; /* region name */ char filename[]; /* source file name */ int begin_lineno; /* begin line # */ int end_lineno; /* end line # */ WORD data[4]; /* unspecified data */ struct region_descr* next; }; Generate context descriptors in global static memory: struct region_descr rd42675 = { “r1”, “foo.c”, 5, 13 }; Table of context descriptors January 29, 2019

Describing Execution Context (continued)
Pass descriptor address (or ID) to performance callback Advantages: Full context information available, including source reference But minimal runtime overhead just one argument needs to be passed implementation doesn’t need to dynamically allocate memory for performance data!! context data initialization at compile time Context data is kept together with executable avoids problems of locating (the right) separate context description file at runtime January 29, 2019

General Issues Portable performance measurement interface
OMP event-oriented (directives and RTL operation) Generic “standardized” performance event interface Not specific to any particular measurement library Cross-language support Performance measurement library approach Profiling and tracing No built-in (non-portable) measurement Overheads vs. perturbation Iteration measurement overhead can be serious Dynamic instrumentation – is it possible? January 29, 2019

TAU Architecture Dynamic January 29, 2019

Hybrid Parallel Computation (OpenMPI + MPI)
Portable hybrid parallel programming OpenMP for shared memory parallel programming Fork-join model Loop level parallelism MPI for cross-box message-based parallelism OpenMP performance measurement Interface to OpenMP runtime system (RTS events) Compiler support and integration 2D Stommel model of ocean circulation Jacobi iteration, 5-point stencil Timothy Kaiser (San Diego Supercomputing Center) January 29, 2019

OpenMP + MPI Ocean Modeling (Trace)
Thread message pairing Integrated OpenMP + MPI events January 29, 2019

OpenMP + MPI Ocean Modeling (HW Profile)
% configure -papi=../packages/papi -openmp -c++=pgCC -cc=pgcc -mpiinc=../packages/mpich/include -mpilib=../packages/mpich/libo Integrated OpenMP + MPI events FP instructions January 29, 2019

Allen D. Malony Computer & Information Science Department

Similar presentations

Presentation on theme: "Allen D. Malony Computer & Information Science Department"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Allen D. Malony Computer & Information Science Department

Similar presentations

Presentation on theme: "Allen D. Malony Computer & Information Science Department"— Presentation transcript:

Similar presentations

About project

Feedback