Allen D. Malony Computer & Information Science Department

Slides:



Advertisements
Similar presentations
OpenMP.
Advertisements

Parallel Processing with OpenMP
Concurrency Important and difficult (Ada slides copied from Ed Schonberg)
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Computer Organization CS224 Fall 2012 Lesson 12. Synchronization  Two processors or threads sharing an area of memory l P1 writes, then P2 reads l Data.
8a-1 Programming with Shared Memory Threads Accessing shared data Critical sections ITCS4145/5145, Parallel Programming B. Wilkinson Jan 4, 2013 slides8a.ppt.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
University of Houston Open Source Software Support for the OpenMP Runtime API for Profiling Oscar Hernandez, Ramachandra Nanjegowda, Van Bui, Richard Krufin.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
On the Integration and Use of OpenMP Performance Tools in the SPEC OMP2001 Benchmarks Bernd Mohr 1, Allen D. Malony 2, Rudi Eigenmann 3 1 Forschungszentrum.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting Bernd Mohr, Felix Wolf Forschungszentrum Jülich John von Neumann.
OpenMPI Majdi Baddourah
Threads© Dr. Ayman Abdel-Hamid, CS4254 Spring CS4254 Computer Network Architecture and Programming Dr. Ayman A. Abdel-Hamid Computer Science Department.
1 Organization of Programming Languages-Cheng (Fall 2004) Concurrency u A PROCESS or THREAD:is a potentially-active execution context. Classic von Neumann.
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
Sameer Shende, Allen D. Malony Computer & Information Science Department Computational Science Institute University of Oregon.
GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Parallel Programming in Java with Shared Memory Directives.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
MPI3 Hybrid Proposal Description
German National Research Center for Information Technology Research Institute for Computer Architecture and Software Technology German National Research.
Paradyn Week – April 14, 2004 – Madison, WI DPOMP: A DPCL Based Infrastructure for Performance Monitoring of OpenMP Applications Bernd Mohr Forschungszentrum.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
OpenMP: Open specifications for Multi-Processing What is OpenMP? Join\Fork model Join\Fork model Variables Variables Explicit parallelism Explicit parallelism.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
Debugging parallel programs. Breakpoint debugging Probably the most widely familiar method of debugging programs is breakpoint debugging. In this method,
UPC Performance Tool Interface Professor Alan D. George, Principal Investigator Mr. Hung-Hsun Su, Sr. Research Assistant Mr. Adam Leko, Sr. Research Assistant.
Parallel Programming 0024 Week 10 Thomas Gross Spring Semester 2010 May 20, 2010.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
A Dynamic Tracing Mechanism For Performance Analysis of OpenMP Applications - Caubet, Gimenez, Labarta, DeRose, Vetter (WOMPAT 2001) - Presented by Anita.
CPE779: More on OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.
Chapter 4 – Thread Concepts
Kai Li, Allen D. Malony, Sameer Shende, Robert Bell
Productive Performance Tools for Heterogeneous Parallel Computing
Introduction to OpenMP
Shared Memory Parallelism - OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
CS427 Multicore Architecture and Parallel Computing
Chapter 4 – Thread Concepts
Threads Threads.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.
TAU integration with Score-P
Computer Engg, IIT(BHU)
Supporting OpenMP and other Higher Languages in Dyninst
Introduction to OpenMP
Computer Science Department
Indranil Roy High Performance Computing (HPC) group
A configurable binary instrumenter
Programming with Shared Memory
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory
Introduction to OpenMP
Programming with Shared Memory
Foundations and Definitions
Presentation transcript:

Performance Tools Interface for OpenMP A presentation to the OpenMP Futures Committee Allen D. Malony malony@cs.uoregon.edu Computer & Information Science Department Computational Science Institute University of Oregon

Outline Goals for OpenMP performance tools interface Performance state and event model Fork-join execution states and events Performance measurement model Event generation (callback) interface Proposal based on directive transformation Sample transformations Comments and other additions Describing execution context General Issues Experience using TAU with OpenMP/MPI January 29, 2019

Goals for an OMP Performance Tools Interface Goal 1: Expose OpenMP events and execution states to a performance measurement system What are the OpenMP events / states of interest? What is the nature (mechanism) of the interface? Goal 2: Make the performance measurement interface portable “Standardize” on interface mechanism Define interface semantics and information Goal 3: Support source-level and compiler-level implementation of interface Source transformation and compiler transformation January 29, 2019

Performance State and Event Model Based on performance model for (nested) fork-join parallelism, multi-threaded work-sharing, and thread-based synchronization Define with respect to multi-level state view Level 1: serial and parallel states (with nesting) Level 2: work-sharing states (per team thread) Level 3: synchronization states (per team thread) Level 4: runtime system (thread) states Events reflect state transitions State enter / exit (begin / end) State graph with event edges January 29, 2019

Fork-Join Execution States and Events master slave Parallel region operation master starts serial execution X S parallel region begins X slaves started X team begins parallel execution X X P team threads hit barrier X X slaves end; master exits barrier X X master resumes serial execution X S January 29, 2019

Performance Measurement Model Serial performance Detect serial transition points Standard events and statistics within serial regions Time spent in serial execution Locations of serial execution in program Parallel performance Detect parallel transitions points Time spent in parallel execution Region perspective and work-sharing perspective Performance profiles kept per region More complex parallel states of execution January 29, 2019

Event Generation (Callback) Interface Generic event callback function (pseudo format) omperf(eventID, contextID[, data]) Single callback routine Must define events (not necessarily standardize) Place burden on callback routine to interpret eventID omperf_{begin/end}(eventID, contextID[, data]) Directive-specific callback functions (pseudo format) omperf_{directive}_{begin/end/…}(contextID[, data]) Standardize function names What about execution context data? January 29, 2019

Instrumentation Alternatives Source-level instrumentation Manual instrumentation TAU performance measurement Directive transformation Compiler instrumentation Could allow more efficient implementation JOMP (EPCC), Java Instrumentation Suite (Barcelona) Runtime system instrumentation Use to see RTL events, in addition to OMP events GuideView (KAI/Intel) Dynamic instrumentation January 29, 2019

Proposal Based on Directive Transformation Consider source-level approach For each OMP directive, generate an “instrumented” version which calls the performance event API. What is the event model for each directive? Issues OMP RTL execution behavior is not fully exposed May not be able to generate equivalent form Possible conflicts with directive optimization May be less efficient Hard to access RTL events and information Sample transformations (B. Mohr, KFA) January 29, 2019

Example: parallel regions, work-sharing (do) Parallel region (parallel) #omp parallel omperf_parallel_fork(regionID) #omp parallel omperf_parallel_begin(regionID) #omp end parallel omperf_parallel_end(regionID) omperf_barrier_begin(regionID) #omp barrier omperf_barrier_end(regionID) #omp end parallel omperf_parallel_join(regionID) #omp is just pseudo notation IDs vs. context descriptor (see below) Work-sharing (do/for) #omp do omperf_do_begin(loopID) #omp do #omp end do nowait #omp end do nowait omperf_do_end(loopID) #omp end do #omp end do nowait omperf_do_end(loopID) omperf_barrier_begin(loopID) #omp barrier omperf_barrier_end(loopID) January 29, 2019

Example: work-sharing (sections) #omp sections omperf_sections_begin(sectionsID) #omp sections #omp section (first section only) #omp section omperf_section_begin(sectionID) #omp section (other sections only) omperf_section_end(prevsectionID) #omp section omperf_section_begin(sectionID) Work-sharing (sections) #omp end sections nowait omperf_section_end(lastsectionID) #omp end sections nowait omperf_sections_end(loopID) #omp end sections omperf_section_end(lastsectionID) #omp end sections nowait omperf_barrier_begin(sectionsID) #omp barrier omperf_barrier_end(sectionsID) omperf_sections_end(sectionsID) January 29, 2019

Example: work-sharing (single, master) #omp single omperf_single_enter(singleID) #omp single omperf_single_begin(singleID) #omp end single nowait omperf_single_end(singleID) #omp end single nowait omperf_single_exit(singleID) #omp end single omperf_single_end(singleID) #omp end single nowait omperf_barrier_begin(singleID) #omp barrier omperf_barrier_end(singleID) omperf_single_exit(singleID) Work-sharing (master) #omp master #omp master omperf_master_begin(regionID) #omp end master omperf_master_end(regionID) #omp end master January 29, 2019

Example: synchronization (critical, atomic, lock) Mutual exclusion (critical section) #omp critical omperf_critical_enter(criticalID) #omp critical omperf_critical_begin(criticalID) #omp end critical omperf_critical_end(criticalID) #omp end critical omperf_critical_exit(criticalID) Mutual exclusion (automic) #omp atomic omperf_atomic_begin(atomicID) #omp atomic atomic-expr-stmt omperf_atomic_end(atomicID) Mutual exclusion (lock routines) omp_set_lock(lockID) omperf_lock_set(lockID) omp_set_lock(lockID) omperf_lock_acquire(lockID) omp_unset_lock(lockID) omp_unset_lock(lockID) omperf_lock_unset(lockID) omp_test_lock(lockID) … Overhead issues here January 29, 2019

Comments Appropriate transformations for short-cut directives #omp parallel do  #omp parallel sections Performance initialization and termination routines omperf_init()  omperf_finalize() User-defined naming to use in context description New attribute? New directive? Runtime function? RTL events and information How to get thread information efficiently? How to get thread-specific context data? Supports portability and source-based analysis tools January 29, 2019

Other Additions Support for user-defined events Measurement control !$omp perf event ... #pragma omp perf event … Place at arbitrary points in program Translated by compiler into corresponding omperf() Measurement control !$omp perf on/off #pragma omp perf on/off Place at “consistent” points in program Translate by compiler into omperf_on/off() January 29, 2019

Describing Execution Context (B. Mohr) Describe different contexts through context descriptor struct region_descr { char name[]; /* region name */ char filename[]; /* source file name */ int begin_lineno; /* begin line # */ int end_lineno; /* end line # */ WORD data[4]; /* unspecified data */ struct region_descr* next; }; Generate context descriptors in global static memory: struct region_descr rd42675 = { “r1”, “foo.c”, 5, 13 }; Table of context descriptors January 29, 2019

Describing Execution Context (continued) Pass descriptor address (or ID) to performance callback Advantages: Full context information available, including source reference But minimal runtime overhead just one argument needs to be passed implementation doesn’t need to dynamically allocate memory for performance data!! context data initialization at compile time Context data is kept together with executable avoids problems of locating (the right) separate context description file at runtime January 29, 2019

General Issues Portable performance measurement interface OMP event-oriented (directives and RTL operation) Generic “standardized” performance event interface Not specific to any particular measurement library Cross-language support Performance measurement library approach Profiling and tracing No built-in (non-portable) measurement Overheads vs. perturbation Iteration measurement overhead can be serious Dynamic instrumentation – is it possible? January 29, 2019

TAU Architecture Dynamic January 29, 2019

Hybrid Parallel Computation (OpenMPI + MPI) Portable hybrid parallel programming OpenMP for shared memory parallel programming Fork-join model Loop level parallelism MPI for cross-box message-based parallelism OpenMP performance measurement Interface to OpenMP runtime system (RTS events) Compiler support and integration 2D Stommel model of ocean circulation Jacobi iteration, 5-point stencil Timothy Kaiser (San Diego Supercomputing Center) January 29, 2019

OpenMP + MPI Ocean Modeling (Trace) Thread message pairing Integrated OpenMP + MPI events January 29, 2019

OpenMP + MPI Ocean Modeling (HW Profile) % configure -papi=../packages/papi -openmp -c++=pgCC -cc=pgcc -mpiinc=../packages/mpich/include -mpilib=../packages/mpich/libo Integrated OpenMP + MPI events FP instructions January 29, 2019