Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK

Slides:

Advertisements

Similar presentations

Scheduling and Performance Issues for Programming using OpenMP

Advertisements

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.

Introductions to Parallel Programming Using OpenMP

On the Interaction of Tiling and Automatic Parallelization Zhelong Pan, Brian Armstrong, Hansang Bae Rudolf Eigenmann Purdue University, ECE

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Dynamic Feedback: An Effective Technique for Adaptive Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California,

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Cc Compiler Parallelization Options CSE 260 Mini-project Fall 2001 John Kerwin.

OmniVM Efficient and Language- Independent Mobile Programs Ali-Reza Adl-Tabatabai, Geoff Langdale, Steven Lucco and Robert Wahbe from Carnegie Mellon University.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

Programming with Shared Memory Introduction to OpenMP

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Parallel Programming in Java with Shared Memory Directives.

OMPi: A portable C compiler for OpenMP V2.0 Elias Leontiadis George Tzoumas Vassilios V. Dimakopoulos University of Ioannina.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Parallel implementation of RAndom SAmple Consensus (RANSAC) Adarsh Kowdle.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

S AN D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED C OMPUTATIONAL I NFRASTRUCTURE On pearls and perils of hybrid OpenMP/MPI programming.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

OpenMP fundamentials Nikita Panov

Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Threaded Programming Lecture 4: Work sharing directives.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

Introduction to OpenMP

HPC1 OpenMP E. Bruce Pitman October, HPC1 Outline What is OpenMP Multi-threading How to use OpenMP Limitations OpenMP + MPI References.

Threaded Programming Lecture 2: Introduction to OpenMP.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Computer Science 320 Load Balancing. Behavior of Parallel Program Why do 3 threads take longer than two?

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

Single Node Optimization Computational Astrophysics.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.

Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

The Effect of Data-Reuse Transformations on Multimedia Applications for Application Specific Processors N. Vassiliadis, A. Chormoviti, N. Kavvadias, S.

Distributed and Parallel Processing George Wells.

Embedded Systems MPSoC Architectures OpenMP: Exercises Alberto Bosio

B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.

Introduction to OpenMP

CS427 Multicore Architecture and Parallel Computing

Open[M]ulti[P]rocessing

Computer Engg, IIT(BHU)

Introduction to OpenMP

Shared-Memory Programming

Computer Science Department

Martin Rinard Laboratory for Computer Science

Department of Computer Science University of California, Santa Barbara

Introduction to High Performance Computing Lecture 20

Introduction to OpenMP

Multithreading Why & How.

Department of Computer Science University of California, Santa Barbara

Shared-Memory Paradigm & OpenMP

Presentation transcript:

Measuring Synchronisation and Scheduling Overheads in OpenMP J. Mark Bull EPCC University of Edinburgh, UK

Overview  Motivation  Experimental method  Results and analysis –Synchronisation –Loop scheduling  Conclusions and future work

Motivation  Compare OpenMP implementations on different systems.  Highlight inefficiencies.  Investigate performance implications of semantically equivalent directives.  Allow estimation of synchronisation/scheduling overheads in applications.

Experimental method  Basic idea is to compare same code executed with and without directives.  Overhead computed as (mean) difference in execution time.  e.g. for DO directive, compare: !$OMP PARALLEL do j=1,innerreps !$OMP DO do i=1,numthreads to do j=1,innerreps call delay(dlength) call delay(dlength) end do end do end do !$OMP END PARALLEL

Experimental method (cont.)  Similar technique can be used for PARALLEL (with and without REDUCTION clause), PARALLEL DO, BARRIER and SINGLE directives.  For mutual exclusion (CRITICAL, ATOMIC, lock/unlock) use a similar method, comparing !$OMP PARALLEL do j=1,innerreps/nthreads !$OMP CRITICAL call delay(dlength) !$OMP END CRITICAL end do !$OMP END PARALLEL to same reference time.

Experimental method (cont.)  Can use same method as for DO directive to investigate loop scheduling overheads.  For loop scheduling options, overhead depends on –number of threads –number of iterations per thread –execution time of loop body –chunk size  Large parameter space - fix first 3 and look at varying chunk size. –4 threads –1024 iterations per thread –100 clock cycles to execute loop body

Timing  Need to take care with timing routines: –second differences of 32 bit floating point values (e.g. etime ) lose too much precision. –need microsecond accuracy (Fortran 90 system_clock isn’t good enough on some systems)  For statistical stability, repeat each measurement 50 times per run, and for 20 runs of the executable. –observe significant variation between runs which is absent within a given run.  Reject runs with large standard deviations, or with large numbers of outliers.

Systems tested Benchmark codes have been run on:  Sun HPC 3500, eight 400 MHz UltraSparcII processors, KAI guidef90 preprocessor, Solaris f90 compiler.  SGI Origin 2000, MHz MIPS R10000 processors, MIPSpro f90 compiler (access to 8 processors only)  Compaq Alpha server, four 525 MHz EV5/6 processors, Digital f90 compiler

Sun HPC 3500

SGI Origin 2000

Compaq Alpha server

Sun HPC 3500

SGI Origin 2000

Compaq Alpha server

Sun HPC 3500

SGI Origin 2000

Compaq Alpha server

Observations  PARALLEL directive uses 2 barriers – is this strictly necessary? –PARALLEL DO cost twice as much as DO  REDUCTION clause scales badly –should use a fan-in method?  SINGLE should not cost more than BARRIER  Mutual exclusion scales badly on Origin 2000  CRITICAL directive very expensive on Compaq

Observations (cont.)  Small chunk sizes very expensive –compiler should generate code statically for block cyclic schedule.  DYNAMIC much more expensive than STATIC, especially on Origin 2000  On Origin 2000 and Compaq, block cyclic is more expensive than block, even with one chunk per thread.

Conclusions and future work  Set of benchmarks to measure synchronisation and scheduling costs in OpenMP.  Show significant differences between systems.  Show some potential areas for optimisation.  Would like to run on more (and larger) systems.