Introduction to OpenMP For a more detailed tutorial see: Look at the presentations also see: https://computing.llnl.gov/tutorials/openMP/

Slides:

Advertisements

Similar presentations

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.

Advertisements

1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.

Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) Adapted from: o “MPI-Message.

Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

1 OpenMP—An API for Shared Memory Programming Slides are based on:

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.

1 Friday, November 10, 2006 “ Programs for sale: Fast, Reliable, Cheap: choose two.” -Anonymous.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.

OpenMPI Majdi Baddourah

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.

1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.

Programming with Shared Memory Introduction to OpenMP

CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.

Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.

OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

OpenMP: Open specifications for Multi-Processing What is OpenMP? Join\Fork model Join\Fork model Variables Variables Explicit parallelism Explicit parallelism.

Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.

OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,…

CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

OpenMP fundamentials Nikita Panov

High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.

Introduction to OpenMP

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,

Introduction to Pragnesh Patel 1 NICS CSURE th June 2015.

Parallel Programming Models (Shared Address Space) 5 th week.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,

Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.

Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

CS240A, T. Yang, Parallel Programming with OpenMP.

COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University

OpenMP An API : For Writing Portable SMP Application Software Rider NCHC GTD.

Parallel Programming in C with MPI and OpenMP

Introduction to OpenMP

SHARED MEMORY PROGRAMMING WITH OpenMP

Shared Memory Parallelism - OpenMP

Lecture 5: Shared-memory Computing with Open MP

Shared-memory Programming

CS427 Multicore Architecture and Parallel Computing

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.

Open[M]ulti[P]rocessing

Introduction to OpenMP

Shared-Memory Programming

September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.

Computer Science Department

Parallel Programming with OpenMP

Prof. Thomas Sterling Department of Computer Science

Programming with Shared Memory Introduction to OpenMP

Introduction to OpenMP

OpenMP Parallel Programming

Presentation transcript:

Introduction to OpenMP For a more detailed tutorial see: Look at the presentations also see:

Concepts An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism Directive based programming –declare properties of language structures sections of code, loops –scope variables A few service routines –get information Compiler options Environment variables

Open MP Programming Model Directive –#pragma omp directive [clause list] –C$OMP construct [clause [clause] … ] Program executes serially until it encounters a parallel directive –#pragma omp parallel [clause list] –/* structured block of code */ Clause list is used to specify conditions –Conditional parallelism - if (cond) –Degree of concurrency - num_threads(int) –Data Handling - private(vlist), firstprivate(vlist), shared(vlist)

OpenMP Programming Model fork-join parallelism Master thread spawns a team of threads as needed.

Typical OpenMP Use Generally used to parallelize loops –Find most time consuming loops –Split iterations up between threads void main() { double Res[1000]; for(int i=0;i<1000;i++) { do_huge_comp(Res[i]); } void main() { double Res[1000]; #pragma omp parallel for for(int i=0;i<1000;i++) { do_huge_comp(Res[i]); }

Thread Interaction OpenMP operates using shared memory –Threads communicate via shared variables Unintended sharing can lead to race conditions –output changes due to thread scheduling Control race conditions using synchronization –synchronization is expensive –change the way data is stored to minimize the need for synchronization

Syntax format Compiler directives –C/C++ #pragma omp construct [clause [clause] …] –Fortran C$OMP construct [clause [clause] … ] !$OMP construct [clause [clause] … ] *$OMP construct [clause [clause] … ] Since we use directives, no changes need to be made to a program for a compiler that doesn’t support OpenMP

Using OpenMP Some compilers can automatically place directives with option –-qsmp=auto (IBM xlc) –some loops may speed up, some may slow down Compiler option required when you use directives –icc -openmp (Linux – Intel compiler) –gcc -fopenmp (gcc versions >= 4.2) Scoping variables is the hard part! –shared variables, thread private variables

OpenMP Example #include main () { int nthreads, tid; /* Fork a team of threads giving them their own copies of variables */ #pragma omp parallel private(tid) { /* Obtain and print thread id */ tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); /* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } } /* All threads join master thread and terminate */ } openmp]$ gcc-4.2 -fopenmp hello.c -o hello openmp]$./hello Hello World from thread = 0 Hello World from thread = 1 Number of threads = 2 openmp]$ openmp]$ gcc-4.2 -fopenmp hello.c -o hello openmp]$./hello Hello World from thread = 0 Hello World from thread = 1 Number of threads = 2 openmp]$

OpenMP Directives 5 categories –Parallel Regions –Worksharing –Data Environment –Synchronization –Runtime functions / environment variables Basically the same between C/C++ and Fortran

Parallel Regions Create threads with omp parallel Threads share A (default behavior) Master thread creates the threads Threads all start at same time then synchronize at a barrier at the end to continue with code. double A[1000] omp_set_num_threads(4); #pragma omp parallel { int ID = omp_get_thread_num(); dosomething(ID, A); }

Parallel Regions #pragma omp parallel [clause...] newline structured_block if (scalar_expression) private (list) shared (list) default (shared | none) firstprivate (list) reduction (operator: list) copyin (list) num_threads (integer-expression) Clauses

Parallel Regions The number of threads in a parallel region is determined by the following factors, in order of precedence: –1. Evaluation of the IF clause –2. Setting of the NUM_THREADS clause –3. Use of the omp_set_num_threads() library function –4. Setting of the OMP_NUM_THREADS environment variable –5. Implementation default - usually the number of CPUs on a node, though it could be dynamic. Threads are numbered from 0 (master thread) to N-1

Sections construct The sections construct gives a different structured block to each thread By default there is a barrier at the end. Use the nowait clause to turn off. #pragma omp parallel #pragma omp sections { X_calculation(); #pragma omp section y_calculation(); #pragma omp section z_calculation(); }

Work-sharing constructs the for construct splits up loop iterations By default, there is a barrier at the end of the “omp for”. Use the “nowait” clause to turn off the barrier. #pragma omp parallel #pragma omp for for (I=0;I<N;I++) { NEAT_STUFF(I); }

Short-hand notation Can combine parallel and work sharing constructs There is also a “parallel sections” construct #pragma omp parallel for for (I=0;I<N;I++){ NEAT_STUFF(I); } #pragma omp parallel for for (I=0;I<N;I++){ NEAT_STUFF(I); }

A Rule In order to be made parallel, a loop must have canonical “shape” for (index=start; index end; ) < <= >= > index++; ++index; index--; --index; index += inc; index -= inc; index = index + inc; index = inc + index; index = index – inc;

An example #pragma omp parallel for private(j) for (i = 0; i < BLOCK_SIZE(id,p,n); i++) for (j = 0; j < n; j++) a[i][j] = MIN(a[i][j], a[i][k] + tmp[j]) By definition, private variable values are undefined at loop entry and exit To change this behavior, you can use the firstprivate(var) and lastprivate(var) clauses x[0] = complex_function(); #pragma omp parallel for private(j) firstprivate(x) for (i = 0; i < n; i++) for (j = 0; j < m; j++) x[j] = g(i, x[j-1]); answer[i] = x[j] – x[i];

Scheduling Iterations The schedule clause effects how loop iterations are mapped onto threads schedule(static [,chunk]) –Deal-out blocks of iterations of size “chunk” to each thread. schedule(dynamic[,chunk]) –Each thread grabs “chunk” iterations off a queue until all iterations have been handled. schedule(guided[,chunk]) –Threads dynamically grab blocks of iterations. The size of he block starts large and shrinks down to size “chunk” as the calculation proceeds. schedule(runtime) –Schedule and chunk size taken from the OMP_SCHEDULE environment variable.

An example #pragma omp parallel for private(j) schedule(static, 2) for (i = 0; i < n; i++) for (j = 0; j < m; j++) x[j][j] = g(i, x[j-1]); You can play with the chunk size to meet load balancing issues, etc.

Synchronization Directives BARRIER –inside PARALLEL, all threads synchronize CRITICAL (lock) / END CRITICAL (lock) –section that can be executed by one thread only –lock is optional name to distinguish several critical constructs from each other

An example double area, pi, x; int i, n; area = 0.0; #pragma omp parallel for private(x) for (i = 0; i < n; i++) { x = (i + 0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); } pi = area / n;

Reductions Sometimes you want each thread to calculate part of a value then collapse all that into a single value Done with reduction clause area = 0.0; #pragma omp parallel for private(x) reduction (+:area) for (i = 0; i < n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n;

Another Example /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ unsigned short xi[3];/* random number seed */ inti; /* loop index */ int samples;/* Number of points to generate */ double x,y;/* Coordinates of points */ doublepi;/* Estimate of pi */ xi[0] = 1;/* These statements set up the random seed */ xi[1] = 1; xi[2] = 0; count = 0; for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } pi = 4.0 * count / samples; printf(“Estimate of pi: %7.5f\n”, pi); OpenMP Issues Each thread needs different random number seeds count is shared we need the aggregate

OpenMP Version /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ unsigned short xi[3];/* random number seed */ inti; /* loop index */ int samples;/* Number of points to generate */ double x,y;/* Coordinates of points */ doublepi;/* Estimate of pi */ omp_set_num_threads(omp_get_num_procs()); xi[0] = 1;xi[1] = 1; xi[2] = omp_get_thread_num(); count = 0; #pragma omp parallel for firstprivate(xi) private(x,y) reduction(+:count) for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } pi = 4.0 * count / samples; printf(“Estimate of pi: %7.5f\n”, pi); What is wrong with this?

#include main(int argc, char *argv[]) { /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ int i; /* loop index */ int samples; /* Number of points to generate */ double x,y; /* Coordinates of points */ double pi; /* Estimate of pi */ samples = atoi(argv[1]); #pragma omp parallel { unsigned short xi[3]; /* random number seed */ xi[0] = 1; /* These statements set up the random seed */ xi[1] = 1; xi[2] = omp_get_thread_num(); count = 0; printf("I am thread %d\n", xi[2]); #pragma omp for firstprivate(xi) private(x,y) reduction(+:count) for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } pi = 4.0 * (double)count / (double)samples; printf("Count = %d, Samples = %d, Estimate of pi: %7.5f\n", count, samples, pi); } Corrected Version openmp]$ time./montecarlopi.single I am thread 0 Count = , Samples = , Estimate of pi: real0m4.628s user0m4.568s sys0m0.030s openmp]$ time./montecarlopi I am thread 1 I am thread 0 Count = , Samples = , Estimate of pi: real0m2.480s user0m4.620s sys0m0.030s openmp]$ openmp]$ time./montecarlopi.single I am thread 0 Count = , Samples = , Estimate of pi: real0m4.628s user0m4.568s sys0m0.030s openmp]$ time./montecarlopi I am thread 1 I am thread 0 Count = , Samples = , Estimate of pi: real0m2.480s user0m4.620s sys0m0.030s openmp]$

An alternate version … #pragma omp parallel private(xi, t, x, y, local_count) { xi[0] = 1;xi[1] = 1; xi[2] = tid = omp_get_thread_num(); t = omp_get_num_threads(); local_count = 0; for (i = tid; i < samples; i += t) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) local_count++; } #pragma omp critical count += local_count; } pi = 4.0 * count / samples; printf(“Estimate of pi: %7.5f\n”, pi); } openmp]$ time./montecarlopi I am thread 0 I am thread 1 1: local_count is : local_count is Count = , Samples = , Estimate of pi: real0m5.053s user0m9.697s sys0m0.053s openmp]$ openmp]$ time./montecarlopi I am thread 0 I am thread 1 1: local_count is : local_count is Count = , Samples = , Estimate of pi: real0m5.053s user0m9.697s sys0m0.053s openmp]$ Problems!

Corrected Version /* A Monte Carlo algorithm for calculating pi */ int count; /* points inside the unit quarter circle */ int local_count; /* points inside the unit quarter circle */ int t, tid; int i; /* loop index */ int samples; /* Number of points to generate */ double x,y; /* Coordinates of points */ double pi; /* Estimate of pi */ samples = atoi(argv[1]); #pragma omp parallel private(i,t,x,y,local_count) reduction(+:count) { unsigned short xi[3]; /* random number seed */ xi[0] = 1; /* These statements set up the random seed */ xi[1] = 1; xi[2] = tid = omp_get_thread_num(); t = omp_get_num_threads(); count = 0; //printf("I am thread %d\n", xi[2]); for (i = tid; i < samples; i+=t) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; } //printf("%d: count is %d\n", tid, count); } pi = 4.0 * (double)count / (double)samples; printf("Count = %d, Samples = %d, Estimate of pi: %7.5f\n", count, samples, pi); Corrections: i should be private! random number seed array(xi) should also be private Use reduction Corrections: i should be private! random number seed array(xi) should also be private Use reduction openmp]$ time./montecarlopi Count = , Samples = , Estimate of pi: real0m2.321s user0m4.559s sys0m0.014s openmp]$ time./montecarlopi Count = , Samples = , Estimate of pi: real0m2.321s user0m4.559s sys0m0.014s

Serial Directives MASTER / END MASTER –executed by master thread only DO SERIAL / END DO SERIAL –loop immediately following should not be parallelized –useful with -qsmp=omp:auto SINGLE –only one thread executes the block

Example Serial Execution /* A Monte Carlo algorithm for calculating pi */ … omp_set_num_threads(omp_get_num_procs()); xi[0] = 1;xi[1] = 1; xi[2] = omp_get_thread_num(); count = 0; #pragma omp parallel for firstprivate(xi) private(x,y) reduction(+:count) for (i = 0; i < samples; i++) { x = erand48(xi); y = erand48(xi); if (x*x + y*y <= 1.0) count++; #pragma omp single { printf(“Loop Iteration: %d\n”, i); } pi = 4.0 * count / samples; printf(“Estimate of pi: %7.5f\n”, pi);

Conditional Execution Overhead of fork/join is high If a loop is small, you don’t want to parallellize But, you may not know how big until runtime Conditional clause for parallel execution –if ( expression ) area = 0.0; #pragma omp parallel for private(x) reduction (+:area) if (n > 5000) for (i = 0; i < n; i++) { x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); } pi = area / n;

Scope Rules Shared memory programming model –most variables are shared by default Global variables are shared But not everything is shared –loop index variables –stack variables in called functions from parallel region variable set and then used in for-loop is PRIVATE array whose subscript is constant w.r.t. PARALLEL for-loop and is set and then used within the for-loop is PRIVATE

Scope Clauses for/DO directive has extra clauses, the most important –PRIVATE (variable list) –REDUCTION (op: variable list) op is sum, min, max variable is scalar, XLF allows array

Scope Clauses (2) PARALLEL and PARALELL for/DO and PARALLEL SECTIONS have also –DEFAULT (variable list) scope determined by rules –SHARED (variable list) –IF (scalar logical expression) directives are like programming language extension, not compiler option

integer i,j,n real*8 a(n,n), b(n) read (1) b !$OMP PARALLEL DO !$OMP PRIVATE (i,j) SHARED (a,b,n) do j=1,n do i=1,n a(i,j) = sqrt(1.d0 + b(j)*i) end do !$OMP END PARALLEL DO

Matrix Multiply !$OMP PARALLEL DO PRIVATE(i,j,k) do j=1,n do i=1,n do k=1,n c(i,j) = c(i,j) + a(i,k) * b(k,j) end do

Analysis Outer loop is parallel: columns of c Not optimal for cache use Can put more directives for each loop Then granularity might be too fine

OMP Functions int omp_get_num_procs() int omp_get_num_threads() int omp_get_thread_num() void omp_set_num_threads(int)

OpenMP Environment Variables OpenMP parallelism may be controlled via environment variables –OMP_NUM_THREADS Sets number of threads in parallel sections –OMP_DYNAMIC When = TRUE, allows number of threads to be set at runtime –OMP_NESTED When = TRUE, enables nested parallelism –OMP_SCHEDULE Controls the scheduling assignment Example - export OMP_SCHEDULE=“static,4”

Fortran Parallel Directives PARALLEL / END PARALLEL PARALLEL SECTIONS / SECTION / SECTION / END PARALLEL SECTIONS DO / END DO –work sharing directive for DO loop immediately following PARALLEL DO / END PARALLEL DO –combined section and work sharing

Pthread Translation