Jemmy Hu Programming multithreaded code with OpenMP

Jemmy Hu Programming multithreaded code with OpenMP
SHARCNET HPC Consultant University of Waterloo May 31, 2016 /work/jemmyhu/ss2016/openmp/

Contents Parallel Programming Concepts OpenMP on SHARCNET OpenMP
- Getting Started - OpenMP Directives Parallel Regions Worksharing Constructs Data Environment Synchronization Runtime functions/environment variables Case Studies Recent Updates (SIMD, Task, …) References

Parallel Computer Memory Architectures Shared Memory (SMP solution)
Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as global address space. Multiple processors can operate independently but share the same memory resources. Changes in a memory location effected by one processor are visible to all other processors. It could be a single machine (shared memory machine, workstation, desktop), or shared memory node in a cluster.

Parallel Computer Memory Architectures
Hybrid Distributed-Shared Memory (Cluster solution) Employ both shared and distributed memory architectures The shared memory component is usually a cache coherent SMP machine. Processors on a given SMP can address that machine's memory as global. The distributed memory component is the networking of multiple SMPs. SMPs know only about their own memory - not the memory on another SMP. Therefore, network communications are required to move data from one SMP to another. Current trends seem to indicate that this type of memory architecture will continue to prevail and increase at the high end of computing for the foreseeable future.

Parallel Computing: What is it?
Parallel computing is when a program uses concurrency to either: - decrease the runtime for the solution to a problem. - increase the size of the problem that can be solved. Gives you more performance to throw at your problems. Parallel programming is not generally trivial, 3 aspects: specifying parallel execution communicating between multiple procs/threads synchronization tools for automated parallelism are either highly specialized or absent Many issues need to be considered, many of which don’t have an analog in serial computing data vs. task parallelism problem structure parallel granularity

Distributed vs. Shared memory model • Distributed memory systems
– For processors to share data, the programmer must explicitly arrange for communication -“Message Passing” – Message passing libraries: MPI (“Message Passing Interface”) PVM (“Parallel Virtual Machine”) • Shared memory systems – Compiler directives (OpenMP) – “Thread” based programming (pthread, …)

OpenMP Concepts: What is it?
An Application Program Interface (API) that may be used to explicitly direct multi-threaded, shared memory parallelism Using compiler directives, library routines and environment variables to automatically generate threaded (or multi-process) code that can run in a concurrent or parallel environment. Portable: - The API is specified for C/C++ and Fortran - Multiple platforms have been implemented including most Unix platforms and Windows NT Standardized: Jointly defined and endorsed by a group of major computer hardware and software vendors What does OpenMP stand for? Open Specifications for Multi Processing

OpenMP: Benefits Standardization:
Provide a standard among a variety of shared memory architectures/platforms Lean and Mean: Establish a simple and limited set of directives for programming shared memory machines. Significant parallelism can be implemented by using just 3 or 4 directives. Ease of Use: Provide capability to incrementally parallelize a serial program, unlike message-passing libraries which typically require an all or nothing approach Portability: Supports Fortran (77, 90, and 95), C, and C++ Public forum for API and membership

OpenMP History OpenMP continues to evolve - new constructs and features are added with each release. Initially, the API specifications were released separately for C and Fortran. Since 2005, they have been released together. The table below chronicles the OpenMP API release history. Date Version Oct 1997 Oct 1998 Nov 1999 Nov 2000 Mar 2002 May 2005 May 2008 Jul 2011 Jul 2013 Nov 2015 Fortran 1.0 C/C++ 1.0 Fortran 1.1 Fortran 2.0 C/C++ 2.0 OpenMP 2.5 OpenMP 3.0 OpenMP 3.1 OpenMP 4.0 OpenMP 4.5

OpenMP: compiler Compilers: compile with –openmp flag with default
Intel (icc, ifort): -openmp GNU (gcc, g++, gfortran), -fopenmp Pathscale (pathcc, pathf90), -openmp PGI (pgcc, pgf77, pgf90), -mp compile with –openmp flag with default icc –openmp –o hello_openmp hello_openmp.c gcc –fopenmp –o hello_openmp hello_openmp.c ifort –openmp –o hello_openmp hello_openmp.f gfortran –fopenmp –o hello_openmp hello_openmp.f

OpenMP on SHARCNET SHARCNET systems https://www.sharcnet.ca/my/systems
All systems allow for SMP- based parallel programming (i.e., OpenMP) applications, but the size (number of threads) per job differ from cluster to cluster (depends on the number of cores per node on the cluster)

Size of OpenMP Jobs on specific system
System Nodes CPU/Node OMP_NUM_THREADS (max, use a full node) orca Opteron 16 or 24 saw Xeon 8 redfin 24 brown Job uses less than the max threads available on a cluster will be preferable

OpenMP: compile and run at Sharcnet
Compiler flags: Run OpenMP jobs in the threaded queue Intel (icc, ifort) openmp GNU(gcc gfortran, >4.1) fopenmp SHARCNET compile (default compiler is intel): cc, CC, f77, f90 e.g., f90 –openmp –o hello_openmp hello_openmp.f Submit OpenMP job on a cluster with 4-cpu nodes (The size of threaded jobs varies on different systems as discussed in previous page) sqsub –q threaded –n 4 –r 1.0h –o hello_openmp.log ./hello_openmp

OpenMP: simplest example
hand-on 1 (get first experience) Login to saw, then ssh to one of saw’s dev nodes (saw-dev1 to saw-dev6), e.g., ssh saw-dev2 Copy my ‘openmp’ to your /work directory cd /work/yourid cp –r /work/jemmyhu/ss2016/openmp . Check copy is done ls Type export OMP_NUM_THREADS=4

[jemmyhu@saw332:~/work] cd ss2016/openmp
cd hellow hello-seq.f90: program hello write(*,*) "Hello, world!“ end program f90 -o hello-seq hello-seq.f90 ./hello-seq Hello, world! Add the two ‘red’ lines to the above code, save it as ‘hello-par1.f90’, compile and run !$omp parallel write(*,*) "Hello, world!" !$omp end parallel f90 -o hello-par1-seq hello-par1.f90 ./hello-par1-seq Compiler ignore openmp directive; parallel region concept

hello-par1.f90: program hello !$omp parallel write(*,*) "Hello, world!" !$omp end parallel end program f90 -openmp -o hello-par1 hello-par1.f90 ./hello-par1 Hello, world!

Now, add another two ‘red’ lines to hello-par1.f90, save it as ‘hello-par2.f90’, compile and run program hello write(*,*) "before" !$omp parallel write(*,*) "Hello, parallel world!" !$omp end parallel write(*,*) "after" end program f90 -openmp -o hello-par2 hello-par2.f90 ./hello-par2 before Hello, parallel world! …… after

sqsub -q threaded -n 4 -r 1.0h -o hello-par2.log ./hello-par2 WARNING: no memory requirement defined; assuming 2GB submitted as jobid sqjobs jobid queue state ncpus nodes time command test Q s ./hello-par2 2688 CPUs total, 2474 busy; 435 jobs running; 1 suspended, 1890 queued. 325 nodes allocated; 11 drain/offline, 336 total. Job <3910> is submitted to queue <threaded>. Before Hello, from thread after

Now, add the red part to hello-par2.f90, save it as ‘hello-par3.f90’, compile and run program hello use omp_lib write(*,*) "before" !$omp parallel write(*,*) "Hello, from thread ", omp_get_thread_num() !$omp end parallel write(*,*) "after" end program before Hello, from thread 1 Hello, from thread 0 Hello, from thread 2 Hello, from thread 3 after Example to use OpenMP API to retrieve a thread’s id (Hand-on 1 end)

OpenMP: hello_omp.f90 program hello90 use omp_lib
integer :: id, nthreads !$omp parallel private(id) id = omp_get_thread_num() write (*,*) 'Hello World from thread', id !$omp barrier if ( id .eq. 0 ) then nthreads = omp_get_num_threads() write (*,*) 'There are', nthreads, 'threads' end if !$omp end parallel end program Hand-on, Compile: f90 –openmp –o hello_omp hello_omp.f run: / hello_omp

OpenMP: hello_omp.c Data types: private vs. shared
#include <stdio.h> #include <omp.h> int main (int argc, char *argv[]) { int id, nthreads; #pragma omp parallel private(id) { id = omp_get_thread_num(); printf("Hello World from thread %d\n", id); #pragma omp barrier if ( id == 0 ) { nthreads = omp_get_num_threads(); printf("There are %d threads\n",nthreads); } return 0; Header file Parallel region directive Runtime library routines synchronization Data types: private vs. shared

OpenMP code structure: C/C++ syntax
#include <omp.h> main () { int var1, var2, var3; Serial code . . . Beginning of parallel section. Fork a team of threads. Specify variable scoping #pragma omp parallel private(var1, var2) shared(var3) { Parallel section executed by all threads All threads join master thread and disband } Resume serial code #pragma omp directive-name [clause, ...] newline parallel private(var1, var2) shared(var3)

OpenMP code structure: Fortran
PROGRAM HELLO INTEGER VAR1, VAR2, VAR3 Serial code . . . Beginning of parallel section. Fork a team of threads. Specify variable scoping !$OMP PARALLEL PRIVATE(VAR1, VAR2) SHARED(VAR3) Parallel section executed by all threads . . . All threads join master thread and disband !$OMP END PARALLEL Resume serial code END sentinel directive-name [clause ...] !$OMP PARALLEL PRIVATE(VAR1, VAR2) SHARED(VAR3)

OpenMP Components Directives Runtime Library Environment routines
variables

OpenMP: Fork-Join Model
OpenMP uses the fork-join model of parallel execution: FORK: the master thread then creates a team of parallel threads The statements in the program that are enclosed by the parallel region construct are then executed in parallel among the various team threads JOIN: When the team threads complete the statements in the parallel region construct, they synchronize and terminate, leaving only the master thread

Shared Memory Model

OpenMP Directives

Basic Directive Formats
Fortran: directives come in pairs !$OMP directive [clause, …] [ structured block of code ] !$OMP end directive C/C++: case sensitive #pragma omp directive [clause,…] newline [ structured block of code ]

PARALLEL Region Construct: Summary
A parallel region is a block of code that will be executed by multiple threads. create a team of threads. A parallel region must be a structured block It may contain any of the following clauses: Fortran !$OMP PARALLEL [clause ...] IF (scalar_logical_expression) PRIVATE (list) SHARED (list) DEFAULT (PRIVATE | SHARED | NONE) FIRSTPRIVATE (list) REDUCTION (operator: list) COPYIN (list) block !$OMP END PARALLEL C/C++ #pragma omp parallel [clause ...] newline if (scalar_expression) private (list) shared (list) default (shared | none) firstprivate (list) reduction (operator: list) copyin (list) structured_block

OpenMP: Parallel Regions

Parallel Region review
Every thread executes all code enclosed in the parallel region OpenMP library routines are used to obtain thread identifiers and total number of threads #include <stdio.h> #include <omp.h> int main (int argc, char *argv[]) { int id, nthreads; #pragma omp parallel private(id) { id = omp_get_thread_num(); printf("Hello World from thread %d\n", id); #pragma omp barrier if ( id == 0 ) { nthreads = omp_get_num_threads(); printf("There are %d threads\n",nthreads); } return 0;

Example: Matrix-Vector Multiplication
A[n,n] x B[n] = C[n] 19 1 3 4 2 9 2 1 4 3 4 1 3 1 2 1 4 3 =  11 1 3 4 2 14 1 3 4 2 3 1 2 1 1 3 5 5 4 9 4 1 3 13 3 9 2 3 13 1 4 14 1 for (i=0; i < SIZE; i++) { for (j=0; j < SIZE; j++) c[i] += (A[i][j] * b[j]); } 17 1 4 11 2 4 19 2 1 11 1 Question: Can we simply add one parallel directive? #pragma omp parallel for (i=0; i < SIZE; i++) { for (j=0; j < SIZE; j++) c[i] += (A[i][j] * b[j]); }

Matrix-Vector Multiplication: Parallel Region
/* Create a team of threads and scope variables */ #pragma omp parallel shared(A,b,c,total) private(tid,i,j,istart,iend) { tid = omp_get_thread_num(); nid = omp_get_num_threads(); istart = tid*SIZE/nid; iend = (tid+1)*SIZE/nid; for (i=istart; i < iend; i++) { for (j=0; j < SIZE; j++) c[i] += (A[i][j] * b[j]); /* Update and display of running total must be serialized */ #pragma omp critical total = total + c[i]; printf(" thread %d did row %d\t c[%d]=%.2f\t",tid,i,i,c[i]); printf("Running total= %.2f\n",total); } } /* end of parallel i loop */ } /* end of parallel construct */

OpenMP: Work-sharing constructs:
A work-sharing construct divides the execution of the enclosed code region among the members of the team that encounter it. Work-sharing constructs do not launch new threads There is no implied barrier upon entry to a work-sharing construct, however there is an implied barrier at the end of a work sharing construct.

A motivating example

Types of Work-Sharing Constructs:
DO / for - shares iterations of a loop across the team. Represents a type of "data parallelism". SECTIONS - breaks work into separate, discrete sections. Each section is executed by a thread. Can be used to implement a type of "functional parallelism". SINGLE - serializes a section of code

Work-sharing constructs: Loop construct
The DO / for directive specifies that the iterations of the loop immediately following it must be executed in parallel by the team. This assumes a parallel region has already been initiated, otherwise it executes in serial on a single processor. #pragma omp parallel #pragma omp for for (I=0;I<N;I++){ NEAT_STUFF(I); } !$omp parallel !$omp do do-loop !$omp end do !$omp parallel

Matrix-Vector Multiplication: Parallel Loop
/* Create a team of threads and scope variables */ #pragma omp parallel shared(A,b,c,total) private(tid,i) { tid = omp_get_thread_num(); /* Loop work-sharing construct - distribute rows of matrix */ #pragma omp for private(j) for (i=0; i < SIZE; i++) { for (j=0; j < SIZE; j++) c[i] += (A[i][j] * b[j]); /* Update and display of running total must be serialized */ #pragma omp critical total = total + c[i]; printf(" thread %d did row %d\t c[%d]=%.2f\t",tid,i,i,c[i]); printf("Running total= %.2f\n",total); } } /* end of parallel i loop */ } /* end of parallel construct */

Summery: Parallel Loop vs. Parallel region
!$OMP PARALLEL DO PRIVATE(i) !$OMP& SHARED(a,b,n) do i=1,n a(i)=a(i)+b(i) enddo !$OMP END PARALLEL DO Parellel region: !$OMP PARALLEL PRIVATE(start, end, i) !$OMP& SHARED(a,b) num_thrds = omp_get_num_threads() thrd_id = omp_get_thread_num() start = n* thrd_id/num_thrds + 1 end = n*(thrd_num+1)/num_thrds do i = start, end a(i)=a(i)+b(i) enddo !$OMP END PARALLEL Parallel region normally gives better performance than loop-based code, but more difficult to implement: Less thread synchronization. Less cache misses. More compiler optimizations.

Hand-on 2 (read, compile, and run)
/work/yourID/ss2016/openmp/MM matrix-vector-seq.c matrix-vector-parregion.c matrix-vector-par.c ser_mm_hu omp_mm_hu.c

The schedule clause

schedule(static) • Iterations are divided evenly among threads
c$omp do shared(x) private(i) c$omp& schedule(static) do i = 1, 1000 x(i)=a enddo

schedule(static,chunk)
• Divides the work load in to chunk sized parcels • If there are N threads, each thread does every Nth chunk of work c$omp do shared(x)private(i) c$omp& schedule(static,1000) do i = 1, 12000 … work … enddo

schedule(dynamic,chunk)
• Divides the workload into chunk sized parcels. • As a thread finishes one chunk, it grabs the next available chunk. • Default value for chunk is 1. • More overhead, but potentially better load balancing. c$omp do shared(x) private(i) c$omp& schedule(dynamic,1000) do i = 1, 10000 … work … end do

Re-examine omp_mm_hu.c (hand-on)
Change chunk = 10; to be chunk = 5; save, recompile and run, see the difference 2) Change back to chunk=10, and change the lines (4 of them) #pragma omp for schedule (static, chunk) #pragma omp for schedule (dynamic, chunk) save as omp_mm_hu_dynamic.c compile and run, see the difference

SECTIONS Directive: Functional/Task parallelism
Purpose: The SECTIONS directive is a non-iterative work-sharing construct. It specifies that the enclosed section(s) of code are to be divided among the threads in the team. Independent SECTION directives are nested within a SECTIONS directive. Each SECTION is executed once by a thread in the team. Different sections may be executed by different threads. It is possible that for a thread to execute more than one section if it is quick enough and the implementation permits such.

Examples: 3-loops Serial code with
program compute implicit none integer, parameter :: NX = integer, parameter :: NY = integer, parameter :: NZ = real :: x(NX) real :: y(NY) real :: z(NZ) integer :: i, j, k real :: ri, rj, rk write(*,*) "start" do i = 1, NX ri = real(i) x(i) = atan(ri)/ri end do do j = 1, NY rj = real(j) y(j) = cos(rj)/rj do k = 1, NZ rk = real(k) z(k) = log10(rk)/rk write(*,*) "end" end program Examples: 3-loops Serial code with three independent tasks, namely, three do loops. each operating on a different array using different loop counters and temporary scalar variables.

program compute …… write(*,*) "start" !$omp parallel !$omp sections !$omp section do i = 1, NX ri = real(i) x(i) = atan(ri)/ri end do do j = 1, NY rj = real(j) y(j) = cos(rj)/rj do k = 1, NZ rk = real(k) z(k) = log10(rk)/rk !$omp end sections !$omp end parallel write(*,*) "end" end program Examples: 3-loops Instead of hard-coding, we can use OpenMP provides task sharing directives (section) to achieve the same goal.

OpenMP Work Sharing Constructs - single
c$omp parallel private(i) shared(a) c$omp do do i = 1, n …work on a(i) … enddo c$omp single … process result of do … c$omp end single … more work … c$omp end parallel • Ensures that a code block is executed by only one thread in a parallel region. • The thread that reaches the single directive first is the one that executes the single block. • Useful when dealing with sections of code that are not thread safe (such as I/O) • Unless nowait is specified, all noninvolved threads wait at the end of the single block

Examples [jemmyhu@wha780 single]$ ./single-1 start 1 4 5 2 3
PROGRAM single_1 write(*,*) 'start' !$OMP PARALLEL DEFAULT(NONE), private(i) !$OMP DO do i=1,5 write(*,*) i enddo !$OMP END DO !$OMP SINGLE write(*,*) 'begin single directive' write(*,*) 'hello',i !$OMP END SINGLE !$OMP END PARALLEL write(*,*) 'end' END single]$ ./single-1 start 1 4 5 2 3 begin single directive hello 1 hello 2 hello 3 hello 4 hello 5 end single]$

!$OMP PARALLEL PRIVATE(TID, i) !$OMP DO do i=1,8
PROGRAM single_2 INTEGER NTHREADS, TID, TID2, OMP_GET_NUM_THREADS, OMP_GET_THREAD_NUM write(*,*) "Start" !$OMP PARALLEL PRIVATE(TID, i) !$OMP DO do i=1,8 TID = OMP_GET_THREAD_NUM() write(*,*) "thread: ", TID, 'i = ', i enddo !$OMP END DO !$OMP SINGLE write(*,*) "SINGLE - begin" TID2 = OMP_GET_THREAD_NUM() PRINT *, 'This is from thread = ', TID2 write(*,*) 'hello',i !$OMP END SINGLE !$OMP END PARALLEL write(*,*) "End " END single]$ ./single-2 Start thread: 0 i = 1 thread: 1 i = 5 thread: 1 i = 6 thread: 1 i = 7 thread: 1 i = 8 thread: 0 i = 2 thread: 0 i = 3 thread: 0 i = 4 SINGLE - begin This is from thread = 0 hello 1 hello 2 hello 3 hello 4 hello 5 hello 6 hello 7 hello 8 End

Try yourself (hand-on)
Codes in …/ss2016/openmp/others/ read the code, compile, and run 3loops-seq.f90 3loops-region.f90 3loops-ompsec.f90 single-3.f90

Data Scope Clauses SHARED (list) PRIVATE (list) FIRSTPRIVATE (list)
LASTPRIVATE (list) DEFAULT (list) THREADPRIVATE (list) COPYIN (list) REDUCTION (operator | intrinsic : list)

Data Scope Example (shared vs private)
/work/userid/ss2016/openmp/data_scope scope.f90 program scope implicit none integer :: myid, myid2 write(*,*) "before" !$omp parallel private(myid2) myid = omp_get_thread_num() myid2 = omp_get_thread_num() write(*,*) "myid myid2 : ", myid, myid2 !$omp end parallel write(*,*) "after" end program integer :: myid,myid2 write(*,*) ``before'' integer :: myid2 !private copy myid = omp get thread num() ! updates shared copy myid2 = omp get thread num() ! updates private copy write(*,*) ``myid myid2 : ``, myid, myid2 write(*,*) ``after''

[jemmyhu@saw-login1:~] ./scope
before myid myid2 : myid myid2 : myid myid2 : myid myid2 : myid myid2 : myid myid2 : myid myid2 : myid myid2 : after scope-2.f90, define both myid and myid2 as private, it gives ./scope-2 before myid myid2 : myid myid2 : myid myid2 : myid myid2 : myid myid2 : myid myid2 : myid myid2 : myid myid2 : after

Changing default scoping rules: C vs Fortran
default (shared | private | none) index variables are private C/C++ default(shared | none) - no defualt (private): many standard C libraries are implemented using macros that reference global variables serial loop index variable is shared - C for construct is so general that it is difficult for the compiler to figure out which variables should be privatized. Default (none): helps catch scoping errors

reduction(operator|intrinsic:var1[,var2])
• Allows safe global calculation or comparison. • A private copy of each listed variable is created and initialized depending on operator or intrinsic (e.g., 0 for +). • Partial sums and local mins are determined by the threads in parallel. • Partial sums are added together from one thread at a time to get gobal sum. • Local mins are compared from one thread at a time to get gmin. c$omp do shared(x) private(i) c$omp& reduction(+:sum) do i = 1, N sum = sum + x(i) end do c$omp& reduction(min:gmin) do i = 1,N gmin = min(gmin,x(i))

reduction.f90: PROGRAM REDUCTION USE omp_lib IMPLICIT NONE
INTEGER tnumber INTEGER I,J,K I=1 J=1 K=1 PRINT *, "Before Par Region: I=",I," J=", J," K=",K PRINT *, "" !$OMP PARALLEL PRIVATE(tnumber) REDUCTION(+:I) REDUCTION(*:J) REDUCTION(MAX:K) tnumber=OMP_GET_THREAD_NUM() I = tnumber J = tnumber K = tnumber PRINT *, "Thread ",tnumber, " I=",I," J=", J," K=",K !$OMP END PARALLEL print *, "Operator * MAX" PRINT *, "After Par Region: I=",I," J=", J," K=",K END PROGRAM REDUCTION ./reduction Before Par Region: I= J= K= Thread I= J= K= Thread I= J= K= Thread I= J= K= Thread I= J= K= Operator * MAX After Par Region: I= J= K=

Scope clauses that can appear on a parallel construct
shared and private explicitly scope specific variables firstprivate and lastprivate perform initialization and finalization of privatized variables default changes the default rules used when variables are not explicitly scoped reduction explicitly identifies reduction variables

Default scoping rules in C
void caller(int a[ ], int n) { int i, j, m=3; #pragma omp parallel for for ( i = 0; i<n; i++){ int k = m; for(j=1; j<=5; j++){ callee(&a[i], &k, j); } extern int c; void callee(int *x, int *y, int z) int ii; static int, cnt; cnt++; for(ii=0; ii<z; ii++){ *x = *y +c; Variable Scope Is Use Safe? Reason dor Scope a shared yes declared outside par construct n i private yes parallel loop index variable j shared no loop index var, but not in Fortran m k private yes auto var declared inside par const x private yes Value parameter *x shared yes actual para. is a, which is shared y *y shared yes actual para. is k, which is shared z c shared yes declared as extern ii private yes local stack var of called subrout cnt shared no declared as static

Default scoping rules in Fortran
subroutine caller(a, n) Integer n, a(n), i, j, m m = 3 !$omp parallel do do i = 1, N do j = 1, 5 call callee(a(j), m, j) end do end subroutine callee(x, y, z) common /com/ c Integer x, y, z, c, ii, cnt save cnt cnt = cnt +1 do ii = 1, z x = y +z Variable Scope Is Use Safe? Reason dor Scope a shared yes declared outside par construct n i private yes parallel loop index variable j private yes Fortran seq. loop index var m x shared yes actual para. is a, which is shared y shared yes actual para. is m, which is shared z private yes actual para. is j, which is private c shared yes in a common block ii private yes local stack var of called subrout cnt shared no local var of called subrout with save attribute

OpenMP: Synchronization
OpenMP has the following constructs to support synchronization: – atomic – critical section – barrier – flush – ordered – single – master

program sharing_omp1 implicit none integer, parameter :: N = integer(selected_int_kind(17)) :: x(N) integer(selected_int_kind(17)) :: total integer :: i !$omp parallel !$omp do do i = 1, N x(i) = i end do !$omp end do total = 0 total = total + x(i) !$omp end parallel write(*,*) "total = ", total end program ! Parallel code with openmp do directives ! Run result varies from run to run ! Due to the chaos with ‘total’ global variable ./sharing-atomic-par1 total = total =

program sharing_omp2 use omp_lib implicit none integer, parameter :: N = integer(selected_int_kind(17)) :: x(N) integer(selected_int_kind(17)) :: total integer :: i !$omp parallel !$omp do do i = 1, N x(i) = i end do !$omp end do total = 0 !$omp atomic total = total + x(i) !$omp end parallel write(*,*) "total = ", total end program ! Parallel code with openmp do directives ! Synchronized with atomic directive ! which give correct answer, but cost more ./sharing-atomic-par2 total =

Synchronization categories
Mutual Exclusion Synchronization critical atomic Event Synchronization barrier ordered master Custom Synchronization flush (lock – runtime library)

Atomic vs. Critical program sharing_par1 use omp_lib implicit none
integer, parameter :: N = integer(selected_int_kind(17)) :: x(N) integer(selected_int_kind(17)) :: total integer :: i !$omp parallel !$omp do do i = 1, N x(i) = i end do !$omp end do total = 0 !$omp atomic total = total + x(i) !$omp end parallel write(*,*) "total = ", total end program

Barriers are used to synchronize the execution of multiple threads within a parallel region, not within a work-sharing construct. Ensure that a piece of work has been completed before moving on to the next phase !$omp parallel private(index) index = generate_next_index() do while (inex .ne. 0) call add_index (index) enddo ! Wait for all the indices to be generated !$omp barrier index = get_next_index() call process_index (index) !omp end parallel

OpenMP: Library routines
Lock routines – omp_init_lock(), omp_set_lock(), omp_unset_lock(), omp_test_lock() Runtime environment routines: – Modify/Check the number of threads – omp_set_num_threads(), omp_get_num_threads(), omp_get_thread_num(), omp_get_max_threads() – Turn on/off nesting and dynamic mode – omp_set_nested(), omp_set_dynamic(), omp_get_nested(), omp_get_dynamic() – Are we in a parallel region? – omp_in_parallel() – How many processors in the system? – omp_num_procs()

OpenMP: Environment Variables
Control how “omp for schedule(RUNTIME)” loop iterations are scheduled. – OMP_SCHEDULE “schedule[, chunk_size]” Set the default number of threads to use. – OMP_NUM_THREADS int_literal Can the program use a different number of threads in each parallel region? – OMP_DYNAMIC TRUE || FALSE Will nested parallel regions create new teams of threads, or will they be serialized? – OMP_NESTED TRUE || FALSE /work/userid/ss2015/openmp/data_scope openmp_lib_env.f90 openmp_lib_env-2.f90 openmp_lib_env-3.f90

Dynamic threading

OpenMP excution model (nested parallel)

Case Studies Example – pi Partial Differential Equation – 2D

Parallel Loop, reduction clause
#include <stdio.h> #include <omp.h> #define NBIN int main(int argc, char *argv[ ]) { int I, nthreads; double x, pi; double sum = 0.0; double step = 1.0/(double) NUM_STEPS; /* do computation -- using all available threads */ #pragma omp parallel { #pragma omp master nthreads = omp_get_num_threads(); } #pragma omp for private(x) reduction(+:sum) schedule(runtime) for (i=0; i < NUM_STEPS; ++i) { x = (i+0.5)*step; sum = sum + 4.0/(1.0+x*x); pi = step * sum; printf("PI = %f\n",pi); Parallel Loop, reduction clause

Partial Differential Equation – 2D
Laplace Equation Helmholtz Equation Heat Equation Wave Equation

Implementation Compute Compare Update

Serial Code program lpcache integer imax,jmax,im1,im2,jm1,jm2,it,itmax
parameter (imax=12,jmax=12) parameter (im1=imax-1,im2=imax-2,jm1=jmax-1,jm2=jmax-2) parameter (itmax=10) real*8 u(imax,jmax),du(imax,jmax),umax,dumax,tol parameter (umax=10.0,tol=1.0e-6) ! Initialize do j=1,jmax do i=1,imax-1 u(i,j)=0.0 du(i,j)=0.0 enddo u(imax,j)=umax

! Main computation loop do it=1,itmax dumax=0.0 do j=2,jm1 do i=2,im1 du(i,j)=0.25*(u(i-1,j)+u(i+1,j)+u(i,j-1)+u(i,j+1))-u(i,j) dumax=max(dumax,abs(du(i,j))) enddo u(i,j)=u(i,j)+du(i,j) write (*,*) it,dumax stop end

Shared Memory Parallelization

OpenMP Code program lpomp use omp_lib
integer imax,jmax,im1,im2,jm1,jm2,it,itmax parameter (imax=12,jmax=12) parameter (im1=imax-1,im2=imax-2,jm1=jmax-1,jm2=jmax-2) parameter (itmax=10) real*8 u(imax,jmax),du(imax,jmax),umax,dumax,tol parameter (umax=10.0,tol=1.0e-6) !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j) !$OMP DO do j=1,jmax do i=1,imax-1 u(i,j)=0.0 du(i,j)=0.0 enddo u(imax,j)=umax !$OMP END DO !$OMP END PARALLEL

!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(i,j)
do it=1,itmax dumax=0.0 !$OMP DO REDUCTION (max:dumax) do j=2,jm1 do i=2,im1 du(i,j)=0.25*(u(i-1,j)+u(i+1,j)+u(i,j-1)+u(i,j+1))-u(i,j) dumax=max(dumax,abs(du(i,j))) enddo !$OMP END DO !$OMP DO u(i,j)=u(i,j)+du(i,j) !$OMP END PARALLEL end

Jacobi-method: 2D Helmholtz equation
The iterative solution in centred finite-difference form where is the discrete Laplace operator and and are known. If is the th iteration for this solution, we get the "residual" vector or and we take the th iteration of such that the new residual is zero

Jacobi Solver – Serial code with 2 loop nests
k = 1; while (k <= maxit && error > tol) { error = 0.0; /* copy new solution into old */ for (j=0; j<m; j++) for (i=0; i<n; i++) uold[i + m*j] = u[i + m*j]; /* compute stencil, residual and update */ for (j=1; j<m-1; j++) for (i=1; i<n-1; i++){ resid =( ax * (uold[i-1 + m*j] + uold[i+1 + m*j]) + ay * (uold[i + m*(j-1)] + uold[i + m*(j+1)]) + b * uold[i + m*j] - f[i + m*j] ) / b; /* update solution */ u[i + m*j] = uold[i + m*j] - omega * resid; /* accumulate residual error */ error =error + resid*resid; } /* error check */ k++; error = sqrt(error) /(n*m); } /* while */

Jacobi Solver – Parallel code with 2 parallel regions
k = 1; while (k <= maxit && error > tol) { error = 0.0; /* copy new solution into old */ #pragma omp parallel for private(i) for (j=0; j<m; j++) for (i=0; i<n; i++) uold[i + m*j] = u[i + m*j]; /* compute stencil, residual and update */ #pragma omp parallel for reduction(+:error) private(i,resid) for (j=1; j<m-1; j++) for (i=1; i<n-1; i++){ resid = ( ax * (uold[i-1 + m*j] + uold[i+1 + m*j]) + ay * (uold[i + m*(j-1)] + uold[i + m*(j+1)]) + b * uold[i + m*j] - f[i + m*j] ) / b; /* update solution */ u[i + m*j] = uold[i + m*j] - omega * resid; /* accumulate residual error */ error =error + resid*resid; } /* error check */ k++; error = sqrt(error) /(n*m); } /* while */

Jacobi Solver – Parallel code with 2 parallel loops in 1 PR
k = 1; while (k <= maxit && error > tol) { error = 0.0; #pragma omp parallel private(resid, i){ #pragma omp for for (j=0; j<m; j++) for (i=0; i<n; i++) uold[i + m*j] = u[i + m*j]; /* compute stencil, residual and update */ #pragma omp for reduction(+:error) for (j=1; j<m-1; j++) for (i=1; i<n-1; i++){ resid = ( ax * (uold[i-1 + m*j] + uold[i+1 + m*j]) + ay * (uold[i + m*(j-1)] + uold[i + m*(j+1)]) + b * uold[i + m*j] - f[i + m*j]) / b; /* update solution */ u[i + m*j] = uold[i + m*j] - omega * resid; /* accumulate residual error */ error =error + resid*resid; } } /* end parallel */ k++; error = sqrt(error) /(n*m); } /* while */

Recent Updates: Task Parallelism in OpenMP
• Definition: A task parallel computation is one in which parallelism is applied by performing distinct computations – or tasks – at the same time. Since the number of tasks is fixed, the parallelism is not scalable. • OpenMP support for task parallelism Parallel sections: different threads execute different code - Tasks (NEW): tasks are created and executed at separate times • Common use of task parallelism = Producer/consumer - A producer creates work that is then processed by the consumer - You can think of this as a form of pipelining, like in an assembly line - The “producer” writes data to a FIFO queue (or similar) to be accessed by the “consumer” Simple: OpenMP sections directive

Task Parallelism in OpenMP
Constructs worked well for many cases But OpenMP had to see everything - Loops with a known length at run time - Finite number of parallel sections - .... This didn’t work well with certain common problems - Linked lists - recursive algorithms being the cases

Task Parallelism in OpenMP
Developer Use a pragma to specify where the tasks are (The assumption is that all tasks can be executed independently) OpenMP runtime system • When a thread encounters a task construct, a new task is generated • The moment of execution of the task is up to the runtime system • Execution can either be immediate or delayed • Completion of a task can be enforced through task synchronization The Tasking Construct #pragma omp task defines a task

Example: recursive algorithm
Program for Fibonacci numbers The Fibonacci numbers are the numbers in the following integer sequence. 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, …….. In mathematical terms, the sequence Fn of Fibonacci numbers is defined by the recurrence relation Fn = Fn-1 + Fn-2 with seed values F0 = 0 and F1 = 1.

SIMD constructs

http://portais. fieb. org

OpenMP 4.5

References 1. OpenMP specifications for C/C++ and Fortran, 2. OpenMP tutorial: 3. Parallel Programming in OpenMP by Rohit Chandra, Morgan Kaufman Publishers, ISBN 4.

Jemmy Hu Programming multithreaded code with OpenMP

Similar presentations

Presentation on theme: "Jemmy Hu Programming multithreaded code with OpenMP"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jemmy Hu Programming multithreaded code with OpenMP

Similar presentations

Presentation on theme: "Jemmy Hu Programming multithreaded code with OpenMP"— Presentation transcript:

Similar presentations

About project

Feedback