Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming.

Slides:



Advertisements
Similar presentations
Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) Adapted from: o “MPI-Message.
Advertisements

Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
CS 470/570 Lecture 7 Dot Product Examples Odd-even transposition sort More OpenMP Directives.
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Programming with Shared Memory Introduction to OpenMP
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Parallel Programming in Java with Shared Memory Directives.
Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
OpenMP: Open specifications for Multi-Processing What is OpenMP? Join\Fork model Join\Fork model Variables Variables Explicit parallelism Explicit parallelism.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,…
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
Threaded Programming Lecture 4: Work sharing directives.
Introduction to OpenMP
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,
CPE779: More on OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University
OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
1 Programming with Shared Memory - 2 Issues with sharing data ITCS 4145 Parallel Programming B. Wilkinson Jan 22, _Prog_Shared_Memory_II.ppt.
B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.
Introduction to OpenMP
Martin Kruliš Jiří Dokulil
Shared Memory Parallelism - OpenMP
Lecture 5: Shared-memory Computing with Open MP
CS427 Multicore Architecture and Parallel Computing
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.
Open[M]ulti[P]rocessing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Shared-Memory Programming
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
Computer Science Department
CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.
Multi-core CPU Computing Straightforward with OpenMP
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Programming with Shared Memory
Introduction to OpenMP
Programming with Shared Memory
OpenMP Martin Kruliš.
OpenMP Parallel Programming
Presentation transcript:

Lecture 6. OpenMP Prof. Taeweon Suh Computer Science Education Korea University COM503 Parallel Computer Architecture & Programming

Korea Univ Clauses The OpenMP directives support a number of clauses, optional addition that controls the behavior of the construct  Shared()  Private()  Firstprivate()  Lastprivate()  Default()  Nowait()  Schedule() 2 #pragma omp parallel [clause[[,] clause]…] structured block

Korea Univ Shared Clause The shared clause is used to specify the data shared among the threads  In many cases, variables are shared by default in OpenMP An important implication is that multiple threads might attempt to simultaneously update the same memory location or that one thread might try to read from a location that another thread is updating  Special care has to be taken to ensure that accesses to shared data are ordered  OpenMP places the synchronization responsibility to users  OpenMP provides several synchronization constructs 3 #pragma omp parallel for shared(a) { for (i=0; i<n; i++) a[i] += i; } // all threads can read from and write to array a

Korea Univ Data Race Condition 4 #include #define NUM_THREADS 128 int main(argc, argv) int argc; char * argv[]; { int Xshared; omp_set_num_threads(NUM_THREADS); #pragma omp parallel { int Xlocal = omp_get_thread_num(); Xshared = omp_get_thread_num(); if (Xlocal != Xshared) printf("Xlocal is %d; Xshared %d\n", Xlocal, Xshared); }

Korea Univ Data Race Condition 5 int compute(int n) { int i; double h, x, sum; sum = 0.0; #pragma omp for reduction(+;sum) shared(h) for (i=0; i <= n; i++) { x = h * ((double)i - 0.5); sum += (1.0 / (1.0 + x*x)); } return(sum); }

Korea Univ Private Clause Each variable in the list of private is replicated in each thread, so each thread has exclusive access to a local copy of this variable  Changes made to the data by one thread are not visible to other threads By default, the iteration variable of a parallel for is given the private attribute  However, it is recommended that programmer not rely on the OpenMP default rules 6 #pragma omp parallel for shared(a) private (i) { for (i=0; i<n; i++) a[i] += i; } // all threads can read from and write to array a

Korea Univ OpenMP Program in Memory 7 Each thread has its own stack for storing its private data Shared data are passed as arguments, and referenced by their address in thread threadprivate data can be stored either on heap or on local stack, depending on implementation

Korea Univ Loop Iteration Variable 8 int i, j; #pragma omp parallel for for (i=0; i <= n; i++) { for (j=0; j <= m; j++) { a[i][j] = compute(i, j); } In C, the index variables of parallel for are private But, it does not extend to the index variables at a deeper nesting level  Loop variable i is private by default, but this is not the case for j (j is shared by default)  It results in undefined runtime behavior

Korea Univ Example 9 #define M 1024 #define N 1024*1024 int main() { int i, j; int sum, local_sum; int * a; a = (int *) malloc(N*sizeof(int)); sum = 0; for (i=0; i<N; i++) a[i] = i; #pragma omp parallel private(local_sum) num_threads(4) { local_sum = 0; //#pragma omp for private (j) #pragma omp for for (i=0; i<M; i++) for (j=0; j<N; j++) local_sum += a[j]; #pragma omp critical { sum += local_sum; } printf("\n sum is %d\n", sum); free(a); }

Korea Univ Private Clause The values of private variables are undefined upon entry to and exit from the specific construct 10 #include #define NUM_THREADS 4 int main(argc, argv) int argc; char * argv[]; { int n = 8; omp_set_num_threads(NUM_THREADS); #pragma omp parallel private(n) { printf("n is %d; Thread %d\n", n, omp_get_thread_num()); }

Korea Univ Firstprivate Clause Private data is undefined on entry to the construct where it is specified  It could be a problem if you want to pre-initialize private variables OpenMP provides the firstprivate clause for such cases  Variables declared to be firstprivate are private variables  But, they can be pre-initialized from the serial part of the code before the parallel construct 11

Korea Univ Firstprivate Example 12 #include #definen 8 int main(void){ int offset, i; int a[n]; for (i=0; i<n; i++) a[i] = 1; for (i=0; i<n; i++) printf("Before: a[%d] = %d\n", i, a[i]); offset = 10; #pragma omp parallel for firstprivate(offset) for(i=0; i<n; i++) { if(i == 0 || i == n-1) offset = offset + 2; a[i] = a[i] + offset; } for (i=0; i<n; i++) printf("After: a[%d] = %d\n", i, a[i]); }

Korea Univ Lastprivate Clause The lastprivate(list) clause ensures that the last value of a data object listed is accessible after the corresponding construct has completed execution In parallel program, what does ‘last’ means?  In the case of a work-sharing loop, the object will have the value from the loop iteration that would be the last in a sequential execution  If the lastprivate clause is used on a sections construct, the object gets assigned the value that it has at the end of the lexically last sections construct 13

Korea Univ Lastprivate Example 14 #include #define NUM_THREADS 4 int main(int argc, char * argv[]) { int n = 8; int i, a; omp_set_num_threads(NUM_THREADS); #pragma omp parallel for private(i) lastprivate(a) for (i=0 ; i<n; i++) { a = i + 1; printf("Thread %d has a value of a = %d for i = %d\n", omp_get_thread_num(), a, i); } printf("Value of a after parallel for: a = %d\n", a); }

Korea Univ Lastprivate Clause A performance penalty is likely to be associated with the use of lastprivate, because the OpenMP library needs to keep track of which thread executes the last iteration In fact, all this clause really does is providing some extra convenience, since the same functionality can be implemented by using an additional shared variable and some simple logic 15 #include #define NUM_THREADS 4 int main(int argc, char * argv[]) { int n = 8; int i, a, a_shared; omp_set_num_threads(NUM_THREADS); #pragma omp parallel for private(i, a) shared(a_shared) for (i=0 ; i<n; i++) { a = i + 1; printf("Thread %d has a value of a = %d for i = %d\n", omp_get_thread_num(), a, i); if (i == n-1) a_shared = a; } printf("Value of a after parallel for: a = %d\n", a_shared); }

Korea Univ Default Clause The default clause is used to give variables a default data-sharing attribute  For example, default(shared) assigns the shared attribute to all variables referenced in the construct  In C/C++, the syntax is default(none) or default(shared) default (private) is not supported in C/C++ This clause is most often used to define the data-sharing attribute of the majority of the variables in a parallel region  Only the exceptions need to be explicitly listed  For example, #pragma omp for default(shared) private(a,b,c) 16

Korea Univ nowait Clause In the work-sharing constructs, there is an implicit barrier at the end. The nowait clause overrides that feature of OpenMP  If it is added to a construct, the barrier at the end of the associated construct will be suppressed  The nowait clause allows the programmer to fine-tune a program’s performance In the following example, when a thread is finished with the work associated with the parallelized for loop, it continues and no longer waits for the other threads to finish as well 17 #pragma omp for nowait for (i=0; i<n; i++) { ……… }

Korea Univ Data Race Condition 18 #pragma omp parallel { #pragma omp for nowait for (i=0 ; i<n; i++) { b[i] = a[i] + a[i-1]; #pragma omp for nowait for (i=0 ; i<n; i++) { z[i] = sqrt(b[i]); } If n is not a multiple of #threads, even static scheduling could introduce race condition OpenMP 2.5 spec:  There are several algorithms for distributing the remaining iterations  No guarantee that the same algorithm has to be used for different loops

Korea Univ Nested Parallelism 19 #pragma omp parallel shared(n,a,b) { #pragma omp for for (i=0; i<n; i++) { a[i] = i + 1; #pragma omp parallel for for (j=0; j<n; j++) b[i][j] = a[i]; } } /*-- End of parallel region --*/ #pragma omp parallel shared(n,a,b) { #pragma omp for for (i=0; i<n; i++) { a[i] = i + 1; #pragma omp for for (j=0; j<n; j++) b[i][j] = a[i]; } } /*-- End of parallel region --*/ Nest work-sharing directives in a program with providing a new parallel region To enable the nested parallelism,  Set the OMP_NESTED environment variable to TRUE (default is false )  Runtime library routine omp_set_nested(true or false)

Korea Univ schedule Clause The schedule clause is supported on the loop construct only  It is used to control the manner in which loop iterations are distributed over the threads  Syntax: schedule (kind [, chunk_size]) There are 4 kinds of schedule  static  dynamic  guided  runtime 20

Korea Univ static It is the default on many OpenMP implementations Iterations are divided into chunks of size chunk_size.  When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size  The last chunk to be assigned may have a smaller number of iterations The chunks are assigned to the threads statically in the order of the thread number 21

Korea Univ dynamic The iterations are assigned to threads as the threads request them The thread executes the chunk of iterations (controlled through the chunk_size parameter), then requests another chunk until there are no more chunks to work on  The last chunk may have fewer iterations than chunk_size When no chunk_size is specified, it defaults to 1 22

Korea Univ guided Similar to the dynamic schedule The difference between dynamic and guided is that the size of the chunk (of iterations) decreases over time  The rationale behind this scheme is that initially larger chunks are desirable because they reduce the overhead  Load balancing is often more of an issue toward the end of computation The system uses relatively small chunks to fill in the gaps in the schedule 23

Korea Univ guided For a chunk_size of 1, the size of each chunk is proportional to the number of unassigned iterations, divided by the number of threads, decreasing to 1 For a chunk_size of k ( k>1 ), the size of each chunk is determined in the same way, with the restriction that the chunks do not contain fewer than k iterations (with a possible exception for the last chunk to be assigned) When no chunk_size is specified, it defaults to 1 24

Korea Univ runtime It is not always easy to select the appropriate schedule and value for chunk_size up front  The choice may depend not only on the code in the loop, but also on the specific problem size and the number of threads used The runtime clause is convenient  The decision regarding scheduling kind is made at run time  Instead of making a compile time decision, the OMP_SCHEDULE environment variable can be used to choose the schedule and (optional) chunk_size at run time  OMP_SCHEDULE export OMP_SCHEDULE="GUIDED,4" export OMP_SCHEDULE="DYNAMIC" 25

Korea Univ Graphical Illustration of Schedules 26

Korea Univ Synchronization Constructs OpenMP provides constructs that help to organize accesses to shared data by multiple threads  Those can be used when the implicit barrier provided with work-sharing constructs does not suffice to specify the required interactions or would be inefficient Synchronization constructs  barrier  ordered  critical  atomic  lock  master 27

Korea Univ barrier Construct Many OpenMP constructs imply a barrier  The compiler automatically inserts a barrier at the end of the construct  All threads wait there until all of the work associated with the construct has been completed  Thus, it is often unnecessary for the programmer to explicitly add a barrier to a code In case that it is required, OpenMP provides a barrier construct  #pragma omp barrier The most common use of a barrier is to avoid a race condition  Inserting a barrier between the writes to and reads from a shared variable guarantees that the accesses are appropriately ordered 28

Korea Univ barrier Example 29 #include #define NUM_THREADS 4 int main(argc, argv) int argc; char * argv[]; { int n = 9; int i, TID, a[10]; omp_set_num_threads(NUM_THREADS); #pragma omp parallel private(TID) { TID = omp_get_thread_num(); if (TID < omp_get_num_threads()/2 ) system("sleep 3"); printf("Before: %d\n", omp_get_thread_num()); system("date"); #pragma omp barrier printf("After: %d\n", omp_get_thread_num()); system("date"); }

Korea Univ Illegal Use of barrier 30 #pragma omp parallel { if (omp_get_num_threads() == 0 ) { ……. #pragma omp barrier } else { ……. #pragma omp barrier } Each barrier region must be encountered by all threads in a team of none at all Otherwise, the program would cause deadlock

Korea Univ barrier Implementation A straightforward way to implement a barrier is to have a shared counter that is initialized to reflect the number of threads in a team When a thread reaches the barrier, it will decrement the counter atomically and wait until the counter is set to 0 31 #pragma omp parallel private(TID) { TID = omp_get_thread_num(); printf("Before: %d\n", omp_get_thread_num()); system("date"); #pragma omp barrier printf("After: %d\n", omp_get_thread_num()); system("date"); }

Korea Univ ordered Construct The ordered construct allows for executing a structured block within a parallel loop in sequential order  #pragma omp ordered  An ordered clause should be added to the parallel region in which this construct appears; it informs the compiler that the construct occurs  It is, for example, used to enforce the ordering on the printing of data computed by different threads Note that the ordered clause and construct come with a performance penalty  The OpenMP implementation needs to perform additional book-keeping tasks to keep track of the order in which threads should execute the corresponding region  Moreover, if threads finish out of order, there may be an additional performance penalty because some threads might have to wait 32

Korea Univ ordered Example 33 #include #define NUM_THREADS 4 int main(argc, argv) int argc; char * argv[]; { int n = 9; int i, TID, a[10]; omp_set_num_threads(NUM_THREADS); #pragma omp parallel for default(none) ordered schedule(runtime) \ private(i, TID) shared(n, a) for (i=0; i<n; i++) { TID = omp_get_thread_num(); printf("Thread %d updates a[%d]\n", TID, i); a[i] = i; #pragma omp ordered { printf("Thread %d prints value of a[%d] = %d\n", TID, i, a[i]); } ordered clause ordered construct

Korea Univ critical Construct The critical construct provides a means to ensure that multiple threads do not attempt to update the same shared data simultaneously  #pragma omp critical [(name)]  The associated code is referred to as a critical region (or a critical section)  An optional name can be given to a critical construct 34

Korea Univ Critical Example 35 #include int main(argc, argv) int argc; char * argv[]; { int n = 9; int i, TID; int sum, sumlocal; sum = 0; #pragma omp parallel private(i, TID, sumlocal) shared(n, sum) { TID = omp_get_thread_num(); sumlocal = 0; #pragma omp for for (i=0; i<n; i++) sumlocal += i; #pragma omp critical { sum += sumlocal; printf("TID=%d: sumlocal = %d, sum = %d\n", TID, sumlocal, sum); } printf("Value of sum after parallel region: %d\n", sum); }

Korea Univ atomic Construct Similar to the critical construct, but it is applied only to the (single) assignment statement that immediately follows it Examples  #pragma omp atomic ic += 1;  #pragma omp atomic ic += bigfunc() // The atomic construct does not prevent multiple threads from executing the function bigfunc() at the same time. It is only the update to the memory location of the variable ic that will occur atomically 36

Korea Univ Locks The OpenMP API provides a set of low-level, general-purpose locking runtime library routines  These routines provide greater flexibility for synchronization (than does the use of critical or atomic constructs)  Syntax: void omp_func_lock (omp_lock_t *lck) func: init, destroy, set, unset, test These routines operate on special-purpose lock variables, which should be accessed via the lock routines only  2 types of locks: simple locks and nestable locks  Simple lock variables are declared with omp_lock_t  Nestable lock variables are declared with omp_nest_lock_t 37

Korea Univ Locks The general procedure to use lock  Define the (simple or nested) lock variables  Initialize lock: omp_init_lock()  Set lock: omp_set_lock(), omp_test_lock() omp_test_lock() checks whether the lock is actually available before attempting to set it  Unset lock: omp_unset_lock()  Remove lock: omp_destroy_lock() Special care has to be taken when the programmer synchronizes the actions of threads using these routines  If these routines are used improperly, a number of programming errors are possible (for example, a code may deadlock) 38

Korea Univ lock Example 39 #include omp_lock_t my_lock; int main() { omp_init_lock(&my_lock); #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num( ); int i; for (i = 0; i < 2; ++i) { omp_set_lock(&my_lock); printf("Thread %d - starting locked region\n", tid); printf("Thread %d - ending locked region\n", tid); omp_unset_lock(&my_lock); } omp_destroy_lock(&my_lock); }

Korea Univ lock Example 40 #include omp_lock_t simple_lock; int main() { omp_init_lock(&simple_lock); #pragma omp parallel num_threads(4) { int tid = omp_get_thread_num(); while (!omp_test_lock(&simple_lock)) printf("Thread %d - failed to acquire simple_lock\n“, tid); printf("Thread %d - acquired simple_lock\n", tid); printf("Thread %d - released simple_lock\n", tid); omp_unset_lock(&simple_lock); } omp_destroy_lock(&simple_lock); }

Korea Univ master Construct The master construct defines a block of code that is guaranteed to be executed by the master thread only #pragma omp master structured block  It is similar to the single construct, but it does not have an implied barrier on entry or exit  The lack of a barrier may lead to problems  If the master construct is used to initialize data, for example, care should be taken that this initialization is completed before the other threads in the team use the data  The typical solution is either to rely on an implied barrier further down the execution stream or to use an explicit barrier construct 41

Korea Univ Incorrect Use of master 42 #include #define NUM_THREADS 4 int main(argc, argv) int argc; char * argv[]; { int Xinit, Xlocal; omp_set_num_threads(NUM_THREADS); #pragma omp parallel { #pragma omp master { Xinit = 10; } Xlocal = Xinit; printf("TID %d, Xlocal = %d\n", omp_get_thread_num(), Xlocal); }

Korea Univ if Clause The if clause is supported on the parallel construct only, where it is used to specify conditional execution Since some overheads are inevitably incurred with the creation and termination of a parallel region, it is sometimes necessary to test whether there is enough work in the region to warrant its parallelization Syntax: if (scalar-logical-expression)  If the logical expression evaluates to true, the parallel region is executed by a team of threads  If it evaluates to false, the region is executed by a single thread only 43

Korea Univ reduction Clause On the slide 35, we used the critical construct to parallelize the summation operation There is a much easier way; OpenMP provides the reduction clause for specifying some forms of recurrence calculations, so they can be performed in parallel without code modification  The programmer must identify the operations and the variables that will hold the results: the rest of the work is then left to the compiler  The results will be shared and it is not necessary to specify the corresponding variables explicitly shared  Syntax: reduction(operator: list) 44

Korea Univ Reduction Example 45 #include #define NUM_THREADS 4 int main(argc, argv) int argc; char * argv[]; { int n = 10; int i; int sum; omp_set_num_threads(NUM_THREADS); sum = 0; #pragma omp parallel for private(i) shared(n) reduction(+:sum) for (i=0; i<=n; i++) sum += i; printf("Value of sum after parallel region: %d\n", sum); }

Korea Univ Supported Operators for reduction 46 OperatorInitial Value +0 *1 -0 & (bitwise AND)~0 | (bitwise OR)0 ^ (bitwise XOR)0 && (logical AND)1 || (logical OR)0

Korea Univ Reduction Example 47 #include int main(argc, argv) int argc; char * argv[]; { int n = 8; int i; float multout, f[n]; multout = 1.0; for (i=0; i<=n; i++) f[i] = ((float) i) *0.1; #pragma omp parallel for private(i) shared(n) reduction(*:multout) for (i=0; i<=n; i++) multout *= f[i]; printf("Multiplication output = %.16f\n", multout); }

Korea Univ reduction Clause The order in which thread-specific values are combined is unspecified  For floating-point operations, there would be numerical differences between the results of a sequential and parallel run, or even of two parallel runs using the same number of threads  This is a result of the limitation in precision with which computers represent floating-point numbers Results may vary slightly, depending on the order in which operations are performed But, it is not a cause of concern if the values are all of roughly the same magnitude Keep this in mind when using the reduction clause 48

Korea Univ flush Clause The OpenMP standard specifies that all modifications are written back to main memory and are thus available to all threads, at synchronization points in the program Between these synchronization points, threads are permitted to have new values for shared variables stored in their local memory rather than in the global shared memory Sometimes, updated values of shared values must become visible to other threads in-between synchronization points The OpenMP API provides the flush directive  #pragma omp flush [(list)]  The flush operation applies to all variables specified in the list  If no list is provided, it applies to all thread-visible shared data 49

Korea Univ flush Clause If the flush operation is invoked by a thread that has updated the variables, their new values will be flushed to memory and therefore be accessible to all other threads If the construct is invoked by a thread that has not updated a value, it will ensure that any local copies of the data are replaced by the latest value from main memory Implicit flush operations with no list occur at the following locations  All explicit and implicit barriers E.g., at the end of a parallel region or work-sharing construct  Entry to and exit from critical regions  Entry to and exit from lock routines 50

Korea Univ flush Example 51 #include #define NUM_THREADS 4 int main(argc, argv) int argc; char * argv[]; { int new_data, local_data, TID, signal; omp_set_num_threads(NUM_THREADS); signal = 0; new_data = 0; #pragma omp parallel default(none) shared(signal, new_data) private(TID, local_data) { TID = omp_get_thread_num(); if (TID == 3) { new_data = 10; signal = 1; #pragma omp flush(new_data, signal) } else { while (signal == 0); local_data = new_data; printf("Thread %d has %d\n", TID, local_data); }

Korea Univ threadprivate Clause The global data is shared by default In some situations, we may need or prefer to have private data that persists throughout the computation  This is where the threadprivate directive comes in handy  #pragma omp threadprivate (list) The effect of the threadprivate directive is that the named global-lifetime objects are replicated, so that each thread has its own copy  Each thread gets a private or local copy of the specified global variables 52

Korea Univ Serial Code 53 int calculate_sum(int length); int *pglobal; int main() { int i, j, sum, n = 5; int length[n]; for (i=0; i<n; i++) length[i] = 10 * (i+1); for (i=0; i<n; i++) { if ( (pglobal = (int *) malloc(length[i]*sizeof(int))) != NULL ) { for (j=0; j<length[i]; j++) pglobal[j] = j+1; sum = calculate_sum(length[i]); printf("Value of sum for i = %d is %8d\n",i,sum); free(pglobal); } return(0); } int calculate_sum(int length) { int sum = 0; int j; for (j=0; j<length; j++) sum += pglobal[j]; return(sum); }

Korea Univ threadprivate Example 54 #define TRUE 1 #define FALSE 0 int calculate_sum(int length); int *pglobal; #pragma omp threadprivate(pglobal) int main() { int i, j, sum, TID, n = 5; int length[n]; for (i=0; i<n; i++) length[i] = 10 * (i+1); #pragma omp parallel for shared(n,length) private(TID,i,j,sum) for (i=0; i<n; i++) { TID = omp_get_thread_num(); if ( (pglobal = (int *) malloc(length[i]*sizeof(int))) != NULL ) { for (j=0; j<length[i]; j++) pglobal[j] = j+1; sum = calculate_sum(length[i]); printf("TID %d: value of sum for i = %d is %8d\n“, TID,i,sum); free(pglobal); } } /*-- End of parallel for --*/ return(0); } int calculate_sum(int length) { int sum = 0; int j; for (j=0; j<length; j++) sum += pglobal[j]; return(sum); }

Korea Univ copyin Clause The threadprivate variables are private variables (that are made to private to each thread)  Each thread has its own set of these variables  Just as with regular private data, the initial values are undefined The copyin clause provides a means to copy the value of the master thread’s threadprivate variables to the corresponding threadprivate variables of the other threads 55

Korea Univ copyin Example 56 #include #define NUM_THREADS 4 int x; #pragma omp threadprivate(x) int main(argc, argv) int argc; char * argv[]; { int tid; x = 33; omp_set_num_threads(NUM_THREADS); #pragma omp parallel private(tid) copyin(x) { tid = omp_get_thread_num(); printf("Thread %d, x = %d\n", tid, x); }

Korea Univ copyprivate Clause The copyprivate clause is supported on the single directive only  #pragma omp single copyprivate(a, b, c) It provides a mechanism for broadcasting the value of a private variable from one thread to the other threads in the team The typical use is to have one thread initialize private data that is subsequently used by the other threads as well After the single construct has ended, but before the threads have left the associated barrier, the values of variables specified in the associated list are copied to the other threads 57

Korea Univ copyprivate Example 58 #include float x, y, fGlobal = 1.0; float get_float() { fGlobal += 0.1; return fGlobal; } #pragma omp threadprivate(x, y) int main() { float a, b; #pragma omp parallel { #pragma omp single copyprivate(a, b, x, y) { a = get_float(); b = get_float(); x = get_float(); y = get_float(); } printf("Value = %f, thread = %d\n", a, omp_get_thread_num()); printf("Value = %f, thread = %d\n", b, omp_get_thread_num()); printf("Value = %f, thread = %d\n", x, omp_get_thread_num()); printf("Value = %f, thread = %d\n", y, omp_get_thread_num()); }

Korea Univ OpenMP Environment Variables The OpenMP standard provides several means with which the programmer can interact with the execution environment  OMP_NUM_THREADS export OMP_NUM_THREADS=16  OMP_DYNAMIC controls dynamic adjustment of the number of threads that will be used to execute future parallel regions export OMP_DYNAMIC=“TRUE"  OMP_NESTED export OMP_NESTED=“TRUE“  OMP_SCHEDULE export OMP_SCHEDULE="GUIDED,4" export OMP_SCHEDULE="DYNAMIC" 59

Korea Univ Backup 60

Korea Univ 61

Korea Univ Firstprivate Clause Example Assume that each thread in a parallel region needs access to a thread-specific section of a vector  Access starts at a certain (nonzero) offset 62 #include #define NUM_THREADS 4 #define vlen 10 int main(int argc, char * argv[]) { int n = 2; int i, a_shared; int TID, indx; int a[10]; omp_set_num_threads(NUM_THREADS); for (i=0; i<vlen; i++) a[i] = -1; indx = 2; #pragma omp parallel firstprivate(indx) private(i, TID) shared(n, a) { TID = omp_get_thread_num(); indx += n*TID; for (i=indx; i<indx+n; i++) a[i] = TID + 1 ; } printf("After the parallel region:\n"); for (i=0; i<vlen; i++) printf("a[%d] = %d\n", i, a[i]); }

Korea Univ Alternative Solution If the variable indx is not updated any further, the following simpler and better solution is preferred 63 #include #define NUM_THREADS 4 #define vlen 10 int main(int argc, char * argv[]) { int n = 2; int i, offset; int TID, indx; int a[10]; omp_set_num_threads(NUM_THREADS); for (i=0; i<vlen; i++) a[i] = -1; offset = 2; #pragma omp parallel private(i, TID, indx) shared(n, offset, a) { TID = omp_get_thread_num(); indx = offset + n*TID; for (i=indx; i<indx+n; i++) a[i] = TID + 1 ; } printf("After the parallel region:\n"); for (i=0; i<vlen; i++) printf("a[%d] = %d\n", i, a[i]); }