Programming with Shared Memory Introduction to OpenMP

Programming with Shared Memory Introduction to OpenMP
Part 1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 11, slides 8b-1.ppt

OpenMP Thread-based shared memory programming model.
Accepted standard developed in late 1990s by a group of industry specialists. Higher-level than using thread API’s such as Pthreads. Write programs in C/C++ (or Fortran!) and insert OpenMP compiler directives to specify parallelism. OpenMP also has a few supporting library routines and environment variables. Several compilers available to compile OpenMP programs include recent Linux C compilers (gcc).

OpenMP thread model parallel region parallel region
Initially, single thread executed by a master thread. parallel directive here uses team of threads with subsequent block of code executed by multiple threads in parallel. Exact number of threads determined by one of several ways, see later. Other directives within parallel construct to specify parallel for loops and different blocks of code for threads. Code outside parallel region executed by master thread only OpenMP thread model Master thread Multiple threads parallel region Synchronization Master thread only parallel region Master thread only

OpenMP Parallel Directive
C “pragmatic” directive instructs compiler to use OpenMP features All OpenMP directives have omp #pragma omp parallel structured_block OpenMP parallel directive Single statement or compound statement created with { ...} with single entry point and single exit point. Creates multiple threads, each one executing the specified structured_block. Implicit barrier at end of construct.

Number of threads in a team
Established by one of three ways, either: num_threads clause after the parallel directive e.g. #pragma omp parallel num_threads(5) or 2. omp_set_num_threads() library routine being previously called e.g. omp_set_num_threads(6); Environment variable OMP_NUM_THREADS is defined e.g $ export OMP_NUM_THREADS=8 $ ./hello in order given or is system dependent if none of above. Number of threads available can be altered dynamically to achieve best use of system resources.

Finding number of threads and thread ID during program execution
omp_get_num_threads() – get the total number of threads omp_get_thread_num() – Returns thread number (ID), an integer from 0 to omp_get_num_thread() -1 where thread 0 is master thread The names of these two functions are similar; easy to confuse.

Hello world example Output with 8 threads:
VERY IMPORTANT Opening brace must on a new line (tabs,spaces ok) Hello world example #pragma omp parallel { printf("Hello World from thread = %d\n", omp_get_thread_num(), omp_get_num_threads()); } Output with 8 threads: Hello World from thread 0 of 8 Hello World from thread 4 of 8 Hello World from thread 3 of 8 Hello World from thread 2 of 8 Hello World from thread 7 of 8 Hello World from thread 1 of 8 Hello World from thread 6 of 8 Hello World from thread 5 of 8

Global “shared” variables/data
Any variable declared outside a parallel construct accessible by all threads unless otherwise specified: int main (int argc, char *argv[]) { int x; // accessibly by all threads #pragma omp parallel { … // each thread see the same x } }

Private variables Separate copies of variables for each thread.
Can be declared within each parallel region but OpenMP provides private clause. int tid; … #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); } Each thread has a local variable tid Also a shared clause available for shared variables.

Another example of shared and private data
int main (int argc, char *argv[]) { int x; int tid; #pragma omp parallel private(tid) tid = omp_get_thread_num(); if (tid == 0) x = 42; printf ("Thread %d, x = %d\n", tid, x); } x is shared by all threads tid is private – each thread has its own copy Variables declared outside the parallel construct are shared unless otherwise specified

Output Why does x change? tid has a separate value for each thread
$ ./data Thread 3, x = 0 Thread 2, x = 0 Thread 1, x = 0 Thread 0, x = 42 Thread 4, x = 42 Thread 5, x = 42 Thread 6, x = 42 Thread 7, x = 42 tid has a separate value for each thread Why does x change?

Another Example Shared versus Private
int a[100]; #pragma omp parallel private(tid, n) { tid = omp_get_thread_num(); n = omp_get_num_threads(); a[tid] = 10*n; } OR #pragma omp parallel private(tid, n) shared(a) ... a[ ] is shared tid and n are private optional

Variations of private variables
private clause – creates private copies of variables for each thread firstprivate clause - as private clause but initializes each copy to the values given immediately prior to parallel construct. lastprivate clause – as private but “the value of each lastprivate variable from the sequentially last iteration of the associated loop, or the lexically last section directive, is assigned to the variable’s original object.”

Specifying work inside a parallel region
Work-Sharing Specifying work inside a parallel region Four constructs in this classification: sections – section for single master In all cases, implicit barrier at end of construct unless a nowait clause included, which overrides the barrier. Note: These constructs do not start a new team of threads. That done by an enclosing parallel construct.

Sections #pragma omp parallel { #pragma omp sections
The construct: #pragma omp parallel { #pragma omp sections #pragma omp section structured_block … } cause structured blocks to be shared among threads in team. The first section directive optional. Enclosing parallel directive Blocks executed by available threads

Example One thread does this Another thread does this
#pragma omp parallel shared(a,b,c,d,nthreads) private(i,tid) { tid = omp_get_thread_num(); #pragma omp sections nowait #pragma omp section printf("Thread %d doing section 1\n",tid); for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); } printf("Thread %d doing section 2\n",tid); d[i] = a[i] * b[i]; printf("Thread %d: d[%d]= %f\n",tid,i,d[i]); } /* end of sections */ } /* end of parallel section */ Example One thread does this Another thread does this

Another sections example
#pragma omp parallel shared(a,b,c,d,nthreads) private(i,tid) { tid = omp_get_thread_num(); #pragma omp sections nowait #pragma omp section printf("Thread %d doing section 1\n",tid); for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]=%f\n“,tid,i,c[i]); } Threads do not wait after finishing section One thread does this

Sections example continued
#pragma omp section { printf("Thread %d doing section 2\n",tid); for (i=0; i<N; i++) { d[i] = a[i] * b[i]; printf("Thread %d: d[%d]= %f\n",tid,i,d[i]); } } /* end of sections */ printf ("Thread %d done\n", tid); } /* end of parallel section */ Another thread does this

Output Threads do not wait (i.e. no barrier)
Thread 0 doing section 1 Thread 0: c[0]= Thread 0: c[1]= Thread 0: c[2]= Thread 0: c[3]= Thread 0: c[4]= Thread 3 done Thread 2 done Thread 1 doing section 2 Thread 1: d[0]= Thread 1: d[1]= Thread 1: d[2]= Thread 1: d[3]= Thread 0 done Thread 1: d[4]= Thread 1 done Threads do not wait (i.e. no barrier)

Output if remove nowait clause
Thread 0 doing section 1 Thread 0: c[0]= Thread 0: c[1]= Thread 0: c[2]= Thread 0: c[3]= Thread 0: c[4]= Thread 3 doing section 2 Thread 3: d[0]= Thread 3: d[1]= Thread 3: d[2]= Thread 3: d[3]= Thread 3: d[4]= Thread 3 done Thread 1 done Thread 2 done Thread 0 done Barrier here If we remove the nowait, then there is a barrier at the end of the section. Threads wait until they are all done with the section.

Combining parallel and section constructs
If a parallel directive is followed by a single “sections” directive, they can be combined into: #pragma omp parallel sections { #pragma omp section structured_block … } with similar effect. (However, a nowait clause is not allowed.)

Parallel For Loop #pragma omp parallel #pragma omp for
{ … #pragma omp for for ( i = 0; i < n; i++ ) { … // for loop body } causes iterations of for loop to be shared among threads in team and threads executed in parallel (given sufficient computing resources. Note: Each iteration of for loop must be independent on other iterations executed in parallel (Later in course on how we determine this formally) Enclosing parallel region Must have a new line here Must be “for” loop of a simple C form such as (i = 0; i < n; i++) where lower bound and upper bound are constants

Example Executed by one thread For loop
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,tid) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } printf("Thread %d starting...\n",tid); #pragma omp for for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); } /* end of parallel section */ Executed by one thread For loop Without “nowait”, threads wait after finishing loop

Scheduling a Parallel For
By default, parallel for scheduled by mapping blocks (or chunks) of iterations to available threads (static mapping) Thread 1 starting... Thread 1: i = 2, c[1] = Thread 1: i = 3, c[1] = Thread 2 starting... Thread 2: i = 4, c[2] = Thread 3 starting... Number of threads = 4 Thread 0 starting... Thread 0: i = 0, c[0] = Thread 0: i = 1, c[0] = Default Chunk Size In this example, mapping = Barrier here

Combined parallel and for constructs
If a parallel directive is followed by a single for directive, it can be combined into: #pragma omp parallel for <for loop> { … } Example #pragma omp parallel for shared(a,b,c,nthreads) private(i,tid) for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); } Declares Parallel Region and Parallel For

Loop Scheduling and Partitioning
OpenMP offers scheduling clauses to add to for construct: 1. Static #pragma omp parallel for schedule (static,chunk_size) Partitions loop iterations into equal sized chunks specified by chunk_size. Chunks assigned to threads in round robin fashion. 2. Dynamic #pragma omp parallel for schedule (dynamic,chunk_size) Uses internal work queue. Chunk-sized block of loop assigned to threads as they become available.

3. Guided #pragma omp parallel for schedule (guided,chunk_size) Similar to dynamic but chunk size starts large and gets smaller to reduce time threads have to go to work queue. chunk size = number of iterations remaining 2 * number of threads 4. Runtime #pragma omp parallel for schedule (runtime) Uses OMP_SCEDULE environment variable to specify which of static, dynamic or guided should be used.

Question Guided scheduling is similar to Static except that the chunk sizes start large and get smaller. What is the advantage of using Guided versus Static? Answer: Guided improves load balance

Reduction Operation Variable
A reduction is when we apply a commutative operator to an aggregate values creating a single value (similar to the MPI_Reduce) sum = 0; #pragma omp parallel for reduction(+:sum) for (k = 0; k < 100; k++ ) { sum = sum + funct(k); } Operation Variable Private copy of sum created for each thread by compiler. Private copy will be added to sum at end. Eliminates the need for critical sections here.

Single The directive #pragma omp parallel #pragma omp single
{ … #pragma omp single structured_block } cause the structured block to be executed by one thread only. Implied synchronization barrier at end of structured_block unless nowait clause specified. Must have a new line here Only one thread executes this section. No guarantee of which one

Single Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp single printf("Thread %d doing work\n",tid); ... } /* end of single */ printf ("Thread %d done\n", tid); } /* end of parallel section */

Single Results Only one thread executing the section Barrier here
Thread 0 starting... Thread 0 doing work Thread 3 starting... Thread 2 starting Thread 1 starting... Thread 0 done Thread 1 done Thread 2 done Thread 3 done Only one thread executing the section Barrier here “nowait” was NOT specified, so threads wait for the one thread to finish.

Master The master directive: #pragma omp parallel
{ … #pragma omp master structured_block } causes only the master thread to execute the structured block. Different others in work sharing group in that no implied barrier at end of construct (nor beginning). Other threads encountering master directive will ignore it and associated structured block, and will move on. Cannot specify “nowait” Only one thread (the master) executes this section No barrier after this block. Threads will NOT wait.

Master Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp master printf("Thread %d doing work\n",tid); ... } /* end of master */ printf ("Thread %d done\n", tid); } /* end of parallel section */

Is there any difference between these two approaches:
Master Directive: Using an if statement: #pragma omp parallel private(tid) { ... tid=omp_get_thread_num(); if (tid == 0) structured_block } #pragma omp parallel { ... #pragma omp master structured_block }

Critical Sections critical directive will only allow one thread execute the associated structured block at a time. #pragma omp parallel { ... #pragma omp critical (name) structured_block } Threads will wait until no other thread executing same critical section (one with same name), name optional. All critical sections with no name map to one undefined name.

Critical Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp critical (myCS) printf("Thread %d in critical section \n",tid); sleep (1); printf("Thread %d finishing critical section \n",tid); } /* end of critical */ printf ("Thread %d done\n", tid); } /* end of parallel section */

Critical Results Thread 0 starting... Thread 0 in critical section Thread 3 starting... Thread 1 starting... Thread 2 starting... Thread 0 finishing critical section Thread 0 done Thread 3 in critical section Thread 3 finishing critical section Thread 3 done Thread 2 in critical section Thread 2 finishing critical section Thread 2 done Thread 1 in critical section Thread 1 finishing critical section Thread 1 done

Atomic Must be a simple statement of the form:
Atomic directive implements a critical section efficiently when critical section simply updates a variable (adds one, subtracts one, or does some other simple arithmetic operation as defined by expression_statement). #pragma omp parallel { ... #pragma omp atomic expression_statement } Must be a simple statement of the form: x = expression x += expression x -= expression x++; x--; ...

Barrier When a thread reaches the barrier #pragma omp parallel { ...
#pragma omp barrier it waits until all threads have reached the barrier and then they all proceed together. Restrictions on the placement of barrier directive in a program. In particular, all threads must be able to reach the barrier (i.e. be careful about not placing barrier inside an if statement where some threads may not execute it).

Barrier Example No barrier at end of “single” block
#pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp single nowait printf("Thread %d busy doing work ... \n",tid); sleep(10); } printf("Thread %d reached barrier\n",tid); #pragma omp barrier printf ("Thread %d done\n", tid); } /* end of parallel section */ No barrier at end of “single” block Threads do NOT wait here Threads wait here

Barrier Results Thread 3 sleeping for 10 seconds 10 second delay
Thread 3 starting... Thread 0 starting... Thread 0 reached barrier Thread 2 starting... Thread 2 reached barrier Thread 1 starting... Thread 1 reached barrier Thread 3 busy doing work ... Thread 3 reached barrier Thread 3 done Thread 0 done Thread 2 done Thread 1 done Thread 3 sleeping for 10 seconds 10 second delay

Flush A synchronization point which causes thread to have a “consistent” view of certain or all shared variables in memory. All current read and write operations on variables allowed to complete and values written back to memory but any memory operations in code after flush are not started. Format: #pragma omp flush (variable_list) Only applied to thread executing flush, not to all threads in team. Flush occurs automatically at entry and exit of parallel and critical directives, and at the exit of for, sections, and single (if a no-wait clause is not present).

More information

Questions

Programming with Shared Memory Introduction to OpenMP

Similar presentations

Presentation on theme: "Programming with Shared Memory Introduction to OpenMP"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Programming with Shared Memory Introduction to OpenMP

Similar presentations

Presentation on theme: "Programming with Shared Memory Introduction to OpenMP"— Presentation transcript:

Similar presentations

About project

Feedback