Parallel Computing with OpenMP

Parallel Computing with OpenMP
Dan Negrut Simulation-Based Engineering Lab Wisconsin Applied Computing Center Department of Mechanical Engineering Department of Electrical and Computer Engineering University of Wisconsin-Madison © Dan Negrut, 2012 UW-Madison Milano December 10-14, 2012

Acknowledgements A large number of slides used for discussing OpenMP are from Intel’s library of presentations promoting OpenMP Slides used herein with permission Credit given where due: IOMPP IOMPP stands for “Intel OpenMP Presentation”

Data vs. Task Parallelism
Data parallelism You have a large amount of data elements and each data element (or possibly a subset of elements) needs to be processed to produce a result When this processing can be done in parallel, we have data parallelism Example: Adding two long arrays of doubles to produce yet another array of doubles Task parallelism You have a collection of tasks that need to be completed If these tasks can be performed in parallel you are faced with a task parallel job Examples: Reading the newspaper, drinking coffee, and scratching your back The breathing your lungs, beating of your heart, liver function, controlling swallowing, etc.

Objectives Understand OpenMP at the level where you can
Implement data parallelism Implement task parallelism Not able to provide a full overview of OpenMP Script: The objectives of the course are to prepare students and faculty to use OpenMP to parallelize C, C++, Fortran applications using either task or data parallelism. The course is expected to take 2 hours for the lecture – plus additional time for the labs or demos up to about an hour

Work Plan What is OpenMP? Advanced topics Parallel regions
Work sharing Data environment Synchronization Advanced topics Script: The agenda for the course is shown above. We will be covering the rudiments of what OpenMP is, how its clauses are constructed, how it can be controlled with environment variables etc. We will also look at what parallel regions are and how they are comprised of structured blocks of code. Then we will spend the most time discussing worksharing and three principle ways worksharing is accomplished in OpenMP: OMP For, OMP Parallel Sections, OMP Tasks. We will also spend some time to understand the data environment and the ramifications in several code examples of different variables being either private or shared. We will look at several key synchronization constructs such as critical sections, single, master & atomic clauses Then we will look at Intel suggested placement of material within a University curriculum and layout what topics we think are important to discuss Finally we will wrap up with some optional material that covers the API, a Monte Carlo Pi example and other helpful clauses. [IOMPP]→

OpenMP: Target Hardware
CUDA: targeted parallelism on the GPU MPI: targeted parallelism on a cluster (distributed computing) Note that MPI implementation can handle transparently a SMP architecture such as a workstation with two hexcore CPUs that draw on a good amount of shared memory OpenMP: targets parallelism on SMP architectures Handy when You have a machine that has 64 cores You have a large amount of shared memory, say 128GB

OpenMP: What’s Reasonable to Expect
If you have 64 cores available to you, it is *highly* unlikely to get a speedup of more than 64 (superlinear) Recall the trick that helped the GPU hide latency Overcommitting the SPs and hiding memory access latency with warp execution This mechanism of hiding latency by overcommitment does not *explicitly* exist for parallel computing under OpenMP beyond what’s offered by HTT

OpenMP: What Is It? Portable, shared-memory threading API
Fortran, C, and C++ Multi-vendor support for both Linux and Windows Standardizes task & loop-level parallelism Supports coarse-grained parallelism Combines serial and parallel code in single source Standardizes ~ 20 years of compiler-directed threading experience Current spec is OpenMP 4.0 More than 300 pages Script: What is OpenMP? OpenMP is a portable (OpenMP codes can be moved between linux & windows for example), shared-memory threading API that standardizes task & loop level parallelism. Because OpenMP clauses have both lexical and dynamic extent, it is possible support a broad multi-file course grained parallelism. Often, the best parallelism technique is to parallel at the coarsest grain possible often parallelizing tasks or loops from with the main driver itself – as this gives the most bang for the buck (the most computation for the necessary threading overhead costs). Another key benefits is that OpenMP allows for a developer to parallelize their applications incrementally. Since OpenMP is primarily a pragma or directive based approach we can easily combine serial & parallel code in a single source. By simply compiling with or without the /OpenMP compiler flag we can turn OpenMP on or off. Code compiled without the /OpenMP flag simply ignores the OpenMp pragmas which allows simple access back to the original serial application. Openmp also standardizes about 20 years of compiler directed threading experience. For more information or to review the latest OpenMP spec (currently the latest spec is OpenMP 3.0) – goto [IOMPP]→

pthreads: An OpenMP Precursor
Before there was OpenMP, a common approach to support parallel programming was by use of pthreads “pthread”: POSIX thread POSIX: Portable Operating System Interface [for Unix] pthreads Available originally under Unix and Linux Windows ports are also available some as open source projects Parallel programming with pthreads: relatively cumbersome, prone to mistakes, hard to maintain/scale/expand Not envisioned as a mechanism for writing scientific computing software

pthreads: Example int main(int argc, char *argv[]) { parm *arg;
pthread_t *threads; pthread_attr_t pthread_custom_attr; int n = atoi(argv[1]); threads = (pthread_t *) malloc(n * sizeof(*threads)); pthread_attr_init(&pthread_custom_attr); barrier_init(&barrier1); /* setup barrier */ finals = (double *) malloc(n * sizeof(double)); /* allocate space for final result */ arg=(parm *)malloc(sizeof(parm)*n); for( int i = 0; i < n; i++) { /* Spawn thread */ arg[i].id = i; arg[i].noproc = n; pthread_create(&threads[i], &pthread_custom_attr, cpi, (void *)(arg+i)); } for( int i = 0; i < n; i++) /* Synchronize the completion of each thread. */ pthread_join(threads[i], NULL); free(arg); return 0;

#include <stdio.h>
#include <math.h> #include <time.h> #include <sys/types.h> #include <pthread.h> #include <sys/time.h> #define SOLARIS 1 #define ORIGIN 2 #define OS SOLARIS typedef struct { int id; int noproc; int dim; } parm; int cur_count; pthread_mutex_t barrier_mutex; pthread_cond_t barrier_cond; } barrier_t; void barrier_init(barrier_t * mybarrier) { /* barrier */ /* must run before spawning the thread */ pthread_mutexattr_t attr; # if (OS==ORIGIN) pthread_mutexattr_setprotocol(&attr, PTHREAD_PRIO_INHERIT); pthread_mutexattr_setprioceiling(&attr, 0); pthread_mutex_init(&(mybarrier->barrier_mutex), &attr); # elif (OS==SOLARIS) pthread_mutex_init(&(mybarrier->barrier_mutex), NULL); # else # error "undefined OS" # endif pthread_cond_init(&(mybarrier->barrier_cond), NULL); mybarrier->cur_count = 0; } void barrier(int numproc, barrier_t * mybarrier) { pthread_mutex_lock(&(mybarrier->barrier_mutex)); mybarrier->cur_count++; if (mybarrier->cur_count!=numproc) { pthread_cond_wait(&(mybarrier->barrier_cond), &(mybarrier->barrier_mutex)); else { mybarrier->cur_count=0; pthread_cond_broadcast(&(mybarrier->barrier_cond)); pthread_mutex_unlock(&(mybarrier->barrier_mutex)); void* cpi(void *arg) { parm *p = (parm *) arg; int myid = p->id; int numprocs = p->noproc; double PI25DT = ; double mypi, pi, h, sum, x, a; double startwtime, endwtime; if (myid == 0) { startwtime = clock(); } barrier(numprocs, &barrier1); if (rootn==0) finals[myid]=0; else { h = 1.0 / (double) rootn; sum = 0.0; for(int i = myid + 1; i <=rootn; i += numprocs) { x = h * ((double) i - 0.5); sum += f(x); mypi = h * sum; finals[myid] = mypi; if (myid == 0){ pi = 0.0; for(int i=0; i < numprocs; i++) pi += finals[i]; endwtime = clock(); printf("pi is approx %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT)); printf("wall clock time = %f\n", (endwtime - startwtime) / CLOCKS_PER_SEC); return NULL;

pthreads: leaving them behind…
Looking at the previous example (which is not the best written piece of code, lifted from the web…) Code displays platform dependency (not portable) Code is cryptic, low level, hard to read (not simple) Requires busy work: fork and joining threads, etc. Burdens the developer Probably in the way of the compiler as well: rather low chances that the compiler will be able to optimize the implementation Higher level approach to SMP parallel computing for *scientific applications* was in order

OpenMP Programming Model
Master thread spawns a team of threads as needed Managed transparently on your behalf It still relies on thread fork/join methodology to implement parallelism The developer is spared the details Parallelism is added incrementally: that is, the sequential program evolves into a parallel program Parallel Regions Master Thread Script: OpenMP uses a fork join methodology to implement parallelism. A master thread, shown in red, begins executing code. At an OMP PARALLEL directive, the master thread forks other threads to do work. At a specified point in the code called a barrier, the master thread will wait for all the child threads to finish work before proceeding. Because we can have multiple parallel regions each created as above, we see that parallelism can be added incrementally. We can work on one region of code and get it running well as a parallel region and we can simply turn off openmp in other more problematic parallel regions, until we get the desired program behaviour. That’s the high level overview of OpenMP, now we will look at some of the details to get us started. More Background info: In case someone asks if the threads are really created and destroyed between parallel regions – here is some background info: The Intel OpenMP implementation maintains a thread pool to minimize overhead from thread creation. Worker threads are reused from one parallel region to the next. In serial regions, the worker threads either sleep, spin, or spin for a user-specified amount of time before going to sleep. This behavior is controlled by the KMP_BLOCKTIME environment variable or the KMP_SET_BLOCKTIME function, which are described in the compiler documentation. Worker thread behavior in the serial regions can affect application performance. On shared systems, for example, spinning threads consume CPU resources that could be used by other applications. However, there is overhead associated with waking sleeping threads. [IOMPP]→

OpenMP: Library Support
Runtime environment routines: Modify/check/get info about the number of threads omp_[set|get]_num_threads() omp_get_thread_num(); //tells which thread you are omp_get_max_threads() Are we in a parallel region? omp_in_parallel() How many processors in the system? omp_get_num_procs() Explicit locks omp_[set|unset]_lock() And several more... Script: More detail on the API info Next foil Background [IOMPP]→

A Few Syntax Details to Get Started
Most of the constructs in OpenMP are compiler directives or pragmas For C and C++, the pragmas take the form: #pragma omp construct [clause [clause]…] For Fortran, the directives take one of the forms: C$OMP construct [clause [clause]…] !$OMP construct [clause [clause]…] *$OMP construct [clause [clause]…] Header file or Fortran 90 module #include “omp.h” use omp_lib Script: Most of the constructs we will looking at in OpenMP are compiler directives or pragmas. The C/C++ and Fortran versions of these directives are shown here. C and C++, the pragmas take the form: #pragma omp construct [clause [clause] where clauses are optional modifiers Be sure to include “omp.h” if you intend to use any routines from the OpenMP Library. Now let look at some environment variables that control OpenMP behavior Note to Fortran Users: For Fortran, you can also use other sentinel’s (!$OMP), but these must exactly line up on columns Column 6 must be blank or contain a + indicating that this line is a continuation from the previous line. [IOMPP]→

Why Compiler Directive and/or Pragmas?
One of OpenMP’s design principles was to have the same code, with no modifications and have it run either on an one core machine, or a multiple core machine Therefore, you have to “hide” all the compiler directives behind Comments and/or Pragmas These hidden directives would be picked up by the compiler only if you instruct it to compile in OpenMP mode Example: Visual Studio – you have to have the /openmp flag on in order to compile OpenMP code Also need to indicate that you want to use the OpenMP API by having the right header included: #include <omp.h> Step 2: Select /openmp Step 1: Go here

OpenMP, Compiling Using the Command Line
Method depends on compiler G++: ICC: Microsoft Visual Studio: $ g++ -o integrate_omp integrate_omp.c –fopenmp $ icc -o integrate_omp integrate_omp.c –openmp $ cl /openmp integrate_omp.c

Enabling OpenMP with CMake
# Minimum version of CMake required. cmake_minimum_required(VERSION 2.8) # Set the name of your project project(ME964-omp) # Include macros from the SBEL utils library include(SBELUtils.cmake) # Example OpenMP program enable_openmp_support() add_executable(integrate_omp integrate_omp.cpp) With the template Without the template Replaces include(SBELUtils.cmake) and enable_openmp_support() above find_package(“OpenMP" REQUIRED) set(CMAKE_C_FLAGS “${CMAKE_C_FLAGS} ${OpenMP_C_FLAGS}”) set(CMAKE_CXX_FLAGS “${CMAKE_CXX_FLAGS} ${OpenMP_CXX_FLAGS}”)

OpenMP Odds and Ends… Controlling the number of threads at runtime
The default number of threads that a program uses when it runs is the number of online processors on the machine For the C Shell: For the Bash Shell: Timing: setenv OMP_NUM_THREADS number export OMP_NUM_THREADS=number Use this on Euler #include <omp.h> stime = omp_get_wtime(); longfunction(); etime = omp_get_wtime(); total=etime-stime;

Work sharing Data environment Synchronization Advanced topics Script: Now we will discuss parallel regions [IOMPP]→

Parallel Region & Structured Blocks (C/C++)
Most OpenMP constructs apply to structured blocks Structured block: a block with one point of entry at the top and one point of exit at the bottom The only “branches” allowed are exit() function calls in C/C++ A structured block Not a structured block #pragma omp parallel { int id = omp_get_thread_num(); more: res[id] = do_big_job (id); if (not_conv (res[id]) goto more; } printf ("All done\n"); Script: A Parallel region is created using the #pragma omp parallel construct. A master thread creates a pool of worker threads once the master thread crosses this pragma. On this foil, the creation of the parallel region is highlighted in yellow and includes the pragma and the left curly brace “{“. The parallel region extends from the left curly brace – to the highlighted yellow right curly brace “}”. There is an implicit barrier at the right curly brace and that’s the point at which the other worker threads complete execution and either go to sleep or spin or otherwise idle. Parallel constructs form the foundation of OpenMP parallel execution. Each time an executing thread enters a parallel region, it creates a team of threads and becomes master of that team. This allows parallel execution to take place within that construct by the threads in that team. The following directives are necessary for a parallel region: #pragma omp parallel A parallel region consists of a structured block of code. We see on the left a good example of a structured block of code where there is a single point of entry into the block at the top of the block, and one exit to the block at the bottom – AND no braches out of the block. Question to class – can someone spot some reasons why this block is unstructured? Here are a couple of reasons that that block is unstructured – multiple entrances and multiple exits to the block. We see on the right a bad example - or an unstructured block of code. Here we have two entries in the block – one from the top of the block and one from the goto more statement which jumps into the block at the label “more:” Additionally, the bad example has multiple exits from the block: 1) From the bottom of the block and one from the goto done statement if (go_now()) goto more; #pragma omp parallel { int id = omp_get_thread_num(); more: res[id] = do_big_job(id); if (conv (res[id]) goto done; goto more; } done: if (!really_done()) goto more; There is an implicit barrier at the right “}” curly brace and that’s the point at which the other worker threads complete execution and either go to sleep or spin or otherwise idle. [IOMPP]→

Example: Hello World on my Machine
#include <stdio.h> #include <omp.h> int main() { #pragma omp parallel { int myId = omp_get_thread_num(); int nThreads = omp_get_num_threads(); printf("Hello World. I'm thread %d out of %d.\n", myId, nThreads); for( int i=0; i<2 ;i++ ) printf("Iter:%d\n",i); } printf("GoodBye World\n"); Example: Hello World on my Machine Script: Take about 5 minutes to build and run the hello world lab. In this example – we will print “hello world” from several threads. Run the code several times. Do you see any issues with the code – do you always get expected results? Does anyone in the class have weird behavior in terms of the sequence of the words that are printed out? Since printf is a function of state – it can only print one thing to one screen at a time – some students are likely to see “race conditions” where one printf in a thread is writing over the results of another printf in an other thread. Here’s my machine (12 core machine) Two Intel Xeon X5650 Westmere 2.66GHz 12MB L3 Cache LGA Watts Six-Core Processors [IOMPP]→

OpenMP: Important Remark
One of the key tenets of OpenMP is the assumption of data independence across parallel jobs Specifically, when distributing work among parallel threads it is assumed that there is no data dependency Since you place the omp parallel directive around some code, it is your responsibility to make sure that data dependency is ruled out Compilers are not smart enough and sometimes it is outright impossible to rule out data dependency between what might look as independent parallel jobs

Work sharing Data environment Synchronization Advanced topics Script: Lets look at worksharing

Each of them automatically divides work among threads
Work Sharing Work sharing is the general term used in OpenMP to describe distribution of work across threads Three categories of work sharing in OpenMP: “omp for” construct “omp sections” construct “omp task” construct Script: Worksharing is the general term used in OpenMP to describe distribution of work across threads. There are three primary categories of worksharing in OpenMP The three examples are: The omp for construct - that automatically divides the a for loop’s work up and distributes the work across threads The omp sections directive distributes work among threads bound to a defined parallel region. This construct is good for function level parallelism where the tasks or functions are well defined and known at compile time The omp task pragma can be used to explicitly define a task. Now we will look more closely at the omp for construct Each of them automatically divides work among threads [IOMPP]→

“omp for” construct #pragma omp parallel #pragma omp for Implicit barrier i = 1 i = 2 i = 3 i = 4 i = 5 i = 6 i = 7 i = 8 i = 9 i = 10 i = 11 i = 12 // assume N=12 #pragma omp parallel #pragma omp for for(i = 1, i < N+1, i++) c[i] = a[i] + b[i]; Script: The omp for directive instructs the compiler to distribute loop iterations within the team of threads that encounters this work-sharing construct. In this example the omp for construct resides inside a parallel region where a team of threads has been created – for sake of argument – lets say that 3 threads are in the tread team. The omp for construct takes the number of loop iterations, N, and divides N by the number of threads in a team – to arrive at a unit of work for each thread. Omp for assigned a different set of iterations to each thread. So lets say that N has the value 12. Then each of the 3 threads gets to work on 4 loop iterations. So one thread is assigned to work on iterations 1-4, the next thread on 5-8, etc. At the end of the parallel region, the threads rejoin and a single thread exits the parallel region. Effectively – omp for was able to cut the execution time down significantly and for this loop the execution time would be on the order of 33% of the serial time – a reduction of 66%. To use the omp for directive, it must be place immediately prior to the loop in question as we see in the above code. Next we’ll see how to combine openmp constructs Background see the openmp spec at or visit publib.boulder.ibm.com/infocenter at the url below to get at the definition and usage Pragma omp [clause] clause is any of the following: private (list) Declares the scope of the data variables in list to be private to each thread. Data variables in list are separated by commas. firstprivate (list) Declares the scope of the data variables in list to be private to each thread. Each new private object is initialized as if there was an implied declaration within the statement block. Data variables in list are separated by commas. lastprivate (list) Declares the scope of the data variables in list to be private to each thread. The final value of each variable in list, if assigned, will be the value assigned to that variable in the last iteration. Variables not assigned a value will have an indeterminate value. Data variables in list are separated by commas. reduction (operator:list) Performs a reduction on all scalar variables in list using the specified operator. Reduction variables in list are separated by commas. A private copy of each variable in list is created for each thread. At the end of the statement block, the final values of all private copies of the reduction variable are combined in a manner appropriate to the operator, and the result is placed back into the original value of the shared reduction variable. Variables specified in the reduction clause: must be of a type appropriate to the operator. must be shared in the enclosing context. must not be const-qualified. must not have pointer type. ordered Specify this clause if an ordered construct is present within the dynamic extent of the omp for directive. schedule (type) Specifies how iterations of the for loop are divided among available threads. Acceptable values for type are: auto Withauto, scheduling is delegated to the compiler and runtime system. .The compiler and runtime system can choose any possible mapping of iterations to threads (including all possible valid schedules) and these may be different in different loops. dynamic Iterations of a loop are divided into chunks of size ceiling(number_of_iterations/number_of_threads). Chunks are dynamically assigned to threads on a first-come, first-serve basis as threads become available. This continues until all work is completed. dynamic,n As above, except chunks are set to size n. n must be an integral assignment expression of value 1 or greater. guided Chunks are made progressively smaller until the default minimum chunk size is reached. The first chunk is of size ceiling(number_of_iterations/number_of_threads). Remaining chunks are of size ceiling(number_of_iterations_left/number_of_threads). The minimum chunk size is 1. Chunks are assigned to threads on a first-come, first-serve basis as threads become available. This continues until all work is completed. guided,n As above, except the minimum chunk size is set to n. n must be an integral assignment expression of value 1 or greater. runtime Scheduling policy is determined at run time. Use the OMP_SCHEDULE environment variable to set the scheduling type and chunk size. static Iterations of a loop are divided into chunks of size ceiling(number_of_iterations/number_of_threads). Each thread is assigned a separate chunk. This scheduling policy is also known as block scheduling. static,n Iterations of a loop are divided into chunks of size n. Each chunk is assigned to a thread in round-robin fashion. n must be an integral assignment expression of value 1 or greater. This scheduling policy is also known as block cyclic scheduling. Note: if n=1, iterations of a loop are divided into chunks of size 1 and each chunk is assigned to a thread in round-robin fashion. This scheduling policy is also known as block cyclic scheduling nowait Use this clause to avoid the implied barrier at the end of the for directive. This is useful if you have multiple independent work-sharing sections or iterative loops within a given parallel region. Only one nowait clause can appear on a given for directive. and where for_loop is a for loop construct with the following canonical shape: for (init_expr; exit_cond; incr_expr) statementwhere: init_exprtakes form:iv = b integer-type iv = bexit_condtakes form:iv <= ub iv < ub iv >= ub iv > ubincr_exprtakes form:++iv iv++ --iv iv-- iv += incr iv -= incr iv = iv + incr iv = incr + iv iv = iv - incrand where: ivIteration variable. The iteration variable must be a signed integer not modified anywhere within the for loop. It is implicitly made private for the duration of the for operation. If not specified as lastprivate, the iteration variable will have an indeterminate value after the operation completes.b, ub, incrLoop invariant signed integer expressions. No synchronization is performed when evaluating these expressions and evaluated side effects may result in indeterminate values.Usage This pragma must appear immediately before the loop or loop block directive to be affected. Program sections using the omp for pragma must be able to produce a correct result regardless of which thread executes a particular iteration. Similarly, program correctness must not rely on using a particular scheduling algorithm. The for loop iteration variable is implicitly made private in scope for the duration of loop execution. This variable must not be modified within the body of the for loop. The value of the increment variable is indeterminate unless the variable is specified as having a data scope of lastprivate. An implicit barrier exists at the end of the for loop unless the nowait clause is specified. Restrictions are: The for loop must be a structured block, and must not be terminated by a break statement. Values of the loop control expressions must be the same for all iterations of the loop. An omp for directive can accept only one schedule clauses. The value of n (chunk size) must be the same for all threads of a parallel region. Threads are assigned an independent set of iterations Threads must wait at the end of work-sharing construct [IOMPP]→

Combining Constructs These two code segments are equivalent
#pragma omp parallel { #pragma omp for for ( int i=0;i< MAX; i++) { res[i] = huge(); } Script: In these equivalent code snippets we see that OepnMP constructs can be combined down to a single statement Here the #pragma omp parallel and the nested #pragma omp for (from the right handed example) are combined to a single tidier construct on the right #pragma omp parallel for Most often in this course we sill use the more abbreviated combined version. There can be occasions however when it may be useful to separate the constructs – such as if the parallel regions has other work do in addition to the for – maybe a and addition omp sections directive. Now well turn to lab activity 2 Background: Side Note – Not shown – but useful also. It is not just the omp or that can be merged with parallel for . Its is also allowed to merge the omp sections with the omp parallel -> #progma omp parallel sections. #pragma omp parallel for for (int i=0;i< MAX; i++) { res[i] = huge(); } [IOMPP]→

The Private Clause Reproduces the variable for each task
Variables are un-initialized; C++ object is default constructed Any variable external to the parallel region is undefined By declaring a variable as being private it means that each thread will have a private copy of that variable The value that thread 1 stores in x is different than the value that thread 2 stores in the variable x Script: We’ve just about wrapped up our coverage of the omp parallel for construct – before we move on to the lab however – we need to introduce one more concept – the concept of private variables. For reasons that we will go into more later on in this module – it is important to be able to make some variables have copies that are private to each thread. The omp private clause accomplishes this. Declaring a variable to be private with the omp private clause means that each thread has its own copy of that variable that can be modified without effecting any others threads value of the their similarly named variable. In this example above, each thread will have private copies of variables x & y. This means that thread 1, thread 2, thread 3 etc all have a variable named x & a variable named y but thread 1’s variable x can contain a different value than thread 2’s variable x. This construct allows each thread to proceed without effecting the computation of the other threads. We will talk more about private variables later in the context of race conditions. Next foil Background Private Cause For-loop iteration variable is PRIVATE by default. void* work(float* c, int N) { float x, y; int i; #pragma omp parallel for private(x,y) for(i=0; i<N; i++) { x = a[i]; y = b[i]; c[i] = x + y; } [IOMPP]→

Example: Parallel Mandelbrot
Objective: create a parallel version of Mandelbrot using OpenMP work sharing clauses to parallelize the computation of Mandelbrot. Script: In this exercise we will use the combined #pragma omp parallel for construct to parallelize a specific loop in the mandelbrot application. Then we will use wall clock time to evaluate the performance of the parallel version as compared to the serial version [IOMPP]→

Example: Parallel Mandelbrot [The Important Function; Includes material from IOMPP]
int Mandelbrot (float z_r[][JMAX],float z_i[][JMAX],float z_color[][JMAX], char gAxis ){ float xinc = (float)XDELTA/(IMAX-1); float yinc = (float)YDELTA/(JMAX-1); #pragma omp parallel for private(i,j) schedule(static,8) for (int i=0; i<IMAX; i++) { for (int j=0; j<JMAX; j++) { z_r[i][j] = (float) -1.0*XDELTA/2.0 + xinc * i; z_i[i][j] = (float) 1.0*YDELTA/2.0 - yinc * j; switch (gAxis) { case 'V': z_color[i][j] = CalcMandelbrot(z_r[i][j], z_i[i][j] ) /1.0001; break; case 'H': z_color[i][j] = CalcMandelbrot(z_i[i][j], z_r[i][j] ) /1.0001; default: } return 1;

The schedule Clause The schedule clause affects how loop iterations are mapped onto threads schedule(static [,chunk]) Blocks of iterations of size “chunk” assigned to each thread Round robin distribution Low overhead, may cause load imbalance schedule(dynamic[,chunk]) Threads grab “chunk” iterations When done with iterations, thread requests next set Higher threading overhead, can reduce load imbalance schedule(guided[,chunk]) Dynamic schedule starting with large block Size of the blocks shrink; no smaller than “chunk” Script: The omp for loop can be modified with a schedule clause that effects how the loop iterations are mapped onto threads. This mapping can dramatically improve performance by eliminating load imbalances or in reducing threading overhead. We are going to dissect the schedule clause and look at several options we can use in the schedule clause and how openmp loop behavior changes for each option. The first schedule clause we are going to examine is the schedule(static) clause. Lets advance to the first sub animation on the slide. For the sake of argument, we are going to assume that the loops in question have N iterations and lets assume we have 4 threads in the thread pool – just to make the concepts a little more tangible. To talk about this scheduling we first need to define what a chunk is. A chunk is a contiguous range of iterations – so iterations 0 through 99, for example would be considered a chuck of iterations. What schedule( static) does is break the for loop in chunks of iterations. Each thread in the team gets one chunk. If we have N total iterations, schedule(static) assigns a chunk of N/(number of threads) iterations to each thread for execution. schedule(static, chunk) – Lets assume that chunk is of 8. Then schedule (static,8) would interleave the allocation of chunks of size 8 to threads. That means that thread 1 gets 8 iterations, then thread 2 gets another 8 etc. The chunks of 8 are doled out to the threads in round robin fashion to whatever threads a free for execution. Increasing chunk size reduces overhead and may increase cache hit rate. Descreasing chunk size allows for finer balancing of workloads. Next animation schedule(dynamic) – takes more overhead to accomplish but it effectively assigns the threads one-at-a-time dynamically. This is great for loop iterations where iteration 1 of a loop takes vastly different computation that iteration N for example. Using dynamic scheduling can greatly improve threading load balance in such cases. Threading load balance is ideal state where all threads have an equal amount of work to do and can all finish their work in roughly an equal amount of time. schedule(dynamic, chunk) – similar to dynamic but assigns the threads assigns the threads chunk-at-a-time dynamically rather than one-at-a-time dynamically next animation schedule(guided) – is a specific cases of dynamic scheduling where the computation is known to take less time for early iterations and significantly longer for later iterations. For example, in finding prime numbers using a sieve. The larger primes take more time to test than the smaller primes – so if we are testing in a large loop the smaller iteration compute quickly – but the later iterations take a long time – schedule guided may be a great choice for this situation. schedule(guided, chunk) - dynamic allocation of chunks to tasks using guided self-scheduling heuristic. Initial chunks are bigger, later chunks are smaller - minimum chunk size is “chunk” Lets look at Recommend uses Use Static scheduling for predictable and similar work per iteration Use Dynamic scheduling for unpredictable, highly variable work per iteration Use Guided scheduling as a special case of dynamic to reduce scheduling overhead when the computation gets progressively more time consuming Use Auto scheduling to select by Compiler or runtime environment variables Lets look at an example now – see next foil Background: See Dr Michael Quinns Book - Parallel Programming in C with MPI and OpenMP Here are some quick notes regarding each of the options we see on the screen schedule(static) - block allocation of N/threads contiguous iterations to each thread schedule(static, S) - interleaved allocation of chunks of size S to threads schedule(dynamic) - dynamic one-at-a-time allocation of iterations to threads schedule(dynamic, S) - dynamic allocation of S iterations at a time to threads schedule(guided)- guided self-scheduling - minimum chunk size is 1 schedule(guided, S) - dynamic allocation of chunks to tasks using guided self-scheduling heuristic. Initial chunks are bigger, later chunks are smaller - minimum chunk size is S Note: When schedule(static, chunk_size) is specified, iterations are divided into chunks of size chunk_size, and the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number. When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. Note that the size of the chunks is unspecified in this case. A compliant implementation of static schedule must ensure that the same assignment of logical iteration numbers to threads will be used in two loop regions if the following conditions are satisfied: 1) both loop regions have the same number of loop iterations, 2) both loop regions have the same value of chunk_size specified, or both loop regions have no chunk_size specified, and 3) both loop regions bind to the same parallel region. A data dependence between the same logical iterations in two such loops is guaranteed to be satisfied allowing safe use of the nowait clause (see Section A.9 on page 170 for examples). dynamic When schedule(dynamic, chunk_size) is specified, the iterations are distributed to threads in the team in chunks as the threads request them. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be distributed. Each chunk contains chunk_size iterations, except for the last chunk to be distributed, which may have fewer iterations. When no chunk_size is specified, it defaults to 1. guided When schedule(guided, chunk_size) is specified, the iterations are assigned to threads in the team in chunks as the executing threads request them. Each thread executes a chunk of iterations, then requests another chunk, until no chunks remain to be assigned. For a chunk_size of 1, the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads in the team, decreasing to 1. For a chunk_size with value k (greater than 1), the size of each chunk is determined in the same way, with the restriction that the chunks do not contain fewer than k iterations (except for the last chunk to be assigned, which may have fewer than k iterations). auto When schedule(auto) is specified, the decision regarding scheduling is delegated to the compiler and/or runtime system. The programmer gives the implementation the freedom to choose any possible mapping of iterations to threads in the team. runtime When schedule(runtime) is specified, the decision regarding scheduling is deferred until run time, and the schedule and chunk size are taken from the run-sched-var ICV. If the ICV is set to auto, the schedule is implementation defined. Credit: IOMPP

schedule Clause Example
#pragma omp parallel for schedule (static, 8) for( int i = start; i <= end; i += 2 ) { if ( TestForPrime(i) ) gPrimesFound++; } Iterations are divided into chunks of 8 If start = 3, then first chunk is i={3,5,7,9,11,13,15,17} Credit: IOMPP

Work sharing – Parallel Sections Data environment Synchronization Advanced topics Script: Now a very brief look at function level parallelism – also called task decomposition. We’ll also look at omp parallel sections [IOMPP]→

Function Level Parallelism
a = alice(); b = bob(); s = boss(a, b); c = cy(); printf ("%6.2f\n", bigboss(s,c)); alice bob boss cy Script: We will now look at ways to take advantage of task decomposition – also known as function level parallelism. In this example, we have functions: alice, bob, boss, cy and bigboss. There are various dependencies among these functions as seen in the directed edge graph to the right. Boss cant complete until alice & bob functions are complete. Bigboss cant complete until boss & cy are completed. How do we parallelize such a function or task dependency? We’ll see one approach in a couple slides For now, lets identify the functions that can be computed independently (or computed in parallel). Alice & bob & cy can all be computed at the same time as there are no dependencies among them. Lets see if we can put this to good advantage later. Background: Get more info in Michael Quinn’s excellent book -Parallel Programming in C with MPI and OpenMP bigboss alice,bob, and cy can be computed in parallel [IOMPP]→

omp sections #pragma omp sections Must be inside a parallel region
There is an “s” here #pragma omp sections Must be inside a parallel region Precedes a code block containing N sub-blocks of code that may be executed concurrently by N threads Encompasses each omp section #pragma omp section Precedes each sub-block of code within the encompassing block described above Enclosed program segments are distributed for parallel execution among available threads Script: The omp sections directive distributes work among threads bound to a defined parallel region. The omp sections construct (note the “s” in sections) indicates that there will be two or more omp section constructs ahead that can be executed in parallel. The omp sections construct must either be inside a parallel region or must be part of a combined omp parallel sections construct. When program execution reaches a omp sections directive, program segments defined by the following omp section directive are distributed for parallel execution among available threads. The parallelism comes from executing each omp section in parallel Lets look at the previous example to see how to apply this to our boss bigboss example Background: Get more info in Michael Quinn’s excellent book -Parallel Programming in C with MPI and OpenMP see the openmp spec at or visit publib.boulder.ibm.com/infocenter at the url below to get at the definition and usage Parameters clause is any of the following: private (list) Declares the scope of the data variables in list to be private to each thread. Data variables in list are separated by commas. firstprivate (list) Declares the scope of the data variables in list to be private to each thread. Each new private object is initialized as if there was an implied declaration within the statement block. Data variables in list are separated by commas. lastprivate (list) Declares the scope of the data variables in list to be private to each thread. The final value of each variable in list, if assigned, will be the value assigned to that variable in the last section. Variables not assigned a value will have an indeterminate value. Data variables in list are separated by commas. reduction (operator: list) Performs a reduction on all scalar variables in list using the specified operator. Reduction variables in list are separated by commas. A private copy of each variable in list is created for each thread. At the end of the statement block, the final values of all private copies of the reduction variable are combined in a manner appropriate to the operator, and the result is placed back into the original value of the shared reduction variable. Variables specified in the reduction clause: must be of a type appropriate to the operator. must be shared in the enclosing context. must not be const-qualified. must not have pointer type. nowait Use this clause to avoid the implied barrier at the end of the sections directive. This is useful if you have multiple independent work-sharing sections within a given parallel region. Only one nowait clause can appear on a given sections directive. Usage The omp section directive is optional for the first program code segment inside the omp sections directive. Following segments must be preceded by an omp section directive. All omp section directives must appear within the lexical construct of the program source code segment associated with the omp sections directive. When program execution reaches a omp sections directive, program segments defined by the following omp section directive are distributed for parallel execution among available threads. A barrier is implicitly defined at the end of the larger program region associated with the omp sections directive unless the nowait clause is specified. There is no “s” here [IOMPP]→

Functional Level Parallelism Using omp sections
alice bob #pragma omp parallel sections { #pragma omp section double a = alice(); double b = bob(); double c = cy(); } double s = boss(a, b); printf ("%6.2f\n", bigboss(s,c)); boss cy Script: Here we have enclosed the omp sections in the omp parallel construct. We placed code of interest inside the parallel sections construct’s code block. Next we added parallel section constructs in front of the tasks that could be executed in parallel – namely alice, bob, and cy. When all of the threads executing the parallel section reach the implicit barrier (the right curly brace “}” at the end of the parallel section, then the master thread will continue on – executing the boss function and later the biggboss function. Another possible approach that Quinn points out – compute alice and bob together then computes boss & cy together, then computes biggboss. #pragma omp parallel sections { #pragma omp section /* Optional */ a = alice(); #pragma omp section b = bob(); } c = cy(); s = boss(a, b); printf ("%6.2f\n", bigboss(s,c)); Background: Get more info in Michael Quinn’s excellent book -Parallel Programming in C with MPI and OpenMP bigboss [IOMPP]→

Advantage of Parallel Sections
Independent sections of code can execute concurrently – reduce execution time (Execution Flow) Time #pragma omp parallel sections { #pragma omp section phase1(); phase2(); phase3(); } Script: In this example, Phase1, Phase2, and Phase3 represent completely independent tasks Sections are distributed among the threads in the parallel team. Each section is executed only once and each thread may execute zero or more sections. It’s not possible to determine whether or not a section will be executed before another. Therefore, the output of one section should not serve as the input to another concurrent section. Notice the overall parallelism achievable in the serial/parallel flow diagram Now we will begin our exploration of omp tasks Serial Parallel [IOMPP]→

sections, Example #include <stdio.h> #include <omp.h>
int main() { printf("Start with 2 procs\n"); #pragma omp parallel sections num_threads(2) { #pragma omp section printf("Start work 1\n"); double startTime = omp_get_wtime(); while( (omp_get_wtime() - startTime) < 2.0); printf("Finish work 1\n"); } printf("Start work 2\n"); printf("Finish work 2\n"); printf("Start work 3\n"); printf("Finish work 3\n"); return 0;

sections, Example: 2 threads

sections, Example #include <stdio.h> #include <omp.h>
int main() { printf("Start with 4 procs\n"); #pragma omp parallel sections num_threads(4) { #pragma omp section printf("Start work 1\n"); double startTime = omp_get_wtime(); while( (omp_get_wtime() - startTime) < 2.0); printf("Finish work 1\n"); } printf("Start work 2\n"); while( (omp_get_wtime() - startTime) < 6.0); printf("Finish work 2\n"); printf("Start work 3\n"); printf("Finish work 3\n"); return 0;

sections, Example: 4 threads

Work sharing – Tasks Data environment Synchronization Advanced topics Script: Next stop – omp tasks

OpenMP Tasks Task – Most important feature added in latest 3.0 version of OpenMP Allows parallelization of irregular problems Unbounded loops Recursive algorithms Producer/consumer Script: Tasks are a powerful new addition to OpenMP since the OpenMP 3.0 spec. Tasks allow parallelization of irregular problems that were impossible to very difficult to parallel in OpenMP prior to Now it is possible to parallelize unbounded loops (such as while loops loops), recursive algorithms, and producer/consumer patterns. Lets explore what Tasks actually are [IOMPP]→

Tasks: What Are They? Serial Parallel
Tasks are independent units of work A thread is assigned to perform a task Tasks might be executed immediately or might be deferred The runtime system decides which of the above Tasks are composed of code to execute data environment internal control variables (ICV) Script: First of all, Tasks are independent units of work that get threads assigned to them in order to do some calculation. The assigned threads might start executing immediately or their execution might be deferred depending on decision made by the OS & runtime. Tasks are composed of three components: Code to execute – the literal code in your program enclosed by the task directive 2) A data environment – the shared & private data manipulated by the task 3) Internal control variables – thread scheduling and environment variable type controls. A task is a specific instance of executable code and its data environment, generated when a thread encounters a task construct or a parallel construct. Background: New concept in OpenMP 3.0: explicit task - We have simply added a way to create a task explicitly for a team of threads to execute. Key Concept: All parallel execution is done in the context of a parallel region - Thread encountering parallel construct packages up a set of N implicit tasks, one per thread. - Team of N threads is created - Each thread begins execution of a separate implicit task immediately New concept: explicit task – OpenMP has simply added a way to create a task explicitly for the team to execute. Every part of an OpenMP program is part of one task or another! Time Serial Parallel [IOMPP]→

Tasks: What Are They? [More specifics…]
Code to execute The literal code in your program enclosed by the task directive Data environment The shared & private data manipulated by the task Internal control variables Thread scheduling and environment variable type controls A task is a specific instance of executable code and its data environment, generated when a thread encounters a task construct Script: First of all, Tasks are independent units of work that get threads assigned to them in order to do some calculation. The assigned threads might start executing immediately or their execution might be deferred depending on decision made by the OS & runtime. Tasks are composed of three components: Code to execute – the literal code in your program enclosed by the task directive 2) A data environment – the shared & private data manipulated by the task 3) Internal control variables – thread scheduling and environment variable type controls. A task is a specific instance of executable code and its data environment, generated when a thread encounters a task construct or a parallel construct. Background: New concept in OpenMP 3.0: explicit task - We have simply added a way to create a task explicitly for a team of threads to execute. Key Concept: All parallel execution is done in the context of a parallel region - Thread encountering parallel construct packages up a set of N implicit tasks, one per thread. - Team of N threads is created - Each thread begins execution of a separate implicit task immediately New concept: explicit task – OpenMP has simply added a way to create a task explicitly for the team to execute. Every part of an OpenMP program is part of one task or another! Two activities: packaging and execution A thread packages new instances of a task (code and data) Some thread in the team executes the task at some later time

typedef list<double> LISTDBL;
using namespace std ; typedef list<double> LISTDBL; void doSomething(LISTDBL::iterator& itrtr) { *itrtr *= 2.; } int main() { LISTDBL test; // default constructor LISTDBL::iterator it; for( int i=0;i<4;++i) for( int j=0;j<8;++j) test.insert(test.end(), pow(10.0,i+1)+j); for( it = test.begin(); it!= test.end(); it++ ) cout << *it << endl; it = test.begin(); #pragma omp parallel num_threads(8) { #pragma omp single private(it) while( it != test.end() ) { #pragma omp task doSomething(it); it++; for( it = test.begin(); it != test.end(); it++ ) cout << *it << endl; return 0; #include <omp.h> #include <list> #include <iostream> #include <math.h> Script: In this example, we are looking at a pointer chasing linked list. We create a parallel region using the #pragma omp parallel construct – 8 threads are then created once the master thread crosses into the parallel region. Let’s also assume that the linked list contains ~1000 nodes. We immediately limit the number of thread which will operate the while loop. We only want one while loop running. Without the single construct – we would have 8 identical copies of while loops all trying to process work and getting into each other's way. The omp task construct copies the code and data and internal control variables to a new task – lets call it task01 and gives that task a thread from the team to execute the task’s instance of the code and data. Since the omp task construct is called from within the while loop, and since the while loop is going to traverse all 1000 nodes, then at some point 1000 tasks will be generated. It is unlikely that all 1000 tasks will be generated at the same time. Since we only have 8 threads to service the 1000 tasks, and the master thread is busy controlling the while loop – we will effectively have 7 threads to do actual work. Let's say that the master thread keeps generating tasks and the 7 worker threads can’t consume these tasks quickly enough. Then eventually, the master thread may “task switch”. It may suspend the work of controlling the while loop and creating tasks. The runtime may decide that the master should begin servicing the tasks just like the rest of the threads. When the task pool drains enough due to the extra help, the master thread may task switch back to executing the while loop and begin generating new tasks once more. This process is at the heart of openmp tasks. We’ll see some animations to demonstrate this in the next few foils

Compile like: $ g++ -o testOMP.exe testOMP.cpp

Task Construct – Explicit Task View
A team of threads is created at the omp parallel construct A single thread is chosen to execute the while loop – call this thread “L” Thread L operates the while loop, creates tasks, and fetches next pointers Each time L crosses the omp task construct it generates a new task and has a thread assigned to it Each task runs in its own thread #pragma omp parallel { #pragma omp single { // block 1 node *p = head_of_list; while (p) { //block 2 #pragma omp task private(p) process(p); p = p->next; //block 3 } Script: This foil is one way to look at pointer chasing in a linked list, where the list must be traversed (and each node in the linked list) has to be processed by function “process” Here we see the overview of the flow of this code snippet A team of threads is created at the omp parallel construct A single thread is chosen to execute the while loop – lets call this thread “L” Thread L operates the while loop, creates tasks, and fetches next pointers Each time L crosses the omp task construct it generates a new task and has a thread assigned to it Each task runs in its own thread All tasks complete at the barrier at the end of the parallel region The next foil will give more insight into the parallelism advantage of this approach All tasks complete at the barrier at the end of the parallel region’s construct Each task has its own stack space that will be destroyed when the task is completed [IOMPP]→

Why are tasks useful? Idle Time
Have potential to parallelize irregular patterns and recursive function calls Single Threaded Thr Thr2 Thr3 Thr4 Block 1 Block 1 #pragma omp parallel { #pragma omp single { // block 1 node *p = head_of_list; while (p) { //block 2 #pragma omp task private(p) process(p); p = p->next; //block 3 } Block 2 Task 1 Block 3 Block 2 Task 1 Block 2 Task 2 Block 3 Idle Block 2 Task 3 Block 3 Script: Here is another look at the same example – but emphasizing the potential performance payoff from this sort of pipelined approach that we get from using tasks. First off, observe that in single threaded mode, all tasks are done sequentially – Block1 is calculated ( node p is assigned to head), then block 2 (pointer p is processed by process(p), then block 3 ( reads next pointer in the linked list), then repeating Blocks2, block 3 etc. 1st animation Now consider the same code executed in parallel First, the master thread crosses the omp parallel construct and creates a team of threads. Next one of those threads is chosen to execute the while loop – lets call this thread L Thread L encounters an omp task construct at block 2 which copies the code and data for process(p) to a new task – well call it Task1 The thread L increments the pointer p – grabbing a new node from the list, and loops to the top of the while loop Then thread L again encounters an omp task construct at block 2 which copies the code and data for process(p) to a new task – well call it Task2 Then thread L again encounters an omp task construct at block 2 which copies the code and data for process(p) to a new task – well call it Task3 So Thread L’s job is simply to assign work to threads and traverse the linked list. The parallelism comes from the fact that thread L does not have to wait for the results of any task before generating a new task. If the system has sufficient resources (enough cores, registers, memory, etc) then task1, task2, task3 can all be computed in parallel. So roughly speaking – the execution time will be about the duration of the longest executing task (task 2 in this case) plus some extra administration time for thread L. The time save can be significant compared to the serial execution of the same code. Obviously the more parallel resources that are supplied the better the parallelism – up until the point that the longest serial task begins to dominate the total execution time. Now its time for lab activity Block 2 Task 2 Time Block 3 Time Saved Block 2 Task 3 [IOMPP]→

Tasks: Synchronization Issues
Setup: Assume Task B specifically relies on completion of Task A You need to be in a position to guarantee completion of Task A before invoking the execution of Task B Tasks are guaranteed to be complete at thread or task barriers: At the directive: #pragma omp barrier At the directive: #pragma omp taskwait Script: Now we are going to quickly explore when tasks are guaranteed to be complete. For doing computations with dependent tasks, where Task B specifically relies on completion of task A, it is sometimes necessary for developers to know explicitly when or where a task can be guaranteed to be completed. This foil addresses this concern. Tasks are guaranteed to be complete at the following three locations: At thread or task barriers – such as the end of a parallel region, the end of a single region, the end of a parallel “for” region – we’ll talk more aout implicit thread or task barriers later At the directive: #pragma omp barrier At the directive: #pragma omp taskwait In the case of #pragma omp taskwait Encountering task suspends at the point of the directive until all its children (all the encountering task’s children tasks created within the encountering task up to this point) are complete. Only direct children - not descendants! Similarly a Thread barrier (implicit or explicit) includes an implicit taskwait. This pretty much wraps up the discussion of tasks – now we are going to look at data environment topics such as data scoping. [IOMPP]→

Task Completion Example
Multiple foo tasks created here – one for each thread #pragma omp parallel { #pragma omp task foo(); #pragma omp barrier #pragma omp single bar(); } All foo tasks guaranteed to be completed here Script: Let's take a look at an example that demonstrates where tasks are guaranteed to be complete. In this example, the master thread crosses the parallel construct and a team of N threads is created. Each thread in the team is assigned a task – in this case each thread gets assigned a “foo” task – so that there are now N foo tasks created The “exit” of the omp barrier construct is where we are guaranteed all the N of the foo tasks are complete. Next a single thread crosses the omp task construct and a single task is created to execute the “bar” function. The bar function is guaranteed to be complete at the exit of the single construct’s code block – the right curly brace that signifies the end of the single region Now let's move on to the next item in the agenda One bar task created here bar task guaranteed to be completed here [IOMPP]→

Work sharing Data scoping Synchronization Advanced topics Script: Next up is Data environment – including data scoping, shared & private variables and quite a few quick examples to show the difference

Data Scoping – What’s shared
OpenMP runs on SMP architectures → data lives in shared-memory Nomenclature: OpenMP shared variable - variable that can be read/written by multiple threads Shared clause can be used to make items explicitly shared Global variables are shared by default among tasks Other examples of variables being shared among threads File scope variables Namespace scope variables Variables with const-qualified type having no mutable member Static variables which are declared in a scope inside the construct Script: Data scoping – what’s shared? OpenMP uses a shared memory programming model rather than a message passing model. As such, the way that threads communicate results to the master thread or to each other is through shared memory and thru shared variables. Consider two threads each computing a partial sum. A master thread is waiting to add the partial sums to get a grand total. If each thread ONLY has its own local copy of data and can ONLY manipulate its own data – then there would be no way to communicate a partial sum to the master thread. So shared variables play a very important role on a shared memory system. A shared variable is a variable whose name provides access to a the same block of storage for each task region. What variables are considered shared by default? Global variables, variables with file scope, variables with namespace scope, static variables are all shared by default. A variable can be made shared explicitly by adding the shared(list,…) clause to any omp construct. For Fortran junkies, Common blocks, Save variables, and module variables are all shared by default. So now that we know what’s shared – lets look at what’s private. Background: predetermined data-sharing attributes Variables with automatic storage duration that are declared in a scope inside the construct are private. • Variables with heap allocated storage are shared. • Static data members are shared. • Variables with const-qualified type having no mutable member are shared. C/C++ • Static variables which are declared in a scope inside the construct are shared. [IOMPP]→

Data Scoping – What’s Private
Not everything is shared... Examples of implicitly determined PRIVATE variables: Stack (local) variables in functions called from parallel regions Automatic variables within a statement block Loop iteration variables Implicitly declared private variables within tasks will be treated as firstprivate firstprivate Specifies that each thread should have its own instance of a variable, and that the variable should be initialized with the value of the variable, because it exists before the parallel construct Script: While shared variables may be essential to have – they also have a serious drawback that we will explore in some future foils. Shared variables open up the possibility of data races or race conditions. A race condition is a situation in which multiple threads of execution are all updating a shared variable in an unsynchronized fashion causing indeterminate results – which means that the same program running the same data may arrive at different answers in multiple trials of the program execution. This is a fancy way of saying that you could write a program that adds 1 plus 3 and sometimes gets 4, other times gets 1, and other times gets 3. One way to combat data races is by making use of copies of data in private variables for every thread. In order to use private variables, we ought to know a little about them. First of all, it is important to know that some variables are implicitly considered to be private variables. Other variables can be made private by explicitly declaring them to be so using the private clause. Some examples of implicitly determined private variables are; Stack variables in functions called from parallel regions, Automatic variables within a statement block, Loop iteration variables. Second, all variables that are implicitly determined to be private variables will be treated as tough they are first private within the task – meaning that they are all given initial value from their associated original variable. Lets look at some examples to see what this means Background: • Variables appearing in threadprivate directives are threadprivate. • Variables with automatic storage duration that are declared in a scope inside the construct are private. • Variables with heap allocated storage are shared. • Static data members are shared. • The loop iteration variable(s) in the associated for-loop(s) of a for or parallel for construct is(are) private. • Variables with const-qualified type having no mutable member are shared. C/C++ • Static variables which are declared in a scope inside the construct are shared. private (list) Declares the scope of the data variables in list to be private to each thread. Data variables in list are separated by commas predetermined data-sharing attributes For each private variable referenced in the structured block, a new version of the original variable (of the same type and size) is created in memory for each task that contains code associated with the directive References to a private variable in the structured block refer to the current task’s private version of the original variable. A private variable in a task region that eventually generates an inner nested parallel region is permitted to be made shared by implicit tasks in the inner parallel region. A private variable in a task region can be shared by an explicit task region generated during its execution. However, it is the programmer’s responsibility to ensure through synchronization that the lifetime of the variable does not end before completion of the explicit task region sharing it. Any other access by one task to the private variables of another task results in unspecified behavior. [IOMPP]→

Data Scoping – The Golden Rule
When in doubt, explicitly indicate who’s what

When in doubt, explicitly indicate who’s what
#pragma omp parallel shared(a,b,c,d,nthreads) private(i,tid) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } printf("Thread %d starting...\n",tid); #pragma omp sections nowait #pragma omp section printf("Thread %d doing section 1\n",tid); for (i=0; i<N; i++) c[i] = a[i] + b[i]; printf("Thread %d: c[%d]= %f\n",tid,i,c[i]); printf("Thread %d doing section 2\n",tid); d[i] = a[i] * b[i]; printf("Thread %d: d[%d]= %f\n",tid,i,d[i]); } /* end of sections */ printf("Thread %d done.\n",tid); } /* end of parallel section */ When in doubt, explicitly indicate who’s what

A Data Environment Example
extern float A[10]; void Work (int *index) { float temp[10]; static integer count; <...> } float A[10]; main () { int index[10]; #pragma omp parallel { Work (index); } printf ("%d\n", index[1]); Script: Let me ask the class – from what you learned already, tell me which variables are shared and which are private A[] is shared – it is a global variable Index is also an array – we pass the pointer to the first element into work. All threads share this array. Count is also shared – why? Because it is declared to be static – so all tasks can share that same value Temp is private – this is because it is created inside work. Each task has code and data for Work(). And temp is a local variable within the function work(). Lets look at some more examples Assumed to be in another translation unit temp A, index, count A, index, and count are shared by all threads, but temp is local to each thread Which variables are shared and which variables are private? Includes material from IOMPP

Data Scoping Issue: fib Example
int fib ( int n ) { int x, y; if ( n < 2 ) return n; #pragma omp task x = fib(n-1); y = fib(n-2); #pragma omp taskwait return x+y; } n is private in both tasks x is a private variable y is a private variable Script: We will assume that the parallel region exists outside of fib and that fib and the tasks inside it are in the dynamic extent of a parallel region. n is firstprivate in both tasks – reason – stack variable called from parallel region are implicitly determined to be private which means that within both task directives, they will then be assigned firstprivate Do you see any issues here? 1st animation What about x & y? They are definitely private (or firstprivate) within the tasks – BUT we want to use their values OUTSIDE the task. We need to share the values of x & y somehow The problem we see is assigning a value to x (which by default is private) and y (which by default is private) is that the value are needed outside the task construct – after the taskwait – and private variables are not defined here So to have any meaning – we have to provide a mechanism to communicate the value of these variables to the statement after the task wait – we have several strategies available to make this work as we shall see on following foils This is very important here What’s wrong here? Values of the private variables not available outside of tasks Credit: IOMPP

int fib ( int n ) { int x, y; if ( n < 2 ) return n; #pragma omp task { x = fib(n-1); } y = fib(n-2); #pragma omp taskwait return x+y x is a private variable y is a private variable Script: We will assume that the parallel region exists outside of fib and that fib and the tasks inside it are in the dynamic extent of a parallel region. n is firstprivate in both tasks – reason – stack variable called from parallel region are implicitly determined to be private which means that within both task directives, they will then be assigned firstprivate Do you see any issues here? 1st animation What about x & y? They are definitely private (or firstprivate) within the tasks – BUT we want to use their values OUTSIDE the task. We need to share the values of x & y somehow The problem we see is assigning a value to x (which by default is private) and y (which by default is private) is that the value are needed outside the task construct – after the taskwait – and private variables are not defined here So to have any meaning – we have to provide a mechanism to communicate the value of these variables to the statement after the task wait – we have several strategies available to make this work as we shall see on following foils Values of the private variables not available outside of tasks Credit: IOMPP

int fib ( int n ) { int x, y; if ( n < 2 ) return n; #pragma omp task shared(x) x = fib(n-1); #pragma omp task shared(y) y = fib(n-2); #pragma omp taskwait return x+y; } n is private in both tasks Script: Good solution In this case, we shared the values of x & y so that the values will available outside each task construct – after the taskwait x & y are now shared we need both values to compute the sum The values of the x & y variables will be available outside each task construct – after the taskwait Credit: IOMPP

Work sharing Data environment Synchronization Advanced topics Script: On to synchronization Credit: IOMPP

Implicit Barriers Several OpenMP constructs have implicit barriers
parallel – necessary barrier – cannot be removed for single Unnecessary barriers hurt performance and can be removed with the nowait clause The nowait clause is applicable to: for clause single clause Script: Several OpenMP* constructs have implicit barriers Parallel – necessary barrier – cannot be removed For – optional barrier Single – optional barrier Unnecessary barriers hurt performance and can be removed with the nowait clause The nowait clause is applicable to: For clause Single clause Credit: IOMPP

Nowait Clause #pragma omp for nowait for(...) {...}; #pragma single nowait { [...] } Use when threads unnecessarily wait between independent computations Script: The nowait clause can be used to remove unnecessary waits between independent calculations. In the example above – we see how to use the nowait clause in conjunction with an omp for loop (upper left example) In the upper right we see how to use the clause with the pragma single In the center of the foil we see a more complicated example In this example – we have two for loops executing back to back. Without the nowait clause – all the threads in the upper for loop would have to complete at meet at the implied barrier at the end of the first loop – BEFORE any threads could begin execution in the second loop. BUT With the nowait clause, some threads that complete the upper loop early – can actually go on through and begin calculation in the lower second loop – WITHOUT having to wait for the rest of the threads in the upper loop We've examined the nowait - clause - now lets look at the explicit barrier construct #pragma omp for schedule(dynamic,1) nowait for(int i=0; i<n; i++) a[i] = bigFunc1(i); #pragma omp for schedule(dynamic,1) for(int j=0; j<m; j++) b[j] = bigFunc2(j); Credit: IOMPP

Barrier Construct Explicit barrier synchronization
Each thread waits until all threads arrive #pragma omp parallel shared(A, B, C) { DoSomeWork(A,B); // Processed A into B #pragma omp barrier DoSomeWork(B,C); // Processed B into C } Script: The barrier construct is just a way for a user to explicitly place a barrier wherever he wishes in a parallel region. Just as it is for implicit barriers, the barrier clause forces all threads to wait until all threads arrive. In the example above – all threads in the parallel region participate in DoSomeWork(A,B) one by one – as they asll complete DoSomeWork(A,B) and the printf – they are now all forced to wait until all threads have completed their work – then ONCE ALL THREADs have met up at the omp barrier, all threads are allowed to do the second roud of work on DoSomeWork(B,C) This is one way to guarantee that the NO threads are altering values of B in the upper section while other threads are consuming B in the lower section That wraps up barriers – lets look at the atomic contruct Credit: IOMPP

Atomic Construct Applies only to simple update of memory location
Special case of a critical section, to be discussed shortly Atomic introduces less overhead then critical index[0] = 2; index[1] = 3; index[2] = 4; index[3] = 0; index[4] = 5; index[5] = 5; index[6] = 5; index[7] = 1; Script: The Atomic Construct can be considered to be a special case of a critical section with some limitations. The limitations are that it only applies to simple updates of memory locations Since index[i] can be the same for different i values, the update to x must be protected. Use of a critical section would serialize updates to x. Atomic protects individual elements of x array, so that if multiple, concurrent instances of index[i] are different, updates can still be done in parallel. That nearly concludes the class – we just need to take some time to tale about what kinds of courses or curriculums would be best served by this material and also to quickly fly by the advanced topics to just show what is left to the reader. #pragma omp parallel for shared(x, y, index, n) for (i = 0; i < n; i++) { #pragma omp atomic x[index[i]] += work1(i); y[i] += work2(i); } Credit: IOMPP

Example: Dot Product What is Wrong?
float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<N; i++) { sum += a[i] * b[i]; } return sum; Script: Here we see a simple minded dot product. We have used a parallel for construct with a shared clause. So what is the problem? Answer: Multiple threads modifying sum with no protection – this is a race condition! Let talk about race condition some more on the next slide What is Wrong? Credit: IOMPP

Race Condition A race condition is nondeterministic behavior produced when two or more threads access a shared variable at the same time For example, suppose both Thread A and Thread B are executing the statement area += 4.0 / (1.0 + x*x); Script: A race condition is nondeterministic behavior caused by the times at which two or more threads access a shared variable Lets look at the code snippet below area = area + 4/(1+x*x) If variable “area” is private, then the individual subtotals will be lost when the loop is exited. But if variable “area” is shared, then we could run into a race condition. So – we have a quandary. On the one hand I want area to be shared because I want to combine partial sums from thread A & thread B – on the other hand – I don’t want a data race? The next few slides will illustrate the problem. Credit: IOMPP

Two Possible Scenarios
Value of area Thread A Thread B Value of area Thread A Thread B 11.667 11.667 +3.765 +3.765 15.432 Script: If thread A references “area”, adds to its value, and assigns the new value to “area” before thread B references “area”, then everything is okay. However, if thread B accesses the old value of “area” before thread A writes back the updated value, then variable “area” ends up with the wrong total. The value added by thread A is lost. We see that in a data race condition, the order of execution of each thread can change the resulting calculations 11.667 15.432 15.432 18.995 15.230 Order of thread execution causes non determinant behavior in a data race Credit: IOMPP

Protect Shared Data The critical construct: protects access to shared, modifiable data The critical section allows only one thread to enter it at a given time float dot_prod(float* a, float* b, int N) { float sum = 0.0; #pragma omp parallel for shared(sum) for(int i=0; i<N; i++) { #pragma omp critical sum += a[i] * b[i]; } return sum; Script: To resolve this issue – let’s consider a using a pragma omp critical – also called a critical section. We’ll go over the critical section in more detail on the next slide, but for now let’s get a feel for how it works. The critical section allows only one thread to enter it at a given time. A critical section “protects’ the next immediate statement or code block – in this case – sum += a[i] + b[i]; Whichever thread, A or B, gets to the critical section first, that thread is guaranteed exclusive access to to that protected code region (called a critical region). Once the thread leaves the critical section, the other thread is allowed to enter. This ability of the critical section to force threads to “take turns” is what prevents the race condition Lets look a little closer at the anatomy of an OpenMP* Critical Construct Credit: IOMPP

OpenMP Critical Construct
#pragma omp critical [(lock_name)] Defines a critical region on a structured block float RES; #pragma omp parallel { #pragma omp for for(int i=0; i<niters; i++){ float B = big_job(i); #pragma omp critical (RES_lock) consum(B, RES); } Threads wait their turn – only one at a time calls consum() thereby protecting RES from race conditions Naming the critical construct RES_lock is optional but highly recommended Script: The OpenMP* Critical Construct simply defines a critical region on a structured code block Threads wait their turn –at a time, only one calls consum() thereby protecting RES from race conditions Naming the critical construct RES_lock is optional. With named critical regions - a thread waits at the start of a critical region identified by a given name until no other thread in the program is executing a critical region with that same name. Critical sections not specifically named by omp critical directive invocation are mapped to the same unspecified name. Now let’s talk about another kind of synchronization called a reduction Good Practice – Name all critical sections Includes material from IOMPP

OpenMP Reduction Clause
reduction (op : list) The variables in “list” must be shared in the enclosing parallel region Inside parallel or work-sharing construct: A PRIVATE copy of each list variable is created and initialized depending on the “op” These copies are updated locally by threads At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable Script: OpenMP* Reduction Clause is clause is used to combine an array of values into a single combined scalar value based on the “op” operation passed in the parameter list. The reduction clause is used for very common math operations on large quantities of data – called a reduction. A reduction “reduces” the data in a list or array down to a representative single value – a scalar value. For example, If I want to compute the sum of a list of numbers, I can “reduce” the list to the “sum” of the list. We’ll see a code example on the next foil Before jumping to the next foil lets take care of some business. The variables in “list” must be shared in the enclosing parallel region The reduction clause must be inside parallel or work-sharing construct: The way it works internally is that: A PRIVATE copy of each list variable is created and initialized depending on the “op” These copies are updated locally by threads At end of construct, local copies are combined through “op” into a single value and combined with the value in the original SHARED variable Now lets look at the example on the next foil Credit: IOMPP

Reduction Example Local copy of sum for each thread
#pragma omp parallel for reduction(+:sum) for(i=0; i<N; i++) { sum += a[i] * b[i]; } Local copy of sum for each thread All local copies of sum added together and stored in “global” variable Script: In this example, we are computing the sum of the product of two vectors. This is a reduction operation because I am taking an array or list or vector full of numbers and boiling the information down to a scalar. To use the reduction clause – I note that the basic reduction operation is an addition and that the reduction variable is sum. I add the following pragma; #pragma omp parallel for reduction(+:sum) The reduction will now apply a reduction to the variable sum based on the + operation Internally – each thread sort of has its own copy of sum to add up partial sums for. After all partial sums have been computed, all the local copies of sum are added together and stored in a “global” variable called sum accessible to the other threads but without a race condition. Following are some of the valid math operations that reductions are designed for Credit: IOMPP

OpenMP Reduction Example: Numerical Integration
4.0  4.0 (1+x2) dx =  1 4.0 (1+x2) f(x) = static long num_steps=100000; double step, pi; void main() { int i; double x, sum = 0.0; step = 1.0/(double) num_steps; for (i=0; i< num_steps; i++){ x = (i+0.5)*step; sum = sum + 4.0/(1.0 + x*x); } pi = step * sum; printf("Pi = %f\n",pi); 2.0 Script: Next we will examine a numerical integration example, where the goal is to compute the value of Pi. We’ll do this using the code for an integration. Animation 1 For the Math geeks out there - In this example we are looking at the code to compute the value of Pi. It basically computes the integral from 0 to 1 of the function 4/(1+x^2). Some of you may remember from calculus that this integral evaluates to the 4 * arctangent(x). The 4 * arctangent of x evaluated on the range 0-1 yields a value of Pi. Animation 2 To approximate the area under the curve – which approximates the integral – we will have small areas (many = num_steps) that we add up. Each area will be “step” wide. The height of the function will be approximated by 4/(1+x*x). Animation 3 The area will just be the sum of all these small areas For the rest of us – we will trust that the mast is right and just dig into the code. Here we have a loop that run from 0 to step_size. Inside the loop – we calculate the approximate location on the x axis of the small area; x= (i+.05)*step We calculate the height of the function at this value of x: 4/(1+x*x). Then we add the small area we just calculated to a running total of allthe area we have computed up to this point. When the loop is done – we print the value of pi. So lets ask the big question – are there any race conditions in play here? Which variables should be marked private? which variables should be marked shared? That is the subject of our next lab 0.0 X 1.0 Credit: IOMPP

OpenMP Reduction Example: Numerical Integration
#include <stdio.h> #include <stdlib.h> #include "omp.h" int main(int argc, char* argv[]) { int num_steps = atoi(argv[1]); double step = 1./(double(num_steps)); double sum; #pragma omp parallel for reduction(+:sum) { for(int i=0; i<num_steps; i++) { double x = (i + .5)*step; sum += 4.0/(1.+ x*x); } double my_pi = sum*step; printf("Value of integral is: %f\n", my_pi); return 0;

OpenMP Reduction Example: Output
CodeBits]$ g++ testOMP.cpp -o me964.exe CodeBits]$ ./me964.exe Value of integral is:

C/C++ Reduction Operations
A range of associative operands can be used with reduction Initial values are the ones that make sense mathematically Script: Below is a table of the C/C++ Reduction Operations. A range of associative operands can be used with reduction including +, *, -, ^(power), bitwise and, bitwise or, logical and, logical or Initial values are the ones that make sense mathematically – for the sum example we say –we would want to start our sum naturally at 0. For a multiplication – we would want to start a product at 1. Lets look at a more complicated example Operand Initial Value + * 1 - ^ Operand Initial Value & ~0 | && 1 || Credit: IOMPP

Exercise: Variable Scoping Aspects
Consider parallelizing the following code void callee(int *x, int *y, int z) { int ii; static int cv=0; cv++; for (ii=1; ii<z; ii++) { *x = *x + *y + z; } printf(“Value of counter: %d\n", cv); void caller(int *a, int n) { int i, j, m=3; for (i=0; i<n; i++) { int k=m; for (j=1; j<=5; j++) { callee(&a[i], &k, j); #include <omp.h> #include <math.h> #include <stdio.h> int main() { const int n=20; int a[n]; for( int i=0; i<n; i++ ) a[i] = i; //this is the part that needs to //be parallelized caller(a, n); printf("a[%d]=%d\n", i, a[i]); return 0; }

Program Output Looks good
The value of the counter increases each time you hit the “callee” subroutine If you run the executable 20 times, you get the same results 20 times

First Attempt to Parallelize
Example obviously contrived but helps to understand the scope of different variables Var Scope Comment a shared Declared outside parallel construct n i private Parallel loop index j m Constant decl. outside parallel construct k Automatic variable/parallel region x Passed by value *x (actually a) y *y (actually k) z (actually j) ii Local stack variable in called function cv Declared static (like global) void callee(int *x, int *y, int z) { int ii; static int cv=0; cv++; for (ii=1; ii<z; ii++) { *x = *x + *y + z; } printf(“Value of counter: %d\n", cv); void caller(int *a, int n) { int i, j, m=3; #pragma omp parallel for for (i=0; i<n; i++) { int k=m; for (j=1; j<=5; j++) { callee(&a[i], &k, j);

Program Output, First Attempt to Parallelize
Looks bad… The values in array “a” are all over the map The value of the counter “cv” changes chaotically within “callee” The function “callee” gets hit a random number of times (should be hit 100 times). Example: $ parallelGood.exe | grep "Value of counter" | wc -l $ 70 If you run executable 20 times, you get different results One of the problems is that “j” is shared

Second Attempt to Parallelize
Declare the inner loop variable “j” as a private variable within the parallel loop void callee(int *x, int *y, int z) { int ii; static int cv=0; cv++; for (ii=1; ii<z; ii++) { *x = *x + *y + z; } printf(“Value of counter: %d\n", cv); void caller(int *a, int n) { int i, j, m=3; #pragma omp parallel for private(j) for (i=0; i<n; i++) { int k=m; for (j=1; j<=5; j++) { callee(&a[i], &k, j);

Program Output, Second Attempt to Parallelize
Looks better The values in array “a” are correct The value of the counter “cv” changes strangely within the “callee” subroutine The function “callee” gets hit 100 times: $ parallelGood.exe | grep "Value of counter" | wc -l $ 100 If you run executable 20 times, you get good results for “a”, but the static variable will continue to behave strangely (it’s shared) Conclusion: be careful when you work with static or some other globally shared variable in parallel programming In general, dealing with such variables is bad programming practice

Slightly Better Solution…
Declare the inner loop index “j” only inside the parallel segment After all, it’s only used there You get rid of the “private” attribute, less constraints on the code, increasing the opportunity for code optimization at compile time void callee(int *x, int *y, int z) { int ii; static int cv=0; cv++; for (ii=1; ii<z; ii++) { *x = *x + *y + z; } printf(“Value of counter: %d\n", cv); void caller(int *a, int n) { int i, m=3; #pragma omp parallel for for (i=0; i<n; i++) { int k=m; for (int j=1; j<=5; j++) { callee(&a[i], &k, j); Used here, then you should declare here (common sense…)

Program Output, Parallelized Code
Looks good The values in array “a” are correct The value of the counter “cv” changes strangely within the “callee” subroutine The function “callee” gets hit 100 times: $ parallelGood.exe | grep "Value of counter" | wc -l $ 100 If you run executable 20 times, you get good results for “a”, but the static variable will continue to behave strangely (it’s shared) What surprised me: the value of the counter was indeed 100 In other words, although shared, no trashing of this variable… Still don’t understand why this is the case (it’s surprising it works like this…)

OpenMP API

OpenMP: The 30,000 Feet Perspective

The OpenMP API Application Programmer Interface (API) is combination of Directives Example: #pragma omp task Runtime library routines Example: int omp_get_thread_num(void) Environment variables Example: setenv OMP_SCHEDULE "guided, 4" [R. Harman-Baker]→

The OpenMP API Directives
Directives (or Pragmas) used to Express/Define parallelism (flow control) Example: #pragma omp parallel for Specify data sharing among threads (communication) Example: #pragma omp parallel for private(x,y) Synchronization (coordination or interaction) Example: #pragma omp barrier [R. Harman-Baker]→

OpenMP 3.1: Summary of Run-Time Library OpenMP Routines
1. omp_set_num_threads 2. omp_get_num_threads 3. omp_get_max_threads 4. omp_get_thread_num 5. omp_get_thread_limit 6. omp_get_num_procs 7. omp_in_parallel 8. omp_set_dynamic 9. omp_get_dynamic 10. omp_set_nested 11. omp_get_nested 12. omp_set_schedule 13. omp_get_schedule 14. omp_set_max_active_levels 15. omp_get_max_active_levels 16. omp_get_level 17. omp_get_ancestor_thread_num 18. omp_get_team_size 19. omp_get_active_level 20. omp_init_lock 21. omp_destroy_lock 22. omp_set_lock 23. omp_unset_lock 24. omp_test_lock 25. omp_init_nest_lock 26. omp_destroy_nest_lock 27. omp_set_nest_lock 28. omp_unset_nest_lock 29. omp_test_nest_lock 30. omp_get_wtime 31. omp_get_wtick

#include <stdio.h>
#include <omp.h> int main() { omp_set_num_threads(4); printf_s(“First call: %d\n", omp_get_num_threads( )); #pragma omp parallel #pragma omp master { printf_s(“Second call: %d\n", omp_get_num_threads( )); } printf_s(“Third call: %d\n", omp_get_num_threads( )); #pragma omp parallel num_threads(3) printf_s(“Fourth call: %d\n", omp_get_num_threads( )); printf_s(“Last call: %d\n", omp_get_num_threads( )); return 0;

OpenMP: Environment Variables
OMP_SCHEDULE Example: setenv OMP_SCHEDULE "guided, 4" OMP_NUM_THREADS Sets the maximum number of threads to use during execution. Example: setenv OMP_NUM_THREADS 8 OMP_DYNAMIC Enables or disables dynamic adjustment of the number of threads available for execution of parallel regions. Valid values are TRUE or FALSE Example: setenv OMP_DYNAMIC TRUE OMP_NESTED Enables or disables nested parallelism. Valid values are TRUE or FALSE Example: setenv OMP_NESTED TRUE

OpenMP: Environment Variables [recent ones, added as of 3.0]
OMP_STACKSIZE Controls the size [in KB] of the stack for created (non-Master) threads. OMP_WAIT_POLICY Provides hint to an OpenMP implementation about desired behavior of waiting threads OMP_MAX_ACTIVE_LEVELS Controls the maximum number of nested active parallel regions. The value of this environment variable must be a non-negative integer. Example: setenv OMP_MAX_ACTIVE_LEVELS 2 OMP_THREAD_LIMIT Sets the number of OpenMP threads to use for the whole OpenMP program Example: setenv OMP_THREAD_LIMIT 8

OpenMP Concluding Remarks & Wrap-up

OpenMP Summary Shared memory, thread-based parallelism
Explicit parallelism (relies on you specifying parallel regions) Fork/join model Industry-standard shared memory programming model First version released in 1997 OpenMP Architecture Review Board (ARB) determines updates to standard The final specification of Version 3.1 released in July of 2011 (minor update) [R. Harman-Baker]→

OpenMP, Summary OpenMP provides small yet versatile programming model
This model serves as the inspiration for the OpenACC effort to standardizing approaches that can factor in the presence of a GPU accelerator Not at all intrusive, very straightforward to parallelize existing code Good efficiency gains achieved by using parallel regions in an existing code Work-sharing constructs: for, section, task enable parallelization of computationally intensive portions of program

Attractive Features of OpenMP
Parallelize small parts of application, one at a time (beginning with most time-critical parts) Can implement complex algorithms Code size grows only modestly Expression of parallelism flows clearly, code is easy to read Single source code for OpenMP and non-OpenMP Non-OpenMP compilers simply ignore OMP directives [R. Harman-Baker]→

OpenMP, Some Caveats It seems that the vendors lag behind when it comes to support of the latest OpenMP specifications Intel probably is most up to speed although I haven’t used their compilers OpenMP threads are heavy Very good for handling parallel tasks Not particularly remarkable at handling fine grain data parallelism (vector architectures excel here)

Parallel Computing with OpenMP

Similar presentations

Presentation on theme: "Parallel Computing with OpenMP"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Computing with OpenMP

Similar presentations

Presentation on theme: "Parallel Computing with OpenMP"— Presentation transcript:

Similar presentations

About project

Feedback