Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1.

Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1

Outline I. Introduction to OpenMP II. OpenMP Programming Model III. OpenMP Directives IV. OpenMP Clauses V. Run-Time Library Routine VI. Environment Variables VII. Summary 2

What is OpenMP Application program interface (API) that is used to explicitly direct multi-threaded, shared memory parallelism Consists of: Compiler directives Run time routines Environment variables Specification maintained by the OpenMP, Architecture Review Board (http://www.openmp.org)http://www.openmp.org Version 3.0 has been released May 2008 3

What OpenMP is Not Not Automatic parallelization User explicitly specifies parallel execution Compiler does not ignore user directives even if wrong Not just loop level parallelism Functionality to enable coarse grained parallelism Not meant for distributed memory parallel systems Not necessarily implemented identically by all vendors Not Guaranteed to make the most efficient use of shared memory 4

History of OpenMP In the early 90's, vendors of shared-memory machines supplied similar, directive-based, Fortran programming extensions: The user would augment a serial Fortran program with directives specifying which loops were to be parallelized. First attempt at a standard was the draft for ANSI X3H5 in 1994. It was never adopted, largely due to waning interest as distributed memory machines became popular. The OpenMP standard specification started in the spring of 1997, taking over where ANSI X3H5had left off, as newer shared memory machine architectures started to become prevalent 5

Goal of OpenMP Standardization : Provide a standard among a variety of shared memory architectures/platforms Lean and mean : Establish a simple and limited set of directives for programming shared memory machines. Ease of Use : Provide capability to incrementally parallelize a serial program Provide the capability to implement both coarse-grain and fine- grain parallelism Portability : Support Fortran (77, 90, and 95), C, and C++ 6

Outline I. Introduction to OpenMP II. OpenMP Programming Model III. OpenMP Directives IV. OpenMP Clauses V. Run-Time Library Routine VI. Environment Variables VII. Summary 7

OpenMP Programming Model Thread Based Parallelism Explicit Parallelism Compiler Directive Based Dynamic Threads Nested Parallelism Support Task parallelism support (OpenMP specification 3.0) 8

Shared Memory Model 9

Execution Model 10 ID=0 ID=1,2,3…N-1

Terminology OpenMP Team=: Master + workers A parallel region is block of code executed by all threads simultaneously. Master thread always has thread ID=0 Thread adjustment is done before entering parallel region. An “if” clause can be used with parallel construct, incase the condition evaluate to FALSE, parallel region is avoided and code run serially Work-sharing construct is responsible for dividing work among the threads in parallel region 11

Example OpenMP Code Structure 12

Components of OpenMP 13

I. Introduction to OpenMP II. OpenMP Programming Model III. OpenMP Directives IV. OpenMP Clauses V. Run-Time Library Routine VI. Environment Variables VII. Summary 14

15 Go to helloworld.c

C/C++ Parallel Region Example 16 !$OMP PARALLEL write (*,*) “Hello” !$OMP END PARALLEL Hello world from thread = 0 Number of threads = 3 Hello world from thread = 1Hello world from thread = 2 thread 0 thread 1 thread 2

OpenMP Directives 17

OpenMP Scoping Static Extent: The code textually enclosed between beginning and end of structure block The static extent does not span other routines Orphaned Directive: An OpenMP directive appear independently Dynamic Extent: It include extent of both static extent and orphaned directives 18

OpenMP Parallel Regions A block of code that will be executed by multiple threads Properties - Fork-Join Model - Number of threads won’t change inside a parallel region - SPMD execution within region - Enclosed block of code must be structured, no branching into or out of block Format #pragma omp parallel clause1 clause2 … 19

OpenMP Threads How many threads? Use of the omp_set_threads() library function Setting of the OMP_NUM_THREADS environment variable Implementation default Dynamic Threads : By default, the same number of threads are used to execute each parallel region Two methods for enabling dynamic threads  Use of the omp_set_dynamic() library function  Setting of the OMP_DYNAMIC environment variable 20

OpenMP Work-sharing constructs 21 Data parallelismFunctional parallelismSerialize a section

Example: Count3s in an array Lets assume we have an array of N integers. We want to find how many 3s are in the array. We need a for loop if statement, and a count variable Lets look at its serial and parallel version 22

Serial: Count3s in an array int count, n=100; int array[n]; // initialize array for(i=0;i<length;i++) { if (array[i]==3) count++; } 23

Work-sharing construct: “for loop” “for loop” work-sharing construct is thought of as data parallelism construct. 24

Parallelize 1 st attempt: Count3s in an array int count, n=100; int array[n]; // initialize array #pragma omp parallel for default(none) shared(n,array,count) private(i) for(i=0;i<length;i++) { if (array[i]==3) count++; } 25

Work-sharing construct: Example of “for loop” #pragma omp parallel for default(none) shared(n,a,b,c) private(i) for (i=0;i<n;i++) { c[i] = a[i] + b[i]; } 26

Work-sharing construct: “section” “Section” work-sharing construct is thought of as functional parallelism construct. 27

Parallelize 2 nd attempt: Count3s in an array Say we also want to count 4s in same array. Now we have two different function i.e. count 3 and count 4. int count, n=100; int array[n]; // initialize array #pragma omp parallel sections default(none) shared(n,array,count3,count4) private(i) #pragma omp parallel section for(i=0;i<length;i++) { if (array[i]==3) count3++; } #pragma omp parallel section for(i=0;i<length;i++) { if (array[i]==4) count4++; } 28 No date race condition in this example. WHY?

#pragma omp parallel sections default(none) shared(a,b,c,d,e,n) private(i) { #pragma omp section { printf("Thread %d executes 1st loop \n”,omp_get_thread_num()); for(i=0;i<n;i++) a[i]=3*b[i]; } #pragma omp section { printf("Thread %d executes 1st loop \n”,omp_get_thread_num()); for(i=0;i<n;i++) e[i]=2*c[i]+d[i]; } final_sum=sum(a,n) + sum(e,n); printf("FINAL_SUM is %d\n",final_sum) Work-sharing construct: Example 1 of “section” 29

Work-sharing construct: Example 2 of “section” 1/2 30

Work-sharing construct: Example 2 of “section” 2/2 31

Work-sharing construct: Example of “single” In parallel region “single block” is used to specify that this block is executed only by one thread in the team of threads. 32 Lets look at an example

OpenMP Clauses: Data sharing 1/2 shared(list) shared clause is used to specify which data is shared among thread. All threads can read and write to this shared variable. By default all variables are shared. private(list) private variable are local to thread. Typical example of private variable is loop counter, since each thread has its own loop counter initialized at entry point. 34

A private variable is defined between entry and exit point of parallel region. A private variable within parallel region has no scope out side of it firstprivate and lastprivate clauses are used to increase scope of variable beyond parallel region. firstprivate: All variables in the list are initialized with the original value that object had before entering parallel region lastprivate: The thread that executes the last iteration or section updates the value of object in list. 35 OpenMP Clauses: Data sharing 2/2

Example: firstprivate and lastprivate int main(){ int C, B, A= 10; /*--- Start of parallel region ---*/ #pragma omp parallel for default(none) firstprivate(A) lastprivate(B) private(i) for (i=0;i<n;i++) { … B = i + A; … } /*--- End of parallel region ---*/ C=B; } 36

OpenMP Clauses: nowait nowait clause is used to avoid implicit synchronization at end of work-sharing directive 37

OpenMP Clause: schedule schedule clause is supported in loop construct only. Used to control the manner in which loop iterations are distributed over the threads. Syntax: schedule(kind[,chunk_size) Types: static[,chunk]: distribute iterations in blocks of size “chunk over the threads in a round-robin fashion dynamic[,chunk]: fixed portions of work; size is controlled by the value chunk, when thread finishes its portion it starts with next portion. guided[,chunk]: same as “dynamic”, but size of the portion of work decreases exponentially. runtime[,chunk]: iteration scheduling scheme is set at runtime thought environment variable OMP_SCHEDULE 38

The Experiment with schedule clause 39

OpenMP Critical construct int main(){ int sum, n=5; int a[5]={1,2,3,4,5}; /*--- Start of parallel region ---*/ #pragma omp parallel for default(none) shared(sum,a,n) private(i) for (i=0;i<n;i++) { sum += a[i]; } /*--- End of parallel region ---*/ printf(“sum of vector a =%d”,sum); } 40 Example summation of a vector race condition

OpenMP Critical construct int main(){ int sum, local_sum, n=5; int a[5]={1,2,3,4,5}; /*--- Start of parallel region ---*/ #pragma omp parallel default(none) shared(sum,a,n) private(local_sum,i) { #pragma omp for for (i=0;i<n;i++) { local_sum += a[i]; } #pragma omp critical { sum+=local_sum } }/*--- End of parallel region ---*/ printf(“sum of vector a =%d”,sum); } 41

Parallelize 3 rd attempt: Count3s in an array int count, n=100; int array[n]; // initialize array #pragma omp parallel default(none) shared(n,array,count) private(i,local_count) { #pragma omp parallel for for(i=0;i<length;i++) { if (array[i]==3) local_count ++; } #pragma omp critical { count+=local_count } } /*--- End of Parallel region ---*/ 42

OpenMP Clause: reduction int main(){ int sum, n=5; int a[5]={1,2,3,4,5}; /*--- Start of parallel region ---*/ #pragma omp parallel for default(none) shared(a,n) private(i)\ reduction(+:sum) for (i=0;i<n;i++) { sum += a[i]; } /*--- End of parallel region ---*/ printf(“sum of vector a =%d”,sum); } 43 OpenMP provides a reduction clause which is used with for loop and section directives. reduction variable must be shared among threads race condition is avoided implicitly.

Parallelize 4 th attempt: Count3s in an array int count, n=100; int array[n]; // initialize array #pragma omp parallel for default(none) shared(n,array) private(i) \ for(i=0;i<length;i++) { if (array[i]==3) count++; } /*--- End of Parallel region ---*/ 44 reduction(+:count)

Tasking in OpenMP 45

Tasking in OpenMP In OpenMP 3.0 the concept of tasks has been added to the OpenMP execution model The Task model is useful is case where the number of parallel pieces and the work involved in each piece varies and/or unknown Before inclusion of the Task model OpenMP was not suited for unstructured problem Tasks are often set up within a single construct in a manager-worker model. 46

Task Parallelism Approach 1/2 Threads line up as workers, go through the queue of work to be done, and do a task Threads do not wait, as in loop parallelism, rather go back to queue and do more tasks. Each task is executed serially by work thread that encounter that task in queue. Load balancing occur as short and long task are done as threads become available. 47

Task Parallelism Approach 2/2 48

Example: Task parallelism 49

Best Practices Optimize barrier use Avoid ordered construct Avoid large critical regions Maximize parallel regions Avoid multiple use of parallel regions Address poor load balance 50

List of runtime library routine Runtime library routine are provided in omp.h header file void omp_set_num_threads(int num); int omp_get_num_threads(); int omp_get_max_threads(); int omp_get_thread_num(); int omp_get_thread_limit(); int omp_get_num_procs(); double omp_get_wtime(); int omp_in_parallel(); // return 0 false and non-zero true Few more 52

More list of runtime library routine These routine are new with OpenMP 3.0 53

Environment Variable OMP_NUM_THREAD OMP_DYANMIC OMP_THREAD_LIMIT OMP_STACKSIZE 55

Summary OpenMP provides small but yet powerful programming model Compilers with OpenMP support are widely available OpenMP is a directive based shared memory programming model OpenMP API is a general purpose parallel programming API with emphasis on the ability to parallelize existing programs Scalable parallel programs can be written by using parallel regions Work-sharing constructs enable efficient parallelization of computationally intensive portions of program 57

Thank You and Exercise Session 58

Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1.

Similar presentations

Presentation on theme: "Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1.

Similar presentations

Presentation on theme: "Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1."— Presentation transcript:

Similar presentations

About project

Feedback