OpenMP.

Slides:



Advertisements
Similar presentations
Implementing Domain Decompositions Intel Software College Introduction to Parallel Programming – Part 3.
Advertisements

Parallel Processing with OpenMP
Introduction to Openmp & openACC
Introductions to Parallel Programming Using OpenMP
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 OpenMP—An API for Shared Memory Programming Slides are based on:
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Parallel Programming by Tiago Sommer Damasceno Using OpenMP
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
CS240A, T. Yang, 2013 Modified from Demmel/Yelick’s and Mary Hall’s Slides 1 Parallel Programming with OpenMP.
Programming with Shared Memory Introduction to OpenMP
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
Executing OpenMP Programs Mitesh Meswani. Presentation Outline Introduction to OpenMP Machine Architectures Shared Memory (SMP) Distributed Memory MPI.
CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.
1 Datamation Sort 1 Million Record Sort using OpenMP and MPI Sammie Carter Department of Computer Science N.C. State University November 18, 2004.
Parallel Programming in Java with Shared Memory Directives.
OMPi: A portable C compiler for OpenMP V2.0 Elias Leontiadis George Tzoumas Vassilios V. Dimakopoulos University of Ioannina.
Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
OpenMP China MCP.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Introduction to OpenMP
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
MPI and OpenMP.
Threaded Programming Lecture 2: Introduction to OpenMP.
Introduction to Pragnesh Patel 1 NICS CSURE th June 2015.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
CS240A, T. Yang, Parallel Programming with OpenMP.
Heterogeneous Computing using openMP lecture 1 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Introduction to OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
Shared Memory Parallelism - OpenMP
Lecture 5: Shared-memory Computing with Open MP
SHARED MEMORY PROGRAMMING WITH OpenMP
CS427 Multicore Architecture and Parallel Computing
Computer Engg, IIT(BHU)
Introduction to OpenMP
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
SHARED MEMORY PROGRAMMING WITH OpenMP
Shared Memory Programming with OpenMP
OpenMP Quiz B. Wilkinson January 22, 2016.
Multi-core CPU Computing Straightforward with OpenMP
Parallel Programming with OpenMP
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Wed. Jan. 31, 2001 *Parts.
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson *Parts of this lecture.
Programming with Shared Memory Introduction to OpenMP
Allen D. Malony Computer & Information Science Department
Introduction to OpenMP
OpenMP Quiz.
Presentation transcript:

OpenMP

Shared Memory Architektur Processor Processor Processor Processor BUS Memory

OpenMP Portable programming of shared memory systems. It is a quasi-standard. OpenMP-Forum 1997 - 2009 API for Fortran and C/C++ directives runtime routines environment variables www.openmp.org

Example Program Execution #include <omp.h> main(){ #pragma omp parallel { printf(“Hello world”); } > export OMP_NUM_THREADS=2 > a.out Hello world > export OMP_NUM_THREADS=3 > a.out Hello World Compilation > icc –O3 –openmp openmp.c

Execution Model #pragma omp parallel { printf(“Hello world %d\n”, omp_get_thread_num()); } T0 PARALLEL { Creates Thread Team T0 T1 T2 print } Team is destroyed T0

Fork/Join Execution Model An OpenMP-program starts as a single thread (master thread). Additional threads (Team) are created when the master hits a parallel region. When all threads finished the parallel region, the new threads are given back to the runtime or operating system. A team consists of a fixed set of threads executing the parallel region redundantly. All threads in the team are synchronized at the end of a parallel region via a barrier. The master continues after the parallel region.

Work Sharing in a Parallel Region main (){ int a[100]; #pragma omp parallel { #pragma omp for for (int i= 1; i<n;i++) a(i) = i; … }

Shared and Private Data Shared data are accessible by all threads. A reference a[5] to a shared array accesses the same address in all threads. Private data are accessible only by a single thread. Each thread has its own copy. The default is shared.

Private clause for parallel loop main (){ int a[100], t; #pragma omp parallel { #pragma omp for private(t) for (int i= 1; i<n;i++){ t=f(i); a(i)=t; }

Example: Private Data I=3 #pragma omp parallel private(i) { I=17 } Printf(“Value of I=%d\n”, I); I = 3 I = 3 I1 = 17 I2 = 17 I3 = 17 I = 3 I = 3 I = 17 I1 = 17 I2 = 17 I = 17

Example main (){ int iam, nthreads; #pragma omp parallel private(iam,nthreads) { iam = omp_get_thread_num(); nthreads = omp_get_num_threads(); printf(“ThradID %d, out of %d threads\n”, iam, nthreads); if (iam == 0) ! Different control flow printf(“Here is the Master Thread.\n”); else printf(“Here is another thread.\n”); }

Private Data A new copy is created for each thread. One thread may reuse the global shared copy. The private copies are destroyed after the parallel region. The value of the shared copy is undefined.

Example: Shared Data I=77 #pragma omp parallel shared(i) { I=omp_get_thread_num(); } Printf(“Value of I=%d\n”, I); I = 77 I = 3 I = 2 I = 1 I = 77 I = 77 I = 0 I = 1 I = 2 I = 2 I = 1 I = 0 I = 0 In Parallel Region

Syntax of Directives and Pragmas Fortran !$OMP directive name [parameters] !$OMP PARALLEL DEFAULT(SHARED) write(*,*) ´Hello world´ !$OMP END PARALLEL C / C++ #pragma omp directive name [parameters] int main() { #pragma omp parallel default(shared) { printf(”hello world\n” ); }

Directives Directives can have continuation lines Fortran !$OMP directive name first_part & !$OMP continuation_part C #pragma omp parallel private(i) \ private(j)

Parallel Region The statements enclosed lexically within a region define the lexical extent of the region. The dynamic extent further includes the routines called from within the construct. #pragma omp parallel [parameters] { parallel region }

Lexical and Dynamic Extend main (){ int a[100]; #pragma omp parallel { … } sub(int a[]) { #pragma omp for for (int i= 1; i<n;i++) a(i) = i; } Local variables of a subroutine called in a parallel region are by default private.

Work-Sharing Constructs Work-sharing constructs distribute the specified work to all threads within the current team. Types Parallel loop Parallel section Master region Single region General work-sharing construct (only Fortran)

Parallel Loop #pragma omp for [parameters] for ... The iterations of the do-loop are distributed to the threads. The scheduling of loop iterations is determined by one of the scheduling strategies static, dynamic, guided, and runtime. There is no synchronization at the beginning. All threads of the team synchronize at an implicit barrier if the parameter nowait is not specified. The loop variable is by default private. It must not be modified in the loop body. The expressions in the for-statement are very restricted.

Scheduling Strategies Schedule clause schedule (type [,size]) Scheduling types: static: Chunks of the specified size are assigned in a round-robin fashion to the threads. dynamic: The iterations are broken into chunks of the specified size. When a thread finishes the execution of a chunk, the next chunk is assigned to that thread. guided: Similar to dynamic, but the size of the chunks is exponentially decreasing. The size parameter specifies the smallest chunk. The initial chunk is implementation dependent. runtime: The scheduling type and the chunk size is determined via environment variables. Guided: Good load balancing but not so much overhead.

Example: Dynamic Scheduling main(){ int i, a[1000]; #pragma omp parallel { #pragma omp for schedule(dynamic, 4) for (int i=0; i<1000;i++) a[i] = omp_get_thread_num(); #pragma omp for schedule(guided) }

Reductions reduction(operator: list) This clause performs a reduction on the variables that appear in list, with the operator operator. Variables must be shared scalars operator is one of the following: +, *, -, &, ˆ, |, &&, || Reduction variable might only appear in statements with the following form: x = x operator expr x binop= expr x++, ++x, x--, --x & bitweises logisches UND && logisches UND ˆ bitweises exclusive or | bitweises oder binop ist +, *, -, &, ˆ, |

Example: Reduction #pragma omp parallel for reduction(+: a) for (i=0; i<n; i++) { a = a + b[i]; }

Classification of Variables private(var-list) Variables in var-list are private. shared(var-list) Variables in var-list are shared. default(private | shared | none) Sets the default for all variables in this region. firstprivate(var-list) Variables are private and are initialized with the value of the shared copy before the region. lastprivate(var-list) Variables are private and the value of the thread executing the last iteration of a parallel loop in sequential order is copied to the variable outside of the region.

Scoping Variables with Private Clause int i, j; i = 1; j = 2; #pragma omp parallel private(i) firstprivate(j) { i = 3; j = j + 2; printf("%d %d\n", i, j); } The values of the shared copies of i and j are undefined on exit from the parallel region. The private copies of j are initialized in the parallel region to 2.

Parallel Section #pragma omp sections [parameters] { [#pragma omp section] block [#pragma omp section block ] } Each section of a parallel section is executed once by one thread of the team. Threads that finished their section wait at the implicit barrier at the end of the section construct. Lastprivate: lexically last section

Example: Parallel Section main(){ int i, a[1000], b[1000] #pragma omp parallel private(i) { #pragma omp sections #pragma omp section for (int i=0; i<1000; i++) a[i] = 100; b[i] = 200; }

OMP Workshare (Fortran only) !$OMP WORKSHARE [parameters] block !$OMP END WORKSHARE [NOWAIT] The WORKSHARE directive divides the work of executing the enclosed code into separate units of work and distributes the units amongst the threads. An implementation of the WORKSHARE directive must insert any synchronization that is required to maintain standard Fortran semantics. There is an implicit barrier at the end of the workshare region.

Sharing Work in a Fortran 90 Array Statement A(1:N)=B(2:N+1)+C(1:N) Each evaluation of an array expression for an individual index is a unit of work. The assignment to an individual array element is also a unit of work.

Master / Single Region #pragma omp master block #pragma omp single [parameters] A master or single region enforces that only a single thread executes the enclosed code within a parallel region. Common No synchronization at the beginning of region. Different Master region is executed by master thread while the single region can be executed by any thread. Master region is skipped by other threads while all threads are synchronized at the end of a single region. Copyprivate: Variables must be private in enclosing parallel region. Value of executing thread is copied to all other threads.

Combined Work-Sharing and Parallel Constructs #pragma omp parallel for #pragma omp parallel sections !$OMP PARALLEL WORKSHARE

Barrier #pragma omp barrier The barrier synchronizes all the threads in a team. When encountered, each thread waits until all of the other threads in that team have reached this point.

Critical Section Mutual exclusion Critical section name #pragma omp critical [(Name)] { ... } Mutual exclusion A critical section is a block of code that can be executed by only one thread at a time. Critical section name A thread waits at the beginning of a critical section until no other thread is executing a critical section with the same name. All unnamed critical directives map to the same name. Critical section names are global entities of the program. If a name conflicts with any other entity, the behavior of the program is unspecified.

Example: Critical Section main(){ int ia = 0 int ib = 0 int itotal = 0 for (int i=0;i<N;i++) { a[i] = i; b[i] = N-i; } #pragma omp parallel private(i) { #pragma omp sections #pragma omp section for (int i=0;i<N;i++) ia = ia + a[i]; #pragma omp critical (c1) itotal = itotal + ia; }} ib = ib + b[i] itotal = itotal + ib;

Atomic Statements #pragma ATOMIC expression-stmt The ATOMIC directive ensures that a specific memory location is updated atomically Has to have the following form: x binop= expr x++ or ++x x-- or -- x where x is an lvalue expression with scalar type and expr does not reference the object designated by x. All parallel assignments to the location must be protected with the atomic directive. Free of race conditions: Writing the same value? The compiler is free to replace all atomic statements with critical directive that have the same unique name.

Only the load and store of x are protected. Translation of Atomic Only the load and store of x are protected. #pragma omp atomic x += expr can be rewritten as xtmp = expr !$OMP CRITICAL (name) x = x + xtmp !$OMP END CRITICAL (name)

Simple Locks Locks can be hold by only one thread at a time. A lock is represented by a lock variable of type omp_lock_t. The thread that obtained a simple lock cannot set it again. Operations omp_init_lock(&lockvar): initialize a lock omp_destroy_lock(&lockvar): destroy a lock omp_set_lock(&lockvar): set lock omp_unset_lock(&lockvar): free lock logicalvar = omp_test_lock(&lockvar): check lock and possibly set lock, returns true if lock was set by the executing thread.

Example: Simple Lock #include <omp.h> int id; omp_lock_t lock; omp_init_lock(lock); #pragma omp parallel shared(lock) private(id) { id = omp_get_thread_num(); omp_set_lock(&lock); //Only a single thread writes printf(“My Thread num is: %d”, id); omp_unset_lock(&lock); WHILE (!omp_test_lock(&lock)) other_work(id); //Lock not obtained real_work(id); //Lock obtained omp_unset_lock(&lock);//Lock freed } omp_destroy_lock(&lock); locked locked

Nestable Locks Unlike simple locks, nestable locks can be set multiple times by a single thread. Each set operation increments a lock counter. Each unset operation decrements the lock counter. If the lock counter is 0 after an unset operation, the lock can be set by another thread.

Ordered Construct #pragma omp for ordered for (...) { ... #pragma omp ordered { ... } ... } Construct must be within the dynamic extent of an omp for construct with an ordered clause. Ordered constructs are executed strictly in the order in which they would be executed in a sequential execution of the loop.

Example with ordered clause #pragma omp for ordered for (...) { S1 #pragma omp ordered { S2} S3 }

#pragma omp flush [(list)] The flush directive synchronizes copies in register or cache of the executing thread with main memory. It synchronizes those variable in the given list or, if no list is specified, all shared variable accessible in the region. It does not update implicit copies at other threads. Load/stores executed before the flush in program order have to be finished. Load/stores following the flush in program order are not allowed to be executed before the flush. A flush is executed implicitly for some constructs, e.g. begin and end of a parallel region, end of work-sharing constructs ...

Example: Flush 1 1 1 #define MAXTHREAD 100 int iam, neigh, isync[MAXTHREAD]; isync[0] = 1;isync[1..MAXTHREAD]=0; #pragma omp parallel private(iam, neigh) { iam = omp_get_thread_num()+1; neigh = iam – 1; //Wait for neighbor while (isync[neigh] == 0) { #pragma omp flush(isync) } //Do my work work(); isync[iam] = 1; //I am done isync 1 1 1

Lastprivate example k=0 #pragma omp parallel { #pragma omp for lastprivate(k) for (i=0; i<100; i++) a[i] = b[i] + b[i+1]; k=2*i; } // The value of k is 198

Copyprivate Example Copyprivate #pragma omp parallel private(x) { #pragma omp single copyprivate(x) getValue(x); } useValue(x); Copyprivate Clause only for single region. Variables must be private in enclosing parallel region. Value of executing thread is copied to all other threads. Der eingelesene Wert von X steht nach der Ausführung der Single region durch einen Thread allen in ihrer privaten Kopie zur Verfügung.

Other Copyprivate Example float read_next( ) { float * tmp; float return_val; #pragma omp single copyprivate(tmp) { tmp = (float *) malloc(sizeof(float)); } #pragma omp master get_float( tmp ); #pragma omp barrier return_val = *tmp; #pragma omp single free(tmp); return return_val; In this example, the value has to be read by the master processor. Therefore, the value cannot be broadcasted with copyprivate to private copies but a temporary variable can be allocated. Note: implicit barriers at end of single no implicit barrier at end of master and at beginning of single.

Runtime Routines for Threads (1) Determine the number of threads for parallel regions omp_set_num_threads(count) Query the maximum number of threads for team creation numthreads = omp_get_max_threads() Query number of threads in the current team numthreads = omp_get_num_threads() Query own thread number (0..n-1) iam = omp_get_thread_num() Query number of processors numprocs = omp_get_num_procs()

Runtime Routines for Threads (2) Query state logicalvar = omp_in_parallel() Allow runtime system to determine the number of threads for team creation omp_set_dynamic(logicalexpr) Query whether runtime system can determine the number of threads logicalvar= omp_get_dynamic() Allow nesting of parallel regions omp_set_nested(logicalexpr) Query nesting of parallel regions logicalvar= omp_get_nested()

Environment Variables OMP_NUM_THREADS=4 Number of threads in a team of a parallel region OMP_SCHEDULE=”dynamic” OMP_SCHEDULE=”GUIDED,4“ Selects scheduling strategy to be applied at runtime OMP_DYNAMIC=TRUE Allow runtime system to determine the number of threads. OMP_NESTED=TRUE Allow nesting of parallel regions.

OpenMP 3.0 Draft 3.0 Public Comment, October 21, 2007 Current standard, May 2008

Explicit Tasking Explicit creation of tasks Task scheduling Barrier #pragma omp parallel { #pragma omp single { for ( elem = l->first; elem; elem = elem->next) #pragma omp task process(elem) } // all tasks are complete by this point Task scheduling Tasks can be executed by any thread in the team Barrier All tasks created in the parallel region have to be finished.

Tasks Clauses If (scalar-expression) Untied #pragma omp Task [clause list] { ... } Clauses If (scalar-expression) FALSE: Execution starts immediately The suspended task may not be resumed until the new task is finished. Untied Task is not tied to the thread starting its execution. It might be rescheduled to another thread. Default (shared|none), private, firstprivate, shared If no default clause is present, the implicit data-sharing attribute is firstprivate. Binding The binding thread set of the task region is the current team. A task region binds to the innermost enclosing parallel region.

Example: Tree Traversal struct node { struct node *left; struct node *right; }; void traverse( struct node *p ) { if (p->left) #pragma omp task // p is firstprivate by default traverse(p->left); if (p->right) traverse(p->right); process(p); }

Task Wait Waits for completion of child tasks #pragma omp taskwait { ... } Waits for completion of child tasks Child tasks: Tasks generated since the beginning of the current task.

Collapsing of loops Handles multi-dimensional perfectly nested loops Larger iteration space ordered according to sequential execution. Schedule clause applies to new iteration space #pragma omp parallel for collapse(2) for (i=0; i<n; i++) for (j=0; j<n; j++) for (k=0; k<n; k++) { ..... }

Guaranteed Scheduling Same work distribution if Same number of iterations, schedule static with same chunksize Both regions bind to same parallel region !$omp do schedule(static) do i=1,n a(i) = .... end do !$ompend do nowait .... = a(i)

Scheduling strategy auto for parallel loops New scheduling strategy auto It is up to the compiler to determine the scheduling.

Nested Parallelism Currently only a single copy specifying the number of threads in a team. omp_set_num_threads() Can be called only outside of parallel region. This is applied for nested parallelism All teams have same size. But num_threads clause of parallel region OpenMP 3.0 supports individual copies There is one copy per task. Teams might have different sizes. OpenMP 3.0 supports individual copies in each task Explicit or implicit task created by parallel or worksharing regions Executing thread encountering a parallel directive can control the size of the created team.

No NUMA support Discussed Not in the standard Research topic MIGRATE_NEXT_TOUCH directive Not in the standard Research topic

Summary OpenMP is quasi-standard for shared memory programming Based on Fork-Join Model Parallel region and work sharing constructs Declaration of private or shared variables Reduction variables Scheduling strategies Synchronization via Barrier, Critical section, Atomic, locks, nestable locks