Introduction to OpenMP

Slides:

Advertisements

Similar presentations

1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.

Advertisements

Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) Adapted from: o “MPI-Message.

Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.

OpenMPI Majdi Baddourah

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

CS 470/570 Lecture 7 Dot Product Examples Odd-even transposition sort More OpenMP Directives.

1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.

Programming with Shared Memory Introduction to OpenMP

Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

Parallel Programming in Java with Shared Memory Directives.

Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.

Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.

OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

OpenMP: Open specifications for Multi-Processing What is OpenMP? Join\Fork model Join\Fork model Variables Variables Explicit parallelism Explicit parallelism.

Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

OpenMP fundamentials Nikita Panov

High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.

Threaded Programming Lecture 4: Work sharing directives.

Introduction to OpenMP

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,

MPI and OpenMP.

Threaded Programming Lecture 2: Introduction to OpenMP.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,

Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University

OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

1 Programming with Shared Memory - 2 Issues with sharing data ITCS 4145 Parallel Programming B. Wilkinson Jan 22, _Prog_Shared_Memory_II.ppt.

Distributed and Parallel Processing George Wells.

B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.

SHARED MEMORY PROGRAMMING WITH OpenMP

Shared Memory Parallelism - OpenMP

Lecture 5: Shared-memory Computing with Open MP

CS427 Multicore Architecture and Parallel Computing

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.

Open[M]ulti[P]rocessing

Computer Engg, IIT(BHU)

Introduction to OpenMP

Shared-Memory Programming

September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.

Programming with Shared Memory

Computer Science Department

OpenMP Quiz B. Wilkinson January 22, 2016.

CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.

Multi-core CPU Computing Straightforward with OpenMP

Parallel Programming with OpenMP

Introduction to High Performance Computing Lecture 20

Programming with Shared Memory Introduction to OpenMP

DNA microarrays. Infinite Mixture Model-Based Clustering of DNA Microarray Data Using openMP.

Hybrid Parallel Programming

Introduction to OpenMP

OpenMP Parallel Programming

Shared-Memory Paradigm & OpenMP

WorkSharing, Schedule, Synchronization and OMP best practices

Presentation transcript:

Introduction to OpenMP

Shared-Memory Systems Processor Processor Processor Processor Bus interface Bus interface Bus interface Bus interface Processor/ memory us Memory controller All processors can access all of the shared memory Memory Shared memory

OpenMP OpenMP uses compiler directives (similar to Paraguin) to parallelize a program The programmer inserts #pragma statements into the sequential program to tell the compiler how to parallelize the program This is a higher level of abstraction than pthreads or Java threads Standardized in late 1990s gcc supports OpenMP

Getting Started

To begin Syntax: #pragma omp parallel structured_block omp indicates that the pragma is an OpenMP pragma (other compilers will ignore it) parallel indicates the directive (“parallel” indicates the start of a parallel region) structured_block will be either a single statement (such as a for loop) or a block of statements

Code outside a parallel region is executed by master thread only A parallel region indicates sections of code that are executed by all threads At the end of a parallel region, all threads synchronize as if there were a barrier Code outside a parallel region is executed by master thread only parallel region Multiple threads Synchronization parallel region Multiple threads Synchronization

Hello World Very Important Opening brace must be on a new line int main (int argc, char *argv[]) { #pragma omp parallel { printf("Hello World from thread = %d of %d\n", omp_get_thread_num(), omp_get_num_threads()); } Very Important Opening brace must be on a new line

Compiling and Output $ gcc -fopenmp hello.c -o hello $ ./hello Hello world from thread 2 of 4 Hello world from thread 0 of 4 Hello world from thread 3 of 4 Hello world from thread 1 of 4 $ Flag to tell gcc to interpret OpenMP directives

Execution omp_get_thread_num() – get the current threads number omp_get_num_threads() – get the total number of threads The names of these two functions are similar; easy to confuse.

Execution There are 3 ways to indicate how many threads you want: Use num_threads withing the directive E.g. #pragma omp parallel num_threads(5) Use the omp_set_num_threads function E.g. omp_set_num_threads(6); Use the OMP_NUM_THREADS environmental variable E.g $ export OMP_NUM_THREADS=8 $ ./hello

Shared versus Private Data

Shared versus Private Data int main (int argc, char *argv[]) { int x; int tid; #pragma omp parallel private(tid) tid = omp_get_thread_num(); if (tid == 0) x = 42; printf ("Thread %d, x = %d\n", tid, x); } x is shared by all threads tid is private – each thread has its own copy Variables declared outside the parallel construct are shared unless otherwise specified

Shared versus Private Data $ ./data Thread 3, x = 0 Thread 2, x = 0 Thread 1, x = 0 Thread 0, x = 42 Thread 4, x = 42 Thread 5, x = 42 Thread 6, x = 42 Thread 7, x = 42 tid has a separate value for each thread x has the same value for each thread (well… almost)

Another Example Shared versus Private #pragma omp parallel private(tid, n) { tid = omp_get_thread_num(); n = omp_get_num_threads(); a[tid] = 10*n; } OR #pragma omp parallel private(tid, n) shared(a) ... a[ ] is shared tid and n are private optional

Private Variables private clause – creates private copies of variables for each thread firstprivate clause - as private clause but initializes each copy to the values given immediately prior to parallel construct. lastprivate clause – as private but “the value of each lastprivate variable from the sequentially last iteration of the associated loop, or the lexically last section directive, is assigned to the variable’s original object.”

Work Sharing Constructs

Specifying Work Inside a Parallel Region There are 4 constructs: section – each section is executed by a different thread for – each iteration is executed by a (potentially) different thread single – executed by a single thread (sequential) master – executed by the master only (sequential) There is a barrier after each construct (except master) unless a nowait clause is given These must be used within a parallel region

Sections Syntax Enclosing parallel region #pragma omp parallel { #pragma omp sections #pragma omp section structured_block ... } Enclosing parallel region Sections executed by available threads

Sections Example Threads do not wait after finishing section #pragma omp parallel shared(a,b,c,d,nthreads) private(i,tid) { tid = omp_get_thread_num(); #pragma omp sections nowait #pragma omp section printf("Thread %d doing section 1\n",tid); for (i=0; i<N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: c[%d]=%f\n“,tid,i,c[i]); } printf("Thread %d doing section 2\n",tid); d[i] = a[i] * b[i]; printf("Thread %d: d[%d]= %f\n",tid,i,d[i]); } /* end of sections */ printf ("Thread %d done\n", tid); } /* end of parallel section */ Threads do not wait after finishing section One thread does this

Sections Example Another thread does this } #pragma omp section { printf("Thread %d doing section 2\n",tid); for (i=0; i<N; i++) { d[i] = a[i] * b[i]; printf("Thread %d: d[%d]=%f\n",tid,i,d[i]); } /* end of sections */ printf ("Thread %d done\n", tid); } /* end of parallel section */ Another thread does this

Sections Output Threads do not wait (i.e. no barrier) Thread 0 doing section 1 Thread 0: c[0]= 5.000000 Thread 0: c[1]= 7.000000 Thread 0: c[2]= 9.000000 Thread 0: c[3]= 11.000000 Thread 0: c[4]= 13.000000 Thread 3 done Thread 2 done Thread 1 doing section 2 Thread 1: d[0]= 0.000000 Thread 1: d[1]= 6.000000 Thread 1: d[2]= 14.000000 Thread 1: d[3]= 24.000000 Thread 0 done Thread 1: d[4]= 36.000000 Thread 1 done Threads do not wait (i.e. no barrier)

Sections Output Barrier here Thread 0 doing section 1 Thread 0: c[0]= 5.000000 Thread 0: c[1]= 7.000000 Thread 0: c[2]= 9.000000 Thread 0: c[3]= 11.000000 Thread 0: c[4]= 13.000000 Thread 3 doing section 2 Thread 3: d[0]= 0.000000 Thread 3: d[1]= 6.000000 Thread 3: d[2]= 14.000000 Thread 3: d[3]= 24.000000 Thread 3: d[4]= 36.000000 Thread 3 done Thread 1 done Thread 2 done Thread 0 done Barrier here If we remove the nowait, then there is a barrier at the end of the section. Threads wait until they are all done with the section.

Parallel For Syntax Enclosing parallel region #pragma omp parallel { #pragma omp for for (i = 0; i < N; i++) { ... } Enclosing parallel region Different iterations will be executed by available threads Must be a simple C for loop, where lower bound and upper bound are constants

Parallel For Example #pragma omp parallel shared(a,b,c,nthreads) private(i,tid) { tid = omp_get_thread_num(); if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } printf("Thread %d starting...\n",tid); #pragma omp for for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; printf("Thread %d: i = %d, c[%d] = %f\n", tid, i, c[i]); } /* end of parallel section */ Without “nowait”, threads wait after finishing loop

Parallel For Output Iterations of loop are mapped to threads Thread 1 starting... Thread 1: i = 2, c[1] = 9.000000 Thread 1: i = 3, c[1] = 11.000000 Thread 2 starting... Thread 2: i = 4, c[2] = 13.000000 Thread 3 starting... Number of threads = 4 Thread 0 starting... Thread 0: i = 0, c[0] = 5.000000 Thread 0: i = 1, c[0] = 7.000000 Iterations of loop are mapped to threads Mapping is In this example, mapping = Barrier here

Combining Directives If a Parallel Region consists of only one Parallel For or Parallel Sections, they can be combined #pragma omp parallel sections #pragma omp parallel for

Combining Directives Example #pragma omp parallel for shared(a,b,c,nthreads) private(i,tid) for (i = 0; i < N; i++) { c[i] = a[i] + b[i]; } Declares a Parallel Region and a Parallel For

Scheduling a Parallel For By default, a parallel for is scheduled by mapping blocks (or chunks) of iterations to available threads (static mapping) Thread 1 starting... Thread 1: i = 2, c[1] = 9.000000 Thread 1: i = 3, c[1] = 11.000000 Thread 2 starting... Thread 2: i = 4, c[2] = 13.000000 Thread 3 starting... Number of threads = 4 Thread 0 starting... Thread 0: i = 0, c[0] = 5.000000 Thread 0: i = 1, c[0] = 7.000000 Default Chunk Size Barrier here

Scheduling a Parallel For Static – Partitions loop iterations into equal sized chunks specified by chunk_size. Chunks assigned to threads in round robin fashion. #pragma omp parallel for schedule (static, chunk_size) Dynamic – Uses internal work queue. Chunk- sized block of iterations assigned to threads as they become available. #pragma omp parallel for schedule (dynamic, chunk_size)

Scheduling a Parallel For Guided – Similar to dynamic but chunk size starts large and gets smaller to reduce time threads have to go to work queue. #pragma omp parallel for schedule (guided) Runtime – Uses OMP_SCEDULE environment variable to specify which of static, dynamic or guided should be used. #pragma omp parallel for schedule (runtime)

Question Guided scheduling is similar to Static except that the chunk sizes start large and get smaller. What is the advantage of using Guided versus Static? Answer: Guided improves load balance

Reduction A reduction is when we apply a commutative operator to an aggregate values creating a single value (similar to the MPI_Reduce) sum = 0; #pragma omp parallel for reduction(+:sum) for (k = 0; k < 100; k++ ) { sum = sum + funct(k); } Operation Variable Private copy of sum created for each thread by compiler. Private copy will be added to sum at end. Eliminates the need for critical sections here.

Single Only one thread executes this section No guarantee of which one #pragma omp parallel { ... #pragma omp single structured_block } Only one thread executes this section No guarantee of which one

Single Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp single printf("Thread %d doing work\n",tid); ... } /* end of single */ printf ("Thread %d done\n", tid); } /* end of parallel section */

Single Results Only one thread executing the section Thread 0 starting... Thread 0 doing work Thread 3 starting... Thread 2 starting... ... Thread 1 starting... Thread 0 done Thread 1 done Thread 2 done Thread 3 done Only one thread executing the section “nowait” was NOT specified, so threads wait for the one thread to finish. Barrier here

Master Only one thread (the master) executes this section #pragma omp parallel { ... #pragma omp master structured_block } Only one thread (the master) executes this section Cannot specify “nowait” here There is no barrier after this block. Threads will NOT wait.

Master Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp master printf("Thread %d doing work\n",tid); ... } /* end of master */ printf ("Thread %d done\n", tid); } /* end of parallel section */

Is there any difference between these two approaches: Master Directive: Using an if statement: #pragma omp parallel { ... #pragma omp master structured_block } #pragma omp parallel private(tid) { ... tid=omp_get_thread_num(); if (tid == 0) structured_block }

Synchronization

Critical Section A critical section implies mutual exclusion. Only one thread allowed to enter the critical section at a time. #pragma omp parallel { ... #pragma omp critical (name) structured_block } name is optional

Critical Example #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp critical (myCS) printf("Thread %d in critical section \n",tid); sleep (1); printf("Thread %d finishing critical section \n",tid); } /* end of critical */ printf ("Thread %d done\n", tid); } /* end of parallel section */

Critical Results 1 second delay Thread 0 starting... Thread 0 in critical section Thread 3 starting... Thread 1 starting... Thread 2 starting... Thread 0 finishing critical section Thread 0 done Thread 3 in critical section Thread 3 finishing critical section Thread 3 done Thread 2 in critical section Thread 2 finishing critical section Thread 2 done Thread 1 in critical section Thread 1 finishing critical section Thread 1 done 1 second delay

Atomic If the critical section is a simple update of a variable, then atomic is more efficient Ensures mutual exclusion for the statement #pragma omp parallel { ... #pragma omp atomic expression_statement } Must be a simple statement of the form: x = expression x += expression x -= expression ... x++; x--;

Barrier Threads will wait at a barrier until all threads have reached the same barrier. All threads must be able to reach the barrier (i.e. be careful about placing the barrier inside an if statement where some threads my not execute it). #pragma omp parallel { ... #pragma omp barrier }

Barrier Example No barrier at the end of the single block #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf ("Thread %d starting...\n", tid); #pragma omp single nowait printf("Thread %d busy doing work ... \n",tid); sleep(10); } printf("Thread %d reached barrier\n",tid); #pragma omp barrier printf ("Thread %d done\n", tid); } /* end of parallel section */ No barrier at the end of the single block Threads wait here Not here

Barrier Results Thread 3 sleeping for 10 seconds 10 second delay Thread 3 starting... Thread 0 starting... Thread 0 reached barrier Thread 2 starting... Thread 2 reached barrier Thread 1 starting... Thread 1 reached barrier Thread 3 busy doing work ... Thread 3 reached barrier Thread 3 done Thread 0 done Thread 2 done Thread 1 done Thread 3 sleeping for 10 seconds 10 second delay

Flush A synchronization point which causes threads to have a “consistent” view of certain or all shared variables in memory. All current read and write operations on variables allowed to complete and values written back to memory but any memory operations in code after flush are not started. Format: #pragma omp flush (variable_list)

Flush Only applied to thread executing flush, not to all threads in team. (So not all threads have to execute the flush.) Flush occurs automatically at entry and exit of parallel and critical directives, and at the exit of for, sections, and single (if a no-wait clause is not present).

More information http://openmp.org/wp/

Questions