Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.

Slides:



Advertisements
Similar presentations
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Advertisements

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.
Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) Adapted from: o “MPI-Message.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
1 Friday, November 10, 2006 “ Programs for sale: Fast, Reliable, Cheap: choose two.” -Anonymous.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.
2 3 Parent Thread Fork Join Start End Child Threads Compute time Overhead.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Hybrid MPI and OpenMP Parallel Programming
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Introduction to OpenMP
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Lecture 7: POSIX Threads - Pthreads. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Parallel Programming Models (Shared Address Space) 5 th week.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
CS240A, T. Yang, Parallel Programming with OpenMP.
COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University
Distributed and Parallel Processing George Wells.
B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.
Parallel Programming in C with MPI and OpenMP
Introduction to OpenMP
Shared Memory Parallelism - OpenMP
Lecture 5: Shared-memory Computing with Open MP
Parallel Programming in C with MPI and OpenMP
Shared-memory Programming
CS427 Multicore Architecture and Parallel Computing
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.
Open[M]ulti[P]rocessing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Shared-Memory Programming
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
Computer Science Department
Parallel Programming with OpenMP
Programming with Shared Memory
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Introduction to OpenMP
Shared-Memory Paradigm & OpenMP
Presentation transcript:

Lecture 8: OpenMP

Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism Shared memory / Distributed memory Other programming paradigms Object-oriented Functional and logic

Parallel Programming Models Shared Memory The programmer’s task is to specify the activities of a set of processes that communicate by reading and writing shared memory. Advantage: the programmer need not be concerned with data-distribution issues. Disadvantage: performance implementations may be difficult on computers that lack hardware support for shared memory, and race conditions tend to arise more easily Distributed Memory Processes have only local memory and must use some other mechanism (e.g., message passing or remote procedure call) to exchange information. Advantage: programmers have explicit control over data distribution and communication.

Shared vs Distributed Memory Shared memory Distributed memory Memory Bus PPPP PPPP MMMM Network

Parallel Programming Models Parallel Programming Tools: Parallel Virtual Machine (PVM) Distributed memory, explicit parallelism Message-Passing Interface (MPI) Distributed memory, explicit parallelism PThreads Shared memory, explicit parallelism OpenMP Shared memory, explicit parallelism High-Performance Fortran (HPF) Implicit parallelism Parallelizing Compilers Implicit parallelism

Parallel Programming Models Shared Memory Model Used on Shared memory MIMD architectures Program consists of many independent threads Concurrently executing threads all share a single, common address space. Threads can exchange information by reading and writing to memory using normal variable assignment operations

Parallel Programming Models Memory Coherence Problem To ensure that the latest value of a variable updated in one thread is used when that same variable is accessed in another thread. Hardware support and compiler support are required Cache-coherency protocol Thread 1Thread 2 X

Parallel Programming Models Distributed Shared Memory (DSM) Systems Implement Shared memory model on Distributed memory MIMD architectures Concurrently executing threads all share a single, common address space. Threads can exchange information by reading and writing to memory using normal variable assignment operations Use a message-passing layer as the means for communicating updated values throughout the system.

Parallel Programming Models Synchronization operations in Shared Memory Model Monitors Locks Critical sections Condition variables Semaphores Barriers

OpenMP

OpenMP Shared-memory programming model Thread-based parallelism Fork/Join model Compiler directive based No support for parallel I/O

OpenMP Master thread executes the sequential sections. Master thread forks additional threads. At the end of the parallel code, the created threads die and the control returns to the master thread (join). Master Threads Fork Join Fork Join

OpenMP General Code Structure #include main () { int var1, var2, var3; Serial code... Fork a team of threads. Specify variable scoping #pragma omp parallel private(var1, var2) shared(var3) { Parallel section executed by all threads... All threads join master thread and disband } Resume serial code... }

OpenMP General Code Structure #include main () { int nthreads, tid; #pragma omp parallel private(tid) /* Fork a team of threads */ { tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); if (tid == 0) {/* master thread */ nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } } /* All threads join master thread and terminate */ }

OpenMP parallel Directive The execution of the code block after the parallel pragma is replicated among the threads. #include main () { struct job_struct job_ptr; struct task_struct task_ptr; #pragma omp parallel private(task_ptr) { task_ptr = get_next_task(&job_ptr); while(task_ptr != NULL) { complete_task(task_ptr) task_ptr = get_next_task(&job_ptr); } job_ptr task_ptr Master tread Thread 1 Shared variables

OpenMP parallel for Directive #include main () { int i; float b[5]; #pragma omp parallel for { for (i=0; i < 5; i++) b[i] = i; } b ii Master Thread (0) Thread 1 In parallel for, variables are by default shared with the exception that the loop index variable is private.

OpenMP Execution context: is an address space containing all of the variables the thread may access. Shared variable: has the same address in the execution context of every thread Private variable: has a different address in the execution context of every thread

OpenMP private Clause declares variables in its list to be private to each thread private (list) #include main () { int i,n; float a[10][10]; n=10; #pragma omp parallel for \ private(j) { for (i=0; i < n; i++) for (j=0; j < n; j++) a[i][j] = a[i][j] + i; }

OpenMP critical Directive Directs the compiler to enforce mutual exclusion among the threads executing the block of code. #include main() { int x; x = 0; #pragma omp parallel shared(x) { #pragma omp critical x = x + 1; } /* end of parallel section */ }

OpenMP reduction Directive reduction (operator : variable) #include main () { int i, n; float a, x, p; n=100; a=0.0; #pragma omp parallel for \ private(x) \ reduction(+:a) for (i=0; i < n; i++) { x = i/10.0; a += x*x; } p = a/n; }

OpenMP reduction Operators + addition - subtraction * multiplication & bitwise and | bitwise or ^ bitwise exclusive or && conditional and || conditional or

OpenMP Loop Scheduling: allows the iterations of a loop to be allocated to threads. Static schedule: all iterations are allocated to threads before they execute any loop iterations. Low overhead High load imbalance Dynamic schedule: only some of the iterations are allocated to threads at the beginning of loop’s execution. Threads that complete their iterations are then eligible to get additional work. Higher overhead Reduce load imbalance

OpenMP schedule Clause schedule (type [, chunk]) type: static, dynamic, etc. chunk: number of contiguous iterations assigned to each thread Increasing chunk size can reduce overhead and increase cache hit rate.

OpenMP schedule Clause #include main () { int i,n; float a[10]; n=10; #pragma omp parallel for \ private(i) \ schedule(static,5) { for (i=0; i < n; i++) a[i] = a[i] + i; }

OpenMP schedule Directive #include #define CHUNKSIZE 100 #define N 1000 main () { int i, chunk; float a[N], b[N], c[N]; for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE; #pragma omp parallel shared(a,b,c,chunk) private(i) { /* iterations will be distributed dynamically in chunk sized pieces*/ #pragma omp for schedule(dynamic,chunk) nowait { for (i=0; i < N; i++) c[i] = a[i] + b[i]; } } /* end of parallel section */ }

OpenMP nowait Clause tells the compiler to omit the barrier synchronization at the end of the parallel for loop #include #define CHUNKSIZE 100 #define N 1000 main () { int i, chunk; float a[N], b[N], c[N]; for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE; #pragma omp parallel shared(a,b,c,chunk) private(i) { #pragma omp for schedule(dynamic,chunk) nowait { for (i=0; i < N; i++) c[i] = a[i] + b[i]; }

OpenMP for Directive specifies that the iterations of the loop immediately following it must be executed in parallel by the team. for (i=0; i < n; i++) { low=a[i]; high=b[i]; if (low > high) { break; } for (j=low; j < high; j++) c[j]=(c[j]-a[i])/b[i]; } #pragma omp parallel private(i,j) for (i=0; i < n; i++) { low=a[i]; high=b[i]; if (low > high) { break; } #pragma omp for for (j=low; j < high; j++) c[j]=(c[j]-a[i])/b[i]; }

OpenMP single Directive S pecifies that the enclosed code is to be executed by only one thread in the team. Threads in the team that do not execute the single directive wait at the end of the enclosed code block for (i=0; i < n; i++) { low=a[i]; high=b[i]; if (low > high) { break; } for (j=low; j < high; j++) c[j]=(c[j]-a[i])/b[i]; } #pragma omp parallel private(i,j) for (i=0; i < n; i++) { low=a[i]; high=b[i]; if (low > high) { #pragma omp single break; } #pragma omp for for (j=low; j < high; j++) c[j]=(c[j]-a[i])/b[i]; }

OpenMP #include int a, b, i, tid; float x; #pragma omp threadprivate(a, x) main () { omp_set_dynamic(0); /*Explicitly turn off dynamic threads*/ #pragma omp parallel private(b,tid) { tid = omp_get_thread_num(); a = tid; b = tid; x = 1.1 * tid +1.0; printf("Thread %d: a,b,x= %d %d %f\n",tid,a,b,x); } /* end of parallel section */ printf("Master thread doing serial work here\n"); #pragma omp parallel private(tid) { tid = omp_get_thread_num(); printf("Thread %d: a,b,x= %d %d %f\n",tid,a,b,x); } /* end of parallel section */ } Output: Thread 0: a,b,x= Thread 2: a,b,x= Thread 3: a,b,x= Thread 1: a,b,x= Master thread doing serial work here Thread 0: a,b,x= Thread 3: a,b,x= Thread 1: a,b,x= Thread 2: a,b,x= THREADPRIVATE Directive The THREADPRIVATE directive is used to make global file scope variables (C/C++) local and persistent to a thread through the execution of multiple parallel regions.

OpenMP parallel sections Directive – Functional Parallelism Specifies that the enclosed section(s) of code are to be divided among the threads in the team to be evaluated concurrently. #include main () {... #pragma omp parallel sections { #pragma omp section /* thread 1 */ v = alpha(); #pragma omp section /* thread 2 */ w = beta(); #pragma omp section /* thread 3 */ y = delta(); } /* end of parallel section */ x = gamma(v,w); printf(“%f\n”, epsilon(x,y)); }

OpenMP parallel sections Directive – Functional Parallelism Another solution: main () {... #pragma omp parallel { #pragma omp sections { #pragma omp section /* thread 1 */ v = alpha(); #pragma omp section /* thread 2 */ w = beta(); } #pragma omp sections { #pragma omp section /* thread 3 */ x = gamma(v,w); #pragma omp section /* thread 4 */ y = delta(); } } /* end of parallel section */ printf(“%f\n”, epsilon(x,y)); }

OpenMP section Directive #include #define N 1000 main () { int i; float a[N], b[N], c[N], d[N]; for (i=0; i < N; i++) { a[i] = i * 1.5; b[i] = i ; } #pragma omp parallel shared(a,b,c,d) private(i) { #pragma omp sections nowait { #pragma omp section for (i=0; i < N; i++) c[i] = a[i] + b[i]; #pragma omp section for (i=0; i < N; i++) d[i] = a[i] * b[i]; } /* end of sections */ } /* end of parallel section */ }

OpenMP Synchronization constructs: master directive: specifies a region that is to be executed only by the master thread of the team critical directive: specifies a region of code that must be executed by only one thread at a time barrier directive: synchronizes all threads in the team atomic directive: specifies that a specific memory location must be updates atomically (a mini critical section)

OpenMP barrier Directive #pragma omp barrier... ; atomic Directive #pragma omp atomic... ;

OpenMP Run-time Library Routines omp_set_num_threads(void): Sets the number of threads that will be used in the next parallel region omp_get_num_threads(void): Returns the number of threads that are currently executing the parallel region omp_get_thread_number(void): Returns the thread number omp_get_num_procs(void): Returns the number of processors omp_in_parallel(void): used to determine if the section of code is parallel or not

Parallel Programming Models Example: Pi calculation   f 0 1 f(x) dx = f 0 1 4/(1+x 2 ) dx = w ∑ f(x i ) f(x) = 4/(1+x 2 ) n = 10 w = 1/n x i = w(i-0.5) x f(x) x i 1

Parallel Programming Models Sequential Code #define f(x) 4.0/(1.0+x*x); main(){ intn,i; float w,x,sum,pi; printf(“n?\n”); scanf(“%d”, &n); w=1.0/n; sum=0.0; for (i=1; i<=n; i++){ x=w*(i-0.5); sum += f(x); } pi=w*sum; printf(“%f\n”, pi); }  = w ∑ f(x i ) f(x) = 4/(1+x 2 ) n = 10 w = 1/n x i = w(i-0.5) x f(x) x i 1

OpenMP #include #define f(x) 4.0/(1.0+x*x) #define NUM_THREADS 4 main() { floatsum, w, pi, x; inti, n, id; sum = 0.0; w=1.0/n; omp_set_num_threads(NUM_THREADS); #pragma omp parallel for private(x) { for (i=0; i<n; i++) { x=(i+0.5)*w; #pragma omp critical sum+=f(x); } pi = sum*w; printf(“pi=%f\n”, pi); } x f(x) x i 1