CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Slides:



Advertisements
Similar presentations
Introduction to Openmp & openACC
Advertisements

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
Mohsan Jameel Department of Computing NUST School of Electrical Engineering and Computer Science 1.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 OpenMP—An API for Shared Memory Programming Slides are based on:
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.
Department of Computer and Information Science
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Programming with Shared Memory Introduction to OpenMP
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Parallel Programming in Java with Shared Memory Directives.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
Introduction to OpenMP. OpenMP Introduction Credits:
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
OpenMP Martin Kruliš Jiří Dokulil. OpenMP OpenMP Architecture Review Board Compaq, HP, Intel, IBM, KAI, SGI, SUN, U.S. Department of Energy,…
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Threaded Programming Lecture 4: Work sharing directives.
Introduction to OpenMP
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
CPE779: More on OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University
B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.
NPACI Parallel Computing Institute August 19-23, 2002 San Diego Supercomputing Center S an D IEGO S UPERCOMPUTER C ENTER N ATIONAL P ARTNERSHIP FOR A DVANCED.
Introduction to OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
Shared Memory Parallelism - OpenMP
SHARED MEMORY PROGRAMMING WITH OpenMP
CS427 Multicore Architecture and Parallel Computing
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.
Loop Parallelism and OpenMP CS433 Spring 2001
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.
Open[M]ulti[P]rocessing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Shared-Memory Programming
Computer Science Department
Shared Memory Programming with OpenMP
Multi-core CPU Computing Straightforward with OpenMP
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Wed. Jan. 31, 2001 *Parts.
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson *Parts of this lecture.
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Introduction to OpenMP
Shared-Memory Paradigm & OpenMP
WorkSharing, Schedule, Synchronization and OMP best practices
Presentation transcript:

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois

Introduction  Certain categories of programs  Use arrays and loops  Natural to do the loops in parallel  Such programs should be easy to express in parallel  OpenMP: Shared memory programming model  Some people think:  Easy to convert sequential programs to parallel via OpenMP  It works well on shared memory hardware  Others think: its hard to get good performance  Probably both right to some degree.. 2CPE 779 Parallel ComputingSpring 2010

OpenMP Spring 2010CPE 779 Parallel Computing3  Meant for C, C++ and FORTRAN  Standardizes 15 years of symmetric multiprocessing (SMP) practice  Set of compiler directives and library routines for parallel programmers

Programming Model Spring 2010CPE 779 Parallel Computing4  Fork-Join Parallelism: Single flow of control:  Master thread spawns slave threads as needed  Sequential sections should be small

Thread Interactions: Race conditions Spring 2010CPE 779 Parallel Computing5  Threads communicate by sharing variables  Unintended sharing can lead to race conditions  Race condition: when the program’s outcome changes as the threads are scheduled differently.  To control race conditions: use synchronization to protect data conflicts.  Synchronization is expensive so: change how data is stored to minimize the need for synchronization.

General Rules  Most of the constructs are compiler directives  For C and C++ #pragma omp construct [clause [clause]…]  For FORTRAN C$OMP construct [clause [clause]…] !$OMP construct [clause [clause]…] *$OMP construct [clause [clause]…]  Directives are ignored by non-OpenMP compilers 6CPE 779 Parallel ComputingSpring 2010

PARALLEL DO/FOR: Syntax Spring 2010CPE 779 Parallel Computing7  Specifies that the iterations of the immediately following do loop (Fortran) or for loop (C/C++) be executed in parallel. #pragma omp parallel for [clause [clause...]] for (index = first; test_expr; increment_expr) { for (index = first; test_expr; increment_expr) { body of the loop body of the loop }

Single precision a*x + y Spring 2010CPE 779 Parallel Computing8 subroutine saxpy (z, a, x, y, n) integer i, n real z(n), a, x(n), y(n) !$omp parallel do do i = 1, n z(i) = a * x(i) + y(i) enddoreturnend †SAXPY (Scalar Alpha X Plus Y) is one of the functions in the Basic Linear Algebra Subprograms (BLAS) package, and is a common operation in computations with vector processors. (saxpy†)

PARALLEL DO: Execution Spring 2010CPE 779 Parallel Computing9 TIME Master thread executes serial portion of code Master thread enters saxpy routine Master thread encounters parallel do directive. Creates slave threads (How many?) Master and slave threads divide iterations of parallel do loop and execute them concurrently (How?) Implicit synchronization: wait for all threads to finish their allocation of iterations Master thread resumes execution after the do loop. Slave threads disappear

Number of threads Spring 2010CPE 779 Parallel Computing10  Use environment variable  Use omp_set_num_threads() function setenv OMP_NUM_THREADS 8 (on Unix machines) subroutine saxpy (z, a, x, y, n) integer i, n real z(n), a, x(n), y(n) !$call omp_set_num_threads(4) !$omp parallel do do i = 1, n z(i) = a * x(i) + y(i) enddoreturnend

Loop Scheduling: User Control  The manner in which iterations of a parallel loop are assigned to threads is called the loop’s schedule.  Unless specified otherwise, the default schedule is used which assigns iterations to threads as evenly as possible (good enough for saxpy)  Alternative user-specified schedules possible - more on this later 11CPE 779 Parallel ComputingSpring 2010

Communication between threads Spring 2010CPE 779 Parallel Computing12  Six clause types allow the programmer to specify how data is shared between threads executing a parallel do (data scope clauses):  private: private (list)  shared: shared(list)  default: default (private | shared | none)  reduction:reduction(intrinsic operator : list)  firstprivate: firstprivate(list)  lastprivate: lastprivate(list)

Default data sharing Spring 2010CPE 779 Parallel Computing13  Most variables are shared by default  Global variables are shared among threads  Exception: loop index variables are private by default  Stack variables in function calls from parallel regions are thread-private

Example Spring 2010CPE 779 Parallel Computing14 double x, y; int i, j, m, n, maxiter; int depth[300][200]; extern int mandel_val(); n = 300; m = 200; maxiter = 200; #pragma omp parallel for private(j, x, y) for (i = 1; i <= m; i++) for (j = 1; j <= n; j++) { for (j = 1; j <= n; j++) { x = i/ (double) m; x = i/ (double) m; y = j/ (double) n; y = j/ (double) n; depth[j][i] = mandel_val(x, y, maxiter); depth[j][i] = mandel_val(x, y, maxiter); }

Private clause: Details  Each thread has its own copy of all variables declared private.  Private variables (implicit or explicit) are uninitialized when a thread starts. (some exceptions)  The value of a private variable is unavailable to the master thread after a parallel loop terminates.  firstprivate and lastprivate are solutions 15CPE 779 Parallel ComputingSpring 2010

Example Spring 2010CPE 779 Parallel Computing16  What’s wrong with this example: program wrong IS = 0 C$OMP PARALLEL DO PRIVATE(IS) DO J=1,1000 … … = IS … 800 CONTINUE print *, IS

Firstprivate and lastprivate Spring 2010CPE 779 Parallel Computing17  firstprivate (list) initializes each thread’s copy of a private variable to the value of the master thread’s copy, for all variables in list.  lastprivate (list) writes back to the master’s copy the value contained in the private copy belonging to the thread that executed the sequentially last iteration of the loop, for all variables in list.

Corrected Example Spring 2010CPE 779 Parallel Computing18 program right IS = 0 C$OMP PARALLEL DO FIRSTPRIVATE(IS) LASTPRIVATE(IS) DO J=1,1000 … … = IS … 800 CONTINUE print *, IS

Another example Spring 2010CPE 779 Parallel Computing19 real x(n, n), c(n, n), y(n), z(n) x(1, 1) = … x(2, 1) = … $omp parallel do firstprivate(x) lastprivate(i, x) do i = 1, n x(1, 2) = c(i, 1) * x(1, 1) x(2, 2) = c(i, 2) * x(2, 1) ** 2 y(i) = x(2, 2) + x(1, 2) z(i) = x(2, 2) - x(1, 2) enddo y[i-1] = x(1, 2) + x (2,2)

Reduction Example Spring 2010CPE 779 Parallel Computing20 subroutine sum (values, n, s) integer n real s, values(n) s = 0 s = 0 !$omp parallel do reduction (+:s) do i = 1, n s = s + values(i) enddoreturnend Performs a reduction on the variable “s”. A private copy of s is created for each thread. Iterations of the parallel loop will be distributed in equal sized blocks to each thread in the team

Reduction Example Spring 2010CPE 779 Parallel Computing21 subroutine sum (values, n, s) integer n real s, values(n) s = 0 s = 0 !$omp parallel do reduction (+:s) do i = 1, n s = s + values(i) enddoreturnend At the end of the parallel loop construct, all threads will add their values of “s" to update the master thread's global copy.

Assigning iterations to threads  Motivation: to balance the work per thread when the work per iteration is inherently unbalanced 22CPE 779 Parallel ComputingSpring 2010 !$omp parallel do private(xkind) do i = 1, n xkind = f(i) if (xkind.lt. 10 then) call smallwork(x[i]) else call bigwork(x[i]) endifenddo

Schedule clause and general form Spring 2010CPE 779 Parallel Computing23  schedule (type[, chunk])  type = static, dynamic, guided or runtime  optional chunk = scalar integer value  static: iterations are divided as evenly as possible among all threads - simple static  static, chunk: iterations are divided into chunks of size chunk. Chunks are then assigned in round robin fashion to threads - interleaved

Schedule clause: dynamic or guided scheduling Spring 2010CPE 779 Parallel Computing24  dynamic, chunk: iterations are divided into chunks of size chunk (1 if unspecified) and are assigned to threads dynamically after an initial round robin assignment - simple dynamic  guided, chunk: chunk size decreases exponentially from an implementation dependent value to chunk (1 if unspecified). Chunks are assigned dynamically - guided self scheduling