Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j<10; j++) printf(“Hello\n”); } On 5 threads we get 50 print out of hello since.

Slides:



Advertisements
Similar presentations
OpenMP.
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Scheduling and Performance Issues for Programming using OpenMP
1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.
Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) Adapted from: o “MPI-Message.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Parallel Programming in Java with Shared Memory Directives.
Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
Introduction to OpenMP. OpenMP Introduction Credits:
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
OpenMP: Open specifications for Multi-Processing What is OpenMP? Join\Fork model Join\Fork model Variables Variables Explicit parallelism Explicit parallelism.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
OpenMP fundamentials Nikita Panov
10/02/2012CS4230 CS4230 Parallel Programming Lecture 11: Breaking Dependences and Task Parallel Algorithms Mary Hall October 2,
Introduction to OpenMP
09/09/2010CS4961 CS4961 Parallel Programming Lecture 6: Data Parallelism in OpenMP, cont. Introduction to Data Parallel Algorithms Mary Hall September.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
Threaded Programming Lecture 2: Introduction to OpenMP.
Parallel Programming Models (Shared Address Space) 5 th week.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
CS240A, T. Yang, Parallel Programming with OpenMP.
CS 110 Computer Architecture Lecture 20: Thread-Level Parallelism (TLP) and OpenMP Intro Instructor: Sören Schwertfeger School.
OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.
OpenMP An API : For Writing Portable SMP Application Software Rider NCHC GTD.
Introduction to OpenMP
Shared Memory Parallelism - OpenMP
Lecture 5: Shared-memory Computing with Open MP
SHARED MEMORY PROGRAMMING WITH OpenMP
CS427 Multicore Architecture and Parallel Computing
Open[M]ulti[P]rocessing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Shared-Memory Programming
Programming with Shared Memory
Computer Science Department
CS4230 Parallel Programming Lecture 12: More Task Parallelism Mary Hall October 4, /04/2012 CS4230.
Multi-core CPU Computing Straightforward with OpenMP
Parallel Programming with OpenMP
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Introduction to OpenMP
OpenMP Martin Kruliš.
OpenMP Parallel Programming
Shared-Memory Paradigm & OpenMP
Parallel Programming with OPENMP
Presentation transcript:

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j<10; j++) printf(“Hello\n”); } On 5 threads we get 50 print out of hello since each thread executes 10 iterations concurrently with other 10 threads #pragma omp parallel for { for ( j=0; j<10; j++) printf(“Hello\n”); } Regardless of # of threads we get 10 print out of hello since do loop iterations are executed in parallel by team of threads

NOWAIT clause : C #pragma omp parallel { #pragma omp for nowait for ( j=1; j<n; j++) b[j]= (a[j]+a[j-1]) /2.0; #pragma omp for for ( j=1; j<n; j++) c[j] = d[j]/e[j] ; }

Parallel Sections So far we have divided the work of one task among threads Parallel sections allow us to assign different tasks to different threads –Need to make sure that none of the later tasks depends on the results of the earlier ones This is helpful where it is difficult or impossible to speedup individual tasks by executing them in parallel The code for the entire sequence of tasks or sections begins with a sections directive and ends with an end sections directive The beginning of each section is marked by a section directive which is optional for the very first section

Fortran section clause !$omp parallel sections [clause..] [!$omp section] code for 1 st section !$omp section code for 2 nd section !$omp section code for 3 rd section. !$omp end parallel sections

C/C++ section clause #pragma omp parallel sections [clause…] { [#pragma omp section] code for 1 st section #pragma omp section code for 2 nd section #pragma omp section code for 3 rd section. }

clause can be private, firstprivate, lastprivate, reduction In Fortran the NOWAIT clause goes at the end: !$omp end sections [nowait] In C/C++ NOWAIT is provided with the omp sections pragma: #pragma omp sections nowait Each section is executed once and each thread executes zero or more sections A thread may execute more than one section if there are more sections than threads It is not possible to determine if one section will be executed before another or if two sections will be executed by the same thread

Assigning work to single thread Within a parallel region a block of code may be executed just once by any one of the threads in the team –There is implicit barrier at the end of single (unless nowait clause supplied) –Clause can be private or firstprivate Fortran : !$omp single [clause…] block of code to be executed by just one thread !$omp end single [nowait] C/C++ : #pragma omp single [clause,….. nowait] block of code to be executed by just one thread

single for I/O Common use of single is for reading in shared input variables or writing output within a parallel region I/O may not be easy to parallelize

omp_get_thread_num, omp_get_num_threads Remember OpenMP uses fork/join model of parallelization Thread teams are only created within a parallel construct (parallel do/for, parallel) omp_get_thread_num and omp_get_num_threads are only valid within a parallel construct where you have forked threads

Synchronization Critical - for any block of code Barrier – where all threads join Other synchronization directives : –master –ordered

Synchronization : master clause The master directive identifies a structured block of code that is executed by the master thread of the team No implicit barrier at the end of master directive Fortran : !$omp master code block !$omp end master C/C++ : #pragma omp master code block

master example !$ (or #pragma) parallel !$ (or #pragma) omp do (or for) loop I = 1: n calculation end loop !$ (or #pragma) omp master print result (reduction) from above loop !$omp end master more computation end parallel loop

Synchronization : ordered clause The structured block following an ordered directive is executed in the order in which iterations would be executed in a sequential loop Fortran : !$omp ordered code block !$omp end ordered C/C++: #pragma omp ordered code block

ordered example parallel loop (with parallel do/for) ordered loop I=1 : n a[I] =..calculation… !$ [OR #pragma] omp ordered print a[I] !$omp end ordered end parallel loop

OpenMP Performance Each processor has its own cache in shared memory machine Data locality in caches and loop scheduling False sharing

Data locality in caches and loop scheduling loop j = 0 : n loop k = 0 : n a[j][k] = k +1 + a[j][k] loop j = 0 : n loop k = 0 : n a[j][k] = 1./a[j][k] Assume each processor’s cache can hold local matrix After first loop each processor’s cache will have some data (cache line dependent). For next iteration it may or may not get to operate on those data depending on scheduling Static scheduling may provide better cache performance than dynamic scheduling

False sharing If different processors update stride one elements of an array – this can cause poor cache performance Cache line has to be invalidated all the time among all the processors Parallel loop with schedule (static,1) loop j = 1 : n a[j] = a[j] + j Proc1 updates a[1], proc2 updates a[2]… etc. Cache line needs to be invalidated for each processor – this leads to bad performance

Look up from OpenMP standard Threadprivate !$omp threadprivate (/cb1/, /cb2/) #pragma omp threadprivate(list) cb1, cb2 are common blocks in fortran, list is a list of named file scope or namespace scope variables in C Threadprivate makes named common blocks private to a thread but global within the thread Threadprivate makes the named file scope or namespace scope variables (list) private to a thread but file scope visible within the thread

Look up from OpenMP standard Atomic directive ensures that specific memory location is updated atomically – provides better optimization than critical due to hardware instructions C: #pragma omp parallel for for (I =1; I< n; I ++) { #pragma omp atomic a[index[I]] = a[index[I]] + 1 } Fortan: !$omp parallel do do I = 1, n $omp atomic y(index(j)) = y(index(j)) + c