Uses some of the slides for chapters 7 and 9 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Slides:



Advertisements
Similar presentations
Implementing Domain Decompositions Intel Software College Introduction to Parallel Programming – Part 3.
Advertisements

Chapter 14 Recursion Lecture Slides to Accompany An Introduction to Computer Science Using Java (2nd Edition) by S.N. Kamin, D. Mickunas,, E. Reingold.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Quicksort. Quicksort I To sort a[left...right] : 1. if left < right: 1.1. Partition a[left...right] such that: all a[left...p-1] are less than a[p], and.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
Quicksort.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
OpenMPI Majdi Baddourah
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.
INTEL CONFIDENTIAL Confronting Race Conditions Introduction to Parallel Programming – Part 6.
CS 470/570 Lecture 7 Dot Product Examples Odd-even transposition sort More OpenMP Directives.
INTEL CONFIDENTIAL OpenMP for Task Decomposition Introduction to Parallel Programming – Part 8.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Programming with Shared Memory Introduction to OpenMP
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
Parallel Programming in Java with Shared Memory Directives.
Upcrc.illinois.edu OpenMP Lab Introduction. Compiling for OpenMP Open project Properties dialog box Select OpenMP Support from C/C++ -> Language.
Uses some of the slides for chapters 7 and 9 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
OpenMP: Open specifications for Multi-Processing What is OpenMP? Join\Fork model Join\Fork model Variables Variables Explicit parallelism Explicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
CS 346 – Chapter 4 Threads –How they differ from processes –Definition, purpose Threads of the same process share: code, data, open files –Types –Support.
OpenMP fundamentials Nikita Panov
CPS120: Introduction to Computer Science Decision Making in Programs.
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
COP-3330: Object Oriented Programming Flow Control May 16, 2012 Eng. Hector M Lugo-Cordero, MS.
© M. Gross, ETH Zürich, 2014 Informatik I für D-MAVT (FS 2014) Exercise 12 – Data Structures – Trees Sorting Algorithms.
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
Chapter 9 Sorting. The efficiency of data handling can often be increased if the data are sorted according to some criteria of order. The first step is.
Parallel Programming - Sorting David Monismith CS599 Notes are primarily based upon Introduction to Parallel Programming, Second Edition by Grama, Gupta,
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Flow Control in Imperative Languages. Activity 1 What does the word: ‘Imperative’ mean? 5mins …having CONTROL and ORDER!
CPE779: More on OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University
Heterogeneous Computing using openMP lecture 1 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.
Introduction to OpenMP
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.
Loop Parallelism and OpenMP CS433 Spring 2001
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.
Computer Engg, IIT(BHU)
Introduction to OpenMP
Computer Science Department
Parallel Sorting Algorithms
Quicksort 1.
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Faculty of Computer Science & Information System
DNA microarrays. Infinite Mixture Model-Based Clustering of DNA Microarray Data Using openMP.
Introduction to OpenMP
Quicksort.
Quicksort.
Synchronization These notes introduce:
Quicksort.
OpenMP Parallel Programming
Presentation transcript:

Uses some of the slides for chapters 7 and 9 accompanying “Introduction to Parallel Computing”, Addison Wesley, And some slides from OpenMP Usage / Orna Agmon Ben-Yehuda

OpenMP Shared memory parallel applications API for C, C++ and Fortran High-level directives for threads manipulation OpenMP directives provide support for: Concurrency Synchronization Data handling 2

OpenMP Based on the #pragma compiler directive Usually not recommended (not portable between compailers) But, OpenMP is considered a standard Hides the dirty details of multithreaded programming Lets the compiler to the dirty work for you Injects its code to your program using the #pragma directive 3

Reminder – Quicksort Choose one of the elements to be the pivot. Rearrange list so that all smaller elements are before the pivot, and all larger ones are after it. After rearrangement, pivot is in its final location! Recursively rearrange smaller and larger elements. Divide and conquer algorithm – suitable for parallel execution (why?). 4

Serial Quicksort code void quicksort(int* values, int from, int to) { if (from < to) { int p = values[from]; int b = from+1, t=to; while(b <= t) { while(t >= b && values[t] >= p) t--; while(b <= t && values[b] <= p) b++; if (b <= t) swap(values[b++], values[t--]); } swap(values[from], values[t]); quicksort(values, from, t); quicksort(values, t + 1, to); } (a) (b) (c) (d) (e) Pivot Final position 5 p t b

Naïve parallel Quicksort – Try 1 void quicksort(int* values, int from, int to) { if (from < to) { // … choose pivot and do swaps … #pragma omp parallel sections { #pragma omp section quicksort(values, from, t); #pragma omp section quicksort(values, t + 1, to); } Sections are performed in parallel Without omp_set_nested(1) no threads will be created recursively Is this efficient? 6

Naïve parallel Quicksort – Try 2 void quicksort(int* values, int from, int to) { … bool create_thread; static int max_threads = omp_get_max_threads(); static int threads_num = 1; if (from < to) { // … choose pivot and do swaps … #pragma omp atomic create_thread = (++threads_num <= max_threads); if (create_thread) { #pragma omp parallel sections { #pragma omp section quicksort(values, from, t); #pragma omp section quicksort(values, t + 1, to); } #pragma omp atomic threads_num--; } else { #pragma omp atomic threads_num--; quicksort(values, from, t); quicksort(values, t + 1, to); } Does this compile? What’s the atomic action? 7

Naïve parallel Quicksort – Try 3 void quicksort(int* values, int from, int to) { … if (from < to) { // … choose pivot and do swaps … bool create_thread = change_threads_num(1); if (create_thread) { #pragma omp parallel sections { #pragma omp section quicksort(values, from, t); #pragma omp section quicksort(values, t + 1, to); } change_threads_num(-1); } else { quicksort(values, from, t); quicksort(values, t + 1, to); } 8 Should we use barrier here?

Naïve parallel Quicksort – Try 3 bool change_threads_num(int change) { static int threads_num = 1; static int max_threads = omp_get_max_threads(); bool do_change; #pragma omp critical { threads_num += change; if (threads_num <= max_threads) return true; else { threads_num -= change; return false; } 9 Can’t return from structured block

Breaking Out Breaking out of parallel loops (and any other structured block) is not allowed and will not compile. A continue can sometimes be used instead of break. 10 #pragma omp parallel #pragma omp for { for (i=0; i < 256; i++) { bool b = some_work(); if (b) break; } bool b = false; #pragma omp parallel #pragma omp for { for (i=0; i < 256; i++) { if (b) continue; if (some_work()) b = true; }

Breaking Out Careful when changing flow prevents reaching barriers. 11 #pragma omp parallel #pragma omp for { for (i=0; i < 256; i++) { bool b = some_work(); if (b) break; if (i % 16 == 0) #pragma omp barrier } bool b = false; #pragma omp parallel #pragma omp for { for (i=0; i < 256; i++) { if (b) continue; if (some_work()) b = true; if (i % 16 == 0) #pragma omp barrier }

Breaking Out Calling exit() or throwing an exception from nested function will lead to undefined behavior. If exit data is important, collect status in a variable. What if a few threads reach error state? 12

Naïve parallel Quicksort – Try 3.1 bool change_threads_num(int change) { static int threads_num = 1; static int max_threads = omp_get_max_threads(); bool do_change; #pragma omp critical { do_change = (threads_num + change) <= max_threads; if (do_change) threads_num += change; } return do_change; } Is this always better than non- atomic/syncronized actions? 13 Should thread_num be volatile?

Naïve parallel Quicksort – Try 3.2 bool change_threads_num(int change) { static int threads_num = 1; static int max_threads = omp_get_max_threads(); bool do_change = change < 0 || (threads_num + change) <= max_threads; if (do_change) { #pragma omp atomic threads_num += change; } return do_change; } What goes wrong if thread count is inaccurate? 14 What value of threads_num will be read here? Can it be stored in a register? What value of threads_num will be read here? Can it be stored in a register?

Moving data between threads #pragma omp flush [(list)] OpenMP flush (list) directive can be used for specific variables: 1. Write variable on thread A 2. Flush variable on thread A 3. Flush variable on thread B 4. Read variable on thread B Parameters are flushed by the order of the list. More about OpenMP memory model later on this course. 15

Naïve parallel Quicksort – Try 3.3 bool change_threads_num(int change) { static int threads_num = 1; static int max_threads = omp_get_max_threads(); #pragma omp flush (threads_num) bool do_change = change < 0 || (threads_num + change) <= max_threads; if (do_change) { #pragma omp atomic threads_num += change; } return do_change; } 16

Lessons from naïve Quicksort Control the number of threads when creating them recursively. No compare-and-swap operations in OpenMP  Critical sections are relatively expensive. Implicit flush occurs after barriers, critical sections and atomic operations (only to the accessed address). Implicit barrier takes place at the entry and exit of work-sharing construct – parallel, critical, ordered and the exit of single construct. 17

Efficient Quicksort Naïve implementation performs partitioning serially We would like to parallelize partitioning In each step Pivot selection Local rearrangement Global rearrangement 18

Efficient Quicksort 19

Efficient global rearrangement 20

Code for the brave… void p_quicksort(int* A, int n) { struct proc *d; int i, p, s, ss, p_loc, done = 0; int *temp, *next, *cur = A; next = temp = malloc(sizeof(int) * n); #pragma omp parallel shared(next, done, d, p) private(i, s, ss, p_loc) { int c = omp_get_thread_num(); p = omp_get_num_threads(); #pragma omp single d = malloc(sizeof(struct proc) * p); d[c].start = c * n / p; d[c].end = (c + 1) * n / p – 1; d[c].sec = d[c].done = 0; do { s = d[c].start; if (c == d[c].sec && !d[c].done) { int p_loc = quick_select(cur+s, d[c].end-s+1); swap(cur[s], cur[s+p_loc]); /* Set pivot at start */ } ss = d[d[c].sec].start; /* pivot location */ #pragma omp barrier /* Local Rearrangment */ for (i = d[c].start; i <= d[c].end; i++) if (cur[i] <= cur[ss]) { swap(cur[s], cur[i]); s++; } Block executed by all threads in parallel Use thread ID and threads num to divide work. Some work must be done only once. Implicit barrier follows. Wait for all threads to arrive before moving on. 21

More code for the brave… /* Sum prefix */ d[c].s_i = max(s - d[c].start, 0); d[c].l_i = max(d[c].end - s + 1, 0); #pragma omp barrier if (c == d[c].sec) d[c].p_loc = s_prefix_sum(d, d[c].sec, p) + d[c].start – 1; /* Global rearrengment */ #pragma omp barrier p_loc = d[c].p_loc = d[d[c].sec].p_loc; memcpy(next + ss + d[c].s_i, cur + d[c].start, sizeof(int)*(s – d[c].start)); memcpy(next + p_loc + d[c].l_i + 1, cur + s, sizeof(int)*(d[c].end - s + 1)); #pragma omp barrier /* Put pivot in place */ if (d[c].sec == c && !d[c].done) { swap(next[ss], next[p_loc]); cur[p_loc] = next[p_loc]; /* the pivot in its final current */ } #pragma omp barrier #pragma omp single { done = p_rebalance(d, p); swap(cur, next); } } while (!done); /* One more iteration sequentially */ seq_quicksort(cur+d[c].start, 0, d[c].end - d[c].start + 1); #pragma omp barrier if (A != cur) memcpy(A + n*c/p, cur + n*c/p, siz eof(int)*(n*(c+1)/p – n*c/p)); } free(d); free(temp); } Out of parallel block – implicit barrier. 22

Efficient Quicksort – last bits int p_rebalance(struct proc* d, int p) { int c, new_sec, sec, done = 1; new_sec = 1; for (c = 0; c < p; c++) { d[c].sec = sec = new_sec ? c : max(sec, d[c].sec); new_sec = 0; if (c+1 = d[c].start && d[c].p_loc <= d[c].end) { d[c+1].start = d[c].p_loc + 1; d[c].end = d[c].p_loc – 1; new_sec = 1; } if (c+2 = d[c+1].start && d[c].p_loc <= d[c+1].end && d[c+2].end - d[c].p_loc > d[c].p_loc - d[c].start) { d[c+1].start = d[c].p_loc + 1; d[c].end = d[c].p_loc – 1; new_sec = 1; } if (d[c].end < d[c].start) { d[c].sec = c; new_sec = 1; d[c].done = 1; } else if (d[c].end > d[sec].start && d[c].sec != c) done = 0; } return done; } This is the serial implementation. Could also be done in parallel. This is the serial implementation. Could also be done in parallel. 23

Lessons from efficient Quicksort Sometimes we need to rethink parallel algorithms, and make them different from serial ones. Algorithms should be chosen by actual performance and not only complexity. Pivot selection has higher importance in the efficient Quicksort. 24

Backup 25

Moving data between threads Barriers may be overkill (overhead...). OpenMP flush (list) directive can be used for specific variables: 1. Write variable on thread A 2. Flush variable on thread A 3. Flush variable on thread B 4. Read variable on thread B Parameters are flushed by the order of the list. More about OpenMP memory model later on this course. 26