Spin Locks Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Slides:



Advertisements
Similar presentations
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Advertisements

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.
Spin Locks and Contention Management The Art of Multiprocessor Programming Spring 2007.
May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.
1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.
Indian Institute of Science Bangalore, India भारतीय विज्ञान संस्थान बंगलौर, भारत Supercomputer Education and Research Centre (SERC) Adapted from: o “MPI-Message.
Introduction Companion slides for
Spin Locks and Contention Based on slides by by Maurice Herlihy & Nir Shavit Tomer Gurevich.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
Shared Counters and Parallelism Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
OpenMPI Majdi Baddourah
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.
Programming with Shared Memory Introduction to OpenMP
Shared Memory Parallelization Outline What is shared memory parallelization? OpenMP Fractal Example False Sharing Variable scoping Examples on sharing.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
Parallel Programming in Java with Shared Memory Directives.
Art of Multiprocessor Programming1 This work is licensed under a Creative Commons Attribution- ShareAlike 2.5 License.Creative Commons Attribution- ShareAlike.
Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
OpenMP - Introduction Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Multicore Programming
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
Spin Locks and Contention
Mutual Exclusion Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
Introduction to OpenMP
Fundamentals of Parallel Computer Architecture - Chapter 71 Chapter 7 Introduction to Shared Memory Multiprocessors Yan Solihin Copyright.
Mutual Exclusion Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
Multiprocessor Architecture Basics Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for CIS 640,
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
The Relative Power of Synchronization Operations Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.
Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Concurrent Skip Lists.
B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.
Introduction to OpenMP
Shared Memory Parallelism - OpenMP
CS427 Multicore Architecture and Parallel Computing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Shared-Memory Programming
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
Computer Science Department
Spin Locks and Contention
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Introduction to OpenMP
Presentation transcript:

Spin Locks Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Art of Multiprocessor Programming2 What Should you do if you can’t get a lock? Keep trying –“spin” or “busy-wait” –Good if delays are short Give up the processor –Good if delays are long –Always good on uniprocessor (1)

Art of Multiprocessor Programming3 What Should you do if you can’t get a lock? Keep trying –“spin” or “busy-wait” –Good if delays are short Give up the processor –Good if delays are long –Always good on uniprocessor our focus

Art of Multiprocessor Programming4 Basic Spin-Lock CS Resets lock upon exit spin lock critical section...

Art of Multiprocessor Programming5 Basic Spin-Lock CS Resets lock upon exit spin lock critical section... …lock introduces sequential bottleneck Seq Bottleneck  no parallelism

Art of Multiprocessor Programming6 Review: Test-and-Set Boolean value Test-and-set (TAS) –Swap true with current value –Return value tells if prior value was true or false Can reset just by writing false TAS aka “getAndSet”

Art of Multiprocessor Programming7 Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } (5)

Art of Multiprocessor Programming8 Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } Package java.util.concurrent.atomic

Art of Multiprocessor Programming9 Review: Test-and-Set public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } Swap old and new values

Art of Multiprocessor Programming10 Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean(false) … boolean prior = lock.getAndSet(true)

Art of Multiprocessor Programming11 Review: Test-and-Set AtomicBoolean lock = new AtomicBoolean(false) … boolean prior = lock.getAndSet(true) (5) Swapping in true is called “test-and-set” or TAS

Art of Multiprocessor Programming12 Test-and-Set Locks Locking –Lock is free: value is false –Lock is taken: value is true Acquire lock by calling TAS –If result is false, you win –If result is true, you lose Release lock by writing false

Art of Multiprocessor Programming13 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }}

Art of Multiprocessor Programming14 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Lock state is AtomicBoolean

Art of Multiprocessor Programming15 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Keep trying until lock acquired

Art of Multiprocessor Programming16 Test-and-set Lock class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); }} Release lock by resetting state to false

Art of Multiprocessor Programming17 Performance Experiment –n threads –Increment shared counter 1 million times How long should it take? How long does it take?

Art of Multiprocessor Programming18 Graph ideal time threads no speedup because of sequential bottleneck

Art of Multiprocessor Programming19 Mystery #1 time threads TAS lock Ideal (1) What is going on?

Art of Multiprocessor Programming20 Test-and-Test-and-Set Locks Lurking stage –Wait until lock “looks” free –Spin while read returns true (lock taken) Pouncing state –As soon as lock “looks” available –Read returns false (lock free) –Call TAS to acquire lock –If TAS loses, back to lurking

Art of Multiprocessor Programming21 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; }

Art of Multiprocessor Programming22 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } Wait until lock looks free

Art of Multiprocessor Programming23 Test-and-test-and-set Lock class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } Then try to acquire it

Art of Multiprocessor Programming24 Mystery #2 TAS lock TTAS lock Ideal time threads

Art of Multiprocessor Programming25 Opinion Our memory abstraction is broken TAS & TTAS methods –Are provably the same (in our model) –Except they aren’t (in field tests) Need a more detailed model …

Art of Multiprocessor Programming26 Bus-Based Architectures Bus cache memory cache

Art of Multiprocessor Programming27 Bus-Based Architectures Bus cache memory cache Random access memory (10s of cycles)

Art of Multiprocessor Programming28 Bus-Based Architectures cache memory cache Shared Bus Broadcast medium One broadcaster at a time Processors and memory all “snoop” Bus

Art of Multiprocessor Programming29 Bus-Based Architectures Bus cache memory cache Per-Processor Caches Small Fast: 1 or 2 cycles Address & state information

Art of Multiprocessor Programming30 Jargon Watch Cache hit –“I found what I wanted in my cache” –Good Thing™

Art of Multiprocessor Programming31 Jargon Watch Cache hit –“I found what I wanted in my cache” –Good Thing™ Cache miss –“I had to shlep all the way to memory for that data” –Bad Thing™

Art of Multiprocessor Programming32 Cave Canem This model is still a simplification –But not in any essential way –Illustrates basic principles Will discuss complexities later

Art of Multiprocessor Programming33 Bus Processor Issues Load Request cache memory cache data

Art of Multiprocessor Programming34 Bus Processor Issues Load Request Bus cache memory cache data Gimme data

Art of Multiprocessor Programming35 cache Bus Memory Responds Bus memory cache data Got your data right here data

Art of Multiprocessor Programming36 Bus Processor Issues Load Request memory cache data Gimme data

Art of Multiprocessor Programming37 Bus Processor Issues Load Request Bus memory cache data Gimme data

Art of Multiprocessor Programming38 Bus Processor Issues Load Request Bus memory cache data I got data

Art of Multiprocessor Programming39 Bus Other Processor Responds memory cache data I got data data Bus

Art of Multiprocessor Programming40 Bus Other Processor Responds memory cache data Bus

Art of Multiprocessor Programming41 Modify Cached Data Bus data memory cachedata (1)

Art of Multiprocessor Programming42 Modify Cached Data Bus data memory cachedata (1)

Art of Multiprocessor Programming43 memory Bus data Modify Cached Data cachedata

Art of Multiprocessor Programming44 memory Bus data Modify Cached Data cache What’s up with the other copies? data

Art of Multiprocessor Programming45 Cache Coherence We have lots of copies of data –Original copy in memory –Cached copies at processors Some processor modifies its own copy –What do we do with the others? –How to avoid confusion?

Art of Multiprocessor Programming46 Write-Back Caches Accumulate changes in cache Write back when needed –Need the cache for something else –Another processor wants it On first modification –Invalidate other entries –Requires non-trivial protocol …

Art of Multiprocessor Programming47 Write-Back Caches Cache entry has three states –Invalid: contains raw seething bits –Valid: I can read but I can’t write –Dirty: Data has been modified Intercept other load requests Write back to memory before using cache

Art of Multiprocessor Programming48 Bus Invalidate memory cachedata

Art of Multiprocessor Programming49 Bus Invalidate Bus memory cachedata Mine, all mine!

Art of Multiprocessor Programming50 cache Bus Invalidate memory cachedata Other caches lose read permission

Art of Multiprocessor Programming51 cache Bus Invalidate memory cachedata Other caches lose read permission This cache acquires write permission

Art of Multiprocessor Programming52 cache Bus Invalidate memory cachedata Memory provides data only if not present in any cache, so no need to change it now (expensive) (2)

Art of Multiprocessor Programming53 cache Bus Another Processor Asks for Data memory cachedata (2) Bus

Art of Multiprocessor Programming54 cache data Bus Owner Responds memory cachedata (2) Bus Here it is!

Art of Multiprocessor Programming55 Bus End of the Day … memory cachedata (1) Reading OK, no writing data

Art of Multiprocessor Programming56 Simple TASLock TAS invalidates cache lines Spinners –Miss in cache –Go to bus Thread wants to release lock –delayed behind spinners

Art of Multiprocessor Programming57 Test-and-test-and-set Wait until lock “looks” free –Spin on local cache –No bus use while lock busy Problem: when lock is released –Invalidation storm …

Art of Multiprocessor Programming58 Local Spinning while Lock is Busy Bus memory busy

Art of Multiprocessor Programming59 Bus On Release memory freeinvalid free

Art of Multiprocessor Programming60 On Release Bus memory freeinvalid free miss Everyone misses, rereads (1)

Art of Multiprocessor Programming61 On Release Bus memory freeinvalid free TAS(…) Everyone tries TAS (1)

Art of Multiprocessor Programming62 Problems Everyone misses –Reads satisfied sequentially Everyone does TAS –Invalidates others’ caches

Art of Multiprocessor Programming63 Mystery Explained TAS lock TTAS lock Ideal time threads Better than TAS but still not as good as ideal

Art of Multiprocessor Programming64 Solution: Introduce Delay spin lock time d r1dr1d r2dr2d If the lock looks free But I fail to get it There must be lots of contention Better to back off than to collide again

Art of Multiprocessor Programming65 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}}

Art of Multiprocessor Programming66 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Fix minimum delay

Art of Multiprocessor Programming67 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Wait until lock looks free

Art of Multiprocessor Programming68 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} If we win, return

Art of Multiprocessor Programming69 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Back off for random duration

Art of Multiprocessor Programming70 Exponential Backoff Lock public class Backoff implements lock { public void lock() { int delay = MIN_DELAY; while (true) { while (state.get()) {} if (!lock.getAndSet(true)) return; sleep(random() % delay); if (delay < MAX_DELAY) delay = 2 * delay; }}} Double max delay, within reason

Art of Multiprocessor Programming71 Spin-Waiting Overhead TTAS Lock Backoff lock time threads

BUILT-IN FUNCTIONS FOR ATOMIC MEMORY ACCESS 72

Built-in functions for atomic memory access type __sync_fetch_and_add (type *ptr, type value,...) type __sync_fetch_and_sub (type *ptr, type value,...) type __sync_fetch_and_or (type *ptr, type value,...) type __sync_fetch_and_and (type *ptr, type value,...) type __sync_fetch_and_xor (type *ptr, type value,...) type __sync_fetch_and_nand (type *ptr, type value,...) { tmp = *ptr; *ptr op= value; return tmp; } { tmp = *ptr; *ptr = ~tmp & value; return tmp; } 73

Built-in functions for atomic memory access type __sync_add_and_fetch (type *ptr, type value,...) type __sync_sub_and_fetch (type *ptr, type value,...) type __sync_or_and_fetch (type *ptr, type value,...) type __sync_and_and_fetch (type *ptr, type value,...) type __sync_xor_and_fetch (type *ptr, type value,...) type __sync_nand_and_fetch (type *ptr, type value,...) { *ptr op= value; return *ptr; } { *ptr = ~*ptr & value; return *ptr; } // nand 74

Built-in functions for atomic memory access bool __sync_bool_compare_and_swap (type *ptr, type oldval type newval,...) type __sync_val_compare_and_swap (type *ptr, type oldval type newval,...) type __sync_lock_test_and_set (type *ptr, type value,...) 75

OPENMP 76

OpenMP (Shared Memory, Thread Based Parallelism) OpenMP is a shared- memory application programming interface (API) whose features, are based on prior efforts to facilitate shared-memory parallel programming. (Compiler Directive Based) OpenMP provides directives, library functions and environment variables to create and control the execution of parallel programs. (Explicit Parallelism) OpenMP’s directives let the user tell the compiler which instructions to execute in parallel and how to distribute them among the threads that will run the code. 77

OpenMP example int main() { for (i = 0; i < 10; i++) { a [i] = i * 0.5; b [i] = i * 2.0; } sum = 0; for (i = 1; i <= 10; i++ ) { sum += a[i]*b[i]; } printf ("sum = %f", sum); } 78

OpenMP example int main() { for (i = 0; i < 10; i++) { a [i] = i * 0.5; b [i] = i * 2.0; } sum = 0; for (i = 1; i <= 10; i++ ) { sum += a[i]*b[i]; } printf ("sum = %f", sum); } int main() { for (i = 0; i < 10; i++) { a [i] = i * 0.5; b [i] = i * 2.0 } sum = 0; #pragma omp parallel private(t) shared(sum, a, b) {t = 0; #pragma omp for for (i = 1; i <= 10; i++ ) { t += a[i]*b[i]; } #pragma omp critical (update_sum) sum += t; } printf ("sum = %f \n", sum); } 79

PARALLEL Region Construct A parallel region is a block of code that will be executed by multiple threads. This is the fundamental OpenMP parallel construct. Starting from the beginning of this parallel region, the code is duplicated and all threads will execute that code. There is an implied barrier at the end of a parallel section. Only the master thread continues execution past this point. How Many Threads? –Setting of the NUM_THREADS clause –Use of the omp_set_num_threads() library function –Setting of the OMP_NUM_THREADS environment variable –Implementation default - usually the number of CPUs on a node A parallel region must be a structured block that does not span multiple routines or code files It is illegal to branch into or out of a parallel region 80

PARALLEL Region Construct #pragma omp parallel [clause...] newline if (scalar_expression) private (list) shared (list) default (shared | none) firstprivate (list) num_threads (integer-expression) structured_block 81

PARALLEL Region Construct #include void main () { int nthreads, tid; /* Fork a team of threads with each thread having a private tid variable */ #pragma omp parallel private(tid) { /* Obtain and print thread id */ tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); /* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } /* All threads join master thread and terminate */ } 82

Work-Sharing Constructs for Directive The for directive specifies that the iterations of the loop immediately following it must be executed in parallel by the team. This assumes a parallel region has already been initiated, otherwise it executes in serial on a single processor. 83

Work-Sharing Constructs for Directive #pragma omp for [clause...] newline schedule (type [,chunk]) ordered private (list) firstprivate (list) lastprivate (list) shared (list) nowait for_loop 84

Work-Sharing Constructs for Directive #include #define CHUNKSIZE 100 #define N 1000 void main () { int i, chunk; float a[N], b[N], c[N]; /* Some initializations */ for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE; #pragma omp parallel shared(a,b,c,chunk) private(i) { #pragma omp for schedule(dynamic,chunk) nowait for (i=0; i < N; i++) c[i] = a[i] + b[i]; } /* end of parallel section */ } 85

Work-Sharing Constructs SECTIONS Directive The SECTIONS directive is a non- iterative work-sharing construct. It specifies that the enclosed section(s) of code are to be divided among the threads in the team. 86

Work-Sharing Constructs SECTIONS Directive #pragma omp sections [clause...] newline private (list) firstprivate (list) lastprivate (list) nowait { #pragma omp section newline structured_block } 87

Work-Sharing Constructs SECTIONS Directive #include #define N 1000 void main () { int i; float a[N], b[N], c[N], d[N]; /* Some initializations */ for (i=0; i < N; i++) a[i] = i * 1.5; b[i] = i ; #pragma omp parallel shared(a,b,c,d) private(i) { #pragma omp sections nowait { #pragma omp section for (i=0; i < N; i++) c[i] = a[i] + b[i]; #pragma omp section for (i=0; i < N; i++) d[i] = a[i] * b[i]; } /* end of sections */ } /* end of parallel section */ } 88

Work-Sharing Constructs SINGLE Directive The SINGLE directive specifies that the enclosed code is to be executed by only one thread in the team. May be useful when dealing with sections of code that are not thread safe (such as I/O) #pragma omp single [clause...] newline private (list) firstprivate (list) nowait structured_block 89

Data Scope Attribute Clauses - PRIVATE The PRIVATE clause declares variables in its list to be private to each thread. A new object of the same type is declared once for each thread in the team All references to the original object are replaced with references to the new object Variables declared PRIVATE should be assumed to be uninitialized for each thread 90

Data Scope Attribute Clauses - SHARED The SHARED clause declares variables in its list to be shared among all threads in the team. A shared variable exists in only one memory location and all threads can read or write to that address It is the programmer's responsibility to ensure that multiple threads properly access SHARED variables (such as via CRITICAL sections) 91

Data Scope Attribute Clauses - FIRSTPRIVATE The FIRSTPRIVATE clause combines the behavior of the PRIVATE clause with automatic initialization of the variables in its list. Listed variables are initialized according to the value of their original objects prior to entry into the parallel or work-sharing construct. 92

Data Scope Attribute Clauses - LASTPRIVATE The LASTPRIVATE clause combines the behavior of the PRIVATE clause with a copy from the last loop iteration or section to the original variable object. The value copied back into the original variable object is obtained from the last (sequentially) iteration or section of the enclosing construct. For example, the team member which executes the final iteration for a DO section, or the team member which does the last SECTION of a SECTIONS context performs the copy with its own values 93

Synchronization Constructs CRITICAL Directive The CRITICAL directive specifies a region of code that must be executed by only one thread at a time. If a thread is currently executing inside a CRITICAL region and another thread reaches that CRITICAL region and attempts to execute it, it will block until the first thread exits that CRITICAL region. 94

Synchronization Constructs CRITICAL Directive #include void main() { int x; x = 0; #pragma omp parallel shared(x) { #pragma omp critical x = x + 1; } / * end of parallel section */ } 95

Synchronization Constructs MASTER Directive –The MASTER directive specifies a region that is to be executed only by the master thread of the team. All other threads on the team skip this section of code BARRIER Directive –The BARRIER directive synchronizes all threads in the team. –When a BARRIER directive is reached, a thread will wait at that point until all other threads have reached that barrier. All threads then resume executing in parallel the code that follows the barrier. 96

Synchronization Constructs ATOMIC Directive –The ATOMIC directive specifies that a specific memory location must be updated atomically, rather than letting multiple threads attempt to write to it. In essence, this directive provides a mini-CRITICAL section. –The directive applies only to a single, immediately following statement 97

Find an error int main() { for (i = 0; i < n; i++) { a [i] = i * 0.5; b [i] = i * 2.0 } sum = 0; t = 0; #pragma omp parallel shared(sum, a, b, n) { #pragma omp for private(t) for (i = 1; i <= n; i++ ) { t = t + a[i]*b[i]; } #pragma omp critical (update_sum) sum += t; } printf ("sum = %f \n", sum); } 98

Find an error for (i=0; i<n-1; i++) a[i] = a[i] + b[i]; for (i=0; i<n-1; i++) a[i] = a[i+1] + b[i]; 99

Find an error #pragma omp parallel { int Xlocal = omp_get_thread_num(); Xshared = omp_get_thread_num(); printf("Xlocal = %d Xshared = %d\n",Xlocal,Xshared); } int i, j; #pragma omp parallel for for (i=0; i<n; i++) for (j=0; j<m; j++) { a[i][j] = compute(i,j); } 100

Find an error void compute(int n) { int i; double h, x, sum; h = 1.0/(double) n; sum = 0.0; #pragma omp for reduction(+:sum) shared(h) for (i=1; i <= n; i++) { x = h * ((double)i - 0.5); sum += (1.0 / (1.0 + x*x)); } pi = h * sum; } 101

Find an error void main () { #pragma omp parallel for private(i,a,b) for (i=0; i<n; i++) { b++; a = b+i; } /*-- End of parallel for --*/ c = a + b; } 102

Find an error int icount; void lib_func() { icount++; do_lib_work(); } main () { #pragma omp parallel { lib_func(); } 103

Find an error work1() { #pragma omp barrier } work2() { } main() { #pragma omp parallel sections { #pragma omp section work1(); #pragma omp section work2(); } 104

Art of Multiprocessor Programming105 This work is licensed under a Creative Commons Attribution- ShareAlike 2.5 License.Creative Commons Attribution- ShareAlike 2.5 License You are free: –to Share — to copy, distribute and transmit the work –to Remix — to adapt the work Under the following conditions: –Attribution. You must attribute the work to “The Art of Multiprocessor Programming” (but not in any way that suggests that the authors endorse you or your use of the work). –Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license. For any reuse or distribution, you must make clear to others the license terms of this work. The best way to do this is with a link to – Any of the above conditions can be waived if you get permission from the copyright holder. Nothing in this license impairs or restricts the author's moral rights.