Introduction to OpenMP (Originally for CS 838, Fall 2005) University of Wisconsin-Madison Slides are derived from online references of Lawrence Livermore.

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.

May 2, 2015©2006 Craig Zilles1 (Easily) Exposing Thread-level Parallelism  Previously, we introduced Multi-Core Processors —and the (atomic) instructions.

1 Programming Explicit Thread-level Parallelism  As noted previously, the programmer must specify how to parallelize  But, want path of least effort.

Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

01/26/2006CS267 Lecture 51 Shared Memory Programming OpenMP and Threads Kathy Yelick

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.

OpenMPI Majdi Baddourah

A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.

1 Parallel Programming With OpenMP. 2 Contents  Overview of Parallel Programming & OpenMP  Difference between OpenMP & MPI  OpenMP Programming Model.

Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.

CS240A, T. Yang, 2013 Modified from Demmel/Yelick’s and Mary Hall’s Slides 1 Parallel Programming with OpenMP.

Programming with Shared Memory Introduction to OpenMP

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

Lecture 5: Shared-memory Computing with Open MP. Shared Memory Computing.

Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.

1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.

Introduction to OpenMP. OpenMP Introduction Credits:

OpenMP: Open specifications for Multi-Processing What is OpenMP? Join\Fork model Join\Fork model Variables Variables Explicit parallelism Explicit parallelism.

Lecture 8: OpenMP. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism / Implicit parallelism.

OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)

04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.

CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.

Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j

CS 346 – Chapter 4 Threads –How they differ from processes –Definition, purpose Threads of the same process share: code, data, open files –Types –Support.

OpenMP fundamentials Nikita Panov

Includes slides from course CS194 at UC Berkeley, by prof. Katherine Yelick Shared Memory Programming Pthreads: an overview Ing. Andrea Marongiu

09/08/2011CS4961 CS4961 Parallel Programming Lecture 6: More OpenMP, Introduction to Data Parallel Algorithms Mary Hall September 8, 2011.

09/17/2010CS4961 CS4961 Parallel Programming Lecture 8: Introduction to Threads and Data Parallelism in OpenMP Mary Hall September 17, 2009.

Introduction to OpenMP

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

09/09/2010CS4961 CS4961 Parallel Programming Lecture 6: Data Parallelism in OpenMP, cont. Introduction to Data Parallel Algorithms Mary Hall September.

Shared Memory Parallelism - OpenMP Sathish Vadhiyar Credits/Sources: OpenMP C/C++ standard (openmp.org) OpenMP tutorial (

9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,

Threaded Programming Lecture 2: Introduction to OpenMP.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

10/05/2010CS4961 CS4961 Parallel Programming Lecture 13: Task Parallelism in OpenMP Mary Hall October 5,

CPE779: More on OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

Heterogeneous Computing using openMP lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.

CS240A, T. Yang, Parallel Programming with OpenMP.

Heterogeneous Computing using openMP lecture 1 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.

MPSoC Architectures OpenMP Alberto Bosio

Introduction to OpenMP

Shared Memory Parallelism - OpenMP

Lecture 5: Shared-memory Computing with Open MP

CS427 Multicore Architecture and Parallel Computing

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.

Open[M]ulti[P]rocessing

Computer Engg, IIT(BHU)

Introduction to OpenMP

Computer Science Department

Shared Memory Programming with OpenMP

Multi-core CPU Computing Straightforward with OpenMP

Parallel Programming with OpenMP

CS4961 Parallel Programming Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms Mary Hall September 4, /04/2012 CS4230.

Introduction to High Performance Computing Lecture 20

Programming with Shared Memory Introduction to OpenMP

Introduction to OpenMP

OpenMP Martin Kruliš.

OpenMP Parallel Programming

Presentation transcript:

Introduction to OpenMP (Originally for CS 838, Fall 2005) University of Wisconsin-Madison Slides are derived from online references of Lawrence Livermore National Laboratory, National Energy Research Scientific Computing Center, University of Minnesota, OpenMP.org

2 (C) 2006 CS Architecture Seminar Introduction to OpenMP What is OpenMP? –Open specification for Multi-Processing –“Standard” API for defining multi-threaded shared-memory programs – – Talks, examples, forums, etc. High-level API –Preprocessor (compiler) directives ( ~ 80% ) –Library Calls ( ~ 19% ) –Environment Variables ( ~ 1% )

3 (C) 2006 CS Architecture Seminar A Programmer’s View of OpenMP OpenMP is a portable, threaded, shared-memory programming specification with “light” syntax –Exact behavior depends on OpenMP implementation! –Requires compiler support (C or Fortran) OpenMP will: –Allow a programmer to separate a program into serial regions and parallel regions, rather than T concurrently-executing threads. –Hide stack management –Provide synchronization constructs OpenMP will not: –Parallelize (or detect!) dependencies –Guarantee speedup –Provide freedom from data races

4 (C) 2006 CS Architecture Seminar Outline Introduction –Motivating example –Parallel Programming is Hard OpenMP Programming Model –Easier than PThreads Microbenchmark Performance Comparison –vs. PThreads Discussion –specOMP

5 (C) 2006 CS Architecture Seminar Current Parallel Programming 1.Start with a parallel algorithm 2.Implement, keeping in mind: Data races Synchronization Threading Syntax 3.Test & Debug 4.Debug 5.Debug

6 (C) 2006 CS Architecture Seminar Motivation – Threading Library void* SayHello(void *foo) { printf( "Hello, world!\n" ); return NULL; } int main() { pthread_attr_t attr; pthread_t threads[16]; int tn; pthread_attr_init(&attr); pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM); for(tn=0; tn<16; tn++) { pthread_create(&threads[tn], &attr, SayHello, NULL); } for(tn=0; tn<16 ; tn++) { pthread_join(threads[tn], NULL); } return 0; }

7 (C) 2006 CS Architecture Seminar Motivation Thread libraries are hard to use –P-Threads/Solaris threads have many library calls for initialization, synchronization, thread creation, condition variables, etc. –Programmer must code with multiple threads in mind Synchronization between threads introduces a new dimension of program correctness

8 (C) 2006 CS Architecture Seminar Motivation Wouldn’t it be nice to write serial programs and somehow parallelize them “automatically”? –OpenMP can parallelize many serial programs with relatively few annotations that specify parallelism and independence –OpenMP is a small API that hides cumbersome threading calls with simpler directives

9 (C) 2006 CS Architecture Seminar Better Parallel Programming 1.Start with some algorithm Embarrassing parallelism is helpful, but not necessary 2.Implement serially, ignoring: Data Races Synchronization Threading Syntax 3.Test and Debug 4.Automatically (magically?) parallelize Expect linear speedup

10 (C) 2006 CS Architecture Seminar Motivation – OpenMP int main() { // Do this part in parallel printf( "Hello, World!\n" ); return 0; }

11 (C) 2006 CS Architecture Seminar Motivation – OpenMP int main() { omp_set_num_threads(16); // Do this part in parallel #pragma omp parallel { printf( "Hello, World!\n" ); } return 0; }

12 (C) 2006 CS Architecture Seminar OpenMP Parallel Programming 1.Start with a parallelizable algorithm Embarrassing parallelism is good, loop-level parallelism is necessary 2.Implement serially, mostly ignoring: Data Races Synchronization Threading Syntax 3.Test and Debug 4.Annotate the code with parallelization (and synchronization) directives Hope for linear speedup 5.Test and Debug

13 (C) 2006 CS Architecture Seminar Programming Model - Threading Serial regions by default, annotate to create parallel regions –Generic parallel regions –Parallelized loops –Sectioned parallel regions Thread-like Fork/Join model –Arbitrary number of logical thread creation/ destruction events Fork Join

14 (C) 2006 CS Architecture Seminar Programming Model - Threading int main() { } // serial region printf(“Hello…”); // serial again printf(“!”); Fork Join // parallel region #pragma omp parallel { printf(“World”); } Hello…WorldWorldWorldWorld!

15 (C) 2006 CS Architecture Seminar Programming Model – Nested Threading Fork/Join can be nested –Nesting complication handled “automagically” at compile-time –Independent of the number of threads actually running Fork Join Fork Join

16 (C) 2006 CS Architecture Seminar Programming Model – Thread Identification Master Thread Thread with ID=0 Only thread that exists in sequential regions Depending on implementation, may have special purpose inside parallel regions Some special directives affect only the master thread (like master ) Fork Join

17 (C) 2006 CS Architecture Seminar Programming Model – Data/Control Parallelism Data parallelism –Threads perform similar functions, guided by thread identifier Control parallelism –Threads perform differing functions »One thread for I/O, one for computation, etc… Fork Join

18 (C) 2006 CS Architecture Seminar Programming Model – Concurrent Loops OpenMP easily parallelizes loops –No data dependencies between iterations! Preprocessor calculates loop bounds for each thread directly from serial source ? ? for( i=0; i < 25; i++ ) { printf(“Foo”); } #pragma omp parallel for

19 (C) 2006 CS Architecture Seminar Programming Model – Loop Scheduling schedule clause determines how loop iterations are divided among the thread team –static([chunk]) divides iterations statically between threads »Each thread receives [chunk] iterations, rounding as necessary to account for all iterations »Default [chunk] is ceil( # iterations / # threads ) –dynamic([chunk]) allocates [chunk] iterations per thread, allocating an additional [chunk] iterations when a thread finishes »Forms a logical work queue, consisting of all loop iterations »Default [chunk] is 1 –guided([chunk]) allocates dynamically, but [chunk] is exponentially reduced with each allocation

20 (C) 2006 CS Architecture Seminar Programming Model – Loop Scheduling for( i=0; i<16; i++ ) { doIteration(i); } // Static Scheduling int chunk = 16/T; int base = tid * chunk; int bound = (tid+1)*chunk; for( i=base; i<bound; i++ ) { doIteration(i); } Barrier(); #pragma omp parallel for \ schedule(static)

21 (C) 2006 CS Architecture Seminar Programming Model – Loop Scheduling for( i=0; i<16; i++ ) { doIteration(i); } // Dynamic Scheduling int current_i; while( workLeftToDo() ) { current_i = getNextIter(); doIteration(i); } Barrier(); #pragma omp parallel for \ schedule(dynamic)

22 (C) 2006 CS Architecture Seminar Programming Model – Data Sharing Parallel programs often employ two types of data –Shared data, visible to all threads, similarly named –Private data, visible to a single thread (often stack-allocated) OpenMP: –shared variables are shared –private variables are private PThreads: –Global-scoped variables are shared –Stack-allocated variables are private // shared, globals int bigdata[1024]; void* foo(void* bar) { // private, stack int tid; /* Calculation goes here */ } int bigdata[1024]; void* foo(void* bar) { int tid; #pragma omp parallel \ shared ( bigdata ) \ private ( tid ) { /* Calc. here */ }

23 (C) 2006 CS Architecture Seminar Programming Model - Synchronization OpenMP Synchronization –OpenMP Critical Sections »Named or unnamed »No explicit locks –Barrier directives –Explicit Lock functions »When all else fails – may require flush directive –Single-thread regions within parallel regions »master, single directives #pragma omp critical { /* Critical code here */ } #pragma omp barrier omp_set_lock( lock l ); /* Code goes here */ omp_unset_lock( lock l ); #pragma omp single { /* Only executed once */ }

24 (C) 2006 CS Architecture Seminar Programming Model - Summary Threaded, shared-memory execution model –Serial regions and parallel regions –Parallelized loops with customizable scheduling Concurrency expressed with preprocessor directives –Thread creation, destruction mostly hidden –Often expressed after writing a serial version through annotation

25 (C) 2006 CS Architecture Seminar Outline Introduction –Motivating example –Parallel Programming is Hard OpenMP Programming Model –Easier than PThreads Microbenchmark Performance Comparison –vs. PThreads Discussion –specOMP

26 (C) 2006 CS Architecture Seminar Performance Concerns Is the overhead of OpenMP too high? –How do the scheduling and synchronization options affect performance? –How does autogenerated code compare to hand-written code? Can OpenMP scale? –4 threads? 16? More? What should OpenMP be compared against? –PThreads? –MPI?

27 (C) 2006 CS Architecture Seminar Performance Comparison: OMP vs. Pthreads PThreads –Shared-memory, portable threading implementation –Explicit thread creation, destruction (pthread_create) –Explicit stack management –Synch: Locks, Condition variables Microbenchmarks implemented in OpenMP, PThreads –Explore OpenMP loop scheduling policies –Comparison vs. tuned PThreads implementation

28 (C) 2006 CS Architecture Seminar Methodology Microbenchmarks implemented in OpenMP and PThreads, compiled with similar optimizations, same compiler (Sun Studio) Execution times measured on a 16-processor Sun Enterprise 6000 (cabernet.cs.wisc.edu), 2GB RAM, 1MB L2 Cache Parameters varied: –Number of processors (threads) –Working set size –OpenMP loop scheduling policy

29 (C) 2006 CS Architecture Seminar Microbenchmark: Ocean Conceptually similar to SPLASH-2’s ocean Simulates ocean temperature gradients via successive-approximation –Operates on a 2D grid of floating point values “Embarrassingly” Parallel –Each thread operates in a rectangular region –Inter-thread communication occurs only on region boundaries –Very little synchronization (barrier-only) Easy to write in OpenMP!

30 (C) 2006 CS Architecture Seminar Microbenchmark: Ocean for( t=0; t < t_steps; t++) { for( x=0; x < x_dim; x++) { for( y=0; y < y_dim; y++) { ocean[x][y] = /* avg of neighbors */ } } #pragma omp parallel for \ shared(ocean,x_dim,y_dim) private(x,y) // Implicit Barrier Synchronization temp_ocean = ocean; ocean = other_ocean; other_ocean = temp_ocean;

31 (C) 2006 CS Architecture Seminar Microbenchmark: Ocean ocean_dynamic – Traverses entire ocean, row- by-row, assigning row iterations to threads with dynamic scheduling. ocean_static – Traverses entire ocean, row-by- row, assigning row iterations to threads with static scheduling. ocean_squares – Each thread traverses a square-shaped section of the ocean. Loop- level scheduling not used—loop bounds for each thread are determined explicitly. ocean_pthreads – Each thread traverses a square-shaped section of the ocean. Loop bounds for each thread are determined explicitly. OpenMP PThreads

32 (C) 2006 CS Architecture Seminar Microbenchmark: Ocean

33 (C) 2006 CS Architecture Seminar Microbenchmark: Ocean

34 (C) 2006 CS Architecture Seminar Microbenchmark: GeneticTSP Genetic heuristic-search algorithm for approximating a solution to the traveling salesperson problem Operates on a population of possible TSP paths –Forms new paths by combining known, good paths (crossover) –Occasionally introduces new random elements (mutation) Variables: Np – Population size, determines search space and working set size Ng – Number of generations, controls effort spent refining solutions rC – Rate of crossover, determines how many new solutions are produced and evaluated in a generation rM – Rate of mutation, determines how often new (random) solutions are introduced

35 (C) 2006 CS Architecture Seminar Microbenchmark: GeneticTSP while( current_gen < Ng ) { Breed rC*Np new solutions: Select two parents Perform crossover() Mutate() with probability rM Evaluate() new solution Identify least-fit rC*Np solutions: Remove unfit solutions from population current_gen++ } return the most fit solution found Outer loop has data dependence between iterations, as the population is not a loop invariant. Can generate new solutions in parallel, but crossover(), mutate(), and evaluate() have varying runtimes. Threads can find least-fit population members in parallel, but only one thread should actually delete solutions.

36 (C) 2006 CS Architecture Seminar Microbenchmark: GeneticTSP dynamic_tsp – Parallelizes both breeding loop and survival loop with OpenMP’s dynamic scheduling static_tsp – Parallelizes both breeding loop and survival loop with OpenMP’s static scheduling tuned_tsp – Attempt to tune scheduilng. Uses guided (exponential allocation) scheduling on breeding loop, static predicated scheduling on survival loop. pthreads_tsp – Divides iterations of breeding loop evenly among threads, conditionally executes survival loop in parallel OpenMP PThreads

37 (C) 2006 CS Architecture Seminar Microbenchmark: GeneticTSP

38 (C) 2006 CS Architecture Seminar Evaluation OpenMP scales to 16-processor systems –Was overhead too high? »In some cases, yes –Did compiler-generated code compare to hand-written code? »Yes! –How did the loop scheduling options affect performance? »dynamic or guided scheduling helps loops with variable interation runtimes »static or predicated scheduling more appropriate for shorter loops Is OpenMP the right tool to parallelize scientific application?

39 (C) 2006 CS Architecture Seminar SpecOMP (2001) Parallel form of SPEC FP 2000 using Open MP, larger working sets –Aslot et. Al., Workshop on OpenMP Apps. and Tools (2001) Many of CFP2000 were “straightforward” to parallelize: –ammp: 16 Calls to OpenMP API, 13 #pragmas, converted linked lists to vector lists –applu: 50 directives, mostly parallel or do –fma3d: 127 lines of OpenMP directives (60k lines total) –mgrid: automatic translation to OpenMP –swim: 8 loops parallelized

40 (C) 2006 CS Architecture Seminar SpecOMP BenchmarkLines of CodeParallel Cov.# Par. Sections ammp13,500 (C)99.2 %7 applu4, %22 apsi7, %24 art1,300 (C)99.5 %3 facerec2, %2 fma3d60, %30 gafort1, %6 galgel15, %29 equake1,500 (C)98.4 %11 mgrid %12 swim %8 wupwise2, %6

41 (C) 2006 CS Architecture Seminar SpecOMP - Scalability Aslot et. Al. Execution times on a “generic” 350Mhz machine.

42 (C) 2006 CS Architecture Seminar Limitations OpenMP Requires compiler support –Sun Studio compiler –Intel VTune –Polaris/OpenMP (Purdue) OpenMP does not parallelize dependencies –Often does not detect dependencies –Nasty race conditions still exist! OpenMP is not guaranteed to divide work optimally among threads –Programmer-tweakable with scheduling clauses –Still lots of rope available

43 (C) 2006 CS Architecture Seminar Limitations Doesn’t totally hide concept of volatile data –From a high-level, use of OMP’s locks can seem like consistency violations if flush directive is forgotten Workload applicability –Easy to parallelize “scientific” applications –How might one create an OpenMP web server? Database? Adoption hurdle –Search for “OpenMP”: »3 results (out of 72,000)

44 (C) 2006 CS Architecture Seminar Summary OpenMP is a compiler-based technique to create concurrent code from (mostly) serial code OpenMP can enable (easy) parallelization of loop- based code –Lightweight syntactic language extensions OpenMP performs comparably to manually-coded threading –Scalable –Portable Not a silver bullet for all applications

45 (C) 2006 CS Architecture Seminar More Information –OpenMP official site –A handy OpenMP tutorial –Another OpenMP tutorial and reference

46 (C) 2006 CS Architecture Seminar Backup Slides Syntax, etc

47 (C) 2006 CS Architecture Seminar Consistency Violation? #pragma omp parallel for \ shared(x) private(i) for( i=0; i<100; i++ ) { #pragma omp atomic x++; } printf(“%i”,x); #pragma omp parallel for \ shared(x) private(i) for( i=0; i<100; i++ ) { omp_set_lock(my_lock); x++; omp_unset_lock(my_lock); } printf(“%i”,x); #pragma omp flush

48 (C) 2006 CS Architecture Seminar OpenMP Syntax General syntax for OpenMP directives Directive specifies type of OpenMP operation –Parallelization –Synchronization –Etc. Clauses (optional) modify semantics of Directive #pragma omp directive [clause…] CR

49 (C) 2006 CS Architecture Seminar OpenMP Syntax PARALLEL syntax Ex: Output: (T=4) #pragma omp parallel { printf(“Hello!\n”); } // implicit barrier Hello! #pragma omp parallel [clause…] CR structured_block

50 (C) 2006 CS Architecture Seminar OpenMP Syntax DO/for Syntax (DO-Fortran, for-C) Ex: #pragma omp parallel { #pragma omp for private(i) shared(x) \ schedule(static,x/N) for(i=0;i<x;i++) printf(“Hello!\n”); } // implicit barrier Note: Must reside inside a parallel section #pragma omp for [clause…] CR for_loop

51 (C) 2006 CS Architecture Seminar OpenMP Syntax More on Clauses private() – A variable in private list is private to each thread shared() – Variables in shared list are visible to all threads –Implies no synchronization, or even consistency! schedule() – Determines how iterations will be divided among threads –schedule(static, C) – Each thread will be given C iterations »Usually T*C = Number of total iterations –schedule(dynamic) – Each thread will be given additional iterations as-needed »Often less efficient than considered static allocation nowait – Removes implicit barrier from end of block

52 (C) 2006 CS Architecture Seminar OpenMP Syntax PARALLEL FOR (combines parallel and for ) Ex: #pragma omp parallel for shared(x)\ private(i) \ schedule(dynamic) for(i=0;i<x;i++) { printf(“Hello!\n”); } #pragma omp parallel for [clause…] CR for_loop

53 (C) 2006 CS Architecture Seminar Example: AddMatrix Files: (Makefile) addmatrix.c // omp-parallelized matrixmain.c // non-omp printmatrix.c // non-omp

54 (C) 2006 CS Architecture Seminar OpenMP Syntax ATOMIC syntax Ex: #pragma omp parallel shared(x) { #pragma omp atomic x++; } // implicit barrier #pragma omp atomic CR simple_statement

55 (C) 2006 CS Architecture Seminar OpenMP Syntax CRITICAL syntax Ex: #pragma omp parallel shared(x) { #pragma omp critical { // only one thread in here } } // implicit barrier #pragma omp critical CR structured_block

56 (C) 2006 CS Architecture Seminar OpenMP Syntax ATOMIC vs. CRITICAL Use ATOMIC for “simple statements” –Can have lower overhead than CRITICAL if HW atomics are leveraged (implementation dep.) Use CRITICAL for larger expressions –May involve an unseen implicit lock

57 (C) 2006 CS Architecture Seminar OpenMP Syntax MASTER – only Thread 0 executes a block SINGLE – only one thread executes a block No implied synchronization #pragma omp master CR structured_block #pragma omp single CR structured_block

58 (C) 2006 CS Architecture Seminar OpenMP Syntax BARRIER Locks –Locks are provided through omp.h library calls –omp_init_lock() –omp_destroy_lock() –omp_test_lock() –omp_set_lock() –omp_unset_lock() #pragma omp barrier CR

59 (C) 2006 CS Architecture Seminar OpenMP Syntax FLUSH Guarantees that threads’ views of memory is consistent Why? Recall OpenMP directives… –Code is generated by directives at compile-time »Variables are not always declared as volatile »Using variables from registers instead of memory can seem like a consistency violation –Synch. Often has an implicit flush »ATOMIC, CRITICAL #pragma omp flush CR

60 (C) 2006 CS Architecture Seminar OpenMP Syntax Functions omp_set_num_threads() omp_get_num_threads() omp_get_max_threads() omp_get_num_procs() omp_get_thread_num() omp_set_dynamic() omp_[init destroy test set unset]_lock()

61 (C) 2006 CS Architecture Seminar Microbenchmark: Ocean Thread 0 Thread 1 Thread 2 Thread 3

62 (C) 2006 CS Architecture Seminar Microbenchmark: Ocean

63 (C) 2006 CS Architecture Seminar Microbenchmark: Ocean

64 (C) 2006 CS Architecture Seminar Microbenchmark: Ocean

65 (C) 2006 CS Architecture Seminar Microbenchmark: Ocean

66 (C) 2006 CS Architecture Seminar Microbenchmark: Ocean