Presentation is loading. Please wait.

Presentation is loading. Please wait.

Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.

Similar presentations


Presentation on theme: "Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent."— Presentation transcript:

1 Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent of the compiler openMP: Library requires compiler support Pthreads: Low-level, with maximum flexibility openMP: Higher-level, with less flexibility Pthreads: Application parallelized all at once openMP: Programmer can incrementally parallelize an application Pthreads: Difficult to program high-performance parallel applications openMP: Much simpler to program high-performance parallel applications Pthreads: Explicit Fork/Join or Detached Thread model openMP: Implicit Fork/Join model

2 Creating openMP Programs In C, make use of the preprocessor General Syntax: #pragma omp directive [clause [clause]...] Note: To extend the syntax to multiple lines, end a line with the '\' character Error Checking: Adapt if compiler support lacking #ifdef _OPENMP // Only include the header if it exists #include #endif #ifdef _OPENMP // Get number of threads and rank if openMP exists int rank = omp_get_thread_num ( ) ; int threads = omp_get_num_threads ( ) ; #e l s e // Default to one thread, rank zero if no openMP support int rank = 0 ; int threads = 1 ; #endif

3 openMP Parallel Regions Copyright©2005 Sun MicrosystemsAn Introduction Into OpenMP

4 Parallel Directive Overview The compiler, encountering the omp parallel pragma, creates multiple threads All threads execute the specified structured block of code A structured block can be either a single statement or a compound statement created with {...} with a single entry point and a single exit point There is an implicit barrier at the end of the construct Within the block, we can declare variables to be – Private: local to a particular thread – Shared: visible to all threads Initiate a parallel structured block: #pragma omp parallel { /* code here */ }

5 Example int x, threads; #pragma omp parallel private(x, threads) {x = omp_get_thread_num(); threads = omp_get_num_threads(); a[x] += num_threads; } omp_get_num_threads() returns number of active threads omp_get_thread_num() returns rank (counting from 0) x and threads are private variables, local to the threads Array a[] is a global shared array Note: a[x] in this example is not a critical section (a[x] is a unique location for each thread)

6 Matrix Times Vector Copyright©2005 Sun MicrosystemsAn Introduction Into OpenMP

7 Trapezoidal Rule Calculate the approximate integral Sum up a bunch of adjacent trapezoids Each trapezoid has the same width Approximate Integral: (b-a)*(f(b)+f(a))/2 A closer approximation: (f(x n )-f(x 0 ))/n * [f(x 0 )/2+f(x 1 )+f(x 2 )+···+f(x n )/2] Sequential algorithm (Given a, b, and n): w = ( b−a ) / n ; // Width of an integral segment integral = 0; for ( i = 1 ; i < n−1; i++) { integral += f (a+i*w); } // Evaluate at the next point integral = w * (integral + f(a)/2 + f(b)/2); // The approximate result Trapezoid Area: h * (f(x i ) + f(x i+1 )/2) (f(b) + f(a))/2

8 Parallel Trapezoidal Integration void TrapezoidIntegration( double a, double b, int n, double global_result ) {int rank = omp_get_thread_num ( ), threads = omp_get_num_threads ( ) ; double w=(b-a)/n, myN=n/threads, myResult; double myA = a+rank*w, myB = myA + (myN - 1)*w; for ( i = 1 ; i < myN−1; i++) { myResult+ = f(myA + i*w); } # pragma omp critical // Mutual exclusion *global_result += w * (myResult +f(myA)/2 + f(myB)/2) ; } int main ( i n t argc, char argv [ ] ) {double result = 0. 0, a=atod(argv[2]), b=atod(argv[3]); int n, threads = atoi ( argv [ 1 ], NULL, 10 ) ; #pragma omp parallel num_threads ( threads ) TrapezoidIntegration( a, b, n, &global_result ) ; printf ( "%d trapezoids, %f to %f Integral = %.14e\n", n, a, b, result ) ; }

9 Global Reduction void TrapezoidIntegration ( double a, double b, int n ) { int rank = omp_get_thread_num ( ), threads = omp_get_num_threads ( ) ; double w=(b-a)/n, myN=n/threads, myResult; double myA = a+rank*w, myB = myA+(myN - 1)*w; for ( i = 1 ; i < myN−1; i++) { myResult += f(myA + i*w); } } return w * ( myResult + f(myA)/2 + f(myB)/2 ); } int main ( i n t argc, char argv [ ] ) {double result = 0. 0, a=atod(argv[2]), b=atod(argv[3]); int n, threads = atoi ( argv [ 1 ], NULL, 10 ) ; #pragma omp parallel num_threads ( threads ) reduction(+: result) result += TrapezoidIntegration ( a, b, n ) ; printf ( "%d trapezoids, %f to %f Integral = %.14e\n", n, a, b, result ) ; }

10 Parallel for loop The parallel directive creates a team of threads to execute a specified block of code in parallel To optimally use system resources, the number of threads in a team is determined dynamically By default, an implied barrier follows each iteration of the loop Any one of the following three approaches, in precedence order, determine team size: 1.The environment variable OMP_NUM_THREADS 2.Calling omp_set_num_threads() library routine 3. num_threads clause after the parallel directive specifies team size for that particular directive Corresponds to the forall construct Syntax: #pragma omp for (i=0; i

11 Illustration of parallel for loops Copyright©2005 Sun Microsystems An Introduction Into OpenMP

12 Data Dependencies The compiler rejects loops that don't follow openMP rules: – The number of iterations must be known in advance – The loop expressions cannot be floats or doubles and cannot change during execution of the loop. The index can only change by the increment part of the for statement int Linear_search ( i n t key, i n t A [ ], i n t n ) { int i ; #pragma omp parallel for num_threads ( thread_count ) for ( i = 0 ; i < n ; i++) i f ( A [ i ] == key ) return i ; return −1; } // Compiler error: invalid exit from OpenMP structured block

13 Data Dependencies Compiles, but results are inconsistent Fibonacci example: f 0 = f 1 = 1; f i = f i-1 + f i-2 Fibo[0] = fibo[1] = 1 ; # pragma omp parallel for num threads ( threads ) f o r ( i = 2 ; i < n ; i++) fibo[i] = fibo[i−1] + fibo [i−2] ; Possible outcomes using two threads // correct, // incorrect Conclusions – Dependencies within a single iteration will work correctly – OpenMP compilers don’t reject parallel for directive iteration dependences – Avoid attempting to parallelize Loops with cross-iteration dependencies One iteration depends upon computations of another

14 More par/for examples Trapezoidal Integration h = ( b−a ) / n ; integral = (f(a) + f(b))/2.0; # pragma omp parallel for \ num_threads (threads) \ reduction (+: integral ) for (i = 1; i<=n−1; i++) integral += f(a+ih ) ; integral = h*integral ; Calculation of π double sum = 0. 0 ; # pragma omp parallel for \ num_threads ( threads ) \ reduction (+: sum) \ private ( factor ) for (k=0 ; k

15 Odd/Even Sort // Note that for, unlike parallel does not fork threads; it uses those available // Spawning new threads is an inefficient operation, and used sparingly # pragma omp parallel num_threads ( threads) \ default ( none ) shared ( a, n ) private ( i, tmp, phase ) for ( phase = 0 ; phase < n ; phase ++) {if ( phase % 2 == 0) # pragma omp for for ( i = 1; i < n; i += 2) {i f ( a[i−1] > a[i] ) { tmp = a[i−1]; a[i−1] = a [i]; a[i] = tmp ; } } else # pragma omp for for ( i = 1; i < n−1; i += 2) {i f ( a[i] > a[i+1] ) { tmp = a[i+1]; a[i+1] = a[i]; a[i] = tmp ; } } } Note: default clause forces programmer to specify the scope of all variables Note: There is a default barrier after each iteration

16 Scheduling of Threads Static – Iterations are assigned to threads before the loop is executed. – System assigns chunks of iterations to threads in a round robin fashion – for seven iterations, 0, 1,..., 7, and two threads schedule(static,1) assigns 0,2,4,6 to thread 0 and 1,3,5,7 to thread 1 schedule(static,4) assigns 0,1,2,3 to thread 0 and 4,5,6,7 to thread 1 Dynamic or guided – Iterations are assigned to the threads while the loop is executing. – After a thread completes its current set of iterations, its requests more – Guided initially assigns large chunks, which decreases down to chunk size as threads request more work, dynamic uses the chunk size auto : The compiler and/or the run-time loader determine the schedule runtime : The allocation schedule is determined automatically at run-time Clause: schedule( type, chunk)

17 Scheduling Example #pragma omp parallel for num_threads( threads) \ reduction ( + : sum ) schedule ( static, 1 ) for ( i = 0 ; i <= n ; i++) sum += f ( i ) ; Assign prior to execute the loop in a round robin fashion with one iteration assigned to each thread

18 Sections // Allocate sections among available threads #pragma omp sections { // The first section directive is implied and optional #pragma omp section { /* structured_block */ } // Each section can have its own individual code #pragma omp section { /* structured_block */ }... } The structured blocks are shared among threads of a team The sections directive does not create new thread teams Notes: Sections can be nested. Different independent code blocks run simultaneously in sections.

19 OMP: Sequential within Parallel Blocks Single: the block executed by one of the threads – Syntax: #pragma omp single {//code} – Note: There is an implied barrier at the end of the construct unless nowait appears on the pragma line Master: The block executed by the master thread – Syntax: #pragma omp master {//code} – Note: There is no implied barrier in this construct

20 Critical Sections/ Synchronization Critical Sections: #pragma omp critical name {//code} – A critical section is keyed by its name. – Thread reaching the critical directive blocks until no other thread is executing the same critical section (one with the same name) – The name is optional. If not specified, A global default is used Barrier: #pragma omp barrier – Threads wait till all threads reach the barrier; then they all proceed together – Caution: All threads must be able to reach the barrier Atomic expression: #pragma omp atomic – A critical section updating a variable by executing a simple expression Flush: #pragma omp flush (variable_list) – The executing thread gets a consistent view of the shared variables – Current read and write operations on the variables complete and values are written back to memory. New memory operations in the code after flush are not started, creating a “memory fence”.


Download ppt "Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent."

Similar presentations


Ads by Google