Presentation is loading. Please wait.

Presentation is loading. Please wait.

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

Similar presentations


Presentation on theme: "– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance."— Presentation transcript:

1 – 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance Optimizations by Code restructuring In the context of OpenMP notation

2 – 2 – OpenMP Implementation Overview OpenMP implementation compiler, library. Unlike Pthreads (purely a library).

3 – 3 – OpenMP Example Usage (1 of 2) OpenMP Compiler Annotated Source Sequential Program Parallel Program compiler switch

4 – 4 – OpenMP Example Usage (2 of 2) If you give sequential switch, pragmas are ignored. If you give parallel switch, pragmas are read, and cause translation into parallel program. Ideally, one source for both sequential and parallel program (big maintenance plus).

5 – 5 – OpenMP Directives Parallelization directives: parallel for Data environment directives: shared, private, threadprivate, reduction, etc.

6 – 6 – OpenMP Notation: Parallel For #pragma omp parallel for A number of threads are spawned at entry. Each thread is assigned a set of iterations for the loop and executes that code. e.g., block, or cyclic iteration assignment to threads Each thread waits at the end. Very similar to fork/join synchronization.

7 – 7 – API Semantics Master thread executes sequential code. Master and slaves execute parallel code. Note: very similar to fork-join semantics of Pthreads create/join primitives.

8 – 8 – Scheduling of Iterations Scheduling: assigning iterations to a thread. OpenMP allows scheduling strategies, such as block, cyclic, etc.

9 – 9 – Scheduling of Iterations: Specification #pragma omp parallel for schedule( ) can be one of can be one of block (default) cyclic

10 – 10 – Example Multiplication of two matrices C = A x B, where the A matrix is upper-triangular (all elements below diagonal are 0). 0 A

11 – 11 – Sequential Matrix Multiply Becomes for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { c[i][j] = 0.0; for( k=i; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; } Load imbalance with block distribution.

12 – 12 – OpenMP Matrix Multiply #pragma omp parallel for schedule( cyclic ) for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) { c[i][j] = 0.0; for( k=i; k<n; k++ ) c[i][j] += a[i][k]*b[k][j]; }

13 – 13 – Code Restructuring Optimizations Private variables Loop reordering Loop peeling

14 – 14 – General Idea Parallelism limited by dependences. Restructure code to eliminate or reduce dependences. Compiler usually not able to do this, good to know how to do it by hand.

15 – 15 – Example 1: Dependency on Scalar for( i=0; i<n; i++ ) { tmp = a[i]; a[i] = b[i]; b[i] = tmp; } Loop-carried dependence on tmp. Easily fixed by privatizing tmp.

16 – 16 – Fix: Scalar Privatization f() { int tmp; /* local allocation on stack */ for( i=from; i<to; i++ ) { tmp = a[i]; a[i] = b[i]; b[i] = tmp; } Removes dependence on tmp. Removes dependence on tmp.

17 – 17 – Fix: Scalar Privatization in OpenMP #pragma omp parallel for private( tmp ) for( i=0; i<n; i++ ) { tmp = a[i]; a[i] = b[i]; b[i] = tmp; } Removes dependence on tmp.

18 – 18 – Example 3: Induction Variable for( i=0, index=0; i<n; i++ ) { index += i; a[i] = b[index]; } Dependence on index. Can be computed from loop variable.

19 – 19 – Fix: Induction Variable Elimination #pragma omp parallel for for( i=0, index=0; i<n; i++ ) { a[i] = b[i*(i+1)/2]; } Dependence removed by computing the induction variable.

20 – 20 – Example 4: Induction Variable for( i=0, index=0; i<n; i++ ) { index += f(i); b[i] = g(a[index]); } Dependence on variable index, but no formula for its value.

21 – 21 – Fix: Loop Splitting for( i=0; i<n; i++ ) { index[i] += f(i); } #pragma omp parallel for for( i=0; i<n; i++ ) { b[i] = g(a[index[i]]); } Loop splitting has removed dependence.

22 – 22 – Example 5 for( k=0; k<n; k++ ) for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) a[i][j] += b[i][k] + c[k][j]; Dependence on a[i][j] prevents k-loop parallelization. No dependencies carried by i- and j-loops.

23 – 23 – Example 5 Parallelization for( k=0; k<n; k++ ) #pragma omp parallel for for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) a[i][j] += b[i][k] + c[k][j]; We can do better by reordering the loops.

24 – 24 – Optimization: Loop Reordering #pragma omp parallel for for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) for( k=0; k<n; k++ ) a[i][j] += b[i][k] + c[k][j]; Larger parallel pieces of work.

25 – 25 – Example 6 #pragma omp parallel for for(i=0; i<n; i++ ) a[i] = b[i]; #pragma omp parallel for for( i=0; i<n; i++ ) c[i] = b[i]^2; Make two parallel loops into one.

26 – 26 – Optimization: Loop Fusion #pragma omp parallel for for(i=0; i<n; i++ ) { a[i] = b[i]; c[i] = b[i]^2; } Reduces loop startup overhead.

27 – 27 – Example 7: While Loops while( *a) { process(a); a++; } The number of loop iterations is unknown.

28 – 28 – Special Case of Loop Splitting for( count=0, p=a; p!=NULL; count++, p++ ); #pragma omp parallel for for( i=0; i<count; i++ ) process( a[i] ); Count the number of loop iterations. Then parallelize the loop.

29 – 29 – Example 8 for( i=0, wrap=n; i<n; i++ ) { b[i] = a[i] + a[wrap]; wrap = i; } Dependence on wrap. Only first iteration causes dependence.

30 – 30 – Loop Peeling b[0] = a[0] + a[n]; #pragma omp parallel for for( i=1; i<n; i++ ) { b[i] = a[i] + a[i-1]; }

31 – 31 – Example 10 for( i=0; i<n; i++ ) a[i+m] = a[i] + b[i]; Dependence if m<n.

32 – 32 – Another Case of Loop Peeling if(m>n) { #pragma omp parallel for for( i=0; i<n; i++ ) a[i+m] = a[i] + b[i]; } else { … cannot be parallelized }

33 – 33 – Summary Reorganize code such that dependences are removed or reduced large pieces of parallel work emerge loop bounds become known … Code can become messy … there is a point of diminishing returns.


Download ppt "– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance."

Similar presentations


Ads by Google