Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity David Monismith CS599 Based upon notes from GPU Gems 3, Chapter 39 - http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html.

Similar presentations


Presentation on theme: "Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity David Monismith CS599 Based upon notes from GPU Gems 3, Chapter 39 - http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html."— Presentation transcript:

1 Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity
David Monismith CS599 Based upon notes from GPU Gems 3, Chapter

2 Work Complexity In terms of data how many or what order of operations (Big-Oh) are required to perform a particular algorithm Example – summation Given an array of n elements, it takes O(n) operations to find the sum of those elements on a single core processor. We want parallel algorithms to be work efficient – to take the same amount of work (operations) as the serial version. This means that a parallelized version of sum should take O(n) operations (additions) to finish.

3 Reduction Work Complexity
Recall – we have studied reduction In class we have seen that a sum reduction takes O(n) work to run to completion. The sum reduction takes the same order of work (operations) to complete as the serial version. Therefore, we say a sum reduction is work-efficient.

4 Prefix Sum (Scan) The prefix sum or scan operation is useful for computing cumulative sums. Example: Given the following array, compute the prefix sum. | 7 | 2 | 3 | 1 | 9 | 2 |

5 Uses for Scan Sorting Lexical Analysis String Comparison
Polynomial Evaluation Stream Compaction Building Histograms and Data Structures

6 Exclusive Prefix Sum (Exclusive Scan)
Example: Given the following array, compute the prefix sum. | 7 | 2 | 3 | 1 | 9 | 2 | | 0 | 7 | 9 | 12| 13| 22|

7 Inclusive Prefix Sum (Inclusive Scan)
Example: Given the following array, compute the prefix sum. | 7 | 2 | 3 | 1 | 9 | 2 | | 7 | 9 | 12| 13| 22| 24|

8 Naïve Prefix Sum (Exclusive Scan) Algorithm
int arr[N] = {7, 2, 3, 1, 9, 2}; int prefixSum[N]; int i, j; prefixSum[0] = 0; for(i = 1; i < N; i++) prefixSum[i]=prefixSum[i-1]+arr[j-1];

9 Naïve Parallel Inclusive Scan (Hillis and Steele 1986)
| 7 | 2 | 3 | 1 | 9 | 2 | | \ | \ | \ | \ | \ | V V V V V V | 7 | 9 | 5 | 4 | 10| 11| \ \ | \ | \ | | \ \| \| \| | \ \ \ \ | \ |\ |\ |\ | \ | \ | \ | \ | \| \| \| \| V V V V | 7 | 9 | 12| 13| 15| 15|

10 Parallel Scan Example Continued
| 7 | 9 | 12| 13| 15| 15| \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \ | | \ \| | \ \ | \ |\ | \ | \ | \| \| V V | 7 | 9 | 12| 13| 22| 24|

11 Naïve Parallel Scan const int N = 6; int log_of_N = log2(8);
int arr[N] = {7, 2, 3, 1, 9, 2}; int scanSum[2][N]; int i, j; int in = 1, out = 0; memcpy(scanSum[out], arr, N*sizeof(int)); memset(scanSum[in], 0, int two_i = 0, two_i_1 = 0;

12 Naïve Parallel Scan for(i = 1; i <= log_of_N; i++) { two_i_1 = 1 << (i-1); two_i = 1 << i; out = 1 - out; in = 1 - out; for(j = 0; j < N; j++) if(j >= two_i_1) scanSum[out][j] = scanSum[in][j] + scanSum[in][j - two_i_1]; else scanSum[out][j] = scanSum[in][j]; }

13 Adding OpenMP Parallelism
for(i = 1; i <= log_of_N; i++) { two_i_1 = 1 << (i-1); two_i = 1 << i; out = 1 - out; in = 1 - out; #pragma omp parallel for private(j) shared(scanSum, in, out) for(j = 0; j < N; j++) if(j >= two_i_1) scanSum[out][j] = scanSum[in][j] + scanSum[in][j - two_i_1]; else scanSum[out][j] = scanSum[in][j]; }

14 Naïve Parallel Scan Analysis
Not work efficient Performs O(n lg n) additions Note that the serial version performs O(n) additions Requires O(lg n) steps This means we can complete a scan in O(lg n) time if we have O(n lg n) processors

15 Work Efficient Parallel Scan (Blelloch 1990)
Goal – build a scan algorithm that avoids the extra O(lg n) work Solution – make use of balanced binary trees Build a balanced binary tree on the input data and sweep the tree to compute the prefix sum Consists of two parts Up Sweep Traverse tree from leaves to root computing partial sums during the traversal The root node (last element in the array) will hold the sum of all elements Down Sweep Traverse from root to leaves using the partial sums to compute the cumulative sums in place. For the exclusive prefix sum, we replace the root with zero

16 | 7 | 9 | 3 | 13| 9 | 11| 8 | 36| ^ /| / | / | / | / | | 7 | 9 | 3 | 13| 9 | 11| 8 | 23| ^ ^ /| /| / | / | / | / | | 7 | 9 | 3 | 4 | 9 | 11| 8 | 12| ^ ^ ^ ^ / | / | / | / | / | / | / | / | | 7 | 2 | 3 | 1 | 9 | 2 | 8 | 4 | Blelloch Up Sweep

17 Blelloch Algorithm – Up Sweep
for(i = 0; i <= log_of_N_1; i++) { two_i_p1 = 1 << (i+1); two_i = 1 << i; for(j = 0; j < N; j+=two_i_p1) scanSum[j + two_i_p1 - 1] = scanSum[j + two_i - 1] + scanSum[j + two_i_p1 - 1]; }

18 | 7 | 9 | 3 | 13| 9 | 11| 8 | 36| zero| V | 7 | 9 | 3 | 13| 9 | 11| 8 | 0 | \ / | \ / | / \ | zero / \ | V V | 7 | 9 | 3 | 0 | 9 | 11| 8 | 13| \ / | \ / | / \ | / \ | V V V V | 7 | 0 | 3 | 9 | 9 | 13| 8 | 24| \ /| \ /| \ /| \ /| / \| / \| / \| / \| V V V V V V V V | 0 | 7 | 9 | 12| 13| 22| 24| 32| Blelloch Down Sweep

19 Blelloch Algorithm – Down Sweep
scanSum[N-1] = 0; for(i = log_of_N_1; i >= 0; i--) { two_i_p1 = 1 << (i+1); two_i = 1 << i; for(j = 0; j < N; j+=two_i_p1) { long t = scanSum[j + two_i - 1]; scanSum[j + two_i - 1] = scanSum[j + two_i_p1 - 1]; scanSum[j + two_i_p1 - 1] = t + }

20 Analysis This algorithm does complete the summation in O(n) operations, giving the appearance of efficiency. Actually, the algorithm requires 2*(n-1) additions and n-1 swaps to run to completion. For large arrays, this algorithm will indeed out-perform the Hillis and Steele Algorithm.


Download ppt "Parallel Programming – OpenMP, Scan, Work Complexity, and Step Complexity David Monismith CS599 Based upon notes from GPU Gems 3, Chapter 39 - http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html."

Similar presentations


Ads by Google