# Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.

## Presentation on theme: "Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and."— Presentation transcript:

Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and Kumar

Decomposition Decomposition - dividing a computation into parts that may be executed in parallel Tasks - programmer defined units of computation into which the main computation is subdivided Task-dependency graphs - abstraction used to express dependencies between tasks and their relative order of execution

Granularity Granularity - number/size of tasks that a computation can be divided into Fine grained - task divided into many small tasks Coarse grained - task divided into few large tasks Degree of concurrency - maximum number of tasks that can be executed in parallel in a program at any time Average degree of concurrency can be more useful as it provides a better indication of performance

Example Matrix-Vector Multiplication – Figure will be drawn upon the board – Generally considered fine grained if parallelizing based upon each dot product – Could be considered coarse-grained if using a dual core processor and each task computes half of the dot products

Task Graphs Critical path - longest directed path between a pair of start an finish nodes in the task graph Critical path length – sum of the weights of the nodes along a critical path Weight of a node is the size of the task or amount of work associated with the task Aside from these factors, the interaction between tasks running on different processors may cost additional runtime An example of a task dependency graph will be drawn in class to aid in the understanding of these concepts

Processes and Threads vs. Processors mapping - mechanism by which tasks are assigned to processes and/or threads for execution Threads and processes are logical units that perform tasks Processors physically perform the computations Important to realize this because we may have multiple stages of computation For example, internode communication vs. shared memory communication Drawing a task dependency or task interaction graph may help us to understand how tasks interact with one another and will aid in development of a parallel algorithm

Decomposition Techniques Embarrassingly Parallel Recursive decomposition Data Decomposition Exploratory decomposition Speculative decomposition

Embarrassingly Parallel Tasks Some tasks lend themselves to direct parallelization Such tasks are said to be embarrassingly parallel and can be directly mapped to processes or threads A subset of these types of tasks represent the map pattern Note that the map pattern represents a function that can be “replicated and applied to all elements in a collection” – source https://software.intel.com/en- us/blogs/2009/06/10/parallel-patterns-3-maphttps://software.intel.com/en- us/blogs/2009/06/10/parallel-patterns-3-map Map operations occur in independent loop iterations

Embarrassingly Parallel (Map) Performing array (or matrix) addition is a straightforward example that is easily parallelized The serial example of this follows: for(i = 0; i < N; i++) C[i] = A[i] + B[i]; Three OpenMP parallel versions follow on the next slides

OpenMP First Try We could parallelize the loop on the last slide directly as follows: #pragma omp parallel private(i) shared(A,B,C) { int start = omp_get_thread_num()*(N / omp_get_num_threads()); int end = start + (N/omp_get_num_threads()); for(i = start; i < end; i++) C[i] = A[i] + B[i]; } Notice that i is declared private because it it is not shared between threads – each thread gets its own copy of i Arrays A, B, and C are declared shared because they are shared between threads

OpenMP for clause It is preferred to allow OpenMP to directly parallelize loops using the for clause as follows #pragma omp parallel private(i) shared(A,B,C) { #pragma omp for for(i = 0; i < N; i++) C[i] = A[i] + B[i]; } Notice that the loop can be written in a serial fashion and it will be automatically partitioned and tasked to a thread

Shortened OpenMP for When using a single for loop, the parallel and for clauses may be combined #pragma omp parallel for private(i) \ shared(A,B,C) for(i = 0; i < N; i++) C[i] = A[i] + B[i];

Recursive Decomposition Used to include concurrency in problems that can be solved with divide-and-conquer Such a problem is solved by dividing it into independent sub-problems A special type of this decomposition is the Reduction Pattern, wherein elements of a collection are combined with a binary associative operator (e.g. +, -, min, max, etc.), source - https://software.intel.com/en- us/blogs/2009/07/23/parallel-pattern-7-reduce https://software.intel.com/en- us/blogs/2009/07/23/parallel-pattern-7-reduce

Example To find a minimum serially given an array A of size N use the following algorithm min = A[0]; for(i = 1; i < N; i++) if(A[i] < min) min = A[i];

Example Decomposing this task for parallelism requires a recursive solution int findMinRec(int A[], int i, int n) { if(n == 1) return A[i]; else { int lmin = findMinRec(A, i, n/2); int rmin = findMinRec(A, i+n/2, n-n/2); return min(lmin,rmin); }

OpenMP Implementation for(i = 0; i < N; i++) A[i] = rand() % 100; small = A[0]; #pragma omp parallel for reduction(min:small) for(i = 0; i < N; i++) { if(A[i] < small) small = A[i]; }

OpenMP Sum Reduction for(i = 0; i < N; i++) A[i] = i+1; sum = 0; #pragma omp parallel for reduction(+:sum) for(i = 0; i < N; i++) sum += A[i]; printf("The sum is %d\n", sum);

Data Decomposition Commonly used on algorithms that operate on large data structures Involves two steps – Data is partitioned – Data partitioning is used to cause partitioning of computations into tasks Operations on different data partitions are typically similar or are chosen from a small set of operations

Partitioning Partitioning output data – outputs computed independently of others as a function of input – Example – matrix multiplication can be partitioned into submatrices Partitioning input data – task is created for each partition of the input data – Example – finding a minimum or maximum Partitioning input and output – combination of the two cases above Partitioning intermediate data

Next Time More decompositions – Exploratory Decomposition – Speculative Decomposition Tasks and Interactions Load balancing Handling overhead Parallel Algorithm Models

Download ppt "Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and."

Similar presentations