Parallel Programming Patterns

Parallel Programming Patterns
Moreno Marzolla Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna

Copyright © 2013, 2017 Moreno Marzolla, Università di Bologna, Italy ( This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). To view a copy of this license, visit or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. Parallel Programming Patterns

What is a pattern? A design pattern is “a general solution to a recurring engineering problem” A design pattern is not a ready-made solution to a given problem... ...rather, it is a description of how a certain kind of problem can be solved Parallel Programming Patterns

Architectural patterns
The term “architectural pattern” was first used by architect Christopher Alexander to denote common design decision that have been used by architects and engineers to realize buildings and constructions in general Christopher Alexander, (1936--), A Pattern Language: Towns, Buildings, Construction Parallel Programming Patterns

Example Building a bridge across a river You do not “invent” a brand new type of bridge each time Instead, you adapt an already existing type of bridge Parallel Programming Patterns

Example Parallel Programming Patterns

Embarrassingly Parallel Partition Master-Worker Stencil Reduce Scan Parallel Programming Patterns

Parallel programming patterns: Embarrassingly parallel

Embarrassingly Parallel
Applies when the computation can be decomposed in independent tasks that require little or no communication Examples: Vector sum Mandelbrot set 3D rendering Brute force password cracking ... Processor 0 Processor 1 Processor 2 a[] + + + b[] = = = c[] Parallel Programming Patterns

Parallel programming patterns: Partition

Partition The input data space (in short, domain) is split in disjoint regions called partitions Each processor operates on one partition This pattern is particularly useful when the application exhibits locality of reference i.e., when processors can refer to their own partition only and need little or no communication with other processors Parallel Programming Patterns

Example Matrix-vector product Ax = b Matrix A[][] is partitioned into P horizontal blocks Each processor operates on one block of A[][] and on a full copy of x[] computes a portion of the result b[] Core 0 Core 1 x = Core 2 Core 3 A[][] x[] b[] Parallel Programming Patterns

Partition Types of partition Regular: the domain is split into partitions of roughly the same size and shape. E.g., matrix-vector product Irregular: partitions do not necessarily have the same size or shape. E.g., heath transfer on irregular solids Size of partitions (granularity) Fine-Grained: a large number of small partitions Coarse-Grained: a few large partitions Parallel Programming Patterns

1-D Partitioning Block Cyclic Core 0 Core 1 Core 2 Core 3 Parallel Programming Patterns

2-D Block Partitioning Block, * *, Block Block, Block Core 0 Core 1 Core 2 Core 3 Parallel Programming Patterns

2-D Cyclic Partitioning
Parallel Programming Patterns

2-D Cyclic Partitioning
Cyclic-cyclic Parallel Programming Patterns

Irregular partitioning example
A lake surface is approximated with a triangular mesh Colors indicate the mapping of mesh elements to processors Source: Parallel Programming Patterns

Fine grained vs Coarse grained partitioning
Computation Fine grained vs Coarse grained partitioning Communication Fine-grained Partitioning Better load balancing, especially if combined with the master-worker pattern (see later) If granularity is too fine, the computation / communication ratio might become too low (communication dominates on computation) Coarse-grained Partitioning In general improves the computation / communication ratio However, it might cause load imbalancing The "optimal" granularity is sometimes problem-dependent; in other cases the user must choose which granularity to use Time Time Parallel Programming Patterns

Example: Mandelbrot set
The Mandelbrot set is the set of points c on the complex plane s.t. the sequence zn(c) defined as does not diverge when n → +∞ 𝑧 𝑛 𝑐 = 0 if𝑛=0 𝑧 𝑛−1 2 𝑐 + 𝑐 if𝑛>0 Parallel Programming Patterns

Mandelbrot set in color
If the modulus of zn(c) does not exceed 2 after nmax iterations, the pixel is black (the point is assumed to be part of the Mandelbrot set) Otherwise, the color depends on the number of iterations required for the modulus of zn(c) to become > 2 Parallel Programming Patterns

Pseudocode Embarassingly parallel structure: the color of each pixel can be computed independently from other pixels maxit = 1000 for each pixel (x0, y0) { x = 0 y = 0 it = 0 while ( x*x + y*y < 2*2 AND it < maxit ) { xtemp = x*x - y*y + x0 y = 2*x*y + y0 x = xtemp it = it + 1 } plot(x, y, it); Source: Parallel Programming Patterns

Mandelbrot set A regular partitioning can result in uneven load distribution Black pixels require maxit iterations, the others require fewer iterations The computation time of each partition is roughly proportional to the number of black pixels it contains. The central partition shown in the figure has more black pixels than the other two partitions, and therefore will require more time Parallel Programming Patterns

Load balancing Ideally, each processor should perform the same amount of “work” For example, if the tasks synchronize at the end of the computation, the execution time will be that of the slower task Task 0 Task 1 busy Task 2 idle Task 3 barrier synchronization Parallel Programming Patterns

Load balancing howto The workload is balanced if each processor performs more or less the same amount of work Ways to achieve load balancing: Use a finer partitioning ...but beware of the possible communication overhead if the tasks need to communicate Use dynamic task allocation (master-worker paradigm) Parallel Programming Patterns

Master-worker paradigm (process farm, work pool)
Apply a fine-grained partitioning number of task >> number of cores The master assigns a task to the first available worker Worker Worker 1 Master Bag of tasks of possibly different duration Worker P-1 Parallel Programming Patterns

Example omp-mandelbrot.c
Coarse-grained partitioning OMP_SCHEDULE=”static” ./omp-mandelbrot Cyclic, fine-grained partitioning (64 rows per block) OMP_SCHEDULE=”static,64” ./omp-mandelbrot Dynamic, fine-grained partitioning (64 rows per block) OMP_SCHEDULE=”dynamic,64” ./omp- mandelbrot Dynamic, fine-grained partitioning (1 row per block) OMP_SCHEDULE=”dynamic” ./omp-mandelbrot Parallel Programming Patterns

P0 P1 P2 P3 coarse-grained decomposition cyclic task assignment
block size = 64 cyclic task scheduling P0 P0 P1 P2 P1 P3 P0 P2 P1 P2 P3 P3 P0 P1 block size = 64 dynamic (master-worker) scheduling (a possible assignment) P0 P1 P2 P3 P0 P2 P0 P3 P2 Parallel Programming Patterns P0

Parallel programming patterns: Stencil

Stencils Stencil computations involve a grid whose values are updated according to fixed pattern called stencil Example: the Gaussian smoothing of an image updates the color of each pixel with the weighted average of the previous colors of the 5 ´ 5 neighborhood 1 4 5 4 1 4 16 28 16 4 7 28 41 28 7 4 16 28 16 4 1 4 7 4 1 Parallel Programming Patterns

2D Stencils 5-point 2-axis 2D stencil 9-point 2-axis 2D stencil 9-point 1-plane 2D stencil Parallel Programming Patterns

3D Stencils 13-point 3-axis 3D stencil 7-point 3-axis 3D stencil Parallel Programming Patterns

3D Stencils 72-point 3-plane 3D stencil Parallel Programming Patterns

2D Stencils 2D stencil computations usually employ two grids to keep the current and next values Values are read from the current grid New values are written to the next grid current and next grid are exchanged at the end of each phase Parallel Programming Patterns

Ghost Cells How do we handle cells on the border of the domain? We might assume that cells outside the border have some fixed, application-dependent value, or We may assume periodic boundary conditions, where sides are “glued” together to form a torus In either case, we may extend the domain with ghost cells, so that cells on the border do not require any special treatment Ghost cells Domain Le ghost cells servono per usare una computazione uniforme: non occorre controllare esplicitamente se ci si trova sul bordo Il numero di ghost cells dipende dalla struttura della stencil Parallel Programming Patterns

Periodic boundary conditions: How to fill ghost cells

2D Stencil Example: Game of Life
2D cyclic domain, each cell has two possible states 0 = dead 1 = alive The state of a cell at time t + 1 depends on the state of that cell at time t the number of alive cells at time t among the 8 neighbors Rules: Alive cell with less than 2 alive neighbors → dies Alive cell with two or three alive neighbors → lives Alive cell with more than three alive neighbors → dies Dead cell with three alive neighbors → lives Parallel Programming Patterns

Example: Game of Life See game-of-life.c Parallel Programming Patterns

Parallelizing stencil computations
Computing the next grid from the current one has embarassingly parallel structure However, domain partitioning on distributed-memory architectures requires special care “Initialize current grid” while (!terminated) { “Compute next grid” “Exchange boundaries” } Embarassingly Parallel Parallel Programming Patterns

Ghost cells Partitions are again augmented with ghost cells (halo) They contain a copy of “logically” adjacent cells The width of the halo depends on the shape of the stencil halo Partition 1 Partition 2 Parallel Programming Patterns

Example: 2D partitioning with 5P stencil Periodic boundary

Example: 2D partitioning with 9P stencil

Example: 1D (Block, *) partitioning with 5P stencil Periodic boundary

Parallelizing 2D stencil computations on distributed-memory architectures Let us consider a 2D domain of size N ´ N subject to a 5P-2D stencil We have a distributed-memory machine with P = 4 processors Compare the following types of decomposition... (Block, *) : the first N/P rows are assigned to the first processor, the next N/P are assigned to the second processor, and so on (Block, Block) : the domain is decomposed in four square subdomains ...assuming the following boundary conditions: Periodic Non-periodic Goal: minimize the number of ghost cells that must be exchanged among processors Parallel Programming Patterns

Choosing a decomposition
(Block, *) (Block, Block) P0 P0 P1 P1 P2 P2 P3 P3 Parallel Programming Patterns

(Block, *), periodic boundary conditions N The ghost cells at the sides are not exchanged across processors, so they do not contribute to the total messages size P0 P1 8 N ghost cells P2 P3 Parallel Programming Patterns

(Block, *), non-periodic boundary conditions N P0 P1 6 N ghost cells P2 P3 Parallel Programming Patterns

(Block, Block), periodic boundary conditions N/2 P0 P1 N/2 8 N ghost cells P2 P3 Parallel Programming Patterns

(Block, Block), non-periodic boundary conditions N/2 P0 P1 N/2 4 N ghost cells P2 P3 Parallel Programming Patterns

Recap (Block, *) (Block, Block) P0 P0 P1 P1 P2 P2 P3 P3 (Block, *) (Block, Block) Periodic 8 N Non periodic 6 N 4 N Parallel Programming Patterns

1D Stencil Example: Rule 30 Cellular Automaton
The state at time t + 1 depends on the state of the red cells at time t t t+1 t+2 Time Rule 30 cellular automaton Parallel Programming Patterns

Example Rule 30 cellular automaton Initial configuration Configuration at time 1 Configuration at time 2 Parallel Programming Patterns

Rule 30 cellular automaton
Conus textile shell Rule 30 CA Parallel Programming Patterns

1D Cellular Automata On distributed-memory architectures, care must be taken to properly handle cells on the border Again, we use ghost cells to augment each sub-domain P0 P1 P2 Cur Next Parallel Programming Patterns

Example Rule 30 cellular automaton Processor 0 Processor 1 Processor 2 Parallel Programming Patterns

Note In the Rule 30 example, using one ghost cell per side it is possible to compute one step of the CA After that, it is necessary to fill the ghost cells with the new values from the neighbors If we use two ghost cells per side we can compute two steps of the CA Parallel Programming Patterns

Example Rule 30 cellular automaton Processor 0 Processor 1 Processor 2 Parallel Programming Patterns

Why? Using more ghost cells fewer communication operations, but each communication involves more data overall, the number of bytes exchanged remains more or less the same Data transfers of large blocks are usually handled more efficiently than small blocks Parallel Programming Patterns

Parallel programming patterns: Reduce

Reduce A reduction is the application of an associative binary operator (e.g., sum, product, min, max...) to the elements of an array [x0, x1, … xn-1] sum-reduce( [x0, x1, … xn-1] ) = x0+ x1+ … + xn-1 min-reduce( [x0, x1, … xn-1] ) = min { x0, x1, … xn-1} … A reduction can be realized in O(log2 n) parallel steps Parallel Programming Patterns

Example: sum 1 3 -2 4 7 11 -8 2 1 -5 16 4 2 -5 2 1 Parallel Programming Patterns

Example: sum 1 3 -2 4 7 11 -8 2 1 -5 16 4 2 -5 2 1 2 -2 14 8 9 6 -6 3 Parallel Programming Patterns

Example: sum 1 3 -2 4 7 11 -8 2 1 -5 16 4 2 -5 2 1 2 -2 14 8 9 6 -6 3 11 4 8 11 Parallel Programming Patterns

Example: sum 1 3 -2 4 7 11 -8 2 1 -5 16 4 2 -5 2 1 2 -2 14 8 9 6 -6 3 11 4 8 11 19 15 Parallel Programming Patterns

Example: sum 1 3 -2 4 7 11 -8 2 1 -5 16 4 2 -5 2 1 2 -2 14 8 9 6 -6 3 11 4 8 11 19 15 34 Parallel Programming Patterns

Example: sum d 1 3 -2 4 7 11 -8 2 1 -5 16 4 2 -5 2 1 2 -2 14 8 9 6 -6 3 11 4 8 11 int d, i; /* compute largest power of two < n */ for (d=1; 2*d < n; d *= 2) ; /* do reduction */ for ( ; d>0; d /= 2 ) { for (i=0; i<d; i++) { if (i+d<n) x[i] += x[i+d]; } return x[0]; Domanda: il frammento di codice mostrato lì funziona anche se n non è una potenza di due? Risposta: Sì 19 15 34 Parallel Programming Patterns See reduction.c

Work efficiency How many sums are computed by the parallel reduction algorithm? n / 2 sums at the first level n / 4 sums at the second level … n / 2j sums at the j-th level 1 sum at the (log2 n)-th level Total: O(n) sums The tree-structured reduction algorithm is work-efficient, which means that it performs the same amount of “work” of the optimal serial algorithm n 𝑗=1 log 2 𝑛 𝑛 2 𝑗 n/2 n/2 n/4 n/8 ... We must compute the summation: sum from j=1 to j=log_2 n (n/2^j) Parallel Programming Patterns

Parallel programming patterns: Scan

Scan (Prefix Sum) A scan computes all prefixes of an array [x0, x1, … xn-1] using a given associative binary operator (e.g., sum, product, min, max... ) [y0, y1, … yn - 1] = inclusive-scan( [x0, x1, … xn - 1] ) where y0 = x0 y1 = x0 + x1 y2 = x0 + x1 + x2 … yn - 1= x0 + x1 + … + xn - 1 Parallel Programming Patterns

Scan (Prefix Sum) A scan computes all prefixes of an array [x0, x1, … xn-1] using a given associative binary operator (e.g., sum, product, min, max... ) [y0, y1, … yn - 1] = exclusive-scan( [x0, x1, … xn - 1] ) where y0 = 0 y1 = x0 y2 = x0 + x1 … yn - 1= x0 + x1 + … + xn - 2 this is the neutral element of the binary operator (zero for sum, 1 for product, ...) Parallel Programming Patterns

Exclusive scan: Up-sweep
Exclusive scan: Up-sweep x[0] ∑x[0..1] x[2] ∑x[0..3] x[4] ∑x[4..5] x[6] ∑x[0..7] x[0] ∑x[0..1] x[2] ∑x[0..3] x[4] ∑x[4..5] x[6] ∑x[4..7] x[0] ∑x[0..1] x[2] ∑x[2..3] x[4] ∑x[4..5] x[6] ∑x[6..7] x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] for ( d=1; d<n/2; d *= 2 ) { for ( k=0; k<n; k+=2*d ) { x[k+2*d-1] = x[k+d-1] + x[k+2*d-1]; } O(n) additions Parallel Programming Patterns

Exclusive scan: Down-sweep
Exclusive scan: Down-sweep x[0] ∑x[0..1] x[2] ∑x[0..3] x[4] ∑x[4..5] x[6] ∑x[0..7] zero x[0] ∑x[0..1] x[2] ∑x[0..3] x[4] ∑x[4..5] x[6] x[0] ∑x[0..1] x[2] x[4] ∑x[4..5] x[6] ∑x[0..3] x[0] x[2] ∑x[0..1] x[4] ∑x[0..3] x[6] ∑x[0..5] x[0] ∑x[0..1] ∑x[0..2] ∑x[0..3] ∑x[0..4] ∑x[0..5] ∑x[0..6] x[n-1] = 0; for ( ; d > 0; d >>= 1 ) { for (k=0; k<n; k += 2*d ) { float t = x[k+d-1]; x[k+d-1] = x[k+2*d-1]; x[k+2*d-1] = t + x[k+2*d-1]; } O(n) additions Parallel Programming Patterns See prefix-sum.c

Example: Line of Sight n peaks of heights h[0], … h[n - 1]; the distance between consecutive peaks is one Which peaks are visible from peak 0? not visible visible h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns

Line of sight Source: Guy E. Blelloch, Prefix Sums and Their Applications Parallel Programming Patterns

Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns

Serial algorithm For each i = 0, 2, … n – 1 Let a[i] be the slope of the line connecting the peak 0 to the peak i a[0] ← -∞ a[i] ← arctan( ( h[i] – h[0] ) / i ), se i > 0 amax[0] ← -∞ amax[i] ← max {a[0], a[1], … a[i – 1]}, se i > 0 If a[i] ≥ amax[i] then the peak i is visible otherwise the peak i is not visible Parallel Programming Patterns

Serial algorithm bool[0..n-1] Line-of-sight( double h[0..n-1] ) bool v[0..n-1] double a[0..n-1], amax[0..n-1] a[0] ← -∞ for i ← 1 to n-1 do a[i] ← arctan( ( h[i] – h[0] ) / i ) endfor amax[0] ← -∞ amax[i] ← max{ a[i-1], amax[i-1] } for i ← 0 to n-1 do v[i] ← ( a[i] ≥ amax[i] ) return v Parallel Programming Patterns

Serial algorithm bool[0..n-1] Line-of-sight( double h[0..n-1] ) bool v[0..n-1] double a[0..n-1], amax[0..n-1] a[0] ← -∞ for i ← 1 to n-1 do a[i] ← arctan( ( h[i] – h[0] ) / i ) endfor amax[0] ← -∞ amax[i] ← max{ a[i-1], amax[i-1] } for i ← 0 to n-1 do v[i] ← ( a[i] ≥ amax[i] ) return v Embarassingly parallel Embarassingly parallel Parallel Programming Patterns

Parallel algorithm bool[0..n-1] Parallel-line-of-sight( double h[0..n-1] ) bool v[0..n-1] double a[0..n-1], amax[0..n-1] a[0] ← -∞ for i ← 1 to n-1 do in parallel a[i] ← arctan( ( h[i] – h[0] ) / i ) endfor amax ← max-exclusive-scan( a ) for i ← 0 to n-1 do in parallel v[i] ← ( a[i] ≥ amax[i] ) return v Parallel Programming Patterns

Conclusions A parallel programming patterns defines: a partitioning of the input data a communication structure among parallel tasks Parallel programming patterns can help to define efficient algorithms Many problems can be solved using one or more known patterns Parallel Programming Patterns

Parallel Programming Patterns

Similar presentations

Presentation on theme: "Parallel Programming Patterns"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Programming Patterns

Similar presentations

Presentation on theme: "Parallel Programming Patterns"— Presentation transcript:

Similar presentations

About project

Feedback