Download presentation
Presentation is loading. Please wait.
1
Parallel Programming Patterns
Moreno Marzolla Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna
2
Parallel Programming Patterns
Copyright © 2013, 2017 Moreno Marzolla, Università di Bologna, Italy ( This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). To view a copy of this license, visit or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. Parallel Programming Patterns
3
Parallel Programming Patterns
What is a pattern? A design pattern is “a general solution to a recurring engineering problem” A design pattern is not a ready-made solution to a given problem... ...rather, it is a description of how a certain kind of problem can be solved Parallel Programming Patterns
4
Architectural patterns
The term “architectural pattern” was first used by architect Christopher Alexander to denote common design decision that have been used by architects and engineers to realize buildings and constructions in general Christopher Alexander, (1936--), A Pattern Language: Towns, Buildings, Construction Parallel Programming Patterns
5
Parallel Programming Patterns
Example Building a bridge across a river You do not “invent” a brand new type of bridge each time Instead, you adapt an already existing type of bridge Parallel Programming Patterns
6
Parallel Programming Patterns
Example Parallel Programming Patterns
7
Parallel Programming Patterns
Example Parallel Programming Patterns
8
Parallel Programming Patterns
Example Parallel Programming Patterns
9
Parallel Programming Patterns
Embarrassingly Parallel Partition Master-Worker Stencil Reduce Scan Parallel Programming Patterns
10
Parallel programming patterns: Embarrassingly parallel
11
Embarrassingly Parallel
Applies when the computation can be decomposed in independent tasks that require little or no communication Examples: Vector sum Mandelbrot set 3D rendering Brute force password cracking ... Processor 0 Processor 1 Processor 2 a[] + + + b[] = = = c[] Parallel Programming Patterns
12
Parallel programming patterns: Partition
13
Parallel Programming Patterns
Partition The input data space (in short, domain) is split in disjoint regions called partitions Each processor operates on one partition This pattern is particularly useful when the application exhibits locality of reference i.e., when processors can refer to their own partition only and need little or no communication with other processors Parallel Programming Patterns
14
Parallel Programming Patterns
Example Matrix-vector product Ax = b Matrix A[][] is partitioned into P horizontal blocks Each processor operates on one block of A[][] and on a full copy of x[] computes a portion of the result b[] Core 0 Core 1 x = Core 2 Core 3 A[][] x[] b[] Parallel Programming Patterns
15
Parallel Programming Patterns
Partition Types of partition Regular: the domain is split into partitions of roughly the same size and shape. E.g., matrix-vector product Irregular: partitions do not necessarily have the same size or shape. E.g., heath transfer on irregular solids Size of partitions (granularity) Fine-Grained: a large number of small partitions Coarse-Grained: a few large partitions Parallel Programming Patterns
16
Parallel Programming Patterns
1-D Partitioning Block Cyclic Core 0 Core 1 Core 2 Core 3 Parallel Programming Patterns
17
Parallel Programming Patterns
2-D Block Partitioning Block, * *, Block Block, Block Core 0 Core 1 Core 2 Core 3 Parallel Programming Patterns
18
2-D Cyclic Partitioning
Parallel Programming Patterns
19
2-D Cyclic Partitioning
Cyclic-cyclic Parallel Programming Patterns
20
Irregular partitioning example
A lake surface is approximated with a triangular mesh Colors indicate the mapping of mesh elements to processors Source: Parallel Programming Patterns
21
Fine grained vs Coarse grained partitioning
Computation Fine grained vs Coarse grained partitioning Communication Fine-grained Partitioning Better load balancing, especially if combined with the master-worker pattern (see later) If granularity is too fine, the computation / communication ratio might become too low (communication dominates on computation) Coarse-grained Partitioning In general improves the computation / communication ratio However, it might cause load imbalancing The "optimal" granularity is sometimes problem-dependent; in other cases the user must choose which granularity to use Time Time Parallel Programming Patterns
22
Example: Mandelbrot set
The Mandelbrot set is the set of points c on the complex plane s.t. the sequence zn(c) defined as does not diverge when n → +∞ 𝑧 𝑛 𝑐 = 0 if𝑛=0 𝑧 𝑛−1 2 𝑐 + 𝑐 if𝑛>0 Parallel Programming Patterns
23
Mandelbrot set in color
If the modulus of zn(c) does not exceed 2 after nmax iterations, the pixel is black (the point is assumed to be part of the Mandelbrot set) Otherwise, the color depends on the number of iterations required for the modulus of zn(c) to become > 2 Parallel Programming Patterns
24
Parallel Programming Patterns
Pseudocode Embarassingly parallel structure: the color of each pixel can be computed independently from other pixels maxit = 1000 for each pixel (x0, y0) { x = 0 y = 0 it = 0 while ( x*x + y*y < 2*2 AND it < maxit ) { xtemp = x*x - y*y + x0 y = 2*x*y + y0 x = xtemp it = it + 1 } plot(x, y, it); Source: Parallel Programming Patterns
25
Parallel Programming Patterns
Mandelbrot set A regular partitioning can result in uneven load distribution Black pixels require maxit iterations, the others require fewer iterations The computation time of each partition is roughly proportional to the number of black pixels it contains. The central partition shown in the figure has more black pixels than the other two partitions, and therefore will require more time Parallel Programming Patterns
26
Load balancing Ideally, each processor should perform the same amount of “work” For example, if the tasks synchronize at the end of the computation, the execution time will be that of the slower task Task 0 Task 1 busy Task 2 idle Task 3 barrier synchronization Parallel Programming Patterns
27
Parallel Programming Patterns
Load balancing howto The workload is balanced if each processor performs more or less the same amount of work Ways to achieve load balancing: Use a finer partitioning ...but beware of the possible communication overhead if the tasks need to communicate Use dynamic task allocation (master-worker paradigm) Parallel Programming Patterns
28
Master-worker paradigm (process farm, work pool)
Apply a fine-grained partitioning number of task >> number of cores The master assigns a task to the first available worker Worker Worker 1 Master Bag of tasks of possibly different duration Worker P-1 Parallel Programming Patterns
29
Example omp-mandelbrot.c
Coarse-grained partitioning OMP_SCHEDULE=”static” ./omp-mandelbrot Cyclic, fine-grained partitioning (64 rows per block) OMP_SCHEDULE=”static,64” ./omp-mandelbrot Dynamic, fine-grained partitioning (64 rows per block) OMP_SCHEDULE=”dynamic,64” ./omp- mandelbrot Dynamic, fine-grained partitioning (1 row per block) OMP_SCHEDULE=”dynamic” ./omp-mandelbrot Parallel Programming Patterns
30
P0 P1 P2 P3 coarse-grained decomposition cyclic task assignment
block size = 64 cyclic task scheduling P0 P0 P1 P2 P1 P3 P0 P2 P1 P2 P3 P3 P0 P1 block size = 64 dynamic (master-worker) scheduling (a possible assignment) P0 P1 P2 P3 P0 P2 P0 P3 P2 Parallel Programming Patterns P0
31
Parallel programming patterns: Stencil
32
Parallel Programming Patterns
Stencils Stencil computations involve a grid whose values are updated according to fixed pattern called stencil Example: the Gaussian smoothing of an image updates the color of each pixel with the weighted average of the previous colors of the 5 ´ 5 neighborhood 1 4 5 4 1 4 16 28 16 4 7 28 41 28 7 4 16 28 16 4 1 4 7 4 1 Parallel Programming Patterns
33
Parallel Programming Patterns
2D Stencils 5-point 2-axis 2D stencil 9-point 2-axis 2D stencil 9-point 1-plane 2D stencil Parallel Programming Patterns
34
Parallel Programming Patterns
3D Stencils 13-point 3-axis 3D stencil 7-point 3-axis 3D stencil Parallel Programming Patterns
35
Parallel Programming Patterns
3D Stencils 72-point 3-plane 3D stencil Parallel Programming Patterns
36
Parallel Programming Patterns
2D Stencils 2D stencil computations usually employ two grids to keep the current and next values Values are read from the current grid New values are written to the next grid current and next grid are exchanged at the end of each phase Parallel Programming Patterns
37
Parallel Programming Patterns
Ghost Cells How do we handle cells on the border of the domain? We might assume that cells outside the border have some fixed, application-dependent value, or We may assume periodic boundary conditions, where sides are “glued” together to form a torus In either case, we may extend the domain with ghost cells, so that cells on the border do not require any special treatment Ghost cells Domain Le ghost cells servono per usare una computazione uniforme: non occorre controllare esplicitamente se ci si trova sul bordo Il numero di ghost cells dipende dalla struttura della stencil Parallel Programming Patterns
38
Periodic boundary conditions: How to fill ghost cells
Parallel Programming Patterns
39
2D Stencil Example: Game of Life
2D cyclic domain, each cell has two possible states 0 = dead 1 = alive The state of a cell at time t + 1 depends on the state of that cell at time t the number of alive cells at time t among the 8 neighbors Rules: Alive cell with less than 2 alive neighbors → dies Alive cell with two or three alive neighbors → lives Alive cell with more than three alive neighbors → dies Dead cell with three alive neighbors → lives Parallel Programming Patterns
40
Parallel Programming Patterns
Example: Game of Life See game-of-life.c Parallel Programming Patterns
41
Parallelizing stencil computations
Computing the next grid from the current one has embarassingly parallel structure However, domain partitioning on distributed-memory architectures requires special care “Initialize current grid” while (!terminated) { “Compute next grid” “Exchange boundaries” } Embarassingly Parallel Parallel Programming Patterns
42
Parallel Programming Patterns
Ghost cells Partitions are again augmented with ghost cells (halo) They contain a copy of “logically” adjacent cells The width of the halo depends on the shape of the stencil halo Partition 1 Partition 2 Parallel Programming Patterns
43
Example: 2D partitioning with 5P stencil Periodic boundary
Parallel Programming Patterns
44
Example: 2D partitioning with 5P stencil Periodic boundary
Parallel Programming Patterns
45
Example: 2D partitioning with 5P stencil Periodic boundary
Parallel Programming Patterns
46
Example: 2D partitioning with 5P stencil Periodic boundary
Parallel Programming Patterns
47
Example: 2D partitioning with 5P stencil Periodic boundary
Parallel Programming Patterns
48
Example: 2D partitioning with 9P stencil
Parallel Programming Patterns
49
Example: 2D partitioning with 9P stencil
Parallel Programming Patterns
50
Example: 1D (Block, *) partitioning with 5P stencil Periodic boundary
Parallel Programming Patterns
51
Example: 1D (Block, *) partitioning with 5P stencil Periodic boundary
Parallel Programming Patterns
52
Example: 1D (Block, *) partitioning with 5P stencil Periodic boundary
Parallel Programming Patterns
53
Example: 1D (Block, *) partitioning with 5P stencil Periodic boundary
Parallel Programming Patterns
54
Parallel Programming Patterns
Parallelizing 2D stencil computations on distributed-memory architectures Let us consider a 2D domain of size N ´ N subject to a 5P-2D stencil We have a distributed-memory machine with P = 4 processors Compare the following types of decomposition... (Block, *) : the first N/P rows are assigned to the first processor, the next N/P are assigned to the second processor, and so on (Block, Block) : the domain is decomposed in four square subdomains ...assuming the following boundary conditions: Periodic Non-periodic Goal: minimize the number of ghost cells that must be exchanged among processors Parallel Programming Patterns
55
Choosing a decomposition
(Block, *) (Block, Block) P0 P0 P1 P1 P2 P2 P3 P3 Parallel Programming Patterns
56
Choosing a decomposition
(Block, *), periodic boundary conditions N The ghost cells at the sides are not exchanged across processors, so they do not contribute to the total messages size P0 P1 8 N ghost cells P2 P3 Parallel Programming Patterns
57
Choosing a decomposition
(Block, *), non-periodic boundary conditions N P0 P1 6 N ghost cells P2 P3 Parallel Programming Patterns
58
Choosing a decomposition
(Block, Block), periodic boundary conditions N/2 P0 P1 N/2 8 N ghost cells P2 P3 Parallel Programming Patterns
59
Choosing a decomposition
(Block, Block), non-periodic boundary conditions N/2 P0 P1 N/2 4 N ghost cells P2 P3 Parallel Programming Patterns
60
Parallel Programming Patterns
Recap (Block, *) (Block, Block) P0 P0 P1 P1 P2 P2 P3 P3 (Block, *) (Block, Block) Periodic 8 N Non periodic 6 N 4 N Parallel Programming Patterns
61
1D Stencil Example: Rule 30 Cellular Automaton
The state at time t + 1 depends on the state of the red cells at time t t t+1 t+2 Time Rule 30 cellular automaton Parallel Programming Patterns
62
Parallel Programming Patterns
Example Rule 30 cellular automaton Initial configuration Configuration at time 1 Configuration at time 2 Parallel Programming Patterns
63
Rule 30 cellular automaton
Conus textile shell Rule 30 CA Parallel Programming Patterns
64
Parallel Programming Patterns
1D Cellular Automata On distributed-memory architectures, care must be taken to properly handle cells on the border Again, we use ghost cells to augment each sub-domain P0 P1 P2 Cur Next Parallel Programming Patterns
65
Parallel Programming Patterns
Example Rule 30 cellular automaton Processor 0 Processor 1 Processor 2 Parallel Programming Patterns
66
Parallel Programming Patterns
Note In the Rule 30 example, using one ghost cell per side it is possible to compute one step of the CA After that, it is necessary to fill the ghost cells with the new values from the neighbors If we use two ghost cells per side we can compute two steps of the CA Parallel Programming Patterns
67
Parallel Programming Patterns
Example Rule 30 cellular automaton Processor 0 Processor 1 Processor 2 Parallel Programming Patterns
68
Parallel Programming Patterns
Why? Using more ghost cells fewer communication operations, but each communication involves more data overall, the number of bytes exchanged remains more or less the same Data transfers of large blocks are usually handled more efficiently than small blocks Parallel Programming Patterns
69
Parallel programming patterns: Reduce
70
Parallel Programming Patterns
Reduce A reduction is the application of an associative binary operator (e.g., sum, product, min, max...) to the elements of an array [x0, x1, … xn-1] sum-reduce( [x0, x1, … xn-1] ) = x0+ x1+ … + xn-1 min-reduce( [x0, x1, … xn-1] ) = min { x0, x1, … xn-1} … A reduction can be realized in O(log2 n) parallel steps Parallel Programming Patterns
71
Parallel Programming Patterns
Example: sum 1 3 -2 4 7 11 -8 2 1 -5 16 4 2 -5 2 1 Parallel Programming Patterns
72
Parallel Programming Patterns
Example: sum 1 3 -2 4 7 11 -8 2 1 -5 16 4 2 -5 2 1 2 -2 14 8 9 6 -6 3 Parallel Programming Patterns
73
Parallel Programming Patterns
Example: sum 1 3 -2 4 7 11 -8 2 1 -5 16 4 2 -5 2 1 2 -2 14 8 9 6 -6 3 11 4 8 11 Parallel Programming Patterns
74
Parallel Programming Patterns
Example: sum 1 3 -2 4 7 11 -8 2 1 -5 16 4 2 -5 2 1 2 -2 14 8 9 6 -6 3 11 4 8 11 19 15 Parallel Programming Patterns
75
Parallel Programming Patterns
Example: sum 1 3 -2 4 7 11 -8 2 1 -5 16 4 2 -5 2 1 2 -2 14 8 9 6 -6 3 11 4 8 11 19 15 34 Parallel Programming Patterns
76
Parallel Programming Patterns
Example: sum d 1 3 -2 4 7 11 -8 2 1 -5 16 4 2 -5 2 1 2 -2 14 8 9 6 -6 3 11 4 8 11 int d, i; /* compute largest power of two < n */ for (d=1; 2*d < n; d *= 2) ; /* do reduction */ for ( ; d>0; d /= 2 ) { for (i=0; i<d; i++) { if (i+d<n) x[i] += x[i+d]; } return x[0]; Domanda: il frammento di codice mostrato lì funziona anche se n non è una potenza di due? Risposta: Sì 19 15 34 Parallel Programming Patterns See reduction.c
77
Parallel Programming Patterns
Work efficiency How many sums are computed by the parallel reduction algorithm? n / 2 sums at the first level n / 4 sums at the second level … n / 2j sums at the j-th level 1 sum at the (log2 n)-th level Total: O(n) sums The tree-structured reduction algorithm is work-efficient, which means that it performs the same amount of “work” of the optimal serial algorithm n 𝑗=1 log 2 𝑛 𝑛 2 𝑗 n/2 n/2 n/4 n/8 ... We must compute the summation: sum from j=1 to j=log_2 n (n/2^j) Parallel Programming Patterns
78
Parallel programming patterns: Scan
79
Parallel Programming Patterns
Scan (Prefix Sum) A scan computes all prefixes of an array [x0, x1, … xn-1] using a given associative binary operator (e.g., sum, product, min, max... ) [y0, y1, … yn - 1] = inclusive-scan( [x0, x1, … xn - 1] ) where y0 = x0 y1 = x0 + x1 y2 = x0 + x1 + x2 … yn - 1= x0 + x1 + … + xn - 1 Parallel Programming Patterns
80
Parallel Programming Patterns
Scan (Prefix Sum) A scan computes all prefixes of an array [x0, x1, … xn-1] using a given associative binary operator (e.g., sum, product, min, max... ) [y0, y1, … yn - 1] = exclusive-scan( [x0, x1, … xn - 1] ) where y0 = 0 y1 = x0 y2 = x0 + x1 … yn - 1= x0 + x1 + … + xn - 2 this is the neutral element of the binary operator (zero for sum, 1 for product, ...) Parallel Programming Patterns
81
Exclusive scan: Up-sweep
Exclusive scan: Up-sweep x[0] ∑x[0..1] x[2] ∑x[0..3] x[4] ∑x[4..5] x[6] ∑x[0..7] x[0] ∑x[0..1] x[2] ∑x[0..3] x[4] ∑x[4..5] x[6] ∑x[4..7] x[0] ∑x[0..1] x[2] ∑x[2..3] x[4] ∑x[4..5] x[6] ∑x[6..7] x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] for ( d=1; d<n/2; d *= 2 ) { for ( k=0; k<n; k+=2*d ) { x[k+2*d-1] = x[k+d-1] + x[k+2*d-1]; } O(n) additions Parallel Programming Patterns
82
Exclusive scan: Down-sweep
Exclusive scan: Down-sweep x[0] ∑x[0..1] x[2] ∑x[0..3] x[4] ∑x[4..5] x[6] ∑x[0..7] zero x[0] ∑x[0..1] x[2] ∑x[0..3] x[4] ∑x[4..5] x[6] x[0] ∑x[0..1] x[2] x[4] ∑x[4..5] x[6] ∑x[0..3] x[0] x[2] ∑x[0..1] x[4] ∑x[0..3] x[6] ∑x[0..5] x[0] ∑x[0..1] ∑x[0..2] ∑x[0..3] ∑x[0..4] ∑x[0..5] ∑x[0..6] x[n-1] = 0; for ( ; d > 0; d >>= 1 ) { for (k=0; k<n; k += 2*d ) { float t = x[k+d-1]; x[k+d-1] = x[k+2*d-1]; x[k+2*d-1] = t + x[k+2*d-1]; } O(n) additions Parallel Programming Patterns See prefix-sum.c
83
Parallel Programming Patterns
Example: Line of Sight n peaks of heights h[0], … h[n - 1]; the distance between consecutive peaks is one Which peaks are visible from peak 0? not visible visible h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns
84
Parallel Programming Patterns
Line of sight Source: Guy E. Blelloch, Prefix Sums and Their Applications Parallel Programming Patterns
85
Parallel Programming Patterns
Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns
86
Parallel Programming Patterns
Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns
87
Parallel Programming Patterns
Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns
88
Parallel Programming Patterns
Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns
89
Parallel Programming Patterns
Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns
90
Parallel Programming Patterns
Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns
91
Parallel Programming Patterns
Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns
92
Parallel Programming Patterns
Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns
93
Parallel Programming Patterns
Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns
94
Parallel Programming Patterns
Serial algorithm For each i = 0, 2, … n – 1 Let a[i] be the slope of the line connecting the peak 0 to the peak i a[0] ← -∞ a[i] ← arctan( ( h[i] – h[0] ) / i ), se i > 0 amax[0] ← -∞ amax[i] ← max {a[0], a[1], … a[i – 1]}, se i > 0 If a[i] ≥ amax[i] then the peak i is visible otherwise the peak i is not visible Parallel Programming Patterns
95
Parallel Programming Patterns
Serial algorithm bool[0..n-1] Line-of-sight( double h[0..n-1] ) bool v[0..n-1] double a[0..n-1], amax[0..n-1] a[0] ← -∞ for i ← 1 to n-1 do a[i] ← arctan( ( h[i] – h[0] ) / i ) endfor amax[0] ← -∞ amax[i] ← max{ a[i-1], amax[i-1] } for i ← 0 to n-1 do v[i] ← ( a[i] ≥ amax[i] ) return v Parallel Programming Patterns
96
Parallel Programming Patterns
Serial algorithm bool[0..n-1] Line-of-sight( double h[0..n-1] ) bool v[0..n-1] double a[0..n-1], amax[0..n-1] a[0] ← -∞ for i ← 1 to n-1 do a[i] ← arctan( ( h[i] – h[0] ) / i ) endfor amax[0] ← -∞ amax[i] ← max{ a[i-1], amax[i-1] } for i ← 0 to n-1 do v[i] ← ( a[i] ≥ amax[i] ) return v Embarassingly parallel Embarassingly parallel Parallel Programming Patterns
97
Parallel Programming Patterns
Parallel algorithm bool[0..n-1] Parallel-line-of-sight( double h[0..n-1] ) bool v[0..n-1] double a[0..n-1], amax[0..n-1] a[0] ← -∞ for i ← 1 to n-1 do in parallel a[i] ← arctan( ( h[i] – h[0] ) / i ) endfor amax ← max-exclusive-scan( a ) for i ← 0 to n-1 do in parallel v[i] ← ( a[i] ≥ amax[i] ) return v Parallel Programming Patterns
98
Parallel Programming Patterns
Conclusions A parallel programming patterns defines: a partitioning of the input data a communication structure among parallel tasks Parallel programming patterns can help to define efficient algorithms Many problems can be solved using one or more known patterns Parallel Programming Patterns
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.