Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Programming Patterns

Similar presentations


Presentation on theme: "Parallel Programming Patterns"— Presentation transcript:

1 Parallel Programming Patterns
Moreno Marzolla Dip. di Informatica—Scienza e Ingegneria (DISI) Università di Bologna

2 Parallel Programming Patterns
Copyright © 2013, 2017 Moreno Marzolla, Università di Bologna, Italy ( This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). To view a copy of this license, visit or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. Parallel Programming Patterns

3 Parallel Programming Patterns
What is a pattern? A design pattern is “a general solution to a recurring engineering problem” A design pattern is not a ready-made solution to a given problem... ...rather, it is a description of how a certain kind of problem can be solved Parallel Programming Patterns

4 Architectural patterns
The term “architectural pattern” was first used by architect Christopher Alexander to denote common design decision that have been used by architects and engineers to realize buildings and constructions in general Christopher Alexander, (1936--), A Pattern Language: Towns, Buildings, Construction Parallel Programming Patterns

5 Parallel Programming Patterns
Example Building a bridge across a river You do not “invent” a brand new type of bridge each time Instead, you adapt an already existing type of bridge Parallel Programming Patterns

6 Parallel Programming Patterns
Example Parallel Programming Patterns

7 Parallel Programming Patterns
Example Parallel Programming Patterns

8 Parallel Programming Patterns
Example Parallel Programming Patterns

9 Parallel Programming Patterns
Embarrassingly Parallel Partition Master-Worker Stencil Reduce Scan Parallel Programming Patterns

10 Parallel programming patterns: Embarrassingly parallel

11 Embarrassingly Parallel
Applies when the computation can be decomposed in independent tasks that require little or no communication Examples: Vector sum Mandelbrot set 3D rendering Brute force password cracking ... Processor 0 Processor 1 Processor 2 a[] + + + b[] = = = c[] Parallel Programming Patterns

12 Parallel programming patterns: Partition

13 Parallel Programming Patterns
Partition The input data space (in short, domain) is split in disjoint regions called partitions Each processor operates on one partition This pattern is particularly useful when the application exhibits locality of reference i.e., when processors can refer to their own partition only and need little or no communication with other processors Parallel Programming Patterns

14 Parallel Programming Patterns
Example Matrix-vector product Ax = b Matrix A[][] is partitioned into P horizontal blocks Each processor operates on one block of A[][] and on a full copy of x[] computes a portion of the result b[] Core 0 Core 1 x = Core 2 Core 3 A[][] x[] b[] Parallel Programming Patterns

15 Parallel Programming Patterns
Partition Types of partition Regular: the domain is split into partitions of roughly the same size and shape. E.g., matrix-vector product Irregular: partitions do not necessarily have the same size or shape. E.g., heath transfer on irregular solids Size of partitions (granularity) Fine-Grained: a large number of small partitions Coarse-Grained: a few large partitions Parallel Programming Patterns

16 Parallel Programming Patterns
1-D Partitioning Block Cyclic Core 0 Core 1 Core 2 Core 3 Parallel Programming Patterns

17 Parallel Programming Patterns
2-D Block Partitioning Block, * *, Block Block, Block Core 0 Core 1 Core 2 Core 3 Parallel Programming Patterns

18 2-D Cyclic Partitioning
Parallel Programming Patterns

19 2-D Cyclic Partitioning
Cyclic-cyclic Parallel Programming Patterns

20 Irregular partitioning example
A lake surface is approximated with a triangular mesh Colors indicate the mapping of mesh elements to processors Source: Parallel Programming Patterns

21 Fine grained vs Coarse grained partitioning
Computation Fine grained vs Coarse grained partitioning Communication Fine-grained Partitioning Better load balancing, especially if combined with the master-worker pattern (see later) If granularity is too fine, the computation / communication ratio might become too low (communication dominates on computation) Coarse-grained Partitioning In general improves the computation / communication ratio However, it might cause load imbalancing The "optimal" granularity is sometimes problem-dependent; in other cases the user must choose which granularity to use Time Time Parallel Programming Patterns

22 Example: Mandelbrot set
The Mandelbrot set is the set of points c on the complex plane s.t. the sequence zn(c) defined as does not diverge when n → +∞ 𝑧 𝑛 𝑐 = 0 if𝑛=0 𝑧 𝑛−1 2 𝑐 + 𝑐 if𝑛>0 Parallel Programming Patterns

23 Mandelbrot set in color
If the modulus of zn(c) does not exceed 2 after nmax iterations, the pixel is black (the point is assumed to be part of the Mandelbrot set) Otherwise, the color depends on the number of iterations required for the modulus of zn(c) to become > 2 Parallel Programming Patterns

24 Parallel Programming Patterns
Pseudocode Embarassingly parallel structure: the color of each pixel can be computed independently from other pixels maxit = 1000 for each pixel (x0, y0) { x = 0 y = 0 it = 0 while ( x*x + y*y < 2*2 AND it < maxit ) { xtemp = x*x - y*y + x0 y = 2*x*y + y0 x = xtemp it = it + 1 } plot(x, y, it); Source: Parallel Programming Patterns

25 Parallel Programming Patterns
Mandelbrot set A regular partitioning can result in uneven load distribution Black pixels require maxit iterations, the others require fewer iterations The computation time of each partition is roughly proportional to the number of black pixels it contains. The central partition shown in the figure has more black pixels than the other two partitions, and therefore will require more time Parallel Programming Patterns

26 Load balancing Ideally, each processor should perform the same amount of “work” For example, if the tasks synchronize at the end of the computation, the execution time will be that of the slower task Task 0 Task 1 busy Task 2 idle Task 3 barrier synchronization Parallel Programming Patterns

27 Parallel Programming Patterns
Load balancing howto The workload is balanced if each processor performs more or less the same amount of work Ways to achieve load balancing: Use a finer partitioning ...but beware of the possible communication overhead if the tasks need to communicate Use dynamic task allocation (master-worker paradigm) Parallel Programming Patterns

28 Master-worker paradigm (process farm, work pool)
Apply a fine-grained partitioning number of task >> number of cores The master assigns a task to the first available worker Worker Worker 1 Master Bag of tasks of possibly different duration Worker P-1 Parallel Programming Patterns

29 Example omp-mandelbrot.c
Coarse-grained partitioning OMP_SCHEDULE=”static” ./omp-mandelbrot Cyclic, fine-grained partitioning (64 rows per block) OMP_SCHEDULE=”static,64” ./omp-mandelbrot Dynamic, fine-grained partitioning (64 rows per block) OMP_SCHEDULE=”dynamic,64” ./omp- mandelbrot Dynamic, fine-grained partitioning (1 row per block) OMP_SCHEDULE=”dynamic” ./omp-mandelbrot Parallel Programming Patterns

30 P0 P1 P2 P3 coarse-grained decomposition cyclic task assignment
block size = 64 cyclic task scheduling P0 P0 P1 P2 P1 P3 P0 P2 P1 P2 P3 P3 P0 P1 block size = 64 dynamic (master-worker) scheduling (a possible assignment) P0 P1 P2 P3 P0 P2 P0 P3 P2 Parallel Programming Patterns P0

31 Parallel programming patterns: Stencil

32 Parallel Programming Patterns
Stencils Stencil computations involve a grid whose values are updated according to fixed pattern called stencil Example: the Gaussian smoothing of an image updates the color of each pixel with the weighted average of the previous colors of the 5 ´ 5 neighborhood 1 4 5 4 1 4 16 28 16 4 7 28 41 28 7 4 16 28 16 4 1 4 7 4 1 Parallel Programming Patterns

33 Parallel Programming Patterns
2D Stencils 5-point 2-axis 2D stencil 9-point 2-axis 2D stencil 9-point 1-plane 2D stencil Parallel Programming Patterns

34 Parallel Programming Patterns
3D Stencils 13-point 3-axis 3D stencil 7-point 3-axis 3D stencil Parallel Programming Patterns

35 Parallel Programming Patterns
3D Stencils 72-point 3-plane 3D stencil Parallel Programming Patterns

36 Parallel Programming Patterns
2D Stencils 2D stencil computations usually employ two grids to keep the current and next values Values are read from the current grid New values are written to the next grid current and next grid are exchanged at the end of each phase Parallel Programming Patterns

37 Parallel Programming Patterns
Ghost Cells How do we handle cells on the border of the domain? We might assume that cells outside the border have some fixed, application-dependent value, or We may assume periodic boundary conditions, where sides are “glued” together to form a torus In either case, we may extend the domain with ghost cells, so that cells on the border do not require any special treatment Ghost cells Domain Le ghost cells servono per usare una computazione uniforme: non occorre controllare esplicitamente se ci si trova sul bordo Il numero di ghost cells dipende dalla struttura della stencil Parallel Programming Patterns

38 Periodic boundary conditions: How to fill ghost cells
Parallel Programming Patterns

39 2D Stencil Example: Game of Life
2D cyclic domain, each cell has two possible states 0 = dead 1 = alive The state of a cell at time t + 1 depends on the state of that cell at time t the number of alive cells at time t among the 8 neighbors Rules: Alive cell with less than 2 alive neighbors → dies Alive cell with two or three alive neighbors → lives Alive cell with more than three alive neighbors → dies Dead cell with three alive neighbors → lives Parallel Programming Patterns

40 Parallel Programming Patterns
Example: Game of Life See game-of-life.c Parallel Programming Patterns

41 Parallelizing stencil computations
Computing the next grid from the current one has embarassingly parallel structure However, domain partitioning on distributed-memory architectures requires special care “Initialize current grid” while (!terminated) { “Compute next grid” “Exchange boundaries” } Embarassingly Parallel Parallel Programming Patterns

42 Parallel Programming Patterns
Ghost cells Partitions are again augmented with ghost cells (halo) They contain a copy of “logically” adjacent cells The width of the halo depends on the shape of the stencil halo Partition 1 Partition 2 Parallel Programming Patterns

43 Example: 2D partitioning with 5P stencil Periodic boundary
Parallel Programming Patterns

44 Example: 2D partitioning with 5P stencil Periodic boundary
Parallel Programming Patterns

45 Example: 2D partitioning with 5P stencil Periodic boundary
Parallel Programming Patterns

46 Example: 2D partitioning with 5P stencil Periodic boundary
Parallel Programming Patterns

47 Example: 2D partitioning with 5P stencil Periodic boundary
Parallel Programming Patterns

48 Example: 2D partitioning with 9P stencil
Parallel Programming Patterns

49 Example: 2D partitioning with 9P stencil
Parallel Programming Patterns

50 Example: 1D (Block, *) partitioning with 5P stencil Periodic boundary
Parallel Programming Patterns

51 Example: 1D (Block, *) partitioning with 5P stencil Periodic boundary
Parallel Programming Patterns

52 Example: 1D (Block, *) partitioning with 5P stencil Periodic boundary
Parallel Programming Patterns

53 Example: 1D (Block, *) partitioning with 5P stencil Periodic boundary
Parallel Programming Patterns

54 Parallel Programming Patterns
Parallelizing 2D stencil computations on distributed-memory architectures Let us consider a 2D domain of size N ´ N subject to a 5P-2D stencil We have a distributed-memory machine with P = 4 processors Compare the following types of decomposition... (Block, *) : the first N/P rows are assigned to the first processor, the next N/P are assigned to the second processor, and so on (Block, Block) : the domain is decomposed in four square subdomains ...assuming the following boundary conditions: Periodic Non-periodic Goal: minimize the number of ghost cells that must be exchanged among processors Parallel Programming Patterns

55 Choosing a decomposition
(Block, *) (Block, Block) P0 P0 P1 P1 P2 P2 P3 P3 Parallel Programming Patterns

56 Choosing a decomposition
(Block, *), periodic boundary conditions N The ghost cells at the sides are not exchanged across processors, so they do not contribute to the total messages size P0 P1 8 N ghost cells P2 P3 Parallel Programming Patterns

57 Choosing a decomposition
(Block, *), non-periodic boundary conditions N P0 P1 6 N ghost cells P2 P3 Parallel Programming Patterns

58 Choosing a decomposition
(Block, Block), periodic boundary conditions N/2 P0 P1 N/2 8 N ghost cells P2 P3 Parallel Programming Patterns

59 Choosing a decomposition
(Block, Block), non-periodic boundary conditions N/2 P0 P1 N/2 4 N ghost cells P2 P3 Parallel Programming Patterns

60 Parallel Programming Patterns
Recap (Block, *) (Block, Block) P0 P0 P1 P1 P2 P2 P3 P3 (Block, *) (Block, Block) Periodic 8 N Non periodic 6 N 4 N Parallel Programming Patterns

61 1D Stencil Example: Rule 30 Cellular Automaton
The state at time t + 1 depends on the state of the red cells at time t t t+1 t+2 Time Rule 30 cellular automaton Parallel Programming Patterns

62 Parallel Programming Patterns
Example Rule 30 cellular automaton Initial configuration Configuration at time 1 Configuration at time 2 Parallel Programming Patterns

63 Rule 30 cellular automaton
Conus textile shell Rule 30 CA Parallel Programming Patterns

64 Parallel Programming Patterns
1D Cellular Automata On distributed-memory architectures, care must be taken to properly handle cells on the border Again, we use ghost cells to augment each sub-domain P0 P1 P2 Cur Next Parallel Programming Patterns

65 Parallel Programming Patterns
Example Rule 30 cellular automaton Processor 0 Processor 1 Processor 2 Parallel Programming Patterns

66 Parallel Programming Patterns
Note In the Rule 30 example, using one ghost cell per side it is possible to compute one step of the CA After that, it is necessary to fill the ghost cells with the new values from the neighbors If we use two ghost cells per side we can compute two steps of the CA Parallel Programming Patterns

67 Parallel Programming Patterns
Example Rule 30 cellular automaton Processor 0 Processor 1 Processor 2 Parallel Programming Patterns

68 Parallel Programming Patterns
Why? Using more ghost cells fewer communication operations, but each communication involves more data overall, the number of bytes exchanged remains more or less the same Data transfers of large blocks are usually handled more efficiently than small blocks Parallel Programming Patterns

69 Parallel programming patterns: Reduce

70 Parallel Programming Patterns
Reduce A reduction is the application of an associative binary operator (e.g., sum, product, min, max...) to the elements of an array [x0, x1, … xn-1] sum-reduce( [x0, x1, … xn-1] ) = x0+ x1+ … + xn-1 min-reduce( [x0, x1, … xn-1] ) = min { x0, x1, … xn-1} A reduction can be realized in O(log2 n) parallel steps Parallel Programming Patterns

71 Parallel Programming Patterns
Example: sum 1 3 -2 4 7 11 -8 2 1 -5 16 4 2 -5 2 1 Parallel Programming Patterns

72 Parallel Programming Patterns
Example: sum 1 3 -2 4 7 11 -8 2 1 -5 16 4 2 -5 2 1 2 -2 14 8 9 6 -6 3 Parallel Programming Patterns

73 Parallel Programming Patterns
Example: sum 1 3 -2 4 7 11 -8 2 1 -5 16 4 2 -5 2 1 2 -2 14 8 9 6 -6 3 11 4 8 11 Parallel Programming Patterns

74 Parallel Programming Patterns
Example: sum 1 3 -2 4 7 11 -8 2 1 -5 16 4 2 -5 2 1 2 -2 14 8 9 6 -6 3 11 4 8 11 19 15 Parallel Programming Patterns

75 Parallel Programming Patterns
Example: sum 1 3 -2 4 7 11 -8 2 1 -5 16 4 2 -5 2 1 2 -2 14 8 9 6 -6 3 11 4 8 11 19 15 34 Parallel Programming Patterns

76 Parallel Programming Patterns
Example: sum d 1 3 -2 4 7 11 -8 2 1 -5 16 4 2 -5 2 1 2 -2 14 8 9 6 -6 3 11 4 8 11 int d, i; /* compute largest power of two < n */ for (d=1; 2*d < n; d *= 2) ; /* do reduction */ for ( ; d>0; d /= 2 ) { for (i=0; i<d; i++) { if (i+d<n) x[i] += x[i+d]; } return x[0]; Domanda: il frammento di codice mostrato lì funziona anche se n non è una potenza di due? Risposta: Sì 19 15 34 Parallel Programming Patterns See reduction.c

77 Parallel Programming Patterns
Work efficiency How many sums are computed by the parallel reduction algorithm? n / 2 sums at the first level n / 4 sums at the second level n / 2j sums at the j-th level 1 sum at the (log2 n)-th level Total: O(n) sums The tree-structured reduction algorithm is work-efficient, which means that it performs the same amount of “work” of the optimal serial algorithm n 𝑗=1 log 2 𝑛 𝑛 2 𝑗 n/2 n/2 n/4 n/8 ... We must compute the summation: sum from j=1 to j=log_2 n (n/2^j) Parallel Programming Patterns

78 Parallel programming patterns: Scan

79 Parallel Programming Patterns
Scan (Prefix Sum) A scan computes all prefixes of an array [x0, x1, … xn-1] using a given associative binary operator (e.g., sum, product, min, max... ) [y0, y1, … yn - 1] = inclusive-scan( [x0, x1, … xn - 1] ) where y0 = x0 y1 = x0 + x1 y2 = x0 + x1 + x2 yn - 1= x0 + x1 + … + xn - 1 Parallel Programming Patterns

80 Parallel Programming Patterns
Scan (Prefix Sum) A scan computes all prefixes of an array [x0, x1, … xn-1] using a given associative binary operator (e.g., sum, product, min, max... ) [y0, y1, … yn - 1] = exclusive-scan( [x0, x1, … xn - 1] ) where y0 = 0 y1 = x0 y2 = x0 + x1 yn - 1= x0 + x1 + … + xn - 2 this is the neutral element of the binary operator (zero for sum, 1 for product, ...) Parallel Programming Patterns

81 Exclusive scan: Up-sweep
Exclusive scan: Up-sweep x[0] ∑x[0..1] x[2] ∑x[0..3] x[4] ∑x[4..5] x[6] ∑x[0..7] x[0] ∑x[0..1] x[2] ∑x[0..3] x[4] ∑x[4..5] x[6] ∑x[4..7] x[0] ∑x[0..1] x[2] ∑x[2..3] x[4] ∑x[4..5] x[6] ∑x[6..7] x[0] x[1] x[2] x[3] x[4] x[5] x[6] x[7] for ( d=1; d<n/2; d *= 2 ) { for ( k=0; k<n; k+=2*d ) { x[k+2*d-1] = x[k+d-1] + x[k+2*d-1]; } O(n) additions Parallel Programming Patterns

82 Exclusive scan: Down-sweep
Exclusive scan: Down-sweep x[0] ∑x[0..1] x[2] ∑x[0..3] x[4] ∑x[4..5] x[6] ∑x[0..7] zero x[0] ∑x[0..1] x[2] ∑x[0..3] x[4] ∑x[4..5] x[6] x[0] ∑x[0..1] x[2] x[4] ∑x[4..5] x[6] ∑x[0..3] x[0] x[2] ∑x[0..1] x[4] ∑x[0..3] x[6] ∑x[0..5] x[0] ∑x[0..1] ∑x[0..2] ∑x[0..3] ∑x[0..4] ∑x[0..5] ∑x[0..6] x[n-1] = 0; for ( ; d > 0; d >>= 1 ) { for (k=0; k<n; k += 2*d ) { float t = x[k+d-1]; x[k+d-1] = x[k+2*d-1]; x[k+2*d-1] = t + x[k+2*d-1]; } O(n) additions Parallel Programming Patterns See prefix-sum.c

83 Parallel Programming Patterns
Example: Line of Sight n peaks of heights h[0], … h[n - 1]; the distance between consecutive peaks is one Which peaks are visible from peak 0? not visible visible h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns

84 Parallel Programming Patterns
Line of sight Source: Guy E. Blelloch, Prefix Sums and Their Applications Parallel Programming Patterns

85 Parallel Programming Patterns
Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns

86 Parallel Programming Patterns
Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns

87 Parallel Programming Patterns
Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns

88 Parallel Programming Patterns
Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns

89 Parallel Programming Patterns
Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns

90 Parallel Programming Patterns
Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns

91 Parallel Programming Patterns
Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns

92 Parallel Programming Patterns
Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns

93 Parallel Programming Patterns
Line of sight h[0] h[1] h[2] h[3] h[4] h[5] h[6] h[7] Parallel Programming Patterns

94 Parallel Programming Patterns
Serial algorithm For each i = 0, 2, … n – 1 Let a[i] be the slope of the line connecting the peak 0 to the peak i a[0] ← -∞ a[i] ← arctan( ( h[i] – h[0] ) / i ), se i > 0 amax[0] ← -∞ amax[i] ← max {a[0], a[1], … a[i – 1]}, se i > 0 If a[i] ≥ amax[i] then the peak i is visible otherwise the peak i is not visible Parallel Programming Patterns

95 Parallel Programming Patterns
Serial algorithm bool[0..n-1] Line-of-sight( double h[0..n-1] ) bool v[0..n-1] double a[0..n-1], amax[0..n-1] a[0] ← -∞ for i ← 1 to n-1 do a[i] ← arctan( ( h[i] – h[0] ) / i ) endfor amax[0] ← -∞ amax[i] ← max{ a[i-1], amax[i-1] } for i ← 0 to n-1 do v[i] ← ( a[i] ≥ amax[i] ) return v Parallel Programming Patterns

96 Parallel Programming Patterns
Serial algorithm bool[0..n-1] Line-of-sight( double h[0..n-1] ) bool v[0..n-1] double a[0..n-1], amax[0..n-1] a[0] ← -∞ for i ← 1 to n-1 do a[i] ← arctan( ( h[i] – h[0] ) / i ) endfor amax[0] ← -∞ amax[i] ← max{ a[i-1], amax[i-1] } for i ← 0 to n-1 do v[i] ← ( a[i] ≥ amax[i] ) return v Embarassingly parallel Embarassingly parallel Parallel Programming Patterns

97 Parallel Programming Patterns
Parallel algorithm bool[0..n-1] Parallel-line-of-sight( double h[0..n-1] ) bool v[0..n-1] double a[0..n-1], amax[0..n-1] a[0] ← -∞ for i ← 1 to n-1 do in parallel a[i] ← arctan( ( h[i] – h[0] ) / i ) endfor amax ← max-exclusive-scan( a ) for i ← 0 to n-1 do in parallel v[i] ← ( a[i] ≥ amax[i] ) return v Parallel Programming Patterns

98 Parallel Programming Patterns
Conclusions A parallel programming patterns defines: a partitioning of the input data a communication structure among parallel tasks Parallel programming patterns can help to define efficient algorithms Many problems can be solved using one or more known patterns Parallel Programming Patterns


Download ppt "Parallel Programming Patterns"

Similar presentations


Ads by Google