Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. 2004.

Similar presentations


Presentation on theme: "Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. 2004."— Presentation transcript:

1 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.1 Chapter 6 Synchronous Computations Barrier and its Implementations Fully Synchronous Applications Partially Synchronous Applications Reducing Synchronization

2 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.2 Synchronous Computations We consider problems solved by a group of separate computations that must at times wait for each other before proceeding A very important class of such problems are fully synchronous applications In fully synchronous applications All processes are synchronized at regular points The same computation or operation is applied to a set of data points Operations start at the same time in a lock-step manner, analogous to SIMD computations

3 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.3 Synchronous Computations: Barrier Barrier is a basic mechanism for synchronizing processes Inserted at the point in each process where it must wait. All processes can continue from the barrier point after they all reached it Or when a stated number of processes have reached the barrier, in some implementations Barriers apply to both shared memory and message- passing systems Provided using library routines in message-passing systems Implementing barriers Counter implementation Tree implementation Butterfly implementation

4 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.4 Timed Arrival to Barrier

5 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.5 Contains a single parameter A named communicator that defines the group of processes to synchronize Called by each process in the group, blocking until all members of the group have reached the barrier call and only returning then MPI_Barrier() Library Routine

6 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.6 Barrier Implementation Centralized counter implementation (a linear barrier): Single counter used to count the number of processes reaching the barrier

7 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.7 Using locally blocking send() s and recv() s Master: for (i = 0; i < n; i++) /*count slaves as they reach barrier*/ recv(P any ); for (i = 0; i < n; i++) /* release slaves */ send(P i ); Slave processes: send(P master ); recv(P master ); Example code:

8 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.8 Barrier Implementation in a Message-Passing System

9 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.9 Tree Implementation Centralized counter barriers have complexity O(p) with p processes Decentralized tree-based barriers have complexity O(log p) Suppose 8 processes, P0, P1, P2, P3, P4, P5, P6, P7: 1st stage: P1 sends message to P0; (when P1 reaches its barrier) P3 sends message to P2; (when P3 reaches its barrier) P5 sends message to P4; (when P5 reaches its barrier) P7 sends message to P6; (when P7 reaches its barrier) 2nd stage: P2 sends message to P0; (P2 & P3 reached their barrier) P6 sends message to P4; (P6 & P7 reached their barrier) 3rd stage: P4 sends message to P0; (P4, P5, P6, & P7 reached barrier) P0 terminates arrival phase; (when P0 reaches barrier & received message from P4)

10 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.10 Tree barrier

11 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.11 Butterfly Barrier

12 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.12 Butterfly Barrier (cont’d) Useful if data are exchanged at the barrier. E.g., observe how P2 gathers data from all processes Butterfly barrier has the same communication complexity, log p

13 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.13 Partial Barriers Local synchronization: suppose a process P i needs to be synchronized and to exchange data with process P i-1 and process Pi+1 before continuing: Not a perfect three-process barrier because process P i-1 will only synchronize with P i and continue as soon as P i allows. similarly, process P i+1 only synchronizes with P i.

14 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.14 Deadlock Deadlock: When a pair of processes send and receive from each other, deadlock may occur. Deadlock will occur if both processes first perform the send, using synchronous routines (or blocking routines without sufficient buffering). This is because neither will return; they will wait for matching receives that are never reached. A solution: Arrange for one process first to receive and then to send and for the other process first to send and then to receive. Example: Linear pipeline deadlock can be avoided by arranging so that even-numbered processes first perform their sends and the odd-numbered processes first perform their receives.

15 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.15 Deadlock (cont’d) A different solution is to use combined deadlock-free blocking routines, e.g., sendrecv() routines. In MPI there are two versions: MPI_Sendrecv() and MPI_Sendrecv_replace() MPI_Sendrecv() uses two buffers and has 12 parameters! MPI_Sendrecv_replace() uses a single buffer

16 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.16 Synchronized Computations Can be classified as: Fully synchronous, or Locally synchronous In fully synchronous, all processes involved in the computation must be synchronized. Data parallel algorithms e.g., prefix sum problem Solving systems of linear equations by iteration In locally synchronous, processes only need to synchronize with a set of logically nearby processes, not all processes involved in the computation Heat-distribution problem

17 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.17 Fully Synchronized Computation Examples Data Parallel Computations: a form of computation with implicit synchronization requirements Same operation performed on different data elements simultaneously; i.e., in parallel. Particularly convenient because: 1.Ease of programming (essentially only one program). 2.Can scale easily to larger problem sizes 3.Many numeric and some non-numeric problems can be cast in a data parallel form. Example: Add the same constant to each element of an array: for (i=0; i<n; i++) a[i] = a[i] + k; The statement a[i] = a[i] + k may be executed simultaneously by multiple processors using different indices i

18 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.18 Data Parallel Computations On an SIMD computer, the synchronism is built into the hardware; the processes operate in lock-step fashion the same instruction, equivalent to a[]=a[]+k, would be sent to each processor simultaneously

19 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.19 The forall Construct Special “parallel” construct in parallel programming languages to specify data parallel operations Example forall (i = 0; i < n; i++) { body } States that n instances of the statements of the body can be executed simultaneously. One value of the loop variable i is valid in each instance of the body, the first instance has i = 0, the next i = 1, and so on. The previous example may be written as forall (i=0; i<n; i++) a[i] = a[i] + k;

20 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.20 Data parallel technique can applied to multiprocessors and multicomputers We do not consider programs for SIMD computers A form of barrier is implicit with the forall construct completed when all instances of body executed Example: previous example on a message-passing system with an explicit barrier: i = myrank; a[i] = a[i] + k; /* body */ barrier(mygroup); Assume that each process has access to the required element of the array Data Parallel Computations (cont’d)

21 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.21 Example 1: Prefix Sum Problem Given a list of numbers, x 0, …, x n-1, compute all the partial summations (i.e., x 0 +x 1 ; x 0 +x 1 +x 2 ; x 0 +x 1 +x 2 +x 3 ; … ). Can also be defined with associative operations other than addition. Widely studied with practical applications in areas such as processor allocation, data compaction, sorting, and polynomial evaluation.

22 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.22 Data Parallel Prefix Sum: 8 Numbers

23 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.23 Data Parallel Prefix Sum (cont’d) A sequential code may be: sum[i] = x[0]; for (i=1; i <= n; i++) sum[i] = sum[i-1] + x[i] Parallel code for (j=0; j < log(n); j++) forall (i=0; i<n; i++) if (i>=2 j ) x[i] = x[i] + x[i-2 j ]; Time complexity is O(log n) for both communications and computations

24 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.24 Synchronous Iteration (Synchronous Parallelism) Synchronous iteration: this term is used to describe a situation where a problem is solved by iteration and each iteration step is composed of several processes that start together at the beginning of the iteration step and next iteration step cannot begin until all processes have finished the current iteration step. Using forall construct: for (j = 0; j < n; j++) /*for each synch. iteration */ forall (i = 0; i < p; i++) { /* p procs each using*/ body(i); /* specific value of i */ }

25 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.25 Using message passing barrier: for (j = 0; j < n; j++) { /*for each synchr.iteration */ i = myrank; /*find value of i to be used */ body(i); barrier(mygroup); } Synchronous Iteration (cont’d)

26 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.26 Example 2: Synchronous Iteration Solving a General System of Linear Equations by Iteration Suppose the equations are of a general form with n equations and n unknowns where the unknowns are x 0,x 1,x 2,… x n-1 (0 <=i< n).

27 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.27 By rearranging the i th equation (0 <=i< n) : This equation gives x i in terms of the other unknowns. Can be be used as an iteration formula for each of the unknowns to obtain better approximations. System of Linear Equations (cont’d)

28 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.28 Jacobi Iteration All values of x are updated together. The Jacobi converges if the diagonal values of a have an absolute value greater than the sum of the absolute values of the other a’s on the row the array of a’s is diagonally dominant That is, if This condition is a sufficient but not a necessary condition.

29 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.29 Illustrating Jacobi Iteration

30 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.30 Solution can be found by iterative methods or direct methods (Chapter 11) Iterative methods Applicable when direct methods require excessive computations Have the advantage of small memory requirements May not not always converge/terminate An iterative method begins with an initial guess for the unknowns e.g., x i =b i Iterations are continued until sufficiently accurate values obtained for the unknowns Solution By Iteration

31 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.31 A simple, common approach is compare values computed in one iteration to values obtained from the previous iteration. terminate computation when all values are within given tolerance; i.e., when ( x i t denotes the value of x i at t th iteration) Other appropriate termination conditions are: Jacobi Iteration: Termination

32 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.32 Jacobi Iteration: Termination (cont’d) Regardless of the termination method, iterations should be stopped when after a certain maximum The iterations may not terminate Trade-off: complex termination calculation vs number of iterations May be a good strategy to check for termination occasionally after a number of iterations Parallel formulation requires each iteration to use all values of the preceding iteration Thus, calculations require global synchronization There are other methods that may terminate faster than the Jacobi iteration

33 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.33 Sequential Code for (i=0; i<n; i++) x[i] = b[i]; for (iter = 0; iter < limit; iter++) for (i=0; i<n; i++){ sum = 0; for (j=0; j<n; j++) if (i != j) sum = sum + a[i][j]*x[j]; newx[i] = (b[i]-sum)/a[i][i] } for (i=0; i<n; i++) x[i] = newx[i]; }

34 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.34 Sequential Code (cont’d) for (i=0; i<n; i++) x[i] = b[i]; for (iter = 0; iter < limit; iter++) for (i=0; i<n; i++){ sum = -a[i][i]*x[i]; for (j=0; j<n; j++) sum = sum + a[i][j]*x[j]; newx[i] = (b[i]-sum)/a[i][i] } for (i=0; i<n; i++) x[i] = newx[i]; } More efficient sequential code (avoiding the if statement):

35 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.35 Parallel Code x[i] = b[i]; for (iter = 0; iter < limit; iter++) { sum = -a[i][i] * x[i]; for (j = 0; j < n; j++) sum = sum + a[i][j] * x[j]; new_x[i] = (b[i] - sum) / a[i][i]; broadcast_receive(&new_x[i]); global_barrier(); } broadcast receive() is used here (1) to send the newly computed value of x[i] from process Pi to every other process and (2) to collect data broadcasted from any other processes to process Pi an alternative simple solution is to use individual send/recv routines Suppose we have a process P i for each unknown x i ; the code for process P i may be:

36 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.36 A New Message-Passing Operation - Allgather. Broadcast and gather values in one composite construction. Note: MPI Allgather() is also a global barrier, so you do not need to add global_barrier().

37 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.37 Parallel Code (cont’d) x[i] = b[i]; iter = 0; do { iter++; sum = -a[i][i]*x[i]; for (j=0; j<n; j++) sum = sum + a[i][j]*x[j]; newx[i] = (b[i]-sum)/a[i][i] broadcast receive(&newx[i]); } while (tolerance() && (iter < limit)); A version where the termination condition is included:

38 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.38 Usually the number of processors is smaller that the number of unknowns, hence each process can be responsible for computing a group of unknowns One may use: block allocation – allocate groups of consecutive unknowns to processors in increasing order. cyclic allocation – processors are allocated one unknown in order; i.e., processor P 0 is allocated x 0, x p, x 2p, …, x ((n/p)-1)p, P 1 is allocated x 1, x p+1, x 2p+1, …, x ((n/p)-1)p+1, and so on. Cyclic disadvantageous here because many such complex indices have to be computed Partitioning

39 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.39 Considering the sequential program, there are two loops, one nested inside the other Assuming that there are iterations, we have Which has complexity of O(n 2 ) if there is a constant number of iterations There parallel execution time is the time for one processor assuming processors execute same number of iterations Each iteration has a computational phase and a broadcast communication phase Analysis

40 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.40 Suppose there are n equations and p processors; hence a processor handle n/p unknowns. If there are iterations, then Overall execution time is When t startup = 10000 and t data =50, for various p and this has a minimum value around Analysis (cont’d)

41 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.41 Heat Distribution Problem: Suppose a square metal sheet has known temperatures along each of its edges. The temperature of the interior surface of the sheet will depend upon the temperatures around it The temperature distribution can be found by divide the area into a fine mesh of points h i,j Temperature at an inside point taken to be average of temperatures of four neighboring points. Convenient to describe edges by points. Temperature of each point by iterating the equation ( 0 < i < n, 0 < j < n ) : Locally Synchronous Computation Iterations repeated for a fixed number of times or until the difference between iterations is less than some very small amount.

42 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.42 Heat Distribution Problem

43 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.43 For convenience, the points are numbered from 1 to m 2. Each point will have an associated equation: Could be written as a linear equation containing the unknowns x i-m, x i-1, x i+1, and x i+m : This is the finite difference system associated to Laplace equation. Connection with System of Linear Equations

44 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.44 Natural Ordering of Heat Distribution Problem

45 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.45 Sequential Code Assume our double indices numbering for the points. Suppose that the temperature of each point is held in an array h[i][j] Suppose the boundary points h[0][x],h[x][0],h[n][x], and h[x][n] ( 0<=x<=n ) have been initialized to the edge temperatures Then sequential code could be: for (iter = 0; iter < limit; iter++) { for (i = 1; i < n; i++) for (j = 1; j < n; j++) g[i][j] = 0.25*(h[i-1][j]+h[i+1][j]+h[i][j-1]+h[i][j+1]); for (i = 1; i < n; i++) for (j = 1; j < n; j++) h[i][j] = g[i][j]; }

46 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.46 Sequential Code with Termination Condition do { for (i = 1; i < n; i++) for (j = 1; j < n; j++) g[i][j] = 0.25*(h[i-1][j]+h[i+1][j]+h[i][j-1]+h[i][j+1]); for (i = 1; i < n; i++) for (j = 1; j < n; j++) h[i][j] = g[i][j]; continue = FALSE; for (i = 1; i < n; i++) for (j = 1; j < n; j++) if (!converged(i,j) { continue = TRUE; break; } } while (continue == TRUE);

47 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.47 Parallel Code Version with fixed number of iterations, and a process P i,j for each point (except for the boundary points): Important Use non-blocking send()s to avoid deadlock Use blocking recv()s to ensure local synchronization

48 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.48 Message Passing for Heat Distribution Problem send(&g,P i-1,j ) send(&g,P i+1,j ) send(&g,P i,j-1 ) send(&g,P i,j+1 ) recv(&w,P i-1,j ) recv(&x,P i+1,j ) recv(&y,P i,j-1 ) recv(&z,P i,j+1 ) send(&g,P i-1,j ) send(&g,P i+1,j ) send(&g,P i,j-1 ) send(&g,P i,j+1 ) recv(&w,P i-1,j ) recv(&x,P i+1,j ) recv(&y,P i,j-1 ) recv(&z,P i,j+1 ) send(&g,P i-1,j ) send(&g,P i+1,j ) send(&g,P i,j-1 ) send(&g,P i,j+1 ) recv(&w,P i-1,j ) recv(&x,P i+1,j ) recv(&y,P i,j-1 ) recv(&z,P i,j+1 ) send(&g,P i-1,j ) send(&g,P i+1,j ) send(&g,P i,j-1 ) send(&g,P i,j+1 ) recv(&w,P i-1,j ) recv(&x,P i+1,j ) recv(&y,P i,j-1 ) recv(&z,P i,j+1 ) send(&g,P i-1,j ) send(&g,P i+1,j ) send(&g,P i,j-1 ) send(&g,P i,j+1 ) recv(&w,P i-1,j ) recv(&x,P i+1,j ) recv(&y,P i,j-1 ) recv(&z,P i,j+1 )

49 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.49 Message Passing for Heat Distribution Problem

50 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.50 Version where processes stop when they reach their required precision: Parallel Code (cont’d)

51 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.51 Parallel Code (cont’d)

52 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.52 Partitioning As the number of points is usually large, one has to allocate more points to a process. Natural partitions are into square blocks or strips.

53 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.53 Block Partition Four edges where data points exchanged. Communication time given by

54 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.54 Strip Partition Two edges where data points are exchanged. Communication time is given by

55 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.55 Optimum In general, strip partition best for large startup time, and block partition best for small startup time. With the previous equations, block partition has a larger communication time than strip partition if

56 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.56 Safety and Deadlock When all processes send their messages first and then receive all of their messages is “unsafe” because it relies upon buffering in the send() s. If insufficient storage available, send routine may be delayed from returning until storage becomes available or until the message can be sent without buffering. Then, a locally blocking send() could behave as a synchronous send() only returning when the matching recv() is executed. Since a matching recv() would never be executed if all the send() s are synchronous, deadlock would occur.

57 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.57 Making the Code Safe Alternate order of send()s/recv() s in adjacent processes so that only one process performs the send()s first: if((i%2)==0){ send(&g[1][1],n,P i-1 ); recv(&h[0][1],n,P i-1 ); send(&g[n/p][1],n,P i+1 ); recv(&h[n/p+1][1],n,P i+1 ); }else{ recv(&h[n/p+1][1],n,P i+1 ); send(&g[n/p][1],n,P i+1 ); recv(&h[0][1],n,P i-1 ); send(&g[1][1],n,P i-1 ); } Then even synchronous send() s would not cause deadlock. Good way you can test for safety is to replace message-passing routines in a program with synchronous versions.

58 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.58 Partially Synchronous Computations Computations in which individual processes operate without needing to synchronize with other processes on every iteration. Good idea that can lead to better speedup Synchronizing processes is expensive and slows the computation significantly Global synchronization achieved with barrier routines which cause processors to wait sometimes needlessly. We explore strategies for reducing synchronization in synchronous iteration examples

59 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.59 Reducing Synchronization Performance can be increased by reducing the amount of waiting due to synchronization One approach could be to allow processes to continue with the next iteration before other processes have completed. Processes moving forward would use values computed from not only the previous iteration but maybe from earlier iterations. The method then becomes an asynchronous iterative method. Two methods for reducing synchronization are, therefore: 1.Chaotic relaxation---a asynchronous iterative method 2.Delayed convergence

60 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 6.60 Reducing Synchronization (cont’d) Chaotic relaxation Each process allowed to perform s iterations before being synchronized and also to update the array as it goes. The actual iteration corresponding to the elements of the array being used at any time may be from an earlier iteration but only up to s iterations previously. May be a mixture of values of different iterations as array is updated without synchronizing with other processes - truly a chaotic situation. Delayed convergence – Do not check for convergence at each iteration Check for convergence after several iterations instead


Download ppt "Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. 2004."

Similar presentations


Ads by Google