Download presentation

Presentation is loading. Please wait.

Published byZoie Brayfield Modified about 1 year ago

1
Lecture # 3: A Primer on How to Design Parallel Algorithms Shantanu Dutt University of Illinois at Chicago

2
Parallel Algorithm Design—A Process/Data View Break up the computation in either a data parallel or functional parallel way depending on the problem at hand Data parallel breakup: Easier to generalize than functional breakup, and more structured Input data parallel (e.g., the max and general associative computation, finite-element computations—the heat equation problem) or Output data parallel (e.g., matrix multiplication) Each data partition defines a computational partition or process that works on that partition 1 st attempt: breakup somehow, determine how to communicate to solve the problem 2 nd attempt: breakup in a way that the reqd. communication complexity (topology indep.) is min. 3 rd attempt: breakup so that commun. complexity on a particular topology is minimized I/P data: Linear or unstructured (e.g., associative functions): Easy to partition into P (N/P)-size pieces 1 2 P I/P data: Dimensional or Structured (e.g., finite-elt. grid): Can be partitioned by same dimensionality or in fewer dimensions. Choice based on resulting communication pattern and the target topology Finite elt. grid: (a) 2D (full dimensional) partition (b) 1D (reduced dimensional) partition Inter-process communication

3
Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 3 An example parallel algorithm for a finite element computation Simple, structured and sparse communication needed. Example: Heat Equation - The initial temperature is zero on the boundaries and high in the middle The boundary temperature is held at zero. The calculation of an element is dependent upon its neighbor elements data1 data2 …... data N

4
Code from: Fundamentals of Parallel Processing, A. Agrawal, IIT Kanpur 4 1. find out if I am MASTER or WORKER 2. if I am MASTER 3. initialize array 4. send each WORKER starting info and subarray 5. do until all WORKERS converge 6. gather from all WORKERS convergence data 7. broadcast to all WORKERS convergence signal 8. end do 9. receive results from each WORKER 1. else if I am WORKER 2. receive from MASTER starting info and subarray 3. do until solution converged { 4. update time 5. send (non-blocking?) neighbors my border info 6. receive (non-blocking?) neighbors border info 7. update interior of my portion of solution array (see comput. given in the serial code) 8. wait for non-block. commun. (if any) to complete 14. update border of my portion of solution array 15. determine if my solution has converged 16. if so {send MASTER convergence signal 17. recv. from MASTER convergence signal} 18. end do } 19. send MASTER results 20. endif Serial Code – repeat do y=2, N-1 do x=2, M-1 u2(x,y)=u1(x,y)+cx*[u1(x+1,y) + u1(x-1,y)] + cy*[u1(x,y+1)} + u1(x,y-1)] /* cx, cy are const. enddo u1 = u2; until u1 ~ u2. Master (can be one of the workers) Workers Problem Grid

5
Parallel Algorithm Design—A Process/Data View: Dimensional Partitioning Finite elt. grid: (a) 2D (full dimensional) partition (b) 1D (reduced dimensional) partition Inter-process communication A full dimensional partition will have the smallest circumference (amount of data that needs to be communicated to neighboring processes) to area (total data elements per process = N/P) ratio, and thus has the least topology-independent communication cost A 2D partition (Fig. a) of a 2D grid of nxn points, gives us [n/sqrt(P)]x [n/sqrt(P)] partitions, where n=sqrt(N). Thus total data that needs to be communicated (per iteration of a finite element computation) is 2* [n/sqrt(P)] (for corner partitions) to 4* [n/sqrt(P)] (for interior partitions). A 1D partition (Fig. b) of a 2D grid of nxn points, gives us nx(n/P) partitions, requiring a data amount of n to 2n to be communicated which is about sqrt(P)/2 times more data than for a 2D partition. However, if one considers the topology of the target parallel system, a 1D partition is suitable for both a linear and 2D topology, while a 2D partition is suitable for a 2D topology but is expensive for a linear topology in terms of average message hops (and thus also in terms of more contention/conflict)

6
Fundamentals of Parallel Processing, Ashish Agrawal, IIT Kanpur 6 An example parallel algorithm for a finite element computation: Analysis Analysis: Let L be the max dist. of a grid point from the heat source. The # of iterations for convergence will be a function f(L) of L. Assume linear partition of i/p data and near-neighbor communication only (possible for most topologies for linear data partitioning). Per iteration computation time per process = (N/P). Comm. time per iter. per process = (sqrt(N)). All computations in an iteration occur in parallel (there is no data dependency among the computations in each process for the same iteration). Commun. also occurs completely in parallel in an iteration. Also, near-neighbor no contention. So parallel time per iteration is (N/P + sqrt(N)) [is sometimes also written as (max(N/P, sqrt(N))]. Number of parallel iterations? Parallel Time Tp(P) = Seq. time Tseq = Tp(1) = Speedup Sp(P) = Efficiency Ep(P) = data1 data2 …... data N

7
Parallel Algorithm Design—A Process/Data View: Mapping to a Multiprocessor Finite elt. grid: (a) 2D (full dimensional) partition (b) 1D (reduced dimensional) partition Processes mapped to data (& thus comput.) 4-hop messages (c) Communication patterns when mapping a 2D data partition (orange) and a 1D data partition (blue) to a linear array. Assumptions: Total grid points: N in a sqrt(N)xsqrt(N) arrangement Commun. time = t s + k*t h + m*t w, where t s is the startup time at source and target procs, k = # of hops, t h the decision time per switch, m = msg. size in “flits” (# of bits that can be transmitted in parallel), t w is propagation delay per flit. t s does not affect congestion, so for congestion/delay purposes we approx. commun. time here by k*t h + m*t w Analysis: 1D data partition & linear topology: All commun. is near neighbor and hence parallel per iteration and w/o contention: Parallel commun. time per iteration in a finite-element computation = t h + sqrt(N)*t w ~ ( sqrt(N)) for large N Row major order is used to map 2D data partition to a 1D processor array

8
Parallel Algorithm Design—A Process/Data View: Mapping to a Multiprocessor Finite elt. grid: (a) 2D (full dimensional) partition (b) 1D (reduced dimensional) partition Processes mapped to data (& thus comput.) 4-hop messages (c) Communication patterns when mapping a 2D data partition (orange) and a 1D data partition (blue) to a linear array Analysis: 2D data partition & linear topology: Near neighbor commun. part is parallel and w/o contentions. So parallel commun. time per iteration of FE computation= t h + (sqrt(N)/2)*t w = (sqrt(N)/2), for large N. 2 nd dim. communication is for a distance of 4 hops (generally about sqrt(P)-hops): Msg. from 4 8 will take time 4* t h + (sqrt(N)/4)*t w ~ (sqrt(N)/4) for large N. Assuming cut- through/wormhole routing in which an entire path is first reserved and then the data passes through via multiple interconnects as if it is a single interconnect, msgs from 3 7, 2 6, 1 5 are blocked until 4 8 communication is completed. After this is finished, say, 3 7 comm. occurs, while the others are blocked. So commun. is sequentialized for 4-hop messages along an intersecting route. So commun. along dim. 2 of the data is not fully parallel (they are started in parallel but get sequentialized due to contention) and takes a total parallel time of 4(4* t h + (sqrt(N)/4)*t w ) ~ (sqrt(N) for large N, similar in order notation to 1D partiton, but will actually take more time in terms of the t h terms (16* t h vs. t h ). Total parallel commun. time is (1.5 sqrt(N)), worse than for 1D partitioning. What would this be for a general P? Row major order is used to map 2D data partition to a 1D processor array

9
Finite elt. grid: (a) 2D (full dimensional) partition (b) 1D (reduced dimensional) partition Processes mapped to data (& thus comput.) (c) Communication patterns when mapping a 2D data partition (orange) and a 1D data partition (blue), & corresponding processes to a 2D array/mesh. The latter is equivalent to embedding a linear array onto a 2D array to minimize average dilation (the factor by which the length of a near-neighbor commun. or link increases in the host/target topology Analysis: 1D data partition & 2D topology: All commun. is near neighborr and hence parallel per iteration and w/o contention: Parallel commun. time per iteration in a finite-element computation = t h + sqrt(N)*t w ~ (sqrt(N)) for large N Analysis: 2D data partition & 2D topology: All commun. is near neighborr and hence w/o contention. However, each processor has two communication steps that are partly sequentialized: Worst-case: Parallel commun. time per iteration in a FE computation = t h + (sqrt(N)/2)*t w + t h + (sqrt(N)/4)*t w ~ (0.75*sqrt(N)) for large N Best (close to Average) case: The bulk of both commun. which is data transfer takes place in parallel but the startup time (ts) is sequentialized. So parallel commun. time per iteration = max( t h + (sqrt(N)/2)*t w, t h + (sqrt(N)/4)*t w ) ~ (0.5*sqrt(N)) for large N. For general P? Parallel Algorithm Design—A Process/Data View: Mapping to a Multiprocessor

10
X = A B C Output data partitioning: E.g., Matrix multiplication Each process computes a block of data of C (one or more Cij’s) Issues of input data distribution among the processes: If to compute elt. Cij, row Ai and col B j are available at the process then no communication needed to move input data to the right processes If each process computing a block of C has similar blocks of A, B data, then data needs to be moved around the processes so that the one computing Cij has Ai and B j available. This will require partial all-to-all broadcasts; see below. A B Each process broadcasts/multicasts the block of A data it has to processes in its “row” and similarly the block of B data to processes in its column. After the above stage, every process computing each elt. Cij, will have row Ai and col. B j The time complexity of the above phase will depend on the interconn. topology of the multiprocessor Parallel Algorithm Design—A Process/Data View: O/P Data Partitioning

11
Parallel Algorithm Design—Strategy Summary Data partitioning: defines processes; based partly on amount & pattern of communication needed and target topology Determine communication pattern required Granularity of breakup: N/P for i/p data parallel or M/P for o/p data parallel can either be decided based on given P (straightforward) or based on overhead (i.e., P to be determined) to minimize either parallel time Tp or efficiency Ep or work efficiency WEp (see below) Map processes to processors in target topology to: reduce average message hops (try to maximize near-neighbor communication) maximize parallelization of message passing across processes minimize conflicts among message paths (conflict depends partly on type of routing “flow control”—store-and-forward or wormhole/cut-through; we assumed the latter in previous analysis) Overall Goals: minimize parallel time Tp (or speedup Sp) maximize efficiency Ep = Sp/P maximize work efficiency WEp = Sp/WSp, where WSp is total work scale-up (correlated to total power consumed) across the P processors compared to the total sequential work Ws in the computation = Wp/Ws, where Wp is the total work done across P processors (= total time across P procs) More on parallelization metrics later

12
MIMD/SPMD Coding Style Loop Do some computation; send data and/or receive data Do some more computation; send data and/or receive data etc. End Loop

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google