A Taste of Parallel Algorithms A.Broumandnia, 1 Learn about the nature of parallel algorithms and complexity: By implementing 5 building-block.

A Taste of Parallel Algorithms A.Broumandnia, Broumandnia@gmail.com 1 Learn about the nature of parallel algorithms and complexity: By implementing 5 building-block parallel computations On 4 simple parallel architectures (20 combinations) Topics in This Chapter 2.1 Some Simple Computations 2.2 Some Simple Architectures 2.3 Algorithms for a Linear Array 2.4 Algorithms for a Binary Tree 2.5 Algorithms for a 2D Mesh 2.6 Algorithms with Shared Variables

2.1. SOME SIMPLE COMPUTATIONS In this section, we define five fundamental building-block computations: 1. Semigroup (reduction, fan-in) computation 2. Parallel prefix computation 3. Packet routing 4. Broadcasting, and its more general version, multicasting 5. Sorting records in ascending/descending order of their keys A.Broumandnia, Broumandnia@gmail.com 2

Semigroup Computation Let ⊗ be an associative binary operator; i.e., (x ⊗ y ) ⊗ z= x ⊗ (y ⊗ z ) for all x, y, z ∈ S. A semigroup is simply a pair (S, ⊗ ), where S is a set of elements on which ⊗ is defined. Semigroup (also known as reduction or fan-in ) computation is defined as: Given a list of n values x0, x1,..., xn–1, compute x0 ⊗ x1 ⊗... ⊗ xn–1. Common examples for the operator ⊗ include +, ×, ∧, ∨, ⊕, ∩, ∪, max, min. The operator ⊗ may or may not be commutative, i.e., it may or may not satisfy x ⊗ y = y ⊗ x (all of the above examples are, but the carry computation, e.g., is not). This last point is important; while the parallel algorithm can compute chunks of the expression using any partitioning scheme, the chunks must eventually be combined in left-to-right order. Figure 2.1 depicts a semigroup computation on a uniprocessor. A.Broumandnia, Broumandnia@gmail.com 3

Semigroup Computation A.Broumandnia, Broumandnia@gmail.com 4 Fig. 2.1 Semigroup computation on a uniprocessor.

Parallel Semigroup Computation A.Broumandnia, Broumandnia@gmail.com 5 Semigroup computation viewed as tree or fan-in computation. s = x 0  x 1 ...  x n–1 log 2 n levels

Parallel Prefix Computation With the same assumptions as in the preceding paragraph, a parallel prefix computation is defined as simultaneously evaluating all of the prefixes of the expression x0 ⊗ x 1... ⊗ xn–1; i.e., x0, x0 ⊗ x1, x0 ⊗ x1 ⊗ x2,..., x 0 ⊗ x1 ⊗... ⊗ xn–1. Note that the ith prefix expression is si = x0 ⊗ x1 ⊗... ⊗ xi. The comment about commutatively, or lack thereof, of the binary operator ⊗ applies here as well. The graph representing the prefix computation on a uniprocessor is similar to Fig. 2.1, but with the intermediate values also output. A.Broumandnia, Broumandnia@gmail.com 6

Parallel Prefix Computation A.Broumandnia, Broumandnia@gmail.com 7 Prefix computation on a uniprocessor. Parallel version much trickier compared to that of semigroup computation Requires a minimum of log 2 n levels s = x 0  x 1  x 2 ...  x n–1

Routing A packet of information resides at Processor i and must be sent to Processor j. The problem is to route the packet through intermediate processors, if needed, such that it gets to the destination as quickly as possible. The problem becomes more challenging when multiple packets reside at different processors, each with its own destination. In this case, the packet routes may interfere with one another as they go through common intermediate processors. When each processor has at most one packet to send and one packet to receive, the packet routing problem is called one-to-one communication or 1–1 routing. A.Broumandnia, Broumandnia@gmail.com 8

Broadcasting Given a value a known at a certain processor i, disseminate it to all p processors as quickly as possible, so that at the end, every processor has access to, or “knows,” the value. This is sometimes referred to as one-to-all communication. The more general case of this operation, i.e., one-to-many communication, is known as multicasting. From a programming viewpoint, we make the assignments xj: = a for 1 ≤ j ≤ p (broadcasting) or for j ∈ G (multicasting), where G is the multicast group and xj is a local variable in processor j. A.Broumandnia, Broumandnia@gmail.com 9

Sorting Rather than sorting a set of records, each with a key and data elements, we focus on sorting a set of keys for simplicity. Our sorting problem is thus defined as: Given a list of n keys x0, x1,..., xn–1, and a total order ≤ on key values, rearrange the n keys as xi nondescending order. Any algorithm for sorting values in nondescending order can be 0, xi1,..., xin–1, such that xi0 ≤ xi1≤... ≤ xin–1. We consider only sorting the keys in converted, in a straightforward manner, to one for sorting the keys in nonascending order or for sorting records. A.Broumandnia, Broumandnia@gmail.com 10

2.2. SOME SIMPLE ARCHITECTURES In this section, we define four simple parallel architectures: 1. Linear array of processors 2. Binary tree of processors 3. Two-dimensional mesh of processors 4. Multiple processors with shared variables A.Broumandnia, Broumandnia@gmail.com 11

Linear Array A.Broumandnia, Broumandnia@gmail.com 12

Linear Array A.Broumandnia, Broumandnia@gmail.com 13 Fig. 2.2 A linear array of nine processors and its ring variant. Max node degreed = 2 Network diameterD = p – 1 (  p/2  ) Bisection widthB = 1 ( 2 )

Binary Tree A.Broumandnia, Broumandnia@gmail.com 14

Binary Tree A.Broumandnia, Broumandnia@gmail.com 15 Fig. 2.3 A balanced (but incomplete) binary tree of nine processors. Complete binary tree 2 q – 1 nodes, 2 q–1 leaves Balanced binary tree Leaf levels differ by 1 Max node degreed = 3 Network diameterD = 2  log 2 p  (  1 ) Bisection widthB = 1

2D Mesh A.Broumandnia, Broumandnia@gmail.com 16

2D Mesh A.Broumandnia, Broumandnia@gmail.com 17 Max node degreed = 4 Network diameterD = 2  p – 2 (  p ) Bisection widthB   p ( 2  p ) Fig. 2.4 2D mesh of 9 processors and its torus variant.

Shared Memory A shared-memory multiprocessor can be modeled as a complete graph, in which every node is connected to every other node, as shown in Fig. 2.5 for p = 9. In the 2D mesh of Fig. 2.4, Processor 0 can send/receive data directly to/from P1 and P 3.However, it has to go through an intermediary to send/receive data to/from P 4, say. In a shared-memory multiprocessor, every piece of data is directly accessible to every processor (we assume that each processor can simultaneously send/receive data over all of its p – 1 links). The diameter D = 1 of a complete graph is an indicator of this direct access. The node degree d = p – 1, on the other hand, indicates that such an architecture would be quite costly to implement if no restriction is placed on data accesses. A.Broumandnia, Broumandnia@gmail.com 18

Shared Memory A.Broumandnia, Broumandnia@gmail.com 19 Max node degreed = p – 1 Network diameterD = 1 Bisection widthB =  p/2   p/2  Costly to implement Not scalable But... Conceptually simple Easy to program Fig. 2.5 A shared-variable architecture modeled as a complete graph.

A.Broumandnia, Broumandnia@gmail.com 20 Architecture/Algorithm Combinations Semi- group Parallel prefix Packet routing Broad- casting Sorting We will spend more time on linear array and binary tree and less time on mesh and shared memory (studied later)

2.3. ALGORITHMS FOR A LINEAR ARRAY Semigroup Computation. Let us consider first a special case of semigroup computation, namely, that of maximum finding. Each of the p processors holds a value initially and our goal is for every processor to know the largest of these values. A local variable, max-thus-far, can be initialized to the processor’s own data value. In each step, a processor sends its max- thus-far value to its two neighbors. Each processor, on receiving values from its left and right neighbors, sets its max-thus-far value to the largest of the three values, i.e., max(left, own, right). Figure 2.6 depicts the execution of this algorithm for p = 9 processors. The dotted lines in Fig. 2.6 show how the maximum value propagates from P6 to all other processors. Had there been two maximum values, say in P2 and P 6, the propagation would have been faster. In the worst case, p – 1 communication steps (each involving sending a processor’s value to both neighbors), and the same number of three-way comparison steps, are needed. This is the best one can hope for, given that the diameter of a p-processor linear array is D = p – 1 (diameter-based lower bound). A.Broumandnia, Broumandnia@gmail.com 21

Semigroup Computation A.Broumandnia, Broumandnia@gmail.com 22 Fig. 2.6 Maximum-finding on a linear array of nine processors. For general semigroup computation: Phase 1: Partial result is propagated from left to right Phase 2: Result obtained by processor p – 1 is broadcast leftward

Parallel Prefix Computation Parallel Prefix Computation. Let us assume that we want the ith prefix result to be obtained at the ith processor, 0 ≤ i ≤ p –1. The general semigroup algorithm described in the preceding paragraph in fact performs a semigroup computation first and then does a broadcast of the final value to all processors. Thus, we already have an algorithm for parallel prefix computation that takes p – 1 communication/combining steps. A variant of the parallel prefix computation, in which Processor i ends up with the prefix result up to the (i – 1)th value, is sometimes useful. This diminished prefix computation can be performed just as easily if each processor holds onto the value received from the left rather than the one it sends to the right. The diminished prefix sum results for the example of Fig. 2.7 would be 0, 5, 7,15, 21, 24, 31, 40, 41. A.Broumandnia, Broumandnia@gmail.com 23

Parallel Prefix Computation A.Broumandnia, Broumandnia@gmail.com 24 Fig. 2.7 Computing prefix sums on a linear array of nine processors. Diminished parallel prefix computation: The ith processor obtains the result up to element i – 1

Parallel Prefix Computation Thus far, we have assumed that each processor holds a single data item. Extension of the semigroup and parallel prefix algorithms to the case where each processor initially holds several data items is straightforward. Figure 2.8 shows a parallel prefix sum computation with each processor initially holding two data items. The algorithm consists of each processor doing a prefix computation on its own data set of size n/p (this takes n/p – 1 combining steps), then doing a diminished parallel prefix computation on the linear array as above ( p– 1 communication/combining steps), and finally combining the local prefix result from this last computation with the locally computed prefixes (n /p combining steps). In all, 2n/p + p– 2 combining steps and p – 1 communication steps are required. A.Broumandnia, Broumandnia@gmail.com 25

Parallel Prefix Computation A.Broumandnia, Broumandnia@gmail.com 26 Fig. 2.8 Computing prefix sums on a linear array with two items per processor.

Packet Routing To send a packet of information from Processor i to Processor j on a linear array, we simply attach a routing tag with the value j – i to it. The sign of a routing tag determines the direction in which it should move (+ = right, – = left) while its magnitude indicates the action to be performed (0 = remove the packet, nonzero = forward the packet). With each forwarding, the magnitude of the routing tag is decremented by 1. Multiple packets originating at different processors can flow rightward and leftward in lockstep, without ever interfering with each other. A.Broumandnia, Broumandnia@gmail.com 27

Broadcasting If Processor i wants to broadcast a value a to all processors, it sends an rbcast(a) (read r-broadcast) message to its right neighbor and an lbcast(a) message to its left neighbor. Any processor receiving an rbcast(a ) message, simply copies the value a and forwards the message to its right neighbor (if any). Similarly, receiving an lbcast(a) message causes a to be copied locally and the message forwarded to the left neighbor. The worst-case number of communication steps for broadcasting is p – 1. A.Broumandnia, Broumandnia@gmail.com 28

Linear Array Routing and Broadcasting A.Broumandnia, Broumandnia@gmail.com 29 Routing and broadcasting on a linear array of nine processors. To route from processor i to processor j: Compute j – i to determine distance and direction To broadcast from processor i: Send a left-moving and a right-moving broadcast message

Sorting We consider two versions of sorting on a linear array: with and without I/O. Figure 2.9 depicts a linear-array sorting algorithm when p keys are input, one at a time, from the left end. Each processor, on receiving a key value from the left, compares the received value with the value stored in its local register (initially, all local registers hold the value +∞). The smaller of the two values is kept in the local register and larger value is passed on to the right. Once all p inputs have been received, we must allow p – 1 additional communication cycles for the key values that are in transit to settle into their respective positions in the linear array. If the sorted list is to be output from the left, the output phase can start immediately after the last key value has been received. In this case, an array half the size of the input list would be adequate and we effectively have zero-time sorting, i.e., the total sorting time is equal to the I/O time. A.Broumandnia, Broumandnia@gmail.com 30

Linear Array Sorting (Externally Supplied Keys) A.Broumandnia, Broumandnia@gmail.com 31 Fig. 2.9 Sorting on a linear array with the keys input sequentially from the left.

Sorting If the key values are already in place, one per processor, then an algorithm known as odd–even transposition can be used for sorting. A total of p steps are required. In an odd-numbered step, odd-numbered processors compare values with their even-numbered right neighbors. The two processors exchange their values if they are out of order. Similarly, in an even- numbered step, even-numbered processors compare–exchange values with their right neighbors (see Fig. 2.10). In the worst case, the largest key value resides in Processor 0 and must move all the way to the other end of the array. This needs p – 1 right moves. One step must be added because no movement occurs in the first step. Of course one could use even–odd transposition, but this will not affect the worst-case time complexity of the algorithm for our nine-processor linear array. A.Broumandnia, Broumandnia@gmail.com 32

A.Broumandnia, Broumandnia@gmail.com 33 Linear Array Sorting (Internally Stored Keys) Fig. 2.10 Odd-even transposition sort on a linear array.

Performance evaloation Note that the odd–even transposition algorithm uses p processors to sort p keys in p compare–exchange steps. How good is this algorithm? Let us evaluate the odd–even transposition algorithm with respect to the various measures introduced in Section 1.6. The best sequential sorting algorithms take on the order of p log p compare– exchange steps to sort a list of size p. Let us assume, for simplicity, that they take exactly p log2 p steps. Then, we have T (1) = W(1) = p log 2 p, T (p) = p, W(p) = p 2/2, S (p) = log2 p (Minsky’s conjecture?), E(p) = (log In most practical situations, the number n of keys to be sorted (the problem size) is greater 2 p)/p, R(p) = p/(2 log2 p ), U(p) = 1/2, and Q(p) = 2(log2 p)3/ p2. A.Broumandnia, Broumandnia@gmail.com 34 T(1) = W(1) = p log 2 pT(p) = pW(p)  p 2 /2 S(p) = log 2 p (Minsky’s conjecture?)R(p) = p/(2 log 2 p)

odd–even transposition algorithm In most practical situations, the number n of keys to be sorted (the problem size) is greater than the number p of processors (the machine size). The odd–even transposition sort algorithm with n/p keys per processor is as follows. First, each processor sorts its list of size n/p using any efficient sequential sorting algorithm. Let us say this takes ( n/p)log2(n/p ) compare–exchange steps. Next, the odd–even transposition sort is performed as before, except that each compare– exchange step is replaced by a merge–split step in which the two communicating processors merge their sublists of size n/p into a single sorted list of size 2n/p and then split the list down the middle, one processor keeping the smaller half and the other, the larger half. For example, if P0 is holding (1, 3, 7, 8) and P1 has (2, 4, 5, 9), a merge–split step will turn the lists into (1, 2, 3, 4) and (5, 7, 8, 9), respectively. Because the sublists are sorted, the merge–split step requires n/p compare–exchange steps. Thus, the total time of the algorithm is (n/p)log2 (n/p ) + n. Note that the first term (local sorting) will be dominant if p log2n. For p ≥ log 2 n, the time complexity of the algorithm is linear in n; hence, the algorithm is more efficient than the one-key-per-processor version. A.Broumandnia, Broumandnia@gmail.com 35

A Taste of Parallel Algorithms A.Broumandnia, 1 Learn about the nature of parallel algorithms and complexity: By implementing 5 building-block.

Similar presentations

Presentation on theme: "A Taste of Parallel Algorithms A.Broumandnia, 1 Learn about the nature of parallel algorithms and complexity: By implementing 5 building-block."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Taste of Parallel Algorithms A.Broumandnia, 1 Learn about the nature of parallel algorithms and complexity: By implementing 5 building-block.

Similar presentations

Presentation on theme: "A Taste of Parallel Algorithms A.Broumandnia, 1 Learn about the nature of parallel algorithms and complexity: By implementing 5 building-block."— Presentation transcript:

Similar presentations

About project

Feedback