A.Broumandnia, 1 5 PRAM and Basic Algorithms Topics in This Chapter 5.1 PRAM Submodels and Assumptions 5.2 Data Broadcasting 5.3.

A.Broumandnia, Broumandnia@gmail.com 1 5 PRAM and Basic Algorithms Topics in This Chapter 5.1 PRAM Submodels and Assumptions 5.2 Data Broadcasting 5.3 Semigroup or Fan-in Computation 5.4 Parallel Prefix Computation 5.5 Ranking the Elements of a Linked List 5.6 Matrix Multiplication

5.1 PRAM Submodels and Assumptions A.Broumandnia, Broumandnia@gmail.com 2 Fig. 4.6 Conceptual view of a parallel random-access machine (PRAM). Processor i can do the following in three phases of one cycle: 1.Fetch a value from address s i in shared memory 2.Perform computations on data held in local registers 3.Store a value into address d i in shared memory

5.1 PRAM Submodels and Assumptions Because the addresses si and di are determined by Processor i, independently of all other processors, it is possible that several processors may want to read data from the same memory location or write their values into a common location. Hence, four submodels of the PRAM model have been defined based on whether concurrent reads (writes) from (to) the same location are allowed. The four possible combinations, depicted in Fig. 5.1, are  EREW: Exclusive-read, exclusive-write  ERCW: Exclusive-read, concurrent-write  CREW: Concurrent-read, exclusive-write  CRCW: Concurrent-read, concurrent-write A.Broumandnia, Broumandnia@gmail.com 3

5.1 PRAM Submodels and Assumptions A.Broumandnia, Broumandnia@gmail.com 4 Fig. 5.1 Submodels of the PRAM model.

Types of CRCW PRAM CRCW PRAM is further classified according to how concurrent writes are handled. Here are a few example submodels based on the semantics of concurrent writes in CRCW PRAM:  Undefined: In case of multiple writes, the value written is undefined (CRCW-U).  Detecting: A special code representing “detected collision” is written (CRCW-D).  Common: Multiple writes allowed only if all store the same value (CRCW-C). This is sometimes called the consistent-write submodel.  Random: The value written is randomly chosen from among those offered (CRCWR). This is sometimes called the arbitrary-write submodel.  Priority: The processor with the lowest index succeeds in writing its value (CRCW-P).  Max/Min: The largest/smallest of the multiple values is written (CRCW-M).  Reduction: The arithmetic sum (CRCW-S), logical AND (CRCW-A), logical XOR (CRCW-X), or some other combination of the multiple values is written. A.Broumandnia, Broumandnia@gmail.com 5

Power of PRAM Submodels These submodels are all different from each other and from EREW and CREW. One way to order these submodels is by their computational power. The following relationships have been established between some of the PRAM submodels: A.Broumandnia, Broumandnia@gmail.com 6 EREW < CREW < CRCW-D < CRCW-C < CRCW-R < CRCW-P

Some Elementary PRAM Computations A.Broumandnia, Broumandnia@gmail.com 7 Initializing an n-vector (base address = B) to all 0s: for j = 0 to  n/p  – 1 processor i do if jp + i < n then M[B + jp + i] := 0 endfor Adding two n-vectors and storing the results in a third (base addresses B, B , B) Convolution of two n-vectors: W k =  i+j=k U i  V j (base addresses B W, B U, B V )  n / p  segments p elements

5.2 Data Broadcasting Simple, or one-to-all, broadcasting is used when one processor needs to send a data value to all other processors. In the CREW or CRCW submodels, broadcasting is trivial, as the sending processor can write the data value into a memory location, with all processors reading that data value in the following machine cycle. Thus, simple broadcasting is done in Θ(1) steps. Multicasting within groups is equally simple if each processor knows its group membership(s) and only members of each group read the multicast data for that group. All-to-all broadcasting, where each of the p processors needs to send a data value to all other processors, can be done through p separate broadcast operations in Θ(p) steps, which is optimal. The above scheme is clearly inapplicable to broadcasting in the EREW model. A.Broumandnia, Broumandnia@gmail.com 8

5.2 Data Broadcasting A.Broumandnia, Broumandnia@gmail.com 9

5.2 Data Broadcasting A.Broumandnia, Broumandnia@gmail.com 10 Fig. 5.2 Data broadcasting in EREW PRAM via recursive doubling. Making p copies of B[0] by recursive doubling for k = 0 to  log 2 p  – 1 Proc j, 0  j < p, do Copy B[j] into B[j + 2 k ] endfor

5.2 Data Broadcasting A.Broumandnia, Broumandnia@gmail.com 11 EREW PRAM algorithm for broadcasting by Processor i Processor i write the data value into B[0] s := 1 while s < p Processor j, 0 ≤ j < min(s, p – s), do Copy B[j] into B[j + s] s := 2s endwhile Processor j, 0 ≤ j < p, read the data value in B[j] Fig. 5.3 EREW PRAM data broadcasting without redundant copying.

All-to-All Broadcasting on EREW PRAM To perform all-to-all broadcasting, so that each processor broadcasts a value that it holds to each of the other p – 1 processors, we let Processor j write its value into B[j], rather than into B[0]. Thus, in one memory access step, all of the values to be broadcast are written into the broadcast vector B. Each processor then reads the other p – 1 values in p – 1 memory accesses. To ensure that all reads are exclusive, Processor j begins reading the values starting with B[j + 1], wrapping around to B[0] after reading B[p – 1]. A.Broumandnia, Broumandnia@gmail.com 12

All-to-All Broadcasting on EREW PRAM A.Broumandnia, Broumandnia@gmail.com 13 EREW PRAM algorithm for all-to-all broadcasting Processor j, 0  j < p, write own data value into B[j] for k = 1 to p – 1 Processor j, 0  j < p, do Read the data value in B[(j + k) mod p] endfor This O(p)-step algorithm is time-optimal j p – 1 0

All-to-All Broadcasting on EREW PRAM A.Broumandnia, Broumandnia@gmail.com 14 Naive EREW PRAM sorting algorithm (using all-to-all broadcasting) Processor j, 0  j < p, write 0 into R[ j ] for k = 1 to p – 1 Processor j, 0  j < p, do l := (j + k) mod p if S[ l ] < S[ j ] or S[ l ] = S[ j ] and l < j then R[ j ] := R[ j ] + 1 endif endfor Processor j, 0  j < p, write S[ j ] into S[R[ j ]] This O(p)-step sorting algorithm is far from optimal; sorting is possible in O(log p) time Because each data element must be given a unique rank, ties are broken by using the processor ID. In other words, if Processors i and j (i < j) hold equal data values, the value in Processor i is deemed smaller for ranking purposes. This sorting algorithm is not optimal in that the O(p²) computational work involved in it is significantly greater than the O(p log p) work required for sorting p elements on a single processor.

5.3 Semigroup or Fan-in Computation This computation is trivial for a CRCW PRAM of the “reduction” variety if the reduction operator happens to be ⊗. For example, computing the arithmetic sum (logical AND, logical XOR) of p values, one per processor, is trivial for the CRCW-S (CRCW-A, CRCW-X) PRAM; it can be done in a single cycle by each processor writing its corresponding value into a common location that will then hold the arithmetic sum (logical AND, logical XOR) of all of the values. A.Broumandnia, Broumandnia@gmail.com 15

5.3 Semigroup or Fan-in Computation A.Broumandnia, Broumandnia@gmail.com 16 EREW PRAM semigroup computation algorithm Proc j, 0  j < p, copy X[j] into S[j] s := 1 while s < p Proc j, 0  j < p – s, do S[j + s] := S[j]  S[j + s] s := 2s endwhile Broadcast S[p – 1] to all processors Fig. 5.4 Semigroup computation in EREW PRAM. This algorithm is optimal for PRAM, but its speedup of O(p / log p) is not

5.3 Semigroup or Fan-in Computation When each of the p processors is in charge of n/p elements, rather than just one element, the semigroup computation is performed by each processor 1.first combining its n/p elements in n/p steps to get a single value. 2.Then, the algorithm just discussed is used, with the first step replaced by copying the result of the above into S[j]. It is instructive to evaluate the speed-up and efficiency of the above algorithm for an n-input semigroup computation using p processors. (n=p and n>p)(page 97) A.Broumandnia, Broumandnia@gmail.com 17

5.4 Parallel Prefix Computation We see in Fig. 5.4 that as we find the semigroup computation result in S[p – 1], all partial prefixes are also obtained in the previous elements of S. Figure 5.6 is identical to Fig. 5.4, except that it includes shading to show that the number of correct prefix results doubles in each step. A.Broumandnia, Broumandnia@gmail.com 18 Fig. 5.6 Parallel prefix computation in EREW PRAM via recursive doubling. Same as the first part of semigroup computation (no final broadcasting)

5.4 Parallel Prefix Computation The previous algorithm is quite efficient, but there are other ways of performing parallel prefix computation on the PRAM. In particular, the divide-and-conquer paradigm leads to two other solutions to this problem. In the following, we deal only with the case of a p-input problem, where p (the number of inputs or processors) is a power of 2. As in Section 5.3, the pair of integers u:v represents the combination (e.g., sum) of all input values from xu to xv. Figure 5.7 depicts our first divide-and-conquer algorithm. We view the problem as composed of two subproblems: computing the odd-indexed results s1, s3, s5,... And computing the even-indexed results s0, s2, s4,.... A.Broumandnia, Broumandnia@gmail.com 19

First Divide-and-Conquer Parallel-Prefix Algorithm A.Broumandnia, Broumandnia@gmail.com 20 In hardware, this is the basis for Brent-Kung carry- lookahead adder T(p) = T(p/2) + 2 T(p)  2 log 2 p Fig. 5.7 Parallel prefix computation using a divide-and-conquer scheme. Each vertical line represents a location in shared memory

Second Divide-and-Conquer Parallel-Prefix Algorithm Figure 5.8 depicts a second divide-and-conquer algorithm. We view the input list as composed of two sublists: the even-indexed inputs x0, x2, x4,... and the odd-indexed inputs x1, x3, x5,.... Parallel prefix computation is performed separately on each sublist, leading to partial results as shown in Fig. 5.8 (a sequence of digits indicates the combination of elements with those indices). The final results are obtained by pairwise combination of adjacent partial results in a single PRAM step. The total computation time is given by the recurrence T(p) = T(p/2) + 1 Even though this latter algorithm is more efficient than the first divide-and-conquer scheme, it is applicable only if the operator ⊗ is commutative (why?). A.Broumandnia, Broumandnia@gmail.com 21

Second Divide-and-Conquer Parallel-Prefix Algorithm A.Broumandnia, Broumandnia@gmail.com 22 Strictly optimal algorithm, but requires commutativity T(p) = T(p/2) + 1 T(p) = log 2 p Each vertical line represents a location in shared memory Fig. 5.8 Another divide-and-conquer scheme for parallel prefix computation.

A.Broumandnia, 1 5 PRAM and Basic Algorithms Topics in This Chapter 5.1 PRAM Submodels and Assumptions 5.2 Data Broadcasting 5.3.

Similar presentations

Presentation on theme: "A.Broumandnia, 1 5 PRAM and Basic Algorithms Topics in This Chapter 5.1 PRAM Submodels and Assumptions 5.2 Data Broadcasting 5.3."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A.Broumandnia, 1 5 PRAM and Basic Algorithms Topics in This Chapter 5.1 PRAM Submodels and Assumptions 5.2 Data Broadcasting 5.3.

Similar presentations

Presentation on theme: "A.Broumandnia, 1 5 PRAM and Basic Algorithms Topics in This Chapter 5.1 PRAM Submodels and Assumptions 5.2 Data Broadcasting 5.3."— Presentation transcript:

Similar presentations

About project

Feedback