Presentation is loading. Please wait.

Presentation is loading. Please wait.

Student: Fan Bai Instructor: Dr. Sushil Prasad CSc8530.

Similar presentations


Presentation on theme: "Student: Fan Bai Instructor: Dr. Sushil Prasad CSc8530."— Presentation transcript:

1 Student: Fan Bai Instructor: Dr. Sushil Prasad CSc8530

2 1. Overview 2. Global Buses 2.1 Global Bus Model 2.2 Mesh Maximum 2.3 Global-Bus Mesh Maximum 3. Row and Column Buses 3.1 Row and Column Buses Model 3.2 Row and Column Buses Mesh Maximum 4. Paper Review 5. References

3 Fixed Buses:  A bus is simply a communication link to which some or all of the processors of the mesh are attached.  A mesh can be enhanced with fixed (electronic) buses in two ways: o Global buses o Row and column buses

4 1.n 1/2 * n 1/2 mesh of processors P(i,j) 0 ≤ i,j ≤ n 1/2 -1 2. All processors are attached to the single global bus.

5 1. At any given time, at most one processor can write a datum onto the bus. 2. The datum is available instantly for other processors to read simultaneously. 3. If more than 2 processors attempt to write at the same time, exactly one is selected arbitrarily by the bus to succeed. 4. Communication with neighbors is done through the standard links.

6  Three elementary Steps: (1) Read bus (2) Write bus (3) Communicate with a neighbor

7  Definition: Suppose that n input data are stored in an n 1/2 * n 1/2 mesh, one datum per processor. It’s required to determine the maximum of these data and place it in processor P(0,0).

8 Without a bus Standard Mesh : O(n 1/2 ) time Step 1: n 1/2 -1 elementary steps 1.1 Each processor in the rightmost column sends, its datum to its left neighbor. 1.2 The processor receiving a datum compares it to its own datum and forwards the larger one to the left, provided that j>0. Eventually, P(i,0), I =0, 1, …, n 1/2 -1, holds the largest datum in row I, denoted by Step 2: n 1/2 -1 elementary steps 2.2 P(n 1/2 -1,0) sends its datum to P(n 1/2 -2,0). 2.3 P(n 1/2 -2,0) compares the datum with its own and forward the larger one to P(n 1/2 -3,0) 2.4 Repeat until P(0,0) determines the overall largest datum.

9 918 642 375 998 662 375 999 666 377 9 999 66 377 9 999 66 379 Start Step4 Step 3 Step 2 Step 1

10 Phase 1: for i = 0 to K do (1.1) Each processor sends it datum to its neighbors (at most four) in the mesh (1.2) Each processor replaces its datum with the larger one. G, the set of processors whose datum has not been overwritten. 511412 161398 106153 11742 511412 161398 106153 11742 1614 16 1512 1615 11 154

11 Phase 2: while |G| > 0 do (2.1) One processor in G broadcasts it datum (2.2) Each processor replaces its datum with the larger one (the broadcasting processor also “replaces” its datum to remove itself from G) 1615 16 15 1615 1614 16 1512 1615 11 154 1614 16 1512 1615 11 1511 16

12 For Phase 1: executed K times, where K < n 1/2 Requires 4K time units For Phase 2: let S i be the set of processors whose data was not replaced after the i th iteration of Phase 1 Requires |S k | iterations in the worst case. t (n) = 4K + |S k |

13  Selecting K that minimizes t(n) o After i steps, the distance between the elements of S i cannot be less than i+1 o The number of processors at a distance i+1 or less from a given processor is 2(i+1) 2 + 2(i+1) + 1 o 2(i+1) 2 + 2(i+1) + 1 is larger than i 2 /2, so it follows that |S i | ≤ n/(i 2 /2) o Therefore: t(n) ≤ 4K + ( 2n / K 2 ) o 4K + ( 2n / K 2 ) is minimized when K = n 1/3

14  Conservative Flow Algorithm  Defined as an algorithm in which the processors are not allowed to communicate any modified or encoded form of the input data  In algorithm Global-Bus Mesh Maximum, the only values exchanged by the processors are the input data themselves.

15  Derive a lower bound on the running time of any conservative flow algorithm that finds the maximum of n data.  Observations: o The global bus reduces the diameter of the mesh to a constant. o A meaningful lower bound for the present model cannot be based on distance.

16  A Lower Bound o Use the notion of information content o N processors of the n 1/2 x n 1/2 mesh are numbered from 0 to n-1 in snakelike row-major order o Let I i (k) be the largest number of data that processor P i can receive in k time units o P i contains the data information of I i (k) processors at time k

17  A Lower Bound o Information accumulated by P i  Messages received through local communications, denoted by I’ i (k)  Messages received through global communications, denoted by I’’ i (k) o Initially, k=0, I’ i (0) = I’’ i (0) = 0 and I i (0) = 1 o For k>0, I i (k) ≤ I’ i (k) + I’’ i (k) +1  Justified since I’ i (k) and I’’ i (k) may contain common data o Let k = l + g, l is the time taken by local communications and g is the time taken by global communications o Processors P i(1), P i(2),…, P i(m), where 0 ≤ i(j) ≤ n-1 and m ≤ g, broadcast until this time.

18  A Lower Bound o The nonredundant information transmitted by P i(j) is at most o This argument holds for any P i(j), 1≤j ≤m, so o After g broadcasts, o Processor P i receives the information from all those processors that are at a distance of k or less from, using standard mesh links, totally:

19 A Lower Bound o Suppose It takes H time units to find maximum in worst case and let P i have the result. Then, o It follows that the lower bound on the time required by any conservative flow algorithm for this type of mesh is Ω(n 1/3 )

20  One limitation of the global-bus model is that no more than one processor can broadcast a datum at any given time.

21  An X × Y mesh of processors with X rows and Y columns is augmented with one bus for each row and column

22 1. A processor can either communicate with its four neighbors using standard mesh links or broadcast along its row or column bus. 2. All processors connected to the same bus can read a value being broadcast simultaneously. 3. Two steps are needed for one processor to broadcast its datum to all other processors. o Step 1: Broadcast along its row o Step 2: Broadcast along the columns from all processors on that row.

23  Problem definition: Let n data be stored one per processor in an X × Y mesh, with row and column buses, where X ≥ Y and XY = n. Find the maximum of n input data.

24  The algorithm partitions the mesh into blocks of size m×m o The values of X, Y and m are determined in the analysis. o Assumption: X and Y are multiples of m and m 2, respectively o A row of m×m blocks is referred as a band.

25  Step 1: (1.1) Use Mesh Maximum to find the maximum datum in each m×m block (a) (1.1) Band Local Maximum The algorithm is illustrated using X= 16, Y=12 and m=2

26 Step 1: (1.2) Copy the local maximum in each block to all the processors in the first column of the block, using standard mesh links(b) (1.2)

27  Step 2: o (2.1) The Y/m partial maxima in each band are divided into m groups, each containing Y/m 2 elements. Since there are m row buses in each band, each row buses is assigned one group of Y/m 2 elements Group

28 Step 2: o (2.2) Each row bus is used to find the maximum of the Y/m 2 data assigned to it, done in Y/m 2 iterations. During the ith iteration the ith of the Y/m 2 data is broadcast on the row bus. The leftmost processor in the row keeps track of the maximum seen so far o (2.3) There are now X local maxima, with each band having m maxima in the leftmost column

29  Step 3: o Find the maximum of the elements in the leftmost column in each band (a) Using the second phase of Mesh Maximum algorithm

30  Step 4: o (4.1) Row buses are used to broadcast each X/m partial maxima to all processors in its row o (4.2) The X/m partial maxima are divided into Z groups, where Z ≤ Y, each containing at most X/(Ym) elements and each assigned to a distinct column (b) o Z=MIN(Y,X/m)

31 Step 5: o For each of the Z column, use the column bus to find the maximum element among the partial maxima assigned to that column in (4.2). This is done in X/(Ym) iterations. During the ith iteration, the ith of the (at most) X/(Ym) partial maxima is broadcast on the column bus, and the topmost processor in the column keeps track of the maximum seen so far.

32  Step 6: o (6.1) Consider M as the sub-mesh that is determined by the first Z rows and the first Z columns. o (6.2) Subdivide M into four Z/2 × Z/2 submeshes M 1, M 2, M 3, M 4. Using the column buses, send the elements in the top row of M 2 to the top row of M 4. Submeshes M 1 and M 4 are now viewed as independent meshes with separate row and column buses. (6.2) is applied recursively to find the maximum in each of M 1 and M 4. Recursion ends when two 1 × 1 meshes are created. The M 4 maximum is sent to M 1 maximum and the larger one kept M

33  Analysis o Step 1&3: use standard mesh links, O(m) time o Step 2: Uses row buses, O(Y/m 2 ) time o Step 4: Uses row buses, O(1) time o Step 5: Uses column buses, O(X/Ym) time O(n/(Y 2 m)) o Step 6: O(log Y) time o Overall:

34  Analysis o Quantity reaches its minimum when m = n 1/8 and Y = n 3/8 o Using a rectangular n 5/8 × n 3/8 mesh, with row and column buses Maximum of n data can be found in O(n 1/8 ) time  A lower bond: o Row and Column Buses Mesh Maximum achieves the best possible running time for finding the maximum of n data on a mesh like this. (See the proof on the textbook)

35  Gossiping Problem: The routing problem of exchanging tokens among all nodes in the computer, which has been studied extensively as a basic communication scheme for sharing information among nodes in a parallel computer.

36  Fast Gossiping on Mesh-Bus Computers Satoshi Fujita and Masafumi Yamashita, Member, IEEE IEEE TRANSACTIONS ON COMPUTERS, VOL. 45, NO. 11, NOVEMBER 1996 Proposes three gossiping algorithms: SIMPLE, PARTITION, and CENTRALIZE.

37

38

39

40  S. Fujita and M. Yamashita, "Optimal Gossiping in Mesh-Bus Computers," Parallel Processing Letters, vol. 3, pp. 357-361,1993.

41  Akl, Parallel Computation, Model and Methods, Prentice Hall, 1997.  F. Satoshi ; M. Yamashita. Fast gossiping on mesh-bus computers. Computers, IEEE Transactions on Volume: 45, Issue: 11. 1996  Gossiping in Mesh-Bus Computers," Parallel Processing Letters, vol. 3, pp. 357-361,1993.  Jonathan Blanton.10_Sect1_extended.ppt, Spring 2010  Fixed Buees, Guangming Wang,2011.


Download ppt "Student: Fan Bai Instructor: Dr. Sushil Prasad CSc8530."

Similar presentations


Ads by Google