Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.

Similar presentations


Presentation on theme: "1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors."— Presentation transcript:

1 1 Introduction to Parallel Computing

2 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors communicate via message passing. Shared-Memory Architectures –Single address space shared by all processors. –Processors communicate by memory read/write. –SMP or NUMA. –Cache coherence is important issue. Lots of middle ground and hybrids. No clear consensus on terminology.

3 3 Message-Passing Architecture... processor cache memory processor cache memory processor cache memory interconnection network...

4 4 Shared-Memory Architecture... interconnection network... processor 1 cache processor 2 cache processor N cache memory 1 memory M memory 2

5 5 Shared-Memory Architecture: SMP and NUMA SMP = Symmetric Multiprocessor –All memory is equally close to all processors. –Typical interconnection network is a shared bus. –Easier to program, but doesn’t scale to many processors. NUMA = Non-Uniform Memory Access –Each memory is closer to some processors than others. –a.k.a. “Distributed Shared Memory”. –Typically interconnection is grid or hypercube. –Harder to program, but scales to more processors.

6 6 Shared-Memory Architecture: Cache Coherence Effective caching reduces memory contention. Processors must see single consistent memory. Many different consistency models. Weak consistency is sufficient. Snoopy cache coherence for bus-based SMPs. Distributed directories for NUMA. Many implementation issues: multiple-levels, I-D separation, cache line size, update policy, etc. etc. Usually don’t need to know all the details.

7 7 Example: Quad-Processor Pentium Pro SMP, bus interconnection. 4 x 200 MHz Intel Pentium Pro processors. 8 + 8 Kb L1 cache per processor. 512 Kb L2 cache per processor. Snoopy cache coherence. Compaq, HP, IBM, NetPower. Windows NT, Solaris, Linux, etc.

8 8 100 Mbit switch Diplopodus Node config  2 x 500 MHz Pentium III  512 Mb RAM  12-16 Gb disk Beowulf-based cluster of Linux/Intel workstations 24 PCs

9 9 The first program Purpose: Illustrate notation Given –Length of vectors M –Data x m, y m, m=0,1,…,M-1 of real numbers, and two real scalars  and . Compute –z =  x +  y, i.e., z[m] =  x[m] +  y[m] for m=0,1,…,M-1.

10 10 Program Vector_Sum_1 declare m: integer; x,y,z: array[0,1,…,M-1] of real; initially assign end Here || is a concurrent operator. It means that is two operations O 1 and O 2 are separated by ||, i.e. O 1 || O 2, then the two operations can be performed concurrently independently of each other. In addition, is short for O 0 ||O 1 ||…||O M-1 meaning that all the M operations can be done concurrently.

11 11 Sequential assignment initially a=1, b=2 assign a:=b; b:=a results in a=b=2. Concurrent assignment initially a=1, b=2 assign a:=b || b:=a results in a=2, b=1.

12 12 A model of a parallel computer P processors (nodes); p=0,1,…P-1. All processors are identical. All processors compute sequentially. All nodes can communicate with any other node. The communication is handled by mechanisms for sending and receiving data at each processor.

13 13 Data distribution Suppose distribution of vector x with M elements x 0,…x M-1 over a collection of P identical computers. On each computer define index set J p = {0,1,…I p -1}, where I p is the number of indices stored at processor p. Assume I 0 +I 1 +…+I P-1 = M, x=(x 0,…x I 0 -1,…,…,…,…,…,x M-1 ) stored proc 0 stored proc P-1

14 14 A proper data distribution defines a one-to-one mapping  from a global index m to a local index i on processor p. For a global index m,  (m) gives a unique index i on a unique processor p. Similarly, an index i on processor p is uniquely mapped to a unique global index m=  -1 (p,i). Globally: x = x 0,…x M-1 Locally: x 0,…x I 0 -1, x 0,…x I 1 -1,…, x 0,…x I P-1 -1 proc 0 proc 1 proc P-1

15 15 Purpose: –derive a multicomputer version of Vector_Sum_1 Given –Length of vectors M. –Data x m, y m, m=0,1,…,M-1 of real numbers, and two real scalars  and . –Number of processors P. –Set of indices J p ={0,1,…,I p -1} where the number of entries I p on the p-th processor is given. –A one-to-one mapping between global and local indices. Compute z=  x +  y, i,.e, z[m]=  x[m] +  y[m] for m=0,1,…,M-1.

16 16 O,…,P-1 || p Program Vector_Sum_2 declare i: integer; x,y,z: array[J p ] of real; initially assign end Notice that we have one program for each processor - all programs being identical. In each program, the identifier p is known. Also the mapping is assumed to be known. The result is stored in a distributed vector z.

17 17 Performance analysis Let P be the number of processors, and let T = T(P) denote the execution time for a program on this multicomputer. Performance analysis is the study of the properties of T(P). In order to analyze concurrent algorithms, we have to assume certain properties of the computer. In fact these assumptions are rather strict and thus leave out a lot of existing computers. On the other hand; without these assumptions the analysis tend to be extremely complicated.

18 18 Observation Let T(1) be the fastest possible scalar computation, then T(P)  T(1)/P. This relation states a bound for how fast a computation can be done on a parallel computer compared with a scalar computer.

19 19 Definitions Speed-up: The speed-up of a P-node computation with execution time T(P) is given by S(P) = T(1)/T(P). Efficiency: The efficiency of a P-node computation with speed-up S(P) is given by  (P) = S(P)/P.

20 20 Discussion Suppose we are in an optimal situation, i.e., we have T(P) = T(1)/P. Then the speed-up is given by S(P) = T(1)/T(P) = P, and the efficiency is  (P) = S(P)/P = 1.

21 21 More generally we have T(P)  T(1)/P, which implies that S(P) = T(1)/T(P)  P, and  (P) = S(P)/P  1. In practical computations we are pleased if we are close to the optimal results. A speed-up close to P and to an efficiency close to 1 is very good. Practical details often result in weaker performance than expected from the analysis.

22 22 Efficiency modelling Goal: estimate how fast a certain algorithm can run on a multicomputer. The models depend on the following parameters:  A = Arithmetic time; the time of one single arithmetic operation. Integer ops ignored, all nodes assumed equal.  C (L) = Message exchange time; the time it takes to send a message of length L (in proper units) from one processor to another. We assume that this time is equal for any pair of processors.  L = Latency; the start-up time for a communication - or the time it takes to send a message of length zero. 1/  = Bandwidth; the maximum rate of messages (in proper units) that can be exchanged.

23 23 Efficiency modelling In our efficiency models, we will assume that there is a linear relation between the message exchange time and the length of the message:  C (L) =  L +  L.

24 24 Analysis of Vector_Sum_2 O,…,P-1 || p Program Vector_Sum_2 declare i: integer; x,y,z: array[J p ] of real; initially assign end Recall that J p ={0,1,…,I p -1}, and define I = max p I p. Then a model of the execution time is given by T(P) = 3 max p I p  A = 3I  A. Notice that there are three arithmetic operations for each entry of the array.

25 25 Load balancing Obviously, we would like to balance the load of the processors. Basically, we would like to have each of them perform approximately the same number of operations. (Recall that we assume all processors of same capacity). In the notation used in the present vector operation, we have load balance if I is as small as possible. In the case that M (the number of array entries) is a multiple of P (the number of processors), we have load balance if I = M/P, meaning that there are equally many vector entries on each processor.

26 26 Speed-up For this problem the speed-up is S(P) = T(1)/T(P) = 3M  A / 3I  A = M/I. If the problem is load balanced, we have I = M/P and thus S(P) = P which is optimal. Notice that we are typically interested in very large values of M, say M=10 6 -10 9. The number of processors P are usually below 1000.

27 27 The communication cost In the above example, no communication at all was necessary. In the next example, one real number must be communicated. This changes the analysis a bit!

28 28 The communication cost Purpose: –derive a multicomputer program for computation of an inner product. Given –Length of vectors M. –Data x m, y m, m=0,1,…,M-1 of real numbers. –Number of processors P. –Set of indices J p ={0,1,…,I p -1} where the number of entries I p on the p-th processor is given. –A one-to-one mapping between global and local indices. Compute  = (x,y), i.e.,  = x[0] y[0] + x[1] y[1] + … + x[M-1] y[M-1].

29 29 Program Inner_Product O,…,P-1 || p Program Inner_Product declare i: integer; w: array[0,1,…,P-1] of real; x,y: array[J p ] of real; initially assign w[p] = ; send w[p] to all ;  = ; end

30 30 Performance modelling of Inner_Product Recall J p = {0,1,…,I p -1} and I = max p I p. A model of the execution time for Inner_Product is given by T(P) = (2I-1)  A + (P-1)  C (1) + (P-1)  A Here the first term arises from the sum of x[i]y[i] over local i values (I p multiplications and I p -1 additions). The second term arises from the cost of sending one real number from one processors to all others. The third term arises from the computation of the inner product based on the values on each processor.

31 31 Simplifications Assume I = M/P, i.e., a load balanced problem. Assume (as always) P  M, and  C (1) =  A (for practical computers  is quite large, 50-1000). We then have T(P)  2I  A + P  C (1), or T(P)  (2M/P + P  )  A.

32 32 Example I Choosing M=10 5 and  = 50, we get T(P) = (2* 10 5 /P + 50P)

33 33 Example II Choosing M=10 7 and  = 50, we get T(P) = (2* 10 7 /P + 50P)

34 34 Speed-up For this problem, the speed-up is S(P) = T(1)/T(P)  [(2M+  )  A ] / [(2M/P + P  )  A ] = P [1+  /(2M)] / [1+  P 2 /(2M)]. Optimal speed-up characterized by S(p)  P, we must require  P 2 /(2M)  1 in order for this to be the case.


Download ppt "1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors."

Similar presentations


Ads by Google