# Parallel Algorithms and Computing Selected topics Parallel Architecture.

## Presentation on theme: "Parallel Algorithms and Computing Selected topics Parallel Architecture."— Presentation transcript:

Parallel Algorithms and Computing Selected topics Parallel Architecture

2 References An introduction to parallel algorithms Joseph Jaja Introduction to parallel computing Vipin Kumar, Ananth Grama, Anshul Gupta, George KArypis Parallel sorting algorithms Selim G. Akl

3 Models Three models: Graphs (DAG : Directed Acyclic Graph) Parallel Randon Access Machine Network

4 Graphs Not studied here

Parallel Architecture Parallel random access machine

6 Parallel Randon Access Machine Flynn classifies parallel machines based on: – Data flow – Instruction flow Each flow can be: – Single – Multiple

7 Parallel Randon Access Machine Flynn classification SINGLEMULTIPLE SINGLESISDSIMD MULTIPLEMISDMIMD Data flow Instruction flow

8 Parallel Randon Access Machine Extend the traditional RAM (Random Access Memory) machine Interconnection network between global memory and processors Multiple processors Mémoire Globale (Shared – Memory) P1P2Pp

9 Parallel Randon Access Machine Characteristics Processors Pi (i (0  i  p-1 ) – each with a local memory – i is a unique identity for processor P i A global shared memory – it can be accessed by all processors

10 Parallel Randon Access Machine Types of operations: Synchronous – Processors work in locked step  at each step, a processor is active or idle  suited for SIMD and MIMD architectures Asynchronous – processors have local clocks – needs to synchronize the processors  suited for MIMD architecture

11 Parallel Randon Access Machine Example of synchronous operation Algorithm : Processor i (i=0 … 3) Input : A, B i processor id Output : (1) C Begin If ( B==0) C = A Else C = A/B End

12 Parallel Randon Access Machine Step 1 A : 7 B : 0 C : 7 (Actif, B=0) A : 2 B : 1 C : 0 (Inactif, (B  0) A : 4 B : 2 C : 0 (Inactif, (B  0) A : 5 B : 0 C : 5 (Actif, (B=0) Processeur 3Processeur 2Processeur 1Processeur 0 Initial A : 7 B : 0 C : 0 A : 2 B : 1 C : 0 A : 4 B : 2 C : 0 A : 5 B : 0 C : 0 Processeur 3Processeur 2Processeur 1Processeur 0 (idle B  0) (active B = 0)

13 Parallel Randon Access Machine Step 2 A : 7 B : 0 C : 7 A : 2 B : 1 C : 2 A : 4 B : 2 C : 2 A : 5 B : 0 C : 5 Processeur 3Processeur 2Processeur 1Processeur 0 (active B  0) (idle B = 0)

14 Parallel Randon Access Machine Read / Write conflicts EREW : Exclusive - Read, Exclusive -Write – no concurrent ( read or write) operation on a variable CREW : Concurrent – Read, Exclusive – Write – concurrent reads allowed on same variable – exclusive write only

15 Parallel Randon Access Machine ERCW : Exclusive Read – Concurrent Write CRCW : Concurrent – Read, Concurrent – Write

16 Parallel Randon Access Machine Concurrent write on a variable X Common CRCW : only if all processors write the same value on X SUM CRCW : write the sum all variables on X Random CRCW : choose one processor at random and write its value on X Priority CRCW : processor with hign priority writes on X

17 Parallel Randon Access Machine Example: Concurrent write on X by processors P1 (50  X), P2 (60  X), P3 (70  X) Common CRCW ou ERCW : Failure SUM CRCW : X is the sum (180) of the written values Random CRCW : final value of X  { 50, 60, 70 }

18 Parallel Randon Access Machine Basic Input/Output operations On global memory – global read (X, x) – global write (Y, y) On local memory – read (X, x) – write (Y, y)

19 Example 1: Matrix-Vector product Matrix-Vector produt Y = AX – A is a nXn matrix – X = [ x1, x2, …, xn] a vector of n elements – p processeurs ( p  n ) and r = n/p Each processor is assigned a bloc of r= n/p elements

20 Example 1: Matrix-Vector product Y1 Y2 …. Yn A1,1 A1,2 … A1,n A2,1 A2,2 … A2,n …….. A n,1 An,2... An,n = X1 X2 …. Xn X Global memory P1P2Pp Processors

21 Example 1: Matrix-Vector product Partition A in p blocks Ai Compute p partial products in parallel Processor Pi compute the partial product Yi = Ai * X A = A1 A2 …. Ap A1,1 A1,2 … A1,n ……. Ar,1 A2,2 … A2,n ……. A(p-1)r,1 A(p-1),2 … A(p-1),n …….. A pr,1 Apr,2 ….Apr,n = A1 Ap r lignes

22 Example 1: Matrix-Vector product Processeur Pi computes Yi = Ai * X A1,1 A1,2 … A1,n ……. Ar,1 A2,2 … A2,n X1 X2 …. Xn X Ar+1,1 Ar+1,2 … Ar+1,n ……. A2r,1 A2r,2 … A2r,n X1 X2 …. Xn X A(p-1)r,1 A(p-1)r,2 …A(p-1)r,n ……. Apr,1 Apr,2 … Apr,n X1 X2 …. Xn X Y1 Y2 …. Yr Y(p-1)r+1 Y(p-1)r+2 …. Ypr Yr+1 Yr+2 …. Y2r P1 P2 Pp

23 Example 1: Matrix-Vector product Solution requires : p concurrents reads of vector X each processor Pi makes an exclusive read of block Ai = A [((i-1)r +1) : ir, 1:n] Each processor Pi makes an exclusive write on block Yi = Y[((i-1)r +1) : ir ]  Required architecture : PRAM CREW

24 Example 1: Matrix-Vector product Algorithm: processor Pi (i=1,2, …, n) Input A : nxn matruix in global memory X : a vector in global memory Output y = AX (y is a vector in global memory) Local variables i : Pi processor id p: number of processors n : dimension of A and X Begin 1. Global read ( x, z) 2. global read (A((i-1)r + 1 : ir, 1:n), B) 3. calculer W = Bz 4. global write (w, y(i-1)r+1 : ir)) End

25 Example 1: Matrix-Vector product Analysis Computation cost Ligne 3: O( n 2 /p) opérations arithmétiques by Pi r lignes X n opérations ( avec r = n/p) Communication cost Ligne 1 : O(n) numbers transferred from global to local memory by Pi Ligne 2 : O(n 2 /p) numbers transferred from global to local memory by Pi Ligne 4 : O(n/p) numbers transferred from global to local memory by Pi Overall: Algorithm run in O(n 2 /p) time

26 Example 1: Matrix-Vector product Other way to partition the matrix is vertically Ai and X are split into blocks – A1, A2, … Ap – X1, X2 … Xp Solution in two phases : – Compute partial products Z1 =A1X1, … Zp = ApXp – Synchronize the processors – Add partial results to get Y Y= AX = Z1 + Z2 + … + Zp

27 Example 1: Matrix-Vector product Y1 Y2 …. Yn A1,1 … A1,r A2,1 … A2,r An,1 … An,r X1 … Xr * r columns Processor P1 A1,(p-1)r +1... A1,pr A2,(p-1)r +1... A2,pr An,(p-1)r +1... An,pr X(p-1)r +1... Xpr * r columns Processor Pp …….. Synchronization

28 Example 1: Matrix-Vector product Algorithm: processor Pi (i=1,2, …, n) Begin 1. Global read ( x( (i-1)r +1 : ir), z) 2. global read (A(1:n, (i-1)r + 1 : ir), B) 3. compute W = Bz 4. Synchronize processors Pi (i=1, 2, …, n) 5. global write (w, y(i-1)r+1 : ir)) End Input A : nxn matruix in global memory X : a vector in global memory Output y = AX (* y: vector in global memory *) Local variables i : Pi processor id p: number of processors n : dimension of A and X

29 Example 1: Matrix-Vector product Analysis Work out the details Overall: Algorithm run in O(n 2 /p) time

30 Example 2: Sum on the PRAM model An aray A of n = 2 k numbers A PRAM machine with n processor Compute S = A(1) + A(2) + …. + A(n) Construct a binary tree to compute the sum in log 2 n time

31 Example 2: Sum on the PRAM model B(1) =A(1) B(2) =A(2) B(1) =A(1) B(2) =A(2) B(1) =A(1) B(2) =A(2) B(1) =A(1) B(2) =A(2) P1P2P3P4P5P6P7P8 B(1)B(2)B(3)B(4) P1 P2 P3P4 B(1) B(2) P1 P2 B(1) S=B(1) P1 Level >1, Pi compute B(i) = B(2i-1) + B(2i) Level 1, Pi B(i) = A(i)

32 Example 2: Sum on the PRAM model Algorithm processor Pi ( i=0,1, …n-1) Input A : array of n = 2 k elements in mémoire global Output S : où S= A(1) + A(2) + ….. A(n) Local variables Pi n : i : processor Pi identity Begin 1. global read ( A(i), a) 2. global write (a, B(i)) 3. for h = 1 to log n do if ( i ≤ n / 2 h ) then begin global read (B(2i-1), x) global read (b(2i), y) z = x +y global write (z,B(i)) end 4. if i = 1 then global write(z,S) End

Parallel Architecture Network model

34 Network model Characteristics Communication structure is important Network can be seen as a graph G=(N,E): – Node i  N is a processor – Edge (i,j)  E represents a two way communication between processors i and j Basi communication operation – Send (X, P i ) – Receive(X, P i )  No global shared memory

35 Network model P1 P2 P3Pn … n processors Linear array n processor ring P1 P2 P3Pn …

36 Network model P11 P12 P13P1n … n 2 processors Grid P21 P22 P23P2n … P31 P32 P33P3n … Pn1 Pn2 Pn3Pnn … n 2 processors Torus: columns and rows are n rings

37 Network model (P0) (P2) (P1) (P7) (P3) (P4) (P5) (P6) n=2 k hypercube

38 Network model P11 P12 P13P1n … n 2 processors Grid P21 P22 P23P2n … P31 P32 P33P3n … Pn1 Pn2 Pn3Pnn … n 2 processors Torus: columns and rows are n rings

39 Exemple 1: Matrix-Vector Product on linear array A=[aij] an nxn matrix, i,j  [1,n] X=[xi] i  [1,n] Compute

40 Exemple 1: Matrix-Vector Product on linear array Systolic array algorithm for n=4 x4 x3 x2 x1 a14 a13 a12 a11 a24 a23 a22 a21 a34 a33 a32 a31 a44 a43 a42 a41...... P1 P2 P3P4

41 Exemple 1: Matrix-Vector Product on linear array At step j, xj enters the processor P1. At step j, processor Pi receives (when possible) a value from its left and a value from the top. It updates its partial as follows: Yi = Yi + aij*xj, j=1,2,3, …. Values xj and aij reach processor i at the same time at step (i+j-1) (x1, a11) reach P1 at step 1 = (1+1-1) (x3, a13) reach P1 at setep 3 = (1+3-1) In general, Yi is computed at step N+i-1

42 Exemple 1: Matrix-Vector Product on linear array The computation is completed when x4 and a44 reach processor P4 at Step N + N –1 = 2N-1 Conclusion: The algorithm requires (2N-1) steps. At each step, active processor Perform an addition and a multiplication Complexity of the algorithm: O(N)

43 Exemple 1: Matrix-Vector Product on linear array J=1 4 y2 =  a 2j *x j J=1 4 y3 =  a 3j *x j J=1 4 y4 =  a 4j *x j J=1 4 y1 =  a 1j *x j x2 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 Step 1 2 3 4 5 6 7 x1 x1 x3 x4 x1 x1 x2 x1 x1 x1 x1 x1 x1 x1 x1 x3 x2 x1 x1 x4

44 Exemple 1: Matrix-Vector Product on linear array Systolic array algorithm: Time-Cost analysis P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 1 Add; 1 Mult; active: P1 idle: P2, P3, P4 2 Add; 2 Mult; active: P1, P2 idle: P3, P4 3 Add; 3 Mult; active: P1, P2,P3 idle: P4 4 Add; 4 Mult; active: P1, P2,P3 P4 idle: 3 Add; 3 Mult; active: P2,P3,P4 idle: P1 2 Add; 2 Mult; active: P3,P4 idle: P1,P2 1 Add; 1 Mult; active: P4 idle: P1,P2,P3

45 Exemple 1: Matrix-Vector Product on linear array Systolic array algorithm: Time-Cost analysis x2 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 P1 P2 P3P4 Step 1 2 3 4 5 6 7 x1 x1 x3 x4 x1 x1 x2 x1 x1 x1 x1 x3 x4 1 Add; 1 Mult; active: P1 idle: P2, P3, P4 2 Add; 2 Mult; active: P1, P2 idle: P3, P4 3 Add; 3 Mult; active: P1, P2,P3 idle: P4 4 Add; 4 Mult; active: P1, P2,P3 P4 idle: 3 Add; 3 Mult; active: P2,P3,P4 idle: P1 2 Add; 2 Mult; active: P3,P4 idle: P1,P2 1 Add; 1 Mult; active: P4 idle: P1,P2,P3

46 Exemple 2: Matrix multiplication on a 2-D nxn Mesh Given two nxn matrices A = [aij] and B = [bij], i,j  [1,n], Compute the product C=AB, where C is given by :

47 Exemple 2: Matrix multiplication on a 2-D nxn Mesh At step i, Row i of A (starting with ai1) is entered from the top into column i (into processor P1i) At step j, Column j of B (starting with b1j) is entered from the left into row j (to processor Pj1) The values aik and bkj reach processor (Pji) at step (i+j+k-2). At the end of this step, aik is sent down and bkj is sent right.

48 Exemple 2: Matrix multiplication on a 2-D nxn Mesh Example: Systolic mesh algorithm for n=4 STEP 1 (1,1) (2,1) (3,1) (4,1) (1,2)(1,3) (1,4) (2,4)(2,3)(2,2) (3,4)(3,3)(3,2) (4,4)(4,3)(4,2) a14 a13 a12 a11 a24 a23 a22 a21 a34 a33 a32 a31. a44 a43 a42 a41.. b41 b3 b21 b11 b42 b32 b22 b12. b43 b33 b23 b13.. b44 b34 b24 b14..

49 Exemple 2: Matrix multiplication on a 2-D nxn Mesh Example: Systolic mesh algorithm for n=4 STEP 5 a11 a24a33a42 b41b31b21 a14 a13 a12 b42 b33 b24 b32 b22 b12 b23b13 b14 a23a32 a41 a22a22 a31 a21 b43 b34b44 b11 a34 a43 a44 A B

50 Exemple 2: Matrix-Vector multiplication on a ring Analysis To determine the number of steps for completing the multiplication of the matrice, we must find the step at which the terms a nn and b nn reach rocessor Pnn. – Values aik and bkj reach processor Pji at i+j+k-2 – Substituing n for i,j,k yields : n + n + n – 2 = 3n - 2 Complexity of the solution: O(N)

51 Exemple 3: Matrix-Vector multiplication on a ring N=4 X4X3X2X1 P1P2P3 P4 a13 a12 a11 a14 a22 a21 a24 a23 a31 a34 a33 a32 a44 a43 a42 a41 Xi aij This algorithm requires N steps for a matrix-vector multiplication

52 Exemple 3: Matrix-Vector multiplication on a ring Goal Pipeline data into the processors, so that n product terms are computed and added to partial sums at each step. Distribution of X on the processors Xj 1  j  N, Xj is assigned to processor N-j+1 This algorithm requires N steps for a matrix-vector multiplication

53 Exemple 3: Matrix-Vector multiplication on a ring Another way to distribute the Xi over the processors and to input Matrix A – Row i of the matrix A is shifted (rotated) down i (mod n) times and entered into processor Pi. – Xi is assigned to processor Pi, at each step the Xi are shifted right

54 Exemple 3: Matrix-Vector multiplication on a ring X1X2X3X4 P1P2P3 P4 a12 a13 a14 a11 a23 a24 a21 a22 a34 a31 a32 a33 a41 a42 a43 a44 Xi aij N=4 Diagonal

55 Exemple 4: Sum of n=2 P numbers on a d-hypercube Assignment: xi is on processor Pi 0 1 3 2 4 5 6 7 (X0) (X2) (X1) (X7) (X3) (X4) (X5) (X6) Computation of S =  xi,

56 Exemple 4: Sum of n=2 P numbers on a d-hypercube Step 1: Processors of the sub-cube 1XX send their data to corresponding processors in sub-cube 0XX 0 1 3 2 4 5 6 7 (X0+X4) (X2+X6) (X1+X5) (X3+X7) 0XX sub-cube 1XX sub-cube

57 Exemple 4: Sum of n=2 P numbers on a d-hypercube Step 2: Processors of the sub-cube 01X send their data to corresponding processors in sub-cube 00X 0 1 3 2 4 5 6 7 (X0+X4+X2+X6) (X1+X5+X3+X7) P Processeurs actifs P Processeurs inactifs

58 Exemple 4: Sum of n=2 P numbers on a d-hypercube Step 3: Processors of the sub-cube 01X send their data to corresponding processors in sub-cube 00X P Processeurs actifs P Processeurs inactifs 0 1 3 2 4 5 6 7 S = (X0+X4+X2+X6+ X1+X5+X3+X7) The sum of the n numbers is stored on node P0

59 Exemple 4: Sum of n=2 P numbers on a d-hypercube Algorithm: Processor Pi Input: 1) An array of X of n=2p of numbers, X[i] is assigned to processor Pi 2) processor identity id Output: S= X[0]+…+X[n] stored on processor P0 Processor Pi Begin My_id = id ( My_id  i) S=X[i] For j = 0 to (d-1) do begin Partner = M_id XOR 2 j if My_id AND 2 j = 0 begin receive(Si, Partner) S = S + Si end if My_id AND 2 j  0 begin send(S, Partner) exit end end end

Parallel Architecture Message broadcast on network model (ring, torus, hypercube)

61 Basic communication Message Broadcast One-to-all broadcast – Ring – Mesh (Torus) – Hypercube All-to-all broadcast – Ring – Mesh (Torus) – Hypercube

62 Communication cost Message from Pi Pj l number of links traversed Communication cost = t s + t w *m*l t s :message preparation time m: message length t w : unit transfer time (byte) l : number of links traversed by the message

63 Communication cost Communication time bounds: – Ring t s + (t w )m  p/2  – Mesh t s + (t w )m  ((p) 1/2 )/2  – Hypercube t s + (t w )m log 2 p  Depends on the maximum number of links traversed by the message

64 One-to-All broadcast Simple solution P0 send message M0 to processors P1, P2, … Pn-1 successively. P1P0 M0 P2P0 M0 P3P0 M0 Pp-1P0 M0 ( P0  P1  P2 ) ( P0  P1  P2  P3 ) ( P0  P1  P2  …  Pp-1 ) Communication cost =  (t s + t w m0 ) i = (t s + t w m0 )*( p(p+1)/2)

65 One-to-all Broadcast Processor send a message M to all processors 01P-101 MMMM …… One-to-all broadcast Dual operation (Accumulation)

66 All-to-all broadcast All-to-All Broadcast : several simultanous One-to-All broadcast where each processor Pi initiates the communication. 01p-1 … X0X0 01 … All-to-All broadcast Accumulation vers plusieurs noeuds X1X1 X p-1 … X0X0 X1X1 … X0X0 X1X1 … X0X0 X1X1

Parallel Architecture Examples of message broadcasts

68 Example 1: One-to-All broadcast on a ring Each processor forwards message to the next processor. Initially, message sent in two directions 0123 7654 12 3 4 3 2 4 Communication cost : T = (t s + t w * m)  p/2  où p est le nombre de processeurs Parallel Steps

69 Example 2: One-to-All broadcast on a Torus Two phases : Phase 1 : One-to-All broacast on first row 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 12 2

70 Example 2: One-to-All broadcast on a Torus Phase 2 : parallel one-to-all broadcasts in the columns. 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 12 2 33334444 4444

71 Example 2: One-to-All broadcast on a Torus Communication cost : 2 T = 2 * (t s + t w * m)  p(1/2)/ 2  (p) is the number of processors Broadcast on line T com = (t s + t w m)  p (1/2) /2  Broadcast on columns T com = (t s + t w m)  p (1/2) /2 

72 Example 3: One-to-All broadcast on a Hypercube Requires d steps. Each step doubles the number of active processors 01 2 3 45 6 7 1 22 3 3 3 3 Coût de communication : T = 2 * (t s + t w * m)*logp P is the number of processors

73 Example 3: One-to-All broadcast on a Hypercube Broadcast an element X stored on one processor (say P0) to the other processors of the hypercube. Broadcast can be performed in O(logn) as follows 0 1 3 2 4 5 6 7 X Initial distribution of data

74 Example 3: One-to-All broadcast on a Hypercube Step 1: Processor Po sends X to processor P1 Step 2: Processors P0 and P1 send X to P2 and P3 respectively Step 3: Processor P0, P1, P2 and P3 send X to P4, P5, P6 and P7 X 0 1 3 2 4 5 6 7 X Step 1 Step 2 0 1 3 2 4 5 6 7 XX XX XX 0 1 3 2 4 5 6 7 X X XX X X Step 3 P Active processors P Idle processors

75 Example 3: One-to-All broadcast on a Hypercube Algorithm for a broadcast of X on a p-hypercube Input: 1) X assigned to processor P0 2) processor identity id Output: All processor Pi contain X Processor Pi Begin If i = 0 then B = X My_id = id ( My_id  i) For j = 0 to (d-1) do if My_id  2j begin Partner = My_id XOR 2j if My_id > Partner receive(B, Partner) if My_id < Partner send(S, Partner) end

76 All-to-all Broadcast on a ring STEP 1 0123 7654 1(0)1(1)1(2) 1(3) 1(7) 1(6)1(5)1(4) (0)(1) (2) (3) (4)(5)(6)(7)

77 All-to-all Broadcast on a ring Step 2 0123 7654 2(7)2(0)2(1) 2(2) 2(6) 2(5)2(4)2(3) (0,7)(1,0) (2,1) (3,2) (4,3)(5,4)(6,5)(7,6)

78 All-to-all Broadcast on a ring Etape 3 0123 7654 3(6)3(7)3(0) 3(1) 3(5) 3(4)3(3)3(2) (0,7,6)(1,0,7) (2,1,0) (3,2,1) (4,3,2) (5,4,3)(6,5,4)(7,6,5)

79 All-to-all Broadcast on a ring Etape 7 0123 7654 7(2)7(3)7(4) 7(5) 7(1) 7(0)7(7)7(6) (0,7,6,5,4,3,2) (1,0,7,6,5,4,3)(2,1,0,7,6,5,4) (3,2,1,0,7,6,5) (4,3,2,1,0,7,6) (5,4,3,2,1,0,7)(6,5,4,3,2,1,0)(7,6,5,4,3,2,1)

80 All-to-all Broadcast on a 2-dimensional Torus Two phases – Phase 1: All-to-all broadcast on each line. Each processor Pi holds a message of size Mi = (p 1/2 )m – Phase 2: All-to-All broadcast in the columns

81 All-to-all Broadcast 012 345 678 (0) (1)(2) (3) (4)(5) (6) (7)(8) All-to-All on the rows Start of Phase 1

82 All-to-all Broadcast 012 345 678 (0,1,2) (3,4,5) (6,7,8) All-to-All on columns Start of Phase 2

83 All-to-all Broadcast Communication cost = Cost phase 1 + cost phase 2 = (p 1/2 -1)(t s + t w m) + (p 1/2 -1) (t s + t w (p 1/2 )m)

Parallel Algorithms and Computing Selected topics Sorting in Parallel

85 Performance mesures Speedup Efficiency Work-Time Amdhal’s law

86 Speed up Speed up : S(p) (p number of processors in the parallel solution) S(p) < 1 : parallel solution is worst 1 { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/12/3352675/slides/slide_86.jpg", "name": "86 Speed up Speed up : S(p) (p number of processors in the parallel solution) S(p) < 1 : parallel solution is worst 1

87 Accélération Is hyper speed up normal? Poor non optimal sequential algorithm Storage space a factor

88 Efficiency Efficiency : E(p) 0< E(p) ≤ 1 : Normal 1 { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/12/3352675/slides/slide_88.jpg", "name": "88 Efficiency Efficiency : E(p) 0< E(p) ≤ 1 : Normal 1

89 Amdhal’s law A program consists of two part : + … Sequential part Parallel prt

90 Amdhal’s law Bound on Speedup

91 Amdhal’s law Bound on Speedup Sequential fraction (f s ) et Parallel fraction (f p ): Speed up can be rewritten as

92 Amdhal’s law Bound on Speedup

93 Amdhal’s law Bound on Speedup For example if f s is equal to 1%, S(p) is less than 100. 1/f s S(p) p 1 1

94 Amdhal’s law  The above computation of speed up bound does not take into account communication and synchronization overheadsOverhead

95 Parallel sorting Types of sorting algorithms Properties – Processor ordering determines order of the final result – Where input and output are stored – Basic compare-exchange operation

96 Issues in sorting algorithms Internal/External sort : CPU Data fits in processor memory (RAM) Performances based : Comparison Basic operations Complexity O(nlogn) Internal CPU Data in memory and on disk ( RAM) (Disk) Performances based : Basic operations Overlap of computing and I/O External

97 Issues in sorting algorithms Comparaison-Based Non-Compared-Based – Ordering based on properties of the keys Executions of : Comparaison Permutation

98 Issues in sorting algorithms Internal sort (shared memory : PRAM) PP Share data Minimize memory access conflicts Each processor sort part of the data in memory

99 Issues in sorting algorithms Internal sort (distributed memory) – Each processor is assigned a block of N/P elements – Processor locally sorts the assigned block (using any sort algorithm internal ot ) Initial data N/P elements per processor P i P 1 P 2 P 3 < < Input : Distributed among processors Output : Store on processors Ordre final : processor order defines the final ordering of list

100 Issues in sorting algorithms Internal sort (distributed memory) (0)(1) (3)(2) (4) (7) (5) (6) Example : Final order defined by the gray code labelling of processors 1 2345

101 Issues in sorting algorithms Building block: compare-exchange operation CPU (a i, a j ) RAM Sequentiel (a i < a j ) ?? a i ↔ a j (a i ) CPU (a j ) CPU a i a j a i = min(a i, a j ) a j a i a j = max(a i, a j ) Parallel Exchange-Compare-Min (P(i+1)) P(i)P(i+1) Exchange-Compare-Max (P(i-1))

102 Issues in sorting algorithms Compare-exchange : N/p elements per processor 168111362279101263168111362279101263279101263168111362168279101263111362 minmax P(i)P(i+1) N/p smallest elements Exchange-compare-min(P(i+1)) n/p largest elements Exchange-compare-max(P(i-1))

Example: Odd-Even Merge Sort Unsorted list of n elements A 0 A 1 A 2 A 3 A M-1 B 0 B 1 B 2 B 3 B M-1 Divide list in two lists (n/2 elements) Sort each sub-list A 0 A 2 … A M-2 B 0 B 2 … B M-2 Divide each in sub-lists of Odd-even index A 1 A 3 … A M-1 B 1 B 3 … B M-1 Merge sort the Odd – even sublists E 0 E 1 E 2 E 3 E M-1 O 0 O 1 O 2 O 3 O M-1 Merge the two list and Exchange out of position elements E 0 O 0 E 1 O 1 …. E M-1 O M-1

Where is parallelism???? E 0 E 1 E 2 E 3 E M-1 O 0 O 1 O 2 O 3 O M-1 1xN 4xN/4 2xN/2

105 Example: Odd-Even Merge Sort Key to the Merge Sort algorithm: method used to merge the sorted sub-list Consider 2 sorted lists of m =2 k elements: A= a 0, a 1, ….a m-1 et B= b 0, b 1, ….b m-1 Even(A)= a 0, a 2, ….a m-2, Odd(A)= a 1, a 3, ….a m-1 Even(B)= b 0, b 2, ….b m-2, Odd(B)= b 1, b 3, ….b m-1

106 Example: Odd-Even Merge Sort Create 2 merged lists: Merge Even(A) and Odd(B) to E = E 0 E 1 …E m-1 Merge Even(B) and Odd(A) to O = O 0 O 1 …O m-1 Merge E and O as follows to create a List L’ L’ = E 0 O 0 E 1 O 1 …E m-1 O m-1 Exchange out of order elements of L’ to obtain L

107 Example: Odd-Even Merge Sort A=2,3,4,8 et B=1,5,6,7 Even (A)= 2,4 and Odd(A)= 3,8 Even (B)= 1, 6 et Odd(B)= 5,7 E = 2,4,5,7 et O=1,3,6,8 L’ = 2 ↔1, 4 ↔ 3, 5, 6, 7, 8 L = 1, 2, 3, 4, 5, 6, 7, 8

Parallel sorting Quicksort

109 Review: Quicksort Recursively: choose a pivot divide list in two using the pivot sort left and right sub-list Recall: Sequential Quicksort Performance

110 Review: Quicksort Sequential Quicksort void Quicksort (double *A, int q, int r) { int s, i; double pivot; if (q < r ) {/* divide A using the pivot */ pivot = A[q]; s = q; for (i = q+1; i ≤ pivot { s = s+1; exchange(A,s,i); } } exchange(A,q,s); /* recursive calls to sort the new sublist*/ Quicksort(A,q,s-1); Quicksort(A, s+1, r); } }

111 Review: Quicksort Create a binary tree of processor, one new processor for each recursive call of Quicksort  Easy to implement, but can be inefficient performance wise

112 Review: Quicksort Implantation en mémoire partagée (avec primitives Fork()) double A[nmax]; qoid quicksort(int q, int r) { int s, i, n; double pivot; if q < r {/*partitions */ pivot=A[q]; s=q; for (i=q+1; i <= r; i++){ if A[i] <= pivot){ s= s+1; exchange(A, s,i); } exchange(A, q, s); /*Create a new processor */ n=fork() if ( n== 0 ) exec("quicksort",q, s-1); else quicksort(,s+1,r); }

113 Quicksort on a d-hypercube d étapes : all processors active in each step A processor is assigned N/p elements (p= 2 d ) Etapes de la solution: – Initially (Step 0), 1 pivot is chosen and broadcast to all processors – Each processors its elements in two sub-lists: one less (inferior) than the current pivot and the other greater or equal (superior) – Exchange the inferior and superior sub-lists based on dimension (d-0), creating two sub-cubes along dimension d-0 (one for the inferior lists and the other for the superior lists) – Each processor merges the (inferior and superior) lists – Repeat for each sub-cube

114 Quicksort on a d-hypercube 000 010 110100 001 101111 011 0XX1XX Step 0 Pivot P0 Division along dimension 3. Two blocks of elements are created : 1 Block of elements less than pivot P0 1 block of elements greater than or equal to P0 Example on a 3-Hypercube < P0> P0

115 Quicksort on a d-hypercube 000 010 110100 001 101111 011 Etape 1 Pivot P10 Division along dimension 2. Divide each sub-cube in two smaller sub-cubes 00X 01X 10X 11X Pivot P11 < P10 > P10 < P11 > P11 Example on a 3-Hypercube

116 Quicksort on a d-hypercube 000 010 110100 001 101111 011 Etape 2 Pivot P20 Division along dimension 1. Final order defined by the label ordering of the processors 000 010 Pivot P22 001 011 100101 110 111 Pivot P21 Pivot P23 P20 >P21 P22 >P23 { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/12/3352675/slides/slide_116.jpg", "name": "116 Quicksort on a d-hypercube 000 010 110100 001 101111 011 Etape 2 Pivot P20 Division along dimension 1.", "description": "Final order defined by the label ordering of the processors 000 010 Pivot P22 001 011 100101 110 111 Pivot P21 Pivot P23 P20 >P21 P22 >P23

117 Quicksort on a d-hypercube 000 010 110100 001 101111 011 Final step Each processor sorts its final list, using for example a sequential quicksort 000 010 110100 001 101111 011 local sort {} {} : empty list Example on a 3-Hypercube

118 Quicksort on a d-hypercube Data exchange at the initial step: sub-cubes P0XX and P1XX P0XXP1XX Broadcast Pivot P0 < P0> P0 < P0> P0 < P0> P0 Exchange sub-lists inferior / superior < P0> P0 Exchange sub-lists inférior / superio P1XX P0XX  Sort the sub-lists at the end of each step?

119 Quicksort on a d-hypercube Algorithm: Processor k (k=0, …, p-1) Hypercube-Quicksort(B, d) { /* B contains the elements assigned to processor i*/ /* d is the hypercube dimension*/ int i; double x, B1[ ], B2[ ], T[ ]; my-id = k; /* Processor id*/ for ( i = d-1 to 0 ) { x = pivot (my-id, i); partition(B, x, B1, B2): /* B1 inferior sub-list, B2 superior sub-list */ if ( my-id AND 2 i == 0) { /* i th bit is 0 */ send( B2, My-neighbor in dimension i); receive(T, My-neighbor in dimension i); B = B1  T; } else { send( B1, My-neighbor in dimension i); receive(T, My-neighbor in dimension i); B = B2  T; } Sequential-Quicksort( B); End Hypercube-Quicksort

120 Quicksort on a d-hypercube Choice of pivot More important for performance than in the sequential case. It has a great impact on: – la répartition de charge entre processeurs – la performance de l’algorithme (dégradation rapide de la performance)

121 Quicksort on a d-hypercube Worst case: At step 0, largest element of list is selected as the pivot Pivot0 = x=max{ X i } 000 010 110100 001 101111 011 {}  Foreground processors are overloaded  Background processors are idle

122 Quicksort on a d-hypercube Choice of pivot : ideal case In parallel do, – Sort the initial list assigned to each processor – Choose the median element of one of the processors of the cube – Assuming uniform distribution of elements of the list List assigned processor Pi Median element List assigned processor Pi Median element Median element of whole list

123 Quicksort on a d-hypercube Steps of the algorithm Local sort of assigned list Selection of pivot by rocessor Broadcast of pivot in sub-d’hypercube d-i Division based on pivot (Binary search) Exchanges of sub-lists between neighbors Merge sorted sub-lists repeat Time complexity O(1*d) d=logp

124 Parallel Quicksort on a PRAM Parallel QUICKSORT algorithm /* The solution constructs a binary tree of processors chich is traversed in IN-ORDER to yield the sorted */ Variables shared by all processors root : Root of the global binary tree A[n] : AN array of n elements ( 1, 2, ….., n) Leftchild [i] : The root of the left sub-tree of processor i (i=1,2, …) Rightchild[i] : The root of the right sub-tree of processor (i=1,2, …

125 Parallel Quicksort on a PRAM Process /* Do in parallel for each processor i */ begin Root := I; Parent := i; Leftchild[i] := Rightchild[i] := n+1; End Repeat for each processor i  root do begin if (A[i] < A[Parent] …..) then begin Leftchild[Parent] := i if i = Leftchild[Parent] then exit else Parent := Leftchild[Parent] end else begin Rightchild[Parent] := i If i = Rightchild[Parent] then exit else Parent := Leftchild[Parent] end end repeat end process

126 Parallel Quicksort on a PRAM Example 332113 5482334072 1 2 3 4 5 6 7 8 Root = processeur 4 [4] {54} 12 3 6 7 5 8 Binary tree 9 1 2 3 4 5 6 7 8 Leftchild Rightchild 9999999 99999999 Step 0

127 Parallel Quicksort on a PRAM Example Racine = processeur 4 [4] {54} 12 3 6 7 5 8 Binary tree 1 2 3 4 5 6 7 8 Leftchild Rightchild 1 5 Step 1 Processorr 1 wins the competition for the left subtree of 4 and 5 wins at right 2 3 6 7 8 [1] {33} [5] {82}

128 Parallel Quicksort on a PRAM Example Racine = processeur 4 238 1 2 3 4 5 6 7 8 Leftchild Rightchild 1 67 5

129 Parallel Quicksort on a PRAM Example [4] {54} 12 3 6 7 5 8 Binary tree 2 3 6 7 8 [1] {33} [5] {82} 3 [2] {21} 7 [6] {33} [8] {72}