PARALLEL PROCESSING From Applications to Systems Gorana Bosic Veljko Milutinovic

PARALLEL PROCESSING From Applications to Systems Gorana Bosic gogaetf@gmail.com Veljko Milutinovic vm@etf.bg.ac.yu

Gorana Bosic gogaetf@gmail.com 2 Dan I. Moldovan “Parallel processing from applications to systems” Morgan Kaufmann Publishers, 1993 pp. 67-78, 90-92, 250-260 / 40

Gorana Bosic gogaetf@gmail.com 3 PARALLEL NUMERICAL ALGORITHMS Algorithms Without Loops Matrix Multiplication Relaxation / 40

Gorana Bosic gogaetf@gmail.com 4 Algorithms Without Loops Parallelism Within a Statement An expression is a well-formed string of atoms and operators - an atom is a constant or a variable - operators are arithmetic (+, *, -) or logic (OR, AND) operations a 1 + a 2 + a 3 + a 4 + a 5 + a 6 + a 7 + a 8 step1 step2 step3 / 40

Gorana Bosic gogaetf@gmail.com 5 Algorithms Without Loops Tree-Hight Reduction The tree height of a parallel computation is the number of computing steps it requires The tree hight can be reduced by using the associativity law the commutativity law the distributivity law / 40

Gorana Bosic gogaetf@gmail.com 6 Algorithms Without Loops Tree-Hight Reduction Parallelism provided by associativity ((( a + b ) * c ) * d ) ( a + b ) * ( c * d ) step1 step2 step3 / 40

Gorana Bosic gogaetf@gmail.com 7 Algorithms Without Loops Tree-Hight Reduction Parallelism provided by commutativity a * ( b + c ) * d a * d * ( b + c) step1 step2 step3 / 40

Gorana Bosic gogaetf@gmail.com 8 Algorithms Without Loops Tree-Hight Reduction Tree-hight reduction provided by factorization a * b * c + a * b ( a * b ) * ( c + 1 ) step1 step2 step3 / 40

Gorana Bosic gogaetf@gmail.com 9 Algorithms Without Loops Tree-Hight Reduction T p [E(e)] ≤ [ 4 log ( e -1 )] - 1 E(e) – the expression e – the number of atoms or elements on the right-hand side T p – the parallel processing time of an arithmetic expression when p processors are used / 40

Gorana Bosic gogaetf@gmail.com 10 Algorithms Without Loops Parallelism Between Statements S 1 : x = a + bcd S 2 : y = ex + f S 3 : z = my + x / 40

Gorana Bosic gogaetf@gmail.com 11 Algorithms Without Loops Parallelism Between Statements a b c d e f m step1 step2 step3 step4 step5 step6 step7 * * * * bc + S1 : x + S2 : y + S3 : z / 40

Gorana Bosic gogaetf@gmail.com 12 Algorithms Without Loops Parallelism Between Statements b c d m f e a step1 step2 step3 step4 step5 *+ a+mf+bcd+mea bcdem + a+mf+bcd+mea+bcdem +* * a+mfmea bcd * + a+mf+bcd bcde *** ea mfbc / 40

Gorana Bosic gogaetf@gmail.com 13 Matrix Multiplication C = AB c ij = a ik b kj for i=1 to n for j=1 to n for k=1 to n c ij k = c ij k-1 + a ik b kj end k end j end i for i=1 to n for j=1 to n for k=1 to n a( i, j, k ) = a( i, j-1, k ) b( i, j, k ) = b( i-1, j, k ) c( i, j, k ) = c( i, j, k-1 ) + a( i, j, k ) b( i, j, k ) end k end j end i a( i, j, k ) = a ik j b( i, j, k ) = b kj i c( i, j, k ) = c ij k / 40

Gorana Bosic gogaetf@gmail.com 14 Matrix Multiplication Dependece matrix: / 40

Gorana Bosic gogaetf@gmail.com 15 Systolic Matrix Multiplication Processors are arranged in a 2-D grid. Each processor accumulates one element of the product. The elements of the matrices to be multiplied are “pumped through” the array. / 40

Gorana Bosic gogaetf@gmail.com 16 a 0,0 *b 0,0 +a 0,1 *b 1,0 +a 0,2 *b 2,0 a 1,0 *b 0,0 +a 1,1 *b 1,0 +a 1,2 *b 2,0 a 2,0 *b 0,0 +a 2,1 *b 1,0 +a 2,2 *b 2,0 a 2,0 *b 0,1 +a 2,1 *b 1,1 +a 2,2 *b 2,1 a 1,0 *b 0,1 +a 1,1 *b 1,1 +a 1,2 *b 2,1 a 0,0 *b 0,1 +a 0,1 *b 1,1 +a 0,2 *b 2,1 a 2,0 *b 0,2 +a 2,1 *b 1,2 +a 2,2 *b 2,2 a 1,0 *b 0,2 +a 1,1 *b 1,2 +a 1,2 *b 2,2 a 0,0 *b 0,2 +a 0,1 *b 1,2 +a 0,2 *b 2,2 a 2,2 b 2,2 a 0,2 a 1,1 a 2,0 a 1,2 a 2,1 a 1,0 a 0,1 a 0,0 b 0,0 b 1,0 b 0,1 b 2,0 b 0,2 b 1,1 b 2,1 b 1,2 aligment in time rows of a columns of b b 0,0 b 1,0 b 2,0 b 0,1 b 1,1 b 2,1 b 0,2 b 1,2 b 2,2 a 0,0 a 0,1 a 0,2 a 1,0 a 1,1 a 1,2 a 2,0 a 2,1 a 2,2 / 40

Gorana Bosic gogaetf@gmail.com 17 Matrix Multiplication / 40

Gorana Bosic gogaetf@gmail.com 18 Matrix Multiplication Case 1. The number of used processors: n One processor can be used to compute one column (row) of matrix C. This means that each horizontal (vertical) layer of the three-dimensional cube is done in one processor (loop j is performed in parallel). The parallel time complexity is O(n 2 ) / 40

Gorana Bosic gogaetf@gmail.com 19 Matrix Multiplication Case 2. The number of used processors: n 2 Each processor may be assigned to compute an element c ij of matrix C. In this case both loops i and j are performed in parallel. The parallel time complexity is O(n) / 40

Gorana Bosic gogaetf@gmail.com 20 Matrix Multiplication Case 3. The number of used processors: n 3 Can the time complexity be reduced to a constant? The lower bound of a matrix multiplication algorithm is O(log n) / 40

Gorana Bosic gogaetf@gmail.com 21 Relaxation Updating a variable at a particular point by finding the average of the values of that variable at neighboring points. for i=1 to l for j=1 to m for k=1 to n u( j, k ) = ¼ [u( j+1, k ) + u( j, k+1 ) + u( j-1, k ) + u( j, k-1 )] end k end j end i u( i, j, k ) = ¼ [u( i-1, j+1, k ) + u( i-1, j, k+1 ) + u( i, j-1, k ) + u( i, j, k-1 )] / 40

Gorana Bosic gogaetf@gmail.com 22 Relaxation 5 3 4 2 1 1 2 3 4 5 j k i=7 i=8i=8 7 3 4 7 2 5 8 2 3 8 1 4 7 3 4 7 4 3 8 2 3 8 3 2 7 4 3 7 5 2 8 4 1 8 3 2 8 2 4 8 4 2 8 3 3 / 40

Gorana Bosic gogaetf@gmail.com 23 Relaxation Points: (7, 2, 5), (7, 3, 4), (7, 4, 3), (7, 5, 2), (8, 1, 4), (8, 2, 3), (8, 3, 2), (8, 4, 1) belong to the plane whose equation is 2i+j+k=21 The dependence matrix is: / 40

Gorana Bosic gogaetf@gmail.com 24 PARALLEL NON-NUMERICAL ALGORITHMS Transitive Closure G = (V, E) directed graph Is there a connecting path between any two vertices? A = [a ij ] the adjacency matrix of G a ij = 1 if there is an edge (i, j) E and a ij = 0 if not A * = [a * ij ] the connectivity matrix of G a * ij = 1 if there is a path in G from i to j and a * ij = 0 if not A * is the adjacency matrix for the graph G = (V, E * ) in which E * is the transitive closure of the binary relation E A well known algorithm for computing A * is Warshall’s algorithm / 40

Gorana Bosic gogaetf@gmail.com 25 PARALLEL NON-NUMERICAL ALGORITHMS Transitive Closure Algorithm for k=1 to n for i=1 to n for j=1 to n a ij k ← a ij k-1 (a ik k-1 ∩ a kj k-1 ) end j end i end k The dependencies are: / 40

Gorana Bosic gogaetf@gmail.com 26 PARALLEL NON-NUMERICAL ALGORITHMS Transitive Closure Algorithm The dependences are between successive k loops and no dependence lies on the (i, j) planes. All operations on the (i, j) plane can be done in parallel, and the k coordinate becomes the paralllel time coordinate. Thus the total time required is O(n) / 40

Gorana Bosic gogaetf@gmail.com 27 PARALLEL NON-NUMERICAL ALGORITHMS Data dependencies for transitive closure algorithm (n=4) j i / 40

Gorana Bosic gogaetf@gmail.com 28 MAPPING OF ALGORITHMS INTO SYSTOLIC ARRAYS Systolic Array Model Space Transformations Design Parameters / 40

Gorana Bosic gogaetf@gmail.com 29 Systolic Array Model A systolic array is a tuple (J n-1, P), where J n-1 Z n-1 is the index set of the array, and P Z (n-1)xr is a matrix of interconnection primitives. The position of each processing cell in the array is described by its cartesian coordinates. The interconnections between cells are described by the different vectors between the coordinates of adjacent cells. The matrix of interconnection primitives is where p j is a column vector indicating a unique direction of a communication link. / 40

Gorana Bosic gogaetf@gmail.com 30 Systolic Array Model A square array with eight-neighbour connections J 2 = (j 1, j 2 ) for 0 ≤ j 1 ≤ 2, 0 ≤ j 2 ≤ 2 00 01 02 10 11 12 20 21 22 j1j1 j2j2 / 40

Gorana Bosic gogaetf@gmail.com 31 Systolic Array Model J 2 = {(j 1, j 2 ) : 0 ≤ j 1 ≤ 3, 0 ≤ j 2 ≤ 3} A triangular systolic array 22 11 21 00 10 20 33 32 31 30 j1j1 j2j2 p1p1 p3p3 p2p2 / 40

Gorana Bosic gogaetf@gmail.com 32 Space Transformations T is a linear algorithm transformation that transforms an algorithm A into an algorithm Â, defined as: Π : J n → Ĵ 1 S : J n → Ĵ n-1, space transformation Algorithm dependences D are transformed into SD = P For each dependence d i, the product Sd i represents an ((n-1)x1)-column vector. The index point where dependence vector d i originates is mapped by transformation S into a cell of the systolic array and the terminal point of the dependence vector is mapped into another processing cell. / 40

Gorana Bosic gogaetf@gmail.com 33 Space Transformations Case 1. Given an algorithm with a dependence matrix D and transformation S, find P. Case 2. Given an algorithm with a dependence matrix D and a systolic array with interconnections P, find transformation S that maps the algorithm into the array. / 40

Gorana Bosic gogaetf@gmail.com 34 Space Transformations In the second case, the number of interconnections may not coincide with the number of dependences, and equation SD=P cannot be applied directly. The utilization matrix K SD = PK k ji = 1, i th dependence utilizes (or is mapped into) communication channel j k ji = 1, i th dependence does not map into channel j Numbers larger than 1 may be possible (0 ≤ k ji ), and indicate repetitive use of an interconnection by a dependence A dependence may map into several interconnections. The number of time units spent by a dependence along its corresponding connections cannot exceed the time alocated by the transformation to that dependence. ( 1 ≤ Σ j k ji ≤ Πd i ) / 40

Gorana Bosic gogaetf@gmail.com 35 Space Transformations for j 0 =1 to n for j 1 =1 to n for j 2 =1 to n S 1 :a(j 0, j 1, j 2 ) = a (j 0, j 1 +1, j 2 ) * b (j 0, j 1, j 2 +1 ) S 2 : b (j 0, j 1, j 2 ) = b (j 0, j 1 -1, j 2 +2 ) + b (j 0, j 1 -3, j 2 +2 ) end k end j end i Example: / 40

Gorana Bosic gogaetf@gmail.com 36 Space Transformations / 40

Gorana Bosic gogaetf@gmail.com 37 Design Parameters At what time and in what processor a computation indexed by (3, 4, 1) is performed? The transformed coordinates are (ĵ 0, ĵ 1, ĵ 2 ) t = T (3, 4, 1) t = (2, 8, 3) t, meanig that the computation time is 2 and the processor cell is (8, 3). ĵ = (ĵ 0, ĵ 1, ĵ 2 ) Ĵ ĵ = Tj The first coordinate ĵ 0 indicates the time at which the computation indexed by the corresponding j is computed. The pair (ĵ 1, ĵ 2 ) indicates the processor at which that computation is performed. / 40

Gorana Bosic gogaetf@gmail.com 38 Design Parameters The first row of the transformed dependences is: Each element indicates the number of time units allowed for its respective variable to travel from the processor in which it is generated to the processor in which it is used. Only two interconnection primitives are required: (0, 1) t and (1, 0) t / 40

Gorana Bosic gogaetf@gmail.com 39 Design Parameters Systolic array: / 40

Gorana Bosic gogaetf@gmail.com 40 Design Parameters Cell structure: + Delay 1 time unit Delay 2 time units Delay 1 time unit * a b b b a b / 40

PARALLEL PROCESSING From Applications to Systems Gorana Bosic Veljko Milutinovic

Similar presentations

Presentation on theme: "PARALLEL PROCESSING From Applications to Systems Gorana Bosic Veljko Milutinovic"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

PARALLEL PROCESSING From Applications to Systems Gorana Bosic Veljko Milutinovic

Similar presentations

Presentation on theme: "PARALLEL PROCESSING From Applications to Systems Gorana Bosic Veljko Milutinovic"— Presentation transcript:

Similar presentations

About project

Feedback