Presentation is loading. Please wait.

Presentation is loading. Please wait.

Global FFT, Global EP-STREAM Triad, HPL written in MC# Vadim B. Guzev Russian People Friendship University October 2006

Similar presentations


Presentation on theme: "Global FFT, Global EP-STREAM Triad, HPL written in MC# Vadim B. Guzev Russian People Friendship University October 2006"— Presentation transcript:

1 Global FFT, Global EP-STREAM Triad, HPL written in MC# Vadim B. Guzev Russian People Friendship University October 2006

2 Global FFT

3 In this submission we’ll show how to implement GlobalFFT in MC# programming language and we’ll try to concentrate on the process of writing parallel distributed programs, but not on the performance/number of lines issues. We are quite sure that the future belongs to very high-level programming languages and that one day the productivity of the programmers will become more important thing, than the productivity of the platforms! That’s why MC# was born… In object-oriented languages all programs are composed of objects and their interaction. It is natural that when programmer starts thinking about the problem first of all he would like to describe the object model before writing any logic. In Global FFT program these classes are Complex (structure) and GlobalFFT (algorithm). We will start writing our program by defining the Complex class: public class Complex { public Complex( double re, double im ) { Re = re; Im = im; } public double Re = 0; public double Im = 0; } The simple math behind the Global FFT problem is the following:

4 The natural way to distribute this computation is to split the execution by index k. That’s exactly what we will do. In MC# if you want to create a method which must be executed in different thread/node/cluster all you need to do is to mark this method as movable (distributed analogue of void, or async keyword of C# 3.0). Where exactly movable method will be executed is determined by the Runtime system and the call of movable method on the caller’s side occurs almost immediately (i.e. caller of the method doesn’t wait until the method execution is completed). In our case this movable method will receive as parameters: (a) Array of complex values z, (b) Current number of processor and (c) Special channel into which the result will be sent. movable Calculate( Complex[] z, int processorNumber, Channel( Complex[] ) channelZ ) { int np = CommWorld.Size; if ( np > 1 ) np -= 1; // When program runs in distributed mode frontend is included in CommWorld.Size int partLength = z.Length / np; int shift = processorNumber * partLength; Complex[] partOfZ = new Complex [partLength]; for ( int k = 0; k < partLength; k++ ) { partOfZ [k] = new Complex(); for ( int j = 0; j < z.Length; j++ ) { double arg = 2 * Math.PI * (j + 1) * (k + shift + 1) / (double) z.Length; double cos = Math.Cos( arg ); double sin = Math.Sin( arg ); partOfZ [k].Re += z [j].Re * cos + z [j].Im * sin; partOfZ [k].Im += z [j].Im * cos - z [j].Re * sin; } channelZ.Send( partOfZ, processorNumber ); } As you can see, in MC# it is possible to use almost any types of parameters for movable methods. When distributed mode is enabled these parameters will be automatically serialized and sent to remote node. The same applies to channels – it is possible to send values of any.Net type (which supports serialization) through the channels. To get the results from channels you have to connect them with synchronous methods – this is known as bounds in languages like Polyphonic C#, C# 3.0 or MC#. More information about bounds you can get on MC# site.MC# site void Get( ref Complex[] Z ) & Channel CZ( Complex[] partOfZ, int processorNumber ) { int shift = processorNumber * partOfZ.Length; for ( int i = 0; i < partOfZ.Length; i++ ) Z [i + shift].Re = partOfZ [i].Re; Z [i + shift].Im = partOfZ [i].Im; }

5 And finally let’s write down the main method which will launch the computation: public static void Main( string[] args ) { GlobalFFT fft = new GlobalFFT(); // Getting m as parameter of the program (m = 2^n) int m = Int32.Parse( args [0] ); Random r = new Random(); Complex[] z = new Complex [m]; // Initializing vector z for ( int i = 0; i < m; i++ ) z [i] = new Complex( r.NextDouble(), r.NextDouble() ); int np = CommWorld.Size; if ( np > 1 ) np -= 1; // Launching processing functions for ( int k = 0; k < np; k++ ) fft.Calculate( z, k, fft.CZ ); // Collecting results – result will be saved to vector z for ( int k = 0; k < np; k++ ) fft.Get( ref z ); }

6 using System; public class Complex { public Complex( double re, double im ) { Re = re; Im = im; } public double Re = 0; public double Im = 0; } public class GlobalFFT { movable Calculate( Complex[] z, int processorNumber, Channel( Complex[] ) channelZ ) { int np = CommWorld.Size; if ( np > 1 ) np -= 1; // When program runs in distributed mode cluster's frontend is included in CommWorld.Size int partLength = z.Length / np; int shift = processorNumber * partLength; Complex[] partOfZ = new Complex [partLength]; for ( int k = 0; k < partLength; k++ ) { partOfZ [k] = new Complex(); for ( int j = 0; j < z.Length; j++ ) { double arg = 2 * Math.PI * (j + 1) * (k + shift + 1) / (double) z.Length; double cos = Math.Cos( arg ); double sin = Math.Sin( arg ); partOfZ [k].Re += z [j].Re * cos + z [j].Im * sin; partOfZ [k].Im += z [j].Im * cos - z [j].Re * sin; } channelZ.Send( partOfZ, processorNumber ); } void Get( ref Complex[] Z ) & Channel CZ( Complex[] partOfZ, int processorNumber ) { int shift = processorNumber * partOfZ.Length; for ( int i = 0; i < partOfZ.Length; i++ ) { Z [i + shift].Re = partOfZ [i].Re; Z [i + shift].Im = partOfZ [i].Im; } public static void Main( string[] args ) { GlobalFFT fft = new GlobalFFT(); int m = Int32.Parse( args [0] ); Random r = new Random(); Complex[] z = new Complex [m]; for ( int i = 0; i < m; i++ ) z [i] = new Complex( r.NextDouble(), r.NextDouble() ); int np = CommWorld.Size; if ( np > 1 ) np -= 1; for ( int k = 0; k < np; k++ ) fft.Calculate( z, k, fft.CZ ); for ( int k = 0; k < np; k++ ) fft.Get( ref z ); } So, the first version of our Global FFT program is the following:

7 Parallel programs written in MC# language can be executed either in local mode (i.e. as simple.exe files – in this mode all movable methods will be executed in different threads) or in distributed mode – in this case all movable calls will be distributed across nodes of the Cluster/MetaCluster/GRID network (depending on the currently used Runtime). That means that programmer can write and debug his program locally (for example on his Windows machine) and then copy his program to Windows-based or Linux-based cluster and run it in distributed mode. User can even emulate cluster environment on his home computer! MC# makes cluster computations accessible to every programmer, even to those of them who currently do not have access to clusters! Let’s try to run this program in local mode on Windows machine: R:\projects\MCSharp\hpcchallenge>GlobalFFT.exe 1024 ________________________________________________ ==MC# Statistics================================ Number of movable calls: 1 Number of channel messages: 1 Number of movable calls (across network): 0 Number of channel messages (across network): 0 Total size of movable calls (across network): 0 bytes Total size of channel messages (across network): 0 bytes Total time of movable calls serialization: 00:00: Total time of channel messages serialization: 00:00:00 Total size of transported messages: 0 bytes Total time of transporting messages: 00:00:00 Session initialization time: 00:00: / sec. / msec. Total time: 00:00: / sec. / msec. ________________________________________________ Or we can run this program in local mode on Linux machine: gfft]$ mono GlobalFFT.exe 1024 MC#.Runtime, v ________________________________________________ ==MC# Statistics================================ Number of movable calls: 1 Number of channel messages: 1 Number of movable calls (across network): 0 Number of channel messages (across network): 0 Total size of movable calls (across network): 0 bytes Total size of channel messages (across network): 0 bytes Total time of movable calls serialization: 00:00: Total time of channel messages serialization: 00:00:00 Total size of transported messages: 0 bytes Total time of transporting messages: 00:00:00 Session initialization time: 00:00: / sec. / msec. Total time: 00:00: / sec. / msec. ________________________________________________

8 OK, it works. Now let’s try to run this program on more serious hardware. We’ll use 16-nodes cluster with the following configuration: vadim]$ uname –a Linux skif #1 SMP Thu Apr 14 15:25:11 MSD 2005 i686 athlon i386 GNU/Linux Let’s run our program on 16 processors: gfft]$ mono GlobalFFT.exe /np 16 MC#.Runtime, v ________________________________________________ ==MC# Statistics================================ Number of movable calls: 16 Number of channel messages: 16 Number of movable calls (across network): 16 Number of channel messages (across network): 16 Total size of movable calls (across network): bytes Total size of channel messages (across network): bytes Total time of movable calls serialization: 00:00: Total time of channel messages serialization: 00:00: Total size of transported messages: bytes Total time of transporting messages: 00:00: Session initialization time: 00:00: / sec. / msec. Total time: 00:00: / sec. / msec. ________________________________________________ Here is the result graph: GlobalFFT.mcs

9 Not bad, especially if we take into account that we wrote this program on modern high-level object-oriented language without thinking about any optimization issues or the physical structure of computational platform. Now let’s try to optimize it a little bit. The main problem in the first version of our program is that we need to move thousands of complex user-defined objects from frontend to cluster nodes and back. Serialization/Deserialization process of such objects takes a lot of resources and time. We can significantly reduce the execution time if we replace “arrays of Complex” to “arrays of doubles”. Here is the modified version of the program (lines 46, NCSL 43, TPtoks 490): using System; public class GlobalFFT { movable FFT( double[] zRe, double[] zIm, int processorNumber, Channel( double[], double[], int ) channelZ ) { int np = CommWorld.Size; if ( np > 1 ) np -= 1; // When program runs in distributed mode then cluster's frontend is included in CommWorld.Size int partLength = zRe.Length / np; int shift = processorNumber * partLength; double[] partOfZRe = new double [partLength]; double[] partOfZIm = new double [partLength]; double multiplier = 2 * Math.PI / (double) zRe.Length; for ( int k = 0; k < partLength; k++ ) { for ( int j = 0; j < zRe.Length; j++ ) { double arg = multiplier * (j + 1) * (k + shift + 1); double cos = Math.Cos( arg ); double sin = Math.Sin( arg ); partOfZRe [k] += zRe [j] * cos + zIm [j] * sin; partOfZIm [k] += zIm [j] * cos - zRe [j] * sin; } channelZ.Send( partOfZRe, partOfZIm, processorNumber ); } void Get( ref double[] ZRe, ref double[] ZIm ) & Channel CZ( double[] partOfZRe, double[] partOfZIm, int processorNumber ) { int shift = processorNumber * partOfZRe.Length; for ( int i = 0; i < partOfZRe.Length; i++ ) { ZRe [i + shift] = partOfZRe [i]; ZIm [i + shift] = partOfZIm [i]; } public static void Main( string[] args ) { GlobalFFT fft = new GlobalFFT(); int m = Int32.Parse( args [0] ); Random r = new Random(); double[] zRe = new double [m]; double[] zIm = new double [m]; for ( int i = 0; i < m; i++ ) { zRe [i] = r.NextDouble(); zIm [i] = r.NextDouble(); } int np = CommWorld.Size; if ( np > 1 ) np -= 1; for ( int k = 0; k < np; k++ ) fft.Calculate( zRe, zIm, k, fft.CZ ); for ( int k = 0; k < np; k++ ) fft.Get( ref zRe, ref zIm ); }

10 gfft]$ mono GlobalFFT_arrays.exe /np 16 MC#.Runtime, v ________________________________________________ ==MC# Statistics================================ Number of movable calls: 32 Number of channel messages: 32 Number of movable calls (across network): 32 Number of channel messages (across network): 32 Total size of movable calls (across network): bytes Total size of channel messages (across network): bytes Total time of movable calls serialization: 00:00: Total time of channel messages serialization: 00:00: Total size of transported messages: bytes Total time of transporting messages: 00:00: Session initialization time: 00:00: / sec. / msec. Total time: 00:00: / sec. / msec. ________________________________________________ Now we can get some performance numbers for the second version of our Global FFT program: Let’s run this version of the program: GlobalFFT_arrays.mcs

11 Global EP-STREAM-Triad

12 There is another one task in Class 2 Specification which is entitled as Global EP-STREAM-Triad. Although C# currently doesn’t support kernel vector operations we think that it is still good example to demonstrate the syntax of MC#. We’ll write a simple program which will make the same calculations on several nodes simultaneously and then print the average time taken on all nodes in general. There will be only one movable method in this program which will accept special Channel through which only objects of class TimeSpan can be sent: movable fun( Channel( TimeSpan ) result ) { int m = ; int n = 1000; Random r = new Random(); double[] a = new double[m]; double[] b = new double[m]; double[] c = new double[m]; for ( int i = 0; i < m; i++ ) { b [i] = r.NextDouble(); c [i] = r.NextDouble(); } double alpha = r.NextDouble(); TimeSpan ts = new TimeSpan(0); for ( int i = 0; i < n; i++ ) { DateTime from = DateTime.Now; for ( int j = 0; j < m; j++ ) a [j] = b [j] + alpha * c[j]; ts += DateTime.Now.Subtract( from ); } result.Send( ts ); } TimeSpan GetResult() & Channel result( TimeSpan ts ) { return ts; } Movable methods cannot return any values. Channels must be used instead to pass the information between nodes. To read from “semi-directional” channels bounds must be used (special syntax constructs) which can synchronize multiple threads. In our case we need only one bound: When Thread A is calling method GetResult then Runtime system checks whether any object has been delivered to the result channel and queued in the special channel queue. If no objects have been received then Thread A is suspended until result channel receives some object. When this object will be received then Thread A will be resumed and the reading from the channel occurs. When object is sent to result channel and there is no waiting callers of method GetResult then this object is put into special channel queue. Object will be read when corresponding GetResult method will be called.

13 using System; public class EPStream { movable fun( Channel( TimeSpan ) result ) { int m = ; int n = 1000; Random r = new Random(); double[] a = new double[m]; double[] b = new double[m]; double[] c = new double[m]; for ( int i = 0; i < m; i++ ) { b [i] = r.NextDouble(); c [i] = r.NextDouble(); } double alpha = r.NextDouble(); TimeSpan ts = new TimeSpan(0); for ( int i = 0; i < n; i++ ) { DateTime from = DateTime.Now; for ( int j = 0; j < m; j++ ) a [j] = b [j] + alpha * c[j]; ts += DateTime.Now.Subtract( from ); } result.Send( ts ); } TimeSpan GetResult() & Channel result( TimeSpan ts ) { return ts; } public static void Main( string[] args ) { EPStream e = new EPStream(); for ( int i = 0; i < CommWorld.Size; i++ ) e.fun( e.result ); TimeSpan total = new TimeSpan(0); for ( int i = 0; i < CommWorld.Size; i++ ) total += e.GetResult(); Console.WriteLine("Average: " + new TimeSpan( total.Ticks/CommWorld.Size )); } Let’s run this program: gfft]$ mono EPStream.exe /np 16 MC#.Runtime, v Average: 00:01: ________________________________________________ ==MC# Statistics================================ Number of movable calls: 17 Number of channel messages: 17 Number of movable calls (across network): 17 Number of channel messages (across network): 17 Total size of movable calls (across network): 6647 bytes Total size of channel messages (across network): 2958 bytes Total time of movable calls serialization: 00:00: Total time of channel messages serialization: 00:00: Total size of transported messages: bytes Total time of transporting messages: 00:00: Session initialization time: 00:00: / sec. / msec. Total time: 00:01: / sec. / msec. ________________________________________________ Here is the final version of this EP-STREAM Triad program (lines 36, NCSL 33, TPtoks 311):

14 HPL

15 HPL solves a linear system of equations of order n: Ax=b, by first computing A=LU factorization and then solving the equations Ly=b and Ux=y one by one. In this scheme L and U matrixes are triangular matrixes, so it is not a problem to solve them. The real problem is the calculation of these matrixes L and U. The simple math behind this problem is the following:And the calculation dependencies graph is the following:

16 Actually there are a lot of modifications of HPL algorithm out there. One of the most important thing in HPL is the way how calculated panels are broadcasted to other nodes. For example, in Increasing Ring algorithm process 0 sends two messages and process 1 only receives one message (0  1, 0  2, 2  3 and so on). This algorithm is almost always better, if not the best. Let's suppose that for a given matrix A calculations in one node take approximately ~1 minute, and transfer of matrixes between nodes takes ~1/10 minute. Using these primitive estimations we can calculate that it takes 11.6 minutes in Increasing Ring version vs. 36 minutes in sequential version. In practice it is even better, because this algorithm saves the bandwidth: 01 23

17 So, we know that there do exist better algorithms, but they are quite complex to understand and the purpose of our submission is not to get the highest performance results, but to show the principles of programming in MC# language. So, for simplicity reasons we’ll use the most simple communication structure where each process is communicating directly with top, left, bottom and right processes in the processes grid. In our case each process will be connected with their neighbors by bi- directional channels (BDChannel). Using these bi-directional channels processes can communicate to each other by sending and receiving messages.

18 public static void Main( string[] args ) { if ( args.Length < 3 ) { Console.WriteLine( "Usage: HPL.exe n p q" ); Console.WriteLine( "Where n - size of matrix A, p - height of process grid, q - width of process grid" ); return; } int n = Int32.Parse( args [0] ); int p = Int32.Parse( args [1] ); int q = Int32.Parse( args [2] ); double[,] a = new double [n, n]; double[] b = new double [n]; int maxRandNum = 100; // Generate matrix A - Expected mean must be equal to zero Random rand = new Random(); for ( int i = 0; i < n; i++ ) for ( int j = 0; j < n; j++ ) a [i,j] = rand.NextDouble() * maxRandNum - maxRandNum/2; // Generate vector b - Expected mean must be equal to zero for ( int i = 0; i < n; i++ ) b[i] = rand.NextDouble() * maxRandNum - maxRandNum/2; DateTime dt1, dt2; TimeSpan dt; dt1 = DateTime.Now; HPLAlgorithm hpl = new HPLAlgorithm(); //    Creating an instance of HPL algorithm    double[] x = hpl.Solve( a, b, n, p, q ); //    launching the algorithm    dt2 = DateTime.Now; dt = dt2.Subtract( dt1 ); Console.WriteLine( "\nElapsed time: " + dt.TotalSeconds + " sec.\n" ); double performance = ( 2.0 * n * n * n / * n * n / 2.0 ) * 10.0e-9 / (double) dt.TotalSeconds; Console.WriteLine( "Performance = " + performance + " Gflop/sec" ); bool correctness = Verify( a, b, x, n ); if ( correctness == true ) Console.WriteLine( "Solution is correct" ); else Console.WriteLine( "Solution is incorrect" ); } The Main method of our program is quite simple. Actually it is written in pure C# code (MC# specific syntax is not used here). First of all we generate matrix A and vector b, and after that we instantiate the HPLAlgorithm object and solve the equation by calling the Solve method. After that we verify the solution. See comments in the code to get the better understanding of the code.

19 public static bool Verify( double[,] A, double[] b, double[] x, int n ) { int i, j; double tmp; double eps = Double.MinValue; double Ax_b_infin = 0.0; // || A x - b ||_infinity for ( i = 0; i < n; i++ ) { tmp = 0.0; for ( j = 0; j < n; j++ ) tmp += A [i,j] * x [ j ]; tmp = Math.Abs( tmp - b [i] ); if ( tmp > Ax_b_infin ) Ax_b_infin = tmp; } double A_infin = 0.0; // || A ||_infinity for ( i = 0; i < n; i++ ) { tmp = 0.0; for ( j = 0; j < n; j++ ) tmp += Math.Abs( A [i,j] ); if ( tmp > A_infin ) A_infin = tmp; } double A_1 = 0.0; // || A ||_1 for ( j = 0; j < n; j++ ) { tmp = 0.0; for ( i = 0; i < n; i++ ) tmp += Math.Abs( A [i,j] ); if ( tmp > A_1 ) A_1 = tmp; } double x_1 = 0.0; // || x ||_1 for ( i = 0; i < n; i++ ) x_1 += Math.Abs( x [i] ); double x_infin = 0.0; // || x ||_infinity for ( i = 0; i < n; i++ ) if ( Math.Abs( x [ i ] ) > x_infin ) x_infin = Math.Abs( x [ i ] ); double r1 = Ax_b_infin / ( eps * A_1 * n ); double r2 = Ax_b_infin / ( eps * A_1 * x_1 ); double r3 = Ax_b_infin / ( eps * A_infin * x_infin * n ); double r; if ( r1 > r2 ) r = r1; else r = r2; if ( r3 > r ) r = r3; Console.WriteLine( "r1 = " + r1 + "\n" + "r2 = " + r2 + "\n" + "r3 = " + r3 ); Console.WriteLine( "Max ri = " + r ); if ( r < 16 ) return true; else return false; } First of all let’s look at accessory methods. These are Verify, GetSubMatrix and GetSubVector. Verify method verifies the solution based on the criteria mentioned in the HPC Challenge Awards: Class 2 Specification: public static double[,] GetSubMatrix( double[,] a, int fromH, int toH, int fromW, int toW ) { int h = toH - fromH; int w = toW - fromW; double[,] b = new double[h, w]; for ( int i = 0; i < h; i++ ) for ( int j = 0; j < w; j++ ) b [i, j] = a [fromH+i, fromW+j]; return b; } public static double[] GetSubVector( double[] b, int fromH, int toH ) { int h = toH - fromH; double[] c = new double[h]; for ( int i = 0; i < h; i++ ) c [i] = b [fromH + i]; return c; }

20 public double[] Solve( double[,] a, double[] b, int n, int p, int q ) { BDChannel[,] bdc = new BDChannel [p,q]; // Creating n * n bdchannels, where n - number of processors for ( int i = 0; i < p; i++ ) for ( int j = 0; j < q; j++ ) bdc [i, j] = new BDChannel(); int diffP = n / p; int diffQ = n / q; for ( int i = 0; i < p; i++ ) { int h = diffP; if ( i == p - 1 ) h += n % p; double[] partOfB = GetSubVector( b, i * diffP, i * diffP + h ); for ( int j = 0; j < q; j++ ) { int w = diffQ; if ( j == q - 1 ) w += n % q; double[,] partOfA = GetSubMatrix( a, i * diffP, i * diffP + h, j * diffQ, j * diffQ + w ); BDChannel top = null, left = null, bottom = null, right = null; if ( i != 0 ) top = bdc [i - 1, j]; if ( j != 0 ) left = bdc [i, j - 1]; if ( i < p - 1 ) bottom = bdc [i + 1, j]; if ( j < q - 1 ) right = bdc [i, j+1]; if ( j != 0 ) partOfB = null; // we need vector b only in the first processors column hplDistributed( partOfA, partOfB, n, p, q, i * diffP, i * diffP + h, j * diffQ, j * diffQ + w, top, left, bdc [i,j], bottom, right, xChannel ); } double[] x = new double [n]; for ( int i = 0; i < p; i++ ) Get( ref x ); return x; } void Get( ref double[] x ) & Channel xChannel( double[] partOfX, int wFrom ) { for ( int i = 0; i < partOfX.Length; i++ ) x [wFrom + i] = partOfX [i]; } Now let’s have a look at the main Solve method. In this method we are creating p-by-q grid of bi-directional channels and then launching p * q movable methods giving them as parameters corresponding parts of matrix a (and if it is necessary corresponding parts of vector b). Also all movable methods receive bi-directional channels pointing to process’s neighbors and to current process itself, as well as the semi-directional channel to return the result of computations. Actually only p processes will return values. These processes are located in the diagonal of p-by-q processes grid. We also have here one Get & xChannel bound which is used to receive parts of calculated vector x from running movable methods and to “merge” these parts into the resulting vector.

21 movable hplDistributed( int c, double[,] a, double[] b, int n, int p, int q, int yStart, int yEnd, int xStart, int xEnd, BDChannel top, BDChannel left, BDChannel current, BDChannel bottom, BDChannel right, Channel(double[], int) xChannel ) { double[,] l = new double [yEnd - yStart, xEnd]; double[,] u = new double [yEnd, xEnd - xStart]; double[] y = new double [yEnd - yStart]; double[] ySum = new double [yEnd - yStart]; double[] x = new double [yEnd - yStart]; int i = 0, j = 0, k = 0; // counters if ( b != null ) // if it is the first column in the row then copy part of vector b to ySum for ( i = 0; i < b.Length; i++ ) ySum [i] = b [i]; // Phase 1: Calculate vector y int nTimes = 0; // How many arrays we can receive from neighbour processes? if ( left != null && top != null ) nTimes = 2; else if ( left != null || top != null ) nTimes = 1; for ( k = 0; k < nTimes; k++ ) { // Receiving part of matrixes L or U from left or top processes object[] o = current.Receive(); int t = (int) o [0]; int h = (int) o [2]; int w = (int) o [3]; double[,] prev = (double[,]) o [1]; if ( t == 0 ) { // Received part of matrix L from the left process // Part of matrix L needed already has been calculated - // pass it to right processor in the row with the calculated part of matrix y[k] ySum = (double[]) o[4]; if ( xStart > yStart && xEnd > yEnd && right != null ) right.Send( 0, prev, h, w, ySum ); // "0" here means that value was sent from the left // copying prev to l for ( i = 0; i < h; i++ ) for ( j = 0; j < w; j++ ) l [i,j] = prev [i,j]; } else if ( t == 1 ) { // Received part of matrix L from top process y = (double[]) o[4]; // Part of matrix L needed already has been calculated - pass it to bottom processor in the column if ( yStart > xStart && yEnd > xEnd && bottom != null ) bottom.Send( 1, prev, h, w, y ); // "1" here means that value was sent from the top process // copying prev to u for ( i = 0; i < h; i++ ) for ( j = 0; j < w; j++ ) u [i,j] = prev [i,j]; } … And finally here is our movable method hplDistributed:

22 // Calculate parts of matrixes L, U for ( i = 0; i < yEnd - yStart; i++ ) { int max = xEnd; if ( max > n ) max = n; for ( j = 0; j < max - xStart; j++ ) { int iPos = i + yStart; int jPos = j + xStart; if ( iPos == jPos ) { // Diagonal u [iPos,j] = 1; l [i,jPos] = a [i,j]; for ( k = 0; k < jPos; k++ ) l [i,jPos] -= l [i,k] * u [k,j]; } else if ( iPos < jPos ) { // Above diagonal l [i,jPos] = 0; u [iPos,j] = a [i,j]; for ( k = 0; k < iPos; k++ ) u [iPos,j] = u [iPos,j] - l [i,k] * u [k,j]; u [iPos,j] = u [iPos,j] / l [i,iPos]; } else { // Beyond diagonal u [iPos,j] = 0; l [i,jPos] = a [i,j]; for ( k = 0; k < jPos; k++ ) l [i,jPos] -= l [i,k] * u [k,j]; } if ( xStart < yStart ) ySum [i] -= l [i, jPos] * y [j]; } // Calculating y[i] if ( xStart == yStart ) for ( i = 0; i < yEnd - yStart; i++ ) { y[i] = ySum [i]; for ( j = 0; j < i; j++ ) y [i] -= y[j] * l [i,j+xStart]; y [i] = y[i] / l [i,i+xStart]; } // Sending L matrix to right channel if it hasn't been sent before + partial sum ySum if ( right != null && (xStart <= yStart || xEnd <= yEnd) ) { right.Send( 0, l, yEnd - yStart, xEnd, ySum ); // "0" means that value was sent from the left process } // Sending U matrix to bottom channel if it hasn't been sent before + part of vector y if ( bottom != null && yStart <= xStart ) bottom.Send( 1, u, yEnd, xEnd - xStart, y ); // "1" means that value was sent from the top process

23 … // STEP 2: Backward substitution - vector x calculation nTimes = 0; if ( xStart == yStart && bottom != null && right != null ) nTimes = 1; else if ( xStart > yStart && (bottom != null && right != null) ) nTimes = 2; else if ( xStart > yStart && (bottom != null || right != null) ) nTimes = 1; for ( i = 0; i < yEnd - yStart; i++ ) ySum [i] = 0; for ( i = 0; i < nTimes; i++ ) { object[] o = current.Receive(); int t = (int) o [0]; if ( t == 2 ) { // Received part of vector x from the bottom process x = (double[]) o [1]; if ( top != null ) top.Send( 2, x ); } else if ( t == 3 ) // Received partial sum of vector x from the right process ySum = (double[]) o [1]; } // Calculate x vector and pass it to the top channel if necessary if ( xStart == yStart ) { for ( i = yEnd - yStart - 1; i >= 0; i-- ) { x [i] = y [i] - ySum [i]; for ( j = xEnd - xStart - 1; j > i; j-- ) x [i] -= x[j] * u[i + yStart, j]; x [i] = x [i] / u [i +yStart, i]; } if ( top != null ) top.Send( 2, x ); // "2" means that the value was sent from the bottom process } else if ( xStart > yStart ) { if ( left != null ) { for ( i = yEnd - yStart - 1; i >= 0; i-- ) for ( j = xEnd - xStart - 1; j >= 0; j-- ) { ySum [i] += u [i,j] * x [j]; } left.Send( 3, ySum ); // "3" means that the value was sent from the right process } if ( xStart == yStart ) xChannel.Send( x, yStart ); } The communication scheme is described more closely in the next slides.

24 Calculate y[0] Step 1 Calculating vector y

25 y[0], U 00 L 00

26 y[0], U 00 L 00

27 Calculate y[1] L 00 y[0], U 00 U 01 L 10, ySum

28 L 00 y[0], U 00 L 20, ySum U 02 L U 01+11

29 Calculate y[2]

30

31 Calculate y[3]

32

33 Calculate y[4]

34

35 Calculate y[5]

36 L 00, U 00 L 00, U 01 L 00, U 02 L 00, U 03 L 00, U 04 L 00, U 05 L 10, U 00 L 10+11, U L 10+11, U L 10+11, U L 10+11, U L 10+11, U L 20, U 00 L 20+21, U L , U L , U L , U L , U L 30, U 00 L 40, U 00 L 50, U 00 L 30+31, U L 40+41, U L 50+51, U L , U L , U L , U L , U L , U L , U L , U L , U L , U L , U L , U L , U The final matrixes L and U fragments distribution.

37 Calculate x [5] Pass x [5] to the main method Step 2 Calculating vector x

38 Calculate x [4]

39 Pass x [4] to the main method

40 Calculate x[3]

41 Pass x[3] to the main method

42 Calculate x[2]

43 Pass x[2] to the main method

44 Calculate x[1]

45 Pass x[1] to the main method

46 Calculate x[0] and pass it to the main method

47 Calculate x[0] and pass it to the main method Vector x now can be merged from fragments on the main node!

48 This implementation has a lack that all communications go through the cluster’s frontend. This happens because all bi-directional Channels were initially created on the frontend machine. It is possible to reduce the execution time by using the mutual exchange of bi-directional channels between neighbor processes (see next slides to understand how it can be done). Here are the measurements for algorithm described in the previous slides. hpl_notparallel.mcs: Not parallel version SKIF 16x2 nodes cluster gfft]$ uname -a Linux skif #1 SMP Thu Apr 14 15:25:11 MSD 2005 i686 athlon i386 GNU/Linux hpl_7.mcs: P=10 Q=10 NP=32 mono hpl_7.exe N /np 32 Note: This version includes times needed for matrix A and vector b generation.

49 Step 0 Exchange of bi-directional channels Each process creates bi-directional channel locally and pass it to the right process

50 Each process passes local bi-directional channel to the left process

51 Each process passes local bi-directional channel to the bottom process

52 Each process passes local bi-directional channel to the top process

53 // STEP 0: Fasten the data exchange by mutual exchange of the bdchannels // otherwise all network traffic will go through the cluster frontend... BDChannel currentNew = new BDChannel(); int nTimes = 0; // How many neighbors this process has? if ( right != null ) { right.Send( 1, currentNew ); nTimes++; } if ( left != null ) { left.Send( 2, currentNew ); nTimes++; } if ( bottom != null ) { bottom.Send( 3, currentNew ); nTimes++; } if ( top != null ) { top.Send( 4, currentNew ); nTimes++; } for ( i = 0; i < nTimes; i++ ) { object[] o = current.Receive(); int t = (int) o[0]; if ( t == 1 ) left = (BDChannel) o[1]; // BDChannel came from the left process else if ( t == 2 ) right = (BDChannel) o[1]; // BDChannel came from the right process else if ( t == 3 ) top = (BDChannel) o[1]; // BDChannel came from the top process else if ( t == 4 ) bottom = (BDChannel) o[1]; // BDChannel came from the bottom process } current = currentNew; This is how the previous four slides can be written in MC# language (it should be inserted at the beginning of the method): And this is the difference in execution time:

54 If we compare these two implementations and have a look at the statistics provided by MC# runtime we’ll see that in hpl_8.mcs during the calculations of 4000x4000 matrix 974'032'636 bytes of information are transferred between nodes [skif gfft]$ mono hpl_8.exe /np 32 ________________________________________________ ==MC# Statistics================================ Number of movable calls: 100 Number of channel messages: 10 Number of movable calls (across network): 100 Number of channel messages (across network): 10 Total size of movable calls (across network): bytes Total size of channel messages (across network): bytes Total time of movable calls serialization: 00:00: Total time of channel messages serialization: 00:00: Total size of transported messages: bytes Total time of transporting messages: 00:01: Session initialization time: 00:00: / sec. / msec. Total time: 00:20: / sec. / msec. ________________________________________________ While in hpl_7.mcs during the calculations of the same 4000x4000 matrix 1'819'549'593 bytes of information are transferred between nodes [skif gfft]$ mono hpl_7.exe /np 32 ________________________________________________ ==MC# Statistics================================ Number of movable calls: 100 Number of channel messages: 10 Number of movable calls (across network): 100 Number of channel messages (across network): 10 Total size of movable calls (across network): bytes Total size of channel messages (across network): bytes Total time of movable calls serialization: 00:00: Total time of channel messages serialization: 00:00: Total size of transported messages: bytes Total time of transporting messages: 00:02: Session initialization time: 00:00: / sec. / msec. Total time: 00:21: / sec. / msec. ________________________________________________

55 Explaining the figures/Limitations of implementation 1) Implemented HPL algorithm was selected by the principle "as simple as possible to understand and to read the final code". It is possible to significantly improve the productivity by using more advanced algorithms of Panel Broadcasting and Update and Look-Ahead heuristics 2) MC# Runtime system has not been optimized yet for really big number of processors - it is working quite good for clusters with number of processors NP<=16. Note that MC# is still a research project. It is just a matter of time, before we get the really effective runtime system. 3) Currently there are no broadcast operations in the MC# syntax. It looks like that we'll have to add such capability to the language in the future 4) HPL requires intensive usage of network bandwidth. The speedup is possible in case of using SCI network adapters by MC# runtime (currently in development). In these measurements we used standard Ethernet adapters 5) MC# is using standard.Net Binary Serializer for transferring objects from one node to another one. This operation is quite memory-consuming. Improved performance can be achieved by writing custom serializers. 6) Mono implementation of.Net platform is not yet as fast as implementation from Microsoft

56 Thanks for your time! MC# Homepage: (the site of the project can be temporary October/November due to hardware upgrade works) Special thanks to: Yury P. Serdyuk – for his great work on MC# project and help in preparing this document Program Systems Institute / University of Pereslavl for hosting MC# project homepage


Download ppt "Global FFT, Global EP-STREAM Triad, HPL written in MC# Vadim B. Guzev Russian People Friendship University October 2006"

Similar presentations


Ads by Google