CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof.

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof. Thomas Sterling Dr. Hartmut Kaiser Department of Computer Science Louisiana State University March 10 th, 2011

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Dr. Hartmut Kaiser Center for Computation & Technology R315 Johnston hkaiser@cct.lsu.edu 2

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Puzzle of the Day What’s the difference between the following valid C function declarations: void foo(); void foo(void); void foo(…);

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Puzzle of the Day What’s the difference between the following valid C function declarations: What’s the difference between the following valid C++ function declarations: void foo(); void foo(void); void foo(…); void foo();  any number of parameters void foo(void);  no parameter void foo(…);  any number of parameters

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Puzzle of the Day What’s the difference between the following valid C function declarations: void foo();  any number of parameters void foo(void);  no parameters void foo(…);  any number of parameters What’s the difference between the following valid C++ function declarations: void foo();  no parameters void foo(void);  no parameters void foo(…);  any number of parameters

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 6 Topics Introduction Mandelbrot Sets Monte Carlo : PI Calculation Vector Dot-Product Matrix Multiplication

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 8 Parallel Programming Goals –Correctness –Reduction in execution time –Efficiency –Scalability –Increased problem size and richness of models Objectives –Expose parallelism Algorithm design –Distribute work uniformly Data decomposition and allocation Dynamic load balancing –Minimize overhead of synchronization and communication Coarse granularity Big messages –Minimize redundant work Still sometimes better than communication

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 9 Basic Parallel (MPI) Program Steps Establish logical bindings Initialize application execution environment Distribute data and work Perform core computations in parallel (across nodes) Synchronize and Exchange intermediate data results –Optional for non-embarrassingly parallel (cooperative) Detect “stop” condition –Maybe implicit with a barrier etc. Aggregate final results –Often a reduction operator Output results and error code Terminate and return to OS

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 10 “embarrassingly parallel” Common phrase –poorly defined, –widely used Suggests lots and lots of parallelism –with essentially no inter task communication or coordination –Highly partitionable workload with minimal overhead “almost embarrassingly parallel” –Same as above, but –Requires master to launch many tasks –Requires master to collect final results of tasks –Sometimes still referred to as “embarrassingly parallel”

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Mandelbrot set Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 12

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. Mandelbrot Set Set of points in a complex plane that are quasi-stable (will increase and decrease, but not exceed some limit) when computed by iterating the function where z k+1 is the (k + 1) th iteration of the complex number z = (a + bi) and c is a complex number giving position of point in the complex plane. The initial value for z is zero. Iterations continued until magnitude of z is greater than 2 or number of iterations reaches arbitrary limit. Magnitude of z is the length of the vector given by 13

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. Sequential routine computing value of one point returning number of iterations structure complex { float real; float imag; }; int cal_pixel(complex c) { int count, max; complex z; float temp, lengthsq; max = 256; z.real = 0; z.imag = 0; count = 0; /* number of iterations */ do { temp = z.real * z.real - z.imag * z.imag + c.real; z.imag = 2 * z.real * z.imag + c.imag; z.real = temp; lengthsq = z.real * z.real + z.imag * z.imag; count++; } while ((lengthsq < 4.0) && (count < max)); return count; } 14

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Parallelizing Mandelbrot Set Computation Static Task Assignment Simply divide the region into fixed number of parts, each computed by a separate processor. Not very successful because different regions require different numbers of iterations and time. Dynamic Task Assignment Have processor request regions after computing previous regions Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. 15

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. Dynamic Task Assignment Work Pool/Processor Farms 16

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 17 Flowchart for Mandelbrot Set Generation “master”“workers” Initialize MPI Environment … Create Local Workload buffer … Isolate work regions Calculate Mandelbrot set values across work region … … Write result from task 0 to file Recv. results from “workers” Send result to “master” … Concatenate results to file End

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 18 Mandelbrot Sets (source code) #include typedef struct complex{ double real; double imag; } Complex; int cal_pixel(Complex c){ int count, max_iter; Complex z; double temp, lengthsq; max_iter = 256; z.real = 0; z.imag = 0; count = 0; do{ temp = z.real * z.real - z.imag * z.imag + c.real; z.imag = 2 * z.real * z.imag + c.imag; z.real = temp; lengthsq = z.real * z.real + z.imag * z.imag; count ++; } while ((lengthsq < 4.0) && (count < max_iter)); return(count); } Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/ cal_pixel () runs on every worker process calculates the : for every pixel

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 19 Mandelbrot Sets (source code) #define MASTERPE 0 int main(int argc, char **argv){ FILE *file; int i, j; int tmp; Complex c; double *data_l, *data_l_tmp; int nx, ny; int mystrt, myend; int nrows_l; int nprocs, mype; MPI_Status status; /***** Initializing MPI Environment*****/ MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nprocs); MPI_Comm_rank(MPI_COMM_WORLD, &mype); /***** Pass in the dimension (X,Y) of the area to cover *****/ if (argc != 3){ int err = 0; printf("argc %d\n", argc); if (mype == MASTERPE){ printf("usage: mandelbrot nx ny"); MPI_Abort(MPI_COMM_WORLD,err ); } /* get command line args */ nx = atoi(argv[1]); ny = atoi(argv[2]); Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/ Initialize MPI Environment Check if the input arguments : x,y dimensions of the region to be processed are passed

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 20 Mandelbrot Sets (source code) /* assume divides equally */ nrows_l = nx/nprocs; mystrt = mype*nrows_l; myend = mystrt + nrows_l - 1; /* create buffer for local work only */ data_l = (double *) malloc(nrows_l * ny * sizeof(double)); data_l_tmp = data_l; /* calc each procs coordinates and call local mandelbrot value generation function */ for (i = mystrt; i <= myend; ++i){ c.real = i/((double) nx) * 4. - 2. ; for (j = 0; j < ny; ++j){ c.imag = j/((double) ny) * 4. - 2. ; tmp = cal_pixel(c); *data_l++ = (double) tmp; } data_l = data_l_tmp; Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/ Determining the dimensions of the work to be performed by each concurrent task. Local tasks calculate the coordinates for each pixel in the local region. For each pixel, cal_pixel() function is called and the corresponding value is calculated

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 21 Mandelbrot Sets (source code) if (mype == MASTERPE){ file = fopen("mandelbrot.bin_0000", "w"); printf("nrows_l, ny %d %d\n", nrows_l, ny); fwrite(data_l, nrows_l*ny, sizeof(double), file); fclose(file); for (i = 1; i < nprocs; ++i){ MPI_Recv(data_l, nrows_l * ny, MPI_DOUBLE, i, 0, MPI_COMM_WORLD, &status); printf("received message from proc %d\n", i); file = fopen("mandelbrot.bin_0000", "a"); fwrite(data_l, nrows_l*ny, sizeof(double), file); fclose(file); } else{ MPI_Send(data_l, nrows_l * ny, MPI_DOUBLE, MASTERPE, 0, MPI_COMM_WORLD); } MPI_Finalize(); } Source : http://people.cs.uchicago.edu/~asiegel/courses/cspp51085/lesson2/examples/ Master process opens a file to store output into and stores its values in the file Master then waits to receive values computed by each of the worker processes Worker processes send computed mandelbrot values of their region to the master process

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 22 Demo : Mandelbrot Sets

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Demo: Mandelbrot Sets 23

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 25

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Monte Carlo Simulation Used when it is infeasible or impossible to compute an exact result with a deterministic algorithm Especially useful in –Studying systems with a large number of coupled degrees of freedom Fluids, disordered materials, strongly coupled solids, cellular structures –For modeling phenomena with significant uncertainty in inputs The calculation of risk in business –These methods are also widely used in mathematics The evaluation of definite integrals, particularly multidimensional integrals with complicated boundary conditions 26

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Monte Carlo Simulation No single approach, multitude of different methods Usually follows pattern –Define a domain of possible inputs –Generate inputs randomly from the domain –Perform a deterministic computation using the inputs –Aggregate the results of the individual computations into the final result Example: calculate Pi 27

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 28 Monte Carlo: Algorithm for Pi The value of PI can be calculated in a number of ways. Consider the following method of approximating PI: Inscribe a circle in a square Randomly generate points in the square Determine the number of points in the square that are also in the circle Let r be the number of points in the circle divided by the number of points in the square PI ~ 4 r Note that the more points generated, the better the approximation Algorithm : npoints = 10000 circle_count = 0 do j = 1,npoints generate 2 random numbers between 0 and 1 xcoordinate = random1 ; ycoordinate = random2 if (xcoordinate, ycoordinate) inside circle then circle_count = circle_count + 1 end do PI = 4.0*circle_count/npoints

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 30 OpenMP Pi Calculation Initialize variables Initialize OpenMP parallel environment Calculate PI Print value of pi N WorkerThreads Master Thread Generate random X,Y Calculate Z=X^2+Y^2 If point lies within the circle Calculate Z =X^2+Y^2 If point lies within the circle Count ++ Reduction ∑ Y N NN YY

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 OpenMP Calculating Pi 31 #include #define SEED 42 main(int argc, char* argv) { int niter=0; double x,y; int i,tid,count=0; /* # of points in the 1st quadrant of unit circle */ double z; double pi; time_t rawtime; struct tm * timeinfo; printf("Enter the number of iterations used to estimate pi: "); scanf("%d",&niter); time ( &rawtime ); timeinfo = localtime ( &rawtime ); Seed for generating random number http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 OpenMP Calculating Pi 32 printf ( "The current date/time is: %s", asctime (timeinfo) ); /* initialize random numbers */ srand(SEED); #pragma omp parallel for private(x,y,z,tid) reduction(+:count) for ( i=0; i<niter; i++) { x = (double)rand()/RAND_MAX; y = (double)rand()/RAND_MAX; z = (x*x+y*y); if (z<=1) count++; if (i==(niter/6)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(niter/3)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(niter/2)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML Initialize random number generator; srand is used to seed the random number generated by rand() Randomly generate x,y points Initialize OpenMP parallel for with reduction(∑) Calculate x^2+y^2 and check if it lies within the circle; if yes then increment count

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Calculating Pi 33 if (i==(2*niter/3)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==(5*niter/6)-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } if (i==niter-1) { tid = omp_get_thread_num(); printf(" thread %i just did iteration %i the count is %i\n",tid,i,count); } time ( &rawtime ); timeinfo = localtime ( &rawtime ); printf ( "The current date/time is: %s", asctime (timeinfo) ); printf(" the total count is %i\n",count); pi=(double)count/niter*4; printf("# of trials= %d, estimate of pi is %g \n",niter,pi); return 0; } http://www.umsl.edu/~siegelj/cs4790/openmp/pimonti_omp.c.HTML Calculate PI based on the aggregate count of the points that lie within the circle

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Demo : OpenMP Pi 34 [cdekate@celeritas l13]$./omcpi Enter the number of iterations used to estimate pi: 100000 The current date/time is: Tue Mar 4 05:53:52 2008 thread 0 just did iteration 16665 the count is 13124 thread 1 just did iteration 33332 the count is 6514 thread 1 just did iteration 49999 the count is 19609 thread 2 just did iteration 66665 the count is 13048 thread 3 just did iteration 83332 the count is 6445 thread 3 just did iteration 99999 the count is 19489 The current date/time is: Tue Mar 4 05:53:52 2008 the total count is 78320 # of trials= 100000, estimate of pi is 3.1328 [cdekate@celeritas l13]$

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 35 Creating Custom Communicators Communicators define groups and the access patterns among them Default communicator is MPI_COMM_WORLD Some algorithms demand more sophisticated control of communications to take advantage of reduction operators MPI permits creation of custom communicators MPI_Comm_create

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 36 MPI Monte Carlo Pi Computation Initialize MPI Environment Receive Request Compute Random Array Send Array to Requestor Last Request? Finalize MPI Y N Server Initialize MPI Environment WorkerMaster Receive Error Bound Send Request to Server Receive Random Array Perform Computations Stop Condition Satisfied? Finalize MPI N Y Propagate Number of Points (Allreduce) Initialize MPI Environment Broadcast Error Bound Send Request to Server Receive Random Array Perform Computations Stop Condition Satisfied? Print Statistics N Y Propagate Number of Points (Allreduce) Finalize MPI Output Partial Result

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 37 Monte Carlo : MPI - Pi (source code) #include #include "mpi.h“ #define CHUNKSIZE 1000 #define INT_MAX 1000000000 #define REQUEST 1 #define REPLY 2 int main( int argc, char *argv[] ) { int iter; int in, out, i, iters, max, ix, iy, ranks[1], done, temp; double x, y, Pi, error, epsilon; int numprocs, myid, server, totalin, totalout, workerid; int rands[CHUNKSIZE], request; MPI_Comm world, workers; MPI_Group world_group, worker_group; MPI_Status status; MPI_Init(&argc,&argv); world = MPI_COMM_WORLD; MPI_Comm_size(world,&numprocs); MPI_Comm_rank(world,&myid); Initialize MPI environment

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 38 Monte Carlo : MPI - Pi (source code) server = numprocs-1;/* last proc is server */ if (myid == 0) sscanf( argv[1], "%lf", &epsilon ); MPI_Bcast( &epsilon, 1, MPI_DOUBLE, 0, MPI_COMM_WORLD ); MPI_Comm_group( world, &world_group ); ranks[0] = server; MPI_Group_excl( world_group, 1, ranks, &worker_group ); MPI_Comm_create( world, worker_group, &workers ); MPI_Group_free(&worker_group); if (myid == server) { do { MPI_Recv(&request, 1, MPI_INT, MPI_ANY_SOURCE, REQUEST, world, &status); if (request) { for (i = 0; i < CHUNKSIZE; ) { rands[i] = random(); if (rands[i] <= INT_MAX) i++; }/* Send random number array*/ MPI_Send(rands, CHUNKSIZE, MPI_INT, status.MPI_SOURCE, REPLY, world); } } while( request>0 ); } else { /* Begin Worker Block */ request = 1; done = in = out = 0; max = INT_MAX; /* max int, for normalization */ MPI_Send( &request, 1, MPI_INT, server, REQUEST, world ); MPI_Comm_rank( workers, &workerid ); iter = 0; Broadcast Error Bounds: epsilon Create a custom communicator Server process : 1. Receives request to generate a random,2. Computes the random number array, 3. Send array to requestor Worker process : Request the server to generate a random number array

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 39 Monte Carlo : MPI - Pi (source code) while (!done) { iter++; request = 1; /* Recv. random array from server*/ MPI_Recv( rands, CHUNKSIZE, MPI_INT, server, REPLY, world, &status ); for (i=0; i<CHUNKSIZE-1; ) { x = (((double) rands[i++])/max) * 2 - 1; y = (((double) rands[i++])/max) * 2 - 1; if (x*x + y*y < 1.0) in++; else out++; } MPI_Allreduce(&in, &totalin, 1, MPI_INT, MPI_SUM, workers); MPI_Allreduce(&out, &totalout, 1, MPI_INT, MPI_SUM, workers); Pi = (4.0*totalin)/(totalin + totalout); error = fabs( Pi-3.141592653589793238462643); done = (error 1000000); request = (done) ? 0 : 1; if (myid == 0) {/* If “Master” : Print current value of PI */ printf( "\rpi = %23.20f", Pi ); MPI_Send( &request, 1, MPI_INT, server, REQUEST, world ); } else { /* If “Worker” : Request new array if not finished */ if (request) MPI_Send(&request, 1, MPI_INT, server, REQUEST, world); } MPI_Comm_free(&workers); } Worker : Receive random number array from the Server Worker: For each pair of x,y in the random number array, calculate the coordinates Determine if the number is inside or out of the circle Print current value of PI and request for more work Compute the value of pi and Check if error is within threshhold

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 40 Monte Carlo : MPI - Pi (source code) if (myid == 0) { /* If “Master” : Print Results */ printf( "\npoints: %d\nin: %d, out: %d, to exit\n", totalin+totalout, totalin, totalout ); getchar(); } MPI_Finalize(); } Print the final value of PI

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 41 Demo : MPI Monte Carlo, Pi > mpirun –np 4 monte 1e-20 pi = 3.14164517741129456496 points: 1000500 in: 785804, out: 214696 > mpirun –np 4 monte 1e-20 pi = 3.14164517741129456496 points: 1000500 in: 785804, out: 214696

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Vector Dot Product Multiplication of 2 vectors followed by Summation 43 A[i] X1X1 X2X2 X3X3 X4X4 X5X5 … XnXn B[i] Y1Y1 Y2Y2 Y3Y3 Y4Y4 Y5Y5 … YnYn ∙ = A[i] * B[i] X 1 * Y 1 X 2 * Y 2 X 3 * Y 3 X 4 * Y 4 X 5 * Y 5 … X n * Y n

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 44 OpenMP Dot Product : using Reduction Initialize variables Initialize OpenMP parallel environment Calculate local computations REDUCTION : ∑ Print value of Dot Product N WorkerThreads Master Thread Workload and schedule is determined by OpenMP during runtime

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 OpenMP Dot Product 45 #include main () { int i, n, chunk; float a[16], b[16], result; n = 16; chunk = 4; result = 0.0; for (i=0; i < n; i++) { a[i] = i * 1.0; b[i] = i * 2.0; } #pragma omp parallel for default(shared) private(i) \ schedule(static,chunk) reduction(+:result) for (i=0; i < n; i++) result = result + (a[i] * b[i]); printf("Final result= %f\n",result); } Reduction example with summation where the result of the reduction operation stores the dotproduct of two vectors ∑a[i]*b[i] Reduction example with summation where the result of the reduction operation stores the dotproduct of two vectors ∑a[i]*b[i] SRC : https://computing.llnl.gov/tutorials/openMP/

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Demo: Dot Product using Reduction 46 [cdekate@celeritas l12]$./reduction a[i] b[i] a[i]*b[i] 0.000000 0.000000 0.000000 1.000000 2.000000 2.000000 2.000000 4.000000 8.000000 3.000000 6.000000 18.000000 4.000000 8.000000 32.000000 5.000000 10.000000 50.000000 6.000000 12.000000 72.000000 7.000000 14.000000 98.000000 8.000000 16.000000 128.000000 9.000000 18.000000 162.000000 10.000000 20.000000 200.000000 11.000000 22.000000 242.000000 12.000000 24.000000 288.000000 13.000000 26.000000 338.000000 14.000000 28.000000 392.000000 15.000000 30.000000 450.000000 Final result= 2480.000000 [cdekate@celeritas l12]$

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 47 MPI Dot Product Computation Initialize Variables Worker Master Initialize MPI environment Receive Size of vectors Receive local workload for Vector A Receive local workload for Vector B Initialize Variables Initialize MPI Environment Broadcast Size of Vectors Get Vector A & Distribute Partitioned Vector A Get Vector B & Distribute Partitioned Vector B Calculate dot-product for local workloads Print Result REDUCTION ∑ Calculate dot-product for local workloads

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 MPI Dot Product 48 #include #include "mpi.h" #define MAX_LOCAL_ORDER 100 main(int argc, char* argv[]) { float local_x[MAX_LOCAL_ORDER]; float local_y[MAX_LOCAL_ORDER]; int n; int n_bar; /* = n/p */ float dot; int p; int my_rank; void Read_vector(char* prompt, float local_v[], int n_bar, int p, int my_rank); float Parallel_dot(float local_x[], float local_y[], int n_bar); MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &p); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); if (my_rank == 0) { printf("Enter the order of the vectors\n"); scanf("%d", &n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); Initialize MPI Environment Broadcast the order of vectors across the workers Parallel Programming with MPI by Peter Pacheco

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 MPI Dot Product 49 n_bar = n/p; Read_vector("the first vector", local_x, n_bar, p, my_rank); Read_vector("the second vector", local_y, n_bar, p, my_rank); dot = Parallel_dot(local_x, local_y, n_bar); if (my_rank == 0) printf("The dot product is %f\n", dot); MPI_Finalize(); } /* main */ void Read_vector( char* prompt /* in */, float local_v[] /* out */, int n_bar /* in */, int p /* in */, int my_rank /* in */) { int i, q; Receive and distribute the two vectors Calculate the parallel dot product for local workloads Master: Print the result of the dot product Parallel Programming with MPI by Peter Pacheco

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 MPI Dot Product 50 float temp[MAX_LOCAL_ORDER]; MPI_Status status; if (my_rank == 0) { printf("Enter %s\n", prompt); for (i = 0; i < n_bar; i++) scanf("%f", &local_v[i]); for (q = 1; q < p; q++) { for (i = 0; i < n_bar; i++) scanf("%f", &temp[i]); MPI_Send(temp, n_bar, MPI_FLOAT, q, 0, MPI_COMM_WORLD); } } else { MPI_Recv(local_v, n_bar, MPI_FLOAT, 0, 0, MPI_COMM_WORLD, &status); } } /* Read_vector */ float Serial_dot( float x[] /* in */, MASTER: Get the input from the User prepare the local workload Get the input from the User load balance in real-time by storing the work chunks in array And sending the array to the worker nodes for processing Get the input from the User load balance in real-time by storing the work chunks in array And sending the array to the worker nodes for processing Worker : Receive the local workload to be processed Serial_dot() : calculates the dot product on local arrays Parallel Programming with MPI by Peter Pacheco

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 MPI Dot Product 51 float y[] /* in */, int n /* in */) { int i; float sum = 0.0; for (i = 0; i < n; i++) sum = sum + x[i]*y[i]; return sum; } /* Serial_dot */ float Parallel_dot( float local_x[] /* in */, float local_y[] /* in */, int n_bar /* in */) { float local_dot; float dot = 0.0; local_dot = Serial_dot(local_x, local_y, n_bar); MPI_Reduce(&local_dot, &dot, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD); return dot; } /* Parallel_dot */ Serial_dot() : calculates the dot product on local arrays Parallel_dot() : Calls the Serial_dot() to perform the dot product for local workload Calculate the dotproduct and calculate summation using collective MPI_REDUCE calls (SUM) Parallel Programming with MPI by Peter Pacheco

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Demo: MPI Dot Product 52 [cdekate@celeritas l13]$ mpirun …../mpi_dot Enter the order of the vectors 16 Enter the first vector 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Enter the second vector 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 The dot product is 2480.000000 [cdekate@celeritas l13]$

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 54 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. Matrix Vector Multiplication

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 55 Matrix-Vector Multiplication c = A xb

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 56 Implementing Matrix Multiplication Sequential Code Assume throughout that the matrices are square (n x n matrices). The sequential code to compute A x B could simply be for (i = 0; i < n; i++) for (j = 0; j < n; j++) { c[i][j] = 0; for (k = 0; k < n; k++) c[i][j] = c[i][j] + a[i][k] * b[k][j]; } This algorithm requires n 3 multiplications and n 3 additions, leading to a sequential time complexity of O(n 3 ). Very easy to parallelize. Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 Implementing Matrix Multiplication With n processors (and n x n matrices), we can obtain: Time complexity of O(n 2 ) with n processors Each instance of inner loop is independent and can be done by a separate processor Time complexity of O(n) with n 2 processors One element of A and B assigned to each processor. Cost optimal since O(n 3 ) = n x O(n 2 ) = n 2 x O(n). Time complexity of O(log n) with n 3 processors By parallelizing the inner loop. Not cost-optimal since O(n 3 ) < n 3 x O(log n). O(log n) lower bound for parallel matrix multiplication. 57

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 58 Block Matrix Multiplication Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved. Partitioning into sub-matricies

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 59 Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen,@ 2004 Pearson Education Inc. All rights reserved. Matrix Multiplication

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 60 Performance Improvement Using tree construction n numbers can be added in O(log n) steps (using n 3 processors): Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M. Allen, @ 2004 Pearson Education Inc. All rights reserved.

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 61 OpenMP: Flowchart for Matrix Multiplication Initialize variables & matrices Initialize OpenMP Environment Compute the Matrix product for the local workload Print Results Compute the Matrix product for the local workload Schedule and workload chunksize are determined based on user preferences during compile/run time Since each thread works on portion of the array and updates different parts of the same array synchronization is not needed

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 OpenMP Matrix Multiplication 62 #include /* Main Program */ main() { int NoofRows_A, NoofCols_A, NoofRows_B, NoofCols_B, i, j, k; NoofRows_A = NoofCols_A = NoofRows_B = NoofCols_B = 4; float Matrix_A[NoofRows_A][NoofCols_A]; float Matrix_B[NoofRows_B][NoofCols_B]; float Result[NoofRows_A][NoofCols_B]; for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_A; j++) Matrix_A[i][j] = i + j; } /* Matrix_B Elements */ for (i = 0; i < NoofRows_B; i++) { for (j = 0; j < NoofCols_B; j++) Matrix_B[i][j] = i + j; } printf("The Matrix_A Is \n"); Initialize the two Matrices A[][] & B[][] with sum of their index values SRC : https://computing.llnl.gov/tutorials/openMP/

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 OpenMP Matrix Multiplication 63 for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_A; j++) printf("%f \t", Matrix_A[i][j]); printf("\n"); } printf("The Matrix_B Is \n"); for (i = 0; i < NoofRows_B; i++) { for (j = 0; j < NoofCols_B; j++) printf("%f \t", Matrix_B[i][j]); printf("\n"); } for (i = 0; i < NoofRows_A; i++) { for (j = 0; j < NoofCols_B; j++) { Result[i][j] = 0.0; } #pragma omp parallel for private(j,k) for (i = 0; i < NoofRows_A; i = i + 1) for (j = 0; j < NoofCols_B; j = j + 1) for (k = 0; k < NoofCols_A; k = k + 1) Result[i][j] = Result[i][j] + Matrix_A[i][k] * Matrix_B[k][j]; printf("\nThe Matrix Computation Result Is \n"); Initialize the results matrix with 0.0 Print the Matrices for debugging purposes Using OpenMP parallel For directive: Calculate the product of the two matrices Loadbalancing is done based on the values of OpenMP environment variables and the number of threads SRC : https://computing.llnl.gov/tutorials/openMP/

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 OpenMP Matrix Multiplicaton 64 for (i = 0; i < NoofRows_A; i = i + 1) { for (j = 0; j < NoofCols_B; j = j + 1) printf("%f ", Result[i][j]); printf("\n"); } SRC : https://computing.llnl.gov/tutorials/openMP/

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 DEMO : OpenMP Matrix Multiplication 65 [cdekate@celeritas l13]$./omp_mm The Matrix_A Is 0.000000 1.000000 2.000000 3.000000 1.000000 2.000000 3.000000 4.000000 2.000000 3.000000 4.000000 5.000000 3.000000 4.000000 5.000000 6.000000 The Matrix_B Is 0.000000 1.000000 2.000000 3.000000 1.000000 2.000000 3.000000 4.000000 2.000000 3.000000 4.000000 5.000000 3.000000 4.000000 5.000000 6.000000 The Matrix Computation Result Is 14.000000 20.000000 26.000000 32.000000 20.000000 30.000000 40.000000 50.000000 26.000000 40.000000 54.000000 68.000000 32.000000 50.000000 68.000000 86.000000 [cdekate@celeritas l13]$

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 66 Flowchart for MPI Matrix Multiplication “master”“workers” Initialize MPI Environment … Initialize Array Partition Array into workloads Send Workload to “workers” Recv. work … wait for “workers“ to finish task Calculate matrix product … Send result … Recv. results Print results End

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 67 Matrix Multiplication (source code) #include "mpi.h" #include #define NRA 4 /* number of rows in matrix A */ #define NCA 4 /* number of columns in matrix A */ #define NCB 4 /* number of columns in matrix B */ #define MASTER 0 /* taskid of first task */ #define FROM_MASTER 1 /* setting a message type */ #define FROM_WORKER 2 /* setting a message type */ int main(argc,argv) int argc; char *argv[]; { intnumtasks, /* number of tasks in partition */ taskid, /* a task identifier */ numworkers, /* number of worker tasks */ source, /* task id of message source */ dest, /* task id of message destination */ mtype, /* message type */ rows, /* rows of matrix A sent to each worker */ averow, extra, offset, /* used to determine rows sent to each worker */ i, j, k, rc; /* misc */ doublea[NRA][NCA], /* matrix A to be multiplied */ b[NCA][NCB], /* matrix B to be multiplied */ c[NRA][NCB]; /* result matrix C */ MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&taskid); MPI_Comm_size(MPI_COMM_WORLD,&numtasks); Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c Initialize the MPI environment Source : http:// www.llnl.gov/computing/ tutorials/mpi/samples/C/mpi_mm.c

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 68 Matrix Multiplication (source code) if (numtasks < 2 ) { printf("Need at least two MPI tasks. Quitting...\n"); MPI_Abort(MPI_COMM_WORLD, rc); exit(1); } numworkers = numtasks-1; if (taskid == MASTER){ for (i=0; i<NRA; i++) for (j=0; j<NCA; j++){ a[i][j]= i+j+1; b[i][j]= i+j+1; } printf("Matrix A :: \n"); for (i=0; i<NRA; i++){ printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", a[i][j]); } printf("Matrix B :: \n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", b[i][j]); averow = NRA/numworkers; extra = NRA%numworkers; offset = 0; mtype = FROM_MASTER; Source : http:// www.llnl.gov/computing/ tutorials/mpi/samples/C/mpi_mm.c MASTER: Initialize the matrix A & B Print the two matrices for Debugging purposes Calculate the number of rows to be processed by each worker Calculate the number of overflow rows to be processed additionally by each worker

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 69 Matrix Multiplication (source code) for (dest=1; dest<=numworkers; dest++) {/* To each worker send : Start point, number of rows to process, and sub-arrays to process */ rows = (dest <= extra) ? averow+1 : averow; printf("Sending %d rows to task %d offset=%d\n",rows,dest,offset); MPI_Send(&offset, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, dest, mtype, MPI_COMM_WORLD); MPI_Send(&a[offset][0], rows*NCA, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); MPI_Send(&b, NCA*NCB, MPI_DOUBLE, dest, mtype, MPI_COMM_WORLD); offset = offset + rows; } /* Receive results from worker tasks */ mtype = FROM_WORKER; /* Message tag for messages sent by “workers” */ for (i=1; i<=numworkers; i++) { source = i; /* offset stores the (processing) starting point of work chunk */ MPI_Recv(&offset, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, source, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&c[offset][0], rows*NCB, MPI_DOUBLE, source, mtype, MPI_COMM_WORLD, &status); printf("Received results from task %d\n",source); } printf("******************************************************\n"); printf("Result Matrix:\n"); for (i=0; i<NRA; i++) { printf("\n"); for (j=0; j<NCB; j++) printf("%6.2f ", c[i][j]); } printf("\n******************************************************\n"); printf ("Done.\n"); } MASTER : Send the workload chunk across to each of the worker MASTER: Receive the workload chunk from the workers c[][] contains the matrix products calculated for each workload chunk by the corresponding worker Source : http:// www.llnl.gov/computing/ tutorials/mpi/samples/C/mpi_mm.c

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 70 Matrix Multiplication (source code) /**************************** worker task ************************************/ if (taskid > MASTER) { mtype = FROM_MASTER; MPI_Recv(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&a, rows*NCA, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status); MPI_Recv(&b, NCA*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD, &status); for (k=0; k<NCB; k++) for (i=0; i<rows; i++) { c[i][k] = 0.0; for (j=0; j<NCA; j++) /* Calculate the product and store result in C */ c[i][k] = c[i][k] + a[i][j] * b[j][k]; } mtype = FROM_WORKER; MPI_Send(&offset, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); MPI_Send(&rows, 1, MPI_INT, MASTER, mtype, MPI_COMM_WORLD); /* Worker sends the resultant array to the master */ MPI_Send(&c, rows*NCB, MPI_DOUBLE, MASTER, mtype, MPI_COMM_WORLD); } MPI_Finalize(); } Source : http://www.llnl.gov/computing/tutorials/mpi/samples/C/mpi_mm.c WORKER: Receive the workload to be processed by each worker Calculate the matrix product and store the result in c[][] Send the computed results array to the Master Source : http:// www.llnl.gov/com puting/tutorials/mpi/sample s/C/mpi_mm.c

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 71 Demo : Matrix Multiplication [cdekate@celeritas matrix_multiplication]$ mpirun -np 4 -machinefile ~/hosts./mpi_mm mpi_mm has started with 4 tasks. Initializing arrays... Matrix A :: 1.00 2.00 3.00 4.00 2.00 3.00 4.00 5.00 3.00 4.00 5.00 6.00 4.00 5.00 6.00 7.00 Matrix B :: 1.00 2.00 3.00 4.00 2.00 3.00 4.00 5.00 3.00 4.00 5.00 6.00 4.00 5.00 6.00 7.00 Sending 2 rows to task 1 offset=0 Sending 1 rows to task 2 offset=2 Sending 1 rows to task 3 offset=3 Received results from task 1 Received results from task 2 Received results from task 3 Result Matrix: 30.00 40.00 50.00 60.00 40.00 54.00 68.00 82.00 50.00 68.00 86.00 104.00 60.00 82.00 104.00 126.00 [cdekate@celeritas matrix_multiplication]$

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof.

Similar presentations

Presentation on theme: "CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof.

Similar presentations

Presentation on theme: "CSC 7600 Lecture 15: Applied Parallel Algorithms 1, Spring 2011 HIGH PERFORMANCE COMPUTING: MODELS, METHODS, & MEANS APPLIED PARALLEL ALGORITHMS 1 Prof."— Presentation transcript:

Similar presentations

About project

Feedback