12.1 Computational Grids ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

Slides:

Advertisements

Similar presentations

Grid Computing, B. Wilkinson, C Program Command Line Arguments A normal C program specifies command line arguments to be passed to main with:

Advertisements

CS 140: Models of parallel programming: Distributed memory and MPI.

Potential for parallel computers/parallel programming

Types of Parallel Computers

Chapter 1 Parallel Computers.

Parallel Computers Chapter 1

CS 240A: Models of parallel programming: Distributed memory and MPI.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, ©

Message-Passing Programming and MPI CS 524 – High-Performance Computing.

Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, ©

Computational Grids.

1 July 29, 2005 Distributed Computing 1:00 pm - 2:00 pm Introduction to MPI Barry Wilkinson Department of Computer Science UNC-Charlotte Consortium for.

EECC756 - Shaaban #1 lec # 7 Spring Message Passing Interface (MPI) MPI, the Message Passing Interface, is a library, and a software standard.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, ©

MPI (Message Passing Interface) Basics

Basics of Message-passing Mechanics of message-passing –A means of creating separate processes on different computers –A way to send and receive messages.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

1a-1.1 Parallel Computing Demand for High Performance ITCS 4/5145 Parallel Programming UNC-Charlotte, B. Wilkinson Dec 27, 2012 slides1a-1.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

2.1 Message-Passing Computing ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 17, 2012.

2a.1 Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,

1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.

2.1 Message-Passing Computing ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 14, 2013.

Grid Computing, B. Wilkinson, 20047a.1 Computational Grids.

1 BİL 542 Parallel Computing. 2 Parallel Programming Chapter 1.

9-2.1 “Grid-enabling” applications Part 2 Using Multiple Grid Computers to Solve a Single Problem MPI © 2010 B. Wilkinson/Clayton Ferner. Spring 2010 Grid.

Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.

CS 240A Models of parallel programming: Distributed memory and MPI.

Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.

Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.

CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.

Message Passing Programming Model AMANO, Hideharu Textbook pp. １４０－１４７.

CSCI-455/522 Introduction to High Performance Computing Lecture 1.

Parallel Programming with MPI By, Santosh K Jena..

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, ©

CSCI-455/522 Introduction to High Performance Computing Lecture 4.

1 Message Passing Models CEG 4131 Computer Architecture III Miodrag Bolic.

Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.

Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.

1 BİL 542 Parallel Computing. 2 Message Passing Chapter 2.

12.1 Parallel Programming Types of Parallel Computers Two principal types: 1.Single computer containing multiple processors - main memory is shared,

Introduction to Parallel Programming at MCSR Message Passing Computing –Processes coordinate and communicate results via calls to message passing library.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()

1a.1 Parallel Computing and Parallel Computers ITCS 4/5145 Cluster Computing, UNC-Charlotte, B. Wilkinson, 2006.

1 Parallel and Distributed Processing Lecture 5: Message-Passing Computing Chapter 2, Wilkinson & Allen, “Parallel Programming”, 2 nd Ed.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, ©

Message Passing Interface Using resources from

1 Programming distributed memory systems Clusters Distributed computers ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 6, 2015.

Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

Parallel Computing and Parallel Computers

Message Passing Computing

MPI Message Passing Interface

Parallel Computers.

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

Lecture 14: Inter-process Communication

Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,

Message-Passing Computing

Introduction to parallelism and the Message Passing Interface

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

Programming with Shared Memory

Programming with Shared Memory

Parallel Computing and Parallel Computers

Message-Passing Computing Message Passing Interface (MPI)

Potential for parallel computers/parallel programming

MPI Message Passing Interface

Programming Parallel Computers

Presentation transcript:

12.1 Computational Grids ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

12.2 Computational Problems Problems that have lots of computations and usually lots of data.

12.3 Demand for Computational Speed Continual demand for greater computational speed from a computer system than is currently possible Areas requiring great computational speed include numerical modeling and simulation of scientific and engineering problems. Computations must be completed within a “reasonable” time period.

12.4 Grand Challenge Problems One that cannot be solved in a reasonable amount of time with today’s computers. Obviously, an execution time of 10 years is always unreasonable. Examples Modeling large DNA structures Global weather forecasting Modeling motion of astronomical bodies.

12.5 Weather Forecasting Atmosphere modeled by dividing it into 3- dimensional cells. Calculations of each cell repeated many times to model passage of time.

12.6 Global Weather Forecasting Example Suppose whole global atmosphere divided into cells of size 1 mile  1 mile  1 mile to a height of 10 miles (10 cells high) - about 5  10 8 cells. Suppose each calculation requires 200 floating point operations. In one time step, floating point operations necessary. To forecast the weather over 7 days using 1-minute intervals, a computer operating at 1Gflops (10 9 floating point operations/s) takes 10 6 seconds or over 10 days. To perform calculation in 5 minutes requires computer operating at 3.4 Tflops (3.4  floating point operations/sec).

12.7 Modeling Motion of Astronomical Bodies Each body attracted to each other body by gravitational forces. Movement of each body predicted by calculating total force on each body. With N bodies, N - 1 forces to calculate for each body, or approx. N 2 calculations. (N log 2 N for an efficient approx. algorithm.) After determining new positions of bodies, calculations repeated.

12.8 A galaxy might have, say, stars. Even if each calculation done in 1 ms (extremely optimistic figure), it takes 10 9 years for one iteration using N 2 algorithm and almost a year for one iteration using an efficient N log 2 N approximate algorithm.

12.9 Astrophysical N-body simulation by Scott Linssen (undergraduate UNC-Charlotte student).

12.10 High Performance Computing (HPC) Traditionally, achieved by using the multiple computers together - parallel computing. Simple idea! -- Using multiple computers (or processors) simultaneously should be able can solve the problem faster than a single computer.

12.11 High Performance Computing Long History: –Multiprocessor system of various types (1950’s onwards) –Supercomputers (1960s-80’s) –Cluster computing (1990’s) –Grid computing (2000’s) ?? Maybe, but let’s first look at how to achieve HPC.

12.12 Speedup Factor where t s is execution time on a single processor and t p is execution time on a multiprocessor. S(p) gives increase in speed by using multiprocessor. Use best sequential algorithm with single processor system. Underlying algorithm for parallel implementation might be (and is usually) different. S(p) = Execution time using one processor (best sequential algorithm) Execution time using a multiprocessor with p processors tsts tptp =

12.13 Maximum Speedup Maximum speedup is usually p with p processors (linear speedup). Possible to get superlinear speedup (greater than p) but usually a specific reason such as: Extra memory in multiprocessor system Nondeterministic algorithm

12.14 Maximum Speedup Amdahl’s law Serial section Parallelizable sections (a) One processor (b) Multiple processors ft s (1- f)t s t s - f)t s /p t p p processors

12.15 Speedup factor is given by: This equation is known as Amdahl’s law

12.16 Speedup against number of processors f = 20% f = 10% f = 5% f = 0% Number of processors, p Speedup factor, S(p)

12.17 Even with infinite number of processors, maximum speedup limited to 1/f. Example With only 5% of computation being serial, maximum speedup is 20, irrespective of number of processors.

12.18 Superlinear Speedup Example - Searching (a) Searching each sub-space sequentially t s t s /p StartTime  t Solution found xt s /p Sub-space search x indeterminate

12.19 (b) Searching each sub-space in parallel Solution found  t

12.20 Question What is the speed-up now?

12.21 Speed-up then given by S(p)S(p) x t s p  t  + t  =

12.22 Worst case for sequential search when solution found in last sub-space search. Then parallel version offers greatest benefit, i.e. S(p)S(p) p1– p t s t  +  t   = as  t tends to zero

12.23 Least advantage for parallel version when solution found in first sub-space search of the sequential search, i.e. Actual speed-up depends upon which subspace holds solution but could be extremely large. S(p) = t  t  = 1

12.24 Computing Platforms for Parallel Programming

12.25 Types of Parallel Computers Two principal types: 1.Single computer containing multiple processors - main memory is shared, hence called “Shared memory multiprocessor” 2.Interconnected multiple computer systems

12.26 Conventional Computer Consists of a processor executing a program stored in a (main) memory: Each main memory location located by its address. Addresses start at 0 and extend to 2 b - 1 when there are b bits (binary digits) in address. Main memory Processor Instructions (to processor) Data (to or from processor)

12.27 Shared Memory Multiprocessor Extend single processor model - multiple processors connected to a single shared memory with a single address space: Memory Processors A real system will have cache memory associated with each processor

12.28 Examples Dual Pentiums Quad Pentiums

12.29 Quad Pentium Shared Memory Multiprocessor Processor L2 Cache Bus interface L1 cache Processor L2 Cache Bus interface L1 cache Processor L2 Cache Bus interface L1 cache Processor L2 Cache Bus interface L1 cache Memory controller Memory I/O interface I/O bus Processor/ memory bus Shared memory

12.30 Programming Shared Memory Multiprocessors Threads - programmer decomposes program into parallel sequences (threads), each being able to access variables declared outside threads. Example: Pthreads Use sequential programming language with preprocessor compiler directives, constructs, or syntax to declare shared variables and specify parallelism. Examples: OpenMP (an industry standard), UPC (Unified Parallel C) -- needs compilers.

12.31 Parallel programming language with syntax to express parallelism. Compiler creates executable code -- not now common. Use parallelizing compiler to convert regular sequential language programs into parallel executable code - also not now common.

12.32 Message-Passing Multicomputer Complete computers connected through an interconnection network: Processor Interconnection network Local Computers Messages memory

12.33 Dedicated cluster with a master node User External network Master node Compute nodes Switch 2 nd Ethernet interface Ethernet interface Cluster

12.34 UNC-C’s cluster used for grid course (Department of Computer Science) PP M PP M PP M PP M coit-grid01coit-grid02coit-grid03coit-grid GHz dual Xeon Pentiums To External network Switch Funding for this cluster provided by the University of North Carolina, Office of the President, specificially for the grid computing course.

12.35 Programming Clusters Usually based upon explicit message- passing. Common approach -- a set of user-level libraries for message passing. Example: –Parallel Virtual Machine (PVM) - late 1980’s. Became very popular in mid 1990’s. –Message-Passing Interface (MPI) - standard defined in 1990’s and now dominant.

12.36 MPI (Message Passing Interface) Message passing library standard developed by group of academics and industrial partners to foster more widespread use and portability. Defines routines, not implementation. Several free implementations exist.

12.37 MPI designed: To address some problems with earlier message-passing system such as PVM. To provide powerful message-passing mechanism and routines - over 126 routines (although it is said that one can write reasonable MPI programs with just 6 MPI routines).

12.38 Message-Passing Programming using User-level Message Passing Libraries Two primary mechanisms needed: 1.A method of creating separate processes for execution on different computers 2.A method of sending and receiving messages

12.39 Multiple program, multiple data (MPMD) model Source file Executable Processor 0Processorp - 1 Compile to suit processor Source file

12.40 Single Program Multiple Data (SPMD) model. Basic MPI way Source file Executables Processor 0Processorp - 1 Compile to suit processor Different processes merged into one program. Control statements select different parts for each processor to execute.

12.41 Multiple Program Multiple Data (MPMD) Model Process 1 Process 2 spawn(); Time Start execution of process 2 Separate programs for each processor. One processor executes master process. Other processes started from within master process - dynamic process creation. Can be done with MPI version 2

12.42 Communicators Defines scope of a communication operation. Processes have ranks associated with communicator. Initially, all processes enrolled in a “universe” called MPI_COMM_WORLD, and each process is given a unique rank, a number from 0 to p - 1, with p processes. Other communicators can be established for groups of processes.

12.43 Using SPMD Computational Model main (int argc, char *argv[]) { MPI_Init(&argc, &argv);. MPI_Comm_rank(MPI_COMM_WORLD,&myrank); /*find rank */ if (myrank == 0) master(); else slave();. MPI_Finalize(); } where master() and slave() are to be executed by master process and slave process, respectively.

12.44 Basic “point-to-point” Send and Receive Routines Process 1Process 2 send(&x, 2); recv(&y, 1); xy Movement of data Generic syntax (actual formats later) Passing a message between processes using send() and recv() library calls:

12.45 Message Tag Used to differentiate between different types of messages being sent. Message tag is carried within message. If special type matching is not required, a wild card message tag is used, so that the recv() will match with any send().

12.46 Message Tag Example Process 1Process 2 send(&x,2,5); recv(&y,1,5); xy Movement of data Waits for a message from process 1 with a tag of 5 To send a message, x, with message tag 5 from a source process, 1, to a destination process, 2, and assign to y:

12.47 Synchronous Message Passing Routines return when message transfer completed. Synchronous send routine Waits until complete message can be accepted by the receiving process before sending the message. Synchronous receive routine Waits until the message it is expecting arrives.

12.48 Synchronous send() and recv() using 3-way protocol Process 1Process 2 send(); recv(); Suspend Time process Acknowledgment Message Both processes continue (a) When send() occurs before recv() Request to send

12.49 Synchronous send() and recv() using 3-way protocol Process 1Process 2 recv(); send(); Suspend Time process Acknowledgment Message Both processes continue (b) When recv() occurs before send() Request to send

12.50 Synchronous routines intrinsically perform two actions: –They transfer data and –They synchronize processes.

12.51 Asynchronous Message Passing Routines that do not wait for actions to complete before returning. Usually require local storage for messages. More than one version depending upon the actual semantics for returning. In general, they do not synchronize processes but allow processes to move forward sooner. Must be used with care.

12.52 MPI Blocking and Non-Blocking Blocking - return after their local actions complete, though the message transfer may not have been completed. Non-blocking - return immediately. Assumes that data storage used for transfer not modified by subsequent statements prior to being used for transfer, and it is left to the programmer to ensure this. These terms may have different interpretations in other systems.

12.53 How message-passing routines return before message transfer completed Process 1Process 2 send(); recv(); Message buffer Read message buffer Continue process Time Message buffer needed between source and destination to hold message:

12.54 Asynchronous routines changing to synchronous routines Buffers only of finite length and a point could be reached when send routine held up because all available buffer space exhausted. Then, send routine will wait until storage becomes re-available - i.e then routine behaves as a synchronous routine.

12.55 Parameters of MPI blocking send MPI_Send(buf, count, datatype, dest, tag, comm) Address of send buffer Number of items to send Datatype of each item Rank of destination process Message tag Communicator

12.56 Parameters of MPI blocking receive MPI_Recv(buf,count,datatype,dest,tag,comm,status) Address of receive buffer Max. number of items to receive Datatype of each item Rank of source process Message tag Communicator Status after operation

12.57 Example To send an integer x from process 0 to process 1, MPI_Comm_rank(MPI_COMM_WORLD,&myrank); /* find rank */ if (myrank == 0) { int x; MPI_Send(&x,1,MPI_INT,1,msgtag,MPI_COMM_WORLD); } else if (myrank == 1) { int x; MPI_Recv(&x,1,MPI_INT,0,msgtag,MPI_COMM_WORLD,status); }

12.58 MPI Nonblocking Routines Nonblocking send - MPI_Isend() - will return “immediately” even before source location is safe to be altered. Nonblocking receive - MPI_Irecv() - will return even if no message to accept.

12.59 Detecting when message receive if sent with non-blocking send routine Completion detected by MPI_Wait() and MPI_Test(). MPI_Wait() waits until operation completed and returns then. MPI_Test() returns with flag set indicating whether operation completed at that time. Need to know which particular send you are waiting for. Identified with request parameter.

12.60 Example To send an integer x from process 0 to process 1 and allow process 0 to continue, MPI_Comm_rank(MPI_COMM_WORLD, &myrank);/* find rank */ if (myrank == 0) { int x; MPI_Isend(&x,1,MPI_INT,1,msgtag,MPI_COMM_WORLD, req1); compute(); MPI_Wait(req1, status); } else if (myrank == 1) { int x; MPI_Recv(&x,1,MPI_INT,0,msgtag, MPI_COMM_WORLD, status); }

12.61 “Group” message passing routines Have routines that send message(s) to a group of processes or receive message(s) from a group of processes Higher efficiency than separate point-to-point routines although not absolutely necessary.

12.62 Broadcast Sending same message to a group of processes. (Sometimes “Multicast” - sending same message to defined group of processes, “Broadcast” - to all processes.) MPI_bcast(); buf MPI_bcast(); data MPI_bcast(); data Process 0Processp - 1Process 1 Action Code MPI form

12.63 MPI Broadcast routine int MPI_Bcast(void *buf, int count, MPI_Datatype datatype, int root, MPI_Comm comm) Actions: Broadcasts message from root process to al l processes in comm and itself. Parameters: *bufmessage buffer countnumber of entries in buffer datatypedata type of buffer rootrank of root

12.64 Scatter MPI_scatter(); buf MPI_scatter(); data MPI_scatter(); data Process 0Processp - 1Process 1 Action Code MPI form Sending each element of an array in root process to a separate process. Contents of ith location of array sent to ith process.

12.65 Gather MPI_gather(); buf MPI_gather(); data MPI_gather(); data Process 0Processp - 1Process 1 Action Code MPI form Having one process collect individual values from set of processes.

12.66 Reduce MPI_reduce(); buf MPI_reduce(); data MPI_reduce(); data Process 0Processp - 1Process 1 + Action Code MPI form Gather operation combined with specified arithmetic/logical operation. Example: Values could be gathered and then added together by root:

12.67 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_Bcast() - Broadcast from root to all other processes MPI_Gather() - Gather values for group of processes MPI_Scatter() - Scatters buffer in parts to group of processes MPI_Alltoall() - Sends data from all processes to all processes MPI_Reduce() - Combine values on all processes to single value MPI_Reduce_scatter() - Combine values and scatter results MPI_Scan() - Compute prefix reductions of data on processes

12.68 Example To gather items from group of processes into process 0, using dynamically allocated memory in root process: int data[10];/*data to be gathered from processes*/ MPI_Comm_rank(MPI_COMM_WORLD, &myrank);/* find rank */ if (myrank == 0) { MPI_Comm_size(MPI_COMM_WORLD,&grp_size);/*find size*/ /*allocate memory*/ buf = (int *)malloc(grp_size*10*sizeof (int)); } MPI_Gather(data,10,MPI_INT,buf,grp_size*10,MPI_INT,0, MPI_COMM_WORLD); MPI_Gather() gathers from all processes, including root.

12.69 Sample MPI program #include “mpi.h” #include #define MAXSIZE 1000 void main(int argc, char *argv) { int myid, numprocs; int data[MAXSIZE], i, x, low, high, myresult, result; char fn[255]; char *fp; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); if (myid == 0) { /* Open input file and initialize data */ strcpy(fn,getenv(“HOME”)); strcat(fn,”/MPI/rand_data.txt”); if ((fp = fopen(fn,”r”)) == NULL) { printf(“Can’t open the input file: %s\n\n”, fn); exit(1); } for(i = 0; i < MAXSIZE; i++) fscanf(fp,”%d”, &data[i]); } MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD); /* broadcast data */ x = n/nproc; /* Add my portion Of data */ low = myid * x; high = low + x; for(i = low; i < high; i++) myresult += data[i]; printf(“I got %d from %d\n”, myresult, myid); /* Compute global sum */ MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf(“The sum is %d.\n”, result); MPI_Finalize(); }

12.70 Debugging/Evaluating Parallel Programs Empirically

12.71 Visualization Tools Programs can be watched as they are executed in a space-time diagram (or process-time diagram): Process 1 Process 2 Process 3 Time Computing Waiting Message-passing system routine Message Visualization tools available for MPI. An example - Upshot

12.72 Evaluating Programs Empirically Measuring Execution Time To measure the execution time between point L1 and point L2 in the code, we might have a construction such as. t1 = MPI_Wtime();/* start */. t2 = MPI_Wtime();/* end */. elapsed_time = t2 - t1); /*elapsed_time */ printf(“Elapsed time = %5.2f seconds”, elapsed_time); MPI provides the routine MPI_Wtime() for returning time (in seconds).

12.73 Executing MPI programs MPI version 1 standard does not address implementation and did not specify how programs are to be started and each implementation has its own way.

12.74 Compiling/Executing MPI Programs Basics For MPICH, use two commands: mpicc to compile a program mirun to execute program

12.75 mpicc Example mpicc –o hello hello.c compiles hello.c to create the executable hello. mpicc is (probably) a script calling cc and hence all regular cc flags can be attached.

12.76 mpirun Example mpirun –np 3 hello executes 3 instances of hello on the local machine (when using MPICH).

12.77 Using multiple computers First create a file (say called “machines”) containing list of computers you what to use. Example coit-1grid01.uncc.edu coit-2grid01.uncc.edu coit-3grid01.uncc.edu coit-4grid01.uncc.edu

12.78 Then specify machines file in mpirun command: Example mpirun –np 3 -machinefile machines hello executes 3 instances of hello using the computers listed in the file. (Scheduling will be round-robin unless otherwise specified.)

12.79 MPI-2 The MPI standard, version 2 does recommend a command for starting MPI programs, namely: mpiexec -n # prog where # is the number of processes and prog is the program.

12.80 Sample MPI Programs

12.81 Hello World Printing out rank of process #include "mpi.h" #include int main(int argc,char *argv[]) { int myrank, numprocs; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&myrank); MPI_Comm_size(MPI_COMM_WORLD,&numprocs) printf("Hello World from process %d of %d\n", myrank, numprocs); MPI_Finalize(); return 0; }

12.82 Question Suppose this program is compiled as helloworld and is executed on a single computer with the command: mpirun -np 4 helloworld What would the output be?

12.83 Answer Several possible outputs depending upon order processes are executed. Example Hello World from process 2 of 4 Hello World from process 0 of 4 Hello World form process 1 of 4 Hello World form process 3 of 4

12.84 Adding communication to get process 0 to print all messages: #include "mpi.h" #include int main(int argc,char *argv[]) { int myrank, numprocs; char greeting[80]; /* message sent from slaves to master */ MPI_Status status; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&myrank); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); sprintf(greeting,"Hello World from process %d of %d\n",rank,size); if (myrank == 0 ) { /* I am going print out everything */ printf("s\n",greeting); /* print greeting from proc 0 */ for (i = 1; i < numprocs; i++) { /* greetings in order */ MPI_Recv(geeting,sizeof(greeting),MPI_CHAR,i,1,MPI_COMM_WORLD, &status); printf(%s\n", greeting); } } else { MPI_Send(greeting,strlen(greeting)+1,MPI_CHAR,0,1, MPI_COMM_WORLD); } MPI_Finalize(); return 0; }

12.85 MPI_Get_processor_name() Return name of processor executing code (and length of string). Arguments: MPI_Get_processor_name(char *name,int *resultlen) Example int namelen; char procname[MPI_MAX_PROCESSOR_NAME]; MPI_Get_processor_name(procname,&namelen); returned in here

12.86 Easy then to add name in greeting with: sprintf(greeting,"Hello World from process %d of %d on $s\n", rank, size, procname);

12.87 Pinging processes and timing Master-slave structure #include void master(void); void slave(void); int main(int argc, char **argv){ int myrank; printf("This is my ping program\n"); MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { master(); } else { slave(); } MPI_Finalize(); return 0; }

12.88 Master routine void master(void){ int x = 9; double starttime, endtime; MPI_Status status; printf("I am the master - Send me a message when you receive this number %d\n", x); starttime = MPI_Wtime(); MPI_Send(&x,1,MPI_INT,1,1,MPI_COMM_WORLD); MPI_Recv(&x,1,MPI_INT,1,1,MPI_COMM_WORLD,&status); endtime = MPI_Wtime(); printf("I am the master. I got this back %d \n", x); printf("That took %f seconds\n",endtime - starttime); }

12.89 Slave routine void slave(void){ int x; MPI_Status status; printf("I am the slave - working\n"); MPI_Recv(&x,1,MPI_INT,0,1,MPI_COMM_WORLD,&status); printf("I am the slave. I got this %d \n", x); MPI_Send(&x, 1, MPI_INT, 0, 1, MPI_COMM_WORLD); }

12.90 Example using collective routines MPI_Bcast() MPI_Reduce() Adding numbers in a file.

12.91 #include “mpi.h” #include #define MAXSIZE 1000 void main(int argc, char *argv){ int myid, numprocs; int data[MAXSIZE], i, x, low, high, myresult, result; char fn[255]; char *fp; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); if (myid == 0) { /* Open input file and initialize data */ strcpy(fn,getenv(“HOME”)); strcat(fn,”/MPI/rand_data.txt”); if ((fp = fopen(fn,”r”)) == NULL) { printf(“Can’t open the input file: %s\n\n”, fn); exit(1); } for(i = 0; i < MAXSIZE; i++) fscanf(fp,”%d”, &data[i]); } MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD); /* broadcast data */ x = n/nproc; /* Add my portion Of data */ low = myid * x; high = low + x; for(i = low; i < high; i++) myresult += data[i]; printf(“I got %d from %d\n”, myresult, myid); /* Compute global sum */ MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf(“The sum is %d.\n”, result); MPI_Finalize(); }

12.92 C Program Command Line Arguments A normal C program specifies command line arguments to be passed to main with: int main(int argc, char *argv[]) where argc is the argument count and argv[] is an array of character pointers. –First entry is a pointer to program name –Subsequent entries point to subsequent strings on the command line.

12.93 MPI C program command line arguments Implementations of MPI remove from the argv array any command line arguments used by the implementation. Note MPI_Init requires argc and argv (specified as addresses)

12.94 Example Getting Command Line Argument #include “mpi.h” #include int main (int argc, char *argv[]) { int n; /* get and convert character string argument to integer value /* n = atoi(argv[1]); return 0; }

12.95 Executing MPI program with command line arguments mpirun -np 2 myProg argv[1] argv[2] Remember these array elements hold pointers to the arguments. argv[0] Removed by MPI - probably by MPI_Init()

12.96 More Information on MPI Books: “Using MPI Portable Parallel Programming with the Message-Passing Interface 2nd ed.,” W. Gropp, E. Lusk, and A. Skjellum, The MIT Press,1999. MPICH: LAM MPI:

12.97 Parallel Programming Home Page Gives step-by-step instructions for compiling and executing programs, and other information.

12.98 Grid-enabled MPI

12.99 Several versions of MPI developed for a grid: MPICH-G, MPICH-G2 PACXMPI MPICH-G2 is based on MPICH and uses Globus.

MPI code for the grid No difference in code from regular MPI code. Key aspect is MPI implementation: Communication methods Resource management

Communication Methods Implementation should take into account whether messages are between processor on the same computer or processors on different computers on the network. Pack messages into less larger message, even if this requires more computations

MPICH-G2 Complete implementation of MPI Can use existing MPI programs on a grid without change Uses Globus to start tasks, etc. Version 2 a complete redesign from MPICH-G for Globus 2.2 or later.

Compiling Application Program As with regular MPI programs, compile on each machine you intend to use and make accessible to computers.

Running an MPICH-G2 Program mpirun submits a Globus RSL script (Resource Specification Language Script) to launch application RSL script can be created by mpirun or you can write your own. RSL script gives powerful mechanism to specify different executables etc., but low level.

mpirun (with it constructing RSL script) Use if want to launch a single executable on binary compatible machines with a shared file system. Requires a “machines” file - a list of computers to be used (and job managers)

“Machines” file Computers listed by their Globus job manager service followed by optional maximum number of node (tasks) on that machine. If job manager omitted (i.e., just name of computer), will default to Globus job manager.

Location of “machines” file mpirun command expects the “ machines ” file either in –the directory specified by -machinefile flag –the current directory used to execute the mpirun command, or –in /bin/machines

Running MPI program Uses the same command line as a regular MPI program: mpirun -np 25 my_prog creates 25 tasks allocated on machines in “machines’ file in around robin fashion.

Example With the machines file containing: “coit-0grid01.uncc.edu” 4 “coit-grid02.uncc.edu” 5 and the command: mpirun -np 10 myProg the first 4 processes (jobs) would run on coit- grid01, the next 5 on coit-grid02, and the remaining one on coit-grid01.

mpirun with your own RSL script Necessary if machines not executing same executable. Easiest way to create script is to modify existing one. Use mpirun –dumprsl –Causes script printed out. Application program not launched.

Example mpirun -dumprsl -np 2 myprog will generate appropriate printout of an rsl document according to the details of the job from the command line and machine file.

Given rsl file, myRSL.rsl, use: mpirun -globusrsl myRSL.rsl to submit modified script.

MPICH-G2 internals Processes allocated a “machine-local” number and a “grid global” number - translated into where process actually resides. Non-local operations uses grid services Local operations do not. globusrun command submits simultaneous job requests

Limitations “machines” file limits computers to those known - no discovery of resources If machines heterogeneous, need appropriate executables available, and RSL script Speed an issue - original version MPI-G slow.

More information on MPICH-G

Parallel Programming Techniques Suitable for a Grid

Message-Passing on a Grid VERY expensive, sending data across network costs millions of cycles Bandwidth shared with other users Links unreliable

Computational Strategies As a computing platform, a grid favors situations with absolute minimum communication between computers.

Strategies With no/minimum communication: “Embarrassingly Parallel” Computations –those computations which obviously can be divided into parallel independent parts. Parts executed on separate computers. Separate instance of the same problem executing on each system, each using different data

Embarrassingly Parallel Computations A computation that can obviously be divided into a number of completely independent parts, each of which can be executed by a separate process(or). No communication or very little communication between processes. Each process can do its tasks without any interaction with other processes

Monte Carlo Methods An embarrassingly parallel computation. Monte Carlo methods use of random selections.

Simple Example: To calculate  Circle formed within a square, with radius of 1. Square has sides 2  2.

Ratio of area of circle to square given by Area of circle =  (1) 2 =  Area of square 2 x 2 4 Points within square chosen randomly. Score kept of how many points happen to lie within circle. Fraction of points within circle will be  /4, given a sufficient number of randomly selected samples.

Method actually computes an integral. One quadrant of the construction can be described by integral

So can use method to compute any integral! Monte Carlo method very useful if the function cannot be integrated numerically (maybe having a large number of variables).

Alternative (better) “Monte Carlo” Method Use random values of x to compute f (x) and sum values of f (x) where x r are randomly generated values of x between x 1 and x 2. Areaf(x)x d x 1 x 2  1 N ---- N  limf( x r ) r i1= N  == (x 2 – x 1 ) X1X1 X2X2

Example Computing the integral Sequential Code sum = 0; for (i = 0; i < N; i++) { /* N random samples */ xr = rand_v(x1, x2);/* next random value */ sum = sum + xr * xr - 3 * xr/* compute f(xr)*/ } area = (sum / N) * (x2 - x1); randv(x1, x2) returns pseudorandom number between x1 and x2. x 1 x 2  (x 2 – 3x) dx

For parallelizing Monte Carlo code, must address best way to generate random numbers in parallel. Can use SPRNG (Scalable Pseudo- random Number Generator) -- supposed to be a good parallel random number generator.

Executing separate problem instances In some application areas, same program executed repeatedly - ideal if with different parameters (“parameter sweep”) Nimrod/G -- a grid broker project that targets parameter sweep problems.

Techniques to reduce effects of network communication Latency hiding with communication/computation overlap Better to have fewer larger messages than many smaller ones

Synchronous Algorithms Many tradition parallel algorithms require the parallel processes to synchronize at regular and frequent intervals to exchange data and continue from known points This is bad for grid computations!! All traditional parallel algorithms books have to be thrown away for grid computing.

Techniques to reduce actual synchronization communications Asynchronous algorithms –Algorithms that do not use synchronization at all Partially synchronous algorithms –those that limit the synchronization, for example only synchronize on every n iterations –Actually such algorithms known for many years but not popularized.

Big Problems “Grand challenge” problems Most of the high profile projects on the grid involve problems that are so big usually in number of data items that they cannot be solved otherwise

Examples High-energy physics Bioinformatics Medical databases Combinatorial chemistry Astrophysics

Workflow Technique Use functional decomposition - dividing problem into separate functions which take results from other functions units and pass on results to functional units - interconnection patterns depends upon the problem. Workflow - describes the flow of information between the units.

Example Climate Modeling