Basics of Message-passing Mechanics of message-passing –A means of creating separate processes on different computers –A way to send and receive messages.

Slides:



Advertisements
Similar presentations
Its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed
Advertisements

A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
EECC756 - Shaaban #1 lec # 7 Spring Message Passing Interface (MPI) MPI, the Message Passing Interface, is a library, and a software standard.
MPI Point-to-Point Communication CS 524 – High-Performance Computing.
Collective Communication.  Collective communication is defined as communication that involves a group of processes  More restrictive than point to point.
1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.
Sahalu JunaiduICS 573: High Performance Computing6.1 Programming Using the Message Passing Paradigm Principles of Message-Passing Programming The Building.
Basics of Message-passing Mechanics of message-passing –A means of creating separate processes on different computers –A way to send and receive messages.
Parallel Programming with Java
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
2.1 Message-Passing Computing ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Jan 17, 2012.
2a.1 Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
HPCA2001HPCA Message Passing Interface (MPI) and Parallel Algorithm Design.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
MPI Communications Point to Point Collective Communication Data Packaging.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February Session 11.
Parallel Programming with MPI By, Santosh K Jena..
Lecture 6: Message Passing Interface (MPI). Parallel Programming Models Message Passing Model Used on Distributed memory MIMD architectures Multiple processes.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, ©
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
Oct. 23, 2002Parallel Processing1 Parallel Processing (CS 730) Lecture 6: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.
1 BİL 542 Parallel Computing. 2 Message Passing Chapter 2.
12.1 Parallel Programming Types of Parallel Computers Two principal types: 1.Single computer containing multiple processors - main memory is shared,
An Introduction to MPI (message passing interface)
Introduction to Parallel Programming at MCSR Message Passing Computing –Processes coordinate and communicate results via calls to message passing library.
Message Passing Interface (MPI) 2 Amit Majumdar Scientific Computing Applications Group San Diego Supercomputer Center Tim Kaiser (now at Colorado School.
2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()
3/12/2013Computer Engg, IIT(BHU)1 MPI-1. MESSAGE PASSING INTERFACE A message passing library specification Extended message-passing model Not a language.
-1.1- MPI Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
1 Parallel and Distributed Processing Lecture 5: Message-Passing Computing Chapter 2, Wilkinson & Allen, “Parallel Programming”, 2 nd Ed.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
Message Passing Interface Using resources from
Lecture 3 Point-to-Point Communications Dr. Muhammad Hanif Durad Department of Computer and Information Sciences Pakistan Institute Engineering and Applied.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
1 MPI: Message Passing Interface Prabhaker Mateti Wright State University.
Distributed Processing with MPI International Summer School 2015 Tomsk Polytechnic University Assistant Professor Dr. Sergey Axyonov.
Computer Science Department
Introduction to MPI Programming Ganesh C.N.
Introduction to parallel computing concepts and technics
CS4402 – Parallel Computing
Introduction to MPI.
MPI Message Passing Interface
Computer Science Department
Send and Receive.
Basics of Message-passing
An Introduction to Parallel Programming with MPI
Send and Receive.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Lecture 14: Inter-process Communication
A Message Passing Standard for MPP and Workstations
MPI: Message Passing Interface
Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,
Message-Passing Computing Message Passing Interface (MPI)
Hello, world in MPI #include <stdio.h> #include "mpi.h"
Computer Science Department
5- Message-Passing Programming
Hello, world in MPI #include <stdio.h> #include "mpi.h"
MPI Message Passing Interface
CS 584 Lecture 8 Assignment?.
Presentation transcript:

Basics of Message-passing Mechanics of message-passing –A means of creating separate processes on different computers –A way to send and receive messages Single program multiple data (SPMD) model –Logic for multiple processes merged into one program –Control Statements separate processor blocks of logic –A compiled program is stored on each processor –All executables are started together statically –Example: MPI (Message Passing Interface) Multiple program multiple data (MPMD) model –Each processor has a separate master program –Master program spawns child processes dynamically –Example: PVM (Parallel Virtual Machine)

PVM (Parallel Virtual Machine) Multiple process control: Host process: control environment; Any process can spawn others, Daemon: control message passing PVM System Calls –Control: pvm_mytid(), pvm_spawn(), pvm_parent(), pvm_exit() –Get send buffer: pvm_initsend() –Pack for sending: pvm_pkint(), pvm_pkfloat(), pvm_pkstr() –Blocking/non-blocking transmission: pvm_send(), pvm_recv(), pvm_nrecv() –Unpack after receipt: Pvm_upkint(), pvm_upkfload, pvm_upkstr() –Group definition: pvm_joingroup() –Collective communication: pvm_bcast(), pvm_scatter(), pvm_gather, pvm_reduce(), pvm_mcast() From Oak Ridge National Laboratories, Free distribution

mpij and MpiJava Overview –MpiJava is a wrapper sitting on mpich or lamMpi –mpij is a native Java implementation of mpi Documentation –MpiJava ( –mpij (uses the same API as MpiJava) Java Grande consortium ( –Sponsors conferences & encourages Java for Parallel Programming –Maintains Java based paradigms (mpiJava, HPJava, and mpiJ) Other Java based implementations –JavaMpi is another less popular MPI Java wrapper

SPMD Computation (MPI) main (int argc, char *argv[]) { MPI_Init(&argc, &argv);. MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) master(); else slave();. MPI_Finalize(); } The master process executes master() The slave processes execute slave()

A Simple MPI Program #include int main(int argc, char *argv[]) { int rank, size, MAX = , TAG=1; char data[MAX]; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &myRank); MPI_Comm_size(MPI_COMM_WORLD, &size); if (size!=2) MPI_Abort(MPI_COMM_WORLD, 1); // Terminate all processors if (myRank==0) { sprintf(data, "Sending from %d of %d", rank, size); MPI_Send(data, MAX, MPI_CHAR, 1, TAG, MPI_COMM_WORLD); } else { MPI_Recv( data, MAX, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, MPI_STATUS_IGNORE); printf:%s\n", data); } MPI_Finalize(); }

Start and Finish MPI_Init: Bring up program on all computers, pass command line arguments, establish ranks. MPI_Comm_rank: Determine the rank of the current process MPI_Comm_size: return the number of processors that are running MPI_Finalize: Terminate the program normally MPI_Abort: Terminate with an error code when something bad happens

Standard Send (MPI_Send) int MPI_Send(void *buf, int count, MPI_Datatype type, int dest, int tag, MPI_Comm comm) Input Parameters –buf: initial address of send buffer (choice) –count: integer number of elements in send buffer –type: type of each send buffer element (ex: MPI_CHAR, MPI_INT, MPI_DOUBLE, MPI_BYTE, MPI_PACK, etc.) –dest: rank of destination (integer) –tag: message tag (integer) –comm communicator (handle) Note: MPI_PACK allows different data types to be sent in a single buffer using the MPI_Pack and MPI_Unpack functions. Note: Google MPI_Send, MPI_Recv, etc for more intormation Block until message is received or data copied to a buffer

Matching Message Tags Differentiates between types of messages The message tag is carried within message. Wild card codes allow receipt of any message from any source –MPI_ANY_TAG: matches any message type –MPI_ANY_SOURCE: matches messages from any sender –Sends cannot use wildcards (pull operation, not push) Send message type 5 from buffer x to buffer y in process 2

Status of Sends and Receives MPI_Status status; MPI_Recv(&result, 1, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status); status.MPI_SOURCE/* rank of sender */ status.MPI_TAG/* type of message */ status.MPI_ERROR/* error code */ –MPI_SUCCESS - Successful, MPI_ERR_BUFFER - Invalid buffer pointer –MPI_ERR_COUNT - Invalid count, MPI_ERR_TYPE - Invalid data type –MPI_ERR_TAG - Invalid tag, MPI_ERR_COMM - Invalid communicator –MPI_ERR_RANK - Invalid rank, MPI_ERR_ARG - Invalid argument –MPI_ERR_UNKNOWN - Unknown error, MPI_ERR_INTERN - internal error –MPI_ERR_TRUNCATE - message truncated on receive MPI_Get_count(&status, recv_type, &count)/* number of elements */

Console Input and Output Input –Console input must be initiated at the host process if (rank==0) { printf("Enter some fraction"); scanf("%lf", &value); fflush(stdin); } or gets(data) to read a string Output –Any process can initiate an output –MPI uses internal library functions to route the output to the process initiating the program –Transmission using a library functions initiated before normal application transmissions can arrive after, or visa versa

Groups and Communicators Group: A set of processes ordered by relative rank Communicators: Context required for sends and receives Purpose: Enable collective communication (to subgroups of processors) The default communicator is MPI_COMM_WORLD –A unique rank corresponds to each executing process –The rank is an integer from 0 to p – 1 –The number of processors executing is p Applications can create subset communicators –Each processor has a unique rank in each sub-communicator –The rank is an integer from 0 to g-1 –The number of processors in the group is g

Example

MPI Group Communicator Functions Typical Usage 1.Extract group from communicator: MPI_Comm_group 2.Form new group: MPI_Group_incl or MPI_Group_excl 3.Create new group communicator: MPI_Comm_create 4.Determine group rank: MPI_Comm_rank 5.Communications: MPI message passing functions 6.Destroy created communicators and groups: MPI_Comm_free and MPI_Group_free

Details MPI_Group_excl: –New group without certain processes from an existing group –int MPI_Group_excl(MPI_Group group, int n, int *ranks, MPI_Group *newgroup); MPI_Group_incl: –New group withc selected processes from an existing group –int MPI_Group_incl(MPI_Group group, int n, int *ranks, MPI_Group *newgroup);

Creating and using a sub-group int ranks[4]={1,3,5,7}; MPI_Group original, subgroup; MPI_Comm slave; MPI_Comm_group(MPI_COMM_WORLD, &original); MPI_Group_incl(original, 4, ranks, &subgroup); MPI_Comm_create(MPI_COMM_WORLD, subgroup, &slave); MPI_Send(data,strlen(data)+1,MPI_CHAR,0,0, slave); MPI_Group_free(subgroup); MPI_Group_free(original); MPI_Comm_free(slave);

Point-to-point Communication Pseudo code constructs Send(data, destination, message tag) Receive(data, source, message tag) Synchronous –Send Completes when data safely received –Receive completes when data is available –No copying to/from internal buffers Asynchronous –Copy to internal message buffer –Send completes when transmission begins –Local buffers are free for application use –Receive polls to determine if data is available Process 1Process 2 send(&x, 2); recv(&y, 1); xy Generic syntax (actual formats later)

Synchronized sends and receives (b) recv() occurs before send()

Point to Point MPI calls Buffered Send (receiver gets to it when it can) –Completes after data is copied to a user supplied buffer –Becomes synchronous if no buffers are available Ready Send (guarantee transmission is successful) –A matching receive call must precede the send –Completion occurs when remote processor receives the data Standard Send (starts transmission if possible) –If receive call is posted, completes when transmission starts –If no receive call is posted, completes when data is buffered by MPI, but becomes synchronous if no buffers are available Blocking - Return occurs when the call completes Non-Blocking - Return occurs immediately –Application must periodically poll or wait for completion –Why non-blocking? To allows more parallel processing

Buffered Send Example Applications supply a data buffer area using MPI_Buffer_attach() to hold the data during transmission Note: transmission is between sender/receiver MPI buffers Note: copying in and out of buffers can be expensive

Point-to-point Message Transfer MPI_Comm_rank(MPI_COMM_WORLD,&myrank); int x; MPI_Status *stat; if (myrank == 0) { MPI_Send(&x,1,MPI_INT,1,99,MPI_COMM_WORLD); } else if (myrank == 1) { MPI_Recv(&x,1,MPI_INT,0,99,MPI_COMM_WORLD,stat); }

Non-blocking Point-to-point Transfer MPI_Comm_rank(MPI_COMM_WORLD, &myrank); int x; MPI_Request *io; MPI_STATUS *stat; if (myrank == 0) { MPI_Isend(&x,1,MPI_INT,1,99,MPI_COMM_WORLD,io); doSomeProcessing(); MPI_Wait(io, stat); } else if (myrank == 1) { MPI_Recv(&x,1,MPI_INT,0,99,MPI_COMM_WORLD,stat); } MPI_Isend() and MPI_Irecv() return immediately MPI_Rsend returns when received by remote computer, MPI_Bsend Buffered send, MPI_Send Standard send MPI_Wait() returns after transmission, MPI_Test() returns non-zero after transmission, returns zero otherwise

Message Passing Order Note : Messages originating from a processor will always be received in order. Messages from different processors can be received out of order.

Collective Communication MPI_Bcast()): Broadcast or Multicast data to processors in a group Scatter (MPI_Scatter()): Send parts of an array to separate processes Gather (MPI_Gather()): Collect array elements from separate processes AlltoAll (MPI_Alltoall()): A combination of gather and scatter. All processes send; then sections of the combined data are gathered MPI_Reduce(): Combine values from all processes to a single value using some operation (function call). MPI_Reduce_scatter(): First reduce and then scatter result MPI_Scan(): Reduce values received from processors of lower rank in the group. (Note: this is a prefix reduction) MPI_Barrier(): Pause until all processors reach the barrier call MPI operations on groups of processes Advantages MPI can use the processor hierarchy to improve efficiency Although, we can implement collective communication using standard send and receive calls, collective operations require less programming and debugging

Reduce, BroadCast, All Reduce Reduce, then broadcast Butterfly Allreduce

Predefined Collective Operations MPI_MAX, MPI_MIN: maximum, minimum MPI_MAXLOC, MPI_MINLOC: –If the output buffer is out –For each index, out[i].val and out[i].rank contains the max (or min) value and the processor rank containing it MPI_SUM, MPI_PROD: sum, product MPI_LAND, MPI_LOR, MPI_LXOR: logical &, |, ^ MPI_BAND, MPI_BOR, MPI_BXOR: bitwise &, |, ^

Derived MPI Data Types /* Goal: send items, each containing a double, integer, and a string */ int lengths[3] = {1, 1, 100}; MPI_Datatype types[3] = {MPI_DOUBLE, MPI_INT, MPI_CHAR, }; int displacements[3] = {0, sizeof(double), sizeof(double)+sizeof(int)}; MPI_Datatype* myType; /* Derive a data type */ MPI_TYPE_create_struct(3, lengths, displacements, types, &myType); MPI_Type_commit(myType);/* Commit it for use */ /* count data items broadcast from source to processors in communicator */ MPI_Bcast(&data, count, myType, source, comm); MPI_Type_free(myType);/* Don't need it anymore */ Note: Broadcasts can be fifty to a hundred times faster than doing individual sends using for loops

Collective Communication Example Master: Allocate memory to hold all of the date and then gather items from a group of processes Remotes: Fill an array with data and send them to the master Note: All processors execute the MPI_Gather() function int data[10]; /*data to gather from processes*/ MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { MPI_Comm_size(MPI_COMM_WORLD, &grp_size); buf = (int *)malloc(grp_size*10*sizeof (int)); } else { for (i=0; i<10; i++) data[i] = myrank; } MPI_Gather(data, 10, MPI_INT, buf, grp_size*10, MPI_INT, 0 /* gatherer rank */,MPI_COMM_WORLD);

User Defined Collective Operation /* User-defined function to add complex numbers (dest = src + dest) */ void compSum(Complex *src, Complex *dest, int *len, MPI_Datatype *ptr) { int i; Complex c; for (i=0; i< *len; ++i) { dest->real += src->real; dest->imag += src->imag; src++; dest++; } Complex in[100], out[100]; MPI_Op operation; MPI_Datatype complexType; MPI_Type_contiguous(2, MPI_DOUBLE, &complextype); // Define type MPI_Type_commit(&complexType); // Record for possible use MPI_Op_create( compSum, True, &operation); // Define the operation MPI_Reduce( in, out, 100, complexType, operation, root, communicator );

Collective Communication Rules All of the processors in the communicator call the same collective function The arguments must specify the same host, input data array length, data type, operation, and communicator The destination process is the only one that needs to specify an output array There is no message tag. Matching is done by the calling order and the communicator The input and output buffers must be different and should not overlap

Broadcast Broadcast - Sending the same message to all processes Multicast - Sending the same message to a defined group of processes. bcast(); buf bcast(); data bcast(); data Process 0Processp  1Process 1 Action Code MPI form

Scatter Distributing each element of an array to separate processes Contents of the i th location of the array transmits to process i scatter(); buf scatter(); data scatter(); data Process 0Processp  1Process 1 Action Code MPI form

Gather One process collects individual values from set of processes. gather(); buf gather(); data gather(); data Process 0Processp  1Process 1 Action Code MPI form

Reduce Perform a distributed calculation Example: Perform addition over a distributed array reduce(); buf reduce(); data reduce(); data Process 0Processp  1Process 1 + Action Code MPI form

Avoiding MPI Deadlocks MPI_Send doesn't always work the same way –Can copy to a buffer and then return before the transmission is received –Can block until the matching MPI_Recv starts –MPI uses thresholds to switch from buffered to blocking sends –Some implementations buffer small messages and block large messages Deadlock Possibilities (MPI_Send followed by MPI_Recv) –If all of the sends block, none of the receives can start –Small messages may succeed, while larger messages may lead to deadlock Possible Solutions: –Some processors send before receive; others receive before send –Use MPI_Sendrecv or Sendrecv_replace so that MPI will automatically handle the order of calls and guarantee no deadlock. An MPI_Recv without a matching send will block forever

Timing Parallel Programs What should not be timed? –Time to type input –Time to print or display output What should be timed? –The actual algorithm's computation –Communication blocks How? Answer: Either use MPI or C time.h functions double start = MPI_Wtime(); /* Do stuff */ double time = (MPI_Wtime() – start)*MPI_Wtick(); OR C (but doesn't include idle time) clock_t start = clock(); /* Do stuff */ float time = ((double) (clock()-start)) / CLOCKS_PER_SEC;

Maximum Time over Processors double start, localElapsed, elapsed; // Start all processors together MPI_Barrier(MPI_COMM_WORLD); start = MPI_Wtime(); // Start time /** do code here */ // Get processor elapsed time localElapsed = MPI_Wtime() – start; // Get the maximum elapsed processor time MPI_Reduce(&localElapsed, &elapsed, 1, MPI_DOUBLE, MPI_MAX, 0, comm); if (rank == 0) // Master processor outputs result printf("Elapsed time = %f seconds\n", elapsed); Note: Another way is to code another barrier at the end in order to avoid needing a reduce operation