1 Lecture 4: Part 2: MPI Point-to-Point Communication
2 Realizing Message Passing Separate network from processor Separate user memory from system memory node 0 user system PE NI node 1 user system PE NI Network
3 Communication Modes for “Send” Blocking/Non-Blocking : Timing regarding the use of user message buffer Ready: Timing regarding the invocation of send and receive Buffered : User/System Buffer Allocation
4 Communication Modes for “Send” Synchronous/Asynchronous: Timing regarding the invocation of send and receive + the execution of receive operation local/non-local: completion independ/depend on the execution of another user process
5 Messaging Semantics SenderReceiver User-space System-space Blocking/nonblockin g Synchronous/asynchronous Ready Not Ready
6 Blocking/Non-blocking Send Blocking send: messaging command does not return until the message data have been safely stored away so that the sender is free to access and overwrite the send buffer. The message might be copied directly into the matching receive buffer. May be copied into a temporary system buffer, even no matching receive is invoked. Local (completion does not depend on the execution of another user process)
7 Blocking Receive -MPI_recv Return when receive is locally complete Message buffer can be read from after return
8 Nonblocking Send - MPI_Isend Non-blocking, asynchronous Does not block for receive ( Return “immediately”) Check for completion with MPI_Wait( ) before using buffer MPI_Wait( ) returns when message has been safely sent, not when it has been received.
9 Non-blocking Receive MPI_irecv Return “immediately” Message buffer should not be read from after return Must check for local completion MPI_wait (..): block until the communication is complete MPI_waitall: block until all communication operations in a given list have completed
10 Non-blocking Receive - MPI_Irecv MPI_Irecv(Buf, count,source, tag, comm, REQUEST,..): REQUEST can be used to query the status of the communication MPI_WAIT(REQUEST,status): return only if REQUEST is complete MPI_Waitall(count, array_of_request,..): wait for the completion of all REQUESTs in the array.
11 Nonblocking Communication Improve Performance by overlapping communication and computation You need intelligent communication interface (messaging co-processor used in SP2, Paragon, CS-2, Myrinet, ATM) startuptransferstartuptransfer startup Add computation
12 Ready Send -- MPI_Rsend( ) Receive must be posted before message arrives. Otherwise, the operation is erroneous and its outcome is undefined. Non-local (completion depends on the starting time of the receiving process) Overheads for synchronization.
13 Buffered Send -- MPI_Bsend( ) Explicitly buffers messages on sending side User allocates buffer by himself/herself ( MPI_BUFFER_ATTACH( )) Programmer likes to control the usage of buffer -- writing new communication libraries.
14 Buffered Send -- MPI_Bsend( ) user system PE NI user allocated buffer
15 Synchronous Send --MPI_Ssend( ) Does not return until message is actually received Send buffer can be reused if send operation completed Non-local (receiver must have received the message)
16 Standard Send -- MPI_Send( ) Standard Send: depends on the implementation (usually, synchronous, blocking, and non-local) Safe to reuse buffer when MPI_Send( ) returns May block until message is received (depends on implementation)
17 Standard Send -- MPI_Send( ) A good implementation short message: send immediately, buffer if no receive posted. Should try to reduce latency. Buffering is unimportant large message: use Rendezvous protocol (request-reply-send; wait for matching receive then send)
18 How to Exchange Data Simple (code in node 0) sid = MPI_Isend(buf1, node1) rid = MPI_Irecv(buf2, node1)..... computation call MPI_Wait(sid) call MPI_Wait(rid) For maximum performance ids(1) = MPI_Isend(buf1, node1) ids(2) = MPI_Irecv(buf2, node1)..... computation call MPI_Waitall(2, ids)
19 Model and Measure p2p communication in MPI data transfer time = latency + message size/bandwidth latency (T 0 ) is startup time, independent of message time (but depends on the communication mode/protocol) bandwidth (B) is number of bytes transferred per second (memory access rate + network transmission rate)
20 Latency and Bandwidth for short message: latency dominates transfer time for long message: the bandwidth term dominates transfer time Critical message size (n 1/2 ) = latency x bandwidth (let latency = message size/bandwidth)
21 Measure p2p performance Round-trip time (ping-pong) time/2 send recv send
22 Some MPI Performance Results
23 Protocols Rendezvous Eager Mixed Pull (get)
24 Rendezvous Algorithm: Sender sends request-to-send Receiver acknowledges Sender sends data No buffering required High latency (three-steps) High bandwidth (no extra buffer copy)
25 Eager Algorithm: Sender sends data immediately Usually must be buffered May be directly transferred if receive already posted Features: Low latency Low bandwidth (buffer copy)
26 Mixed Algorithm: Eager for short messages Rendezvous for long messages Switch protocols near n 1/2
27 Mixed Features: Low latency for latency-dominated (short) messages High bandwidth for bandwidth-dominated (long) messages Reasonable memory management Non-ideal performance for some messages near n 1/2
28 Pull (Get) Protocol One-side communication Used in shared memory machines
29 MPICH p2p on SGI Default : byte: Short, K: Eager, > 128KB: Rendezvous MPID_PKT_MAX_DATA_SIZE = 256 Short (fill data in the header)
30 Let MPID_PKT_MAX_DATA_SIZE = 256 Short Eager Rendezvous
31 MPI-FM (HPVM: Fast Messages) Performance One-way latency (µs) WorseBetter Bandwidth (MB/s) Worse Better HPVM Pwr. Chal. SP-2 T3E Origin 2K Beowulf Note: Supercomputer measurements taken by NAS, JPL, and HLRS (Germany)
32 MPI Collective Operations
MPI_Alltoall(v) MPI_Alltoall It is an extension of MPI_Allgather to the case where each process sends distinct data to each of the receivers. The j-th block of data sent from process i is received by process j and is placed in the i-th block of receive buffer of process j.
MPI_Alltoall(v) alltoalldata process Define i j be the i-th block of data of process j.
MPI_Alltoall(v) Current Implementation: Process j sends i j directly to process i Send bufferReceive buffer
MPI_Alltoall(v) Current Implementation: Process j sends i j directly to process i