1 Lecture 4: Part 2: MPI Point-to-Point Communication.

1 Lecture 4: Part 2: MPI Point-to-Point Communication

2 Realizing Message Passing Separate network from processor Separate user memory from system memory node 0 user system PE NI node 1 user system PE NI Network

3 Communication Modes for “Send” Blocking/Non-Blocking : Timing regarding the use of user message buffer Ready: Timing regarding the invocation of send and receive Buffered : User/System Buffer Allocation

4 Communication Modes for “Send” Synchronous/Asynchronous: Timing regarding the invocation of send and receive + the execution of receive operation local/non-local: completion independ/depend on the execution of another user process

5 Messaging Semantics SenderReceiver User-space System-space Blocking/nonblockin g Synchronous/asynchronous Ready Not Ready

6 Blocking/Non-blocking Send Blocking send: messaging command does not return until the message data have been safely stored away so that the sender is free to access and overwrite the send buffer. The message might be copied directly into the matching receive buffer. May be copied into a temporary system buffer, even no matching receive is invoked. Local (completion does not depend on the execution of another user process)

7 Blocking Receive -MPI_recv Return when receive is locally complete Message buffer can be read from after return

8 Nonblocking Send - MPI_Isend Non-blocking, asynchronous Does not block for receive ( Return “immediately”) Check for completion with MPI_Wait( ) before using buffer MPI_Wait( ) returns when message has been safely sent, not when it has been received.

9 Non-blocking Receive MPI_irecv Return “immediately” Message buffer should not be read from after return Must check for local completion MPI_wait (..): block until the communication is complete MPI_waitall: block until all communication operations in a given list have completed

10 Non-blocking Receive - MPI_Irecv MPI_Irecv(Buf, count,source, tag, comm, REQUEST,..): REQUEST can be used to query the status of the communication MPI_WAIT(REQUEST,status): return only if REQUEST is complete MPI_Waitall(count, array_of_request,..): wait for the completion of all REQUESTs in the array.

11 Nonblocking Communication Improve Performance by overlapping communication and computation You need intelligent communication interface (messaging co-processor used in SP2, Paragon, CS-2, Myrinet, ATM) startuptransferstartuptransfer startup Add computation

12 Ready Send -- MPI_Rsend( ) Receive must be posted before message arrives. Otherwise, the operation is erroneous and its outcome is undefined. Non-local (completion depends on the starting time of the receiving process) Overheads for synchronization.

13 Buffered Send -- MPI_Bsend( ) Explicitly buffers messages on sending side User allocates buffer by himself/herself ( MPI_BUFFER_ATTACH( )) Programmer likes to control the usage of buffer -- writing new communication libraries.

14 Buffered Send -- MPI_Bsend( ) user system PE NI user allocated buffer

15 Synchronous Send --MPI_Ssend( ) Does not return until message is actually received Send buffer can be reused if send operation completed Non-local (receiver must have received the message)

16 Standard Send -- MPI_Send( ) Standard Send: depends on the implementation (usually, synchronous, blocking, and non-local) Safe to reuse buffer when MPI_Send( ) returns May block until message is received (depends on implementation)

17 Standard Send -- MPI_Send( ) A good implementation short message: send immediately, buffer if no receive posted. Should try to reduce latency. Buffering is unimportant large message: use Rendezvous protocol (request-reply-send; wait for matching receive then send)

18 How to Exchange Data Simple (code in node 0) sid = MPI_Isend(buf1, node1) rid = MPI_Irecv(buf2, node1)..... computation...... call MPI_Wait(sid) call MPI_Wait(rid) For maximum performance ids(1) = MPI_Isend(buf1, node1) ids(2) = MPI_Irecv(buf2, node1)..... computation...... call MPI_Waitall(2, ids)

19 Model and Measure p2p communication in MPI data transfer time = latency + message size/bandwidth latency (T 0 ) is startup time, independent of message time (but depends on the communication mode/protocol) bandwidth (B) is number of bytes transferred per second (memory access rate + network transmission rate)

20 Latency and Bandwidth for short message: latency dominates transfer time for long message: the bandwidth term dominates transfer time Critical message size (n 1/2 ) = latency x bandwidth (let latency = message size/bandwidth)

21 Measure p2p performance Round-trip time (ping-pong) time/2 send recv send

22 Some MPI Performance Results

23 Protocols Rendezvous Eager Mixed Pull (get)

24 Rendezvous Algorithm: Sender sends request-to-send Receiver acknowledges Sender sends data No buffering required High latency (three-steps) High bandwidth (no extra buffer copy)

25 Eager Algorithm: Sender sends data immediately Usually must be buffered May be directly transferred if receive already posted Features: Low latency Low bandwidth (buffer copy)

26 Mixed Algorithm: Eager for short messages Rendezvous for long messages Switch protocols near n 1/2

27 Mixed Features: Low latency for latency-dominated (short) messages High bandwidth for bandwidth-dominated (long) messages Reasonable memory management Non-ideal performance for some messages near n 1/2

28 Pull (Get) Protocol One-side communication Used in shared memory machines

29 MPICH p2p on SGI Default : 0-1024 byte: Short, 1024-128K: Eager, > 128KB: Rendezvous MPID_PKT_MAX_DATA_SIZE = 256 Short (fill data in the header)

30 Let MPID_PKT_MAX_DATA_SIZE = 256 Short Eager Rendezvous

31 MPI-FM (HPVM: Fast Messages) Performance 050100150200250 One-way latency (µs) WorseBetter 050100150200250300 Bandwidth (MB/s) Worse Better HPVM Pwr. Chal. SP-2 T3E Origin 2K Beowulf Note: Supercomputer measurements taken by NAS, JPL, and HLRS (Germany)

32 MPI Collective Operations

MPI_Alltoall(v) MPI_Alltoall It is an extension of MPI_Allgather to the case where each process sends distinct data to each of the receivers. The j-th block of data sent from process i is received by process j and is placed in the i-th block of receive buffer of process j.

MPI_Alltoall(v) 61616161 51515151 41414141 31313131 21212121 11111111 01010101 71717171 62626262 52525252 42424242 32323232 22222222 12121212 02020202 72727272 66666666 56565656 46464646 36363636 26262626 16161616 06060606 76767676 67676767 57575757 47474747 37373737 27272727 17171717 07070707 77777777 65656565 55555555 45454545 35353535 25252525 15151515 05050505 75757575 63636363 53535353 43434343 33333333 23232323 13131313 03030303 73737373 66666666 55555555 44444444 34343434 24242424 14141414 04040404 74747474 60606060 50505050 40404040 30303030 20202020 10101010 00000000 70707070 16161616 15151515 14141414 13131313 12121212 11111111 10101010 17171717 26262626 25252525 24242424 23232323 22222222 21212121 20202020 27272727 66666666 65656565 64646464 63636363 62626262 61616161 60606060 67676767 76767676 75757575 74747474 73737373 72727272 71717171 70707070 77777777 56565656 55555555 54545454 53535353 52525252 51515151 50505050 57575757 36363636 35353535 34343434 33333333 32323232 31313131 30303030 37373737 46464646 45454545 44444444 43434343 42424242 41414141 40404040 47474747 06060606 05050505 04040404 03030303 02020202 01010101 00000000 07070707 alltoalldata process Define i j be the i-th block of data of process j.

MPI_Alltoall(v) Current Implementation: Process j sends i j directly to process i 61616161 51515151 41414141 31313131 21212121 11111111 01010101 71717171 62626262 52525252 42424242 32323232 22222222 12121212 02020202 72727272 66666666 56565656 46464646 36363636 26262626 16161616 06060606 76767676 67676767 57575757 47474747 37373737 27272727 17171717 07070707 77777777 65656565 55555555 45454545 35353535 25252525 15151515 05050505 75757575 63636363 53535353 43434343 33333333 23232323 13131313 03030303 73737373 66666666 55555555 44444444 34343434 24242424 14141414 04040404 74747474 60606060 50505050 40404040 30303030 20202020 10101010 00000000 70707070 10101010 20202020 60606060 70707070 50505050 30303030 40404040 00000000 Send bufferReceive buffer 0 1 4 3 2 5 6 7

MPI_Alltoall(v) Current Implementation: Process j sends i j directly to process i 61616161 51515151 41414141 31313131 21212121 11111111 01010101 71717171 62626262 52525252 42424242 32323232 22222222 12121212 02020202 72727272 66666666 56565656 46464646 36363636 26262626 16161616 06060606 76767676 67676767 57575757 47474747 37373737 27272727 17171717 07070707 77777777 65656565 55555555 45454545 35353535 25252525 15151515 05050505 75757575 63636363 53535353 43434343 33333333 23232323 13131313 03030303 73737373 66666666 55555555 44444444 34343434 24242424 14141414 04040404 74747474 60606060 50505050 40404040 30303030 20202020 10101010 00000000 70707070 16161616 15151515 14141414 13131313 12121212 11111111 10101010 17171717 26262626 25252525 24242424 23232323 22222222 21212121 20202020 27272727 66666666 65656565 64646464 63636363 62626262 61616161 60606060 67676767 76767676 75757575 74747474 73737373 72727272 71717171 70707070 77777777 56565656 55555555 54545454 53535353 52525252 51515151 50505050 57575757 36363636 35353535 34343434 33333333 32323232 31313131 30303030 37373737 46464646 45454545 44444444 43434343 42424242 41414141 40404040 47474747 06060606 05050505 04040404 03030303 02020202 01010101 00000000 07070707 0 1 4 3 2 5 6 7

1 Lecture 4: Part 2: MPI Point-to-Point Communication.

Similar presentations

Presentation on theme: "1 Lecture 4: Part 2: MPI Point-to-Point Communication."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Lecture 4: Part 2: MPI Point-to-Point Communication.

Similar presentations

Presentation on theme: "1 Lecture 4: Part 2: MPI Point-to-Point Communication."— Presentation transcript:

Similar presentations

About project

Feedback