Presentation is loading. Please wait.

Presentation is loading. Please wait.

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

Similar presentations


Presentation on theme: "Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with."— Presentation transcript:

1 Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with RDMA Support Liu, Jiang, Wyckoff, Panda, Ashton, Buntinas, Gropp, Toonen Host-Assisted Zero-Copy Remote Memory Access Communication On InfiniBand Tipparaju, Santhanaraman, Nieplocha, Panda Presented by Nikola Vouk Advisor: Dr. Frank Mueller

2 Background General Buffer Manipulation in Communication Protoocls

3 InfiniBand 7.6 microsecond latency 857 MB/s peak bandwidth Send/Receive Queue+Work Completed interface Asynchronous calls Remote Direct Memory Access –Between Shared memory architecture and MPI –Not exactly NUMA, but close Provides channel Interface (read/write) for communication Each side registers memory accessible freely to other hosts for security purposes.

4

5 Common Problems 1.Link-layer/Network Protocol in- efficiencies (unnecessary messages sent) 2.User-space to System-Buffer copy overhead (copy time) 3.Synchronous sending/receiving and computing (Application has to stop in order to handle requests)

6 Problem 1: Message Passing Protocol Basic InfiniBand protocol requires three matching writes RDMA CHANNEL INTERFACE Put Operation: Copy user buffer to pre-registered buffer RDMA write buffer to receiver Adjust local head pointer RDMA write new head pointer to receiver Return Bytes written Get Operation 1.Copy data from shared memory to user buffer 2.Adjust Tail Pointer 3.RDMA write new tail pointer to sender 4.Return bytes read

7 Solutions: Piggybacking and Pipelining Improvement, but still less than 870 MB/s Send Pointer update with Packets Chop buffers into packet size and Send out as message comes in

8 Problem 2: Internal buffer copying overhead Solution: Zero-Copy Buffers Internal overhead where the user must copy data to system (and into a registered memory slot) Allows system to read directly from the user

9 Zero-Copy Protocol at different Levels of MPICH Hierarchy If Packet is Large enough… 1.Register user buffer 2.Notify end-host of request 3.End-host sends a RDMA-read 4.Reads from user buffer space

10 Comparing Interfaces: CH3 interface vs RDMA Interface Implement directly off of CH3 interface More flexible due to access to complete ADI-3 interface Always uses RMDA- write

11

12 CH3 Implementation Performance A function of raw underlying performance

13 Pipelining always performed the worst RDMA Channel within 1% of CH3

14 Problem 3: To much overhead, not enough execution Unanswered Problems 1.Registration overhead still there even in cached version 2.Data transfer still requires significant cooperation from both sides (taking away from computation) 3.Non-contiguous data not addressed Solutions 1.Provide custom API allocates out of large pre-registers memory chunks 2.Overlapping as much as possible communication with computation 3.Applying zero-copy techniques using scatter/gather RMDA calls

15 Host-Assisted Zero-Copy Protocol Host sends request for gather from receiver Receiver posts a descriptor and continues working Can be implemented as a “helper” thread on receiving host Same as previous Zero-Copy idea, but extended to Non-contiguous data

16 NAS MG Again the Pipelined method performs similarly to the zero-copy method

17 Summa Matrix Multiplication Significant benefit of Host-Assisted Zero-Copy

18 Conclusions Minimizing internal memory copying removes primary memory performance obstacle Infiniband allows DMA that offloads work from the CPU. Can benefit by coordinating registered memory to minimize CPU involvment With proper coding, can achieve almost wire-speed on existing MPI programs over infiniband Could be implemented on other architectures (Gig-E, Myranet)

19 Thesis Implications Buddy MPICH is a latency hiding implementation of MPICH also. Separation at the ADI layer. Buddy thread listens for connections and accepts work from worker thread via send/receive queues.


Download ppt "Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with."

Similar presentations


Ads by Google