Presentation on theme: "A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of."— Presentation transcript:
A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of British Columbia Vancouver, Canada April 14, 2008
A Hybrid Message Passing Interface Design using the Stream Control Transmission Protocol and the Internet Wide Area Remote Direct Memory Access Protocol Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of British Columbia Vancouver, Canada April 14, 2008
Research Background SCTP – Stream Control Transmission Protocol –IETF standardized transport protocol for IP –Can be used anywhere TCP or UDP are used –Additional features SCTP and MPI middleware –LAM (unreleased) –MPICH2 (1.0.5 and on) ch3:sctp –Open MPI SCTP BTL (in v1.3 trunk)
Hardware acceleration techniques for IP –Protocol offload –OS bypass –Zero copy –RDMA –10 GigE How would these look for SCTP? Are there benefits here for using SCTP? State-of-the-Art Networking
iWARP - Internet Wide Area RDMA protocol –IETF standard for RDMA over IP Use RDMA, point-to-point, or a mix? “Why Compromise?” (G. HPCWire.com) –Depending on the application, use whichever is best. For MPI middleware, who decides what’s best? Story/motivation The programmer!
Contribution Hybrid MPI with functional decomposition lets the programmer decide: –Let RMA use RDMA –Let other communications use point-to-point Explore SCTP’s use within iWARP –Extended OSC userspace software iWARP, making many internal OSC changes
iWARP : DDP & LLP Direct Data Placement Fragments messages Reassembles segments Segments self-contained Data delivery and placement separation Out-of-order delivery Requires LLP to: Keep segment boundaries Be reliable Take a strong checksum
iWARP : LLP = MPA over TCP Message PDU Aligned Message framing DDP segment vs. TCP stream Markers for out-of-order For middlebox fragmentation Stronger checksum … is a complex layer (majority of OSC code)! … can lead to non-compliant TCP stacks. LLP
SCTP is a better LLP LLP’s needs built-in to SCTP: Reliable, message-based CRC32c checksum Out-of-order support: MSG_UNORDERED Multistreaming Multihoming Unmodified stack supports: Path failover Multirail data striping LLP
In the beginning, there was ch3:sctp
OSC iWARP was modified and incorporated in as a thread….
RMA done by modified OSC iWARP
OSC iWARP changes to support MPI Running in a thread Use SCTP Making all OSC ops non-blocking Locks around shared data
Connection Management Design Connection establishment: Separate one-to-many socket for new QPs –SCTP “peeloff” feature New QP sends request from one-to-many socket Request/ACK received, then QP socket peeled-off For conflicts, MPI rank resolves who sends ACK
Performance What we tested… –Compared our new ch3:hybrid to the original ch3:sctp –Two 3.2 GHz Intel boxes (GigE + switch) OSU latency tests (MPI_Put & MPI_Get) Homemade synthetic benchmark –Combination of RMA and MPI-1 calls
OSU One-sided Latency Tests ch3:hybrid adds 2-8% overhead
Synthetic Application ch3:hybrid was faster than ch3:sctp – 3.8 seconds vs. 4.5 seconds Extra thread helps in some cases
Conclusions RDMA versus point-to-point for MPI –Why choose? Functional decomposition lets programmer decide SCTP is a good match for iWARP –Implementation of iWARP using SCTP shown. –SCTP has its place in the state-of-the-art. –It’d be more exciting to have SCTP-based devices…
Google “sctp mpi” for more information about our work Thank you!