Presentation is loading. Please wait.

Presentation is loading. Please wait.

Towards MPI progression layer elimination with TCP and SCTP

Similar presentations


Presentation on theme: "Towards MPI progression layer elimination with TCP and SCTP"— Presentation transcript:

1 Towards MPI progression layer elimination with TCP and SCTP
Brad Penoff and Alan Wagner Department of Computer Science University of British Columbia Vancouver, Canada Distributed Systems Group HIPS 2006 April 25

2 Will my application run?
Portability Aspect of parallel processing integration MPI API provides interface for portable parallel applications, independent of MPI implementation Will my application run?

3 any MPI Implementation
MPI API User Code any MPI Implementation Resources

4 Will my application perform well?
Portability Aspect of parallel processing integration MPI API provides interface for portable parallel applications, independent of MPI implementation MPI Middleware provides glue for a variety of underlying components required for a complex parallel runtime environment, independent of component implementation Will my application perform well?

5 any MPI Implementation
MPI Middleware User Code any MPI Implementation Resources

6 MPI Middleware MPI Middleware User Code Glues together components
Job Scheduler Component Job Scheduler Component Process Manager Component Process Manager Component Message Progression Communication Component Transport Operating System Network

7 Message Progression Communication Component
User Code Maintains necessary state between MPI calls Calls not a simple library function Manages underlying communication through the OS (e.g. TCP) direct low-level interaction (e.g. Infiniband) MPI Middleware Message Progression Communication Component Transport OS Network

8 Communication Requirements
Common: Portability by having support for all potential interconnects In this work: Portability by eliminating this component by assuming IP! Push MPI functionality down onto IP-based transports Learn about necessary MPI implementation design changes

9 Component Elimination
User Code MPI Middleware/Library Job Scheduler Component Process Manager Component Message Progression Communication Component Operating System Transport Network

10 Elimination Motivation
Common approach Exploit specific features for all potential interconnects Middleware does transport-layer “things” Sequencing & flow control complicates the middleware Implemented differently, MPI implementations incompatible Our approach here Assume IP Leverage mainstream commodity networking advances Simplify middleware Increase MPI implementation interoperability (perhaps?) Our approach here Assume IP Leverage mainstream commodity networking advances Simplify middleware Increase MPI implementation interoperability Common approach Exploit specific features for all potential interconnects Middleware does transport-layer “things” Sequencing & flow control complicates the middleware Implemented differently, MPI implementations incompatible

11 Elimination Approach View MPI as a protocol, from a networking point-of-view MPI Message matching Expected / unexpected queues Short / long protocol Networking Demultiplexing Storage/buffering Flow control Design MPI with elimination as a goal

12 MPI Implementation Designs
TCP SCTP

13 TCP Socket Per TRC General scheme Control port
Socket per MPI message stream (tag-rank-context (TRC)) Control port MPI_Send calls connect (MPI_Recv could wildcard) Resulting socket stored in table attached to communicator object

14 TCP-MPI as a Protocol Matching Queues Short/long
select() fd sets for wildcards Queues Unexpected = socket buffer w/ flow control Expected = more local, attached to handles Short/long No distinction, rely on TCP flow control

15 TCP per TRC critique Design achieves elimination, but…
# sockets – OS user limits Expense of sys calls (context switch, copying) select() – doesn’t scale Flow control Mismatch : transport/OS = event driven vs. MPI application = control-driven

16 SCTP-based design

17 What is SCTP? Stream Control Transmission Protocol
General purpose unicast transport protocol for IP network data communications Recently standardized by IETF Can be used anywhere TCP is used

18 Available SCTP stacks BSD / Mac OS X
LKSCTP – Linux Kernel and later Solaris 10 HP OpenCall SS7 OpenSS7 Other implementations listed on sctp.org for Windows, AIX, VxWorks, etc.

19 Relevant SCTP features
Multistreaming One-to-many socket style Multihoming Message-based

20 Logical View of Multiple Streams in an Association
Flow control per association (not stream)

21 Using SCTP for MPI TRC-to-stream map matches MPI semantics

22 SCTP-MPI as a protocol Matching – required since cannot receive from a particular stream sctp_recvmsg() = ANY_RANK + ANY_TAG Avoids select() through one-to-many socket Queues – globally required for matching Short/Long – required; flow control not per stream

23 SCTP and elimination SCTP thins the middleware but the component cannot be eliminated Need flow control per stream Need ability to receive from stream Need ability to query which streams have data ready

24 Conclusions TCP design eliminates but doesn’t scale
SCTP scales but only thins component SCTP one-to-many socket style requires additional features for elimination Flow control per stream Ability to receive from stream Ability to query which streams have data ready

25 More information about our work is at:
Thank you! More information about our work is at: Or Google “sctp mpi”

26 Upcoming annual SCTP Interop
July 30 – Aug 4, 2006 to be held at UBC Vendors and implementers test their stacks Performance Interoperability

27 Extra slides

28 MPI Point-to-Point MPI_Send(msg,cnt,type,dst-rank,tag,context)
MPI_Recv(msg,cnt,type,src-rank,tag,context) Message matching is done based on Tag, Rank and Context (TRC). Combinations such as blocking, non-blocking, synchronous, asynchronous, buffered, unbuffered. Use of wildcards for receive

29 MPI Messages Using Same Context, Two Processes

30 MPI Messages Using Same Context, Two Processes
Out of order messages with same tags violate MPI semantics

31 Associations and Multihoming
Endpoint X Endpoint Y Association NIC 1 NIC 2 NIC 3 NIC 4 Network 207 . 10 . x . x IP = 207 . 10 . 3 . 20 IP = 207 . 10 . 40 . 1 Network 168 . 1 . x . x IP = 168 . 1 . 10 . 30 IP = 168 . 1 . 140 . 10

32 SCTP Key Similarities Reliable in-order delivery, flow control, full duplex transfer. TCP-like congestion control Selective ACK is built-in the protocol

33 SCTP Key Differences Message oriented Added security
Multihoming, use of associations Multiple streams within an association

34 MPI over SCTP LAM and MPICH2 are two popular open source implementations of the MPI library. We redesigned LAM to use SCTP and take advantage of its additional features. Future plans include SCTP support within MPICH2.

35 How can SCTP help MPI? A redesign for SCTP thins the MPI middleware’s communication component. Use of one-to-many socket-style scales well. SCTP adds resilience to MPI programs. Avoids unnecessary head-of-line blocking with streams Increased fault tolerance in presence of multihomed hosts Built-in security features Improved congestion control Full Results

36 Partially Ordered User Messages Sent on Different Streams

37 Partially Ordered User Messages Sent on Different Streams

38 Partially Ordered User Messages Sent on Different Streams

39 Partially Ordered User Messages Sent on Different Streams

40 Partially Ordered User Messages Sent on Different Streams

41 Partially Ordered User Messages Sent on Different Streams

42 Partially Ordered User Messages Sent on Different Streams

43 Partially Ordered User Messages Sent on Different Streams

44 Partially Ordered User Messages Sent on Different Streams

45 Partially Ordered User Messages Sent on Different Streams

46 Partially Ordered User Messages Sent on Different Streams

47 Partially Ordered User Messages Sent on Different Streams
Can be received in the same order as it was sent (required in TCP).

48 Partially Ordered User Messages Sent on Different Streams

49 Partially Ordered User Messages Sent on Different Streams

50 Partially Ordered User Messages Sent on Different Streams

51 Partially Ordered User Messages Sent on Different Streams

52 Partially Ordered User Messages Sent on Different Streams

53 Partially Ordered User Messages Sent on Different Streams

54 Partially Ordered User Messages Sent on Different Streams
Delivery constraints: A must be before C and C must be before D

55 MPI Middleware { } Components

56 Elimination Motivation
Common approach : Exploit specific features for all potential interconnects Middleware does transport-layer “things” Sequencing & flow control complicates the middleware Implemented differently, MPI implementations incompatible Our approach here : Assume IP Leverage mainstream commodity networking advances Simplify middleware Increase MPI implementation interoperability


Download ppt "Towards MPI progression layer elimination with TCP and SCTP"

Similar presentations


Ads by Google