Request ordering for FI_MSG and FI_RDM endpoints

Slides:



Advertisements
Similar presentations
Proposal (More) Flexible RMA Synchronization for MPI-3 Hubert Ritzdorf NEC–IT Research Division
Advertisements

KOFI Stan Smith Intel SSG/DPD January, 2015 Kernel OpenFabrics Interface.
OFA Openframework WG SHMEM/PGAS Feedback Worksheet 1/27/14.
Consistency. Consistency model: –A constraint on the system state observable by applications Examples: –Local/disk memory : –Database: What is consistency?
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface KFI Framework.
Stan Smith Intel SSG/DPD February, 2015 Kernel OpenFabrics Interface kOFI Framework.
Open Fabrics Interfaces Architecture Introduction Sean Hefty Intel Corporation.
Signature Verbs Extension Richard L. Graham. Data Integrity Field (DIF) Used to provide data block integrity check capabilities (CRC) for block storage.
OpenFabrics 2.0 Sean Hefty Intel Corporation. Claims Verbs is a poor semantic match for industry standard APIs (MPI, PGAS,...) –Want to minimize software.
Web Services Glossary Summary of Holger Lausen
1-1 Embedded Network Interface (ENI) API Concepts Shared RAM vs. FIFO modes ENI API’s.
OpenFabrics 2.0 or libibverbs 1.0 Sean Hefty Intel Corporation.
 Protocols used by network systems are not effective to distributed system  Special requirements are needed here.  They are in cases of: Transparency.
Arden Objects Proposal Arden SIG Meeting Jan. 14, 2003 San Antonio, Texas Presented by Roger Corman.
SMUCSE 4344 transport layer. SMUCSE 4344 transport layer end-to-end protocols –transport code runs only on endpoint hosts encapsulates network communications.
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
ISCSI Extensions for RDMA (iSER) draft-ko-iwarp-iser-02 Mike Ko IBM August 2, 2004.
CE Operating Systems Lecture 13 Linux/Unix interprocess communication.
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
Stan Smith Intel SSG/DPD February, 2015 Kernel OpenFabrics Interface Initialization.
CSC 600 Internetworking with TCP/IP Unit 5: IP, IP Routing, and ICMP (ch. 7, ch. 8, ch. 9, ch. 10) Dr. Cheer-Sun Yang Spring 2001.
The Client-Server Model And the Socket API. Client-Server (1) The datagram service does not require cooperation between the peer applications but such.
OFI SW Sean Hefty - Intel Corporation. Target Software 2 Verbs 1.x + extensions 2.0 RDMA CM 1.x + extensions 2.0 Fabric Interfaces.
OpenFabrics Interface WG A brief introduction Paul Grun – co chair OFI WG Cray, Inc.
G.v. Bochmann, revised Jan Comm Systems Arch 1 Different system architectures Object-oriented architecture (only objects, no particular structure)
Call Completion using BFCP draft-roach-sipping-callcomp-bfcp IETF 67 – San Diego November 7, 2006.
Intro to Distributed Systems Hank Levy. 23/20/2016 Distributed Systems Nearly all systems today are distributed in some way, e.g.: –they use –they.
CSE Operating System Principles Protection.
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface Kfabric Framework.
CS 457 – Lecture 3 Link Layer Protocols Fall 2011.
SC’13 BoF Discussion Sean Hefty Intel Corporation.
Transport Layer Slides are originally from instructor: Carey Williamson at University of Calgary Very minor modification are made Notes derived from “Computer.
Modularity Most useful abstractions an OS wants to offer can’t be directly realized by hardware Modularity is one technique the OS uses to provide better.
Chapter 3 outline 3.1 Transport-layer services
Calibration using NDP Vincenzo Scarpa
5. End-to-end protocols (part 1)
Scheduler activations
Transport Layer.
Discussion: Messaging
Fabric Interfaces Architecture – v4
CMPT 371 Data Communications and Networking
Distribution and components
Chapter 6: Transport Layer (Part I)
Advancing open fabrics interfaces
Persistent memory support
CS 1652 Jack Lange University of Pittsburgh
Distributed OS.
TWT Information frames in 11ax
The IP, TCP, UDP protocols
Cache Coherence Protocols 15th April, 2006
Falling Back! … and: a Functional Decomposition of Post-Sockets
VTP: VDIF Transport Protocol
Calibration using NDP Date: Authors: December 2006
William Stallings Computer Organization and Architecture 8th Edition
Atomic Commit and Concurrency Control
Application taxonomy & characterization
Lecture 25: Multiprocessors
Chapter 14: Protection.
Lecture 10: Consistency Models
Scaling the Network: The Internet Protocol
Lecture 25: Multiprocessors
Object-Oriented Databases
doc.: IEEE <doc#1>
Ch 17 - Binding Protocol Addresses
Lecture 24: Multiprocessors
Chapter 13: I/O Systems.
Transport Layer 9/22/2019.
Regarding trigger frame in UL MU
Lecture 11: Consistency Models
Presentation transcript:

Request ordering for FI_MSG and FI_RDM endpoints 29 April ‘14

Something needed so consumers of libfabric stay sane A few type of endpoints with simple ordering rules that are reasonably easy to understand The ordering rules should allow for sufficient flexibility so that different providers can provide maximum performance while also insuring program correctness

Background – what is ordering? Ordering is an end-to-end concept and may include some or all of the following: Expectations of the API consumer w.r.t. the order of execution of operations posted to the fabric provider Execution of operations as expressed on the wire Ordering of information as packets/flits/msgs cross the wire Ordering of inbound operations (both inbound requests and inbound RDMA operations) Order in which inbound data is placed on the memory bus Order in which inbound data is written to memory by the memory controller Expectations of the API consumer w.r.t. the order in which operations are completed/notified

IB: The concise ordering rules Operations on the SEND queue are transmitted on the wire in order. Operations at the RESPONDER side are executed in the order received. A SEND or RDMA WRITE may be executed before an RDMA READ! Operations on the SEND queue are completed in the order in which they were posted. 3 SEND responder 2 RDMA RD 1 SEND SEND 3 RDMA RD SEND 1 ??? ACK 1 READ DATA ACK 3 6/14/2011 www.openfabrics.org

What’s man –l man/fi_getinfo.3 have now? FI_MSG - Provides reliable, in-order message based communication, with data transfers maintaining message boundaries. Hmmm. Okay, so if you bought an adaptive network, you wasted your money. FI_RDM - Provides reliable datagram communication without ordering guarantees. – Hmmm. Okay, does PSM really work this way? MPI can’t use this mode easily.

It’s worse than that IB provides ordering only on operations posted to a given QP. The QP construct binds together operations of different types in order to provide ordering guarantees between different operations, e.g. between message and RDMA operations. How is that accomplished using the fabric interfaces?

What would be nicer… FI_MSG - Provides reliable, message based communication, with data transfers maintaining message boundaries. Messages are ordered by default, with relaxed order being optionally supported on a per message basis. FI_RDM - Provides reliable datagram communication. By default, ordering is not guaranteed, although for datagrams targeting a given network endpoint, a sequence of datagrams can be specified as an ordered sequence.

Using relaxed order in MPI – rendezvous example No ordering dependency No ordering dependency SencCmp1 WR1 responder SndCmp0 WR0 SndCmp1 Wr1 Sndcmp0 Wr0 ordering required ordering required 6/14/2011 www.openfabrics.org

PCI-e Transaction order rules (for given TC, src, target) Producer/consumer model first op second op Row Pass Column? Posted Request Non-posted read request Non-posted AMO req. Posted Request (RDMA write) a) no b) y/n yes Non-posted read request y/n Non-posted AMO request. Strict producer/consumer model – i.e. strict order I’m feeling lucky and have set RO bit in second request Requesting relaxed order doesn’t mean you’ll get it, hence y/n. If you’re a HW designer, don’t count on relaxed order.

IB Transaction ordering rules (RC) first op second op Row Pass Column? Send/ RDMA Write RDMA Read Non-posted AMO req. Send/RDMA Write no a)yes b) no Atomic Op. a) yes b) no No ordering guarantee – don’t count on order if you want correctness I need order and have set the fence bit in second transaction Strict producer-consumer model March 30 – April 2, 2014 #OFADevWorkshop

Libfabric FI_MSG now – depending on interpretation of fi_getinfo.3 first op Row Pass Column? Send RDMA write RDMA Read Non-posted AMO req. no RDMA Write Atomic Op. second op March 30 – April 2, 2014 #OFADevWorkshop

Libfabric FI_MSG with optional relaxed order proposal first op Row Pass Column? Send RDMA write RDMA Read Non-posted AMO req. a) no b) y/n a) yes c) no RDMA Write a)yes c) no Atomic Op. second op IB RC like behavior (default) Relaxed order bit set in flag (SendMsg, etc.) Fence bit set in flag Provider free to ignore b), must observe c)

Ordering bits for FI_MSG Add new flag bit for sendmsg/writemsg - FI_RELAXED_ORDER - If this bit is set, MSG, RMA ,or AMO operation may be completed ahead of pending MSG, RMA, or AMO ops in the EP’s send queue Messages may appear to complete out of order when this bit is set. Add new flag bit for sendmsg/writemsg - FI_FENCE_GLOBAL - If this bit is set, this operations posted to the EP will not be initiated till all previously posted MSG, RMA, AMO ops to the EP have completed globally fi_ep_sync sounds blocking, this is a potentially non-blocking way to do a fence March 30 – April 2, 2014 #OFADevWorkshop

Libfabric FI_RDM now – depending on interpretation of fi_getinfo.3 first op Row Pass Column? Send RDMA write RDMA Read Non-posted AMO req. yes RDMA Write Atomic Op. second op Not enough order? fi_ep_sync seems kind of heavy weight.

Libfabric FI_RDM suggestion – HyperTransport ordered sequences first op Row Pass Column? Send RDMA write RDMA Read Non-posted AMO req. a) yes b) no RDMA Write Atomic Op. a)yes b) no second op a) default, as in man page. App must use fi_ep_sync for ordering. b) If second op has the same order sequence. Ops must be back-to-back.

Ordering bits for FI_RDM Add new flag bit for sendmsg/writemsg - FI_ORDERED_SEQ - If this bit is set, message or rma or amo operation is treated as part of an ordered sequence. The sequence number is specified in the flow field of the fi_msg, etc. argument Msg, RMA, and AMO requests within an ordered sequence must be posted sequentially to a given endpoint, with intervening requests that are not part of the ordered sequence. The operations must all target the same target address. Some providers may be able to do this efficiently, otherwise the behavior is as if fi_ep_sync were invoked internally between each operation.