Open Fabrics Interfaces Architecture Introduction Sean Hefty Intel Corporation.

Slides:



Advertisements
Similar presentations
MicroKernel Pattern Presented by Sahibzada Sami ud din Kashif Khurshid.
Advertisements

Ch:8 Design Concepts S.W Design should have following quality attribute: Functionality Usability Reliability Performance Supportability (extensibility,
MPI Requirements of the Network Layer OFA 2.0 Mapping MPI community feedback assembled by Jeff Squyres, Cisco Systems Sean Hefty, Intel Corporation.
KOFI Stan Smith Intel SSG/DPD January, 2015 Kernel OpenFabrics Interface.
Uncovering Performance and Interoperability Issues in the OFED Stack March 2008 Dennis Tolstenko Sonoma Workshop Presentation.
Extensibility, Safety and Performance in the SPIN Operating System Presented by Allen Kerr.
Keith Wiles DPACC vNF Overview and Proposed methods Keith Wiles – v0.5.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Develop Application with Open Fabrics Yufei Ren Tan Li.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Chapter 13 Embedded Systems
Implications for Programming Models Todd C. Mowry CS 495 September 12, 2002.
CprE 458/558: Real-Time Systems
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface KFI Framework.
Stan Smith Intel SSG/DPD February, 2015 Kernel OpenFabrics Interface kOFI Framework.
IB ACM InfiniBand Communication Management Assistant (for Scaling) Sean Hefty.
I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.
New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.
Discussing an I/O Framework SC13 - Denver. #OFADevWorkshop 2 The OpenFabrics Alliance has recently undertaken an effort to review the dominant paradigm.
OpenFabrics 2.0 Sean Hefty Intel Corporation. Claims Verbs is a poor semantic match for industry standard APIs (MPI, PGAS,...) –Want to minimize software.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
OpenFabrics 2.0 or libibverbs 1.0 Sean Hefty Intel Corporation.
Scalable Fabric Interfaces Sean Hefty Intel Corporation OFI software will be backward compatible.
Scalable name and address resolution infrastructure -- Ira Weiny/John Fleck #OFADevWorkshop.
OFI SW - Progress Sean Hefty - Intel Corporation.
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
© 2012 xtUML.org Bill Chown – Mentor Graphics Model Driven Engineering.
Introduction to Biometrics Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #23 Biometrics Standards - II November 14, 2005.
Scalable RDMA Software Solution Sean Hefty Intel Corporation.
Storage Interconnect Requirements Chen Zhao, Frank Yang NetApp, Inc.
IWARP Status Tom Tucker. 2 iWARP Branch Status  OpenFabrics SVN  iWARP in separate branch in SVN  Current with trunk as of SVN 7626  Support for two.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
Stan Smith Intel SSG/DPD February, 2015 Kernel OpenFabrics Interface Initialization.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
A new thread support level for hybrid programming with MPI endpoints EASC 2015 Dan Holmes, Mark Bull, Jim Dinan
IB Verbs Compatibility
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
OFI SW Sean Hefty - Intel Corporation. Target Software 2 Verbs 1.x + extensions 2.0 RDMA CM 1.x + extensions 2.0 Fabric Interfaces.
Programmability Hiroshi Nakashima Thomas Sterling.
OpenFabrics Interface WG A brief introduction Paul Grun – co chair OFI WG Cray, Inc.
OpenFabrics 2.0 rsockets+ requirements Sean Hefty - Intel Corporation Bob Russell, Patrick MacArthur - UNH.
بسم الله الرحمن الرحيم MEMORY AND I/O.
1 Channel Access Concepts – IHEP EPICS Training – K.F – Aug EPICS Channel Access Concepts Kazuro Furukawa, KEK (Bob Dalesio, LANL)
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Open Fabrics Interfaces Software Sean Hefty - Intel Corporation.
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface Kfabric Framework.
WREC Working Group IETF 49, San Diego Co-Chairs: Mark Nottingham Ian Cooper WREC Working Group.
1 Acquisition Automation – Challenges and Pitfalls Breakout Session # E11 Name: Jim Hargrove and Allen Edgar Date: Tuesday, July 31, 2012 Time: 2:30 pm-3:45.
SC’13 BoF Discussion Sean Hefty Intel Corporation.
Introduction to Operating Systems Concepts
A Brief Introduction to OpenFabrics Interfaces - libfabric
CS533 Concepts of Operating Systems
Fabric Interfaces Architecture – v4
Advancing open fabrics interfaces
Persistent memory support
OpenFabrics Interfaces: Past, present, and future
Request ordering for FI_MSG and FI_RDM endpoints
Implementing an OpenFlow Switch on the NetFPGA platform
Application taxonomy & characterization
Ch 17 - Binding Protocol Addresses
Presentation transcript:

Open Fabrics Interfaces Architecture Introduction Sean Hefty Intel Corporation

Current State of Affairs OFED SW was not designed around HPC Hardware and fabric features are changing –Divergence is driving competing APIs Interfaces are being extended, and new APIs introduced –Long delay in adoption Size of clusters and single node core counts greatly increasing More applications are wanting to take advantage of high- performance fabrics 2 OFED software Widely adopted low-level RDMA API Ships with upstream Linux but…

Solution 3 Design software interfaces that are aligned with application requirements Evolve OpenFabrics Target needs of HPC Support multiple interface semantics Fabric and vendor agnostic Supportable in upstream Linux

Enabling through OpenFabrics Leveraging existing open source community Broad ecosystem –Application developers and vendors –Active community engagement Drive high-performance software APIs –Take advantage of available hardware features –Support multiple product generations 4 Open Fabrics Interfaces Working Group

OFIWG Charter Develop an extensible, open source framework and interfaces aligned with ULP and application needs for high-performance fabric services Software leading hardware –Enable future hardware features –Minimal impact to applications Minimize impedance match between ULPs and network APIs Craft optimal APIs –Detailed analysis on MPI, SHMEM, and other PGAS languages –Focus on other applications – storage, databases, … 5

Call for Participation OFI WG is open participation –Contact the ofiwg mail list for meeting details Source code available through github –github.com/ofiwg Presentations / meeting minutes available from OFA download directory 6 Help OFI WG understand workload requirements and drive software design

Enable.. 7 Optimized software path to hardware Independent of hardware interface, version, features Scalability High performance App-centric Extensible Reduced cache and memory footprint Scalable address resolution and storage Tight data structures Analyze application needs Implement them in a coherent, concise, high-performance manner More agile development Time-boxed, iterative development Application focused APIs Adaptable

Verbs Semantic Mismatch 8 Allocate WR Allocate SGE Format SGE – 3 writes Format WR – 6 writes Loop 1 Checks – 9 branches Loop 2 Check Loop 3 Checks – 3 branches Direct call – 3 writes Checks – 2 branches lines of C-code25-30 lines of C-code generic send calloptimized send call Reduce setup cost - Tighter data Eliminate loops and branches - Remaining branches predictable Selective optimization paths to HW - Manual function expansion Current RDMA APIsEvolved Fabric Interfaces

Application-Centric Interfaces Collect application requirements Identify common, fast path usage models –Too many use cases to optimize them all Build primitives around fabric services –Not device specific interface 9 Reducing instruction count requires a better application impedance match

OFA Software Evolution 10 libfabric FI Provider IB Verbs Verbs Provider Verbs Fabric Interfaces Transition from disjoint APIs to a cohesive set of fabric interfaces RDMA CM CM uVerbs Command

Fabric Interfaces Framework Take growth into consideration Reduce effort to incorporate new application features –Addition of new interfaces, structures, or fields –Modification of existing functions Allow time to design new interfaces correctly –Support prototyping interfaces prior to integration 11 Focus on longer-lived interfaces – software leading hardware

Fabric Interfaces 12 Fabric Interfaces Message Queue Control Interface Control Interface RMA Atomics Addressing Services Tag Matching Triggered Operations CM Services Fabric Provider Implementation Message Queue CM Services RMA Tag Matching Control Interface Framework defines multiple interfaces Vendors provide optimized implementations Addressing Services Atomics Triggered Operations

Fabric Interfaces Defines philosophy for interfaces and extensions –Focus interfaces on the semantics and services offered by the hardware and not the hardware implementation Exports a minimal API –Control interface Defines fabric interfaces –API sets for specific functionality Defines core object model –Object-oriented design, but C-interfaces 13

Fabric Interfaces Architecture Based on object-oriented programming concepts Derived objects define interfaces – New interfaces exposed – Define behavior of inherited interfaces – Optimize implementation 14

Control Interfaces Application specifies desired functionality Discover fabric providers and services Identify resources and addressing fi_getinfo Open a set of fabric interfaces and resources fi_fabric Dynamic providers publish control interfaces fi_register 15 FI Framework fi_getinfo fi_fabric fi_register

Application Semantics Progress –Application or hardware driven –Data versus control interfaces Ordering –Message ordering –Data delivery order Multi-threading and locking model –Compile and run-time options 16 Get / set using control interfaces

Fabric Fabric Object Model 17 Fabric Interfaces NIC Physical Fabric NIC Resource Domain Event Queue Event Counter Active Endpoint Address Vectors Event Queue Event Counter Active Endpoint Address Vectors Passive Endpoint Event Queue Boundary of resource sharing Provider abstracts multiple NICs Software objects usable across multiple HW providers

Endpoint Interfaces 18 Type Protocol CM Message Transfers RMA Tagged Atomics Triggered CM Message Transfers RMA Tagged Atomics Triggered Properties Interfaces Communication interfaces Software path to hardware optimized based on endpoint properties Capabilities Message Transfers RMA Tagged Atomics Triggered Message Transfers RMA Tagged Atomics Triggered Aliasing supports multiple software paths to hardware

Application Configured Interfaces 19 lg. msg RMA NIC Message Queue Ops RMA Ops Endpoint Communication type Capabilities Data transfer flags sm. msg inline send send write read Provider directs app to best API sets App specifies comm model

Event Queues 20 Format Wait Object Context only Data Tagged Generic Context only Data Tagged Generic None fd mwait None fd mwait Properties Interface Details Asynchronous event reporting User specified wait object Compact optimized data structures Optimize interface around reporting successful operations Domain Event counters support lightweight event reporting

Event Queues 21 read CQoptimized CQ Generic completion Op context Send: +4-6 writes, +2 branches Recv: writes, +4 branches Send: +4-6 writes, +2 branches Recv: writes, +4 branches +1 write, +0 branches App selects completion structure Generic verbs completion example Application optimized completion

Address Vectors 22 Store addresses/host names - Insert range of addresses with single call Share between processes Enable provider optimization techniques - Greatly reduce storage requirements Reference entries by handle or index - Handle may be encoded fabric address Reference vector for group communication Example only Fabric specific addressing requirements

Summary 23 These concepts are necessary, not revolutionary –Communication addressing, optimized data transfers, app- centric interfaces, future looking Want a solution where the pieces fit tightly together

Repeated Call for Participation Co-chair –Meets Tuesdays from 9-10 PST / 12-1 EST Links –Mailing list subscription –Document downloads –libfabric source tree –labfabric sample programs 24

Backup 25

Verbs API Mismatch 26 struct ibv_sge { uint64_taddr; uint32_tlength; uint32_tlkey; }; struct ibv_send_wr { uint64_twr_id; struct ibv_send_wr *next; struct ibv_sge *sg_list; intnum_sge; enum ibv_wr_opcodeopcode; intsend_flags; uint32_timm_data;... }; Application request 3 x 8 = 24 bytes of data needed SGE + WR = 88 bytes allocated Requests may be linked - next must be set to NULL Must link to separate SGL and initialize count App must set and provider must switch on opcode Must clear flags 28 additional bytes initialized Significant SW overhead

Verbs Provider Mismatch 27 For each work request Check for available queue space Check SGL size Check valid opcode Check flags x 2 Check specific opcode Switch on QP type Switch on opcode Check flags For each SGE Check size Loop over length Check flags Check Check for last request Other checks x branches including loops 100+ lines of C code lines of code to HW Most often 1 (overlap operations) Often 1 or 2 (fixed in source) Artifact of API QP type usually fixed in source Flags may be fixed or app may have taken branches

Verbs Completions Mismatch 28 struct ibv_wc { uint64_twr_id; enum ibv_wc_statusstatus; enum ibv_wc_opcodeopcode; uint32_tvendor_err; uint32_tbyte_len; uint32_timm_data; uint32_tqp_num; uint32_tsrc_qp; intwc_flags; uint16_tpkey_index; uint16_tslid; uint8_tsl; uint8_tdlid_path_bits; }; Application accessed fields Provider must fill out all fields, even those ignored by the app Developer must determine if fields apply to their QP App must check both return code and status to determine if a request completed successfully Single structure is 48 bytes likely to cross cacheline boundary Provider must handle all types of completions from any QP

RDMA CM Mismatch 29 struct rdma_route { struct rdma_addr addr; struct ibv_sa_path_rec *path_rec;... }; struct rdma_cm_id {...}; rdma_create_id() rdma_resolve_addr() rdma_resolve_route() rdma_connect() Src/dst addresses stored per endpoint 456 bytes per endpoint Path record per endpoint Resolve single address and path at a time All to all connected model for best performance Want: reliable data transfers, zero copies to thousands of processes RDMA interfaces expose:

Progress Ability of the underlying implementation to complete processing of an asynchronous request Need to consider ALL asynchronous requests –Connections, address resolution, data transfers, event processing, completions, etc. HW/SW mix 30 All(?) current solutions require significant software components

Progress Support two progress models –Automatic and implicit Separate operations as belonging to one of two progress domains –Data or control –Report progress model for each domain 31 SAMPLEImplicitAutomatic DataSoftwareHardware offload ControlSoftwareKernel services

Automatic Progress Implies hardware offload model –Or standard kernel services / threads for control operations Once an operation is initiated, it will complete without further user intervention or calls into the API Automatic progress meets implicit model by definition 32

Implicit Progress Implies significant software component Occurs when reading or waiting on EQ(s) Application can use separate EQs for control and data Progress limited to objects associated with selected EQ(s) App can request automatic progress –E.g. app wants to wait on native wait object –Implies provider allocated threading 33

Ordering Applies to a single initiator endpoint performing data transfers to one target endpoint over the same data flow –Data flow may be a conceptual QoS level or path through the network Separate ordering domains –Completions, message, data Fenced ordering may be obtained using fi_sync operation 34

Completion Ordering Order in which operation completions are reported relative to their submission Unordered or ordered –No defined requirement for ordered completions Default: unordered 35

Message Ordering Order in which message (transport) headers are processed –I.e. whether transport message are received in or out of order Determined by selection of ordering bits –[Read | Write | Send] After [Read | Write | Send] –RAR, RAW, RAS, WAR, WAW, WAS, SAR, SAW, SAS Example: –fi_order = 0 // unordered –fi_order = RAR | RAW | RAS | WAW | WAS | SAW | SAS // IB/iWarp like ordering 36

Data Ordering Delivery order of transport data into target memory –Ordering per byte-addressable location –I.e. access to the same byte in memory Ordering constrained by message ordering rules –Must at least have message ordering first 37

Data Ordering Ordering limited to message order size –E.g. MTU –In order data delivery if transfer <= message order size –WAW, RAW, WAR sizes? Message order size = 0 –No data ordering Message order size = -1 –All data ordered 38

Other Ordering Rules Ordering to different target endpoints not defined Per message ordering semantics implemented using different data flows –Data flows may be less flexible, but easier to optimize for –Endpoint aliases may be configured to use different data flows 39

Multi-threading and Locking Support both thread safe and lockless models –Compile time and run time support –Run-time limited to compiled support Lockless (based on MPI model) –Single – single-threaded app –Funneled – only 1 thread calls into interfaces –Serialized – only 1 thread at a time calls into interfaces Thread safe –Multiple – multi-threaded app, with no restrictions 40

Buffering Support both application and network buffering –Zero-copy for high-performance –Network buffering for ease of use Buffering in local memory or NIC –In some case, buffered transfers may be higher- performing (e.g. “inline”) Registration option for local NIC access –Migration to fabric managed registration Required registration for remote access –Specify permissions 41

Scalable Transfer Interfaces Application optimized code paths based on usage model Optimize call(s) for single work request –Single data buffer –Still support more complex WR lists/SGL Per endpoint send/receive operations –Separate RMA function calls Pre-configure data transfer flags –Known before post request –Select software path through provider 42