A Brief Introduction to OpenFabrics Interfaces - libfabric

Slides:

Advertisements

Similar presentations

Keith Wiles DPACC vNF Overview and Proposed methods Keith Wiles – v0.5.

Advertisements

I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p

Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface KFI Framework.

I/O Systems ◦ Operating Systems ◦ CS550. Note:  Based on Operating Systems Concepts by Silberschatz, Galvin, and Gagne  Strongly recommended to read.

New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.

Open Fabrics Interfaces Architecture Introduction Sean Hefty Intel Corporation.

General System Architecture and I/O.  I/O devices and the CPU can execute concurrently.  Each device controller is in charge of a particular device.

OpenFabrics 2.0 Sean Hefty Intel Corporation. Claims Verbs is a poor semantic match for industry standard APIs (MPI, PGAS,...) –Want to minimize software.

CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.

OpenFabrics 2.0 or libibverbs 1.0 Sean Hefty Intel Corporation.

Scalable Fabric Interfaces Sean Hefty Intel Corporation OFI software will be backward compatible.

OFI SW - Progress Sean Hefty - Intel Corporation.

Fabric Interfaces Architecture Sean Hefty - Intel Corporation.

Fabric Interfaces Architecture Sean Hefty - Intel Corporation.

CE Operating Systems Lecture 2 Low level hardware support for operating systems.

IB Verbs Compatibility

OFI SW Sean Hefty - Intel Corporation. Target Software 2 Verbs 1.x + extensions 2.0 RDMA CM 1.x + extensions 2.0 Fabric Interfaces.

iSER update 2014 OFA Developer Workshop Eyal Salomon

OpenFabrics Interface WG A brief introduction Paul Grun – co chair OFI WG Cray, Inc.

CE Operating Systems Lecture 2 Low level hardware support for operating systems.

Challenges in Porting & Abstraction. Getting Locked-In Applications are developed with a particular platform in mind The software is locked to the current.

Open Fabrics Interfaces Software Sean Hefty - Intel Corporation.

Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface Kfabric Framework.

Lecture 1 Page 1 CS 111 Summer 2013 Important OS Properties For real operating systems built and used by real people Differs depending on who you are talking.

SC’13 BoF Discussion Sean Hefty Intel Corporation.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

Introduction to Operating Systems Concepts

TLDK Transport Layer Development Kit

Balazs Voneki CERN/EP/LHCb Online group

GridOS: Operating System Services for Grid Architectures

Modularity Most useful abstractions an OS wants to offer can’t be directly realized by hardware Modularity is one technique the OS uses to provide better.

Chapter 2: Computer-System Structures(Hardware)

Chapter 2: Computer-System Structures

CIM Modeling for E&U - (Short Version)

Fabric Interfaces Architecture – v4

Operating System Structure

Chapter 18 MobileApp Design

Advancing open fabrics interfaces

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 2: Computer-System Structures Computer System Operation I/O Structure Storage.

Switching Techniques In large networks there might be multiple paths linking sender and receiver. Information may be switched as it travels through various.

Persistent memory support

Many-core Software Development Platforms

Transport Layer Unit 5.

OpenFabrics Interfaces: Past, present, and future

Request ordering for FI_MSG and FI_RDM endpoints

OpenFabrics Interfaces Working Group Co-Chair Intel November 2016

OpenFabrics Alliance An Update for SSSI

Chapter 2: System Structures

Computer-System Architecture

Module 2: Computer-System Structures

Chapter 1 Introduction to Operating System Part 5

Virtio/Vhost Status Quo and Near-term Plan

Virtual Memory Hardware

Lecture Topics: 11/1 General Operating System Concepts Processes

Enabling TSO in OvS-DPDK

Chapter 2: Operating-System Structures

Application taxonomy & characterization

Outline Chapter 2 (cont) OS Design OS structure

Module 2: Computer-System Structures

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Ch 17 - Binding Protocol Addresses

System calls….. C-program->POSIX call

Chapter 2: Computer-System Structures

Chapter 2: Computer-System Structures

Module 2: Computer-System Structures

Chapter 2: Operating-System Structures

Module 2: Computer-System Structures

Chapter 6: Architectural Design

COMP755 Advanced Operating Systems

Chapter 13: I/O Systems.

Presentation transcript:

A Brief Introduction to OpenFabrics Interfaces - libfabric Sean Hefty The terms OFI and libfabric are often used interchangeably. Libfabric is the first, but likely not the only, component of what will become OpenFabrics Interfaces. It is focused on user space applications

Motivation OpenFabrics libibverbs middleware Widely adopted low-level RDMA API Ships with upstream Linux Intended as unified API for RDMA Motivation but… Designed around InfiniBand architecture Targets specific hardware implementation Hardware, not network, abstraction Too low level for most consumers, not designed around HPC Hardware and fabric features are changing Divergence is driving alternative APIs – UCX, PSM, MXM, CCI, PAMI, uGNI … More applications require high-performance fabrics Cloud systems, data analytics, virtualization, big data … This should not be viewed as an ‘attack’ on libibverbs. Libibverbs has served a very useful purpose over the years and will continue to exist as part of OpenFabrics. The libibverbs interface was originally designed as a merger of 3 different IB interfaces. It is based on the IB spec, chapter 11. IB terminology is used throughout the interface. Non-IB hardware has had to adopt to these interfaces, sometimes with restrictions. E.g. libibverbs does not expose any MTU sizes other than those defined by IB. Ethernet devices (e.g. iWarp) must report non-standard MTU values. The IB spec never intended for verbs to be a software interface. It was an agreement made to the IB hardware. OFI is OFA adapting to the changing hardware implementations, new applications, and new fabric features.

OpenFabrics Interfaces Working Group Solution Optimized SW path to HW Minimize cache and memory footprint Reduce instruction count Minimize memory accesses Scalable Implementation Agnostic Software interfaces aligned with application requirements 168 requirements from MPI, PGAS, SHMEM, DBMS, sockets, NVM, … Leverage existing open source community Inclusive development effort App and HW developers Good impedance match with multiple fabric hardware InfiniBand, iWarp, RoCE, raw Ethernet, UDP offload, Omni-Path, GNI, others Open Source Application-Centric libfabric OFA created a new working group to address the challenges of the marketplace and ensure its relevance in the industry. OFA was an ideal group for developing a new set of interfaces – had active community, multiple vendors, and end-users.

OpenFabrics Interfaces Working Group Charter: Develop an extensible, open source framework and interfaces aligned with ULP and application needs for high-performance fabric services ofiwg@lists.openfabrics.org github.com/ofiwg Application-centric interfaces will help foster fabric innovation and accelerate their adoption One of the goals of OFI was to switch the focus from being a bottom up approach to one that was top down. Verbs is an example of a bottom up approach. The hardware implementation is exposed directly to the applications. As hardware evolved, the interfaces were forced to change in order to expose the new features. By focusing on the application’s needs instead, implementation details are hidden from the app. As hardware incorporate new features, those features can be used by modifying the providers rather than enabling each application.

Development Requirement analysis Rough conceptual model ~200 requirements MPI, PGAS, SHMEM, DBMS, sockets, … Development Requirement analysis Iterative design and implementation Deployment Rough conceptual model Input from wide variety of devices Quarterly release cycle Collective feedback from OFIWG The development process is iterative. We spent months analyzing application requirements in order to get at the real application requirement, as opposed to a suggested solution. In many cases, the initial requirement that we received was a proposal for a solution, often based on how things were done with existing interfaces. By spending time understanding the driving need of each requirement, we were able to craft a API well suited for application needs. At the same time, we analyzed the various proposals to understand their impact on the hardware implementations. Could it be done in hardware? What would the resulting API cost to implement in terms of memory footprint, cache usage, or instruction count? The first release was Q1 of 2015. There have since been releases for 1.1, 1.1.1, and we’re targeting a 1.2 release at the end of Q4 2015. Sometime next year, we anticipate adding the first set of extensions to the 1.0 API.

Application Requirements Give us a high-level interface! Give us a low-level interface! And this was just the MPI developers! Try talking to the government! Libfabric tries to walk both lines. We want it easy for a casual developer to get their code working. At the same time, there are some power users that want very low level access to the hardware. If libibverbs is conceptually viewed as being ‘assembly language’ for IB devices, libfabric can be viewed as being C. It still allows for low-level access to generic services.

Implementation Agnostic API Design Implementation Agnostic API EASY Enable simple, basic usage Move functionality under OFI GURU Advanced application constructs Expose abstract HW capabilities Range of application usage models It should be noted that the ‘Easy’ part of the design is there, but the implementation of that is on-going. That is, each provider has implemented those pieces of libfabric that it does well. However, the next step is to expand the implementation so that it is easier for applications to switch from one provider to another. This is simply a matter of having the time to complete the development.

Architecture libfabric Note: current implementation focused on enabling applications Architecture Intel MPI MPICH (Netmod) Open MPI (MTL / BTL) Open MPI SHMEM Sandia SHMEM GASNet Clang UPC rsockets ES-API Libfabric Enabled Middleware libfabric Control Services Communication Services Completion Services Data Transfer Services Discovery Connection Management Event Queues Message Queues RMA Triggered Ops Address Vectors Counters Tag Matching Atomics fi_info Sockets TCP, UDP Verbs IB, RoCE, iWarp Cisco usNIC Intel Omni-Path Cray GNI Mellanox MXM IBM Blue Gene A3Cube RONNIEE The experimental providers do not ship with libfabric. The Blue Gene provider was developed by Intel in order to test the development of the middleware at scale. To date, we’ve run libfabric over BG hardware at 1 million ranks. The A3Cube provider was developed by a graduate student in Italy. Provider development has been focused on time to market, so that middleware can be enabled over libfabric. The sockets provider is for development purposes and works on both Linux and MAC OS X. The verbs provider is a layered provider that enables existing IB, RoCE, and iWarp hardware. Cisco has a native libfabric provider (native meaning there isn’t a lower-level interface sitting under libfabric). The BG provider is native as well. The Intel providers enable TrueScale and OPA. Cray has a native GNI provider, and Intel’s MPI team contributed a provider over MXM, which targets greater scalability over Mellanox fabrics. Supported or in active development Experimental

Select desired endpoint type and capabilities EASY Fabric Information Endpoint Types Capabilities MSG - Reliable connected DGRAM - Datagram RDM - Reliable unconnected, datagram messages Message queue - FIFO RMA Tagged messages Atomics Select desired endpoint type and capabilities This is an easy example of using an RDM endpoint with tag matching capabilities. With just this support, many MPIs can easily be ported over libfabric.

Fabric Information EASY App n App 1 App 2 . . . RDM Message Queue OFI Enabled Applications RDM Message Queue Common Implementation DGRAM Message Queue In this example, the HW supports unreliable datagram communication. This is similar to what is supported by Cisco’s usNIC or to a rough degree Intel’s True Scale or OPA gen 1 hardware. Rather than each application needed to code for reliability, by pushing this down into libfabric, we can have a single, optimized implementation that all applications can take advantage of.

Fabric Information Capabilities GURU Application desired features and permissions Primary – must be requested by application Secondary – offered by provider (application can request) Communication type – msg, tagged, rma, atomics, triggered Permissions – local R/W, remote R/W, send/recv Features – rma events, directed recv, multi-recv ,… Applications request specific capabilities through the libfabric control interface.

Expose optimal way to use underlying hardware resources GURU Fabric Information Attributes Defines the limits and behavior of selected interfaces Progress – provider or application driven Threading – resource synchronization boundaries Resource mgmt – protect against queue overruns Ordering – message processing, data transfers Expose optimal way to use underlying hardware resources Libfabric attributes are different than those defined for other interfaces. Most interfaces define attributes are hardware maximums or limits. Libfabric defines limits based on optimal values supported by the provider. E.g. a provider may support 64,000 endpoints in terms of addressing capabilities, but the hardware cache effectively limits that number to 256 endpoints active at any given time. The intent is that a resource manager can use the data to allocate resources among different jobs and processes. Additionally, libfabric defines some attributes around meeting application needs. E.g. the threading attribute allows an application to allocate resources among threads such that it can avoid locking. In many cases the attributes are optimization hints from the application to the provider on its intended use.

Request application take action to improve overall performance GURU Fabric Information Mode Provider hints on how it is best used Local MR – must register buffers for local operations Context – app provides ‘scratch space’ for provider to track request Buffer prefix – app provides space for network headers Request application take action to improve overall performance Although the interfaces are driven by the application, there are cases where performance could be improved if the application took some action on behalf of the provider. These are the mode bits. In most cases, it is cheaper for an application to take these actions than the provider doing so. E.g. if we implement reliability over an unreliable interface (as shown by laying RDM over DGRAM EPs in the previous slide), then a provider needs to track each request. The FI_CONTEXT mode bit indicates that the provider would like the application to provide the memory used to track the request. Many HPC middleware already allocate memory with each operation, so allocating a few more bytes is easier and cheaper than the provider also allocating memory.

Endpoints Addressable communication portal EASY Conceptually similar to a socket or QP transmit receive completions Conceptual (or real) command queues Sequence of request and completion processing Conceptually, each endpoint is associated with a hardware transmit and receive command queue. There is no requirement that the provider implement the command queues in hardware, however.

Shared Tx/Rx Contexts GURU Enable resource manager to direct use of HW resources Endpoint Endpoint Endpoint Endpoint transmit receive Number of endpoints greater than available resources Map to command queues or HW limits (caching) If there are more endpoints active than the hardware can effectively support, the endpoints may manually be configured to use shared command queues.

Scalable Endpoints Multiple Tx/Rx contexts per endpoint GURU transmit transmit transmit transmit receive receive receive receive - Multi-threading - Ordering - Progress - Completions Incoming requests may be able to target a specific receive context Scalable endpoints are the opposite of shared endpoints. The intent is that an application can take advantage of all hardware resources that are available. An anticipated use of scalable endpoints is to allow multiple threads to each have their own transmit command queue. This avoids locking between threads. This has an advantage over using multiple endpoints in that the number of addresses that each process must maintain is decreased. In this case, we have 1 endpoint = 1 address, but 4 transmit contexts. The alternative would be to have 4 endpoints = 4 addresses.

API Performance Analysis Issues apply to many APIs: Verbs, AIO, DAPL, Portals, NetworkDirect, … API Performance Analysis libibverbs with InfiniBand libfabric with InfiniBand Structure Field Write Size Branch? Type Parameter sge 16 void * buf 8 send_wr 60 size_t len next Yes desc num_sge fi_addr_t dest_addr opcode context flags Totals 76+8 = 84 4+1 = 5 40 Generic entry points result in additional memory reads/writes Interface parameters can force branches in the provider code Move operation flags into initialization code path for optimal SW paths This analyzes the affect that the API has on the underlying implementation. Although this comparison is against libibverbs, the problem appears in many other APIs. This looks at the number of bytes of memory that an application must write in order to invoke the API in order to send a simple message. It also examines whether there are parameters into the API that essentially force a branch to occur in the underlying code. With libibverbs, we require writing an additional 44 bytes to memory in order to send a message, and the interface adds 5 branches in the underlying implementation. (An analysis of an actual provider shows that at a minimum 19 branches are actually taken, but at least 5 of those are the result of how the interface has been defined.)

libibverbs with InfiniBand libfabric with InfiniBand Memory Footprint Per peer addressing data libibverbs with InfiniBand libfabric with InfiniBand Type Data Size struct * ibv_ah 8 uint64 fi_addr_t uint32 QPN 4 QKey 4 [0] 24 Total 36 Map Address Vector : encodes peer address direct mapping to HW command data IB Data: DLID SL QPN Size: 2 1 3 Index Address Vector : minimal footprint requires lookup/calculation for peer address Libfabric considered the impact of trying to scale to millions of peers. Its address vector concept is used to convert endpoint addresses into fabric specific addresses. Address vectors may be shared between processes, which allow a single copy of all addresses to be shared among the ranks on a single system. It also allows a provider to greatly reduce the amount of storage required for all endpoints. With the case of a ‘map address vector’, each address seen by the user is 8 bytes. This may be a pointer into provider memory structures. In the best case, the 8-bytes encodes the actual address as shown. This allows data transfer operations to place the encoded data directly into a command. No additional memory references are needed to send. This allows for minimal instructions in any transmit operation. The cost is that the app must store 8 bytes of data per remote endpoint. The ‘index address vector’ allows the app to use a simple index to reference each remote endpoint. In this case, the app does not need to store any addressing data. However, the provider may need to use the index to look up the actual address of the endpoint. In some cases, the actual address may be calculated based on the index. This enables a very small memory footprint to access millions of peers, with the cost of performing the calculation on each send.

OFA Community Release Schedule 1.0 – Q1 2015 Initial release – support for PSM, Verbs (IB/iWarp), usNIC, socket providers Quickly enable applications, mix of native and layered providers 1.1 – Q2 2015 Bug fixes and provider enhancements 1.1.1 – Q3 2015 Bug fix only release 1.2 – Q4 2015 New providers - enhanced verbs, Omni- Path (PSM2), MXM, GNI 2016 Interface extensions

OFIWG at SC ‘15 Tutorial – Monday 1:30 – 5:00 Detailed look at the libfabric interface Basic examples Middleware implementation reference MPI OpenSHMEM BoF – Tuesday 1:30 – 3:00 OFIWG, including the Data Storage/Data Access subgroup Initial collection of interface extensions

Legal Notices and Disclaimers Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. Intel, the Intel logo, and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © 2015 Intel Corporation. Place at the back of the deck

INTEL® HPC DEVELOPER CONFERENCE