© 2010 Mercury Computer Systems, Inc. Performance Migration to Open Solutions: OFED * for Embedded Fabrics Kenneth Cain * OpenFabrics Enterprise Distribution
© 2010 Mercury Computer Systems, Inc. 2 Outline Performance Migration to Open Solutions What’s open, what is the cost? Opportunities for a smooth migration Migration example: ICS/DX to MPI over sRIO OFED sRIO Performance Data / Analysis Message passing and RDMA patterns, benchmarking approach Varied approaches within sRIO (MPI, OFED, ICS/DX, etc.) Common approaches across fabrics (sRIO, IB, iWARP) Conclusions
© 2010 Mercury Computer Systems, Inc. 3 Related to Recent HPEC Talks HPEC 2008 Using Layer 2 Ethernet for High-Throughput, Real-Time Applications Robert Blau / Mercury Computer Systems, Inc. HPEC 2009 The “State” and “Future” of Middleware for HPEC Anthony Skjellum / RunTime Computing Solutions, LLC and University of Alabama at Birmingham Middleware has been a significant focus for the HPEC community for a long time
© 2010 Mercury Computer Systems, Inc. 4 Performance Migration to Open Solutions © 2010 Mercury Computer Systems, Inc. High Level Perspective
© 2010 Mercury Computer Systems, Inc. 5 Scalability Reference Model Inter-NodeInter-Board Multi-Node Processing Intra-NodeInter-Core Multi-core Processing Inter-ChassisInter-Box Grid/Cluster Computing Domain-Specific Middlewares / Frameworks Distributed Applications Interconnect Technologies Chip, Backplane, Network Shared Memory MPI (RDMA) - RPC Socket (TCP/UDP) 8 – 32 mm 100 cores / node 1 – 50 ns 100 GB/s 32mm –100m 16 – 200 blades 1000ns – 20,000ns 1-10 GB/s 100m+ 50,000 boxes 1ms – 5s GB/s Distance Scalability Latency Throughput Only a Model – And Models Have Imperfections Innovation Opportunity in Bridging Domains
© 2010 Mercury Computer Systems, Inc. 6 Scalability Reference Model Inter-NodeInter-Board Multi-Node Processing Intra-NodeInter-Core Multi-core Processing Inter-ChassisInter-Box Grid/Cluster Computing Domain-Specific Middlewares / Frameworks Distributed Applications Interconnect Technologies Chip, Backplane, Network Shared Memory MPI (RDMA) - RPC Socket (TCP/UDP) 8 – 32 mm 100 cores / node 1 – 50 ns 100 GB/s 32mm –100m 16 – 200 blades 1000ns – 20,000ns 1-10 GB/s 100m+ 50,000 boxes 1ms – 5s GB/s Distance Scalability Latency Throughput Only a Model – And Models Have Imperfections Innovation Opportunity in Bridging Domains Today: MPI/OFED
© 2010 Mercury Computer Systems, Inc. 7 TransitionsGoals, Potential Solutions Compute Establish Open Baseline in PPC Only Recompile if Migrating to x86 PPC,FPGAx86, GPGPU, FPGA Interconnects and Data Rates Common RDMA API for All Fabrics SW Binary Compatible among Fabrics RDMA+LAN APIs for Selected Fabric Programmable Fabric Device 10Gbps sRIO10/20/40/100Gbps sRIO, Ethernet, IB Middleware, Use Cases Leveraged SW Investment, Reuse HW Acceleration RDMA, Middleware, Protocols MPI, Vendor DMA Algorithm Library Integrated MPI/Dataflow Parallel Algorithm Library Pub/Sub, Components Dev / Run / Deployment Flexibility Productive Porting to Embedded Few / No Code Changes Good Performance Out of the Box Then, Incremental Tuning Dev+Run/Embedded Different Air/Ground SW Dev/Cluster, Run/Embed Common Air/Ground SW Fabric Software Migrations and Goals
© 2010 Mercury Computer Systems, Inc. 8 TransitionsGoals, Potential Solutions Compute Establish Open Baseline in PPC Only Recompile if Migrating to x86 PPC,FPGAx86, GPGPU, FPGA Interconnects and Data Rates Common RDMA API for All Fabrics SW Binary Compatible among Fabrics RDMA+LAN APIs for Selected Fabric Programmable Fabric Device 10Gbps sRIO10/20/40/100Gbps sRIO, Ethernet, IB Middleware, Use Cases Leveraged SW Investment, Reuse HW Acceleration RDMA, Middleware, Protocols MPI, Vendor DMA Algorithm Library Integrated MPI/Dataflow Parallel Algorithm Library Pub/Sub, Components Dev / Run / Deployment Flexibility Productive Porting to Embedded Few / No Code Changes Good Performance Out of the Box Then, Incremental Tuning Dev+Run/Embedded Different Air/Ground SW Dev/Cluster, Run/Embed Common Air/Ground SW Fabric Software Migrations and Goals
© 2010 Mercury Computer Systems, Inc. 9 TransitionsGoals, Potential Solutions Compute Establish Open Baseline in PPC Only Recompile if Migrating to x86 PPC,FPGAx86, GPGPU, FPGA Interconnects and Data Rates Common RDMA API for All Fabrics SW Binary Compatible among Fabrics RDMA+LAN APIs for Selected Fabric Programmable Fabric Device 10Gbps sRIO10/20/40/100Gbps sRIO, Ethernet, IB Middleware, Use Cases Leveraged SW Investment, Reuse HW Acceleration RDMA, Middleware, Protocols MPI, Vendor DMA Algorithm Library Integrated MPI/Dataflow Parallel Algorithm Library Pub/Sub, Components Dev / Run / Deployment Flexibility Productive Porting to Embedded Few / No Code Changes Good Performance Out of the Box Then, Incremental Tuning Dev+Run/Embedded Different Air/Ground SW Dev/Cluster, Run/Embed Common Air/Ground SW Fabric Software Migrations and Goals
© 2010 Mercury Computer Systems, Inc TransitionsGoals, Potential Solutions Compute Establish Open Baseline in PPC Only Recompile if Migrating to x86 PPC,FPGAx86, GPGPU, FPGA Interconnects and Data Rates Common RDMA API for All Fabrics SW Binary Compatible among Fabrics RDMA+LAN APIs for Selected Fabric Programmable Fabric Device 10Gbps sRIO10/20/40/100Gbps sRIO, Ethernet, IB Middleware, Use Cases Leveraged SW Investment, Reuse HW Acceleration RDMA, Middleware, Protocols MPI, Vendor DMA Algorithm Library Integrated MPI/Dataflow Parallel Algorithm Library Pub/Sub, Components Dev / Run / Deployment Flexibility Productive Porting to Embedded Few / No Code Changes Good Performance Out of the Box Then, Incremental Tuning Dev+Run/Embedded Different Air/Ground SW Dev/Cluster, Run/Embed Common Air/Ground SW Fabric Software Migrations and Goals
© 2010 Mercury Computer Systems, Inc TransitionsGoals, Potential Solutions Compute Establish Open Baseline in PPC Only Recompile if Migrating to x86 PPC,FPGAx86, GPGPU, FPGA Interconnects and Data Rates Common RDMA API for All Fabrics SW Binary Compatible among Fabrics RDMA+LAN APIs for Selected Fabric Programmable Fabric Device 10Gbps sRIO10/20/40/100Gbps sRIO, Ethernet, IB Middleware, Use Cases Leveraged SW Investment, Reuse HW Acceleration RDMA, Middleware, Protocols MPI, Vendor DMA Algorithm Library Integrated MPI/Dataflow Parallel Algorithm Library Pub/Sub, Components Dev / Run / Deployment Flexibility Productive Porting to Embedded Few / No Code Changes Good Performance Out of the Box Then, Incremental Tuning Dev+Run/Embedded Different Air/Ground SW Dev/Cluster, Run/Embed Common Air/Ground SW Fabric Software Migrations and Goals Performance Always a Requirement Optimize relative to SWaP, Overcome “price of portability” New Approach – Community Open Architecture Goals Not Met
© 2010 Mercury Computer Systems, Inc Fabric Software Migrations and Goals Performance Always a Requirement Optimize relative to SWaP, Overcome “price of portability” New Approach – Community Open Architecture Goals Not Met TransitionsGoals, Potential Solutions Compute Establish Open Baseline in PPC Only Recompile if Migrating to x86 PPC,FPGAx86, GPGPU, FPGA Interconnects and Data Rates Common RDMA API for All Fabrics SW Binary Compatible among Fabrics RDMA+LAN APIs for Selected Fabric Programmable Fabric Device 10Gbps sRIO10/20/40/100Gbps sRIO, Ethernet, IB Middleware, Use Cases Leveraged SW Investment, Reuse HW Acceleration RDMA, Middleware, Protocols MPI, Vendor DMA Algorithm Library Integrated MPI/Dataflow Parallel Algorithm Library Pub/Sub, Components Dev / Run / Deployment Flexibility Productive Porting to Embedded Few / No Code Changes Good Performance Out of the Box Then, Incremental Tuning Dev+Run/Embedded Different Air/Ground SW Dev/Cluster, Run/Embed Common Air/Ground SW
© 2010 Mercury Computer Systems, Inc Inter-Box Inter-Board Inter-Core Industry Scaling Solutions MPI DDS AMQP CORBA OpenCV VSIPL++ VSIPL EIB DRI Middlewares Engage Platform Services for Performance and Scalability RapidIO (sRIO) InfiniBand (IB) Ethernet OpenMP Ct OpenCL QPIHT CUDA IPP Middlewares, Application Frameworks Mercury, 3 rd Party, Customer Supplied....
© 2010 Mercury Computer Systems, Inc RapidIO (sRIO) InfiniBand (IB) Ethernet OpenMP Ct OpenCL QPIHT CUDA IPP MPI DDS AMQP CORBA OpenCV VSIPL++ VSIPL Inter-Box Inter-Board Inter-Core Industry Scaling Solutions EIB Pivot Points Concentrations of industry investment (sometimes performance) OpenCL OFED Network Stack DRI
© 2010 Mercury Computer Systems, Inc What is OFED? Open Source Software by The OpenFabrics Alliance Mercury is a member Multiple Fabric/Vendor Ethernet/IB, sRIO (this work) “Offload” HW Available Multiple Use MPI (multiple libraries) RDMA: uDAPL, verbs Sockets Storage: block, file High Performance 0 Copy – RDMA, Channel I/O
© 2010 Mercury Computer Systems, Inc OFA – Significant Membership Roster See for the listhttp:// HPC, System Integrators including “Tier 1” Microprocessor Providers Networking Providers (NIC/HCA, Switch) Server/Enterprise Software + Systems Providers NAS / SAN Providers Switched Fabric Industry Trade Associations Financial Services / Technology Providers National Laboratories Universities Software Consulting Organizations
© 2010 Mercury Computer Systems, Inc Diagram From OpenFabrics Alliance, OFED Software Structure
© 2010 Mercury Computer Systems, Inc Diagram From OpenFabrics Alliance, OFED Software Structure, MCS Work An OFED “Device Provider” Software Module (now) Linux PPC 8641D/sRIO (future) Linux Intel PCIe to sRIO / Ethernet Device Open MPI with OFED Transport ( Open MPI MCS Provider SW PPC/sRIOIntel+sRIO/Ethernet
© 2010 Mercury Computer Systems, Inc Embedded Middleware Migration Embedded SW Migration Open Solution SW Migration ICS/DX Mercury middleware No MPI content Preserve SW investment ICS/DX+MPI OFED in kernel driver Incremental MPI use App. buffer integration MPI_Send(smb_ptr, ) MPI as MCP extension MPI_Barrier(world) ICS/DX+MPI tuned OFED OS bypass Coordinate DMA HW Sharing MPI+DATAFLOW CONCEPT Single library, integrated resources DMA HW assist for MPI, RDMA MPI (MPI_Alloc_mem) malloc -> MPI_Alloc_mem App still uses open APIs MCS “PMEM”, contig. mem. Fabric / OFED+PMEM MPI (malloc) Baseline for port /embedded 0-copy RDMA out of the box Fabric / OFED Intra-node / SAL, OFED
© 2010 Mercury Computer Systems, Inc. 20 Motivating Example: ICS/DX vs. MPI © 2010 Mercury Computer Systems, Inc. Data Rate Performance
© 2010 Mercury Computer Systems, Inc Benchmark ½ Round Trip Time Rank 0 Process Rank 1 Process Inter-node: ranks placed on distinct nodes Latency = ½ RTT (record min, max, avg, median) Data rate calculated: transfer size, avg latency Return DEST buffer T 0 = timestamp DMA write data, sync word Poll on recv sync word T 1 = timestamp Poll on recv sync word DMA write data, sync word Round Trip Time (RTT) nbytes SRC bufferReturn SRC bufferDEST buffer
© 2010 Mercury Computer Systems, Inc ICS/DX, MPI/OFED: 8641D sRIO MPI/malloc: good, 0-copy “out of the box” Comparable Data Rates (OFED, ICS/DX) Overhead Differences Kernel (OFED) vs. user level DMA queuing Dynamic DMA chain formatting (OFED) vs. pre-plan + modify MPI/malloc Contiguous/PMEM ICS/DX OFED RDMA Write ½ RTT Latency – Small Transfer (4 bytes) OFED (data+flag) 13 usecOFED (data/poll last) 8.6 usecICS/DX (data+flag): 4 usec x
© 2010 Mercury Computer Systems, Inc Preliminary Conclusions 2. Minor Application Changes Contiguous memory for application buffers Comparable MPI performance to other systems 4. Communication Model Differences RDMA vs. message passing Limited “gap closing” using traditional means Different solution approaches to be considered OFED “middleware friendly” RDMA 1. Out of the Box Application Porting 0-copy RDMA enabled by OFED Better results than MPI over TCP/IP 3. Improvements Underway OFED: user-space optimized DMA queuing Moving performance curves “to the left”
© 2010 Mercury Computer Systems, Inc. 24 Communication Model Comparison © 2010 Mercury Computer Systems, Inc. Message Passing and RDMA
© 2010 Mercury Computer Systems, Inc Message Passing vs. RDMA Rendezvous Protocol Extra Costs Fabric handshakes in each send/recv 0-copy RDMA direct / app. memory Flexibility (MPI) vs. Performance (RDMA) Issue: memory registration / setup for RDMA MPI Tradeoffs: Copy vs. Fabric Handshake Costs Eager Protocol Extra Costs Copy (both sides) to special RDMA buffers RDMA transfer to/from special buffers
© 2010 Mercury Computer Systems, Inc. 26 MPI and RDMA Performance © 2010 Mercury Computer Systems, Inc. Server/HPC Platform Comparisons
© 2010 Mercury Computer Systems, Inc MPI vs. OFED Data Rate: IB, iWARP Without sRIO (Same Software) Double Data Rate InfiniBand (DDRIB), iWARP (10GbE) Similar Performance Delta Observed % of peak rate – link speeds: 10, 16 Gbps
© 2010 Mercury Computer Systems, Inc MPI Data Rate: sRIO, IB, iWARP MPI Data Rate Comparable Across Fabrics Small Transfer Copy Cost Differences X86_64 (DDRIB, iWARP) faster than PPC32 (sRIO)
© 2010 Mercury Computer Systems, Inc. 29 Mercury MPI/OFED Intra-Node Performance © 2010 Mercury Computer Systems, Inc.
© 2010 Mercury Computer Systems, Inc Open MPI Intra-Node Communication Multi-core Node Process 0 Memory Process 1 Memory Shared Memory Segment Send buffer Temp buffer Recv buffer Shared memory (sm) transport Copy segments via temp Optimized copy (AltiVec) OFED (openib) transport Loopback via CPU or DMA CPU (now): kernel driver CPU (next): library/AltiVec DMA (now): through sRIO DMA (next): local copy
© 2010 Mercury Computer Systems, Inc MPI Intra-Node Data Rate OMPI sm latency better (very small messages) 2-3 usec latency (OMPI sm) vs. 11 usec (OFED CPU loopback) OFED loopback better for large messages Direct copy – 50% less memory bandwidth used OFED CPU, DMA Loopback 0-copy A B OMPI sm Indirect copy A TMP B Using MCS optimized vector copy routines
© 2010 Mercury Computer Systems, Inc Summary / Conclusions OFED Perspective RDMA performance promising More performance with user-space optimization Opportunity to support more middlewares MPI Perspective MPI performance in line with current practice Steps shown to increase performance Enhancing MPI Perspective Converged middleware / extensions end goal Vs. pure MPI end goal HW assist likely can help in either case
© 2010 Mercury Computer Systems, Inc. 33 Thank You Questions? © 2010 Mercury Computer Systems, Inc. Kenneth Cain