© 2010 Mercury Computer Systems, Inc.www.mc.com Performance Migration to Open Solutions: OFED * for Embedded Fabrics Kenneth Cain * OpenFabrics.

Slides:

Advertisements

Similar presentations

Introduction to Grid Application On-Boarding Nick Werstiuk

Advertisements

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers Ryan E. GrantAhmad Afsahi Pavan Balaji Department of Electrical and Computer Engineering,

Uncovering Performance and Interoperability Issues in the OFED Stack March 2008 Dennis Tolstenko Sonoma Workshop Presentation.

Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

RDS and Oracle 10g RAC Update Paul Tsien, Oracle.

Keith Wiles DPACC vNF Overview and Proposed methods Keith Wiles – v0.5.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Initial Data Access Module & Lustre Deployment Tan Li.

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.

Leveling the Field for Multicore Open Systems Architectures Markus Levy President, EEMBC President, Multicore Association.

Infiniband enables scalable Real Application Clusters – Update Spring 2008 Sumanta Chatterjee, Oracle Richard Frank, Oracle.

New Direction Proposal: An OpenFabrics Framework for high-performance I/O apps OFA TAC, Key drivers: Sean Hefty, Paul Grun.

Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.

CASTNESS‘11 Computer Architectures and Software Tools for Numerical Embedded Scalable Systems Workshop & School: Roma January 17-18th 2011 Frédéric ROUSSEAU.

Roland Dreier Technical Lead – Cisco Systems, Inc. OpenIB Maintainer Sean Hefty Software Engineer – Intel Corporation OpenIB Maintainer Yaron Haviv CTO.

© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,

1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.

The “State” and “Future” of Middleware for HPEC Anthony Skjellum RunTime Computing Solutions, LLC & University of Alabama at Birmingham HPEC 2009.

Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

MIT Lincoln Laboratory VXFabric-1 Kontron 9/22/2011 VXFabric: PCI-Express Switch Fabric for HPEC Poster B.7, Technologies and Systems Robert Negre, Business.

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.

The NE010 iWARP Adapter Gary Montry Senior Scientist

InfiniSwitch Company Confidential. 2 InfiniSwitch Agenda InfiniBand Overview Company Overview Product Strategy Q&A.

OpenFabrics 2.0 or libibverbs 1.0 Sean Hefty Intel Corporation.

Gilad Shainer, VP of Marketing Dec 2013 Interconnect Your Future.

1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.

An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.

2006 Sonoma Workshop February 2006Page 1 Sockets Direct Protocol (SDP) for Windows - Motivation and Plans Gilad Shainer Mellanox Technologies Inc.

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

OFED Interoperability NetEffect April 30, 2007 Sonoma Workshop Presentation.

Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,

© 2004 Mercury Computer Systems, Inc. FPGAs & Software Components Graham Bardouleau & Jim Kulp Mercury Computer Systems, Inc. High Performance Embedded.

OFED Usage in VMware Virtual Infrastructure Anne Marie Merritt, VMware Tziporet Koren, Mellanox May 1, 2007 Sonoma Workshop Presentation.

© 2012 MELLANOX TECHNOLOGIES 1 Disruptive Technologies in HPC Interconnect HPC User Forum April 16, 2012.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.

1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.

Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

CMS week, June 2002, CERN 1 First P2P Measurements on Infiniband Luciano Berti INFN Laboratori Nazionali di Legnaro.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.

LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.

iSER update 2014 OFA Developer Workshop Eyal Salomon

OpenFabrics Interface WG A brief introduction Paul Grun – co chair OFI WG Cray, Inc.

Mr. P. K. GuptaSandeep Gupta Roopak Agarwal

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

KIT – University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association STEINBUCH CENTRE FOR COMPUTING - SCC

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Datacenter Fabric Workshop NFS over RDMA Boris Shpolyansky Mellanox Technologies Inc.

Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.

Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface Kfabric Framework.

The World Leader in High Performance Signal Processing Solutions Heterogeneous Multicore for blackfin implementation Open Platform Solutions Steven Miao.

Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen

Tgt: Framework Target Drivers FUJITA Tomonori NTT Cyber Solutions Laboratories Mike Christie Red Hat, Inc Ottawa Linux.

Balazs Voneki CERN/EP/LHCb Online group

Low Latency Analytics HPC Clusters

OpenFabrics Alliance An Update for SSSI

Application taxonomy & characterization

Factors Driving Enterprise NVMeTM Growth

Presentation transcript:

© 2010 Mercury Computer Systems, Inc. Performance Migration to Open Solutions: OFED * for Embedded Fabrics Kenneth Cain * OpenFabrics Enterprise Distribution

© 2010 Mercury Computer Systems, Inc. 2 Outline Performance Migration to Open Solutions  What’s open, what is the cost?  Opportunities for a smooth migration  Migration example: ICS/DX to MPI over sRIO OFED sRIO Performance Data / Analysis  Message passing and RDMA patterns, benchmarking approach  Varied approaches within sRIO (MPI, OFED, ICS/DX, etc.)  Common approaches across fabrics (sRIO, IB, iWARP) Conclusions

© 2010 Mercury Computer Systems, Inc. 3 Related to Recent HPEC Talks HPEC 2008  Using Layer 2 Ethernet for High-Throughput, Real-Time Applications  Robert Blau / Mercury Computer Systems, Inc. HPEC 2009  The “State” and “Future” of Middleware for HPEC  Anthony Skjellum / RunTime Computing Solutions, LLC and University of Alabama at Birmingham Middleware has been a significant focus for the HPEC community for a long time

© 2010 Mercury Computer Systems, Inc. 4 Performance Migration to Open Solutions © 2010 Mercury Computer Systems, Inc. High Level Perspective

© 2010 Mercury Computer Systems, Inc. 5 Scalability Reference Model Inter-NodeInter-Board Multi-Node Processing Intra-NodeInter-Core Multi-core Processing Inter-ChassisInter-Box Grid/Cluster Computing Domain-Specific Middlewares / Frameworks Distributed Applications Interconnect Technologies Chip, Backplane, Network Shared Memory MPI (RDMA) - RPC Socket (TCP/UDP) 8 – 32 mm 100 cores / node 1 – 50 ns 100 GB/s 32mm –100m 16 – 200 blades 1000ns – 20,000ns 1-10 GB/s 100m+ 50,000 boxes 1ms – 5s GB/s Distance Scalability Latency Throughput Only a Model – And Models Have Imperfections Innovation Opportunity in Bridging Domains

© 2010 Mercury Computer Systems, Inc. 6 Scalability Reference Model Inter-NodeInter-Board Multi-Node Processing Intra-NodeInter-Core Multi-core Processing Inter-ChassisInter-Box Grid/Cluster Computing Domain-Specific Middlewares / Frameworks Distributed Applications Interconnect Technologies Chip, Backplane, Network Shared Memory MPI (RDMA) - RPC Socket (TCP/UDP) 8 – 32 mm 100 cores / node 1 – 50 ns 100 GB/s 32mm –100m 16 – 200 blades 1000ns – 20,000ns 1-10 GB/s 100m+ 50,000 boxes 1ms – 5s GB/s Distance Scalability Latency Throughput Only a Model – And Models Have Imperfections Innovation Opportunity in Bridging Domains Today: MPI/OFED

© 2010 Mercury Computer Systems, Inc. 7 TransitionsGoals, Potential Solutions Compute Establish Open Baseline in PPC Only Recompile if Migrating to x86 PPC,FPGAx86, GPGPU, FPGA Interconnects and Data Rates Common RDMA API for All Fabrics SW Binary Compatible among Fabrics RDMA+LAN APIs for Selected Fabric Programmable Fabric Device 10Gbps sRIO10/20/40/100Gbps sRIO, Ethernet, IB Middleware, Use Cases Leveraged SW Investment, Reuse HW Acceleration RDMA, Middleware, Protocols MPI, Vendor DMA Algorithm Library Integrated MPI/Dataflow Parallel Algorithm Library Pub/Sub, Components Dev / Run / Deployment Flexibility Productive Porting to Embedded Few / No Code Changes Good Performance Out of the Box Then, Incremental Tuning Dev+Run/Embedded Different Air/Ground SW Dev/Cluster, Run/Embed Common Air/Ground SW Fabric Software Migrations and Goals

© 2010 Mercury Computer Systems, Inc. 8 TransitionsGoals, Potential Solutions Compute Establish Open Baseline in PPC Only Recompile if Migrating to x86 PPC,FPGAx86, GPGPU, FPGA Interconnects and Data Rates Common RDMA API for All Fabrics SW Binary Compatible among Fabrics RDMA+LAN APIs for Selected Fabric Programmable Fabric Device 10Gbps sRIO10/20/40/100Gbps sRIO, Ethernet, IB Middleware, Use Cases Leveraged SW Investment, Reuse HW Acceleration RDMA, Middleware, Protocols MPI, Vendor DMA Algorithm Library Integrated MPI/Dataflow Parallel Algorithm Library Pub/Sub, Components Dev / Run / Deployment Flexibility Productive Porting to Embedded Few / No Code Changes Good Performance Out of the Box Then, Incremental Tuning Dev+Run/Embedded Different Air/Ground SW Dev/Cluster, Run/Embed Common Air/Ground SW Fabric Software Migrations and Goals

© 2010 Mercury Computer Systems, Inc. 9 TransitionsGoals, Potential Solutions Compute Establish Open Baseline in PPC Only Recompile if Migrating to x86 PPC,FPGAx86, GPGPU, FPGA Interconnects and Data Rates Common RDMA API for All Fabrics SW Binary Compatible among Fabrics RDMA+LAN APIs for Selected Fabric Programmable Fabric Device 10Gbps sRIO10/20/40/100Gbps sRIO, Ethernet, IB Middleware, Use Cases Leveraged SW Investment, Reuse HW Acceleration RDMA, Middleware, Protocols MPI, Vendor DMA Algorithm Library Integrated MPI/Dataflow Parallel Algorithm Library Pub/Sub, Components Dev / Run / Deployment Flexibility Productive Porting to Embedded Few / No Code Changes Good Performance Out of the Box Then, Incremental Tuning Dev+Run/Embedded Different Air/Ground SW Dev/Cluster, Run/Embed Common Air/Ground SW Fabric Software Migrations and Goals

© 2010 Mercury Computer Systems, Inc TransitionsGoals, Potential Solutions Compute Establish Open Baseline in PPC Only Recompile if Migrating to x86 PPC,FPGAx86, GPGPU, FPGA Interconnects and Data Rates Common RDMA API for All Fabrics SW Binary Compatible among Fabrics RDMA+LAN APIs for Selected Fabric Programmable Fabric Device 10Gbps sRIO10/20/40/100Gbps sRIO, Ethernet, IB Middleware, Use Cases Leveraged SW Investment, Reuse HW Acceleration RDMA, Middleware, Protocols MPI, Vendor DMA Algorithm Library Integrated MPI/Dataflow Parallel Algorithm Library Pub/Sub, Components Dev / Run / Deployment Flexibility Productive Porting to Embedded Few / No Code Changes Good Performance Out of the Box Then, Incremental Tuning Dev+Run/Embedded Different Air/Ground SW Dev/Cluster, Run/Embed Common Air/Ground SW Fabric Software Migrations and Goals

© 2010 Mercury Computer Systems, Inc TransitionsGoals, Potential Solutions Compute Establish Open Baseline in PPC Only Recompile if Migrating to x86 PPC,FPGAx86, GPGPU, FPGA Interconnects and Data Rates Common RDMA API for All Fabrics SW Binary Compatible among Fabrics RDMA+LAN APIs for Selected Fabric Programmable Fabric Device 10Gbps sRIO10/20/40/100Gbps sRIO, Ethernet, IB Middleware, Use Cases Leveraged SW Investment, Reuse HW Acceleration RDMA, Middleware, Protocols MPI, Vendor DMA Algorithm Library Integrated MPI/Dataflow Parallel Algorithm Library Pub/Sub, Components Dev / Run / Deployment Flexibility Productive Porting to Embedded Few / No Code Changes Good Performance Out of the Box Then, Incremental Tuning Dev+Run/Embedded Different Air/Ground SW Dev/Cluster, Run/Embed Common Air/Ground SW Fabric Software Migrations and Goals Performance Always a Requirement Optimize relative to SWaP, Overcome “price of portability” New Approach – Community Open Architecture Goals Not Met

© 2010 Mercury Computer Systems, Inc Fabric Software Migrations and Goals Performance Always a Requirement Optimize relative to SWaP, Overcome “price of portability” New Approach – Community Open Architecture Goals Not Met TransitionsGoals, Potential Solutions Compute Establish Open Baseline in PPC Only Recompile if Migrating to x86 PPC,FPGAx86, GPGPU, FPGA Interconnects and Data Rates Common RDMA API for All Fabrics SW Binary Compatible among Fabrics RDMA+LAN APIs for Selected Fabric Programmable Fabric Device 10Gbps sRIO10/20/40/100Gbps sRIO, Ethernet, IB Middleware, Use Cases Leveraged SW Investment, Reuse HW Acceleration RDMA, Middleware, Protocols MPI, Vendor DMA Algorithm Library Integrated MPI/Dataflow Parallel Algorithm Library Pub/Sub, Components Dev / Run / Deployment Flexibility Productive Porting to Embedded Few / No Code Changes Good Performance Out of the Box Then, Incremental Tuning Dev+Run/Embedded Different Air/Ground SW Dev/Cluster, Run/Embed Common Air/Ground SW

© 2010 Mercury Computer Systems, Inc Inter-Box Inter-Board Inter-Core Industry Scaling Solutions MPI DDS AMQP CORBA OpenCV VSIPL++ VSIPL EIB DRI Middlewares Engage Platform Services for Performance and Scalability RapidIO (sRIO) InfiniBand (IB) Ethernet OpenMP Ct OpenCL QPIHT CUDA IPP Middlewares, Application Frameworks Mercury, 3 rd Party, Customer Supplied....

© 2010 Mercury Computer Systems, Inc RapidIO (sRIO) InfiniBand (IB) Ethernet OpenMP Ct OpenCL QPIHT CUDA IPP MPI DDS AMQP CORBA OpenCV VSIPL++ VSIPL Inter-Box Inter-Board Inter-Core Industry Scaling Solutions EIB Pivot Points Concentrations of industry investment (sometimes performance) OpenCL OFED Network Stack DRI

© 2010 Mercury Computer Systems, Inc What is OFED? Open Source Software by The OpenFabrics Alliance  Mercury is a member Multiple Fabric/Vendor  Ethernet/IB, sRIO (this work)  “Offload” HW Available Multiple Use  MPI (multiple libraries)  RDMA: uDAPL, verbs  Sockets  Storage: block, file High Performance  0 Copy – RDMA, Channel I/O

© 2010 Mercury Computer Systems, Inc OFA – Significant Membership Roster See for the listhttp:// HPC, System Integrators including “Tier 1” Microprocessor Providers Networking Providers (NIC/HCA, Switch) Server/Enterprise Software + Systems Providers NAS / SAN Providers Switched Fabric Industry Trade Associations Financial Services / Technology Providers National Laboratories Universities Software Consulting Organizations

© 2010 Mercury Computer Systems, Inc Diagram From OpenFabrics Alliance, OFED Software Structure

© 2010 Mercury Computer Systems, Inc Diagram From OpenFabrics Alliance, OFED Software Structure, MCS Work  An OFED “Device Provider” Software Module  (now) Linux PPC 8641D/sRIO  (future) Linux Intel PCIe to sRIO / Ethernet Device  Open MPI with OFED Transport ( Open MPI MCS Provider SW PPC/sRIOIntel+sRIO/Ethernet

© 2010 Mercury Computer Systems, Inc Embedded Middleware Migration Embedded SW Migration Open Solution SW Migration ICS/DX Mercury middleware No MPI content Preserve SW investment ICS/DX+MPI OFED in kernel driver Incremental MPI use App. buffer integration MPI_Send(smb_ptr, ) MPI as MCP extension MPI_Barrier(world) ICS/DX+MPI tuned OFED OS bypass Coordinate DMA HW Sharing MPI+DATAFLOW CONCEPT Single library, integrated resources DMA HW assist for MPI, RDMA MPI (MPI_Alloc_mem) malloc -> MPI_Alloc_mem App still uses open APIs MCS “PMEM”, contig. mem. Fabric / OFED+PMEM MPI (malloc) Baseline for port /embedded 0-copy RDMA out of the box Fabric / OFED Intra-node / SAL, OFED

© 2010 Mercury Computer Systems, Inc. 20 Motivating Example: ICS/DX vs. MPI © 2010 Mercury Computer Systems, Inc. Data Rate Performance

© 2010 Mercury Computer Systems, Inc Benchmark ½ Round Trip Time Rank 0 Process Rank 1 Process Inter-node: ranks placed on distinct nodes Latency = ½ RTT (record min, max, avg, median) Data rate calculated: transfer size, avg latency Return DEST buffer T 0 = timestamp DMA write data, sync word Poll on recv sync word T 1 = timestamp Poll on recv sync word DMA write data, sync word Round Trip Time (RTT) nbytes SRC bufferReturn SRC bufferDEST buffer

© 2010 Mercury Computer Systems, Inc ICS/DX, MPI/OFED: 8641D sRIO MPI/malloc: good, 0-copy “out of the box” Comparable Data Rates (OFED, ICS/DX) Overhead Differences Kernel (OFED) vs. user level DMA queuing Dynamic DMA chain formatting (OFED) vs. pre-plan + modify MPI/malloc Contiguous/PMEM ICS/DX OFED RDMA Write ½ RTT Latency – Small Transfer (4 bytes) OFED (data+flag) 13 usecOFED (data/poll last) 8.6 usecICS/DX (data+flag): 4 usec x

© 2010 Mercury Computer Systems, Inc Preliminary Conclusions 2. Minor Application Changes Contiguous memory for application buffers Comparable MPI performance to other systems 4. Communication Model Differences RDMA vs. message passing Limited “gap closing” using traditional means Different solution approaches to be considered OFED “middleware friendly” RDMA 1. Out of the Box Application Porting 0-copy RDMA enabled by OFED Better results than MPI over TCP/IP 3. Improvements Underway OFED: user-space optimized DMA queuing Moving performance curves “to the left”

© 2010 Mercury Computer Systems, Inc. 24 Communication Model Comparison © 2010 Mercury Computer Systems, Inc. Message Passing and RDMA

© 2010 Mercury Computer Systems, Inc Message Passing vs. RDMA Rendezvous Protocol Extra Costs Fabric handshakes in each send/recv 0-copy RDMA direct / app. memory Flexibility (MPI) vs. Performance (RDMA) Issue: memory registration / setup for RDMA MPI Tradeoffs: Copy vs. Fabric Handshake Costs Eager Protocol Extra Costs Copy (both sides) to special RDMA buffers RDMA transfer to/from special buffers

© 2010 Mercury Computer Systems, Inc. 26 MPI and RDMA Performance © 2010 Mercury Computer Systems, Inc. Server/HPC Platform Comparisons

© 2010 Mercury Computer Systems, Inc MPI vs. OFED Data Rate: IB, iWARP Without sRIO (Same Software) Double Data Rate InfiniBand (DDRIB), iWARP (10GbE) Similar Performance Delta Observed % of peak rate – link speeds: 10, 16 Gbps

© 2010 Mercury Computer Systems, Inc MPI Data Rate: sRIO, IB, iWARP MPI Data Rate Comparable Across Fabrics Small Transfer Copy Cost Differences X86_64 (DDRIB, iWARP) faster than PPC32 (sRIO)

© 2010 Mercury Computer Systems, Inc. 29 Mercury MPI/OFED Intra-Node Performance © 2010 Mercury Computer Systems, Inc.

© 2010 Mercury Computer Systems, Inc Open MPI Intra-Node Communication Multi-core Node Process 0 Memory Process 1 Memory Shared Memory Segment Send buffer Temp buffer Recv buffer Shared memory (sm) transport Copy segments via temp Optimized copy (AltiVec) OFED (openib) transport Loopback via CPU or DMA CPU (now): kernel driver CPU (next): library/AltiVec DMA (now): through sRIO DMA (next): local copy

© 2010 Mercury Computer Systems, Inc MPI Intra-Node Data Rate OMPI sm latency better (very small messages)  2-3 usec latency (OMPI sm) vs. 11 usec (OFED CPU loopback) OFED loopback better for large messages  Direct copy – 50% less memory bandwidth used OFED CPU, DMA Loopback 0-copy A  B OMPI sm Indirect copy A  TMP  B Using MCS optimized vector copy routines

© 2010 Mercury Computer Systems, Inc Summary / Conclusions OFED Perspective RDMA performance promising More performance with user-space optimization Opportunity to support more middlewares MPI Perspective MPI performance in line with current practice Steps shown to increase performance Enhancing MPI Perspective Converged middleware / extensions end goal Vs. pure MPI end goal HW assist likely can help in either case

© 2010 Mercury Computer Systems, Inc. 33 Thank You Questions? © 2010 Mercury Computer Systems, Inc. Kenneth Cain