Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2010 Mercury Computer Systems, Inc.www.mc.com Performance Migration to Open Solutions: OFED * for Embedded Fabrics Kenneth Cain * OpenFabrics.

Similar presentations


Presentation on theme: "© 2010 Mercury Computer Systems, Inc.www.mc.com Performance Migration to Open Solutions: OFED * for Embedded Fabrics Kenneth Cain * OpenFabrics."— Presentation transcript:

1 © 2010 Mercury Computer Systems, Inc.www.mc.com Performance Migration to Open Solutions: OFED * for Embedded Fabrics Kenneth Cain * OpenFabrics Enterprise Distribution

2 © 2010 Mercury Computer Systems, Inc. 2 Outline Performance Migration to Open Solutions  What’s open, what is the cost?  Opportunities for a smooth migration  Migration example: ICS/DX to MPI over sRIO OFED sRIO Performance Data / Analysis  Message passing and RDMA patterns, benchmarking approach  Varied approaches within sRIO (MPI, OFED, ICS/DX, etc.)  Common approaches across fabrics (sRIO, IB, iWARP) Conclusions

3 © 2010 Mercury Computer Systems, Inc. 3 Related to Recent HPEC Talks HPEC 2008  Using Layer 2 Ethernet for High-Throughput, Real-Time Applications  Robert Blau / Mercury Computer Systems, Inc. HPEC 2009  The “State” and “Future” of Middleware for HPEC  Anthony Skjellum / RunTime Computing Solutions, LLC and University of Alabama at Birmingham Middleware has been a significant focus for the HPEC community for a long time

4 © 2010 Mercury Computer Systems, Inc.www.mc.com 4 Performance Migration to Open Solutions © 2010 Mercury Computer Systems, Inc.www.mc.com High Level Perspective

5 © 2010 Mercury Computer Systems, Inc. 5 Scalability Reference Model Inter-NodeInter-Board Multi-Node Processing Intra-NodeInter-Core Multi-core Processing Inter-ChassisInter-Box Grid/Cluster Computing Domain-Specific Middlewares / Frameworks Distributed Applications Interconnect Technologies Chip, Backplane, Network Shared Memory MPI (RDMA) - RPC Socket (TCP/UDP) 8 – 32 mm 100 cores / node 1 – 50 ns 100 GB/s 32mm –100m 16 – 200 blades 1000ns – 20,000ns 1-10 GB/s 100m+ 50,000 boxes 1ms – 5s GB/s Distance Scalability Latency Throughput Only a Model – And Models Have Imperfections Innovation Opportunity in Bridging Domains

6 © 2010 Mercury Computer Systems, Inc. 6 Scalability Reference Model Inter-NodeInter-Board Multi-Node Processing Intra-NodeInter-Core Multi-core Processing Inter-ChassisInter-Box Grid/Cluster Computing Domain-Specific Middlewares / Frameworks Distributed Applications Interconnect Technologies Chip, Backplane, Network Shared Memory MPI (RDMA) - RPC Socket (TCP/UDP) 8 – 32 mm 100 cores / node 1 – 50 ns 100 GB/s 32mm –100m 16 – 200 blades 1000ns – 20,000ns 1-10 GB/s 100m+ 50,000 boxes 1ms – 5s GB/s Distance Scalability Latency Throughput Only a Model – And Models Have Imperfections Innovation Opportunity in Bridging Domains Today: MPI/OFED

7 © 2010 Mercury Computer Systems, Inc. 7 TransitionsGoals, Potential Solutions Compute Establish Open Baseline in PPC Only Recompile if Migrating to x86 PPC,FPGAx86, GPGPU, FPGA Interconnects and Data Rates Common RDMA API for All Fabrics SW Binary Compatible among Fabrics RDMA+LAN APIs for Selected Fabric Programmable Fabric Device 10Gbps sRIO10/20/40/100Gbps sRIO, Ethernet, IB Middleware, Use Cases Leveraged SW Investment, Reuse HW Acceleration RDMA, Middleware, Protocols MPI, Vendor DMA Algorithm Library Integrated MPI/Dataflow Parallel Algorithm Library Pub/Sub, Components Dev / Run / Deployment Flexibility Productive Porting to Embedded Few / No Code Changes Good Performance Out of the Box Then, Incremental Tuning Dev+Run/Embedded Different Air/Ground SW Dev/Cluster, Run/Embed Common Air/Ground SW Fabric Software Migrations and Goals

8 © 2010 Mercury Computer Systems, Inc. 8 TransitionsGoals, Potential Solutions Compute Establish Open Baseline in PPC Only Recompile if Migrating to x86 PPC,FPGAx86, GPGPU, FPGA Interconnects and Data Rates Common RDMA API for All Fabrics SW Binary Compatible among Fabrics RDMA+LAN APIs for Selected Fabric Programmable Fabric Device 10Gbps sRIO10/20/40/100Gbps sRIO, Ethernet, IB Middleware, Use Cases Leveraged SW Investment, Reuse HW Acceleration RDMA, Middleware, Protocols MPI, Vendor DMA Algorithm Library Integrated MPI/Dataflow Parallel Algorithm Library Pub/Sub, Components Dev / Run / Deployment Flexibility Productive Porting to Embedded Few / No Code Changes Good Performance Out of the Box Then, Incremental Tuning Dev+Run/Embedded Different Air/Ground SW Dev/Cluster, Run/Embed Common Air/Ground SW Fabric Software Migrations and Goals

9 © 2010 Mercury Computer Systems, Inc. 9 TransitionsGoals, Potential Solutions Compute Establish Open Baseline in PPC Only Recompile if Migrating to x86 PPC,FPGAx86, GPGPU, FPGA Interconnects and Data Rates Common RDMA API for All Fabrics SW Binary Compatible among Fabrics RDMA+LAN APIs for Selected Fabric Programmable Fabric Device 10Gbps sRIO10/20/40/100Gbps sRIO, Ethernet, IB Middleware, Use Cases Leveraged SW Investment, Reuse HW Acceleration RDMA, Middleware, Protocols MPI, Vendor DMA Algorithm Library Integrated MPI/Dataflow Parallel Algorithm Library Pub/Sub, Components Dev / Run / Deployment Flexibility Productive Porting to Embedded Few / No Code Changes Good Performance Out of the Box Then, Incremental Tuning Dev+Run/Embedded Different Air/Ground SW Dev/Cluster, Run/Embed Common Air/Ground SW Fabric Software Migrations and Goals

10 © 2010 Mercury Computer Systems, Inc. 10 TransitionsGoals, Potential Solutions Compute Establish Open Baseline in PPC Only Recompile if Migrating to x86 PPC,FPGAx86, GPGPU, FPGA Interconnects and Data Rates Common RDMA API for All Fabrics SW Binary Compatible among Fabrics RDMA+LAN APIs for Selected Fabric Programmable Fabric Device 10Gbps sRIO10/20/40/100Gbps sRIO, Ethernet, IB Middleware, Use Cases Leveraged SW Investment, Reuse HW Acceleration RDMA, Middleware, Protocols MPI, Vendor DMA Algorithm Library Integrated MPI/Dataflow Parallel Algorithm Library Pub/Sub, Components Dev / Run / Deployment Flexibility Productive Porting to Embedded Few / No Code Changes Good Performance Out of the Box Then, Incremental Tuning Dev+Run/Embedded Different Air/Ground SW Dev/Cluster, Run/Embed Common Air/Ground SW Fabric Software Migrations and Goals

11 © 2010 Mercury Computer Systems, Inc. 11 TransitionsGoals, Potential Solutions Compute Establish Open Baseline in PPC Only Recompile if Migrating to x86 PPC,FPGAx86, GPGPU, FPGA Interconnects and Data Rates Common RDMA API for All Fabrics SW Binary Compatible among Fabrics RDMA+LAN APIs for Selected Fabric Programmable Fabric Device 10Gbps sRIO10/20/40/100Gbps sRIO, Ethernet, IB Middleware, Use Cases Leveraged SW Investment, Reuse HW Acceleration RDMA, Middleware, Protocols MPI, Vendor DMA Algorithm Library Integrated MPI/Dataflow Parallel Algorithm Library Pub/Sub, Components Dev / Run / Deployment Flexibility Productive Porting to Embedded Few / No Code Changes Good Performance Out of the Box Then, Incremental Tuning Dev+Run/Embedded Different Air/Ground SW Dev/Cluster, Run/Embed Common Air/Ground SW Fabric Software Migrations and Goals Performance Always a Requirement Optimize relative to SWaP, Overcome “price of portability” New Approach – Community Open Architecture Goals Not Met

12 © 2010 Mercury Computer Systems, Inc. 12 Fabric Software Migrations and Goals Performance Always a Requirement Optimize relative to SWaP, Overcome “price of portability” New Approach – Community Open Architecture Goals Not Met TransitionsGoals, Potential Solutions Compute Establish Open Baseline in PPC Only Recompile if Migrating to x86 PPC,FPGAx86, GPGPU, FPGA Interconnects and Data Rates Common RDMA API for All Fabrics SW Binary Compatible among Fabrics RDMA+LAN APIs for Selected Fabric Programmable Fabric Device 10Gbps sRIO10/20/40/100Gbps sRIO, Ethernet, IB Middleware, Use Cases Leveraged SW Investment, Reuse HW Acceleration RDMA, Middleware, Protocols MPI, Vendor DMA Algorithm Library Integrated MPI/Dataflow Parallel Algorithm Library Pub/Sub, Components Dev / Run / Deployment Flexibility Productive Porting to Embedded Few / No Code Changes Good Performance Out of the Box Then, Incremental Tuning Dev+Run/Embedded Different Air/Ground SW Dev/Cluster, Run/Embed Common Air/Ground SW

13 © 2010 Mercury Computer Systems, Inc. 13 Inter-Box Inter-Board Inter-Core Industry Scaling Solutions MPI DDS AMQP CORBA OpenCV VSIPL++ VSIPL EIB DRI Middlewares Engage Platform Services for Performance and Scalability RapidIO (sRIO) InfiniBand (IB) Ethernet OpenMP Ct OpenCL QPIHT CUDA IPP Middlewares, Application Frameworks Mercury, 3 rd Party, Customer Supplied....

14 © 2010 Mercury Computer Systems, Inc. 14 RapidIO (sRIO) InfiniBand (IB) Ethernet OpenMP Ct OpenCL QPIHT CUDA IPP MPI DDS AMQP CORBA OpenCV VSIPL++ VSIPL Inter-Box Inter-Board Inter-Core Industry Scaling Solutions EIB Pivot Points Concentrations of industry investment (sometimes performance) OpenCL OFED Network Stack DRI

15 © 2010 Mercury Computer Systems, Inc. 15 What is OFED? Open Source Software by The OpenFabrics Alliance  Mercury is a member Multiple Fabric/Vendor  Ethernet/IB, sRIO (this work)  “Offload” HW Available Multiple Use  MPI (multiple libraries)  RDMA: uDAPL, verbs  Sockets  Storage: block, file High Performance  0 Copy – RDMA, Channel I/O

16 © 2010 Mercury Computer Systems, Inc. 16 OFA – Significant Membership Roster See for the listhttp://www.openfabrics.org HPC, System Integrators including “Tier 1” Microprocessor Providers Networking Providers (NIC/HCA, Switch) Server/Enterprise Software + Systems Providers NAS / SAN Providers Switched Fabric Industry Trade Associations Financial Services / Technology Providers National Laboratories Universities Software Consulting Organizations

17 © 2010 Mercury Computer Systems, Inc. 17 Diagram From OpenFabrics Alliance, OFED Software Structure

18 © 2010 Mercury Computer Systems, Inc. 18 Diagram From OpenFabrics Alliance, OFED Software Structure, MCS Work  An OFED “Device Provider” Software Module  (now) Linux PPC 8641D/sRIO  (future) Linux Intel PCIe to sRIO / Ethernet Device  Open MPI with OFED Transport (http://www.open-mpi.org)http://www.open-mpi.org Open MPI MCS Provider SW PPC/sRIOIntel+sRIO/Ethernet

19 © 2010 Mercury Computer Systems, Inc. 19 Embedded Middleware Migration Embedded SW Migration Open Solution SW Migration ICS/DX Mercury middleware No MPI content Preserve SW investment ICS/DX+MPI OFED in kernel driver Incremental MPI use App. buffer integration MPI_Send(smb_ptr, ) MPI as MCP extension MPI_Barrier(world) ICS/DX+MPI tuned OFED OS bypass Coordinate DMA HW Sharing MPI+DATAFLOW CONCEPT Single library, integrated resources DMA HW assist for MPI, RDMA MPI (MPI_Alloc_mem) malloc -> MPI_Alloc_mem App still uses open APIs MCS “PMEM”, contig. mem. Fabric / OFED+PMEM MPI (malloc) Baseline for port /embedded 0-copy RDMA out of the box Fabric / OFED Intra-node / SAL, OFED

20 © 2010 Mercury Computer Systems, Inc.www.mc.com 20 Motivating Example: ICS/DX vs. MPI © 2010 Mercury Computer Systems, Inc.www.mc.com Data Rate Performance

21 © 2010 Mercury Computer Systems, Inc. 21 Benchmark ½ Round Trip Time Rank 0 Process Rank 1 Process Inter-node: ranks placed on distinct nodes Latency = ½ RTT (record min, max, avg, median) Data rate calculated: transfer size, avg latency Return DEST buffer T 0 = timestamp DMA write data, sync word Poll on recv sync word T 1 = timestamp Poll on recv sync word DMA write data, sync word Round Trip Time (RTT) nbytes SRC bufferReturn SRC bufferDEST buffer

22 © 2010 Mercury Computer Systems, Inc. 22 ICS/DX, MPI/OFED: 8641D sRIO MPI/malloc: good, 0-copy “out of the box” Comparable Data Rates (OFED, ICS/DX) Overhead Differences Kernel (OFED) vs. user level DMA queuing Dynamic DMA chain formatting (OFED) vs. pre-plan + modify MPI/malloc Contiguous/PMEM ICS/DX OFED RDMA Write ½ RTT Latency – Small Transfer (4 bytes) OFED (data+flag) 13 usecOFED (data/poll last) 8.6 usecICS/DX (data+flag): 4 usec x

23 © 2010 Mercury Computer Systems, Inc. 23 Preliminary Conclusions 2. Minor Application Changes Contiguous memory for application buffers Comparable MPI performance to other systems 4. Communication Model Differences RDMA vs. message passing Limited “gap closing” using traditional means Different solution approaches to be considered OFED “middleware friendly” RDMA 1. Out of the Box Application Porting 0-copy RDMA enabled by OFED Better results than MPI over TCP/IP 3. Improvements Underway OFED: user-space optimized DMA queuing Moving performance curves “to the left”

24 © 2010 Mercury Computer Systems, Inc.www.mc.com 24 Communication Model Comparison © 2010 Mercury Computer Systems, Inc.www.mc.com Message Passing and RDMA

25 © 2010 Mercury Computer Systems, Inc. 25 Message Passing vs. RDMA Rendezvous Protocol Extra Costs Fabric handshakes in each send/recv 0-copy RDMA direct / app. memory Flexibility (MPI) vs. Performance (RDMA) Issue: memory registration / setup for RDMA MPI Tradeoffs: Copy vs. Fabric Handshake Costs Eager Protocol Extra Costs Copy (both sides) to special RDMA buffers RDMA transfer to/from special buffers

26 © 2010 Mercury Computer Systems, Inc.www.mc.com 26 MPI and RDMA Performance © 2010 Mercury Computer Systems, Inc.www.mc.com Server/HPC Platform Comparisons

27 © 2010 Mercury Computer Systems, Inc. 27 MPI vs. OFED Data Rate: IB, iWARP Without sRIO (Same Software) Double Data Rate InfiniBand (DDRIB), iWARP (10GbE) Similar Performance Delta Observed % of peak rate – link speeds: 10, 16 Gbps

28 © 2010 Mercury Computer Systems, Inc. 28 MPI Data Rate: sRIO, IB, iWARP MPI Data Rate Comparable Across Fabrics Small Transfer Copy Cost Differences X86_64 (DDRIB, iWARP) faster than PPC32 (sRIO)

29 © 2010 Mercury Computer Systems, Inc.www.mc.com 29 Mercury MPI/OFED Intra-Node Performance © 2010 Mercury Computer Systems, Inc.www.mc.com

30 © 2010 Mercury Computer Systems, Inc. 30 Open MPI Intra-Node Communication Multi-core Node Process 0 Memory Process 1 Memory Shared Memory Segment Send buffer Temp buffer Recv buffer Shared memory (sm) transport Copy segments via temp Optimized copy (AltiVec) OFED (openib) transport Loopback via CPU or DMA CPU (now): kernel driver CPU (next): library/AltiVec DMA (now): through sRIO DMA (next): local copy

31 © 2010 Mercury Computer Systems, Inc. 31 MPI Intra-Node Data Rate OMPI sm latency better (very small messages)  2-3 usec latency (OMPI sm) vs. 11 usec (OFED CPU loopback) OFED loopback better for large messages  Direct copy – 50% less memory bandwidth used OFED CPU, DMA Loopback 0-copy A  B OMPI sm Indirect copy A  TMP  B Using MCS optimized vector copy routines

32 © 2010 Mercury Computer Systems, Inc. 32 Summary / Conclusions OFED Perspective RDMA performance promising More performance with user-space optimization Opportunity to support more middlewares MPI Perspective MPI performance in line with current practice Steps shown to increase performance Enhancing MPI Perspective Converged middleware / extensions end goal Vs. pure MPI end goal HW assist likely can help in either case

33 © 2010 Mercury Computer Systems, Inc.www.mc.com 33 Thank You Questions? © 2010 Mercury Computer Systems, Inc.www.mc.com Kenneth Cain


Download ppt "© 2010 Mercury Computer Systems, Inc.www.mc.com Performance Migration to Open Solutions: OFED * for Embedded Fabrics Kenneth Cain * OpenFabrics."

Similar presentations


Ads by Google