High Performance Interconnects: Landscape, Assessments & Rankings

Slides:



Advertisements
Similar presentations
Hardware & the Machine room Week 5 – Lecture 1. What is behind the wall plug for your workstation? Today we will look at the platform on which our Information.
Advertisements

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
SDN and Openflow.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat Presented by Gregory Peaker and Tyler Maclean.
Storage area network and System area network (SAN)
1 Some Context for This Session…  Performance historically a concern for virtualized applications  By 2009, VMware (through vSphere) and hardware vendors.
1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.
1 CISCO NETWORKING ACADEMY PROGRAM (CNAP) SEMESTER 1/ MODULE 8 Ethernet Switching.
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
InfiniSwitch Company Confidential. 2 InfiniSwitch Agenda InfiniBand Overview Company Overview Product Strategy Q&A.
Gilad Shainer, VP of Marketing Dec 2013 Interconnect Your Future.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
March 9, 2015 San Jose Compute Engineering Workshop.
Network Architecture for the LHCb DAQ Upgrade Guoming Liu CERN, Switzerland Upgrade DAQ Miniworkshop May 27, 2013.
Chapter 7 Backbone Network. Announcements and Outline Announcements Outline Backbone Network Components  Switches, Routers, Gateways Backbone Network.
Beowulf – Cluster Nodes & Networking Hardware Garrison Vaughan.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.
Why it might be interesting to look at ARM Ben Couturier, Vijay Kartik Niko Neufeld, PH-LBC SFT Technical Group Meeting 08/10/2012.
Interconnection network network interface and a case study.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
The Evaluation Tool for the LHCb Event Builder Network Upgrade Guoming Liu, Niko Neufeld CERN, Switzerland 18 th Real-Time Conference June 13, 2012.
Graciela Perera Department of Computer Science and Information Systems Slide 1 of 18 INTRODUCTION NETWORKING CONCEPTS AND ADMINISTRATION CSIS 3723 Graciela.
Disruptive Storage Workshop Lustre Hardware Primer
Enhancements for Voltaire’s InfiniBand simulator
Balazs Voneki CERN/EP/LHCb Online group
LHCb and InfiniBand on FPGA
Local Area Networks Honolulu Community College
Berkeley Cluster Projects
IOT Critical Impact on DC Design
The Underlying Technologies
Part I. Overview of Data Communications and Networking
What is Fibre Channel? What is Fibre Channel? Introduction
What's the buzz about HORNET?
Network Load Balancing Topology
OCP: High Performance Computing Project
Grid Computing.
Understanding the OSI Reference Model
Introduction to Networks
Designing Routing and Switching Architectures. Howard C. Berkowitz
IS3120 Network Communications Infrastructure
Chapter 7 Backbone Network
CS 1652 The slides are adapted from the publisher’s material
Little work is accurate
Chapter 16: Distributed System Structures
Why PC Based Control ?.
Avalon Switch Fabric.
What’s “Inside” a Router?
Marrying OpenStack and Bare-Metal Cloud
CS 31006: Computer Networks – The Routers
Software Defined Networking (SDN)
IP : Internet Protocol Surasak Sanguanpong
Storage area network and System area network (SAN)
Practical Issues for Commercial Networks
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
PRESENTATION COMPUTER NETWORKS
Networks Networking has become ubiquitous (cf. WWW)
Chapter 4 Network Layer Computer Networking: A Top Down Approach 5th edition. Jim Kurose, Keith Ross Addison-Wesley, April Network Layer.
Computer Evolution and Performance
Network Processors for a 1 MHz Trigger-DAQ System
Network Layer: Control/data plane, addressing, routers
EEC4113 Data Communication & Multimedia System Chapter 1: Introduction by Muhazam Mustapha, July 2010.
NVMe Markets: Rising at Breakneck Speed with Even Better Days Ahead Eric Burgener, Research Vice President Infrastructure Systems, Platforms and Technologies.
Elmo Muhammad Shahbaz Lalith Suresh, Jennifer Rexford, Nick Feamster,
Cluster Computers.
Presentation transcript:

High Performance Interconnects: Landscape, Assessments & Rankings Dan Olds Partner, OrionX April 12, 2017

High Performance Interconnects (HPI) Very top end of networking market – when you absolutely need high bandwidth and low latency Without HPI, you don’t really have a cluster – or at least one that works very well Performance has been rising at 30% annually Spending on HPI is also rising significantly ©2016 OrionX

HPI market segment Current HPI line Link Speed Net Protocol App Comm InfiniBand Specialized OPA 100G MPI Multi-Rack Single-Rack 100G 40G JDBC RMI IIOP SOAP etc. Current HPI line TCP/IP 10G 1G Link Speed Net Protocol App Comm

Three Types of HPI Ethernet Proprietary InfiniBand Sold by a host of providers, Cisco, HPE, Juniper, plus many others Tried and true interconnect, easiest to implement While it has the bandwidth of others, latency is pretty high (ms rather than ns) Proprietary Primarily sold by Cray, SGI, and IBM plus a few others Have to purchase a system in order to get their brand of HPI Intel is a new entrant in this segment of the market, although without an accompanying system InfiniBand Mellanox has emerged as the de-facto leader Highest performance based on published numbers:200Gb/s, 200m messages/s, 90ns latency ©2016 OrionX

Key Differences in HPI: Product Maturity/Position Ethernet Ethernet has been around longer than any HPI, but was surpassed in performance Still many installations, but has lost much of its share at the high end Latency (measured in ms, not ns) is the problem, not bandwidth ©2016 OrionX

Key Differences in HPI: Product Maturity/Position Intel Omni Path Architecture Intel and Omni-Path are in their infancy still, very few installations Handful of customers (although some big names), few, if any, in production Claims bandwidth/latency/message rate same or better than InfiniBand (covered later) ©2016 OrionX.net

Key Differences in HPI: Product Maturity/Position InfiniBand Has been in the HPI market since early 2000’s Thousands of customers, millions of nodes Now makes up large proportion of Top500 list (187 systems) Synonymous with Mellanox these days ©2016 OrionX.net

Key Differences in HPI technology Onload vs. Offload Onload: main CPU handles all network processing chores, adapter and switches just pass the messages, examples Intel Omni-Path Architecture, Ethernet Also PC servers, old UNIX systems where CPUs handled every task and received interrupts on communications Offload: HCA and switches handle all network processing tasks, very little or no need for main CPU cycles, allows CPU to continue processing applications, examples Mellanox InfiniBand Mainframes with communication assist processors used to allow CPU to process applications, not communications ©2016 OrionX

Offload Details Network protocol load includes: Link Layer: packet layout, packet forwarding, flow control, data integrity, QoS Network layer: adds header, routing of packets from one subnet to another Transport layer: in-order packet delivery, divides data into packets, receiver reassembles packets, sends/receives acknowledgements MPI operations: scatter, gather, broadcast, etc. With offload, ALL of these operations are handled by the adapter hardware, example: InfiniBand HBA ©2016 OrionX.net

Onload Details Network protocol load includes: Link layer packet layout, packet forwarding, flow control, data integrity, QoS Network layer: header, routing of packets from one subnet to another Transport layer: in-order packet delivery, divides data into packets, receiver reassembles packets, sends/receives acknowledgements MPI operations: scatter, gather, broadcast, etc. With onload, ALL of these operations are performed by the host processor, using host memory ©2016 OrionX.net

Onload vs. Offload Onload vs. Offload isn’t a big deal when the cluster is small… ©2016 OrionX.net

Onload vs. Offload But it will become a very large deal when the cluster becomes larger Will particularly be a problem on scatter, gather type collective problems when head node will be overrun trying to process messages ©2016 OrionX.net

Onload vs. Offload As node count increases, performance of Onload will drop Higher node count = more messaging, pressure on head node Node counts are increasing significantly Offload uses dedicated hardware ASICs Much faster than general purpose CPUs MPI not highly parallel With Onload, this means that speed is limited to slowest core speed Has no bearing on Offload speed ©2016 OrionX.net

Rampant FUD War From: The Next Platform, “Intel Stretches Deep Learning On Scalable System Framework”, 5/10/16 Cost of HPI in cluster budget is typically ~15-20% of total Prices in high tech typically don’t increase over time Price points for new products typically are the same as the former high end product they replace…ex: high end PCs, low-end servers, etc. ©2016 OrionX.net

More FUD War….. All images provided by Intel, all from The Next Platform story “Intel Stretches Learning on Scalable System Framework” May 10th, 2016 What else do these images have in common? ©2016 OrionX.net

FUD Wars – Behind the Numbers …..48 port (B0 silicon). IOU Non-posted Prefetch disabled in BIOS. Snoop hold-off timer = 9. EDR based on internal testing: Intel MPI 5.1.3, shm:dapl fabric, RHEL 7.2 -genv I_MPI_DAPL_EAGER_MESSAGE_AGGREGATION off. Mellanox EDR ConnectX-4 Single Port Rev 3 MCX455A HCA. Mellanox SB7700 – 36 Port EDR InfiniBand switch. MLNX_OFED_LINUX-3.2-2.0.0.0 (OFED-3.2-2.0.0). IOU Non-posted Prefetch enabled in BIOS. 1. osu_latency 8 B message. 2. osu_bw 1 MB message. 3. osu_mbw_mr, 8 B……… Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. It’s all in the fine print, right? Here’s Intel’s fine print for the graphs on the last slide…. “dapl” is key, it’s an Intel MPI mechanism that doesn’t allow for offload operations ala InfiniBand ©2016 OrionX.net

Even More FUD 100% CPU core utilization on a Offload HCA?!! Does anyone believe this?!! This means that about half of the Top500 systems are absolutely useless Intel is using a CPU polling mechanism that pegs the CPU on the Mellanox box at 100%, yet has nothing to do with network comms Both Intel and Mellanox have benchmarked OPA at ~65% CPU utilization ©2016 OrionX.net

FUD Aside, here are the numbers….UPDATE FOR HDR Intel OPA Mellanox EDR/HDR InfiniBand Bandwidth 100 Gb/sec 100 Gb/sec / 200Gb/sec Latency (ns) .93 .85 or less / .90 or less Message rate 89 million/sec* 150m/sec / 200m/sec * this number, provided by Intel, has dropped from >150 million in 2015 ©2016 OrionX.net

HPI Roadmaps InfiniBand roadmap shows HDR now (200Gb/s) and NDR down the road (400Gb/s? 2020?) OPA Roadmap Formerly: OPA 2 in 2018 Now: OPA 2 in 2020…ouch Ethernet roadmap shows 200Gb/s in 2018-19 ©2016 OrionX.net

Major HPI Choices: OrionX analysis Vendor Market Customer Product Presence Trends Overall Readiness Needs Capabilities Roadmap Mellanox 9 8 8.5 10 9.5 Ethernet vendors 7 6 7.5 6.5 Intel 7.0 ©2016 OrionX.net

OrionX Constellation Market Product Customer Ethernet 7 6.5 7.5 Vendor Market Product Customer Ethernet 7 6.5 7.5 Mellanox 9 9.5 8.5 Intel 7.25

OrionX Constellation™ reports Questions? Comments? Concerns? OrionX Constellation™ reports ©2016 OrionX