Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines P. Balaji ¥ W. Feng α Q. Gao ¥ R. Noronha ¥ W. Yu ¥ D. K. Panda ¥ ¥ Network.

Slides:



Advertisements
Similar presentations
Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
High Speed Total Order for SAN infrastructure Tal Anker, Danny Dolev, Gregory Greenman, Ilya Shnaiderman School of Engineering and Computer Science The.
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula,
04/25/06Pavan Balaji (The Ohio State University) Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol over InfiniBand.
Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers Ryan E. GrantAhmad Afsahi Pavan Balaji Department of Electrical and Computer Engineering,
Beowulf Supercomputer System Lee, Jung won CS843.
Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.
Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.
Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji,
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
On the Provision of Prioritization and Soft QoS in Dynamically Reconfigurable Shared Data-Centers over InfiniBand P. Balaji, S. Narravula, K. Vaidyanathan,
Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur.
Remigius K Mommsen Fermilab A New Event Builder for CMS Run II A New Event Builder for CMS Run II on behalf of the CMS DAQ group.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
Supporting iWARP Compatibility and Features for Regular Network Adapters P. BalajiH. –W. JinK. VaidyanathanD. K. Panda Network Based Computing Laboratory.
P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp
Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji and D. K. Panda Network Based.
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
Slide 1 DESIGN, IMPLEMENTATION, AND PERFORMANCE ANALYSIS OF THE ISCSI PROTOCOL FOR SCSI OVER TCP/IP By Anshul Chadda (Trebia Networks)-Speaker Ashish Palekar.
Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Introduction Slide 1 A Communications Model Source: generates.
The NE010 iWARP Adapter Gary Montry Senior Scientist
Workload-driven Analysis of File Systems in Shared Multi-Tier Data-Centers over InfiniBand K. Vaidyanathan P. Balaji H. –W. Jin D.K. Panda Network-Based.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Electronic visualization laboratory, university of illinois at chicago A Case for UDP Offload Engines in LambdaGrids Venkatram Vishwanath, Jason Leigh.
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.
2006 Sonoma Workshop February 2006Page 1 Sockets Direct Protocol (SDP) for Windows - Motivation and Plans Gilad Shainer Mellanox Technologies Inc.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,
Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.
Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan,
Srihari Makineni & Ravi Iyer Communications Technology Lab
August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
William Stallings Data and Computer Communications
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)
Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.
Ethernet. Ethernet standards milestones 1973: Ethernet Invented 1983: 10Mbps Ethernet 1985: 10Mbps Repeater 1990: 10BASE-T 1995: 100Mbps Ethernet 1998:
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 April 11, 2006 Session 23.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.
Interconnection network network interface and a case study.
Mr. P. K. GuptaSandeep Gupta Roopak Agarwal
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
A Practical Evaluation of Hypervisor Overheads Matthew Cawood Supervised by: Dr. Simon Winberg University of Cape Town Performance Analysis of Virtualization.
SURENDRA INSTITUTE OF ENGINEERING & MANAGEMENT PRESENTED BY : Md. Mubarak Hussain DEPT-CSE ROLL
Computer Networking A Top-Down Approach Featuring the Internet Introduction Jaypee Institute of Information Technology.
Enhancements for Voltaire’s InfiniBand simulator
Balazs Voneki CERN/EP/LHCb Online group
Internetworking: Hardware/Software Interface
Computer Networking A Top-Down Approach Featuring the Internet
Cluster Computers.
Presentation transcript:

Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines P. Balaji ¥ W. Feng α Q. Gao ¥ R. Noronha ¥ W. Yu ¥ D. K. Panda ¥ ¥ Network Based Computing Lab, Ohio State University α Advanced Computing Lab, Los Alamos National Lab

Ethernet Trends Ethernet is the most widely used network architecture today Traditionally Ethernet has been notorious for performance issues –Near an order-of-magnitude performance gap compared to InfiniBand, Myrinet Cost conscious architecture Relied on host-based TCP/IP for network and transport layer support Compatibility with existing infrastructure (switch buffering, MTU) –Used by 42.4% of the Top500 supercomputers –Key: Extremely high performance per unit cost GigE can give about 900Mbps (performance) / 0$ (cost) 10-Gigabit Ethernet (10GigE) recently introduced –10-fold (theoretical) increase in performance while retaining existing features –Can 10GigE bridge the performance between Ethernet and InfiniBand/Myrinet?

InfiniBand, Myrinet and 10GigE: Brief Overview InfiniBand (IBA) –Industry Standard Network Architecture –Supports 10Gbps and higher network bandwidths –Offloaded Protocol Stack –Rich feature set (one-sided, zero-copy communication, multicast, etc.) Myrinet –Proprietary network by Myricom –Supports up to 4Gbps with dual ports (10G adapter announced !) –Offloaded Protocol Stack and rich feature set like IBA 10-Gigabit Ethernet (10GigE) –The next step for the Ethernet family –Supports up to 10Gbps link bandwidth –Offloaded Protocol Stack –Promises a richer feature set too with the upcoming iWARP stack

Characterizing the Performance Gap Each High Performance Interconnect has its own interface –Characterizing the performance gap is no longer straight forward Portability in Application Development –Portability across various networks is a must –Message Passing Interface (MPI) De facto standard for Scientific Applications –Sockets Interface Legacy Scientific Applications Grid-based or Heterogeneous computing applications File and Storage Systems Other Commercial Applications Sockets and MPI are the right choices to characterize the GAP –In this paper we concentrate only on the Sockets interface

Presentation Overview ۩Introduction and Motivation ۩Protocol Offload Engines ۩Experimental Evaluation ۩Conclusions and Future Work

Interfacing with Protocol Offload Engines Application or Library Traditional Sockets Interface High Performance Sockets User-level Protocol TCP/IP Device Driver High Performance Network Adapter Network Features (e.g., Offloaded Protocol) High Performance Sockets toedev TOM Application or Library Traditional Sockets Interface TCP/IP Device Driver High Performance Network Adapter Network Features (e.g., Offloaded Protocol) TCP Stack Override

High Performance Sockets Pseudo Sockets-like Interface –Smooth transition for existing sockets applications Existing applications do not have to rewritten or recompiled ! –Improved Performance by using the Offloaded Protocol Stack on networks High Performance Sockets exist for many networks –Initial implementations on VIA –Follow-up implementations on Myrinet and Gigabit Ethernet –Sockets Direct Protocol is an industry standard specification for IBA Implementations by Voltaire, Mellanox and OSU exist [shah98:canpc, kim00:cluster, balaji02:hpdc] [balaji02:cluster] [balaji03:ispass, mellanox05:hoti]

TCP Stack Override Similar to the High Performance Sockets approach, but… –Overrides the TCP layer instead of the Sockets layer Advantages compared to High Performance Sockets –Sockets features do not have to be duplicated E.g., Buffer management, Memory registration –Features implemented in the Berkeley sockets implementation can be used Disadvantages compared to High Performance Sockets –A kernel patch is required –Some TCP functionality has to be duplicated This approach is used by Chelsio in their 10GigE adapters

Presentation Overview ۩Introduction and Motivation ۩High Performance Sockets over Protocol Offload Engines ۩Experimental Evaluation ۩Conclusions and Future Work

Experimental Test-bed and Evaluation 4 node cluster: Dual Xeon 3.0GHz; SuperMicro SUPER X5DL8-GG nodes 512 KB L2 cache; 2GB of 266MHz DDR SDRAM memory; PCI-X 64-bit 133MHz InfiniBand –Mellanox MT23108 Dual Port 4x HCAs (10Gbps link bandwidth); MT port switch –Voltaire IBHost stack Myrinet –Myrinet-2000 dual port adapters (4Gbps link bandwidth) –SDP/Myrinet v1.7.9 over GM v GigE –Chelsio T110 adapters (10Gbps link bandwidth); Foundry 16-port SuperX switch –Driver v1.2.0 for the adapters; Firmware v2.2.0 for the switch Experimental Results: –Micro-benchmarks (latency, bandwidth, bi-dir bandwidth, multi-stream, hot-spot, fan-tests) –Application-level Evaluation

Ping-Pong Latency Measurements SDP/Myrinet achieves the best small message latency at 11.3us 10GigE and IBA achieve latencies of 17.7us and 24.4us respectively As message size increases, IBA performs the best  Myrinet cards are 4Gbps links right now !

Unidirectional Throughput Measurements 10GigE achieves the highest throughput at 6.4Gbps IBA and Myrinet achieve about 5.3Gbps and 3.8Gbps  Myrinet is only a 4Gbps link

Snapshot Results (“Apples-to-Oranges” comparison) 10GigE: Opteron 2.2GHzIBA: Xeon 3.6GHzMyrinet: Xeon 3.0GHz Provide a reference point ! Only valid for THIS slide ! SDP/Myrinet with MX allows polling and achieves about 4.6us latency (event-based is better for SDP/GM ~ 11.3us) 10GigE achieves the lowest event-based latency of 8.9us on Opteron systems IBA achieves a 9Gbps throughput with their DDR cards (link speed of 20Gbps)

Bidirectional and Multi-Stream Throughput

Hot-Spot Latency 10GigE and IBA demonstrate similar scalability with increasing number of clients Myrinet’s performance deteriorates faster than the other two

Fan-in and Fan-out Throughput Test 10GigE and IBA achieve a similar performance for the fan-in test 10GigE performs slightly better for the fan-out test

Data-Cutter Run-time Library (Software Support for Data Driven Applications) Designed by Univ. of Maryland Component framework User-defined pipeline of components Each component is a filter The collection of components is a filter group Replicated filters as well as filter groups Illusion of a single stream in the filter group Stream based communication Flow control between components Unit of Work (UOW) based flow control Each UOW contains about 16 to 64KB of data Several applications supported Virtual Microscope ISO Surface Oil Reservoir Simulator

Data-Cutter Performance Evaluation InfiniBand performs the best for both the data-cutter applications (especially Virtual Microscope) The filter-based approach makes the environment medium message latency sensitive

Network Parallel Virtual File System (PVFS) Compute Node Compute Node Compute Node Compute Node Meta-Data Manager I/O Server Node I/O Server Node I/O Server Node Meta Data Designed by ANL and Clemson University Relies on Striping of data across different nodes Tries to aggregate I/O bandwidth from multiple nodes Utilizes the local file system on the I/O Server nodes

PVFS Contiguous I/O Performance trends are similar to the throughput test Experiment is throughput intensive

MPI-Tile I/O (PVFS Non-contiguous I/O) 10GigE and IBA perform quite equally Myrinet is very close behind inspite of being only a 4Gbps network

Ganglia Cluster Management Infrastructure Cluster Nodes Management Node connect() Request Information Receive Data Developed by UC Berkeley For each transaction, one connection and one medium sized message (~6KB) transfer is required Ganglia Daemon

Ganglia Performance Evaluation Performance is dominated by connection time IBA, Myrinet take about a millisecond to establish connections while 10GigE takes about 60us Optimizations for SDP/IBA and SDP/Myrinet such as connection caching are possible (being implemented)

Presentation Overview ۩Introduction and Motivation ۩High Performance Sockets over Protocol Offload Engines ۩Experimental Evaluation ۩Conclusions and Future Work

Concluding Remarks Ethernet has traditionally been notorious for performance reasons –Close to an order-of-magnitude performance gap compared to IBA/Myrinet 10GigE: Recently introduced as a successor in the Ethernet family –Can 10GigE help bridge the gap for Ethernet with IBA/Myrinet? We showed comparisons between Ethernet, IBA and Myrinet –Sockets Interface was used for the comparison –Micro-benchmark as well as application-level evaluations have been shown 10GigE performs quite comparably with the IBA and Myrinet –Better in some cases and worse in others; but around the same ballpark –Quite ubiquitous in Grid and WAN environments –Comparable performance in SAN environments

Continuing and Future Work Sockets is only one end of the comparison chart –Other middleware are quite widely used too (e.g., MPI) –IBA/Myrinet might have an advantage due to RDMA/multicast capabilities Network interfaces and software stacks change –Myrinet coming out with 10Gig adapters –10GigE might release a PCI-Express based card –IBA has a zero-copy sockets interface for improved performance –IBM’s 12x InfiniBand adapters increase the performance of IBA by 3 fold –These results keep changing with time; more snapshots needed for fairness Multi-NIC comparison for Sockets/MPI –Awful lot of work at the host –Scalability might be bound by the host iWARP compatibility and features for 10GigE TOE adapters

Acknowledgements Our research is supported by the following organizations Funding support by Equipment donations by

Web Pointers Websites: s: NBCL LANL

Backup Slides

10GigE: Technology Overview Broken into three levels of technologies –Regular Ethernet adapters Layer-2 adapters Rely on host-based TCP/IP to provide network/transport functionality Could achieve a high performance with optimizations –TCP Offload Engines (TOEs) Layer-4 adapters Have the entire TCP/IP stack offloaded on to hardware Sockets layer retained in the host space –iWARP-aware adapters Layer-4 adapters Entire TCP/IP stack offloaded on to hardware Support more features than TCP Offload Engines –No sockets ! Richer iWARP interface ! –E.g., Out-of-order placement of data, RDMA semantics [feng03:hoti, feng03:sc, balaji03:rait] [balaji05:hoti]