Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 

Slides:



Advertisements
Similar presentations
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Advertisements

Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.
1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.
Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines P. Balaji ¥ W. Feng α Q. Gao ¥ R. Noronha ¥ W. Yu ¥ D. K. Panda ¥ ¥ Network.
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula,
04/25/06Pavan Balaji (The Ohio State University) Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol over InfiniBand.
Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers Ryan E. GrantAhmad Afsahi Pavan Balaji Department of Electrical and Computer Engineering,
Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.
Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.
Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji,
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
On the Provision of Prioritization and Soft QoS in Dynamically Reconfigurable Shared Data-Centers over InfiniBand P. Balaji, S. Narravula, K. Vaidyanathan,
Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
1 Design and Implementation of A Content-aware Switch using A Network Processor Li Zhao, Yan Luo, Laxmi Bhuyan University of California, Riverside Ravi.
CSE 124 Networked Services Fall 2009 B. S. Manoj, Ph.D 10/27/20091CSE 124 Networked Services Fall 2009 Some.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
1 Performance Evaluation of Gigabit Ethernet & Myrinet
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
5/8/2006 Nicole SAN Protocols 1 Storage Networking Protocols Nicole Opferman CS 526.
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
Supporting iWARP Compatibility and Features for Regular Network Adapters P. BalajiH. –W. JinK. VaidyanathanD. K. Panda Network Based Computing Laboratory.
P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp
Revisiting Network Interface Cards as First-Class Citizens Wu-chun Feng (Virginia Tech) Pavan Balaji (Argonne National Lab) Ajeet Singh (Virginia Tech)
Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy,
Performance Tradeoffs for Static Allocation of Zero-Copy Buffers Pål Halvorsen, Espen Jorde, Karl-André Skevik, Vera Goebel, and Thomas Plagemann Institute.
Network Server Performance and Scalability June 9, 2005 Scott Rixner Rice Computer Architecture Group
Mapping of scalable RDMA protocols to ASIC/FPGA platforms
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.
Slide 1 DESIGN, IMPLEMENTATION, AND PERFORMANCE ANALYSIS OF THE ISCSI PROTOCOL FOR SCSI OVER TCP/IP By Anshul Chadda (Trebia Networks)-Speaker Ashish Palekar.
High Performance Computing & Communication Research Laboratory 12/11/1997 [1] Hyok Kim Performance Analysis of TCP/IP Data.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
The NE010 iWARP Adapter Gary Montry Senior Scientist
Workload-driven Analysis of File Systems in Shared Multi-Tier Data-Centers over InfiniBand K. Vaidyanathan P. Balaji H. –W. Jin D.K. Panda Network-Based.
Penn State CSE “Optimizing Network Virtualization in Xen” Aravind Menon, Alan L. Cox, Willy Zwaenepoel Presented by : Arjun R. Nath.
Electronic visualization laboratory, university of illinois at chicago A Case for UDP Offload Engines in LambdaGrids Venkatram Vishwanath, Jason Leigh.
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.
2006 Sonoma Workshop February 2006Page 1 Sockets Direct Protocol (SDP) for Windows - Motivation and Plans Gilad Shainer Mellanox Technologies Inc.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.
Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,
Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.
Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan,
Srihari Makineni & Ravi Iyer Communications Technology Lab
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
ND The research group on Networks & Distributed systems.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
CMS week, June 2002, CERN 1 First P2P Measurements on Infiniband Luciano Berti INFN Laboratori Nazionali di Legnaro.
Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.
Ethernet. Ethernet standards milestones 1973: Ethernet Invented 1983: 10Mbps Ethernet 1985: 10Mbps Repeater 1990: 10BASE-T 1995: 100Mbps Ethernet 1998:
Supporting Multimedia Communication over a Gigabit Ethernet Network VARUN PIUS RODRIGUES.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
Authors: Danhua Guo 、 Guangdeng Liao 、 Laxmi N. Bhuyan 、 Bin Liu 、 Jianxun Jason Ding Conf. : The 4th ACM/IEEE Symposium on Architectures for Networking.
Prentice HallHigh Performance TCP/IP Networking, Hassan-Jain Chapter 13 TCP Implementation.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
Using Uncacheable Memory to Improve Unity Linux Performance
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
L1/HLT trigger farm Bologna setup 0 By Gianluca Peco INFN Bologna Genève,
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
Balazs Voneki CERN/EP/LHCb Online group
Internetworking: Hardware/Software Interface
Xen Network I/O Performance Analysis and Opportunities for Improvement
Presentation transcript:

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda   Network Based Computing Lab Computer Science and Engineering Ohio State University ¥ Embedded IA Division Intel Corporation Austin, Texas

Introduction and Motivation Advent of High Performance Networks –Ex: InfiniBand, 10-Gigabit Ethernet, Myrinet, etc. –High Performance Protocols: VAPI / IBAL, GM –Good to build new applications –Not so beneficial for existing applications Built around portability: Should run on all platforms TCP/IP based sockets: A popular choice Several GENERIC optimizations proposed and implemented for TCP/IP –Jacobson Optimization: Integrated Checksum-Copy [Jacob89] –Header Prediction for Single Stream data transfer [Jacob89]: “An analysis of TCP Processing Overhead”, D. Clark, V. Jacobson, J. Romkey and H. Salwen. IEEE Communications

Generic Optimizations Insufficient! 10GigE 1GigE Processor Speed DOES NOT scale with Network Speeds Protocol processing too expensive for current day systems

Network Specific Optimizations Sockets can utilize some network features –Hardware support for protocol processing –Interrupt Coalescing (can be considered generic) –Checksum Offload (TCP stack has to modified) –Insufficient! Network Specific Optimizations –High Performance Sockets [shah99, balaji02] –TCP Offload Engines (TOE) [shah99]: “High Performance Sockets and RPC over Virtual Interface (VI) Architecture”, H. Shah, C. Pu, R. S. Madukkarumukumana, In CANPC ‘99 [balaji02]: “Impact of High Performance Sockets on Data Intensive Applications”, P. Balaji, J. Wu, T. Kurc, U. Catalyurek, D. K. Panda, J. Saltz, In HPDC ‘03

Memory Traffic Bottleneck Offloaded Transport Layers provide some performance gains –Protocol processing is offloaded; lesser host CPU overhead –Better network performance for slower hosts –Quite effective for 1-2 Gigabit networks –Effective for faster (10-Gigabit) networks in some scenarios Memory Traffic Constraints –Offloaded Transport Layers rely on the sockets interface –Sockets API forces memory access operations in several scenarios Transactional protocols such as RPC, File I/O, etc. –For 10-Gigabit networks memory access operations can limit network performance !

10-Gigabit Networks 10-Gigabit Ethernet –Recently released as a successor in the Ethernet family –Some adapters support TCP/IP checksum and Segmentation offload InfiniBand –Open Industry Standard –Interconnect for connecting compute and I/O nodes –Provides High Performance Offloaded Transport Layer; Zero-Copy data-transfer Provides one-sided communication (RDMA, Remote Atomics) –Becoming increasingly popular –An example RDMA capable 10-Gigabit network

Objective New standards proposed for RDMA over IP –Utilizes an offloaded TCP/IP stack on the network adapter –Supports additional logic for zero-copy data transfer to the application –Compatible with existing Layer 3 and 4 switches What’s the impact of an RDMA interface over TCP/IP? –Implications on CPU Utilization –Implications on Memory Traffic –Is it beneficial? We analyze these issues using InfiniBand’s RDMA capabilities!

Presentation Outline Introduction and Motivation TCP/IP Control Path and Memory Traffic 10-Gigabit network performance for TCP/IP 10-Gigabit network performance for RDMA Memory Traffic Analysis for 10-Gigabit networks Conclusions and Future Work

TCP/IP Control Path (Sender Side) Application Buffer Socket Buffer NIC Driver write() Checksum and Copy Post TX Kick Driver Return to Application Post Descriptor INTR on transmit success DMA Checksum, Copy and DMA are the data touching portions in TCP/IP Offloaded protocol stacks avoid checksum at the host; copy and DMA are still present Packet Leaves

TCP/IP Control Path (Receiver Side) Application Buffer Socket Buffer NIC Driver Wait for read() read() Application gets data Copy DMA Packet Arrives INTR on Arrival Data might need to be buffered on the receiver side Pick-and-Post techniques force a memory copy on the receiver side

North Bridge Application and Socket buffers fetched to L2 $ Application Buffer written back to memory Memory Bus Traffic for TCP CPU NIC FSB Memory Bus I/O Bus Each network byte requires 4 bytes to be transferred on the Memory Bus (unidirectional traffic) Assuming 70% memory efficiency, TCP can support at most 4-5Gbps bidirectional on 10Gbps (400MHz/64bit FSB) Appln. Buffer Socket Buffer L2 $ Memory Appln. Buffer Socket Buffer Data DMA Data Copy

Network to Memory Traffic Ratio Application Buffer Fits in Cache Application Buffer Doesn’t fit in Cache Transmit (Worst Case) Transmit (Best Case)12-4 Receive (Worst Case)2-44 Receive (Best Case)24 This table shows the minimum memory traffic associated with network data In reality socket buffer cache misses, control messages and noise traffic may cause these to be higher Details of other cases present in the paper

Presentation Outline Introduction and Motivation TCP/IP Control Path and Memory Traffic 10-Gigabit network performance for TCP/IP 10-Gigabit network performance for RDMA Memory Traffic Analysis for 10-Gigabit networks Conclusions and Future Work

Experimental Test-bed (10-Gig Ethernet) Two Dell2600 Xeon 2.4GHz 2-way SMP node 1GB main memory (333MHz, DDR) Intel E7501 Chipset 32K L1, 512K L2, 400MHz/64bit FSB PCI-X 133MHz/64bit I/O bus Intel 10GbE/Pro 10-Gigabit Ethernet adapters 8 P4 2.0 GHz nodes (IBM xSeries 305; X) Intel Pro/1000 MT Server Gig-E adapters 256K main memory

10-Gigabit Ethernet: Latency and Bandwidth TCP/IP achieves a latency of 37us (Win Server 2003) – 20us on Linux About 50% CPU utilization on both platforms Peak Throughput of about 2500Mbps; % CPU Utilization Application buffer is always in Cache !!

TCP Stack Pareto Analysis (64 byte) SenderReceiver Kernel, Kernel Libraries and TCP/IP contribute to the Offloadable TCP/IP stack

TCP Stack Pareto Analysis (16K byte) Sender Receiver TCP and other protocol overhead takes up most of the CPU Offload is beneficial when buffers fit into cache

TCP Stack Pareto Analysis (16K byte) Sender Receiver TCP and other protocol overhead takes up most of the CPU Offload is beneficial when buffers fit into cache

Throughput (Fan-in/Fan-out) SB = 128K; MTU=9K Peak throughput of 3500Mbps for Fan-In and 4200Mbps for Fan-out

Bi-Directional Throughput Not the traditional Bi-directional Bandwidth test Fan-in with half the nodes and Fan-out with the other half

Presentation Outline Introduction and Motivation TCP/IP Control Path and Memory Traffic 10-Gigabit network performance for TCP/IP 10-Gigabit network performance for RDMA Memory Traffic Analysis for 10-Gigabit networks Conclusions and Future Work

Experimental Test-bed (InfiniBand) 8 SuperMicro SUPER P4DL6 nodes –Xeon 2.4GHz 2-way SMP nodes –512MB main memory (DDR) –PCI-X 133MHZ/64bit I/O bus Mellanox InfiniHost MT23108 DualPort 4x HCA –InfiniHost SDK version –HCA firmware version 1.17 Mellanox InfiniScale MT port switch (4x) Linux kernel version smp

InfiniBand RDMA: Latency and Bandwidth Performance improvement due to hardware support and zero-copy data transfer Near zero CPU Utilization at the data sink for large messages Performance limited by PCI-X I/O bus

Presentation Outline Introduction and Motivation TCP/IP Control Path and Memory Traffic 10-Gigabit network performance for TCP/IP 10-Gigabit network performance for RDMA Memory Traffic Analysis for 10-Gigabit networks Conclusions and Future Work

Throughput test: Memory Traffic Sockets can force up to 4 times more memory traffic compared to the network traffic RDMA allows has a ratio of 1 !!

Multi-Stream Tests: Memory Traffic Memory Traffic is significantly higher than the network traffic Comes to within 5% of the practically attainable peak memory bandwidth

Presentation Outline Introduction and Motivation TCP/IP Control Path and Memory Traffic 10-Gigabit network performance for TCP/IP 10-Gigabit network performance for RDMA Memory Traffic Analysis for 10-Gigabit networks Conclusions and Future Work

Conclusions TCP/IP performance on High Performance Networks –High Performance Sockets –TCP Offload Engines 10-Gigabit Networks –A new dimension of complexity – Memory Traffic Sockets API can require significant memory traffic –Up to 4 times more than the network traffic –Allows saturation on less than 35% of the network bandwidth –Shows potential benefits of providing RDMA over IP –Significant benefits in performance, CPU and memory traffic

Future Work Memory Traffic Analysis for 64-bit systems Potential of the L3-Cache available in some systems Evaluation of various applications –Transactional (SpecWeb) –Streaming (Multimedia Services)

For more information, please visit the Network Based Computing Laboratory, The Ohio State University Thank You! NBC Home Page

Backup Slides