04/25/06Pavan Balaji (The Ohio State University) Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol over InfiniBand.

Slides:



Advertisements
Similar presentations
Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of.
Efficient Collective Operations using Remote Memory Operations on VIA-Based Clusters Rinku Gupta Dell Computers Dhabaleswar Panda.
Head-to-TOE Evaluation of High Performance Sockets over Protocol Offload Engines P. Balaji ¥ W. Feng α Q. Gao ¥ R. Noronha ¥ W. Yu ¥ D. K. Panda ¥ ¥ Network.
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula,
Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers Ryan E. GrantAhmad Afsahi Pavan Balaji Department of Electrical and Computer Engineering,
Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.
Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.
Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji,
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
On the Provision of Prioritization and Soft QoS in Dynamically Reconfigurable Shared Data-Centers over InfiniBand P. Balaji, S. Narravula, K. Vaidyanathan,
Analyzing the Impact of Supporting Out-of-order Communication on In-order Performance with iWARP P. Balaji, W. Feng, S. Bhagvat, D. K. Panda, R. Thakur.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)
Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.
Realizing the Performance Potential of the Virtual Interface Architecture Evan Speight, Hazim Abdel-Shafi, and John K. Bennett Rice University, Dep. Of.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
Supporting iWARP Compatibility and Features for Regular Network Adapters P. BalajiH. –W. JinK. VaidyanathanD. K. Panda Network Based Computing Laboratory.
04/26/06D. K. Panda (The Ohio State University) Designing Next Generation Data-Centers with Advanced Communication Protocols and Systems Services P. Balaji,
P. Balaji, S. Bhagvat, D. K. Panda, R. Thakur, and W. Gropp
Supporting Strong Cache Coherency for Active Caches in Multi-Tier Data-Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan, S. Krishnamoorthy,
Designing Efficient Systems Services and Primitives for Next-Generation Data-Centers K. Vaidyanathan, S. Narravula, P. Balaji and D. K. Panda Network Based.
Protocols for Wide-Area Data-intensive Applications: Design and Performance Issues Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi, Brian.
Roland Dreier Technical Lead – Cisco Systems, Inc. OpenIB Maintainer Sean Hefty Software Engineer – Intel Corporation OpenIB Maintainer Yaron Haviv CTO.
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
HPCS Lab. High Throughput, Low latency and Reliable Remote File Access Hiroki Ohtsuji and Osamu Tatebe University of Tsukuba, Japan / JST CREST.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.
The NE010 iWARP Adapter Gary Montry Senior Scientist
Workload-driven Analysis of File Systems in Shared Multi-Tier Data-Centers over InfiniBand K. Vaidyanathan P. Balaji H. –W. Jin D.K. Panda Network-Based.
Electronic visualization laboratory, university of illinois at chicago A Case for UDP Offload Engines in LambdaGrids Venkatram Vishwanath, Jason Leigh.
An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multi-core Environments G. Narayanaswamy, P. Balaji and W. Feng Dept. of Comp. Science Virginia Tech.
2006 Sonoma Workshop February 2006Page 1 Sockets Direct Protocol (SDP) for Windows - Motivation and Plans Gilad Shainer Mellanox Technologies Inc.
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,
Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.
Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan,
Srihari Makineni & Ravi Iyer Communications Technology Lab
August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
iSER update 2014 OFA Developer Workshop Eyal Salomon
ProOnE: A General-Purpose Protocol Onload Engine for Multi- and Many- Core Architectures P. Lai, P. Balaji, R. Thakur and D. K. Panda Computer Science.
Implementing Babel RMI with ARMCI Jian Yin Khushbu Agarwal Daniel Chavarría Manoj Krishnan Ian Gorton Vidhya Gurumoorthi Patrick Nichols.
Mr. P. K. GuptaSandeep Gupta Roopak Agarwal
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
The Evaluation Tool for the LHCb Event Builder Network Upgrade Guoming Liu, Niko Neufeld CERN, Switzerland 18 th Real-Time Conference June 13, 2012.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
Introduction to Operating Systems Concepts
High Performance and Reliable Multicast over Myrinet/GM-2
Presentation transcript:

04/25/06Pavan Balaji (The Ohio State University) Asynchronous Zero-copy Communication for Synchronous Sockets in the Sockets Direct Protocol over InfiniBand P. Balaji, S. Bhagvat, H. –W. Jin and D. K. Panda Network Based Computing Laboratory (NBCL) Computer Science and Engineering Ohio State University

04/25/06Pavan Balaji (The Ohio State University) InfiniBand Overview An emerging industry standard High Performance –Low latency (about 2us) –High Throughput (8Gbps, 16Gbps and higher) Advanced Features –Hardware offloaded protocol stack –Kernel bypass – direct access to network for applications –RDMA operations – direct access to remote memory

04/25/06Pavan Balaji (The Ohio State University) Sockets Direct Protocol (SDP) High-Performance Alternative to TCP/IP sockets for IB, etc. Hijack and redirect socket calls Application transparent –Binary compatibility (most cases) Utilizes IB capabilities –Offloaded Protocol –RDMA operations –Kernel bypass Sockets Direct Protocol App #1App #2 High-speed Network Device Driver IP TCP Traditional Sockets Sockets Interface Offloaded Protocol App #N Advanced Features

04/25/06Pavan Balaji (The Ohio State University) Sockets APIs Supported by SDP Synchronous Sockets Asynchronous Sockets Extended Sockets (OSU Specific)* CommunicationSynchronousAsynchronous Operations Outstanding At most oneMore than one SDP Implementations BSDP, ZSDP Existing Applications MostFewVery few Potential for Performance LimitedHigh (Portions of this table have been borrowed from Mellanox Technologies) * RAIT05: “Supporting iWARP compatibility and features for regular network adapters”. P. Balaji, H. –W. Jin, K. Vaidyanathan and D. K. Panda. RAIT Workshop; in conjunction with Cluster ‘05 BSDP, ZSDP, AZ-SDP

04/25/06Pavan Balaji (The Ohio State University) Presentation Layout §Introduction and Background §Understanding Asynchronous Zero-copy SDP §Design Issues in AZ-SDP §Performance Evaluation §Conclusions and Future Work

04/25/06Pavan Balaji (The Ohio State University) Buffer-copy SDP (BSDP) Several buffer-copy based implementations of SDP exist –OSU, Mellanox, Voltaire HCA offloads transport and network layers Copy overhead still present Data Source App Buffer App Buffer SDP Buffer SDP Buffer SDP Buffer SDP Buffer SDP Buffer SDP Buffer Data Sink SDP Data Message ISPASS04: “Sockets Direct Protocol over InfiniBand in Clusters: Is it Beneficial?”. P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy and D. K. Panda. IEEE International Conference on Performance Analysis of Systems and Software (ISPASS), 2004.

04/25/06Pavan Balaji (The Ohio State University) Zero-copy SDP (ZSDP) Implemented by Mellanox –RDMA Read based design Benefits of zero-copy Limited by the API of Synchronous Sockets –At most one outstanding communication request –Control message latency (50% time for 16K message) Intolerant to Skew Data Source App Buffer Data Sink SRC AVAIL App Buffer send() Send Complete Application Blocks App Buffer SRC AVAIL App Buffer send() Send Complete Application Blocks GET COMPLETE

04/25/06Pavan Balaji (The Ohio State University) Asynchronous Zero-copy SDP (AZ-SDP) Basic zero-copy communication is synchronous –Data communication accompanied by control messages –Communication will be latency bound Asynchronous Zero-copy SDP –Utilize the benefits of asynchronous communication (more than one outstanding communication operation) –Maintain the semantics of synchronous sockets (application can assume that it is using synchronous sockets) –Objectives: Correctness, Transparency and Performance –Key Idea: Memory protect buffers

04/25/06Pavan Balaji (The Ohio State University) Memory Protect AZ-SDP Functionality Send returns as soon as communication is initiated –Application “thinks” communication is synchronous Memory unprotected after communication completes If application touches buffer –Communication complete: Great! –Else PAGE FAULT generated SRC AVAIL send() App Buffer1 SRC AVAIL send() App Buffer2 Memory Unprotect GET COMPLETE App Buffer1 App Buffer2 Data SourceData Sink Get Data

04/25/06Pavan Balaji (The Ohio State University) Presentation Layout §Introduction and Background §Understanding Asynchronous Zero-copy SDP §Design Issues in AZ-SDP §Performance Evaluation §Conclusions and Future Work

04/25/06Pavan Balaji (The Ohio State University) Design Issues in AZ-SDP Handling a Page Fault –Block-on-Write: Wait till the communication has finished –Copy-on-Write: Copy data to internal buffer and carry on communication Handling Buffer Sharing –Buffers shared through mmap() Handling Unaligned Buffers –Memory protection is only in the granularity of a page –Malloc hook overheads

04/25/06Pavan Balaji (The Ohio State University) Handling a Page Fault Memory protection needed to disallow the application from accessing an occupied communication buffer Page fault generated on access –Number of page faults generated are application dependent Two approaches for handling the page-fault –Block on Write –Copy on Write

04/25/06Pavan Balaji (The Ohio State University) Memory Protect Block-on-Write Optimistic approach to avoid blocking for communication –ZSDP blocks during the communication call –AZ-SDP delays blocking Advantage: –Zero-copy communication –SDP specification compliant Disadvantage: –Not skew tolerant SRC AVAIL send() App Buffer1 Memory Unprotect GET COMPLETE App Buffer1 Data SourceData Sink Get Data Application touches buffer PAGE FAULT generated Block

04/25/06Pavan Balaji (The Ohio State University) Memory Protect Copy-on-Write Enhances the functionality of Block-on-Write –Does not blindly block Advantage: –Zero-copy communication when possible –Skew tolerant when receiver is not ready Disadvantage –Not SDP specification compliant SRC AVAIL send() App Buffer1 Memory Unprotect GET COMPLETE App Buffer1 Data SourceData Sink Application touches buffer PAGE FAULT generated Block Atomic Lock Atomic Lock Failed App Buffer1 Atomic Lock Successful Copy to temp. buffer SRC UPDATE

04/25/06Pavan Balaji (The Ohio State University) Presentation Layout §Introduction and Background §Understanding Asynchronous Zero-copy SDP §Design Issues in AZ-SDP §Performance Evaluation §Conclusions and Future Work

04/25/06Pavan Balaji (The Ohio State University) Experimental Test-Bed 4 node cluster –Dual 3.6 GHz Intel Xeon EM64T processors (2 MB L2 cache), 512 MB of 333 MHz DDR SDRAM –Mellanox MT25208 InfiniHost III DDR PCI-Express adapters (capable of a link-rate of 16 Gbps) –Mellanox MTS-2400, 24-port fully non-blocking DDR switch

04/25/06Pavan Balaji (The Ohio State University) Throughput and Comp./Comm. Overlap 30% improvement in the throughput Up to 2X improvement in computation/communication overlap tests

04/25/06Pavan Balaji (The Ohio State University) Impact of Page-faults When application touches the communication buffer very frequently, PAGE FAULT overheads degrade AZ-SDP’s performance

04/25/06Pavan Balaji (The Ohio State University) Presentation Layout §Introduction and Background §Understanding Asynchronous Zero-copy SDP §Design Issues in AZ-SDP §Performance Evaluation §Conclusions and Future Work

04/25/06Pavan Balaji (The Ohio State University) Conclusions and Future Work Current Zero-copy SDP approaches: Very restrictive AZ-SDP brings the benefits of asynchronous sockets to synchronous sockets in a TRANSPARENT manner 30% better throughput and 2X improvement in computation-communication overlap tests Analysis with applications and large-scale clusters Integration with OpenIB/Gen2

21 Acknowledgements Our research is supported by the following organizations Current Funding support by Current Equipment support by

04/25/06Pavan Balaji (The Ohio State University) Web Pointers Website: Group Homepage: NBCL

04/25/06Pavan Balaji (The Ohio State University) Backup Slides

04/25/06Pavan Balaji (The Ohio State University) Sockets Programming Model Several high-speed networks available today –E.g., InfiniBand (IB), Myrinet, 10-Gigabit Ethernet Common programming models –E.g., Sockets, MPI, Shared Memory Models –Network independent parallel and distributed applications Sockets programming model is of particular interest –Scientific apps, file/storage systems, commercial apps –Traditionally built over TCP/IP (and others) –Performance of such implementations is not the best

04/25/06Pavan Balaji (The Ohio State University) Limitations of TCP/IP Sockets for High-speed Networks Network/Transport layers processed by the host –Limited performance –Excessive resource usage (CPU, Memory traffic) Generic optimizations for TCP/IP sockets –Cannot sustain the performance of high-speed networks –Performance on IB (16Gbps) adapters limited to 2Gbps Sockets Direct Protocol (SDP) proposed –Alternative to TCP/IP Sockets

04/25/06Pavan Balaji (The Ohio State University) Zero-Copy Mechanisms in SDP SRC Available RDMA Read Data GET Complete SenderReceiver SINK Available RDMA Write Data PUT Complete SenderReceiver Register Buffer SOURCE-AVAILSINK-AVAIL

04/25/06Pavan Balaji (The Ohio State University) Prior Research Prior Research on High-Performance Sockets spanning various networks (Giganet CLAN, VIA, GbE, Myrinet) SDP over IBA: Buffer-copy based implementation Recent research on Zero-copy SDP [Goldenberg05] Zero-copy schemes to optimize TCP and UDP stacks –Mostly for asynchronous sockets –May require kernel/NIC firmware modifications

04/25/06Pavan Balaji (The Ohio State University) Latency and Throughput

04/25/06Pavan Balaji (The Ohio State University) Computation/Communication Overlap

04/25/06Pavan Balaji (The Ohio State University) Multi-connection Tests

04/25/06Pavan Balaji (The Ohio State University) Hot-spot Latency Test

04/25/06Pavan Balaji (The Ohio State University) Buffer Sharing Memory-protect B1 and disallow all access to it Override the mmap() call (libc) with a new mmap call –New mmap() call contains mapping of all memory-mapped buffers B1 B2 B1 and B2 are memory mapped to each other Send() Write()

04/25/06Pavan Balaji (The Ohio State University) Managing Un-aligned Buffers Two approaches –Malloc Hook –Hybrid approach with Buffered SDP Physical Page VAPI Control Buffer Application Buffer Shared Page

04/25/06Pavan Balaji (The Ohio State University) Malloc Hook Approach overrides the malloc() and free() system calls New Malloc() allocates physical page boundary- aligned N + PAGE_SIZE bytes, when N bytes are requested Advantage : –Simple Approach Disadvantage : –Very small buffer requests may result in buffer wastage –Time to malloc few bytes to Physical Page size is the same

04/25/06Pavan Balaji (The Ohio State University) Hybrid approach with Buffered SDP Hybrid Mechanism between BSDP and AZ-SDP Physical Page VAPI Control Buffer Application Buffer AZ-SDP communication BSDP comm. BSDP comm. A single communication might be carried out in multiple operations (upto three) 5-10% better performance than Malloc-hook based scheme

04/25/06Pavan Balaji (The Ohio State University) Copy-on-Write Control maintained via Locks at the receiver end by the AZ-SDP layer Receiver obtains the lock, if recv() is called first Sender can obtain the lock on generation of a page fault and can perform a copy-on-write operation