1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer.

Slides:



Advertisements
Similar presentations
Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
Advertisements

A Hybrid MPI Design using SCTP and iWARP Distributed Systems Group Mike Tsai, Brad Penoff, and Alan Wagner Department of Computer Science University of.
Chapter 17 Networking Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
Umut Girit  One of the core members of the Internet Protocol Suite, the set of network protocols used for the Internet. With UDP, computer.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public ITE PC v4.0 Chapter 1 1 OSI Transport Layer Network Fundamentals – Chapter 4.
Chapter 7: Transport Layer
Evaluation of ConnectX Virtual Protocol Interconnect for Data Centers Ryan E. GrantAhmad Afsahi Pavan Balaji Department of Electrical and Computer Engineering,
BZUPAGES.COM 1 User Datagram Protocol - UDP RFC 768, Protocol 17 Provides unreliable, connectionless on top of IP Minimal overhead, high performance –No.
Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.
Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji,
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
IWARP Update #OFADevWorkshop.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
Networking Theory (Part 1). Introduction Overview of the basic concepts of networking Also discusses essential topics of networking theory.
Networking Theory (part 2). Internet Architecture The Internet is a worldwide collection of smaller networks that share a common suite of communication.
RDMA ENABLED WEB SERVER Rajat Sharma. Objective  To implement a Web Server serving HTTP client requests through RDMA replacing the traditional TCP/IP.
Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Transport Protocols Slide 1 Transport Protocols.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
UDP© Dr. Ayman Abdel-Hamid, CS4254 Spring CS4254 Computer Network Architecture and Programming Dr. Ayman A. Abdel-Hamid Computer Science Department.
COE 342: Data & Computer Communications (T042) Dr. Marwan Abu-Amara Chapter 2: Protocols and Architecture.
Gursharan Singh Tatla Transport Layer 16-May
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
Supporting iWARP Compatibility and Features for Regular Network Adapters P. BalajiH. –W. JinK. VaidyanathanD. K. Panda Network Based Computing Laboratory.
What Can IP Do? Deliver datagrams to hosts – The IP address in a datagram header identify a host IP treats a computer as an endpoint of communication Best.
Process-to-Process Delivery:
IWARP Redefined: Scalable Connectionless Communication Over High-Speed Ethernet M. J. Rashti, R. E. Grant, P. Balaji and A. Afsahi.
Chapter 17 Networking Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William Stallings.
70-291: MCSE Guide to Managing a Microsoft Windows Server 2003 Network Chapter 3: TCP/IP Architecture.
CS448 Computer Networking Chapter 1 Introduction to Computer Networks Instructor: Li Ma Office: NBC 126 Phone: (713)
What is a Protocol A set of definitions and rules defining the method by which data is transferred between two or more entities or systems. The key elements.
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
Protocols and the TCP/IP Suite
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.
Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.
Introduction to Networks CS587x Lecture 1 Department of Computer Science Iowa State University.
1 Chapter 16 Protocols and Protocol Layering. 2 Protocol  Agreement about communication  Specifies  Format of messages (syntax)  Meaning of messages.
© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.
Data Communications and Computer Networks Chapter 3 CS 3830 Lecture 12 Omar Meqdadi Department of Computer Science and Software Engineering University.
Electronic visualization laboratory, university of illinois at chicago A Case for UDP Offload Engines in LambdaGrids Venkatram Vishwanath, Jason Leigh.
ICOM 6115©Manuel Rodriguez-Martinez ICOM 6115 – Computer Networks and the WWW Manuel Rodriguez-Martinez, Ph.D. Lecture 26.
1 The Internet and Networked Multimedia. 2 Layering  Internet protocols are designed to work in layers, with each layer building on the facilities provided.
The Transmission Control Protocol (TCP) Application Services (Telnet, FTP, , WWW) Reliable Stream Transport (TCP) Connectionless Packet Delivery.
1 Networking Chapter Distributed Capabilities Communications architectures –Software that supports a group of networked computers Network operating.
Types of Service. Types of service (1) A network architecture may have multiple protocols at the same layer in order to provide different types of service.
03/11/2015 Michael Chai; Behrouz Forouzan Staffordshire University School of Computing Streaming 1.
LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.
Prentice HallHigh Performance TCP/IP Networking, Hassan-Jain Chapter 13 TCP Implementation.
Mr. P. K. GuptaSandeep Gupta Roopak Agarwal
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
UDP: User Datagram Protocol Chapter 12. Introduction Multiple application programs can execute simultaneously on a given computer and can send and receive.
1 Network Communications A Brief Introduction. 2 Network Communications.
UDP: User Datagram Protocol. What Can IP Do? Deliver datagrams to hosts – The IP address in a datagram header identify a host – treats a computer as an.
Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen
Computer Networking A Top-Down Approach Featuring the Internet Introduction Jaypee Institute of Information Technology.
1 Chapter 24 Internetworking Part 4 (Transport Protocols, UDP and TCP, Protocol Port Numbers)
Accelerating Peer-to-Peer Networks for Video Streaming
Internet Protocol: Connectionless Datagram Delivery
Process-to-Process Delivery:
SCTP-based Middleware for MPI
Networking Theory (part 2)
CS4470 Computer Networking Protocols
Computer Networking A Top-Down Approach Featuring the Internet
Process-to-Process Delivery: UDP, TCP
Computer Networks Protocols
Transport Layer 9/22/2019.
Networking Theory (part 2)
Presentation transcript:

1 May 2011 RDMA Capable iWARP over Datagrams Ryan E. Grant 1, Mohammad J. Rashti 1, Pavan Balaji 2, Ahmad Afsahi 1 1 Department of Electrical and Computer Engineering Queen’s University Kingston, ON, Canada K7L 3N6 2 Mathematics and Computer Science Argonne National Laboratory Argonne, IL, USA

2 May 2011 Introduction Motivation Background Information Design Experimental Framework and Results –Microbenchmarks –Applications Conclusions –Future Work Questions

3 May 2011 Motivation Existing RDMA designs do not provide support for RDMA write operations over unreliable datagram (UD) transports Popular applications use datagrams –video on demand streaming –high-speed financial trading applications Desirable to leverage RDMA technology to improve application performance Improve performance of inter-node communication for Ethernet clusters

4 May 2011 Motivation Sandvine Inc. Report from Monday –Netflix consumes 29.7% of peak time bandwidth in North America –Real-time entertainment consumes 49.2% –Predicting entertainment will consume 55-60% of peak time bandwidth by the end of 2011 –RTE and filesharing consume almost 70% of peak time bandwidth Source:

5 May 2011 Motivation Why use UD? –Scalability, no need for connections –Speed, no TCP congestion control –Simplicity, less complex implementation for UD offloading than a TOE Drawbacks to UD? –Unreliability –Potential packet loss from congestion

6 May 2011 Outline Motivation Background Information Design Experimental Framework and Results –Microbenchmarks –Applications Conclusions –Future Work Questions

7 May 2011 Background Information iWARP –Remote Direct Memory Access over Ethernet –Standard built on TCP or SCTP lower layer –Queue pair based network –Untagged and tagged models Untagged, sent data matched with a posted receive for local data placement Tagged, sender aware of remote memory window and provides target memory location

8 May 2011 Background Information iWARP (UD) Stack versus Kernel TCP/IP Stack

9 May 2011 Background Information Traditional iWARP RDMA Write 1. Verbs Request 2. iWARP stack applies tagged header (STag and offset) 3. Data sent to target 4. Data received 5. Data written into memory based on STag and offset 6. Send request posted 7. Send request data sent to target 8. Incoming data matched to Recv Request 9. Recv request Handled 10. RDMA Write valid after Recv 11. Application can access data Alternatively, the application can poll a bit in memory to determine when write is complete 7. Poll on memory until valid

10 May 2011 Background Relies on the lower layer (TCP) for reliability With a UD LLP: –If using UD, target buffer may not have complete message –Final send/recv lost in transit means complete iWARP message loss

11 May 2011 Outline Motivation Background Information Design Experimental Framework and Results –Microbenchmarks –Applications Conclusions –Future Work Questions

12 May 2011 Design - Challenges with UD Transports UD Transports provide additional challenges over TCP –Unreliable! –No order guarantees –No connection information But solves some problems as well –No middlebox fragmentation issues No need for iWARP markers

13 May 2011 Challenges with UD RDMA functions like a local DMA, but Remote –For UD need to treat RDMA like an unreliable memory –Indicate which areas of memory are “bad” due to message loss Ideally it should be compatible with socket semantics –Done through an intermediate interface or protocol

14 May 2011 Challenges with UD Allow for socket semantics compatibility –Each incoming message can result in a completion notification –Functions like traditional recvmsg but using user buffers –Similar to send/recv without posted recvs Allow for DMA-like interface –Produce a validity map for all valid areas of memory in a defined memory region –Essentially an aggregate of many completion notifications, delivered at once

15 May 2011 Background Information iWARP RDMA Write-Record 1. Verbs Request 2. iWARP stack applies tagged header (STag and offset) 3. Data sent to target 4. Data received 5. Data written into memory based on STag and offset 8. Application can access data 7. Poll CQ for valid data 6. Location of valid data entered into CQ or Validity map

16 May 2011 Solving the Challenges of UD Ordering –Small messages are typical of UD (< 64K) –Direct placement avoids ordering issues for small messages –Large messages – need to keep a message sequence number counter for each user of a memory region No Connection Information –Pass sender’s IP/Port back to application upon application validity data fetch

17 May 2011 Outline Motivation Background Information Design Experimental Framework and Results –Microbenchmarks –Applications Conclusions –Future Work Questions

18 May 2011 Experimental Framework OSProcessorsNICSwitch Fedora Kernel – 2.0 Ghz Quad- Core AMD Opteron NetEffect 10GigEFujitsu 10GigE Switch Network Performance data collected using custom microbenchmark suite for software iWARP Application results collected using a custom socket interface to software iWARP and the following software: VideoLan’s VLC ( SIPp ( UD Send/Recv first proposed in: Mohammad J. Rashti, Ryan E. Grant, Pavan Balaji, and Ahmad Afsahi, "iWARP Redefined: Scalable Connectionless Communication over High-Speed Ethernet", 17th International Conference on High Performance Computing (HiPC 2010), Goa, India, December 19-22, 2010.

19 May 2011 Microbenchmark Results UD RDMA Write-Record has the lowest small message latency, similar to UD Send/Recv

20 May 2011 Baseline Multi-Stream Performance RDMA Write-Record also has higher bandwidth for larger message sizes, and outperforms at medium message sizes as well

21 May 2011 Microbenchmark Results RDMA Write-Record is more loss tolerant for large messages than Send/Recv as well, as it delivers partial messages (messages may span multiple 64K UDP messages)

22 May 2011 Microbenchmark Summary RDMA Write-Record provides good performance –Beats RC RDMA Write at the most important message sizes for latency and bandwidth –Improves upon UD Send/Recv RDMA Write-Record fits well within existing socket semantics, enabling easy adoption –Removes MPA layer complexity as well as TCP bottlenecks to enhance performance and reduce overall stack complexity

23 May 2011 Application Performance Results

24 May 2011 Application Performance Tested with Media Streaming and SIP phone applications for performance –Developed a sockets to verbs interface to allow existing applications to use software iWARP stack (UD/RC iWARP) –Lightweight interface to test functionality Formally specified socket interface would be helpful in facilitating acceptance Operates in one iWARP transport mode at a time only, RC or UD. Sockets Direct Protocol is available for RC mode hardware (not compatible with software iWARP)

25 May 2011 VLC Performance VLC performance shows significantly less buffering time required for UD iWARP over RC iWARP, a 74% average improvement.

26 May 2011 SIP Performance Sip shows a 43.1% improvement in response times using UD over RC (send/recv and RDMA Write (Record) are statistically tied in performance for this test)

27 May 2011 Application Performance Discussion Performance with UD is better than with RC Software solution is still using TCP/IP and UDP stacks –OS related overhead in both cases is similar –Performance benefits from simpler UDP transport Hardware solutions would show benefit from having no target CPU involvement required for data reception (no posted recvs) Target system can receive information without local machine work request

28 May 2011 Application Memory Usage The memory usage of a UD solution for a SIP application can be significantly less than that of an RC solution clients)

29 May 2011 Application Memory Usage Memory usage calculated using whole application memory usage as well as memory usage from the slab. Improvement of users contrasts to theoretical improvement of 28.1% –Difference is in SIP application’s requirement to store information on active UDP clients Scalability and offloaded networking for iWARP UD hardware are promising for increasing server capacity and throughput

30 May 2011 Outline Motivation Background Information Design Experimental Framework and Results –Microbenchmarks –Applications Conclusions –Future Work Questions

31 May 2011 Conclusions RDMA Write-Record is the first one-sided RDMA operation operable over UD on iWARP RDMA Write-Record allows for data transfer that can tolerate packet loss UD solution is more scalable than connection based one Full specifications for a two-sided Send/Recv and one-sided RDMA Write-Record over iWARP are now available Real applications show performance improvements using UD based iWARP

32 May 2011 Future Work Extend the work to include a reliable datagram transport, broadening the potential application space MPI-RDMA Write-Record interface for HPC applications Provide an SDP-like interface for UD iWARP

33 May 2011 Thank You Questions? This work was supported in part by: Natural Sciences and Engineering Research Council of Canada Grant #RGPIN/ , Canada Foundation for Innovation and Ontario Innovation Trust Grant #7154, Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract DE-AC02-06CH11357, and the National Science Foundation Grant #