Low Latency Messaging Over Gigabit Ethernet Keith Fenech CSAW 24 September 2004.

Slides:

Advertisements

Similar presentations

IP Router Architectures. Outline Basic IP Router Functionalities IP Router Architectures.

Advertisements

Virtual Machine Queue Architecture Review Ali Dabagh Architect Windows Core Networking Don Stanwyck Sr. Program Manager NDIS Virtualization.

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach.

Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Fast Communication Firefly RPC Lightweight RPC  CS 614  Tuesday March 13, 2001  Jeff Hoy.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.

Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.

Evaluating System Performance in Gigabit Networks King Fahd University of Petroleum and Minerals (KFUPM) INFORMATION AND COMPUTER SCIENCE DEPARTMENT Dr.

Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.

COM S 614 Advanced Systems Novel Communications U-Net and Active Messages.

Ethan Kao CS 6410 Oct. 18 th  Active Messages: A Mechanism for Integrated Communication and Control, Thorsten von Eicken, David E. Culler, Seth.

NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.

Storage area network and System area network (SAN)

Router Architectures An overview of router architectures.

Router Architectures An overview of router architectures.

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Chapter 4 Queuing, Datagrams, and Addressing

ATM and Fast Ethernet Network Interfaces for User-level Communication Presented by Sagwon Seo 2000/4/13 Matt Welsh, Anindya Basu, and Thorsten von Eicken.

UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.

1/29/2002 CS Distributed Systems 1 Infiniband Architecture Aniruddha Bohra.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

LWIP TCP/IP Stack 김백규.

Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Lecture 3 Review of Internet Protocols Transport Layer.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.

Penn State CSE “Optimizing Network Virtualization in Xen” Aravind Menon, Alan L. Cox, Willy Zwaenepoel Presented by : Arjun R. Nath.

Example: Sorting on Distributed Computing Environment Apr 20,

Unconventional Networking Makoto Bentz October 13, 2010 CS 6410.

Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,

Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.

Srihari Makineni & Ravi Iyer Communications Technology Lab

Jump to first page One-gigabit Router Oskar E. Bruening and Cemal Akcaba Advisor: Prof. Agarwal.

Internetworking Internet: A network among networks, or a network of networks Allows accommodation of multiple network technologies Universal Service Routers.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.

IBM Systems & Tech. Group Bangalore, IndiaHiPC 2004, Dec Copyright by IBM HPS Switch and Adapter Architecture, Design & Performance Rama K Govindaraju.

1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 1 - by Adrian Riedo - Summer 2000 High Performance Computing using.

Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.

Using Heterogeneous Paths for Inter-process Communication in a Distributed System Vimi Puthen Veetil Instructor: Pekka Heikkinen M.Sc.(Tech.) Nokia Siemens.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 April 11, 2006 Session 23.

LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.

By Chi-Chang Chen.  Cluster computing is a technique of linking two or more computers into a network (usually through a local area network) in order.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

UDP: User Datagram Protocol Chapter 12. Introduction Multiple application programs can execute simultaneously on a given computer and can send and receive.

Major OS Components CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University

Background Computer System Architectures Computer System Software.

Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen

Infiniband Architecture

J.M. Landgraf, M.J. LeVine, A. Ljubicic, Jr., M.W. Schulz

Final Review CS144 Review Session 9 June 4, 2008 Derrick Isaacson

Internetworking: Hardware/Software Interface

Storage area network and System area network (SAN)

Basic Mechanisms How Bits Move.

Chapter 4 Network Layer Computer Networking: A Top Down Approach 5th edition. Jim Kurose, Keith Ross Addison-Wesley, April Network Layer.

NetFPGA - an open network development platform

Low Overhead Interrupt Handling with SMT

ECE 671 – Lecture 8 Network Adapters.

Presentation transcript:

Low Latency Messaging Over Gigabit Ethernet Keith Fenech CSAW 24 September 2004

CSAW '04 2 / 11 Why Cluster Computing? Ideal for computationally intensive applications. Multi-threaded processes allow jobs to be processed in parallel over multiple CPUs. High Bandwidth allows interconnected nodes to achieve supercomputer performance. Networks of Workstations (NOWs) 1 Easily available (commodity platforms) Relatively cheap Nodes may be used independently or as a cluster Better utilization of idle computing resources.

24 September 2004 CSAW '04 3 / 11 High Performance Networking Commodity networks dominated by IP over Ethernet Performance is directly affected by: Hardware – bus & network bandwidths Latency – delay incurred in communicating a message from source to destination Overhead – length of time that a processor is engaged in tx/rx of each message Fine-grain threads communicate frequently using small messages. HP communication architecture features: transparency to the application layer allow high-throughput for bandwidth intensive applications low latencies for frequently communicating threads Minimise protocol processing overhead on host machine Gigabit performance not achievable at application layers. Why?

24 September 2004 CSAW '04 4 / 11 Conventional NICs & Protocols Receiver node Ethernet controller receives frame Check CRC for frame Filter MAC destination address NIC generates HW interrupt to notify host PCI transfer to host memory CPU suspends current task & launches interrupt handler to service high priority interrupt Check network layer (IP) header & verify checksum Parse routing tables & store valid IP datagrams in IP buffer Reassemble fragmented datagrams in host memory Call transport layer (TCP/UDP) functions Deliver packet to application layer

24 September 2004 CSAW '04 5 / 11 Problems With Conventional Protocols & Architectures NIC generates a CPU interrupt for each frame Servicing interrupts involves expensive vertical switch to kernel space. Software interrupts to pass IP datagrams to upper layers Servicing incoming packets results in high host CPU load Risk of Receiver Livelock scenarios (as in Denial of Service attacks) PCI bus startup overheads for each message Layered protocols implies expensive memory-to-memory buffer copies

24 September 2004 CSAW '04 6 / 11 Available Techniques Bypass kernel for critical data paths Buffer & protocol processing moved to user-space User-level hardware access Zero-copy techniques Scatter/Gather techniques Larger MTUs (Jumbo frames) Larger DMA transfers avoid PCI startup overheads Interrupt coalescing Message descriptors & polling replace interrupts

24 September 2004 CSAW '04 7 / 11 Current Solutions Enabled by programmable NICs Virtual Interface Architecture (VIA 2 ) U-Net 3 (ATM) Myrinet GM 4 and Illinois FM 5 (Myrinet) QsNet 6 (Quadrics) EMP 7 (Ethernet)

24 September 2004 CSAW '04 8 / 11 Our Proposal NOWs running over Gigabit Ethernet Use Tigon2 programmable NIC features (onboard CPU, memory, DMA) Design a reliable lightweight communication protocol for GE Reliable network (ordered & lossless packet delivery) Low-overhead Low-latency Offload protocol processing from host CPU onto NIC CPU Interrupt-free architecture (message descriptor queues + polling) OS Bypass: user-applications & NIC hardware communicate through pinned down shared memory. Zero Copy Dynamic MTUs & DMA sizes – reduce PCI startup overheads Tackle 2 application scenarios Small messages – Latency is critical Large bandwidth – Throughput is critical

24 September 2004 CSAW '04 9 / 11 Conclusion Provide a high performance communication API Replace PVM 8 & MPI 9 protocols Fine-grained thread communication High Bandwidth applications Remove network communication bottleneck in user-level thread messaging. Interface with SMASH 10 user-level thread scheduler Multi-threaded applications can run seamlessly over a cluster of SMPs. Achieve higher throughput with minimal usage of host CPU resources.

24 September 2004 CSAW '04 10 / 11 References 1. D. Culler, A. Arpaci-Dusseau, R. Arpaci-Dusseau, B. Chun, S. Lumetta, A. Mainwaring, R. Martin, C. Yoshikawa, and F. Wong. Parallel Computing on the Berkeley NOW. In Ninth Joint Symposium on Parallel Processing, Microsoft Compaq, Intel. Virtual Interface Architecture Specification, draft revision 1.0 edition, December T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: a user-level network interface for parallel and distributed computing. In Proceedings of the fifteenth ACM symposium on Operating systems principles, pages 40–53. ACM Press, Myricom Inc. Myrinet GM – the low-level message-passing system for Myrinet networks. 5. Scott Pakin, Mario Lauria, and Andrew Chien. High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet Fabrizio Petrini, Wu chun Feng, Adolfy Hoisie, Salvador Coll, and Eitan Frachtenberg. Quadrics Network (QsNet): High- Performance Clustering Technology. In Hot Interconnects 9, Stanford University, Palo Alto, CA, August Piyush Shivam, Pete Wyckoff, and Dhabaleswar Panda. EMP: Zero-copy OSbypass NIC-driven Gigabit Ethernet Message Passing Message Passing Interface Forum. MPI2: A Message Passing Interface standard. International Journal of High Performance Computing Applications, 12(1–2):1–299, A. Geist, A. Beguelin, J. Dongarra, W. Jiang, B. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine - A Users Guide and Tutorial for Network Parallel Computing. MIT Press, Cambridge, Mass., Kurt Debattista. High Performance Thread Scheduling on Shared Momory Multiprocessors. Masters thesis, University of Malta, 2001.

24 September 2004 CSAW '04 11 / 11 Thank you!