Presentation on theme: "Low Latency Messaging Over Gigabit Ethernet Keith Fenech CSAW 24 September 2004."— Presentation transcript:
Low Latency Messaging Over Gigabit Ethernet Keith Fenech CSAW 24 September 2004
CSAW '04 2 / 11 Why Cluster Computing? Ideal for computationally intensive applications. Multi-threaded processes allow jobs to be processed in parallel over multiple CPUs. High Bandwidth allows interconnected nodes to achieve supercomputer performance. Networks of Workstations (NOWs) 1 Easily available (commodity platforms) Relatively cheap Nodes may be used independently or as a cluster Better utilization of idle computing resources.
24 September 2004 CSAW '04 3 / 11 High Performance Networking Commodity networks dominated by IP over Ethernet Performance is directly affected by: Hardware – bus & network bandwidths Latency – delay incurred in communicating a message from source to destination Overhead – length of time that a processor is engaged in tx/rx of each message Fine-grain threads communicate frequently using small messages. HP communication architecture features: transparency to the application layer allow high-throughput for bandwidth intensive applications low latencies for frequently communicating threads Minimise protocol processing overhead on host machine Gigabit performance not achievable at application layers. Why?
24 September 2004 CSAW '04 4 / 11 Conventional NICs & Protocols Receiver node Ethernet controller receives frame Check CRC for frame Filter MAC destination address NIC generates HW interrupt to notify host PCI transfer to host memory CPU suspends current task & launches interrupt handler to service high priority interrupt Check network layer (IP) header & verify checksum Parse routing tables & store valid IP datagrams in IP buffer Reassemble fragmented datagrams in host memory Call transport layer (TCP/UDP) functions Deliver packet to application layer
24 September 2004 CSAW '04 5 / 11 Problems With Conventional Protocols & Architectures NIC generates a CPU interrupt for each frame Servicing interrupts involves expensive vertical switch to kernel space. Software interrupts to pass IP datagrams to upper layers Servicing incoming packets results in high host CPU load Risk of Receiver Livelock scenarios (as in Denial of Service attacks) PCI bus startup overheads for each message Layered protocols implies expensive memory-to-memory buffer copies
24 September 2004 CSAW '04 6 / 11 Available Techniques Bypass kernel for critical data paths Buffer & protocol processing moved to user-space User-level hardware access Zero-copy techniques Scatter/Gather techniques Larger MTUs (Jumbo frames) Larger DMA transfers avoid PCI startup overheads Interrupt coalescing Message descriptors & polling replace interrupts
24 September 2004 CSAW '04 7 / 11 Current Solutions Enabled by programmable NICs Virtual Interface Architecture (VIA 2 ) U-Net 3 (ATM) Myrinet GM 4 and Illinois FM 5 (Myrinet) QsNet 6 (Quadrics) EMP 7 (Ethernet)
24 September 2004 CSAW '04 8 / 11 Our Proposal NOWs running over Gigabit Ethernet Use Tigon2 programmable NIC features (onboard CPU, memory, DMA) Design a reliable lightweight communication protocol for GE Reliable network (ordered & lossless packet delivery) Low-overhead Low-latency Offload protocol processing from host CPU onto NIC CPU Interrupt-free architecture (message descriptor queues + polling) OS Bypass: user-applications & NIC hardware communicate through pinned down shared memory. Zero Copy Dynamic MTUs & DMA sizes – reduce PCI startup overheads Tackle 2 application scenarios Small messages – Latency is critical Large bandwidth – Throughput is critical
24 September 2004 CSAW '04 9 / 11 Conclusion Provide a high performance communication API Replace PVM 8 & MPI 9 protocols Fine-grained thread communication High Bandwidth applications Remove network communication bottleneck in user-level thread messaging. Interface with SMASH 10 user-level thread scheduler Multi-threaded applications can run seamlessly over a cluster of SMPs. Achieve higher throughput with minimal usage of host CPU resources.
24 September 2004 CSAW '04 10 / 11 References 1. D. Culler, A. Arpaci-Dusseau, R. Arpaci-Dusseau, B. Chun, S. Lumetta, A. Mainwaring, R. Martin, C. Yoshikawa, and F. Wong. Parallel Computing on the Berkeley NOW. In Ninth Joint Symposium on Parallel Processing, Microsoft Compaq, Intel. Virtual Interface Architecture Specification, draft revision 1.0 edition, December T. von Eicken, A. Basu, V. Buch, and W. Vogels. U-Net: a user-level network interface for parallel and distributed computing. In Proceedings of the fifteenth ACM symposium on Operating systems principles, pages 40–53. ACM Press, Myricom Inc. Myrinet GM – the low-level message-passing system for Myrinet networks. 5. Scott Pakin, Mario Lauria, and Andrew Chien. High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet Fabrizio Petrini, Wu chun Feng, Adolfy Hoisie, Salvador Coll, and Eitan Frachtenberg. Quadrics Network (QsNet): High- Performance Clustering Technology. In Hot Interconnects 9, Stanford University, Palo Alto, CA, August Piyush Shivam, Pete Wyckoff, and Dhabaleswar Panda. EMP: Zero-copy OSbypass NIC-driven Gigabit Ethernet Message Passing Message Passing Interface Forum. MPI2: A Message Passing Interface standard. International Journal of High Performance Computing Applications, 12(1–2):1–299, A. Geist, A. Beguelin, J. Dongarra, W. Jiang, B. Manchek, and V. Sunderam. PVM: Parallel Virtual Machine - A Users Guide and Tutorial for Network Parallel Computing. MIT Press, Cambridge, Mass., Kurt Debattista. High Performance Thread Scheduling on Shared Momory Multiprocessors. Masters thesis, University of Malta, 2001.