Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.

Slides:

Advertisements

Similar presentations

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Advertisements

♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.

Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.

Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.

Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji,

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

Types of Parallel Computers

Digital Video Cluster Simulation Martin Milkovits CS699 – Professional Seminar April 26, 2005.

A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.

I/O Channels I/O devices getting more sophisticated e.g. 3D graphics cards CPU instructs I/O controller to do transfer I/O controller does entire transfer.

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.

Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.

1 Performance Evaluation of Gigabit Ethernet & Myrinet

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

Storage area network and System area network (SAN)

Evaluation of High-Performance Networks as Compilation Targets for Global Address Space Languages Mike Welcome In conjunction with the joint UCB and NERSC/LBL.

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 

IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p

Protocol-Dependent Message-Passing Performance on Linux Clusters Dave Turner – Xuehua Chen – Adam Oline This work is funded by the DOE MICS office.

1 Input/Output. 2 Principles of I/O Hardware Some typical device, network, and data base rates.

Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

Digital UNIX Internals III/O Framework Chapter 12.

A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.

Slide 1 DESIGN, IMPLEMENTATION, AND PERFORMANCE ANALYSIS OF THE ISCSI PROTOCOL FOR SCSI OVER TCP/IP By Anshul Chadda (Trebia Networks)-Speaker Ashish Palekar.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

1 Lecture 20: I/O n I/O hardware n I/O structure n communication with controllers n device interrupts n device drivers n streams.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.

The NE010 iWARP Adapter Gary Montry Senior Scientist

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory.

1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.

A Comparative Study of the Linux and Windows Device Driver Architectures with a focus on IEEE1394 (high speed serial bus) drivers Melekam Tsegaye

Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Srihari Makineni & Ravi Iyer Communications Technology Lab

CSE 661 PAPER PRESENTATION

August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

University of Mannheim1 ATOLL ATOmic Low Latency – A high-perfomance, low cost SAN Patrick R. Haspel Computer Architecture Group.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.

Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.

Lecture 25 PC System Architecture PCIe Interconnect

LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

Full and Para Virtualization

Interconnection network network interface and a case study.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Networking update and plans (see also chapter 10 of TP) Bob Dobinson, CERN, June 2000.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.

DiSCoV Fall 2008 Paul A. Farrell Cluster Computing 1 Improving Cluster Performance Performance Evaluation of Networks.

The Multikernel: A New OS Architecture for Scalable Multicore Systems

Operating Systems Chapter 5: Input/Output Management

Constructing a system with multiple computers or processors

MPJ: A Java-based Parallel Computing System

Types of Parallel Computers

Cluster Computers.

Presentation transcript:

Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This work was funded by the MICS office of the US Department of Energy

with or without fence calls. + Measure performance or do an integrity test.

The NetPIPE utility NetPIPE does a series of ping-pong tests between two nodes. Message sizes are chosen at regular intervals, and with slight perturbations, to fully test the communication system for idiosyncrasies. Latencies reported represent half the ping-pong time for messages smaller than 64 Bytes. Measuring the overhead of message-passing protocols. Help in tuning the optimization parameters of message-passing libraries. Optimizing driver and OS parameters (socket buffer sizes, etc.). Identifying dropouts in networking hardware and drivers. Some typical uses What is not measured NetPIPE cannot measure the load on the CPU yet. The effects from the different methods for maintaining message progress. Scalability with system size.

Recent additions to NetPIPE Can do an integrity test instead of measuring performance. Streaming mode measures performance in 1 direction only. Must reset sockets to avoid effects from a collapsing window size. A bi-directional ping-pong mode has been added (-2). One-sided Get and Put calls can be measured (MPI or SHMEM). Can choose whether to use an intervening MPI_Fence call to synchronize. Messages can be bounced between the same buffers (default mode), or they can be started from a different area of memory each time. There are lots of cache effects in SMP message-passing. InfiniBand can show similar effects since memory must be registered with the card. Process 1 Process

Current projects Overlapping pair-wise ping-pong tests. Must consider synchronization if not using bi-directional communications. Investigate other methods for testing the global network. Evaluate the full range from simultaneous nearest neighbor communications to all-to-all. Ethernet Switch n0n1n2n3 n0n1n2n3 Line speed vs end-point limited

Performance on Mellanox InfiniBand cards A new NetPIPE module allows us to measure the raw performance across InfiniBand hardware (RDMA and Send/Recv). Burst mode preposts all receives to duplicate the Mellanox test. The no-cache performance is much lower when the memory has to be registered with the card. An MP_Lite InfiniBand module will be incorporated into LAM/MPI. MVAPICH 0.9.1

10 Gigabit Ethernet Intel 10 Gigabit Ethernet cards 133 MHz PCI-X bus Single mode fiber Intel ixgb driver Can only achieve 2 Gbps now. Latency is 75 us. Streaming mode delivers up to 3 Gbps. Much more development work is needed.

Channel-bonding Gigabit Ethernet for better communications between nodes Channel-bonding uses 2 or more Gigabit Ethernet cards per PC to increase the communication rate between nodes in a cluster. GigE cards cost ~$40 each. 24-port switches cost ~$1400.  $100 / computer This is much more cost effective for PC clusters than using more expensive networking hardware, and may deliver similar performance.

Performance for channel-bonded Gigabit Ethernet Channel-bonding multiple GigE cards using MP_Lite and Linux kernel bonding GigE can deliver 900 Mbps with latencies of us for PCs with 64-bit / 66 MHz PCI slots. Channel-bonding 2 GigE cards / PC using MP_Lite doubles the performance for large messages. Adding a 3 rd card does not help much. Channel-bonding 2 GigE cards / PC using Linux kernel level bonding actually results in poorer performance. The same tricks that make channel- bonding successful in MP_Lite should make Linux kernel bonding working even better. Any message-passing system could then make use of channel-bonding on Linux systems.

Channel-bonding in MP_Lite Application on node 0 a b MP_Lite User space Kernel space Large socket buffers b a TCP/IP stack dev_q_xmit device driver device queue DMA GigE card TCP/IP stack dev_q_xmit device queue DMA GigE card Flow control may stop a given stream at several places. With MP_Lite channel-bonding, each stream is independent of the others.

Linux kernel channel-bonding Application on node 0 User space Kernel space Large socket buffer TCP/IP stack dqx device driver device queue DMA GigE card device queue DMA GigE card A full device queue will stop the flow at bonding.c to both device queues. Flow control on the destination node may stop the flow out of the socket buffer. In both of these cases, problems with one stream can affect both streams. bonding.c dqx

Comparison of high-speed interconnects InfiniBand can deliver Mbps at a 7.5 us latency. Atoll delivers 1890 Mbps with a 4.7 us latency. SCI delivers 1840 Mbps with only a 4.2 us latency. Myrinet performance reaches 1820 Mbps with an 8 us latency. Channel-bonded GigE offers 1800 Mbps for very large messages. Gigabit Ethernet delivers 900 Mbps with a us latency. 10 GigE only delivers 2 Gbps with a 75 us latency.

Conclusions NetPIPE provides a consistent set of analytical tools in the same flexible framework to many message-passing and native communication layers. New modules have been developed. –1-sided MPI and SHMEM –GM, InfiniBand using the Mellanox VAPI, ARMCI, LAPI –Internal tests like memcpy New modes have been incorporated into NetPIPE. –Streaming and bi-directional modes. –Testing without cache effects. –The ability to test integrity instead of performance.

Current projects Developing new modules. –ATOLL –IBM Blue Gene/L –I/O performance Need to be able to measure CPU load during communications. Expanding NetPIPE to do multiple pair-wise communications. –Can measure the backplane performance on switches. –Compare the line speed to end-point limited performance. Working toward measuring more of the global properties of a network. –The network topology will need to be considered.

Contact information Dave Turner -

One-sided Puts between two Linux PCs MP_Lite is SIGIO based, so MPI_Put() and MPI_Get() finish without a fence. LAM/MPI has no message progress, so a fence is required. ARMCI uses a polling method, and therefore does not require a fence. An MPI-2 implementation of MPICH is under development. An MPI-2 implementation of MPI/Pro is under development. Netgear GA620 fiber GigE 32/64-bit 33/66 MHz AceNIC driver

The MP_Lite message-passing library A light-weight MPI implementation Highly efficient for the architectures supported Designed to be very user-friendly Ideal for performing message-passing research

A NetPIPE example: Performance on a Cray T3E Raw SHMEM delivers:  2600 Mbps  2-3 us latency Cray MPI originally delivered:  1300 Mbps  20 us latency MP_Lite delivers:  2600 Mbps  9-10 us latency New Cray MPI delivers:  2400 Mbps  20 us latency The top of the spikes are where the message size is divisible by 8 Bytes.