Protocol-Dependent Message-Passing Performance on Linux Clusters Dave Turner – Xuehua Chen – Adam Oline This work is funded by the DOE MICS office.

Slides:

Advertisements

Similar presentations

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Advertisements

NERCS Users’ Group, Oct. 3, 2005 Interconnect and MPI Bill Saphir.

Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.

Beowulf Supercomputer System Lee, Jung won CS843.

Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.

RDS and Oracle 10g RAC Update Paul Tsien, Oracle.

History of Distributed Systems Joseph Cordina

Communication Performance Measurement and Analysis on Commodity Clusters Name Nor Asilah Wati Abdul Hamid Supervisor Dr. Paul Coddington Dr. Francis Vaughan.

A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.

Low Overhead Fault Tolerant Networking (in Myrinet)

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.

1 Performance Evaluation of Gigabit Ethernet & Myrinet

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

Evaluation of High-Performance Networks as Compilation Targets for Global Address Space Languages Mike Welcome In conjunction with the joint UCB and NERSC/LBL.

Can Google Route? Building a High-Speed Switch from Commodity Hardware Guido Appenzeller, Matthew Holliman Q2/2002.

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 

Supporting iWARP Compatibility and Features for Regular Network Adapters P. BalajiH. –W. JinK. VaidyanathanD. K. Panda Network Based Computing Laboratory.

GigE Knowledge. BODE, Company Profile Page: 2 Table of contents  GigE Benefits  Network Card and Jumbo Frames  Camera - IP address obtainment  Multi.

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1

SSI-OSCAR A Single System Image for OSCAR Clusters Geoffroy Vallée INRIA – PARIS project team COSET-1 June 26th, 2004.

Operational computing environment at EARS Jure Jerman Meteorological Office Environmental Agency of Slovenia (EARS)

1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.

Synchronization and Communication in the T3E Multiprocessor.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

A TCP/IP transport layer for the DAQ of the CMS Experiment Miklos Kozlovszky for the CMS TriDAS collaboration CERN European Organization for Nuclear Research.

March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.

CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.

Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.

Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.

Srihari Makineni & Ravi Iyer Communications Technology Lab

Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 1 - by Adrian Riedo - Summer 2000 High Performance Computing using.

Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.

University of Mannheim1 ATOLL ATOmic Low Latency – A high-perfomance, low cost SAN Patrick R. Haspel Computer Architecture Group.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.

IDE disk servers at CERN Helge Meinhard / CERN-IT CERN OpenLab workshop 17 March 2003.

Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.

The Distributed Data Interface in GAMESS Brett M. Bode, Michael W. Schmidt, Graham D. Fletcher, and Mark S. Gordon Ames Laboratory-USDOE, Iowa State University.

Harnessing Grid-Based Parallel Computing Resources for Molecular Dynamics Simulations Josh Hursey.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.

1 Farm Issues L1&HLT Implementation Review Niko Neufeld, CERN-EP Tuesday, April 29 th.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

G. Russo, D. Del Prete, S. Pardi Frascati, 2011 april 4th-7th The Naples' testbed for the SuperB computing model: first tests G. Russo, D. Del Prete, S.

DiSCoV Fall 2008 Paul A. Farrell Cluster Computing 1 Improving Cluster Performance Performance Evaluation of Networks.

Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen

Balazs Voneki CERN/EP/LHCb Online group

RT2003, Montreal Niko Neufeld, CERN-EP & Univ. de Lausanne

CMS DAQ Event Builder Based on Gigabit Ethernet

J.M. Landgraf, M.J. LeVine, A. Ljubicic, Jr., M.W. Schulz

Pluggable Architecture for Java HPC Messaging

Xen Network I/O Performance Analysis and Opportunities for Improvement

MPJ: A Java-based Parallel Computing System

LHCb Online Meeting November 15th, 2000

Designing a PC Farm to Simultaneously Process Separate Computations Through Different Network Topologies Patrick Dreher MIT.

Cluster Computers.

Presentation transcript:

Protocol-Dependent Message-Passing Performance on Linux Clusters Dave Turner – Xuehua Chen – Adam Oline This work is funded by the DOE MICS office.

Inefficiencies in the communication system Applications MPI native layer internal buses driver & NIC switch fabric Poor MPI usage No mapping 50% bandwidth 2-3x latency OS bypass TCP tuning PCI Memory Hardware limits Driver tuning Topological bottlenecks

The NetPIPE utility NetPIPE does a series of ping-pong tests between two nodes. Message sizes are chosen at regular intervals, and with slight perturbations, to fully test the communication system for idiosyncrasies. Latencies reported represent half the ping-pong time for messages smaller than 64 Bytes. Measuring the overhead of message-passing protocols. Help in tuning the optimization parameters of message-passing libraries. Identify dropouts in networking hardware. Optimizing driver and OS parameters (socket buffer sizes, etc.). Some typical uses What is not measured NetPIPE can measure the load on the CPU using getrusage, but this was not done here. The effects from the different methods for maintaining message progress. Scalability with system size.

A NetPIPE example: Performance on a Cray T3E Raw SHMEM delivers:  2600 Mbps  2-3 us latency Cray MPI originally delivered:  1300 Mbps  20 us latency MP_Lite delivers:  2600 Mbps  9-10 us latency New Cray MPI delivers:  2400 Mbps  20 us latency The top of the spikes are where the message size is divisible by 8 Bytes.

The network hardware and computer test-beds Linux PC test-bed Two 1.8 GHz P4 computers 768 MB PC133 memory 32-bit 33 MHz PCI bus RedHat 7.2 Linux Alpha Linux test-bed Two 500 MHz dual-processor Compaq DS20s 1.5 GB memory 32/64-bit 33 MHz PCI bus RedHat 7.1 Linux PC SMP test-bed 1.7 GHz dual-processor Xeon 1.0 GB memory RedHat 7.3 Linux smp All measurements were done back-to-back except for the Giganet hardware, which went through an 8-port switch.

MPICH MPICH release Uses the p4 device for TCP. P4_SOCKBUFSIZE must be increased to ~256 kBytes. Rendezvous threshold can be changed in the source code. MPICH-2.0 will be out soon! Developed by Argonne National Laboratory and Mississippi State University.

LAM/MPI LAM release from the RedHat 7.2 distibution. Must lamboot the daemons. -lamd directs messages through the daemons. -O avoids data conversion for homogeneous systems. No socket buffer size tuning. No threshold adjustments. Currently developed at Indiana University.

MPI/Pro MPI/Pro release Easy to install RPM Requires rsh, not ssh -tcp_long  128 kBytes gets rid of most of the dip at the rendezvous threshold. Other parameters didn’t help. Thanks to MPI Software Technology for supplying the MPI/Pro software for testing.

The MP_Lite message-passing library A light-weight MPI implementation Highly efficient for the architectures supported Designed to be very user-friendly Ideal for performing message-passing research

PVM PVM release from the RedHat 7.2 distribution. Uses XDR encoding and the pvmd daemons by default. pvm_setopt(PvmRoute, PvmRouteDirect) bypasses the pvmd daemons. pvm_initsend(PvmDataInPlace) avoids XDR encoding for homogeneous systems. Developed at Oak Ridge National Laboratory.

Performance on Netgear GA620 Fiber Gigabit Ethernet cards between two PCs All libraries do reasonably well on this mature card and driver. MPICH and PVM suffer from an extra memory copy. LAM/MPI, MPI/Pro, and MPICH have dips at the rendezvous threshold due to the large 180 us latency. Tunable thresholds would easily eliminate this minor drop in performance. Netgear GA620 fiber GigE 32/64-bit 33/66 MHz AceNIC driver

Performance on TrendNet and Netgear GA622T Gigabit Ethernet cards between two Linux PCs Both cards are very sensitive to the socket buffer sizes. MPICH and MP_Lite do well because they adjust the socket buffer sizes. Increasing the default socket buffer size in the other libraries, or making it an adjustable parameter, would fix this problem. More tuning of the ns83820 driver would also fix this problem. TrendNet TEG-PCITX copper GigE 32-bit 33/66 MHz ns83820 driver

Performance on SysKonnect Gigabit Ethernet cards between Compaq DS20s running Linux The SysKonnect cards using a 9000 Byte MTU provides a more challenging environment. MP_Lite delivers nearly all the 900 Mbps performance. LAM/MPI again suffers due to the smaller socket buffer sizes. MPICH suffers from the extra memory copy. PVM suffers from both. SysKonnect SK-9843-SX fiber GigE 32/64-bit 33/66 MHz sk98lin driver

Performance on Myrinet cards between two Linux PCs MPICH-GM and MPI/Pro- GM both pass almost all the performance of GM through to the application. SCore claims to provide better performance, but is not quite ready for prime time yet. IP-GM IP-GM provides little benefit over TCP on Gigabit Ethernet, and at a much greater cost. Myrinet PCI64A-2 SAN card 66 MHz RISC with 2 MB memory

Performance on VIA Giganet hardware and on SysKonnect GigE cards using M-VIA between two Linux PCs MPI/Pro, MVICH, and MP_Lite all provide 800 Mbps bandwidth on the Giganet hardware, but MPI/Pro has a longer latency of 42 us compared with 10 us for the others. The M-VIA 1.2b2 performance is roughly at the same level that raw TCP provides. The M-VIA 1.2b3 release has not been tested, nor has using jumbo frames. Giganet CL1000 cards through an 8-port CL5000 switch

SMP message-passing performance on a dual-processor Compaq DS20 running Alpha Linux With the data starting in main memory.

SMP message-passing performance on a dual-processor Compaq DS20 running Alpha Linux With the data starting in cache.

SMP message-passing performance on a dual-processor Xeon running Linux With the data starting in main memory.

SMP message-passing performance on a dual-processor Xeon running Linux With the data starting in cache.

One-sided Puts between two Linux PCs MP_Lite is SIGIO based, so MPI_Put() and MPI_Get() finish without a fence. LAM/MPI has no message progress, so a fence is required. ARMCI uses a polling method, and therefore does not require a fence. An MPI-2 implementation of MPICH is under development. An MPI-2 implementation of MPI/Pro is under development. Netgear GA620 fiber GigE 32/64-bit 33/66 MHz AceNIC driver

Conclusions Most message-passing libraries do reasonably well if properly tuned. All need to have the socket buffer sizes and thresholds user-tunable. Optimizing the network drivers would also correct some of the problems. There is still much room for improvement for SMP and 1-sided communications. Future Work All network cards should be tested on a 64-bit 66 MHz PCI bus to put more strain on the message-passing libraries. Testing within real applications is vital to verify NetPIPE results, test scalability of the implementation methods, investigate loading of the CPU, and study the effects of the various approaches to maintaining message progress. Score should be compared to GM. VIA and Infinaband modules are needed for NetPIPE.

Protocol-Dependent Message-Passing Performance on Linux Clusters Dave Turner – Xuehua Chen – Adam Oline