Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.

Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY OF MPI COMMUNICATION TIMES FOR ETHERNET AND MYRINET NETWORKS

Motivation Most large parallel computers are clusters with Myrinet or Ethernet communication networks, using MPI. Previous studies of MPI performance focus on average times and do not consider the distributions of communication times  which can have long tails due to contention effects.  In Ethernet with TCP, retransmit timeouts (RTOs) can also occur. These effects may have significant impact on performance, particularly for applications requiring frequent synchronization. Most previous comparisons of Myrinet and Ethernet performance use relatively small numbers of processors.

Aims Compare times for MPI routines on Ethernet with TCP and Myrinet with GM on the same cluster. Analyze the distributions of MPI communication times by using the MPIBench benchmark program. Investigate the scalability of the performance, and the effect on the time distributions, as the number of communicating processes is increased to 100s of CPUs. Quantitative analysis of the effect of RTOs for Ethernet with TCP.

IBM eServer 1350 Linux cluster with 128 compute nodes connected by :-  Myrinet 2000 network 8 nodes connected to each switch, using fat tree topology.  100 Mbit/s Fast Ethernet network. 38 and 14 nodes ( 0-38, 39-76, 77-114 and 115-128) Gigabit Ethernet (full duplex) uplink to a Cisco Gigabit switch. IBM xSeries 335 servers with dual 2.4 GHz Intel Xeon processors and 2 GBytes of RAM, total of 256 CPUs. Machine Used

100Mbps 1 Gbps Cisco Gigabit Switch Node 0 …….……….Node 38Node39 ………….Node 76Node 77 ……...Node 114Node 115 ….……Node 126 Fast Ethernet 100MBps IBM eServer 1350 Linux Cluster Fast Ethernet Architecture

MPI_Send/MPI_Recv MPIBench measures the effects of contention when all processors take part in point-to-point communication concurrently  not just the time for a ping-pong communication between two processors processor p communicates with processor (p+n/2) mod n n is the total number of processors P1P2P0P7P6P5P4P3

Average Time for MPI_Send/MPI_Recv for different networks and number of CPUs. MPI_Send/MPI_Recv

Fast Ethernet (100  s) about 10 times higher than Myrinet (10  s).  Small message sizes due to :- higher latency of Ethernet and software overhead of TCP comp. to lightweight GM protocol on Myrinet.  Higher message sizes due to:- Difference in bandwidth for each network, i.e. 100 Mbits/s for Fast Ethernet and 1.2 Gbit/s for Myrinet 2000. Gigabit Ethernet network, the results for larger messages would be much closer. Excellent scalability of the Myrinet fat tree architecture. On Ethernet - jump between 64 and 128 CPUs (32 to 64 nodes)  due to communication no longer between processors connected by a single switch. So the Gigabit connection between switches becomes a bottleneck. MPI_Send/MPI_Recv

The distribution of point-to-point communication times for 64 Kbyte messages on 128 CPUs (64 nodes). Fast Ethernet with Retransmit Timeout approx. at 225ms, 425ms and 625ms. Myrinet with wide range of completion times. -there is a wide range of completion times, due to contention effects -and possibly other effects such as operating system interrupts. -The minimum time is a little under 0.5 msec and the average is around 0.6 msec, and although most results are between 0.4 and 0.8 msec, there is a long tail to the distribution and some communications take several times longer than the average value

Retransmit Timeout TCP Retransmit-Timeout (RTO), RTO = SRTT + 4 * RTTVAR. RTO is the Retransmit-Timeout, SRTT is the Smoothed Round-Trip Time, RTTVAR is the Round-Trip Time Variation. Linux TCP implementation - minimum time for 4 * RTTVAR is set to 200 ms. RTO approximately RTT + 200 ms for Linux. RTO (SRTT = 25 ms) plus the 200 ms minimum value for 4 * RTTVAR set by the Linux kernel. 425 ms and 625 ms caused by communications suffer 2 or 3 RTOs. The default 200 ms timeout value - over 20 times the average value.

MPI_Bcast  Small message size (<12KByte) or less than 8 CPUs uses binomial tree algorithm.  Medium (12KByte 512KByte) uses scatter followed by allgather algorithm.  Allgather algorithm :-  Medium message size and POF 2 number of processes uses recursive doubling algorithm.  Medium message size for Non POF 2 number of processes and also for long message size uses ring algorithm.

MPI_Bcast Average time for MPI_Bcast MyrinetEthernet

Average times for both networks increases gradually. For 200 CPUs - show the same pattern, a jump at 1 KByte and 16 KByte. A similar jump for 48 and 80 CPUs - suspected the causes related to Non POF2 number of CPU. Slight jump at 16 KByte and 512 KByte due to different set of algorithms. For Ethernet, hump for 128 CPUs at 16 KByte – it is caused by retransmit timeouts. MPI_Bcast

No RTO1 x RTO2 x RTO3 x RTOAvg. TimeEst. Avg Time Without RTO 8 KB1000007.92 16 KB98.90.990.01090.8149.33 32 KB78.421.30.290243.769.39 64 KB85.214.70.010289.7094.43 256KB73.026.90.040412.38202.99 Percentage of RTO occurrences for Broadcast for 128 CPUs using Ethernet.  Percentage of times that have RTOs, which grows to be a substantial fraction of the total times.  The average time that the broadcast would have taken if there were no RTOs, which can be up to half the actual measured time.

Short messages ( = 8) – store-and- forward algorithm. Medium size messages (256Bytes< med. message size < 32768 Byte) and (short messages for no. of proc. < 8) - algorithm posts all irecvs and isends,  then a waitall,  then scatter sources and destinations among the processes. Long messages and POF number of processes - pairwise exchange algorithm. Non POF number of processes - in step i, each process receives from (rank-i) and sends to (rank+i). MPI_Alltoall

Average time for MPI_Alltoall  Average completion time for Myrinet increases gradually with message size and number of processes.  > 32 CPUs - Retransmit-Timeouts.  Hump after 256Byte and 32KByte – algorithm changes.

MPI_Alltoall No RTO1 x RTO2 x RTO≥ 3 x RTOAvg. TimeEst. Avg. Time Without RTO 32 CPU 2KB99.90.060015.6 4 KB10000028.4 8 KB97.22.6300117.384.41 16 KB81.1918.8100287.1239.91 32 KBNA 420.5NA 64 CPU 2KB97.62.400115.5112.24 4 KB55.733.06.94.4718.6205.83 8 KB3.578.413.15.0528.7282.23 16 KB0.228.661.59.6742.8381.14 32 KBNA 844.3NA 128 CPU 256 B89.2810.7200217.175.13 512 B96.593.290.100660.7407.12 1 KB95.54.370.050.08773.7425.86 2 KBNA 2425NA 4 KBNA 2572NA

Conclusions Myrinet network performs significantly better than Fast Ethernet. TCP RTO can have a significant effect on the collective communications performance of MPI on Ethernet networks  But only for large message sizes and large numbers of processors Using older versions of MPICH, Grove et al. found TCP RTO can greatly reduce the performance of MPI_Gather and MPI_Alltoall  Failed at large message sizes!  However the effects are much less serious for our measurements, probably due to improvements in the collective communications routines in recent versions of MPICH. Changeover points for different collective communications algorithms in MPICH are fixed, and based on Myrinet-like networks  Not optimal for Ethernet, and perhaps other networks

We found that MPIBench was a useful tool for detailed analysis of MPI communications performance.  Particularly for using distributions of communication times to study effects of contention and the occurrence and impact of TCP RTO. It should be useful for researchers who are working on approaches for improving MPI performance  e.g. analyzing the performance of new communication protocols for Ethernet networks. Conclusions

Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.

Similar presentations

Presentation on theme: "Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.

Similar presentations

Presentation on theme: "Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY."— Presentation transcript:

Similar presentations

About project

Feedback