Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide: 1 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 1 Bringing High-Performance Networking to HEP users Richard Hughes-Jones.

Similar presentations


Presentation on theme: "Slide: 1 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 1 Bringing High-Performance Networking to HEP users Richard Hughes-Jones."— Presentation transcript:

1 Slide: 1 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 1 Bringing High-Performance Networking to HEP users Richard Hughes-Jones Stephen Dallison, Nicola Pezzi, Yee-Ting Lee MB - NG

2 Slide: 2 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 2 uPeak bandwidth 23.21Gbits/s u6.6 TBytes in 48 minutes The Bandwidth Challenge at SC2003 Phoenix - Amsterdam 4.35 Gbit HighSpeed TCP rtt 175 ms, window 200 MB

3 Slide: 3 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 3 TCP (Reno) – What’s the problem? uTCP has 2 phases: Slowstart & Congestion Avoidance uAIMD and High Bandwidth – Long Distance networks Poor performance of TCP in high bandwidth wide area networks is due in part to the TCP congestion control algorithm. For each ack in a RTT without loss: cwnd -> cwnd + a / cwnd- Additive Increase, a=1 For each window experiencing loss: cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ uTime to recover from 1 lost packet for round trip time of ~100 ms:

4 Slide: 4 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 4 Investigation of new TCP Stacks uThe AIMD Algorithm – Standard TCP (Reno) For each ack in a RTT without loss: cwnd -> cwnd + a / cwnd- Additive Increase, a=1 For each window experiencing loss: cwnd -> cwnd – b (cwnd) - Multiplicative Decrease, b= ½ uHigh Speed TCP a and b vary depending on current cwnd using a table a increases more rapidly with larger cwnd – returns to the ‘optimal’ cwnd size sooner for the network path b decreases less aggressively and, as a consequence, so does the cwnd. The effect is that there is not such a decrease in throughput. uScalable TCP a and b are fixed adjustments for the increase and decrease of cwnd a = 1/100 – the increase is greater than TCP Reno b = 1/8 – the decrease on loss is less than TCP Reno Scalable over any link speed. uFast TCP Uses round trip time as well as packet loss to indicate congestion with rapid convergence to fair equilibrium for throughput. uHSTCP-LP, H-TCP, BiC-TCP

5 Slide: 5 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 5 Packet Loss with new TCP Stacks uTCP Response Function Throughput vs Loss Rate – further to right: faster recovery Drop packets in kernel MB-NG rtt 6ms DataTAG rtt 120 ms

6 Slide: 6 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 6 High Throughput Demonstrations Manchester (Geneva) man03lon01 2.5 Gbit SDH MB-NG Core 1 GEth Cisco GSR Cisco 7609 Cisco 7609 London (Chicago) Dual Zeon 2.2 GHz Send data with TCP Drop Packets Monitor TCP with Web100

7 Slide: 7 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 7 uDrop 1 in 25,000 urtt 6.2 ms uRecover in 1.6 s High Performance TCP – MB-NG StandardHighSpeed Scalable

8 Slide: 8 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 8 High Performance TCP – DataTAG uDifferent TCP stacks tested on the DataTAG Network u rtt 128 ms uDrop 1 in 10 6 uHigh-Speed Rapid recovery uScalable Very fast recovery uStandard Recovery would take ~ 20 mins

9 Slide: 9 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 9 End Systems: NICs & Disks

10 Slide: 10 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 10 End Hosts & NICs SuperMicro P4DP6 Latency Throughput Bus Activity uUse UDP packets to characterise Host & NIC SuperMicro P4DP6 motherboard Dual Xenon 2.2GHz CPU 400 MHz System bus 66 MHz 64 bit PCI bus

11 Slide: 11 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 11 Host, PCI & RAID Controller Performance uRAID5 (stripped with redundancy) u3Ware 7506 Parallel 66 MHz 3Ware 7505 Parallel 33 MHz u3Ware 8506 Serial ATA 66 MHz ICP Serial ATA 33/66 MHz uTested on Dual 2.2 GHz Xeon Supermicro P4DP8-G2 motherboard uDisk: Maxtor 160GB 7200rpm 8MB Cache uRead ahead kernel tuning: /proc/sys/vm/max-readahead = 512 uRAID0 (stripped) Read 1040 Mbit/s, Write 800 Mbit/s Disk – Memory Read Speeds Memory - Disk Write Speeds

12 Slide: 12 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 12 The performance of the end host / disks BaBar Case Study: RAID BW & PCI Activity u3Ware 7500-8 RAID5 parallel EIDE u3Ware forces PCI bus to 33 MHz uBaBar Tyan to MB-NG SuperMicro Network mem-mem 619 Mbit/s uDisk – disk throughput bbcp 40-45 Mbytes/s (320 – 360 Mbit/s) uPCI bus effectively full! uUser throughput ~ 250 Mbit/s Read from RAID5 Disks Write to RAID5 Disks

13 Slide: 13 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 13 Data Transfer Applications

14 Slide: 14 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 14 The Tests (being) Made AppTCP StackSuperMicro on MB-NG SuperMicro on SuperJANET4 BaBar on SuperJANET4 IperfStandard HighSpeed Scalable bbcpStandard HighSpeed Scalable bbftpStandard HighSpeed Scalable apacheStandard HighSpeed Scalable GridftpStandard HighSpeed Scalable

15 Slide: 15 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 15 Topology of the MB – NG Network Key Gigabit Ethernet 2.5 Gbit POS Access MPLS Admin. Domains UCL Domain Edge Router Cisco 7609 man01 man03 Boundary Router Cisco 7609 RAL Domain Manchester Domain lon02 man02 ral01 UKERNA Development Network Boundary Router Cisco 7609 ral02 lon03 lon01 HW RAID

16 Slide: 16 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 16 Topology of the Production Network Key Gigabit Ethernet 2.5 Gbit POS Access 10 Gbit POS man01 RAL Domain Manchester Domain ral01 HW RAID routers switches 3 routers 2 switches

17 Slide: 17 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 17 Average Transfer Rates Mbit/s AppTCP StackSuperMicro on MB-NG SuperMicro on SuperJANET4 BaBar on SuperJANET4 IperfStandard940350-370425 HighSpeed940510570 Scalable940580-650605 bbcpStandard434290-310290 HighSpeed435385360 Scalable432400-430380 bbftpStandard400-410325320 HighSpeed370-390380 Scalable430345-532380 apacheStandard425260300-360 HighSpeed430370315 Scalable428400317 GridftpStandard405240 HighSpeed320 Scalable335

18 Slide: 18 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 18 iperf Throughput + Web100 u SuperMicro on MB-NG network u HighSpeed TCP u Linespeed 940 Mbit/s u DupACK ? <10 (expect ~400) u BaBar on Production network u Standard TCP u 425 Mbit/s u DupACKs 350-400 – re-transmits

19 Slide: 19 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 19 bbftp: Host & Network Effects u 2 Gbyte file RAID5 Disks: 1200 Mbit/s read 600 Mbit/s write u Scalable TCP u BaBar + SuperJANET Instantaneous 220 - 625 Mbit/s u SuperMicro + SuperJANET Instantaneous 400 - 665 Mbit/s for 6 sec Then 0 - 480 Mbit/s u SuperMicro + MB-NG Instantaneous 880 - 950 Mbit/s for 1.3 sec Then 215 - 625 Mbit/s

20 Slide: 20 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 20 bbftp: What else is going on? u Scalable TCP u BaBar + SuperJANET u SuperMicro + SuperJANET u Congestion window – dupACK u Variation not TCP related? Disk speed / bus transfer Application

21 Slide: 21 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 21 Applications: Throughput Mbit/s u HighSpeed TCP u 2 GByte file RAID5 u SuperMicro + SuperJANET u bbcp u bbftp u Apachie u Gridftp u Previous work used RAID0 (not disk limited)

22 Slide: 22 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 22 uMotherboards NICs, RAID controllers and Disks matter The NICs should be well designed: NIC should use 64 bit 133 MHz PCI-X (66 MHz PCI can be OK) NIC/drivers: CSR access / Clean buffer management / Good interrupt handling Worry about the CPU-Memory bandwidth as well as the PCI bandwidth Data crosses the memory bus at least 3 times Separate the data transfers – use motherboards with multiple 64 bit PCI-X buses 32 bit 33 MHz is too slow for Gigabit rates 64 bit 33 MHz > 80% used Choose a modern high throughput RAID controller Consider SW RAID0 of RAID5 HW controllers uNeed plenty of CPU power for sustained 1 Gbit/s transfers uWork with Campus network engineers to eliminate bottlenecks and packet loss High bandwidth link to your server Look for Access link overloading / old Ethernet equipment / flow limitation policies uUse of Jumbo frames, Interrupt Coalescence and Tuning the PCI-X bus helps uNew TCP stacks are stable and run with 10 Gigabit Ethernet NICs uNew stacks give better response & performance Still need to set the tcp buffer sizes System maximums in collaboration with the sysadmin Socket sizes in the application uApplication architecture & implementation is also important Summary, Conclusions & Thanks

23 Slide: 23 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 23 More Information Some URLs uMB-NG project web site: http://www.mb-ng.net/ uDataTAG project web site: http://www.datatag.org/ uUDPmon / TCPmon kit + writeup: http://www.hep.man.ac.uk/~rich/net uMotherboard and NIC Tests: www.hep.man.ac.uk/~rich/net/nic/GigEth_tests_Boston.ppt & http://datatag.web.cern.ch/datatag/pfldnet2003/ “Performance of 1 and 10 Gigabit Ethernet Cards with Server Quality Motherboards” FGCS Special issue 2004 uTCP tuning information may be found at: http://www.ncne.nlanr.net/documentation/faq/performance.html & http://www.psc.edu/networking/perf_tune.html uTCP stack comparisons: “Evaluation of Advanced TCP Stacks on Fast Long-Distance Production Networks” Journal of Grid Computing 2004

24 Slide: 24 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 24 Backup Slides

25 Slide: 25 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 25 SuperMicro P4DP6: Throughput Intel Pro/1000 Max throughput 950Mbit/s No packet loss CPU utilisation on the receiving PC was ~ 25 % for packets > than 1000 bytes 30- 40 % for smaller packets Motherboard: SuperMicro P4DP6 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66 MHz RedHat 7.2 Kernel 2.4.19

26 Slide: 26 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 26 SuperMicro P4DP6: Latency Intel Pro/1000 Some steps Slope 0.009 us/byte Slope flat sections : 0.0146 us/byte Expect 0.0118 us/byte No variation with packet size FWHM 1.5 us Confirms timing reliable Motherboard: SuperMicro P4DP6 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66 MHz RedHat 7.2 Kernel 2.4.19

27 Slide: 27 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 27 SuperMicro P4DP6: PCI Intel Pro/1000 1400 bytes sent Wait 12 us ~5.14us on send PCI bus PCI bus ~68% occupancy ~ 3 us on PCI for data recv CSR access inserts PCI STOPs NIC takes ~ 1 us/CSR CPU faster than the NIC ! Similar effect with the SysKonnect NIC Motherboard: SuperMicro P4DP6 Chipset: Intel E7500 (Plumas) CPU: Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66 MHz RedHat 7.2 Kernel 2.4.19

28 Slide: 28 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 28 Raid0 Performance (1) u3Ware 7500-8 RAID0 parallel EIDE uMaxtor 3.5 Series DiamondMax Plus 9 120 Gb ATA/133 uRaid stripe size 64 bytes u Write Slight increase with number of disks u Read u 3 Disks OK u Write 100 MBytes/s u Read 130 MBytes/s

29 Slide: 29 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 29 Raid0 Performance (2) uMaxtor 3.5 Series DiamondMax PLus 9 120 Gb ATA/133 u No difference for Write u Larger Stripe lower the performance u Write 100 MBytes/s u Read 120 MBytes/s

30 Slide: 30 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 30 Raid5 Disk Performance vs readahead_max uBaBar Disk Server Tyan Tiger S2466N motherboard 1 64bit 66 MHz PCI bus Athlon MP2000+ CPU AMD-760 MPX chipset 3Ware 7500-8 RAID5 8 * 200Gb Maxtor IDE 7200rpm disks uNote the VM parameter readahead max uDisk to memory (read) Max throughput 1.2 Gbit/s 150 MBytes/s) uMemory to disk (write) Max throughput 400 Mbit/s 50 MBytes/s) [not as fast as Raid0]

31 Slide: 31 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 31 Host, PCI & RAID Controller Performance uRAID0 (striped) & RAID5 (stripped with redundancy) u3Ware 7506 Parallel 66 MHz 3Ware 7505 Parallel 33 MHz u3Ware 8506 Serial ATA 66 MHz ICP Serial ATA 33/66 MHz uTested on Dual 2.2 GHz Xeon Supermicro P4DP8-G2 motherboard uDisk: Maxtor 160GB 7200rpm 8MB Cache uRead ahead kernel tuning: /proc/sys/vm/max-readahead

32 Slide: 32 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 32 Serial ATA Raid Controllers RAID5 u3Ware 66 MHz PCI uICP 66 MHz PCI

33 Slide: 33 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 33 RAID Controller Performance RAID 0 RAID 5 Read Speed Write Speed

34 Slide: 34 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 34 Gridftp Throughput + Web100 u RAID0 Disks: 960 Mbit/s read 800 Mbit/s write u Throughput Mbit/s: u See alternate 600/800 Mbit and zero u Data Rate: 520 Mbit/s u Cwnd smooth u No dup Ack / send stall / timeouts

35 Slide: 35 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 35 http data transfers HighSpeed TCP u Same Hardware u RAID0 Disks u Bulk data moved by web servers u Apachie web server out of the box! uprototype client - curl http library u1Mbyte TCP buffers u2Gbyte file u Throughput ~720 Mbit/s u Cwnd - some variation u No dup Ack / send stall / timeouts

36 Slide: 36 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 36 Bbcp & GridFTP Throughput u RAID5 - 4disks Manc – RAL u 2Gbyte file transferred u bbcp u Mean 710 Mbit/s u GridFTP u See many zeros Mean ~710 Mean ~620 u DataTAG altAIMD kernel in BaBar & ATLAS


Download ppt "Slide: 1 Richard Hughes-Jones CHEP2004 Interlaken Sep 04 R. Hughes-Jones Manchester 1 Bringing High-Performance Networking to HEP users Richard Hughes-Jones."

Similar presentations


Ads by Google