The Irish National e-Infrastructure Funds Grid-Ireland Operations Centre Creating a National Datastore Multiple Regional Datastores Ops Centre runs TCD regional datastore For all disciplines Not just science & technology Projects with (inter)national dimension Central allocation process Grid and non-grid use
Grid-Ireland @ TCD already had some Dell Poweredge 2950 (2xQuad Xeon) Dell MD1000 (SAS - JBOD) After procurement data store has total 8x Dell PE2950 (6x1TB disks, 10GbE) 30x MD1000, each with 15x 1TB disks ~11.6 TiB each after RAID6 and XFS format (~350TiB total) 2x Dell Blade Chassis with 8x M600 blades each Dell tape library (24x Ultrium 4 tapes) HP ExDS9100 with 4 capacity blocks of 82x 1TB disks each and 4 blades ~ 233 TiB total available for NFS/http export
DPM installed on Dell hardware ~100TB for Ops Centre to allocate Rest for Irish users via allocation process May also try to combine with iRODS HP-ExDS high availability store iRODS primarily vNFS exports Not for conventional grid use Bridge services on blades for community specific access patterns
Room needed upgrade Another cooler UPS maxed out New high-current AC circuits added 2x 3kVA UPS per rack acquired for Dell equipment ExDS has 4x 16A 3Ø - 2 on room UPS, 2 raw 10 GbE to move data!
Benchmarked with netperf http://www.netperf.org Initially 1-2Gb/s… not good Had machines that produced figures 4Gb/s + Whats the difference? Looked at a couple of documents on this: http://www.redhat.com/promo/summit/2008/download s/pdf/Thursday/Mark_Wagner.pdf http://www.redhat.com/promo/summit/2008/download s/pdf/Thursday/Mark_Wagner.pdf http://docs.sun.com/source/819-0938-13/D_linux.html Tested various of these optimisations Initially little improvement (~100Mb/s) Then identified the most important changes
Cards fitted to wrong PCI-E port Were x4 instead of x8 New kernel version New kernel supports MSI-X (multiqueue) Was saturating one core, now distributes Increase MTU (from 1500 to 9216) Large difference to netperf Smaller difference to real loads Then compared two switches with direct connection
Storage was mostly in place 10GbE was there but being tested Brought into production early in STEP09 Useful exercise for us See bulk data transfer in conjunction with user access to stored data The first large 'real' load on the new equipment Grid-Ireland OpsCentre at TCD involved as Tier-2 site Associated with NL Tier-1
Data transfers into TCD from NL Peaked at 440 Mbit/s (capped at 500) Recently upgraded FW box coped well
HEAnet view of GEANT link TCD view of Grid-Ireland link
Lots of analysis jobs Running on cluster nodes Accessing large datasets directly from storage Caused heavy load on network and disk servers Caused problems for other jobs accessing storage Now known that access patterns were pathological Also production jobs ATLAS analysis ATLAS production LHCb production
3x1Gbit bonded links set up Almost all data stored on this server
Fix to distinguish FS with identical names on different servers Fixed display of long labels Display space token stats in TB New code for pool stats
Pool stats first to use DPM C API Previously everything was done via MySQL Was able to merge some of these fixes Time-consuming to contribute patches Single maintainer with no dedicated effort … MonAMI useful but future uncertain Should UKI contribute effort to plugin development? Or should similar functionality be created for native Ganglia?
Recent procurement gave us a huge increase in capacity STEP09 great test of data paths into and within our new infrastructure Identified bottlenecks and tuned configuration Back-ported SL5 kernel to support 10GbE on SL4 Spread data across disk servers for load-balancing Increased capacity of cluster-storage link Have since upgraded switches Monitoring crucial to understanding whats going on Weathermap for quick visual check Cacti for detailed information on network traffic LEMON and Ganglia for host load, cluster usage, etc.
Quotas are close to becoming essential for us 10GbE problems have highlighted that releases on new platforms are needed far more quickly
Firewall 1Gb outbound 10Gb internally M8024 switch in bridge blade chassis 24 port (16 to blades) layer 3 switch Force10 switch main backbone 10GbE cards in DPM servers 10GbE uplink from National Servers 6224 switch 10GbE Copper (CX4) ExDS to M6220 in 2nd blade chassis Link between 2 blade chassis M6220 - M8024 4-way LAG Force10 - M8024
24 port 10Gb switch XFP modules Dell supplied our XFPs so cost per port reduced 10Gb/s only Layer 2 switch Same Fulcrum ASIC as Arista switch tested Uses a standard reference implementation
Arista networks 7124S 24 port switch SFP+ modules Low cost per port (switches relatively cheap too) Open software - Linux Even has bash available Potential for customisation (e.g. iptables being ported) Can run 1Gb/s and 10Gb/s simultaneously Just plug in the different SFPs Layer 2/3 Some docs refer to layer 3 as a software upgrade
Our 10GbE cards are Intel PCI-E 10GBASE-SR Dell had plugged most into the 4xPCI-E slot An error was coming up in dmesg Trivial solution: I moved the cards to 8x slots Now can get >5Gb/s on some machines
Maximum Transmission Unit Ethernet spec says 1500 Most hardware/software can support jumbo frames Ixgbe driver allowed MTU=9216 Must be set through whole path Different switches have different max value Makes a big difference to netperf Example of SL5 machines, 30s tests: MTU=1500, TCP stream at 5399 Mb/s MTU=9216, TCP stream at 8009 Mb/s
Machines on SL4 kernels had very poor receive performance (50Mb/s) One core was 0% idle Use mpstat -P ALL Sys/soft used up the whole core /proc/interrupts showed PCI-MSI used All RX interrupts to one core New kernel had MSI-X and multiqueue Interrupts distributed, full RX performance