Tier1 Site Report RAL 19-20 June 2008 Martin Bly.

Tier1 Site Report HEPSysMan @ RAL 19-20 June 2008 Martin Bly

19-20 June 2008 Tier1 Site Report - HEPSysMan @ RAL Overview New Building Site issues Tier1

19-20 June 2008 Tier1 Site Report - HEPSysMan @ RAL New Computing Building New computing building being constructed opposite new reception building at RAL November 2007 – looked like a sparse Meccano construction – just girders Now has walls, a roof, windows, skylight –shell is almost complete External ‘beautification’ starting Internal fitting of machine room level yet to start Completion due late 2008 Migration planning starting –Target: To move most of Tier1 hardware Jan-Mar 2009

19-20 June 2008 Tier1 Site Report - HEPSysMan @ RAL

19-20 June 2008 Tier1 Site Report - HEPSysMan @ RAL Portable Device Encryption Big concern in the UK over data-loss by ‘government’, like everywhere else. –Mostly careless custodianship rather than ‘enemy action’ –Many stolen/lost laptops, CDs/DVDs going missing in transit… Government has mandated that all public service organisations must ensure all portable devices taken off their site have the data storage encrypted by an approved tool This means: all laptops and other portable devices (PDAs, phones) which have access to ‘data’ on the RAL network must have encryption before they leave site –‘Data’ means any data that can identify or be associated with any individual – thus Outlook caches, email lists, synchronised file caches of ‘corporate’ data of any sort Many staff have rationalised what they keep on their laptop/PDA –Why do you need it? If you don’t need it, don’t keep it! Using Pointsec from CheckPoint Software Technologies Ltd –Will do Windows XP, some versions of Linux –…but not Macs, or dual-boot Windows/Linux systems (yet!) Painful but necessary –Don’t put the data at risk…

19-20 June 2008 Tier1 Site Report - HEPSysMan @ RAL Tier1: Grid Only Non-Grid access to Tier-1 has ended. Only special cases now have access to: –UIs –Direct job submission (qsub) Until end of May 2008: –IDs were maintained (disabled) –Home directories were maintained online –Mail forwarding was maintained. After end of May 2008 –IDs will be deleted –Home directories will be backed up –Mail spool will be backed up –Mail forwarding will stop AFS service continues for Babar (and just in case for LCG)

19-20 June 2008 Tier1 Site Report - HEPSysMan @ RAL CASTOR CASTOR: production version is v2.1.6-12 hot-fix 2 –Recently much more stable and reliable –Good support from developers at CERN – working well –Some problems appear at RAL that don’t show in testing at CERN because we use features not exercised at CERN – speedy investigation and fixing Considerable effort with CMS on tuning disk server and tape migration performance –Recent work with developers on migration strategies has improved performance considerably Migration to v2.1.7-7 imminent

19-20 June 2008 Tier1 Site Report - HEPSysMan @ RAL dCache closure dCache service closure was announced was announced for the end May 2008 –Migration of data is proceeding –Some work to do to provide generic Castor instance for small VOs –Likely the closure deadline will extend some months

19-20 June 2008 Tier1 Site Report - HEPSysMan @ RAL Hardware: New Capacity Storage 182 x 9TB 16-bay 3U servers: 1638TB data capacity Two Lots based on same Supermicro chassis with different disk OEM (WD, Seagate) and CPU (AMD, Intel) Dual RAID controllers – data and system disks separate: –3Ware 9650SX-16ML, 14 x 750GB data drives –3Ware 9650SX-4, 2 x 250GB or 400GB system drives Twin CPUs (quad-core Intel, dual-core AMD), 8GB RAM, dual 1GB NIC Intel set being deployed –Used in CCRC08 AMD set: some issues with forcedeth network driver (SL4.4) under high sustained load

19-20 June 2008 Tier1 Site Report - HEPSysMan @ RAL Backplane Failures (Supermicro) 3 servers “burn out” backplane 2 of which set off VESDA 1 called out fire-brigade! Safety risk assessment: Urgent rectification needed Good response from supplier/manufacturer PCB fault in “bad batch” Replacement complete

19-20 June 2008 Tier1 Site Report - HEPSysMan @ RAL Hardware: CPU 2007: production capacity ~1500KSI2K on 600 systems –Late 2007: upgraded about 50% of capacity to 2GB/core FY07/08 procurement (~3000KSI2K - but YMMV) –Streamline 57 x 1U servers (114 systems, 3 racks), each system: dual Intel E5410 (2.33GHz) quad-core CPUs 2GB/core, 1 x 500GB HDD –Clustervision 56 x 1U servers (112 systems, 4 racks), each system: dual Intel E5440 (2.83GHz) quad-core CPUs 2GB/core, 1 x 500GB HDD –Configuration based on 15kW per rack maximum, from supplied ‘full-load’ power consumption data. Required to meet power supply and cooling restrictions (2 x 32A supplies).

19-20 June 2008 Tier1 Site Report - HEPSysMan @ RAL Hardware: non-Capacity Servers for Grid services (CEs, WMSs, FTS etc –11 ‘twin’ systems, same as batch workers but two disks Low capacity storage –6 x 2U servers, 8GB RAM, dual chip dual-core AMD CPU, 2 x 250GB HDD (RAID1 system), 4 x 750GB HDD (RAID5 data), 3Ware controller –For AFS and Home filesystem, installation repositories… Xen –4 ‘monster’ systems for virtualisation –2 x dual core AMD 2222 CPUs, 32GB RAM, 4 x 750 GB HDDs on HW RAID controller –For PPS service and Tier1 testing. Oracle Databases –5 x servers (redundant PSUs, HW RAID disks) and 7TB data array (HW RAID) –To provide additional RAC nodes for 3D services, LFC/FTS backend, Atlas TAG etc.

19-20 June 2008 Tier1 Site Report - HEPSysMan @ RAL FY08/09 Procurements Capacity procurements for 2008/9 started –PQQs issued to OJEU, responses due mid July –Evaluation and issue of technical documents for limited second stage expected by early August –Second stage evaluation September/October –Delivery … Looking for ~1800TB usable storage and around the same compute capacity as last year Additional non-capacity hardware required to replace aging re-tasked batch workers.

19-20 June 2008 Tier1 Site Report - HEPSysMan @ RAL RAL Site 5510 5530 Stack 4 x Nortel 5530 Router A OPN Router 3 x 5510 + 5530 6 x 5510 + 5530 ADS Caches CPUs + Disks CPUs + Disks CPUs + Disks CPUs + Disks 10Gb/s to CERN N x 1Gb/s 10Gb/s 5 x 5510 + 5530 2 x 5510 + 5530 RAL Tier 2 Tier 1 Oracle systems 10Gb/s Firewall Site Access Router 10Gb/s to SJ5 Hardware: Network 1Gb/s Lancaster (test network) Force10 C300 8 slot Router (64*10Gb) bypass

19-20 June 2008 Tier1 Site Report - HEPSysMan @ RAL Services New compute capacity enabled for CCRC08 –Exposed weakness under load in single CE configuration –Deployed three extra CEs as previously planned Moved LFC backend to single-node Oracle –To move to RAC with FTS backend shortly Maui issues with buffer sizes caused by large increase in number of jobs running –Monitoring task killing maui at 8 hour intervals –Rebuild with larger buffer sizes cured the problem

19-20 June 2008 Tier1 Site Report - HEPSysMan @ RAL Monitoring / On-Call Cacti – network traffic and power Ganglia - performance Nagios – alerts 24x7 callout now operational –Using Nagios to signal existing pager system to initiate callouts –Working well But still learning Blogging –UK T2s and the Tier1 have blogs: –http://planet.gridpp.ac.uk/

19-20 June 2008 Tier1 Site Report - HEPSysMan @ RAL Power Failure: Thursday 7 th Feb ~12:15 Work on building power supplies since December –Down to 1 transformer (of 2) for extended periods (weeks). Increased risk of disaster –Single transformer running at (close to) maximum operating load –No problems until work finished and casing being closed control line crushed and power supply tripped! First power interruption for over 3 years Restart (Effort > 200 FTE hours) –Most Global/National/Tier-1 core systems up by Thursday evening –Most of CASTOR/dCache/NFS data services and part of batch up by Friday –Remaining batch on Saturday/Sunday –Still problems to iron out in CASTOR on Monday/Tuesday Lessons –Communication was prompt and sufficient but ad-hoc –Broadcast unavailable as RAL run the GOCDB (now fixed by caching) –Careful restart of disk servers slow and labour intensive (but worked) will not scale http://www.gridpp.rl.ac.uk/blog/2008/02/18/review-of-the-recent-power-failure/

19-20 June 2008 Tier1 Site Report - HEPSysMan @ RAL Power Glitch: Tuesday 6 th May ~07:03 County-wide power interruption At RAL, lost power to ISIS, Atlas, lasers etc Single phase (B) Knocked off some systems, caused reboots of others Blew several fuses in upper machine room Recovery quite quick –No opportunity for controlled restart –most of the systems automatically restarted and had gone though fsck or journal recovery before T1/Castor staff arrived.

Tier1 Site Report RAL 19-20 June 2008 Martin Bly.

Similar presentations

Presentation on theme: "Tier1 Site Report RAL 19-20 June 2008 Martin Bly."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tier1 Site Report RAL 19-20 June 2008 Martin Bly.

Similar presentations

Presentation on theme: "Tier1 Site Report RAL 19-20 June 2008 Martin Bly."— Presentation transcript:

Similar presentations

About project

Feedback