Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly.

Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Overview RAL / Tier-1 Hardware Services Monitoring Networking

22 May 2007 Tier1 Site Report - HEPSysMan, RAL RAL / Tier-1 Change in UK science funding structure: –CCLRC and PPARC have been merged to form a new Research Council: Science and Technology Facilities Council (STFC) Combined remit, looks after large facilities, grants etc… –RAL is one of the several STFC institutes –Some internal restructuring and name changes in Business Units New corporate styles, etc RAL hosts the UK WLCG Tier-1 –Funded via GridPP2 project by STFC –Supports WLCG and UK Particle Physics users and collaborators atlas, cms, lhcb, alice, dteam, ops, babar, cdf, d0, h1, zeus, bio, cedar, esr, fusion, geant4, ilc, magic, minos, pheno, t2k, mice, sno, ukqcd, harp, theory users … –Expect no change operationally as a result of STFC ‘ownership’.

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Finance & Staff GridPP3 project funding approved –“From Production to Exploitation” –Provides for UK Tier-1, Tier-2s and some software activity –April 2008 to March 2011 –Tier1: Increase in staff: 17 FTE (+3.4 FTE from ESC) Hardware resources for WLCG: ~£7.2M Tight funding settlement, contingencies for HW and power Additional Tier1 staff now in-post –2 x systems administrators: James Thorne, Lex Holt –1 x hardware technician: James Adams –1 x PPS admin: Marian Klein

22 May 2007 Tier1 Site Report - HEPSysMan, RAL New Computing Building Funding for a new computer centre building –Funded by RAL/STFC as part of site infrastructure –Shared with HPC and other STFC computing facilities –Design complete: ~300 racks + 3-4 tape silos –Planning permission granted –Tender running for construction and fitting out –Construction starts July, planned to be ready for occupation mid August 2008

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Tape Silo Sun SL8500 tape silo –Expanded from 6,000 to 10,000 slots –8 robot trucks –18 x T10K, 10 x 9940B drives 8 x T10K tape drives for CASTOR –Second silo planned this FY SL8500, 6,000 slots Tape passing between silos may be possible

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Capacity Hardware FY06/07 CPU –64 x 1U twin dual-core Woodcrest 5130 units: ~550kSI2K 4GB RAM, 250GB DATA HDD, dual 1GB NIC Commissioned January 07 –Total capacity ~1550 kSIbase2K, ~1275 job slots Disk –86 x 3U 16-bay servers: 516TB(10^12) data capacity 3Ware 9550SX, 14 x 500GB data drives, 2 x 250GB system drives Twin dual-core Opteron 275 CPUs, 4GB RAM, dual 1GB NIC Commissioned March 07, into production service as required –Total disk storage ~900TB ~40TB being phased out at end of life (~5 years)

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Storage commissioning Problems with recent storage systems now solved! –Issue: WD5000YS (500GB) units show ‘random throws’ from RAID units –No host logs of problems, testing drive offline shows no drive issues –Common to two completely different hardware configurations Problem isolated: –Non-return loop in the drive firmware –Drive head needs to move occasionally to avoid ploughing a furrow in the drive platter lubricant: due to timeout issues in some circumstances, the drive would just sit there stuck, and communication with the controller would time out, causing a drive eject –Yanking the drive resets the electronics and no problem is evident (or logged) WD patched firmware once problem isolated Subsequent reports of the same or similar problem from non-HEP sites

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Operating systems Grid services, batch workers, service machines –Mainly SL3.0.3, SL3.0.5, SL3.0.8, some SL4.2, SL4.4, all ix86 Planning for x86_64 WNs, SL4 batch services Disk storage –New servers using SL4/i386/ext3, some x86_64 CASTOR, dCache, NFS, Xrootd –Older servers: SL4 migration in progress Tape systems –AIX: ADS tape caches –Solaris: silo/library controllers –SL3/4: CASTOR caches, SRMs, tape servers Oracle systems –RHEL3/4 Batch system –Torque/MAUI Problems with jobs ‘failing to launch’ Reduced using Torque with rpp disabled

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Services UK National BDII –Single system was overloaded Dropping connections, failing to reply to queries Failing SAM tests, detrimental to UK Grid services (and reliability stats!) –Replaced single unit with a DNS-‘balanced’ pair in Feb 07 –Extended to triplet in March UIs –Migration to gLite-flavour in May 07 CE –Overloaded system moved to twin dual-core (AMD) node with faster SATA drive –Plan a second (gLite) CE to split load RB –Second RB added to ease load PPS –Service now in production –Testing gLite-flavour middleware AFS –Upgrade of hardware postponed, pending review of service needs

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Storage Resource Management dCache –Performance issues LAN performance very good WAN performance and tuning problems –Stability issues –Now better: increased number of open file descriptors, number of logins allowed Java 1.4 -> 1.5 ADS –In-house system, many years old Will remain for some legacy services, but not planned for PP CASTOR –Replacing both dCache disk and tape SRMs for major data services –Replace T1 access to existing tape services –Production services for ATLAS, CMS, LHCb –CSA06 to castor OK –Support issues –‘Plan B’

22 May 2007 Tier1 Site Report - HEPSysMan, RAL CASTOR Issues Lots of issues causing stability problems Scheduling transfer jobs to servers in the wrong service class Problems upgrading to latest version –T1 running older versions, not in use at CERN –Struggle to get new versions running on test instance –Support patchy Performance on disk servers with single file system poor compared to performance on servers with multiple file systems: –Castor schedules transfers per file system whereas LSF uses limits per disk server –New LSF plug-in should resolve but needs latest LSF and Castor WAN tuning not good for LAN transfers Problem with ‘Reserved Space’ Lots of other niggles and unwarranted assumptions –Short hostnames

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Monitoring Nagios –Production service implemented –Replaces SURE for alarm and exception handling –3 servers (1 master + 2 slaves) –Almost all systems covered 800+ –Some stability issues with server Memory use –Call-out facilities to be added Ganglia –Updating to latest version More stable CACTI –Network monitoring

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Networking All systems have 1Gb/s connections –Except oldest fraction of the batch farm 10Gb/s interlinks everywhere –10Gb/s backbone complete within T1 Nortel 5530/5510 stacks –Considering T1 internal topology – will it meet the intra-farm transfer rates? –10Gb/s link to RAL site backbone 10Gb/s link to RAL T2 –10Gb/s link to UK academic network SuperJanet5 (SJ5) Direct link to SJ5 rather than via local MAN Active 10 April 2007 Link to Firewall now at 2Gb/s –Planned 10Gb/s bypass for T1-T2 data traffic –10Gb/s OPN link to CERN T1-T1 routing via OPN being implemented

22 May 2007 Tier1 Site Report - HEPSysMan, RAL Testing developments Viglen HX2220i ‘Twin’ system –Intel Clovertown ‘Quads’ –Benchmarking, running in batch system Viglen HS216a storage –3U 16-bay with 3ware 9650SX-16 controller –Similar to recent servers controller is PCI-E, RAID6 Data Direct Networks storage –‘RAID’-style controller with disk shelves attached via FC, FC attached to servers. –Aim to test performance under various load types and SRM clients

Comments, Questions?

Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly.

Similar presentations

Presentation on theme: "Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly.

Similar presentations

Presentation on theme: "Tier1 Site Report HEPSysMan, RAL May 2007 Martin Bly."— Presentation transcript:

Similar presentations

About project

Feedback