Presentation on theme: "SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin."— Presentation transcript:
SouthGrid Status Pete Gronbech: 12 th March 2008 GridPP 20 Dublin
UK Tier 2 reported CPU – Historical View
UK Tier 2 reported CPU – Feb 2008 View
SouthGrid Sites Accounting as reported by APEL
RAL PPD 600KSI2K 158TB SL4 ce installed, some teething problems 80TB storage + a further 78TB which was loaned to RAL Tier 1. SRMv2.2 upgrade on the dcache se proved very tricky, space tokens not yet defined hardware upgrade purchased but not yet installed. Some kit installed in the Atlas Centre, due to power/cooling issues in R1 Two new sys admins started during the last month.
Status at Cambridge 391KSI2K 43TB 32 Intel Woodcrest servers, giving 128 cpu cores equiv. to 358 KSI2k. Jun 2007 Storage upgrade of 40TB running DPM on SL4 64 bit Plans to double storage and update CPUs Condor version is being used SAM availability high Lots of Work by Graeme and Santanu to get verified for ATLAS production, but had recent problems with long jobs failing. Now working with LHCb to solve issues Problems with Accounting, we still dont believe that the work done at Cambridge is reported correctly in the accounting.
Birmingham Status BaBar Cluster 76KSI2K ~10TB-50TB Had been unstable mainly because of failing disks Very few (<20 out of 120) healthy workers nodes left Many workers died during two shut downs ( no power to motherboards?) Very time consuming to maintain Recently purchased 4 twin Viglen quad core workers – two will go to the grid (2 Twin quad core nodes = 3 racks with 120 nodes! ) BaBar cluster withdrawn from the Grid as effort better spent getting new resources online
Birmingham Status – Atlas (grid) Farm Added 12 local workers to the grid 20 workers in total -> 40 job slots Will provides 60 jobs slots after local twin boxes are installed Upgraded to SL4 Installation with kickstart / Cfengine, maintained with Cfengine VOS: alice atlas babar biomed calice camont cms dteam fusion gridpp hone ilc lhcb ngs.ac.uk ops vo.southgrid.ac.uk zeus Several broken CPUs fans are being replaced
Birmingham Status - Grid Storage 1 DPM SL 3 head node with 10 TB attached to it Mainly dedicated to Atlas – no use by Alice but... Latest SL4 DPM provides xrootd needed by Alice Have just bought an extra 40 TB Upgrade strategy: current DPM head node will be migrated to new SL4 server, then a DPM pool node will be deployed on new DPM head node Performance issues with deleting files on ext3 fs were observed -> Should we move to XFS? SRMv2.2 with 3TB space token reservation for Atlas published Latest srmv2.2 clients (not in gLite yet) installed on BlueBear UI but not on PP desktops
Birmingham Status - eScience Cluster 31 nodes (servers included) with 2 Xeon CPU 3.06GHz and 2GB of RAM hosted by IS All on a private network but one NAT node Torque server on private network Connected to the grid via SL4 CE in Physics – more testing needed Serves as model for gLite deployment on BlueBear cluster -> installation assume no root access to workers and user tarball method Aimed to have it passing SAM test by GridPP20, but may not meet target as delayed by security challenge and helping with setting up Atlas on BlueBear Software area is not large enough to meet Atlas 100GB requirement :( ~150 cores will be allocated to Grid on BlueBear
Bristol Update Bristol is pleased to report that after considerable hard work, LCG on Bristol University HPC is running well, & the accounting is now showing the promised much higher-spec CPU usage. =UKI-SOUTHGRID-BRIS-HEPhttp://www3.egee.cesga.es/gridsite/accounting/CESGA/tree_egee.php?ExecutingSite =UKI-SOUTHGRID-BRIS-HEP That purple credit goes to Jon Wakelin & Yves Coppens. Work will start soon on testing StoRM on SL4 in preparing to replace DPM access from HPC with StoRM. DPM will remain in use for the smaller cluster. 50TB of storage (gpfs) will be ready for PP at least by 1 Sept The above CE & SE are still on 'proof-of-concept' borrowed hardware. Purchases for new CE/SE/MON & NAT are pending, also we would like to replace older CE/SE/UI & WN (depends on funding).
Site status and future plans Oxford 510KSI2K 102TB –Two sub clusters (2004) GHz cpus running SL3 to be upgraded asap (2007) (Intel quads) running SL4 –New kit installed last Sept. in local computer room performing very well so far. –Need to move the 4 grid racks to the new Computer Room at Begbroke Science Park before end of March
Oxford Routine updates have brought us to the level required for CCRC08, and our storage had space tokens configured to allow us to take part in CCRC and FDR successfully. We have been maintaining two parallel services, one with SL4 workers, one with SL3 to support VOs that are still migrating. We've been working with Zeus and now have them running on the SL4 system, so the SL3 one is now due for imminent retirement. Overall it has been useful to maintain the two clusters rather than just moving to SL4 in one go. We've been delivering large amounts of work to LHC VOs. In periods where there hasn't been much LHC work available we've been delivering time to the fusion VO as part of our efforts to bring in resources from non PP sites such as JET. Oxford is one of the two sites supporting the vo.southgrid.ac.uk regional VO; so far only really testing work, but we have some potentially interested users who we're hoping to introduce to the grid. On a technical note Oxford's main CE (t2ce03.physics.ox.ac.uk) and site BDII (t2bdii01.physics.ox.ac.uk) are running on VMware server virtual machines. This is allowing good use of hardware, and a clean separation of services, and seems to be working very well.
EFDA JET Cluster upgraded at the end of November 2007, with 80 Sun Fire x2200 with Opteron 2218 CPUs. Worker nodes upgraded to SL4. Have provided a valuable contribution to ATLAS VO 242KSI2K 1.5TB
SouthGrid….Issues? How can SouthGrid become more pro-active with VOs (Atlas)? Alice is very specific with its VOBOX. CMS requires Phedex but RALPPD may be able to provide the interface for SouthGrid. Zeus and Fusion strongly supported NGS integration, Oxford has become an affiliate and Birmingham is passing conformance tests. SouthGrid regional VO will be used to bring local groups to the grid. Considering the importance of accounting, do we need independent cross-checks? Manpower issues supporting APEL? Bham PPS nodes are broken -> PPS service suspended :( What strategy should SouthGrid adopt (PPS needs to do 64 bit testing) ?
SouthGrid Summary Big improvements at –Oxford –Bristol –Jet Expansion expected shortly at –RAL PPD –Birmingham Working hard to solve problems with exploiting the resources at Cambridge Its sometimes an up hill struggle But the top is getting closer