Presentation on theme: "Northgrid Status Alessandra Forti Gridpp21 Swansea 4 September 2008."— Presentation transcript:
Northgrid Status Alessandra Forti Gridpp21 Swansea 4 September 2008
Layout General status General news Site news VOMS and sysadmin repos Conclusions
General Status (1) 96%2.813300DPMyes SL4Glite3.1 Sheffiel d 93%36.21622160 dcache/D PMyes SL4Glite3.1 Manche ster 91%2.113592 Dcache -> DPMyes SL4Glite3.1 Liverpo ol 90%39.680476.2DPMyes SL4Glite3.1 Lancast er Aver age avail abilit y Used Storage (TB) Storage (TB) CPU (kSI2K) SRM brand Space Tokens SRM2. 2OS Middle wareSite
General news Manpower changes: –Liverpool: Gridpp post will start last week of September..75 FTE for 3 years has been converted to 1 FTE for 2 years. –Manchester: EGEE Deputy coordinator will start on the 1 of November. Technical Board Meetings: –Increased frequency from 1 per quarter to 1 per month. Northgrid and atlas –It seems its the only UK region to supply people for ATLAS shifts. –Good level of Atlas production NorthGrid VO used by local groups in Manchester
Lancaster news Not much to report (it seems!) All the data have been moved from dcache to DPM and dcache has been decomissioned. –There wasnt much to move There have been few problems with power cuts. New cluster with 126 jobs slots and 100 TB storage is on the way –There have been some delays –Old cluster will remain. Setting up two CEs. Had recently problems with accounting generated by an update of tomcat –Needed to be removed and reinstalled Most of the errors reported for Lancaster in the monitoring pages are due to external sources. –They should be flagged directly in the monitoring system.
Liverpool News dcache grievances: –Ease of dCache maintenance is a big issue; the initial installation was painful and every single update we've done since has broken something. dCache is just way too complicated for what we need from an SE and we don't have the time or manpower to justify it. Moving from dcache to DPM –A test DPM instance has been installed already waiting for the new hardware to arrive to complete the operation. –54 TB should be added in the near future Working to use the University cluster Minimum availability 83% due to glite/dcache upgrade, network configuration problems and university DNS server Had also some problems with SAM tests due to university firewall. –Difficult to remove a service from SAM tests once inserted in the GOCDB. Procedure is contorted.
Manchester News Dcache upgrade grievances –Resilience manager didnt start anymore –Max number of job before it started to time out was only 200 Problems eventually resolved thanks to some serious digging from the developers who got direct access to the system –Turns out that a static parameter hadnt been changed in the configuration files for the resilient manager Resilience is incompatible with space tokens anyway DPM instance with 6TB installed for Atlas production –Eventually new storage will be added to DPM DPM will be dedicated to atlas –Dcache on WN for all the other VOs that dont have as many requirements ATLAS split Manchester in two sites in their configuration –This massively improved the efficiency in production Minimum availability 79% due to dcache upgrade and collateral problems.
Sheffield news Problems with university DNS Bought new hardware for the services (SE, CE and Mon box). –Spent July tuning them and this has affected the availability (still 90%) Already increased storage space to 13 TB –This is online Further 16 TB are on the way. –Hardware is there but the fan are missing CPU increased from 182 to 300 kSI2k Very good productivity for atlas. Availability was never below 90% in the past 6 months.
VOMS and Repos VOMS –skipcacheck option has been enabled on the GridPP VOMS. –Should avoid future problems to users with CA rollover Sergey is also testing new VOMS version New YUM repository has been enabled on the www.sysadmin.hep.ac.uk (other face of www.gridpp.ac.uk): egee- SA1www.gridpp.ac.uk –EGEE-SA1 is now distributing system monitoring and management tools (following on from the WLCG monitoring WG work with Nagios). There is asingle repository for this (monitoring clients+servers, messaging clients+servers). This will eventually also included user-donated system management tools (e.g. FTSMon, WMSMon) that are approved by the EGEE Operations Automation Team. Manchester people using also the UKI-NORTHGRID-MAN-HEP svn repository.
Conclusions Storage looking good –All the sites have SRM 2.2 and space tokens enabled –dcache relegated to a lesser role (or completely eliminated) should increase stability –All sites are bidding for additional storage or already have bought it –Manchester numerous problems with dcache and atlas way of representing it have been solved. The sites are really active in Atlas and level of productivity is high Just in time