WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.

Slides:



Advertisements
Similar presentations
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Advertisements

Storage Task Force Intermediate pre report. History GridKa Technical advisory board needs storage numbers: Assemble a team of experts. 04/05 At HEPiX.
AMOD Report Doug Benjamin Duke University. Hourly Jobs Running during last week 140 K Blue – MC simulation Yellow Data processing Red – user Analysis.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 18 th August
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS Summary of CASTOR incident, April 2010 Germán Cancio Leader,
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI AMOD report – Fernando H. Barreiro Megino CERN-IT-ES-VOS.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Andrea Sciabà CERN CMS availability in December Critical services  CE, SRMv2 (since December) Critical tests  CE: job submission (run by CMS), CA certs.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
Alberto Aimar CERN – LCG1 Reliability Reports – May 2007
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
ATLAS Bulk Pre-stageing Tests Graeme Stewart University of Glasgow.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 17 th March 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th June
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
SRM v2.2 Production Deployment SRM v2.2 production deployment at CERN now underway. – One ‘endpoint’ per LHC experiment, plus a public one (as for CASTOR2).
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
WLCG Services in 2009 ~~~ dCache WLCG T1 Data Management Workshop, 15 th January 2009.
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 9 th December 2008.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.
~~~ WLCG Management Board, 28th October 2008
WLCG Management Board, 16th July 2013
WLCG Service Report 5th – 18th July
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009

Introduction Quiet week again Decreasing participation No alarm tickets Incidents leading to postmortem ATLAS post-mortem FZK posted a post-mortem explaining their tape problems during STEP09 RAL scheduled downtime for move to new Data Centre ASGC recovering?

Decreasing participation STEP09

GGUS summary VOUserTeamAlarmTotal ALICE2103 ATLAS CMS3003 LHCb Totals

LHCb Team Tickets drifting up ? Jobs failed or aborted at Tier 2 8 tickets (5 of these 8 still open, all others closed) gLite WMS issues at Tier 1 (temporary) 5 Data transfers to Tier 1 failing (disk full) 1 Software area files with root owned 1 CE marked down but accepting jobs 1 Nothing really unusual

66

PVSS2COOL? incident 27-6 (1/3)? Incident report and affected services: Sunday afternoon 27-6 Viatcheslav Khomutnikov (Slava) from Atlas reported to the Physics DB service that the online reconstruction was stopped because of an error was returned by the PVSS2COOL? application (on Atlas offline DB). The error started appearing on Saturday (26-6) evening.?

PVSS2COOL? incident 27-6 (2/3)? Issue analysis and actions taken: The error stack reported by Atlas indicated that the error was generated by a 'drop table operation' being blocked by the custom trigger set up by Atlas to prevent 'unwanted' segment drop. The trigger is operational since several months. This information was fed back by Physics DB services to Atlas on Sunday evening. On Monday morning Atlas still reported this blocking issue and upon further investigation they were not able to find which table the application (PVSS2COOL?) wanted to drop (therefore causing the blocking error) as the issue appeared in a block of code responsible for inserting data. Physics DB service in collaboration with Atlas DBAs then ran 'logmining' of the failed drop operation and found that the application was indeed trying to drop some segments on the recycle bin of the schema owner (ATLAS_COOLOFL_DCS). Further investigations with SQL trace by the DBAs showed that Oracle attempted to drop objects on the recycle bin when PVSS2COOL? wanted to bulk insert data. This operation was then blocked by the custom Atlas trigger that blocks drop in production, hence the error message originally reported. Metalink note " " then further clarified that the issue was a side effect of an expected behaviour of Oracle's space reclamation process.?

PVSS2COOL? incident 27-6 (3/3)? Issue resolution and expected follow-up: In the evening on 29-6 Physics DB support in collaboration with Atlas DBAs extended the datafile of the PVSS2COOL? application to circumvent this space reclamation process issue. Atlas has reported that this has fixed the issue. Further discussions on the role of the recycle bin and on possible improvements of the 'block drop trigger' of Atlas are currently in progress to avoid further occurrences of this issue.?

FZK tape problems during STEP09 Jos posted a Post-Mortem analysis of the tape problems seen at FZK during STEP09: dKa.pdf dKa.pdf Too long to fit here but in summary Before STEP09 An update to fix a minor problem in the tape library manager resulted in stability problems Possible cause: SAN or library configuration Both were tried and problem disappeared but which one was the root cause? Second SAN had reduced connectivity to dCache pools: not enough for CMS and ATLAS at the same time  CMS asked to not to use tape First week of STEP09 Many problems: hw (disk, library, tape drives), sw (TSM) Second week of STEP09 Added two more dedicated stager hosts resulted in better stability Finally getting stable rates 100 – 150MB/s

RAL scheduled downtime for DC move Friday 3/7: reported still on schedule for restoring CASTOR and Batch on Monday 6/7 Despite presumably hectic activity with equipment movements, RAL continued to attend the daily conf call Planning and detailed progress reported at : R89 Migration: Friday 3rd July Posted by Andrew Sansum 12:00 Our last dash towards restoration of the production service is under way. All racks of disk servers have now had a first pass check. Faults list is currently 11 servers, although some of these may well be trivial. We expect to provide a large number of disk servers to the CASTOR team later today.

ASGC instabilities ATLAS reported instabilities in beginning of week Monday: Functional tests worked but still some problem withTier-1  Tier-2 transfers Another unscheduled downtime (recabling of CASTOR disk servers) CMS allowed the full week grace period for ASGC to recover from all its problems No new tickets and opened tickets put on hold Resume on Monday 6/7 Both ATLAS and CMS specific site tests changed from Red to Green during the week Friday 3/7: Gang reports that tape drives and servers are online

Summary Daily meeting attendance is degrading – holidays…? No new serious site issues RAL long downtime for DC move is progressing to plan. (Tuesday report – RAL back apart CASTORATLAS, some network instability). Tape problems at FZK during STEP09 understood ASCG is recovering?