WLCG Service Report 5th – 18th July

Slides:



Advertisements
Similar presentations
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Advertisements

WLCG Service Report ~~~ WLCG Management Board, 18 th August
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
WLCG Service Report ~~~ WLCG Management Board, 11 th November 2008.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
LHCb: March/April Operational Report NCB 10 th May 2010.
Andrea Sciabà CERN CMS availability in December Critical services  CE, SRMv2 (since December) Critical tests  CE: job submission (run by CMS), CA certs.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
Alberto Aimar CERN – LCG1 Reliability Reports – May 2007
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
WLCG Service Report ~~~ WLCG Management Board, 17 th March 2009.
T1 status Input for LHCb- NCB 9 th November 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th June
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
Tier 1 Status and Recent Major WLCG Service Incidents LCG-LHCC Referees Meeting 22 September 2008.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Patricia Méndez Lorenzo Status of the T0 services.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.
GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE2046 ATLAS CMS26210 LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 15 th December
Lessons learned administering a larger setup for LHCb
1 VO User Team Alarm Total ALICE 12 ATLAS CMS
L’analisi in LHCb Angelo Carbone INFN Bologna
Report from WLCG Workshop 2017: WLCG Network Requirements GDB - CERN 12th of July 2017
WLCG Management Board, 30th September 2008
~~~ WLCG Management Board, 28th October 2008
Database Services at CERN Status Update
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
~~~ WLCG Management Board, 10th March 2009
CASTOR-SRM Status GridPP NeSC SRM workshop
CCRC08 May Post-Mortem Tier-1 view
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
WLCG Management Board, 16th July 2013
~~~ LCG-LHCC Referees Meeting, 16th February 2010
WLCG Service Interventions
1 VO User Team Alarm Total ALICE ATLAS CMS
LHCb: March/April Operational Report
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
Take the summary from the table on
~~~ WLCG Management Board, 16th June 2009
Review of Tier1 DB Interventions and Service procedures
WLCG Collaboration Workshop: Outlook for 2009 – 2010
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

WLCG Service Report 5th – 18th July Maria Girone, IT-DM ~~~ WLCG Management Board, 21st July 2009

Overview Quiet two weeks No alarm tickets And good participation, including better reporting from FZK and ASGC – Many thanks! No alarm tickets STEP’09 post-mortem workshop held at end of first week Agenda RAL move to new machine room successfully completed at beginning of this period More on Planet WLCG NIKHEF cooling problem – 30% capacity off until move to new CC (foreseen 10 – 21 August) Incidents leading to postmortem ATLAS Central Catalogs Degradation https://twiki.cern.ch/twiki/bin/view/LCG/PostMortem13Jul09 CNAF LFC problem on 12th-13th July just received

GGUS summary (2 weeks) VO User Team Alarm Total ALICE 3 ATLAS 20 64 84 ATLAS 20 64 84 CMS 15 LHCb 1 41 42 Totals 39 105 144

4

Site Availability Summary ATLAS RAL: 7th-8th July – SRM instabilities (recovering from a long DT) FZK: 7th-9th July – SRM instabilities CNAF: 12th July – LFC DB problems (post-mortem?). Also some LFC instabilities on 13th due to network glitches SARA: 17th-today – unscheduled downtime on one CE (job submission) CMS PIC: 7th-8th July – job submission failures due to batch system mis-configuration ASGC: 7th – today – Castor SAM tests timeouts (long queues in the castor job scheduler under load) LHCb All T1s: 7th-8th July – Sam tests not running properly due to a misconfiguration on Dirac

SIR on ATLAS Central Catalog (performance degradation) DB service interruption with session kill for several connected session on Sunday 12th at 10:26. Full service connectivity was restored at 10:28 and again at 10:32 Full service connectivity restored at 10:36 Problem understood. Caused by a wrong DBA operation when increasing the recovery area (alert sent when it reaches 85%) https://twiki.cern.ch/twiki/bin/view/LCG/PostMortem13Jul09

SIR on CNAF LFC problem On Sunday 12 July at 01:13 am the ATLAS LFC standby database in Roma has become unreachable because of a storage problem. Moreover, at CNAF, on Sunday afternoon, a not well understood problem has caused the loss of connectivity to the storage area network from several Oracle clusters among which there was the ATLAS LFC one. Due to this connectivity problem, several clusters have been automatically rebooted, after the reboot, the connection between the LFC front-end and the back-end has been automatically  restored, but unfortunately the software wasn't functional. On Monday the 13th, the database was in hang with an error ORA-29702(error in cluster group service operation). We found a lot of connections (order of 100)on the database, while the usual number is 40. The investigation of this problem is difficult because in the LFC front-end logs there is an hole between July 12 at 22:41 and July 13 at 10:19,probably due to the fact that the lfcdaemon was in hang. As the database in Roma was unavailable, the failover didn't succeed. The service has been restored in the evening on Monday the 13th, in both CNAF and Roma sites.

Summary No new serious site issues Good participation locally and remotely in week 2 – week 1 perturbed by F2F meetings and workshop RAL long downtime for DC move completed ASGC is recovering… NIKHEF cooling problem – 30% capacity off until move to new CC (foreseen 10 – 21 August)