GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS71174128 CMS145120 LHCb528033 Totals271515183 1.

Slides:



Advertisements
Similar presentations
CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.
Advertisements

GGUS summary (5 weeks) VOUserTeamAlarmTotal ALICE2002 ATLAS CMS6208 LHCb Totals
AMOD Report Doug Benjamin Duke University. Hourly Jobs Running during last week 140 K Blue – MC simulation Yellow Data processing Red – user Analysis.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
WLCG Service Report (for the SCOD team) ~~~ WLCG Management Board, 22 nd January 2013 Thanks to Maria Dimou, Mike Kenyon, David.
WLCG Service Report ~~~ WLCG Management Board, 18 th August
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
Issues in Milan Two main problems (details in the next slides): – Site excluded from analysis due to corrupted installation of some releases (mainly )
WLCG Service Report ~~~ WLCG Management Board, 27 th October
AMOD Report Doug Benjamin Duke University. Running Jobs last 7 days 120K MC sim Users MC Rec Group.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
Overview of day-to-day operations Suzanne Poulat.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
AMOD Report October 22-28, 2012 Torre Wenaus With thanks to Alexei Sedov, shadow shifter October 30, 2012.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Ticket review T1 Service Coordination Meeting 2010/10/28.
WLCG Service Report ~~~ WLCG Management Board, 17 th March 2009.
Busy Storage Services Flavia Donno CERN/IT-GS WLCG Management Board, CERN 10 March 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th June
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
Dissemination and User Feedback Castor deployment team Castor Readiness Review – June 2006.
Scuola Grid - Martina Franca, Thursday 08 November Il Sistema di Supporto INFNGrid & GGUS ( Global Grid User.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE2046 ATLAS CMS26210 LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 15 th December
1 VO User Team Alarm Total ALICE 12 ATLAS CMS
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
WLCG Management Board, 16th July 2013
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
Take the summary from the table on
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Presentation transcript:

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals

1/9/2016WLCG MB Report WLCG Service Report 2 Support-related events since last MB There have been 5 real ALARMs since the 2012/11/20 MB. 1 was submitted by CMS and 4 by ATLAS. 3 of these ALARMs were submitted during the weekend. All concerned the CERN site. 2 GGUS Releases took place since the last MB, on 2012/11/28 & 2012/12/12. All ALARM tests were successful (operators received notification, reacted within minutes, interfaces worked, experts closed promptly).

ATLAS ALARM->CERN AFS REL. AREA INACCESSIBLE GGUS:88856GGUS: /9/2016WLCG MB Report WLCG Service Report 3 What time UTCWhat happened 2012/11/25 16:31 SUNDAY GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: File Access. 2012/11/25 16:40Operator records in the ticket that AFS piquet is working on the issue. 2012/11/25 17:21Supporter explains that the physical afs server had a hard disk issue, solved by a reboot. 2012/11/25 17:59Submitter confirms service quality is improving. 2012/11/26 08:14 MONDAY Another ATLAS supporter reports atlas.web.cern.ch problems in the same ticket, because the site is afs- hosted. 2012/11/26 09:59Ticket ‘solved’ after exchange of 8 comments, where afs experts insisted to distinguish the ATLAS afs file access problems from the web ones - hosts involved afs154.cern.ch vs afs140.cern.ch.

ATLAS ALARM->CERN CASTOR FILE EXPORT PROBLEMS GGUS:89107GGUS: /9/2016WLCG MB Report WLCG Service Report 4 What time UTCWhat happened 2012/12/02 00:17 SUNDAY GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: File Access. 2012/12/02 00:26Expert confirms in the ticket that investigation started. 2012/12/02 00:28Operator records in the ticket that CASTOR piquet was called. 2012/12/02 01:15Expert puts the ticket to status ‘solved’ after identifying a problem on the node where the files reside and rebooting it (4 comments exchanged). 2012/12/02 05:49 MONDAY Ticket ‘re-opened’ because 1 of the 2 files was not transferred. 2012/12/02 11:18Ticket ‘solved’ after exchange of 6 comments and migration of the file-to-transfer to a more stable machine.

CMS ALARM->CERN SRM UNREACHABLE GGUS:89186GGUS: /9/2016WLCG MB Report WLCG Service Report 5 What time UTCWhat happened 2012/12/04 14:27GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Storage Systems. 2012/12/04 14:27Expert confirms in the ticket that investigation started. 2012/12/04 14:38Operator records in the ticket that CASTOR piquet was called. 2012/12/04 14:42Expert and submitter agree the symptoms lasted for 1.5 hrs and disappeared, also from SLS. 2012/12/04 16:22Ticket ‘solved’ and cause fully understood. A bug was revealed by a chain of srmAbort & srmReleaseFiles requests. The patch will be applied during an agreed quiet LHC operations’ period. 2012/12/04 17:09Ticket ‘verified’ by the submitter, although he doubts the problem is really solved (he wrote that transfer errors still persisted).

ATLAS ALARM->CERN SLOW LSF GGUS:89202 GGUS: /9/2016WLCG MB Report WLCG Service Report 6 What time UTCWhat happened 2012/12/04 23:46GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems. 2012/12/04 23:50Submitter attaches to the ticket plots showing number of pending jobs & bsub average time. 2012/12/05 00:02Operator records in the ticket that Local Batch System piquet was called. 2012/12/05 00:06 Time is always UTC! Expert comments that the reason for this slowness is a reconfiguration that started at 0:00 CET and took too long to finish. 2012/12/05 07:59Ticket ‘solved’ by the expert with a note for pending action: “we need to understand why the system took so long to reconfigure”. 2012/12/05 09:29Ticket ‘verified’ by the submitter  no way to get further updated with the answer to the above question on the slow reconfiguration.

ATLAS ALARM->CERN WEB SERVER DOWN GGUS:89334GGUS: /9/2016WLCG MB Report WLCG Service Report 7 What time UTCWhat happened 2012/12/08 09:02 SATURDAY GGUS ALARM ticket opened, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Monitoring. 2012/12/08 09:13Operator records in the ticket that the responsible of the concerned host webafs10.cern.ch is checking. 2012/12/08 09:32Service mgr asks if situation has improved and explains that reason is related to an intervention on the power infrastructure that took longer. 2012/12/08 09:39Service mgr asks the submitter twice if the service is restored, then, having no answer, leaves the ticket with web 3 rd Line Support. 2012/12/08 09:47Ticket ‘solved’ by the expert. Reason was “power cut” 2012/12/09 18:49Ticket set to ‘solved’ again by the supporter doing these drills, as an acknowledgment by the submitter had caused a re-opening. ‘verified’ on 2012/12/11.