GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE100111 ATLAS331697209 CMS117220 LHCb325129 Totals5720111269 1.

Slides:



Advertisements
Similar presentations
SERVICE MANAGER 9.2 PROBLEM MANAGEMENT TRAINING JUNE 2011.
Advertisements

GTA Network Management Systems On Behalf Of BellSouth.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS – CERN SNOW (Service Now) interface 2 nd update For T1SCM.
CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
PROACTIS: Supplier User Guide Invoicing. Introduction Why PROACTIS Invoice Management Invoice Notification Viewing an Invoice Acknowledging invoices Accepting.
Request Material Information Use Case Item as created in Optiva. Supplier information request(s) can happen at any time. The same process works for Optiva.
GGUS summary (5 weeks) VOUserTeamAlarmTotal ALICE2002 ATLAS CMS6208 LHCb Totals
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
WLCG Service Report (for the SCOD team) ~~~ WLCG Management Board, 22 nd January 2013 Thanks to Maria Dimou, Mike Kenyon, David.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
SERVICE MANAGER 9.2 CHANGE PRESENTATION JUNE 2011.
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Overview ROC_LA CERN
WLCG Service Report ~~~ WLCG Management Board, 9 th August
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t 24x7 Service Support Tony Cass LCG GDB, 24 th November 2009.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Ticket review T1 Service Coordination Meeting 2010/10/28.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 17 th March 2009.
Faculty Center for Instructors Roles and Access Faculty Center Features Grade Changes and Approval.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
1Maria Dimou- cern-it-gd LCG GDB May 2008 USAG and direct GGUS ticket routing to Sites Grid Deployment.
WLCG Service Report ~~~ WLCG Management Board, 7 th June
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE2046 ATLAS CMS26210 LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 15 th December
1 VO User Team Alarm Total ALICE 12 ATLAS CMS
Cross-site problem resolution Focus on reliable file transfer service
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
WLCG Management Board, 16th July 2013
1 VO User Team Alarm Total ALICE ATLAS CMS
Managing infrastructure faults to minimize accelerator down time
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
Take the summary from the table on
THE TRIAL DATABASE AND ONLINE DATA QUERIES
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Presentation transcript:

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals

10/11/2015WLCG MB Report WLCG Service Report 2 Support-related events since last MB A reminder of the TEAM tickets’ meaning and workflow for the Tier0 was presented at the 2011/03/17 T1SCM. Slide available here. Their advantage to ‘user’ tickets is only the co-ownership of the ticket by all TEAMers. They do not imply a higher ‘importance’. Direct site notification by is triggered by GGUS also for ‘user’ tickets, provided the ‘Notify site’ field is used.Slide available here. There were 6 real ALARM tickets since the 2011/03/08 MB (4 weeks), all submitted by ATLAS, notified sites IN2P3 (1 ticket) and CERN-PROD (5 tickets). Afs performance became an issue for all experiments. The GGUS ALARM test suite was issued on 2011/03/30 (Release date). A special GGUS-to-SNOW route entered production allowing service managers to get direct ticket assignment in SNOW. Details follow…

ATLAS ALARM->IN2P3 DATA COPY FROM CERN FAILS GGUS:68794 GGUS: /11/2015WLCG MB Report WLCG Service Report 3 What time UTCWhat happened 2011/03/19 19:35 SATURDAY GGUS ALARM ticket, automatic notification to AND automatic assignment to NGI_France. 2011/03/19 19:40Automatic acknowledgement of ALARM registration. 2011/03/19 19:54Service manager identifies a problem with SRM. 2011/03/19 21:16Service manager suggests to put site at risk as the SRM database problem persists and is not understood. 2011/03/19 22:15ATLAS stops using the site for the rest of the weekend. 2011/03/20 12:38Site reports things are better now. 2011/03/21 08:09Ticket set to ‘solved’. A Friday intervention was the reason for this incident as IN2P3 reported on Monday.

ATLAS ALARM->CERN LSF NO JOB ACCEPTED GGUS:68795 GGUS: /11/2015WLCG MB Report WLCG Service Report 4 What time UTCWhat happened 2011/03/19 21:22 SATURDAY GGUS ALARM ticket, automatic notification to AND automatic assignment to ROC_CERN. 2011/03/19 21:40Operator acknowledges and records in the GGUS ticket that were 2011/03/19 23:05CMS expert comments in the GGUS ticket that a user submitted by mistake 180K jobs. 2011/03/20 05:34Service manager set ticket to ‘solved’ once the number of jobs queued was reduced. 2011/03/20 06:11Submitter puts the ticket to status ‘verified’. In the days following the incident, a limit to the number of jobs was put in LSF to avoid such blockage in the future.

ATLAS ALARM->CERN CASTOR DOWN GGUS:68949 GGUS: /11/2015WLCG MB Report WLCG Service Report 5 What time UTCWhat happened 2011/03/25 11:59GGUS ALARM ticket, automatic notification to AND automatic assignment to ROC_CERN. 2011/03/25 12:05Operator acknowledges and records in the GGUS ticket that the Castor piquet was contacted. 2011/03/25 12:15Expert on call records in the ticket that the problem is understood and fixed (it also affected CMS). 2011/03/25 14:09Service manager set ticket to ‘solved’ with description: ‘incident caused by an incorrect conf. that was loaded at the wrong time. A mistake made as part of the SL5 upgrade. ‘

CMS ALARM->CERN CASTOR DOWN GGUS:68952 GGUS: /11/2015WLCG MB Report WLCG Service Report 6 What time UTCWhat happened 2011/03/25 12:32GGUS ALARM ticket, automatic notification to AND automatic assignment to ROC_CERN. 2011/03/25 12:37Operator acknowledges and records in the GGUS ticket that the Castor piquet was contacted. 2011/03/25 12:41Expert on call records in the ticket that the problem is understood and fixed (as per ATLAS GGUS:68949).GGUS: /03/25 14:42Service manager set ticket to ‘solved’. Reason was human error. Details in slide /03/25 14:51Submitter sets the ticket to ‘verified’. He had already dropped the ticket priority at 12:37 as problem went quickly away.

ATLAS ALARM->CERN AFS NOT RESPONDING GGUS:69121 GGUS: /11/2015WLCG MB Report WLCG Service Report 7 What time UTCWhat happened 2011/03/29 11:13GGUS ALARM ticket, automatic notification to AND automatic assignment to ROC_CERN. 2011/03/29 11:26Operator acknowledges and records in the GGUS ticket that was sent to the afs service. 2011/03/29 13:43Service manager set ticket to ‘solved’. Reason was a hardware failure that rendered 3 partitions and 110 ATLAS volumes inaccessible. 2011/03/29 13:49Submitter sets the ticket into status ‘verified‘.

ATLAS ALARM->CERN AFS S/W REL. AREA UNAVAILABLE GGUS:69192 GGUS: /11/2015WLCG MB Report WLCG Service Report 8 What time UTCWhat happened 2011/03/31 7:25GGUS ALARM ticket, automatic notification to AND automatic assignment to ROC_CERN. No entry by the operator in the ticket!! Maybe forgot to record the call. 2011/03/31 8:39Service manager records in the ticket that investigation has started. 2011/03/31 8:39Experiment member complains in the ticket for the afs problem frequency. 2011/03/31 10:46Afs expert records ‘problem found on server afs151:device mapper s/w RAID layer was stuck in a loop after a h/w error, blocking all I/O’. 2011/03/31 12:46Service manager sets the ticket to status ‘solved’. 2011/03/31 15:25Submitter sets the ticket to status ‘verified’.