Presentation is loading. Please wait.

Presentation is loading. Please wait.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE100111 ATLAS331697209 CMS117220 LHCb325129 Totals5720111269 1.

Similar presentations


Presentation on theme: "GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE100111 ATLAS331697209 CMS117220 LHCb325129 Totals5720111269 1."— Presentation transcript:

1 GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE100111 ATLAS331697209 CMS117220 LHCb325129 Totals5720111269 1

2 10/11/2015WLCG MB Report WLCG Service Report 2 Support-related events since last MB A reminder of the TEAM tickets’ meaning and workflow for the Tier0 was presented at the 2011/03/17 T1SCM. Slide available here. Their advantage to ‘user’ tickets is only the co-ownership of the ticket by all TEAMers. They do not imply a higher ‘importance’. Direct site notification by email is triggered by GGUS also for ‘user’ tickets, provided the ‘Notify site’ field is used.Slide available here. There were 6 real ALARM tickets since the 2011/03/08 MB (4 weeks), all submitted by ATLAS, notified sites IN2P3 (1 ticket) and CERN-PROD (5 tickets). Afs performance became an issue for all experiments. The GGUS ALARM test suite was issued on 2011/03/30 (Release date). A special GGUS-to-SNOW route entered production allowing service managers to get direct ticket assignment in SNOW. Details follow…

3 ATLAS ALARM->IN2P3 DATA COPY FROM CERN FAILS GGUS:68794 GGUS:68794 10/11/2015WLCG MB Report WLCG Service Report 3 What time UTCWhat happened 2011/03/19 19:35 SATURDAY GGUS ALARM ticket, automatic email notification to lhc-alarms@cc.in2p3.fr AND automatic assignment to NGI_France. 2011/03/19 19:40Automatic email acknowledgement of ALARM registration. 2011/03/19 19:54Service manager identifies a problem with SRM. 2011/03/19 21:16Service manager suggests to put site at risk as the SRM database problem persists and is not understood. 2011/03/19 22:15ATLAS stops using the site for the rest of the weekend. 2011/03/20 12:38Site reports things are better now. 2011/03/21 08:09Ticket set to ‘solved’. A Friday intervention was the reason for this incident as IN2P3 reported on Monday.

4 ATLAS ALARM->CERN LSF NO JOB ACCEPTED GGUS:68795 GGUS:68795 10/11/2015WLCG MB Report WLCG Service Report 4 What time UTCWhat happened 2011/03/19 21:22 SATURDAY GGUS ALARM ticket, automatic email notification to atlas-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. atlas-operator-alarm@cern.ch 2011/03/19 21:40Operator acknowledges and records in the GGUS ticket that it-dep-pes-ps-sms@cern.ch were contacted.it-dep-pes-ps-sms@cern.ch 2011/03/19 23:05CMS expert comments in the GGUS ticket that a user submitted by mistake 180K jobs. 2011/03/20 05:34Service manager set ticket to ‘solved’ once the number of jobs queued was reduced. 2011/03/20 06:11Submitter puts the ticket to status ‘verified’. In the days following the incident, a limit to the number of jobs was put in LSF to avoid such blockage in the future.

5 ATLAS ALARM->CERN CASTOR DOWN GGUS:68949 GGUS:68949 10/11/2015WLCG MB Report WLCG Service Report 5 What time UTCWhat happened 2011/03/25 11:59GGUS ALARM ticket, automatic email notification to atlas-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. atlas-operator-alarm@cern.ch 2011/03/25 12:05Operator acknowledges and records in the GGUS ticket that the Castor piquet was contacted. 2011/03/25 12:15Expert on call records in the ticket that the problem is understood and fixed (it also affected CMS). 2011/03/25 14:09Service manager set ticket to ‘solved’ with description: ‘incident caused by an incorrect conf. that was loaded at the wrong time. A mistake made as part of the SL5 upgrade. ‘

6 CMS ALARM->CERN CASTOR DOWN GGUS:68952 GGUS:68952 10/11/2015WLCG MB Report WLCG Service Report 6 What time UTCWhat happened 2011/03/25 12:32GGUS ALARM ticket, automatic email notification to cms-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. cms-operator-alarm@cern.ch 2011/03/25 12:37Operator acknowledges and records in the GGUS ticket that the Castor piquet was contacted. 2011/03/25 12:41Expert on call records in the ticket that the problem is understood and fixed (as per ATLAS GGUS:68949).GGUS:68949 2011/03/25 14:42Service manager set ticket to ‘solved’. Reason was human error. Details in slide 5. 2011/03/25 14:51Submitter sets the ticket to ‘verified’. He had already dropped the ticket priority at 12:37 as problem went quickly away.

7 ATLAS ALARM->CERN AFS NOT RESPONDING GGUS:69121 GGUS:69121 10/11/2015WLCG MB Report WLCG Service Report 7 What time UTCWhat happened 2011/03/29 11:13GGUS ALARM ticket, automatic email notification to atlas-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. atlas-operator-alarm@cern.ch 2011/03/29 11:26Operator acknowledges and records in the GGUS ticket that email was sent to the afs service. 2011/03/29 13:43Service manager set ticket to ‘solved’. Reason was a hardware failure that rendered 3 partitions and 110 ATLAS volumes inaccessible. 2011/03/29 13:49Submitter sets the ticket into status ‘verified‘.

8 ATLAS ALARM->CERN AFS S/W REL. AREA UNAVAILABLE GGUS:69192 GGUS:69192 10/11/2015WLCG MB Report WLCG Service Report 8 What time UTCWhat happened 2011/03/31 7:25GGUS ALARM ticket, automatic email notification to atlas-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. No entry by the operator in the ticket!! Maybe forgot to record the call. atlas-operator-alarm@cern.ch 2011/03/31 8:39Service manager records in the ticket that investigation has started. 2011/03/31 8:39Experiment member complains in the ticket for the afs problem frequency. 2011/03/31 10:46Afs expert records ‘problem found on server afs151:device mapper s/w RAID layer was stuck in a loop after a h/w error, blocking all I/O’. 2011/03/31 12:46Service manager sets the ticket to status ‘solved’. 2011/03/31 15:25Submitter sets the ticket to status ‘verified’.


Download ppt "GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE100111 ATLAS331697209 CMS117220 LHCb325129 Totals5720111269 1."

Similar presentations


Ads by Google