Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 VO User Team Alarm Total ALICE ATLAS CMS

Similar presentations


Presentation on theme: "1 VO User Team Alarm Total ALICE ATLAS CMS"— Presentation transcript:

1 1 VO User Team Alarm Total ALICE 3 1 4 ATLAS 62 235 8 305 CMS 15 6 2
GGUS summary (6 weeks) VO User Team Alarm Total ALICE 3 1 4 ATLAS 62 235 8 305 CMS 15 6 2 23 LHCb 5 16 Totals 85 247 348 1

2 Support-related events since last MB
There were 7 real ALARM tickets since the 2011/11/29 MB (6 weeks), 5 submitted by ATLAS,1 by CMS, 1 by LHCb. All ALARM tickets concerned CERN. All of them are in status ‘solved’, most are also ‘verified’. The most difficult case was a LSF problem that occupied supporters for 3 days (including a long debugging session on a Saturday night), the root cause of which was never understood by CERN or Platform engineers. The GGUS monthly release took place on 2011/12/ test ALARMs were issued and analysed in Savannah:124732 Details follow… 6/26/2018 WLCG MB Report WLCG Service Report

3 ATLAS ALARM->CERN LFC restart required GGUS:77049
What time UTC What happened 2011/12/05 12:32 GGUS ALARM ticket, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem = ToP: File Access. 2011/12/05 12:39 The operator records in the ticket that it-dep-pes-ps-sms was informed. 2011/12/05 12:49 Service expert sets the ticket in progress. 2011/12/05 12:50 Service expert sets the ticket to ‘solved’ when daemon restart was done on the nodes requested by the submitter. 2011/12/06 08:47 Submitter sets the ticket to ‘verified’. 6/26/2018 WLCG MB Report WLCG Service Report

4 ATLAS ALARM->CERN LSF slow response GGUS:77065
What time UTC What happened 2011/12/05 17:42 GGUS ALARM ticket, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem = ToP: Local Batch System. 2011/12/05 17:46 The operator records in the ticket that it-dep-pes-ps-sms was informed. 2011/12/05 19:25 Submitter records in the ticket that the problem got worse and no response was received so far from the service. 2011/12/05 20:31 Service expert starts investigating. Finds a lot of ‘bstatus’ queries and tries to kill them. 2011/12/05 21:09 The operator asks if the problem is solved(!? Never seen this before…) 2011/12/05 21:44 Service expert sets the ticket to ‘solved’ after having killed successfully all jobs that were making multiple ‘bstatus’ calls. The submitter confirmed the ‘bsub’ response was good. 2011/12/06 06:53 Shifters’ exchanges to confirm ‘bsub’ rate was ok, followed by ticket ‘verify’. 6/26/2018 WLCG MB Report WLCG Service Report

5 ATLAS ALARM->CERN LFC sessions locked for long GGUS:77069
What time UTC What happened 2011/12/05 21:34 GGUS ALARM ticket, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem = ToP: File Access. 2011/12/05 21:40 The operator records in the ticket that it-dep-pes-ps-sms was informed. 2011/12/05 22:30 Submitter records in the ticket that the locking sessions no more last that long. 2011/12/06 07:56 Grid service expert assigns the ticket within SNOW to the DB 2nd Line Support. 2011/12/06 08:11 The submitter asks for an update. 2011/12/06 15:13 Service expert sets the ticket to ‘solved’ after 6 exchanges with the submitter to make sure the problem went away since about 23:30 the night before. Various reasons for the extended lock were suggested without conclusion. 2011/12/06 15:18 Submitter sets the ticket to status ‘verified’. 6/26/2018 WLCG MB Report WLCG Service Report

6 CMS ALARM->CERN Oracle sessions time-out GGUS:77142
What time UTC What happened 2011/12/07 09:51 GGUS ALARM ticket, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem = ToP: Databases. 2011/12/07 10:01 Submitter pastes the received by the Oracle expert about a global problem with the CMSR db hanging. The operator never recorded ticket reception (maybe because the problem was well-known to the experts already?) 2011/12/08 10:08 Grid service mgr assigned the ticket in SNOW to DB 2nd Line Support. No further updates were entered in the ticket except 2 reminders from WLCG GGUS supporter to record progress in the ticket. At this point Oracle expert set the ticket to ‘solved’ as the CMSR db was available for 24 hrs already. Total time of the hang was 30 mins. No service restart was needed. 6/26/2018 WLCG MB Report WLCG Service Report

7 LHCb ALARM->CERN DIRAC host unreachable GGUS:77246
What time UTC What happened 2011/12/08 14:21 GGUS ALARM ticket, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem = ToP: Other. 2011/12/08 14:29 The operator records in the ticket that the sys admin was called. 2011/12/08 16:03 Grid service mgr sets the ticket to ‘solved’. The DIRAC host hung and got rebooted. Experts were investigating the cause of the host hunging. 2011/12/08 16:08 The submitter sets the ticket to status ‘verified’. Thus no further updates are possible, so the experts’ conclusion can’t be recorded in the ticket! 6/26/2018 WLCG MB Report WLCG Service Report

8 WLCG MB Report WLCG Service Report
ATLAS ALARM->CERN passwd change breaks important T0 web service GGUS:77467 What time UTC What happened 2011/12/14 16:17 GGUS ALARM ticket, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem = ToP: Other. 2011/12/14 16:27 The operator records in the ticket that was informed (as it concerned passwd change). 2011/12/14 16:49 The operator pastes in the GGUS ticket updates from the new incident ticket created in SNOW when the was used. The service desk considered this to be a “Request” and not a SNOW incident. Meanwhile another SNOW ticket was also created automatically by GGUS from the beginning. 2011/12/14 17:09 Grid service expert contacted 3rd Level account mgnt and set the ticket to ‘solved’. The other SNOW incidents were closed with no action. 2011/12/14 17:16 The submitter sets the ticket to status ‘verified’. 6/26/2018 WLCG MB Report WLCG Service Report

9 ATLAS ALARM->CERN LSF batch down GGUS:77547
What time UTC What happened 2011/12/17 19:04 SATURDAY GGUS ALARM ticket, automatic notification to AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem = ToP: Local Batch System. 2011/12/17 19:35 Experts start investigating. They worked all Sat-to-Sunday night debugging this, finally opened urgent ticket to Platform. 2011/12/18 09:33 SUNDAY Instance looks more stable. Reason not yet understood. Six comments exchanged. LSF kept crashing and no batch scheduling/running was possible. 2011/12/18 19:49 Operator’s acknowledgment and contact of it-dep-pes-ps-sms is recorded in the GGUS ticket diary with a 24-hour delay. The copy of this entry shows quick ops response. routing issue being followed SNOW:INC089422 2011/12/19 07:35 Five comments exchanged before ‘solved’ & ‘verified’ but root cause was NOT understood. 6/26/2018 WLCG MB Report WLCG Service Report


Download ppt "1 VO User Team Alarm Total ALICE ATLAS CMS"

Similar presentations


Ads by Google