Presentation is loading. Please wait.

Presentation is loading. Please wait.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS411268175 CMS122519 LHCb836145 Totals6816416248 1.

Similar presentations


Presentation on theme: "GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS411268175 CMS122519 LHCb836145 Totals6816416248 1."— Presentation transcript:

1 GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS411268175 CMS122519 LHCb836145 Totals6816416248 1

2 6/13/2016WLCG MB Report WLCG Service Report 2 Support-related events since last MB There were 9 real ALARM tickets since the 2011/09/20 MB (3 weeks), 4 submitted by ATLAS, 4 by CMS, 1 by ALICE, all ‘solved’, most (except 1) ‘verified’. 7 ALARM tickets concerned CERN, 1 for RAL and 1 for ASGC. 20 test ALARM tickets were submitted by the GGUS developers on Release day 2011/09/28, as a part of the regular procedure. Following this release, a flag regulating GGUS email notification got wrongly configured. This resulted into GGUS generating duplicate email notifications to the supporters intermittently until Oct 7 th am). On 2011/10/06 pm GGUS interfaces with other ticketing systems using web services broke due to a KIT DNS problem, caused by an update of the intrusion prevention system (IPS). Due to this update the KIT DNS was not able to get in touch with other DNS servers outside. After rolling back to the previous version of the IPS it took some time until the DNS communication worked correctly again.

3 ATLAS ALARM->CERN raw files vanish from Castor scratch space before merge and copy to tape GGUS:74448 GGUS:74448 6/13/2016WLCG MB Report WLCG Service Report 3 What time UTCWhat happened 2011/09/19 11:40GGUS ALARM ticket, automatic email notification to atlas- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.atlas- operator-alarm@cern.ch 2011/09/19 11:49Service mgr confirms in the ticket investigation started. 2011/09/19 11:55Service mgr puts the ticket to status ‘solved’ explaining that a node was taken out of production for reasons unknown at that time and never recorded in the ticket. 2011/09/19 12:15The operator records in the ticke that “the sys. Admin is working on it”. 2011/09/19 13:08Submitter sets the ticket to status ‘verified’.

4 CMS ALARM->CERN LSF not starting T0 jobs GGUS:74456 GGUS:74456 6/13/2016WLCG MB Report WLCG Service Report 4 What time UTCWhat happened 2011/09/19 15:48GGUS ALARM ticket, automatic email notification to cms- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.cms- operator-alarm@cern.ch 2011/09/19 15:57Grid services’ expert, having seen the email, comments in the ticket that the problem was already known and at hand. 2011/09/19 16:00Operator records in the ticket that the sys. admin. was contacted. 2011/09/19 16:25Expert sets the ticket to status ‘solved’. The cmst0 queue priority was set to a higher value so that LSF allows more CMS jobs to run within a given cycle. A more permanent solution was promised but not recorded in this ticket. 2011/09/19 17:04Submitter observed the queues for 2.5 hrs until the number of jobs returned as failed decreased. 2011/09/25 17:28 SUNDAY Submitter sets the ticket on status ‘verified’.

5 ATLAS ALARM-> T0 to RAL data exports fail GGUS:74686 GGUS:74686 6/13/2016WLCG MB Report WLCG Service Report 5 What time UTCWhat happened 2011/09/27 04:33GGUS TEAM ticket, automatic email notification to lcg- support@gridpp.rl.ac.uk AND automatic assignment to NIG_UK.lcg- support@gridpp.rl.ac.uk 2011/09/27 06:45TEAM ticket upgrade to ALARM. lcg-alarm@gridpp.rl.ac.uk notified. Automatic ALARM acknowledgement recorded in the ticket promising expert’s response within 2 hours.lcg-alarm@gridpp.rl.ac.uk 2011/09/27 07:23Site admin records in the ticket investigation is taking place with high priority. 2011/09/27 08:53Service expert at the site record a Castor DB inconsistency found. DB experts @ RAL contacted. The Atlas Castor instance @ RAL put in downtime. 2011/09/27 13:574 comments added by the expert at the site rectifying the diagnostic and to record in the ticket that the DB table needed to be rebuilt. 2011/09/27 14:55Service expert sets the ticket on status ‘solved’. 2011/09/27 16:05Submitter sets the ticket to status ‘verified’.

6 ATLAS ALARM->ASGC can’t get LFC replicas GGUS:74758 GGUS:74758 6/13/2016WLCG MB Report WLCG Service Report 6 What time UTCWhat happened 2011/09/28 19:22GGUS TEAM ticket, automatic email notification to ops@lists.grid.sinica.edu.tw AND automatic assignment to ROC_Asia/Pacific. “Type of Problem (ToP)” 1 st usage!!! ToP: Storage Systems. ops@lists.grid.sinica.edu.tw 2011/09/28 20:26Next shifter records in the ticket the problem appears in the opposite direction as well. 2011/09/28 20:27CERN/IT/ES ATLAS supporter raises the ticket into an ALARM. Asgc-t1-op@lists.gird.sinica.edu.tw. 2011/09/28 21:561 st diagnosis shows a DOS caused by a panda user. 2011/09/29 02:26Site admin. sets the ticket ‘in progress’. 2011/09/29 07:56The ATLAS supporter from CERN confirms ~10K concurrentjobs, each fetching 100MB from storage was the reason for the DOS, bans the job submitter and sets the ticket to status ‘solved’.

7 ATLAS ALARM->CERN T0MERGE inaccessible GGUS:74838 GGUS:74838 6/13/2016WLCG MB Report WLCG Service Report 7 What time UTCWhat happened 2011/09/30 13:01GGUS ALARM ticket, automatic email notification to atlas- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: File Access.atlas- operator-alarm@cern.ch 2011/09/30 13:06Operator records in the ticket that the Castor piquet was contacted. 2011/09/30 13:06Castor expert puts the ticket ‘in progress’. 2011/09/30 13:37Expert puts the problem to status ‘solved’ recording that the knownTransfer Manager problem was the cause. Stuck transfer requests were cleaned but available patches should be installed. 2011/09/30 14:36Expert enters 2 more clarification comments. 2011/10/03 06:41Submitter sets the ticket on status ‘verified’.

8 CMS ALARM->CERN CMSR DB down GGUS:74701 GGUS:74701 6/13/2016WLCG MB Report WLCG Service Report 8 What time UTCWhat happened 2011/09/27 12:39GGUS ALARM ticket, automatic email notification to cms- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: other (not selected)cms- operator-alarm@cern.ch 2011/09/27 12:532 nd Line support assigns ticket to DB Instances 3 rd Line. 2011/09/27 13:30The operator records that the ticket is received but calls nobody. 2011/09/27 14:00Service expert sets the ticket to status ‘solved’ confirming there was a problem with the DB but without explanation about the reason of this problem. 2011/09/27 14:17Submitter sets the ticket to status ‘verified’.

9 CMS ALARM->CERN Problem to open DB file GGUS:74709 GGUS:74709 6/13/2016WLCG MB Report WLCG Service Report 9 What time UTCWhat happened 2011/09/27 17:46GGUS TEAM ticket, automatic email notification to grid- cern-prod-admins@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: other (not selected).grid- cern-prod-admins@cern.ch 2011/09/27 18:59TEAM ticket upgraded to ALARM. Cms-operator- alarm@cern.ch notified.Cms-operator- alarm@cern.ch 2011/09/27 19:19Operator records in the ticket that phyDB support was contacted. 2011/09/27 19:27Service expert puts the ticket in status ‘solved’ without explaning how. 2011/09/27 21:15Submitter sets the ticket to status ‘verified’.

10 ALICE ALARM->CERN myproxy stopped working GGUS:75055 GGUS:75055 6/13/2016WLCG MB Report WLCG Service Report 10 What time UTCWhat happened 2011/10/06 17:35GGUS ALARM ticket, automatic email notification to alice- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation doesNOT appear in the GGUS ticket diary!!! This is due to the KIT DNS problem (see slide 2) ToP: middleware.alice- operator-alarm@cern.ch 2011/10/06 18:31Operator records in the ticket that IT PES PS piquet was contacted. 2011/10/06 19:46Service expert comments in the ticket that the problem is fixed. The diagnostic was already given by the submitter, i.e. a change of host cert. led to authorisation failures. 2011/10/06 19:53Submitter confirms that problem went away. 2011/10/07 10:31Late appearance of the SNOW ticket number. 2011/10/07 11:53Service expert puts the ticket to status ‘solved’. A number of identical comments follow due to the duplicate email notifications explained in slide 2. They stop when the sumbitter sets the ticket into status ‘verified’.

11 CMS ALARM->CERN myproxy stopped working GGUS:75056 GGUS:75056 6/13/2016WLCG MB Report WLCG Service Report 11 What time UTCWhat happened 2011/10/06 17:43GGUS ALARM ticket, automatic email notification to cms- operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation doesNOT appear in the GGUS ticket diary!!! This is due to the KIT DNS problem (see slide 2) ToP: File transfer (different from the identical report by ALICE – see previous slide).cms- operator-alarm@cern.ch 2011/10/06 18:05Service expert comments in the ticket that the problem is known and already fixed. 2011/10/06 18:22The same expert comments in the ticket that one of the 2 myproxy hosts still gives errors and is temporarily disabled for verification. 2011/10/06 18:31Operator records in the ticket that IT PES PS piquet was contacted. 2011/10/06 18:43- 21:22 3 comments exchanged for debugging, followed by status change to ‘solved’ and ‘verified’. 2011/10/07 10:35Late appearance of the SNOW ticket number (reasons in the previous slide).


Download ppt "GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS411268175 CMS122519 LHCb836145 Totals6816416248 1."

Similar presentations


Ads by Google