Presentation is loading. Please wait.

Presentation is loading. Please wait.

GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.

Similar presentations


Presentation on theme: "GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove."— Presentation transcript:

1 GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove this slide and let me know if drills are missing and should be prepared for a future MB. Thank You! MariaDZ 1 12/23/2015WLCG MB Report WLCG Service Report

2 GGUS summary (5 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 2 To calculate the totals for this slide and copy/paste the usual graph for the 2012/07/24 MB please: 1.Take the summary from the table on https://ggus.eu/download/wlcg_metrics/html/20120716_escalationreport_wlcg. html and https://ggus.eu/download/wlcg_metrics/html/20120723_escalationreport_wlcg. html https://ggus.eu/download/wlcg_metrics/html/20120716_escalationreport_wlcg. html https://ggus.eu/download/wlcg_metrics/html/20120723_escalationreport_wlcg. html 2. Copy locally file https://twiki.cern.ch/twiki/pub/LCG/WLCGOperationsMeetings/ggus-tickets.xls https://twiki.cern.ch/twiki/pub/LCG/WLCGOperationsMeetings/ggus-tickets.xls 3. Include 2 more lines from the escalation reports above. Add up the last 5 weeks i.e. starting from the 25-Jun line and put the totlas in this table. 4. Copy/paste here, instead of these instructions, the updated graph from the point 2.xls file.

3 12/23/2015WLCG MB Report WLCG Service Report 3 Support-related events since last MB There have been 12+ real ALARMs since the 2012/06/19 MB. All were submitted by ATLAS,CMS & LHCb. Sites for all these tickets were CERN, IN2P3, FZK, PIC, SARA. There have been 2 GGUS Releases since the last MB: On 2012/06/25: specifically on new Reporting Tools. On 2012/07/09: all other dev.items.

4 ATLAS ALARM->CERN CASTOR PROBLEM GGUS:83360GGUS:83360 12/23/2015WLCG MB Report WLCG Service Report 4 What time UTCWhat happened 2012/06/18 15:42GGUS ALARM ticket opened, automatic email notification to atlas-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Storage Systems. 2012/06/18 15:42Expert records work started. 2012/06/18 15:48Operator records that expert is working already. 2012/06/18 15:50Expert records there was a configuration error. ITSBB is updated and fixing started. 2012/06/18 16:27Ticket set to ‘solved’ after configuration change and propagation. 4 more comments were exchanged because the problem persisted for some nodes that appeared to be under maintenance in CASTOR monitor and had not received the new config. Problem really solved at 18:05 hrs.

5 ATLAS ALARM->CERN LSF SCHEDULING GGUS:83362GGUS:83362 12/23/2015WLCG MB Report WLCG Service Report 5 What time UTCWhat happened 2012/06/18 15:50GGUS ALARM ticket opened, automatic email notification to atlas-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems. 2012/06/18 16:02Operator’s acknowledgment and email to …pes-sms… 2012/06/18 16:19Service mgr starts work. 2012/06/18 16:38The ticket is ‘solved’ because the LSF problem was a side-effect of the CASTOR problem of the previous slide.

6 ATLAS ALARM-> FZK FTS TRANSFER ERRORS GGUS:83367GGUS:83367 12/23/2015WLCG MB Report WLCG Service Report 6 What time UTCWhat happened 2012/06/18 23:16GGUS TEAM ticket opened, automatic email notification to lcg-admin@lists.kit.edu AND automatic assignment to NGI_DE. Type of Problem: File Transfer. 2012/06/18 01:18Increased to “Top Priority” followed by ticket conversion to ALARM 10 mins later as transfer failure rate increases. 2012/06/19 05:46A CMS comment! They have the same problem! 2012/06/19 09:02The ticket is ‘solved’ after finding a disk issue that needed a log partition cleanup on an FTS host. Both experiments agree the problem is gone.

7 ATLAS ALARM->CERN LSF SLOW RESPONSE GGUS:83375GGUS:83375 12/23/2015WLCG MB Report WLCG Service Report 7 What time UTCWhat happened 2012/06/19 07:43GGUS ALARM ticket opened, automatic email notification to atlas-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems. 2012/06/19 07:51Operator’s acknowledgment and email to …pes-sms… 2012/06/19 07:55Service mgr starts work. 2012/06/20 15:42The ticket is ‘solved’ because the problem went away. Although Platform was supposed to get back with a diagnostic, after the ticket was set to ‘verified’ no further update is possible, hence, we never knew what the cause of the problem was.

8 ATLAS ALARM->IN2P3 SW SRC PROBLEM VIA CVMFS GGUS:83517GGUS:83517 12/23/2015WLCG MB Report WLCG Service Report 8 What time UTCWhat happened 2012/06/24 08:30 SUNDAY GGUS TEAM ticket opened, automatic email notification to grid.admin@cc.in2p3.fr AND automatic assignment to NGI_FRANCE. Type of Problem: Middleware. 2012/06/25 07:37Ticket upgrade to ALARM after 2 comments with all WNs where 100% of the jobs failed. Email sent to lhc-alarm@cc.in2p4.fr. Automatic acknowledgment recorded immediately afterwards. lhc-alarm@cc.in2p4.fr 2012/06/25 08:21Sys.admins investigate (cvmfs cache problem). 2012/06/25 11:16The ticket is ‘solved’ after changing the logrotate policy to reduce the logs but as the ticket was set to ‘verified’ no further update is possible, hence, we never knew why the high increase of connections led to this fast grow of logfiles.

9 ATLAS ALARM-> SARA SRM CONTACT PROBLEM GGUS:83523GGUS:83523 12/23/2015WLCG MB Report WLCG Service Report 9 What time UTCWhat happened 2012/06/24 19:57 SUNDAY GGUS TEAM ticket opened, automatic email notification to eugrid.support@sara.nl AND automatic assignment to NGI_NL. Type of Problem: Storage Systems. 2012/06/24 20:21Ticket upgrade to ALARM as the SRM layer appeared broken. Email sent to nlt1-alarms@biggrid.nl. Automatic acknowledgment recorded immediately afterwards. 2012/06/25 05:54Service mgr restarted srm. 2012/06/27 14:47The ticket is ‘solved’ after exchanging16 comments to understand the cause, which seemed to be the recent dcache upgrade to v.2.2.1. Moving the srm to new hardware didn’t help but re-indexing the DB did.

10 ATLAS ALARM->CERN VOATLAS SERVERS DOWN GGUS:83705GGUS:83705 12/23/2015WLCG MB Report WLCG Service Report 10 What time UTCWhat happened 2012/06/29 06:33GGUS ALARM ticket opened, automatic email notification to atlas-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Other. 2012/06/29 06:34Grid services’ expert informs the submitter that there is a power cut in the CC, published on the itssb. 2012/06/29 06:40Operator also records there all many problems due to the power cut. 2012/06/29 12:01The ticket is set to ‘verified’ after the services got back at 08:26 and the solution was recorded at 11:55.

11 LHCB ALARM->CERN MISSING DATA ON DISK GGUS:83713GGUS:83713 12/23/2015WLCG MB Report WLCG Service Report 11 What time UTCWhat happened 2012/06/29 11:37GGUS ALARM ticket opened, automatic email notification to lhcb-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Storage Systems. 2012/06/29 11:38Storage expert informs the submitter that after the power cut in the CC earlier on that day, not all servers have yet recovered. 2012/06/29 11:45Operator also records that CASTOR piquet was called. 2012/06/29 16:20Ticket set to ’solved’ at 16:20 when all servers came back to production. SLS was showing all was fine even if this was partially true. The reason was that the monitoring process checks a necessary and sufficient subset of nodes’ availability only. 2012/07/04 07:45The ticket was ‘re-opened’ and eventually re-’solved’ & ‘verified’ following experiment complaints when files were found missing. The reason was that a machine was still unreachable. It came back after vendor call.

12 CMS ALARM->CERN VOCMS203 WEB SERVICE PROBLEM GGUS:83726GGUS:83726 12/23/2015WLCG MB Report WLCG Service Report 12 What time UTCWhat happened 2012/06/30 07:41 SATURDAY GGUS TEAM ticket opened, automatic email notification to cms-operator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Other. 2012/06/30 10:37Escalated and soon afterwards upgraded to ALARM. 2012/06/30 10:52Operator records that the problem is known and the piquet has already sent mail suggesting copying the data because the disk is scheduled for replacement. 2012/07/02 10:04Various CMS ALARMers submitted 6 comments in the ticket trying to get any news on progress of this. 2012/07/03 09:09Ticket set to ‘solved’ after fixing the hardware problem.

13 ATLAS ALARM->PIC TRANSFERS FROM CERN FAIL GGUS:83923GGUS:83923 12/23/2015WLCG MB Report WLCG Service Report 13 What time UTCWhat happened 2012/07/06 09:31GGUS TEAM ticket opened, automatic email notification to lcg.support@pic.es AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: File Transfer. 2012/07/06 09:36Site mgrs record in the ticket a know network problem in the dCache pools. LHCb opened a similar ticket on the matter. 2012/07/06 11:25Transfer failure rate keeps increasing. Ticket upgraded to ALARM. Email sent to tier1- alarms@pic.es. 2012/07/06 16:26The ticket is set to ‘solved’ after reducing the timeout and increasing the queue size. Supporters and submitters observed the service recovering for 2 days before ‘verify’ing the ticket.

14 ATLAS ALARM->CERN SLOW LSF GGUS:83947 GGUS:83947 12/23/2015WLCG MB Report WLCG Service Report 14 What time UTCWhat happened 2012/07/07 07:27 SATURDAY GGUS ALARM ticket opened, automatic email notification to atlasoperator-alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. Type of Problem: Local Batch Systems. 2012/07/07 11:22The same problem was reported by CMS via ALARM GGUS:83948. No operator acknowledgment was recorded in these 2 tickets, due to the invalid email addresses used cmsoperator-alarm@cern.ch. Submitters provided debug info about jobs appearing to ‘run’ on lost-and-found machines. Service mgr applied recently received hot fixes. 7 comments exchanged. GGUS:83948cmsoperator-alarm@cern.ch 2012/07/07 20:12The ticket is set to ‘solved’. ‘verified’ the next day. Similar process for the CMS ALARM on this issue.


Download ppt "GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove."

Similar presentations


Ads by Google