GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS171185140 CMS118120 LHCb116118 Totals331427182 1.

Slides:



Advertisements
Similar presentations
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Advertisements

GGUS summary (5 weeks) VOUserTeamAlarmTotal ALICE2002 ATLAS CMS6208 LHCb Totals
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
WLCG Service Report (for the SCOD team) ~~~ WLCG Management Board, 22 nd January 2013 Thanks to Maria Dimou, Mike Kenyon, David.
WLCG Service Report ~~~ WLCG Management Board, 18 th August
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks What GGUS can do for you JRA1 All hands.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
LHCb: March/April Operational Report NCB 10 th May 2010.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Overview ROC_LA CERN
Multicore Accounting John Gordon, STFC-RAL WLCG MB, July 2015.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
MW Readiness WG Update Andrea Manzi Maria Dimou Lionel Cons 10/12/2014.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Ticket review T1 Service Coordination Meeting 2010/10/28.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 17 th March 2009.
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 7 th June
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
Collecting Copyright Transfers and Disclosures via Editorial Manager™ -- Editorial Office Guide 2015.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
WLCG Service Report ~~~ WLCG Management Board, 9 th December 2008.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE2046 ATLAS CMS26210 LHCb Totals
1 VO User Team Alarm Total ALICE 12 ATLAS CMS
Cross-site problem resolution Focus on reliable file transfer service
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
~~~ WLCG Management Board, 10th March 2009
WLCG Management Board, 16th July 2013
1 VO User Team Alarm Total ALICE ATLAS CMS
WLCG Service Report 5th – 18th July
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
Take the summary from the table on
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals

2/25/2016WLCG MB Report WLCG Service Report 2 Support-related events since last MB There were 7 real ALARM tickets since the 2011/11/08 MB (3 weeks), 5 submitted by ATLAS,1 by CMS, 1 by LHCb. 4 ALARM tickets concerned CERN, 2 were for SARA and 1 for CNAF..All of them are in status ‘solved’, most are also ‘verified’. Details follow…

ATLAS ALARM->CERN slow LSF response GGUS:76039 GGUS: /25/2016WLCG MB Report WLCG Service Report 3 What time UTCWhat happened 2011/11/07 08:01GGUS ALARM ticket, automatic notification to atlas- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation took place 2hrs 28 mins later! This is the 1 st time we see this. Maybe SNOW was in the middle of its weekly release (every Monday) and didn’t accept connections? Type of Problem = ToP: Local Batch System.atlas- 2011/11/07 08:02The operator records in the ticket that it-dep-pes-ps-sms was informed. 2011/11/07 08:21Service expert records a hardware problem with the LSF master node. The service ran on a secondary node for the rest of the day with slow performance. 5 comments exchanged with the submitter along these lines. 2011/11/08 07:18Service expert sets the ticket to ‘solved’ as hardware issues were addressed and a reconfiguration solved the slow performance. A ticket was opened to Platform to investigate the root cause of the problem. 2011/11/08 07:27Submitter sets the ticket to ‘verified’.Thus the root cause will never be recorded…!!!???

CMS ALARM-> CERN LSF problem GGUS:76045 GGUS: /25/2016WLCG MB Report WLCG Service Report 4 What time UTCWhat happened 2011/11/07 08:54GGUS TEAM ticket, automatic notification to grid- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: Local Batch System.grid- 2011/11/07 10:31CERN SNOW 2 nd Line Supporter assignes the ticket to LXBATCH 3 rd Line support group. 2011/11/07 15:58Expert sets the ticket in status ‘in progress’. It is interesting that the same expert had been much more responsive to the similar ticket from ATLAS (previous slide) which was opened as an ALARM and not a TEAM ticket, although, at this point in time, a solution was not yet found. 2011/11/07 16:48As the situation didn’t improve, the ticket was upgraded to ALARM. was sent to 2011/11/07 16:59Operator records in the ticket that it-dep-pes-ps- received a copy of the notification. The ticket was solved the next day at the same time as the ATLAS one.it-dep-pes-ps-

ATLAS ALARM->CERN slow AFS GGUS:76519 GGUS: /25/2016WLCG MB Report WLCG Service Report 5 What time UTCWhat happened 2011/11/16 16:29GGUS ALARM ticket, automatic notification to atlas- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: File Access.atlas- 2011/11/16 16:35Operator records in the ticket that AFS support was contacted. 2011/11/16 17:245 comments exchanged between the submitter and the service managers led to setting the ticket to ‘solved’ because performance gradually improved without needing to move data to another server. 2011/11/16 21:51Submitter sets the ticket to ‘verified’.

LHCb ALARM-> SARA SE down GGUS:76629 GGUS: /25/2016WLCG MB Report WLCG Service Report 6 What time UTCWhat happened 2011/11/20 08:34 SUNDAY GGUS TEAM ticket, automatic notification to AND automatic assignment to NGI_NL ToP: Storage Systems. 2011/11/20 10:56Ticket upgrade to ALARM. sent to nlt /11/20 11:03Expert on call recorded in the ticket that the problem is known, ATLAS reported the same and being investigated. 2011/11/20 11:10Submitter bans SARA to prevent jobs from being submitted there while problem not yet solved. 2011/11/20 20:36SARA service mgr traces the problem down to a full partition of the dCache namespace node. Cleared up and restarted dCache cluster. Ticket set to ‘solved’ at 21:32hrs. During the daily WLCG ops meeting the next day the site reminded that during weekends they are outside SLA hours and they respond to ALARMs on best effort. 2011/11/21 09:10An acknowledgement of ALARM reception sent only now from a NIKHEF address (the content seems to be a standard text).

ATLAS ALARM-> unable to contact SARA SRM GGUS:76628 GGUS: /25/2016WLCG MB Report WLCG Service Report 7 What time UTCWhat happened 2011/11/20 06:17 SUNDAY GGUS TEAM ticket, automatic notification to AND automatic assignment to NGI_NL ToP: File Transfer. 2011/11/20 08:03Ticket upgrade to ALARM after a reminder that data exports to SARA are failing during data taking. sent to 2011/11/20 08:24Expert on call recorded in the ticket the standard text of ALARM receipt acknowledgement. 2011/11/20 11:07Submitter takes SARA out of exports while the problem is not yet solved. 2011/11/20 20:32SARA service mgr traces the problem down to a full partition of the dCache namespace node. Cleared up and restarted dCache cluster. Ticket set to ‘solved’ at 21:33hrs. Logging level was planned to be increased for debugging during the wee but as the submitter ‘verified’ the ticket on 2011/11/21 00:44hrs no more updates are possible.

ATLAS ALARM-> data exports to CNAF fail GGUS:76663 GGUS: /25/2016WLCG MB Report WLCG Service Report 8 What time UTCWhat happened 2011/11/21 13:17GGUS TEAM ticket, automatic notification to t1- AND automatic assignment to NGI_IT. ToP: File Transfer.t /11/21 13:54Ticket upgrade to ALARM. sent to t /11/21 14:09Site mgr comments that un unscheduled downtime of 2011/11/18 (Friday) had left these instabilities. Thigs started getting better. 2011/11/21 15:29Site mgr sets the ticket in status ‘solved’ after confirmation from the ALARMer that the errors went down. 2011/11/21 19:04Another shifter re-opens the ticket finding the batch queue closed and wondering if the problem persists or the queue re-opening was simply forgotten. 2011/11/22 12:55Site mgr re-sets the ticket to ‘solved’ pasting the last shifters question about the batch queue found closed without any comment (?!)

ATLAS ALARM->CERN LFC down GGUS:76770 GGUS: /25/2016WLCG MB Report WLCG Service Report 9 What time UTCWhat happened 2011/11/24 04:37GGUS TEAM ticket, automatic notification to grid- h AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: Databases. 2011/11/24 04:51Operator records in the ticket that it-dep-pes-ps- was contacted. This is strange because the operators are not included in the e-group notified (!?)it-dep-pes-ps- 2011/11/24 05:52Authorised ALARMer upgrades the ticket and offers a possible incident reason related to ADCR db. sent to Maybe the timestamp in the ticket is incorrect… 2011/11/24 07:31CERN 2 nd Line supporter assigns the ticket to “LFC 2nd Line Support”. This is not the right level given that it is now an ALARM. 2011/11/24 09:37Grid services’ expert started working on the problem immediately. After a few exchanges this was ‘solved’ as related to adcr_lfc Oracle DB not available.