GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS411268175 CMS122519 LHCb836145 Totals6816416248 1.

Slides:



Advertisements
Similar presentations
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Advertisements

Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
GGUS summary (5 weeks) VOUserTeamAlarmTotal ALICE2002 ATLAS CMS6208 LHCb Totals
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
WLCG Service Report (for the SCOD team) ~~~ WLCG Management Board, 22 nd January 2013 Thanks to Maria Dimou, Mike Kenyon, David.
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
Claudio Grandi INFN Bologna CMS Operations Update Ian Fisk, Claudio Grandi 1.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks What GGUS can do for you JRA1 All hands.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Overview ROC_LA CERN
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Ticket review T1 Service Coordination Meeting 2010/10/28.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
LCG Support for Pilot Jobs John Gordon, STFC GDB December 2 nd 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
1Maria Dimou- cern-it-gd LCG GDB May 2008 USAG and direct GGUS ticket routing to Sites Grid Deployment.
WLCG Service Report ~~~ WLCG Management Board, 7 th June
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE2046 ATLAS CMS26210 LHCb Totals
1 VO User Team Alarm Total ALICE 12 ATLAS CMS
Cross-site problem resolution Focus on reliable file transfer service
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
Grid status ALICE Offline week Nov 3, Maarten Litmaath CERN-IT v1.0
WLCG Management Board, 16th July 2013
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
Take the summary from the table on
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

6/13/2016WLCG MB Report WLCG Service Report 2 Support-related events since last MB There were 9 real ALARM tickets since the 2011/09/20 MB (3 weeks), 4 submitted by ATLAS, 4 by CMS, 1 by ALICE, all ‘solved’, most (except 1) ‘verified’. 7 ALARM tickets concerned CERN, 1 for RAL and 1 for ASGC. 20 test ALARM tickets were submitted by the GGUS developers on Release day 2011/09/28, as a part of the regular procedure. Following this release, a flag regulating GGUS notification got wrongly configured. This resulted into GGUS generating duplicate notifications to the supporters intermittently until Oct 7 th am). On 2011/10/06 pm GGUS interfaces with other ticketing systems using web services broke due to a KIT DNS problem, caused by an update of the intrusion prevention system (IPS). Due to this update the KIT DNS was not able to get in touch with other DNS servers outside. After rolling back to the previous version of the IPS it took some time until the DNS communication worked correctly again.

ATLAS ALARM->CERN raw files vanish from Castor scratch space before merge and copy to tape GGUS:74448 GGUS: /13/2016WLCG MB Report WLCG Service Report 3 What time UTCWhat happened 2011/09/19 11:40GGUS ALARM ticket, automatic notification to atlas- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.atlas- 2011/09/19 11:49Service mgr confirms in the ticket investigation started. 2011/09/19 11:55Service mgr puts the ticket to status ‘solved’ explaining that a node was taken out of production for reasons unknown at that time and never recorded in the ticket. 2011/09/19 12:15The operator records in the ticke that “the sys. Admin is working on it”. 2011/09/19 13:08Submitter sets the ticket to status ‘verified’.

CMS ALARM->CERN LSF not starting T0 jobs GGUS:74456 GGUS: /13/2016WLCG MB Report WLCG Service Report 4 What time UTCWhat happened 2011/09/19 15:48GGUS ALARM ticket, automatic notification to cms- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.cms- 2011/09/19 15:57Grid services’ expert, having seen the , comments in the ticket that the problem was already known and at hand. 2011/09/19 16:00Operator records in the ticket that the sys. admin. was contacted. 2011/09/19 16:25Expert sets the ticket to status ‘solved’. The cmst0 queue priority was set to a higher value so that LSF allows more CMS jobs to run within a given cycle. A more permanent solution was promised but not recorded in this ticket. 2011/09/19 17:04Submitter observed the queues for 2.5 hrs until the number of jobs returned as failed decreased. 2011/09/25 17:28 SUNDAY Submitter sets the ticket on status ‘verified’.

ATLAS ALARM-> T0 to RAL data exports fail GGUS:74686 GGUS: /13/2016WLCG MB Report WLCG Service Report 5 What time UTCWhat happened 2011/09/27 04:33GGUS TEAM ticket, automatic notification to lcg- AND automatic assignment to NIG_UK.lcg- 2011/09/27 06:45TEAM ticket upgrade to ALARM. notified. Automatic ALARM acknowledgement recorded in the ticket promising expert’s response within /09/27 07:23Site admin records in the ticket investigation is taking place with high priority. 2011/09/27 08:53Service expert at the site record a Castor DB inconsistency found. DB RAL contacted. The Atlas Castor RAL put in downtime. 2011/09/27 13:574 comments added by the expert at the site rectifying the diagnostic and to record in the ticket that the DB table needed to be rebuilt. 2011/09/27 14:55Service expert sets the ticket on status ‘solved’. 2011/09/27 16:05Submitter sets the ticket to status ‘verified’.

ATLAS ALARM->ASGC can’t get LFC replicas GGUS:74758 GGUS: /13/2016WLCG MB Report WLCG Service Report 6 What time UTCWhat happened 2011/09/28 19:22GGUS TEAM ticket, automatic notification to AND automatic assignment to ROC_Asia/Pacific. “Type of Problem (ToP)” 1 st usage!!! ToP: Storage Systems. 2011/09/28 20:26Next shifter records in the ticket the problem appears in the opposite direction as well. 2011/09/28 20:27CERN/IT/ES ATLAS supporter raises the ticket into an ALARM. 2011/09/28 21:561 st diagnosis shows a DOS caused by a panda user. 2011/09/29 02:26Site admin. sets the ticket ‘in progress’. 2011/09/29 07:56The ATLAS supporter from CERN confirms ~10K concurrentjobs, each fetching 100MB from storage was the reason for the DOS, bans the job submitter and sets the ticket to status ‘solved’.

ATLAS ALARM->CERN T0MERGE inaccessible GGUS:74838 GGUS: /13/2016WLCG MB Report WLCG Service Report 7 What time UTCWhat happened 2011/09/30 13:01GGUS ALARM ticket, automatic notification to atlas- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: File Access.atlas- 2011/09/30 13:06Operator records in the ticket that the Castor piquet was contacted. 2011/09/30 13:06Castor expert puts the ticket ‘in progress’. 2011/09/30 13:37Expert puts the problem to status ‘solved’ recording that the knownTransfer Manager problem was the cause. Stuck transfer requests were cleaned but available patches should be installed. 2011/09/30 14:36Expert enters 2 more clarification comments. 2011/10/03 06:41Submitter sets the ticket on status ‘verified’.

CMS ALARM->CERN CMSR DB down GGUS:74701 GGUS: /13/2016WLCG MB Report WLCG Service Report 8 What time UTCWhat happened 2011/09/27 12:39GGUS ALARM ticket, automatic notification to cms- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: other (not selected)cms- 2011/09/27 12:532 nd Line support assigns ticket to DB Instances 3 rd Line. 2011/09/27 13:30The operator records that the ticket is received but calls nobody. 2011/09/27 14:00Service expert sets the ticket to status ‘solved’ confirming there was a problem with the DB but without explanation about the reason of this problem. 2011/09/27 14:17Submitter sets the ticket to status ‘verified’.

CMS ALARM->CERN Problem to open DB file GGUS:74709 GGUS: /13/2016WLCG MB Report WLCG Service Report 9 What time UTCWhat happened 2011/09/27 17:46GGUS TEAM ticket, automatic notification to grid- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful. ToP: other (not selected).grid- 2011/09/27 18:59TEAM ticket upgraded to ALARM. Cms-operator- notified.Cms-operator- 2011/09/27 19:19Operator records in the ticket that phyDB support was contacted. 2011/09/27 19:27Service expert puts the ticket in status ‘solved’ without explaning how. 2011/09/27 21:15Submitter sets the ticket to status ‘verified’.

ALICE ALARM->CERN myproxy stopped working GGUS:75055 GGUS: /13/2016WLCG MB Report WLCG Service Report 10 What time UTCWhat happened 2011/10/06 17:35GGUS ALARM ticket, automatic notification to alice- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation doesNOT appear in the GGUS ticket diary!!! This is due to the KIT DNS problem (see slide 2) ToP: middleware.alice- 2011/10/06 18:31Operator records in the ticket that IT PES PS piquet was contacted. 2011/10/06 19:46Service expert comments in the ticket that the problem is fixed. The diagnostic was already given by the submitter, i.e. a change of host cert. led to authorisation failures. 2011/10/06 19:53Submitter confirms that problem went away. 2011/10/07 10:31Late appearance of the SNOW ticket number. 2011/10/07 11:53Service expert puts the ticket to status ‘solved’. A number of identical comments follow due to the duplicate notifications explained in slide 2. They stop when the sumbitter sets the ticket into status ‘verified’.

CMS ALARM->CERN myproxy stopped working GGUS:75056 GGUS: /13/2016WLCG MB Report WLCG Service Report 11 What time UTCWhat happened 2011/10/06 17:43GGUS ALARM ticket, automatic notification to cms- AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation doesNOT appear in the GGUS ticket diary!!! This is due to the KIT DNS problem (see slide 2) ToP: File transfer (different from the identical report by ALICE – see previous slide).cms- 2011/10/06 18:05Service expert comments in the ticket that the problem is known and already fixed. 2011/10/06 18:22The same expert comments in the ticket that one of the 2 myproxy hosts still gives errors and is temporarily disabled for verification. 2011/10/06 18:31Operator records in the ticket that IT PES PS piquet was contacted. 2011/10/06 18:43- 21:22 3 comments exchanged for debugging, followed by status change to ‘solved’ and ‘verified’. 2011/10/07 10:35Late appearance of the SNOW ticket number (reasons in the previous slide).