WLCG Service Report ~~~ WLCG Management Board, 9 th August 2011 1.

WLCG Service Report Andrea.Valassi@cern.ch ~~~ WLCG Management Board, 9 th August 2011 1

Introduction 3 busy weeks since the last MB report on July 19 th Good data taking with LHC record fills (passed the 2 fb -1 mark on August 5!) Three Service Incident Reports received: IN2P3 outage of 13 DBs due to disk failures on July 19 th –21 st (SIR)SIR Affected Atlas (COOL, LFC, AMI), CMS (FTS), LHCb (COOL, LFC) for >1 week GGUS ALARM submission affected by KIT mail interface, July 22 th -26 th (SIR)SIR Loss of 11k ATLAS files at KIT due to dirty GPFS, July 12 th -26 th (SIR)SIR One more Service Incident Report is expected: CERN KDC flood from ATLAS users in May-June (reported at last MB) 4 real GGUS ALARMS (3 for ATLAS and 1 for CMS) All about storage – at CERN (Castor) and CNAF (Storm) Other notable issues reported at the daily meetings Major power outage at FNAL due to thunderstorm on July 29 Storm issues at many ATLAS sites after 1.7.0 upgrade, applied workarounds Low CPU efficiency of ALICE jobs finally solved (new hw, xrootd, svc config) ADCR DB performance slow (after move to standby hw, but not correlated?) 2

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS151057127 CMS4026 LHCb822131 Totals3112711169 3

WLCG MB Report WLCG Service Report Support-related events since last MB There were 4 real ALARM tickets since the 2011/07/18 MB (3 weeks), 3 submitted by ATLAS, 1 by CMS, all ‘solved’ and ‘verified’; 2 of them for CERN CASTOR, 2 for CNAF Storm. Ongoing GGUS problems in ALARM submission and/or escalation: Problems between June 12-27 already reported at last MB, due to new KIT exim mailer and supposedly solved during week of June 27 For ATLAS ticket on July 24, GGUS did not allow ALARM submission and also failed to notify operators on TEAM-to-ALARM escalation. For CMS ALARM submitted on July 26, piquet was not called. These issues were solved last week at KIT (see SIR) and validated with test alarms.SIR This weekend again an ALARM submitted by ATLAS with INFN on August 6 did not reach the SMS system of the site. This had already been reported on July 17 (GGUS:72717). CNAF reported this morning that a fix has been applied and validated (tests have confirmed that ALARMS correctly trigger SMS messages).GGUS:72717 4

ATLAS ALARM->CERN CASTORATLAS DOWN GGUS:72890GGUS:72890 WLCG MB Report WLCG Service Report What time UTCWhat happened 2011/07/24 03:16 SUNDAY GGUS TEAM ticket (as GGUS did not allow direct ALARM submission!), automatic email notification to grid-cern-prod- admins@cern.ch AND automatic assignment to ROC_CERN.grid-cern-prod- admins@cern.ch 2011/07/24 03:17Submitter immediately escalates ticket to ALARM. Email notification recorded as ‘Sent to atlas-operator-alarm@cern.ch’ (but no email received by operators & service mgrs?). Automatic SNOW ticket creation successful.atlas-operator-alarm@cern.ch 2011/07/24 06:34Supporter records that data export from CERN is also affected 2011/07/24 06:43- 07:57 Supporter calls 75011. Operator had received no alarm! Supporter emails computer.operations@cern.ch and later also atlas-operator-alarm@cern.ch and castor.operations@cern.ch.computer.operations@cern.ch atlas-operator-alarm@cern.chcastor.operations@cern.ch 2011/07/24 08:03Castor developer confirms TEAM-to-ALARM did not work and observes that no problem can be seen at this time. 2011/07/24 08:20- 08:44 Supporter confirms problem was real. ATLAS data export still suffering due to backlog accumulated when CASTOR down. 2011/07/26 10:16Castor mgr puts ticket on hold, discussion ongoing with ATLAS 2011/07/29 16:35- 20:56 Castor expert sets ticket ‘solved’, applying workarounds and hotfixes. Submitter sets ticket ‘verified’. 5

CMS ALARM->CERN CASTOR XROOTD REDIRECTOR NOT WORKING GGUS:72944GGUS:72944 WLCG MB Report WLCG Service Report What time UTCWhat happened 2011/07/26 08:56GGUS ALARM ticket, automatic notification to cms-operator- alarm@cern.ch AND automatic assignment to ROC_CERN. Automatic SNOW ticket creation successful.cms-operator- alarm@cern.ch 2011/07/26 09:56Castor admin restarts redirector and asks if all ok. “Redirector threads were busy with CASTOR (stuck in synchronous Puts), so new requests were stuck (and would get eventually run into Kerberos Clock skew detection). The number of threads can be increased, but this might point to some overload issue. We might also have hit some issue with locking on the Kerberos replay cache, a core dump was taken and is being looked at.” 2011/07/26 09:58Castor admin adds “For the record, ALARM seems not to have reached CERN via the usual channels (i.e. no parallel routing to CERN operator or SMS alert list, hence no piquet call)”. 2011/07/26 10:15- 21:12 Submitter replies and Castor admin sets ticket ‘solved’ and later ‘verified’. 6

ATLAS ALARM->INFN SRM DOWN GGUS:73054GGUS:73054 WLCG MB Report WLCG Service Report What time UTCWhat happened 2011/07/29 15:13GGUS TEAM ticket, automatic email notification to t1- admin@lists.cnaf.infn.it AND automatic assignment to NGI_IT.t1- admin@lists.cnaf.infn.it 2011/07/29 16:02Transfers from T0 are also failing. Supporter escalates ticket to ALARM. Notification sent to address t1-alarms@cnaf.infn.it.t1-alarms@cnaf.infn.it 2011/07/29 16:02Automatic reply “You are not allowed to trigger an SMS alarm for INFN Tier1. Anyway your message has been forwarded to the operations mailing list.” 2011/07/29 16:45Site admin restarts GPFS process in Storm BE, asks if ok now. 2011/07/29 17:21Supporter confirms all is ok, ticket can be closed. 2011/07/31 03:23Shifter reopens ticket because SRM is down again. 2011/07/31 05:04Supporter sets ticket as closed and moves new SRM issue to new TEAM ticket GGUS:73068 (to be escalated if not solved promptly – but the issue is fixed at 05:53).GGUS:73068 2011/07/31 17:22Supporter sets ticket as ‘verified’. 7

ATLAS ALARM->INFN PUT GRIDFTP_COPY_WAIT: CONNECTION TIMED OUT GGUS:73236GGUS:73236 WLCG MB Report WLCG Service Report What time UTCWhat happened 2011/08/06 14:30 SATURDAY GGUS TEAM ticket, automatic email notification to t1- admin@lists.cnaf.infn.it AND automatic assignment to NGI_IT.t1- admin@lists.cnaf.infn.it 2011/08/06 17:43SRM seems to be down. Supporter escalates ticket to ALARM. Notification sent to address t1-alarms@cnaf.infn.it.t1-alarms@cnaf.infn.it 2011/08/06 17:43Automatic reply “You are not allowed to trigger an SMS alarm for INFN Tier1. Anyway your message has been forwarded to the operations mailing list.” 2011/08/06 19:53Site admin resets Storm BE via power cycle, asks if ok now. Problem with SMS will be investigated during the week. 2011/08/06 22:30Supporter confirms all is ok, sets ticket as closed and verified. 8

2.1 4.1

Analysis of the availability plots: Week of 18/07/2011 Atlas 2.1 IN2P3-CC - UNSCHEDULED - problem with disk on the oracle cluster - DB service was unstable LHCb 4.1 LCG.IN2P3.fr - UNSCHEDULED - problem with disk on the oracle cluster - DB service was unstable

2.1 3.1

Analysis of the availability plots: Week of 25/07/2011 Atlas 2.1 Taiwan-LCG2 - SCHEDULED - Network Maintenance CMS 3.1 T1_TW_ASGC - SCHEDULED - Network Maintenance and Phedex agent upgrade

Analysis of the availability plots: Week of 01/08/2011 All sites were operating above 50% threshold during the entire week. Nothing to report.

Conclusions Business as usual – successful record data taking Serious issue with databases at IN2P3 affecting ATLAS, CMS, LHCb Experienced many GGUS problems with ALARM submission and escalation (operators and piquet not always contacted) 15

WLCG Service Report ~~~ WLCG Management Board, 9 th August 2011 1.

Similar presentations

Presentation on theme: "WLCG Service Report ~~~ WLCG Management Board, 9 th August 2011 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

WLCG Service Report ~~~ WLCG Management Board, 9 th August 2011 1.

Similar presentations

Presentation on theme: "WLCG Service Report ~~~ WLCG Management Board, 9 th August 2011 1."— Presentation transcript:

Similar presentations

About project

Feedback