~~~ WLCG Management Board, 10th March 2009

Jamie.Shiers@cern.ch ~~~ WLCG Management Board, 10th March 2009
WLCG Service Report ~~~ WLCG Management Board, 10th March 2009

Introduction This report covers the 2 week period since the last WLCG MB Our run of “no major service incidents” has been broken with several incidents in the last two weeks One of these incidents – the fire in Taipei – will take a long time to fully recover (up to 2 months!) Recovery is underway – LFC is back and FTS soon(?) Update at tomorrow’s GDB… Another – the CASTOR-related problems due to network invention at CERN – also needs further analysis: human mistakes are probably inevitable but IMHO this outage was COMPLETELY avoidable Action: IT-DES experts will be present whenever a further such invention is performed

Major Service Incidents
Site When What Report? CNAF 21 Feb Network outage Promised… ASGC 25 Feb Fire s 25/2 & 2/3 nl-t1 3 Mar Cooling ed CERN Human error Provided by IT-FIO (Olof) (FIO wiki of service incidents) Wide disparity in reports – both level of detail and delay in producing them (some others still pending…) We agreed that they should be produced by the following MB – even if some issues were still not fully understood Would adopting a template – such as that used by IT-FIO or GridPP – help? (Discuss at pre-CHEP workshop…) Is the MB content with the current situation ?

CASTOR: switch intervention
Announcements: at the IT “CCSR” meeting of 25 Feb interventions on the private switches of the DBs for the CASTOR+ services was announced Oracle DB services: The firmware of the network switches in the LAN used for accessing the NAS filers as well as the LAN used to implement the Oracle Cluster interconnect, will be upgraded This intervention should be transparent for users since these LAN's use a redundant switch configuration. Only the intervention on 2 Mar was put on the IT service status board and AFAIK no EGEE broadcast (“at risk” would have been appropriate) But the intervention was done on 4 Mar anyway! News regarding the problem and its eventual resolution was poorly handled – no update was made since 11:30 on 4 Mar – despite a “promise” We will update on the status and cause later today, sorry for inconvenience The reports at 4 Mar CCSR were inconsistent and incomplete: the service as seen by the users was down from around 9:45 for 3-4 hours At least some CASTOR daemons / components are not able to reconnect to the DB in case of problems – this is NOT CONSISTENT with WLCG service standards Cost of this intervention: IT – minimum several days; Users - ?

https://twiki.cern.ch/twiki/bin/view/FIOgroup/PostMortem20090304
CASTOR p.m. - Olof Description All CASTOR Oracle databases went down at the same time following a 'transparent' intervention on the private network switches between the NAS headnodes and the storage. This caused a general service outage of the stagers and CASTOR name server (as well as other central services). Impact : All CASTOR and SRM production instances were down for approximately 3 hours: Time line of the incident 09:43: Oracle databases went down 10:01: users started to report problems accessing CASTOR 10:26: CASTOR service manager submitted a first incident message for posting at the service status board ( 10:40: Most of the databases are back, the srm-*-db databases are still down for now. 11:30: Most databases back except srm-atlas-db and c2cmsdlfdb 11:34: CASTOR name server daemons restarted. This was required in order to re-establish the database sessions 11:36: service status board updated with the information that most databases were back 11:45: castorcms recovered 11:49: All databases back 13:00: castoratlas and castorlhcb recovered 13:21: All SRM servers restarted 13:30: castorpublic recovered 16:48: castorcernt3 recovered The network bandwidth plots for the various instances (see bottom of this page) gives a good indication of the outage period for the 5 instances.

CASTOR – cont. When the databases started to come back, the CASTOR2 stager daemons automatically reconnected. This was not sufficient for recovering the service. The CASTOR name servers were stuck in stale ORACLE sessions. That problem was discovered by the CASTOR development team and the servers had to be restarted. However, even after the name servers had been restarted several CASTOR instances (castoratlas, castorlhcb, castorpublic and castorcernt3) were still seriously degraded. It is likely that the CASTOR2 stager daemons were stuck in name server client commands. A full restart of all CASTOR2 stager daemons on the affected instances finally recovered the production services by ~13:00. All the SRM daemons were restarted at 13:21 for the same reasons. The recovery of the less critical instance castorcernt3 was delayed because its deployment architecture is different. It was only in the late afternoon when it was finally understood that the instance was stuck because it is running an internal CASTOR name server daemon (this will become the standard architecture with the production deployment). After having restarted the daemon (and the stager daemons) the instance rapidly recovered. The existing procedure for recovering CASTOR from scratch (PowercutRecovery) needs to be reviewed. The recovery of some of the CASTOR stager instances took longer than necessary. The reason is likely to be that although the database connections had been automatically re-established, most of the threads were stuck in CASTOR name server calls (this cannot be confirmed). Next time a message should also be posted to the Service Status Board when the service has been fully recovered.

CERN Network – Warning! Postponed to April 1st?
There will be an “Important Disruptive Network Intervention on March 18th” 06:00 – 08:00 Geneva time This will entail a ~15min interruption, which will affect access to AFS, NICE, MAIL and all Databases which are hosted in the GPN among other services. Next, the switches in the General Purpose Network that have not been previously upgraded will be upgraded resulting to a ~10min interruption. All services requiring access to services hosted in the Computer Center will see interruptions. 08:00 – 12:00 Geneva time The routers of the LCG network will be upgraded at 08:00 a.m., mainly affecting the Batch system and CASTOR services, including Grid related services. The switches in the LCG network that have not been previously upgraded will be upgraded next. (Recent network interventions have been to reduce the amount of work done in this major intervention) See FIO preparation page for this intervention + also joint OPS meetings Postponed to April 1st?

ASGC Fire it have been a disaster right now. the whole data center area are affected while it's the damage of the UPS battery cause the entire power system down and dust and smoke spread into other computer room in which all computing and storage facilities resided. minor water leak have been observed while fire fighter trying to suppressing the fire in power room. we leave DC an hour ago, right now, the situation in data center are not acceptable for human to stay long. Full recovery might take up to 1.5 – 2 months

GGUS Summaries Week 9 Week 10 VO concerned USER TEAM ALARM TOTAL ALICE
2 ATLAS 25 24 49 CMS 5 LHCb 17 1 18 Totals 74 Week 9 VO concerned USER TEAM ALARM TOTAL ALICE 9 (??) 9 ATLAS 32 6 12 50 CMS 3 LHCb 13 2 15 Totals 48 8 30 86 Week 10 Alarm tests performed successfully against Tier0 & Tier1s. Still problems with mail2SMS gateway at CERN (FITNR) – some VOs sent “empty” alarm and not a “sample scenario” Should we re-test soon or wait 3 months for next scheduled test?

Alarm Summary Most of alarm tickets are (successful) tests of alarm flows (some small problems still…) GGUS Ticket-ID: 46821 Description: All transfers to BNL fail Detailed description: Hello, since 21:00 (CET), all transfers to BNL fail. The main error message is : FTS State [Failed] FTS Retries [1] Reason [SOURCE error during TRANSFER_PREPARATION phase: [CONNECTION_ ERROR] failed to contact on remote SRM [httpg://dcsrm.usatlas.bnl.gov:8443/srm/managerv1] Can you provide news ? Stephane Solution: BNL GUMS service have been fixed. This, in return, fixed the DDM problem in dCache. DDM service is working normally in BNL and US. Hiro This solution has been verified by the submitter.

GGUS Alarm Tests cont. LHCb (Roberto Santinelli) have also performed tests yesterday – interim results are available at These results are still being analyzed – IMHO it is immature to draw concrete conclusions from them but it would be interesting to understand why and how the ATLAS & CMS tests were globally successful whereas for LHCb at least some sites – and possibly also the infrastructure – gave some problems For this and other reasons I suggest that we prepare carefully for another test to be executed and analyzed PRIOR to next month’s F2F / GDB

Service Summary – 23 Feb: Mar1

Service Summary – “Last Week”

Last 2 Weeks 11 March 2008

Summary Major service incidents at several sites in the last two weeks
Prolonged outage of entire ASGC site to be expected due to fire Poorly announced & executed intervention at CERN affected all CASTOR services for several hours Other than ASGC, CNAF and RAL appear to regularly have problems with experiment tests (note that short-term glitches get smeared out in weekly views) Otherwise things are OK!

~~~ WLCG Management Board, 10th March 2009

Similar presentations

Presentation on theme: "~~~ WLCG Management Board, 10th March 2009"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

~~~ WLCG Management Board, 10th March 2009

Similar presentations

Presentation on theme: "~~~ WLCG Management Board, 10th March 2009"— Presentation transcript:

Similar presentations

About project

Feedback