Presentation is loading. Please wait.

Presentation is loading. Please wait.

WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.

Similar presentations


Presentation on theme: "WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009."— Presentation transcript:

1 WLCG Service Report Olof.Barring@cern.ch ~~~ WLCG Management Board, 7 th July 2009

2 Introduction Quiet week again Decreasing participation No alarm tickets Incidents leading to postmortem ATLAS post-mortem FZK posted a post-mortem explaining their tape problems during STEP09 RAL scheduled downtime for move to new Data Centre ASGC recovering?

3 Decreasing participation STEP09

4 GGUS summary VOUserTeamAlarmTotal ALICE2103 ATLAS913022 CMS3003 LHCb121022 Totals1535050

5 LHCb Team Tickets drifting up ? Jobs failed or aborted at Tier 2 8 tickets (5 of these 8 still open, all others closed) gLite WMS issues at Tier 1 (temporary) 5 Data transfers to Tier 1 failing (disk full) 1 Software area files with root owned 1 CE marked down but accepting jobs 1 Nothing really unusual

6 66

7 PVSS2COOL? incident 27-6 (1/3)? Incident report and affected services: Sunday afternoon 27-6 Viatcheslav Khomutnikov (Slava) from Atlas reported to the Physics DB service that the online reconstruction was stopped because of an error was returned by the PVSS2COOL? application (on Atlas offline DB). The error started appearing on Saturday (26-6) evening.?

8 PVSS2COOL? incident 27-6 (2/3)? Issue analysis and actions taken: The error stack reported by Atlas indicated that the error was generated by a 'drop table operation' being blocked by the custom trigger set up by Atlas to prevent 'unwanted' segment drop. The trigger is operational since several months. This information was fed back by Physics DB services to Atlas on Sunday evening. On Monday morning Atlas still reported this blocking issue and upon further investigation they were not able to find which table the application (PVSS2COOL?) wanted to drop (therefore causing the blocking error) as the issue appeared in a block of code responsible for inserting data. Physics DB service in collaboration with Atlas DBAs then ran 'logmining' of the failed drop operation and found that the application was indeed trying to drop some segments on the recycle bin of the schema owner (ATLAS_COOLOFL_DCS). Further investigations with SQL trace by the DBAs showed that Oracle attempted to drop objects on the recycle bin when PVSS2COOL? wanted to bulk insert data. This operation was then blocked by the custom Atlas trigger that blocks drop in production, hence the error message originally reported. Metalink note "265253.1" then further clarified that the issue was a side effect of an expected behaviour of Oracle's space reclamation process.?

9 PVSS2COOL? incident 27-6 (3/3)? Issue resolution and expected follow-up: In the evening on 29-6 Physics DB support in collaboration with Atlas DBAs extended the datafile of the PVSS2COOL? application to circumvent this space reclamation process issue. Atlas has reported that this has fixed the issue. Further discussions on the role of the recycle bin and on possible improvements of the 'block drop trigger' of Atlas are currently in progress to avoid further occurrences of this issue.?

10 FZK tape problems during STEP09 Jos posted a Post-Mortem analysis of the tape problems seen at FZK during STEP09: https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_storage_FZK_Gri dKa.pdf https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_storage_FZK_Gri dKa.pdf Too long to fit here but in summary Before STEP09 An update to fix a minor problem in the tape library manager resulted in stability problems Possible cause: SAN or library configuration Both were tried and problem disappeared but which one was the root cause? Second SAN had reduced connectivity to dCache pools: not enough for CMS and ATLAS at the same time  CMS asked to not to use tape First week of STEP09 Many problems: hw (disk, library, tape drives), sw (TSM) Second week of STEP09 Added two more dedicated stager hosts resulted in better stability Finally getting stable rates 100 – 150MB/s

11 RAL scheduled downtime for DC move Friday 3/7: reported still on schedule for restoring CASTOR and Batch on Monday 6/7 Despite presumably hectic activity with equipment movements, RAL continued to attend the daily conf call Planning and detailed progress reported at : http://www.gridpp.rl.ac.uk/blog/category/r89-migration http://www.gridpp.rl.ac.uk/blog/category/r89-migration R89 Migration: Friday 3rd July Posted by Andrew Sansum 12:00 Our last dash towards restoration of the production service is under way. All racks of disk servers have now had a first pass check. Faults list is currently 11 servers, although some of these may well be trivial. We expect to provide a large number of disk servers to the CASTOR team later today.

12 ASGC instabilities ATLAS reported instabilities in beginning of week Monday: Functional tests worked but still some problem withTier-1  Tier-2 transfers Another unscheduled downtime (recabling of CASTOR disk servers) CMS allowed the full week grace period for ASGC to recover from all its problems No new tickets and opened tickets put on hold Resume on Monday 6/7 Both ATLAS and CMS specific site tests changed from Red to Green during the week Friday 3/7: Gang reports that tape drives and servers are online

13 Summary Daily meeting attendance is degrading – holidays…? No new serious site issues RAL long downtime for DC move is progressing to plan. (Tuesday report – RAL back apart CASTORATLAS, some network instability). Tape problems at FZK during STEP09 understood ASCG is recovering?


Download ppt "WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009."

Similar presentations


Ads by Google