WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.

Slides:



Advertisements
Similar presentations
Ian Bird LCG Project Leader Site Reviews WLCG Site Reviews Prague, 21 st March 2009.
Advertisements

CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
GGUS summary (5 weeks) VOUserTeamAlarmTotal ALICE2002 ATLAS CMS6208 LHCb Totals
AMOD Report Doug Benjamin Duke University. Hourly Jobs Running during last week 140 K Blue – MC simulation Yellow Data processing Red – user Analysis.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
WLCG Service Report (for the SCOD team) ~~~ WLCG Management Board, 22 nd January 2013 Thanks to Maria Dimou, Mike Kenyon, David.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
LCG CCRC’08 Status WLCG Management Board November 27 th 2007
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
WLCG Tier1 [ Performance ] Metrics ~~~ Points for Discussion ~~~ WLCG GDB, 8 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 17 th March 2009.
Busy Storage Services Flavia Donno CERN/IT-GS WLCG Management Board, CERN 10 March 2009.
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
Summary of 2008 LCG operation ~~~ Performance and Experience ~~~ LCG-LHCC Mini Review, 16 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES WLCG Tier0 – Tier1 Service Coordination Meeting Update
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 9 th December 2008.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
WLCG Management Board, 30th September 2008
WLCG DB Service Reviews
1 VO User Team Alarm Total ALICE ATLAS CMS
BNL FTS services Hironori Ito.
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
~~~ WLCG Management Board, 10th March 2009
WLCG Management Board, 16th July 2013
1 VO User Team Alarm Total ALICE ATLAS CMS
WLCG Service Report 5th – 18th July
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
Take the summary from the table on
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009

Overview Summary of GGUS tickets last week Summary of outstanding or new “significant” service incidents Several “DB-related” problems – at least one critical! (IMHO) dCache problems at FZK Similar problems have been seen at other sites: PIC & now Lyon(?) Outstanding problem exchanging (ATLAS) data FZK NDGF ATLAS stager DB at CERN blocked on Saturday: Savannah ticket (degradation of ~1 hour)ticket FTS transfers blocked for 16h also on Saturday: “SIR”SIR Oracle problems with CASTOR/SRM DB: “big ID” syndrome seen now at CERN (PPS) and ASGC(?) 2

I got the following error on Atlas and CMS dashboard. DESTINATION error during TRANSFER_PREPARATION phase: [STORAGE_INTERNAL_ERROR] No type found for id : On Stager DB: SQL> select max(id) from id2type; MAX(ID) E+24 On SRM DB: SQL> select max(id) from id2type; MAX(ID) E+20 3

GGUS Summary – Has Changed! 4 VO (submitted by) USERTEAMALARMTOTAL ALICE2002 ATLAS CMS6006 LHCb1300 After agreeing that the format of the GGUS summary was ok it was changed without prior warning. Split into submitted by, assigned to and affecting. The format has also slightly changed! The above table shows tickets submitted by the VOs

Alarm “Tickets” There was no GGUS alarm ticket during this period, but ATLAS attempted to use the alarm flow for the FTS problem on Saturday: Sat 06:25 Kors' delegated proxy gets corrupted, and transfers start failing Sat 11:16 GGUS team ticket opened by Stephane JezequelGGUS team ticket Sat 21:28 Stephane sends an to atlas-operator-alarm, and the CERN CC operator calls the FIO Data Services piquet (Jan) at 21:55. Sat 22:30 Gavin (called by Jan) fixes the problem, and informs Atlas Once again, we are hit by 'Major' bug 33641, ("corrupted proxies after delegation"), opened Feb 19, 2008.bug More info herehere Follow-up: [DM group] Increase priority of bug opened Feb 19, [Gavin] Reinstate temporary (huh!) workaround [Jan] verify alarm flow: no SMS received by SMoD's understood: mail2sms gateway cannot handle signed s. Miguel will follow this up [ new gateway later this month(?) will at least send subject which is ok ] no received by SMoDs understood: missing posting permissions on unclear how/when they disappeared. 5

DB Services Online conditions data replication using Streams between the ATLAS offline database and the Tier1 site databases has been restored on Thursday This action was pending after the problem affecting the ATLAS Streams setup before the CERN Annual Closure (just over 1 month total) There are still problems with CMS streaming from online to offline: The replication usually works fine but the capture process fails on a regular basis (once a day average). It seems that the problem is related to some incompatibilities between Oracle Streams and Oracle Change Notification features. We are in contact with Oracle Support on that but in parallel we are looking for ways to abandon using Change Notification functionality. The Tier1 ASGC (Taiwan) has been removed from the ATLAS (conditions) Streams setup. This database is down since Sunday due to a block corruption in one tablespace and cannot be restored because they are missing a good backup. A new h/w setup is being established which will use ASM (as is standard across “3D” installations) and will hopefully be in production relatively soon.  The missing DBA expertise is still an area of concern and was re- discussed with Simon last Thursday 6

DM Services – GridKA (Doris Ressmann) GridKa: we had massive problems with SRM last weekend, which we solved together with the dCache developers by tuning SRM database parameters, and the system worked fine again from Sunday till Wednesday early morning. Then we run into the same problem again. There are several causes for these problems, the most important are massive SRMls requests, which hammer on the SRM. If the performance of SRM is ok, then the load is again too high for the PNFS database. So tuning SRM only brings problems on PNFS and vice versa. This resulted in: solving one problem we run immediately into the next problem. 7

Furthermore we had some trouble because one experiment tried to write into full pools. This was a kind of "deny of service" attack, since many worker nodes kept repeating this request and the SRM SpaceManager had to refuse it. After detecting this problem we worked together with the experiment representative to free up some space and relaxed this situation. However the current situation is that we reduced the amount of SRM threads at a time, to allow the once accessing the system to have a chance to finish. It seems that this has relaxed the situation, we still have transfers failing because we run into the limit, and reject these transfers. But it seems this is the only way to protect us from failing completely. In my opinion we should learn to interpret the SAM tests, in our case a failing SAM test currently does not mean we are down, but fully loaded. Next week we will try to optimise some of our parameters to see if we still can cope with more requests, but currently it seems better to let the system relax and to finish the piled jobs. We (experiments, developer and administrator) should all work together to find a solution to not stress the storage system unnecessary, tuning by itself does not solve this problem in the long run. 8

Summary “Service incidents” happen regularly – in most cases the service is returned “to normal” relatively quickly (hours to days) and (presumably) sufficiently rapidly(?) for 2009 pp data taking (but painful for support teams…)  “Chronic incidents” – ones that are not solved (for whatever reason) for > or >> 1 week still continue  What can we do to (preferably) avoid the latter or decrease the time it takes to get the service back?  N.B. there are real costs associated with the former! 9

Proposal After 1 week (2? More? MB to decide…) or serious degradation or outage the site responsible for a service ranked as “very critical” or “critical” by a VO using that site should provide a written report to the MB on their action plan and timeline for restoring this service asap e.g. the proposal to perform a clean CASTOR & Oracle installation at ASGC was made back in October and has still not been performed… 10