WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.

Slides:



Advertisements
Similar presentations
CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.
Advertisements

Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
CASTOR Upgrade, Testing and Issues Shaun de Witt GRIDPP August 2010.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI AMOD report – Fernando H. Barreiro Megino CERN-IT-ES-VOS.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
Dan Tovey, University of Sheffield User Board Overview Dan Tovey University Of Sheffield.
WLCG Collaboration Workshop 7 – 9 July, Imperial College, London In Collaboration With GridPP Workshop Outline, Registration, Accommodation, Social Events.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
WLCG Grid Deployment Board, CERN 11 June 2008 Storage Update Flavia Donno CERN/IT.
ATLAS Bulk Pre-stageing Tests Graeme Stewart University of Glasgow.
Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
WLCG Tier1 [ Performance ] Metrics ~~~ Points for Discussion ~~~ WLCG GDB, 8 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 17 th March 2009.
Busy Storage Services Flavia Donno CERN/IT-GS WLCG Management Board, CERN 10 March 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
Summary of 2008 LCG operation ~~~ Performance and Experience ~~~ LCG-LHCC Mini Review, 16 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
Julia Andreeva on behalf of the MND section MND review.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
Storage Classes report GDB Oct Artem Trunov
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
SRM v2.2 Production Deployment SRM v2.2 production deployment at CERN now underway. – One ‘endpoint’ per LHC experiment, plus a public one (as for CASTOR2).
Service Challenge Meeting “Review of Service Challenge 1” James Casey, IT-GD, CERN RAL, 26 January 2005.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
WLCG Services in 2009 ~~~ dCache WLCG T1 Data Management Workshop, 15 th January 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
WLCG Management Board, 30th September 2008
~~~ WLCG Management Board, 28th October 2008
Cross-site problem resolution Focus on reliable file transfer service
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
WLCG Management Board, 16th July 2013
WLCG Service Interventions
1 VO User Team Alarm Total ALICE ATLAS CMS
WLCG Service Report 5th – 18th July
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008

GGUS Summary 2 VOAlarmTeamTotal ALICE000 ATLAS01732 CMS002 LHCb1012 rfcp transfers to lhcbraw hanging ALARM! Detailed description: All transfers to lhcbraw using rfcp from the online system of LHCb are failing after the intervention. SRM interface preliminary tests seem to be OK on the other hands. Please have a look. Solution turns out that a port on one of our production machines was not open.

Service Summary Many on-going activities (CMS PhEDEx deployment, preparation for Xmas activities (all), deployment of new versions of Alien & AliRoot, preparation of ATLAS 10M file test…) Number of issues discussed at the daily meeting has increased quite significantly since earlier in the year…  Cross-experiment / site discussion healthy and valuable similar problems seen by others – possible solutions proposed etc. Services disconcertingly still fragile under load – this doesn’t really seem to change from one week to another DM services often collapse rendering a site effectively unusable At least some of these problems are attributed to problems at the DB backend – there are also DB-related issues in their own right High rate of both scheduled and unscheduled interventions continues – a high fraction of these (CERN DB+CASTOR related) overran significantly during this last week  Some key examples follow… 3

Service Issues - ATLAS ATLAS “10M file test” stressed many DM-related aspects of the service This caused quite a few short-term problems during the course of last week, plus larger problems over the weekend: ASGC: the SRM is unreachable both for put and get. Jason sent a report this morning. Looks like they had again DB problems and in particular ORA LYON: SRM unreachable over the weekend NDGF: scheduled downtime In addition, RAL was also showing SRM instability this morning (and earlier according to RAL – DB-related issues).RAL These last issues are still under investigation… 4

Service Issues – ATLAS cont. ATLAS online (conditions, PVSS) capture processes aborted – operation not fully tested before running on production system Oracle bug but no fix for this problem – other customers have also seen same problem but no progress since July (service requests…) [ Details in slide notes ] Back to famous WLCG Oracle reviews proposed several times.. Action? This situation will take quite some time to recover – unlikely it can be done prior to Christmas… ATONR => ATLR cond. done; PVSS on-going, T1s postponed… Other DB-related errors affecting e.g. dCache at SARA (PostgreSQL), replication to SARA (Oracle) 5

So What Can We Do? Is the current level of service issues acceptable to: 1.Sites – do they have the effort to follow-up and resolve this number of problems 2.Experiments – can they stand the corresponding loss / degradation of service? If the answer to either of the above is NO, what can we realistically do? DM services need to be made more robust to the (sum of) peak and average loads of the experiments  This may well include changing the way experiments use the services – “be kind” to them! DB services (all of them) are clearly fundamental and need the appropriate level of personnel and expertise at all relevant sites Procedures are there to help us – much better to test carefully and avoid messy problems rather than expensive and time- consuming cleanup which may have big consequences on an experiment’s production 6

Outlook GGUS weekly summary gives a convenient weekly overview Other “key performance indicators” could include similar summary of scheduled / unscheduled interventions, plus those that run into “overtime” A GridMap style summary – preferably linked to GGUS tickets, Service Incident Reports and with (as now) click through to detailed test results could also be a convenient high-level view of the service But this can only work if the # problems is relatively low… 7

Summary Still much to do in terms of service hardening in 2009… … as we “look forward to the LHC experiments collecting their first colliding beam data for physics in 2009.” 8