WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.

Slides:



Advertisements
Similar presentations
LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.
Advertisements

Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Castor F2F Meeting Barbara Martelli Castor Database CNAF.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 22 th July 2008.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
WLCG Service Report ~~~ WLCG Management Board, 18 th August
SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.
Issues in Milan Two main problems (details in the next slides): – Site excluded from analysis due to corrupted installation of some releases (mainly )
WLCG Service Report ~~~ WLCG Management Board, 27 th October
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
ATLAS Metrics for CCRC’08 Database Milestones WLCG CCRC'08 Post-Mortem Workshop CERN, Geneva, Switzerland June 12-13, 2008 Alexandre Vaniachine.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
CCRC’08 Weekly Update ~~~ WLCG Management Board, 27 th May 2008.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Andrea Sciabà CERN CMS availability in December Critical services  CE, SRMv2 (since December) Critical tests  CE: job submission (run by CMS), CA certs.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
Alberto Aimar CERN – LCG1 Reliability Reports – May 2007
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
PERFORMANCE AND ANALYSIS WORKFLOW ISSUES US ATLAS Distributed Facility Workshop November 2012, Santa Cruz.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
Report from GSSD Storage Workshop Flavia Donno CERN WLCG GDB 4 July 2007.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
LCG Issues from GDB John Gordon, STFC WLCG MB meeting September 28 th 2010.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
LCG Tier1 Reliability John Gordon, STFC-RAL CCRC09 November 13 th, 2008.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
Jean-Philippe Baud, IT-GD, CERN November 2007
1 VO User Team Alarm Total ALICE ATLAS CMS
CASTOR-SRM Status GridPP NeSC SRM workshop
1 VO User Team Alarm Total ALICE ATLAS CMS
WLCG Service Report 5th – 18th July
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
Take the summary from the table on
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Presentation transcript:

WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009

GGUS Tickets submitted concerning WLCG VOs from :00:00 thru :59:59 Summary of the alarm tickets: CMS to CNAF created 20 Jan, addressed same day and marked as solved 21 Jan ATLAS to FZK created 21 Jan and marked as solved same day ATLAS to RAL created 22 Jan marked as verified same day ATLAS to CERN created 23 Jan marked as verified same day

Experiment Alarm “Tickets” CMS to CNAF: Tue 20 Jan D.Bonocorsi reported Phedex exports from CNAF failing since a few hours with Castor timeouts. Additional error was reply he was not allowed to trigger an SMS alarm for INFN T1 (being followed up) Problem addressed same day and found to be due to a single disk server. ATLAS to FZK: Test of alarm ticket workflow after new release. Closed within 1 hour. ATLAS to RAL: Thur 22 Jan RAL failing to accept data from Tier 0 giving an Oracle error on a bulk insert call. Within 1 hour solved by restarting SRM processes after which FTS reported no further errors. ATLAS to CERN: Fri 23 Jan from almost all transfers to CERN ATLASMCDISK space token failing with ‘possible disk full’ errors. This was due to a misconfigured disk server that was then removed. Then on Saturday importing failed again which was reported back as the pool being full. Monitoring showed it full while stager and srm queries did not. Removing the misconfigured disk had also taken out state information, a known castor problem. The machine was back in the pool Monday and the state information resynchronised. 3

Other Outstanding or new “significant” service incidents (1 of 3) ASGC: Jan 24 FTS job submission failures due to constraint of ORACLE maximum tablespace. Had to add 100MB manually. Follow-up is to try adding a new plugin to monitor the size of table space to avoid the same situation in the future. FZK: Jan 24 The FTS and LFC services at FZK went down due to a problem with the Oracle backend. The problem was quite complex and Oracle support was involved. Reported as solved late Monday. BNL: Jan 23 hit by the FTS delegated proxy corruption bug, a repeated source of annoyance. Back porting of the fix from FTS 2.2 to 2.1 is now in certification and eagerly awaited.

Other Outstanding or new “significant” service incidents (2 of 3) PIC: Jan 24 Barcelona was hit by a storm with strong winds and suffered a power cut about midday which turned off air conditioning and closed them down. Resumed Monday morning being fully back around midday. Had some problems bringing back Oracle databases following their unclean shutdown. CERN: Jan 22/23 the lcg-cp command (which makes srm calls) started failing when requested to create 2 levels of new directories. Recent SRM server upgrade suspected - to be followed up. PIC: Jan 21 high srm load due to srmLs commands thought to be from cms jobs (as previously seen at FZK). No easy/feasible control mechanism.

Other Outstanding or new “significant” service incidents (3 of 3) CERN: Jan 20 25K LHCb jobs "stuck" in WMS in waiting status. Further investigations suggested an LB 'bug' that occasionally leaves jobs in limbo state. Plan is to see if latest patch will fix. In the meantime DIRAC will discard such jobs after 24h. CERN: Jan 19 ATLAS eLogger backend daemon hung and had to be killed. Follow up will be to trap the condition and write an alarm. Heavily used by many levels of ATLAS and they did not know where to report it. Follow up has been to create a new ATLAS- eLogger-support Remedy work flow with GS group members as the service managers.

Situation at ASGC Jason was at CERN last week and we (Jamie) discussed F2F issues around their DB services –3D & CASTOR+SRM and including migration from the OCFS2 filesystem. A target date for restoring the “3D” service (ATLAS LFC & conditions, FTS) to production is early February 2009 The new hardware should be ready before that allowing for sufficient testing and resynchronization of ATLAS conditions then VO-testing before announcing the service to be open A tentative target for a “clean” CASTOR+SRM+DB service is mid-February 2009 – preferably in time for the CASTOR external operations F2F (Feb at RAL) It is less clear that this date can be met – need to checkpoint regularly (work will start after Chinese NY on Feb 2 – see notes) ASGC will participate in bi-weekly CASTOR external operations calls & 3D con-calls + WLCG daily OPS 7

Other Service Changes GGUS has now released direct routing to sites for all tickets not just team and alarm. Submitter has new field to target site when it is known. CERN moved back into production the two LCG backbone routers that were shutdown before Christmas (attended by hardware support engineers). CNAF been testing long-term problematic access to the LHCb shared software area in GPFS. A debugging session discovered that the GPFS file system, for intrinsic caching mechanism limitations, runs the LHCb setup script in minutes (depending the load of the WN) while a NFS file system (on top of GPFS storage) runs the same in the more reasonable time of 3-4 seconds as at other sites. A migration is planned. IN2P3 following the ATLAS 10Million file test have deployed a third load balanced FTS server addressing CNAF, PIC, RAL and ASGC. Some improvement though not much was initially seen.RAL

Summary Business as usual - the rate of new problems arising and that need follow-up service changes remains high. The problems remain heterogeneous and without a predictable pattern to indicate where effort should best be invested. However, the resulting service failures are usually of short time duration. 9