Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010

Slides:



Advertisements
Similar presentations
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
Advertisements

WLCG Service Report (for the SCOD team) ~~~ WLCG Management Board, 22 nd January 2013 Thanks to Maria Dimou, Mike Kenyon, David.
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
WLCG Service Report ~~~ WLCG Management Board, 24 th November
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
Andrea Sciabà CERN CMS availability in December Critical services  CE, SRMv2 (since December) Critical tests  CE: job submission (run by CMS), CA certs.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
Alberto Aimar CERN – LCG1 Reliability Reports – May 2007
ATLAS Bulk Pre-stageing Tests Graeme Stewart University of Glasgow.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
T1 status Input for LHCb- NCB 9 th November 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th June
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
WLCG Service Report ~~~ WLCG Management Board, 23 rd March
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Service Report ~~~ WLCG Management Board, 20 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Service Report ~~~ WLCG Management Board, 10 th November
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.
GGUS summary (2 weeks) VOUserTeamAlarmTotal ALICE2046 ATLAS CMS26210 LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 15 th December
Servizi core INFN Grid presso il CNAF: setup attuale
Service Availability Monitoring
1 VO User Team Alarm Total ALICE 12 ATLAS CMS
L’analisi in LHCb Angelo Carbone INFN Bologna
Dirk Duellmann CERN IT/PSS and 3D
WLCG Management Board, 30th September 2008
~~~ WLCG Management Board, 28th October 2008
Flavia Donno CERN GSSD Storage Workshop 3 July 2007
1 VO User Team Alarm Total ALICE ATLAS CMS
Database Readiness Workshop Intro & Goals
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
Grid status ALICE Offline week Nov 3, Maarten Litmaath CERN-IT v1.0
~~~ WLCG Management Board, 10th March 2009
CASTOR-SRM Status GridPP NeSC SRM workshop
WLCG Management Board, 16th July 2013
WLCG Service Interventions
1 VO User Team Alarm Total ALICE ATLAS CMS
Summary from last MB “The MB agreed that a detailed deployment plan and a realistic time scale are required for deploying glexec with setuid mode at WLCG.
Project Status Report Computing Resource Review Board Ian Bird
WLCG Service Report 5th – 18th July
LHCb: March/April Operational Report
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
Take the summary from the table on
~~~ WLCG Management Board, 16th June 2009
LHC Data Analysis using a worldwide computing grid
WLCG Collaboration Workshop: Outlook for 2009 – 2010
MB Maarten Litmaath CERN v1.0
The LHCb Computing Data Challenge DC06
Presentation transcript:

Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010 WLCG Service Report Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010

WLCG Operations Report – Summary KPI Status Comment GGUS tickets 4 real alarm tickets PIC dCache, NDGF SRM 2 Castor CERN Site Usability Minor issues SIRs & Change assessments 2 new SIRs And 3 closed SIR received VO User Team Alarm Total ALICE 4 1 5 ATLAS 23 92 119 CMS 10 3 2 15 LHCb 20 24 Totals 40 115 8 163 The response to alarms well) within targets.

1.2 1.2 1.2 1.2 1.2 1.2 0.1 0.1 1.1 4.4 4.7 3.1 4.6 4.6 4.6 4.6 4.6 4.1 4.2 4.2 4.2 3.2 4.3 4.5

Analysis of the availability plots COMMON FOR THE ALL EXPERIMENTS 0.1 FZK-LCG2: FZK-LCG site offline for production ATLAS 1.1 SARA-MATRIX: Temporary test failure with timeout. 1.2 BNL: Old problem with CE critical test for OSG CE. The SAM ATLAS Computing Element critical test have been modified to take into account the different configuration of the OSG CE. ALICE NTR CMS 3.1 KIT: Cooling problem since Saturday. Only 20% WN online. 3.2 IN2P3: Stage Out Test failed temporarily. LHCb 4.1 GRIDKA: Power failure issue continuing since Sunday. 4.2 GRIDKA: SAM tests failing, problems staging and accessing files. Problem with dCache. Database was partially indexed. Small unscheduled downtime was taken. 4.3 IN2P3: Degradation in IN2P3 shared area. Command on a subdirectory of the shared area took more than 150 seconds while are expected less than 60 seconds 4.4 CERN: Temporary test failure. Command on a subdirectory of the shared area took more than 150 seconds while are expected less than 60 seconds 4.5 PIC: Temporary test failure 4.6 CNAF: SRM service UNIT test failed occasionally 4.7 CERN: CERN SRM UNIT test failed with communication error and i/o error.

1.1 0.1 3.1 3.2 3.2 4.2 4.2 4.3 0.1 4.1 4.3 0.1

Analysis of the availability plots COMMON FOR ALL THE EXPERIMENTS 0.1 PIC: scheduled downtime on Tuesday, the 20th of July, from 6am-6pm. ATLAS 1.1 TAIWAN: Stage-in/out job failures (GGUS:60231). The number of jobs accessing the disk servers was reduced temporarily to decrease the load. ALICE NTR CMS 3.1 KIT: dCache headnode crash. Site in emergency shutdown from 9:00 to 13:00h, the 22nd of July. 3.2 KIT: SAM CE prod & sft test jobs expiring. LHCb 4.1 GRIDKA: Occasionally CE-sft-job test failings. 4.2 CNAF: Production jobs failed: the HOME directory was not set due to a concerning LDAP. 4.3 CNAF: LFC_L-check-streams test failings.

7 VO User Team Alarm Total ALICE 4 1 5 ATLAS 23 92 119 CMS 10 3 2 15 GGUS summary (2 weeks) VO User Team Alarm Total ALICE 4 1 5 ATLAS 23 92 119 CMS 10 3 2 15 LHCb 20 24 Totals 40 115 8 163 7

ALARM Tickets NDGF SRM-dCache outage- SIR below Castor CMS Single user issuing 30k disk-to-disk copies User notified and per user limits in place CASTOR ATLAS T0 merge Disk server unstable after RAID controller firmware problems PIC dCache Wrong pool cost equation affected balancing between old and new pools

Support-related events Service incident report updates SIR received for NDGF SRM outage on 2010/0714 SIR received for GridKa cooling system failure incident of 2010/07/10. SIR received for reduced availability caused by data corruption at NL-T1 on 2010/07/05 SIR being prepared from GGUS/OSG about notification issues SIR being prepared for CERN vault cooling issues 10/5/2019 WLCG MB Report WLCG Service Report

WLCG MB Report WLCG Service Report NDGF SRM outage What time CEST What happened 2010/07/14 13:00 Scheduled downtime starts with dCache upgrade 2010/07/14 13:50 After upgrade & reboot to new firmware problems restarting the service 2010/07/14 16:00 Scheduled downtime ends and is replaced by an unscheduled downtime 2010/07/15 10:00 Services working fine as far as we can tell, Atlas SAM tests finally green too https://wiki.ndgf.org/display/ndgfwiki/20100714+dCache+server+failure 10/5/2019 WLCG MB Report WLCG Service Report

WLCG MB Report WLCG Service Report KIT Cooling Failure What time UTC What happened 2010/07/10 14:30 KIT Cooling system going down. 2010/07/10 22:10 FTS and LFC up. LHCb and ATLAS 3D DB up. 2010/07/12 13:00 3 out of 4 chillers working. Powering up compute nodes with best compute power per watt ratio. 2010/07/15 All chillers working. Powering up remaining compute nodes. https://twiki.cern.ch/twiki/pub/LCG/WLCGServiceIncidents/SIR_cooling_failure_20100710.pdf 10/5/2019 WLCG MB Report WLCG Service Report

NL-T1 Data Corruption Issues What time CEST What happened 2010/07/05 18:35 ATLAS reported failed jobs due to checksum errors. 2010/07/05 22:23 dCache shutdown on pool nodes which could possibly be affected by this issue. These nodes reside in 7 racks. 2010/07/13 15:31 4 racks put back into production after it was established that the nodes in those racks were not affected. 2010/07/15 11:52 The remaining racks are put back into production. http://sirs.grid.sara.nl/docs/NL-T1_SIR-20100705.pdf 10/5/2019 WLCG MB Report WLCG Service Report

Other Service news CNAF is doing rolling upgrade of GPFS on worker nodes ALICE working with CNAF to establish impact on their job rates LHCb is working with SARA on reducing the inpact of their storage issues on their jobs Masking hot files unavailable due to storage issues Storage issue was solved by the site today

Summary Quiet week ending the technical stop no major issues