Alberto Aimar CERN – LCG1 Reliability Reports – May 2007

Slides:



Advertisements
Similar presentations
LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.
Advertisements

EGEE is a project funded by the European Union under contract IST Using SRM: DPM and dCache G.Donvito,V.Spinoso INFN Bari
Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
Site report: Tokyo Tomoaki Nakamura ICEPP, The University of Tokyo 2014/12/10Tomoaki Nakamura1.
WLCG Service Report ~~~ WLCG Management Board, 27 th January 2009.
WLCG Service Report ~~~ WLCG Management Board, 27 th October
WLCG Service Schedule June 2007.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
WLCG Service Report ~~~ WLCG Management Board, 24 th November
John Gordon STFC-RAL Tier1 Status 9 th July, 2008 Grid Deployment Board.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
LHCb: March/April Operational Report NCB 10 th May 2010.
Andrea Sciabà CERN CMS availability in December Critical services  CE, SRMv2 (since December) Critical tests  CE: job submission (run by CMS), CA certs.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Light weight Disk Pool Manager experience and future plans Jean-Philippe Baud, IT-GD, CERN September 2005.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
ATLAS Bulk Pre-stageing Tests Graeme Stewart University of Glasgow.
DPM Python tools Ivan Calvet IT/SDC-ID DPM Workshop 10 th October 2014.
BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.
T1 status Input for LHCb- NCB 9 th November 2009.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
LCG Report from GDB John Gordon, STFC-RAL MB meeting February24 th, 2009.
CERN IT Department CH-1211 Genève 23 Switzerland t Streams Service Review Distributed Database Workshop CERN, 27 th November 2009 Eva Dafonte.
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
WLCG Service Report ~~~ WLCG Management Board, 18 th September
WLCG Service Report ~~~ WLCG Management Board, 23 rd November
SRM-2 Road Map and CASTOR Certification Shaun de Witt 3/3/08.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
LCG Storage Workshop “Service Challenge 2 Review” James Casey, IT-GD, CERN CERN, 5th April 2005.
CASTOR Status at RAL CASTOR External Operations Face To Face Meeting Bonny Strong 10 June 2008.
Service Challenge Meeting “Review of Service Challenge 1” James Casey, IT-GD, CERN RAL, 26 January 2005.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
Distributed Data Management Miguel Branco ATLAS Tier-0 exercise Export report.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
HEPiX Rome – April 2006 The High Energy Data Pump A Survey of State-of-the-Art Hardware & Software Solutions Martin Gasthuber / DESY Graeme Stewart / Glasgow.
WLCG Service Report ~~~ WLCG Management Board, 9 th February
CERN IT Department CH-1211 Genève 23 Switzerland t CCRC’08 Review from a DM perspective Alberto Pace (With slides from T.Bell, F.Donno, D.Duelmann,
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Summary of SC4 Disk-Disk Transfers LCG MB, April Jamie Shiers, CERN.
LCG Tier1 Reliability John Gordon, STFC-RAL CCRC09 November 13 th, 2008.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 19 th August 2008.
WLCG Service Report ~~~ WLCG Management Board, 15 th December
“A Data Movement Service for the LHC”
James Casey, IT-GD, CERN CERN, 5th September 2005
Flavia Donno CERN GSSD Storage Workshop 3 July 2007
Database Readiness Workshop Intro & Goals
Grid status ALICE Offline week Nov 3, Maarten Litmaath CERN-IT v1.0
CASTOR-SRM Status GridPP NeSC SRM workshop
WLCG Management Board, 16th July 2013
Inter-site issues
STEP’09 Tier-1 Centres Report
Summary from last MB “The MB agreed that a detailed deployment plan and a realistic time scale are required for deploying glexec with setuid mode at WLCG.
Grid status ALICE Offline week March 30, Maarten Litmaath CERN-IT v1.1
WLCG Service Report 5th – 18th July
Data Management cluster summary
LHC Data Analysis using a worldwide computing grid
MB Maarten Litmaath CERN v1.0
IPv6 update Duncan Rand Imperial College London
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

Alberto Aimar CERN – LCG1 Reliability Reports – May

Alberto Aimar CERN – LCG2

Alberto Aimar CERN – LCG3 PIC 77% TRIUMF 95% ASGC98% RAL87% INFN/CNAF87% CERN90%GridKa/FZK79% SARA-NIKHEF99% BNL 98% FNAL 75% IN2P3 94% NDGF n/a

Alberto Aimar CERN – LCG4 Jan 07 to May 07 SiteJan 07Feb 07Mar 07Apr 07May 07 CERN GridKa/FZK IN2P INFN/CNAF RAL SARA-NIKHEF TRIUMF ASGC FNAL PIC BNL9057*6*8998 NDGFn/a The target of 88% for the best 8 sites not reached 6 Sites > 88% (target)(in April were 7) 6+2 Sites > 79% (90% of target)(in April were 7+3) * BNL: LCG/gLite CE probed by SAM but not installed with the SL4 upgrade

Alberto Aimar CERN – LCG5 Summary of Issues and Solutions Reported DescriptionSITE: Problem  Solution SRM/MSS/ DB CERN, INFN: CASTOR overload or instabilities INFN: SRM servers timeouts FZK: SRM instability  SRM restarted PIC: Problems with STK robot and with dCache GridFTP doors BNL: dCache problems with load balancing and choice of the pools  rebalanced dCache cost model BNL: HPPS core server crashed  Restarted HPSS + GridFTP powered off  Restarted GridFTP RAL: CASTOR overloaded  dCache as temp SE, then CASTOR could recover TRIUMF: SRM blocked port 8443 not listening  restarted SRM SARA: Problems with the disks of the Oracle server BDII FZK: BDII timeouts with CERN and with DNS  Fixed DNS information IN2P3: Local Top-level BDII timeouts, re-indexed the database CE CERN: Problems with LCG-CE stability and scalability  Added 8 new CEs FZK: CE locked up for several days, probably overloaded  investigating RAL: CE overloaded  no solution yet TRIUMF: CE Gatekeeper connection problems  GK restarted INFN: CE problems with the LDAP service  Not understood. LDAP restarted Operational Issues FNAL: Configuration errors in the IS  Fixed the configuration FZK: Power cut of the administrative rack + CE locked for several days and sites admin absent PIC: Wrong manual CE configuration  Fixed the configuration PIC: Wrong patch to the top-level BDII  Introduced correct indexes in top-bdii db RAL: /temp full caused SAM to fail SARA: Migration of DNS servers caused several reverse lookup failures TRIUMF: Grid Canada CRL certificate expired all users and services with GC credentials were stopped because of this. SAM Several SAM unavailabilities on the 28-31/05 replica-mgmt tests failing ASGC: replica-management test fails because of wrong ACL permissions IN2P3: Problems with SAM and scheduled downtime not taken into account PIC: replica-management tests failed because of bad CloseSE configuration

Alberto Aimar CERN – LCG6 Summary – May 2007  Only 6 sites > 88%Only 6+2 sites > 79%  Average 8 best sites: 94% Average all sites: 89%  Main Issues are unchanged  CE overloaded at several sites  CASTOR overloads  dCache GridFTP problems  SAM instabilities may have affected the results of end of May  Continue collecting reliability data at the Operations meeting  Please pass the message to your representatives at the Operations meeting.  Every down-time report should always indicate: Day: Reason: Severity: Solution: