LHCb: March/April Operational Report

LHCb: March/April Operational Report
Roberto Santinelli on behalf of LHCb GDB 12th May 2010

OUTLINE Recent activities GGUS review Issues Most worrying
T1 sites review First LHCb T1 Jamboree Roberto Santinelli

1GB/s integrated throughput (export+ replication)
Also small reconstructed/stripped files [lxplus235] ~ > dirac-dms-storage-usage-summary --Dir=/lhcb/data/2010/RAW DIRAC SE Size (TB) Files CERN-RAW

Re-run several times reprocessing on first data due to
several internal problems. Pedestal: due to many (increasing) different users. Need for a LHCb DAST No huge MC activity and in general no huge activity (commissioning new workflows,real data is the FOCUS)

GGUS tickets (March/April/ first week of May)
91 GGUS tickets in total: 8 normal tickets 6 ALARM tickets (3 test) 77 TEAM ticket 27 GGUS tickets with shared area problems in total 29 (real) GGUS tickets open against T0/T1: ALARM (CERN): FTS found not working… ALARM (GRIDKA): No space left on M-DST token ALARM (CNAF): GPFS not working: all transfers failing NL-T1:8 CERN: 8 CNAF: 5 GRIDKA:4 PIC: 3 IN2P3: 1 RAL: 0 Roberto Santinelli

(Most) worrying Issues
FTS issue at CERN underlined the importance of effective communication the migration happened simultaneously with the LHCb Jamboree a simple alias change – as originally discussed – was not feasible in this case (service draining issues) explicit sign-off has already been implemented via WLCG T1SCM, e.g. for myproxy-fts retirement and will be used for future migrations. SVN service down and degradation of performances NIKHEF file access issue for some users. File available and dccp works fine but ROOT can’t open m/w problem (see later) resolved moving to dcap everywhere on dCache also reported by ATLAS. GridKA shared area performances issue due to concurrent activity on ATLAS software area.. adding more servers on NFS. LHCb: banned PIC for two weeks for a missing option in the installation script (March). CNAF and RAL: banned a week for a new connection string due to a recent migration of their Oracle to new RACs

April : Production Mask
CERN CNAF GridKA IN2P3 NL-T1 PIC RAL NB: Blue unavailability period ; no a site issue

Site issues CERN (March):
March 3rd Offline CondDB schema corrupted. Need to restore the schema to the previous configuration but the apply process failed against 5 out of 6 T1’s (but CNAF). March 7th merging jobs at CERN affected: input data was not available (3 days weekend not available) March 11th Migration of central Dirac services March 11th LFC replication failing against all T1 according SAM A glitch with the AFS shared area preventing to write lock files spawned by SAM March 17th SVN reported to be extremely slow March 25th: Started xrootd tests: found the server not properly setup (supporting only kerberos) March 29th CASTOR: The LHCb data written to lhcbraw service has not been migrated for several days March 31st: FTS not working: wrong instance used.

Site issues CERN (April/May):
~April: The issue with LFC-RO at CERN (due to old CORAL LFC interface) has been fixed Released a patch which is now part of GAUDI Good interaction experiment/service providers Still the need to use the workaround based on a local DB lookup xml file (some LHCb core applications still based on old GAUDI stack). 15th CASTOR Data Access problem: lhcbmdst pool was running out of maximum number of allowed parallel transfers (mitigated adding more nodes)  Old issue of sizing pools in terms of number of servers (and not just TB provided) 29th: Lhcb downstream capture for conditions was stuck for several hours. May 4th: lost 72 files on the lhcb default pool. A diskserver reinstalled and data (not migrated since the 22nd of March) scrapped. At the end the loss was limited and only 10 files unretrievable May 5th default pool overloaded/unavailable yesterday due to some LXBATCH user putting too much load on and small size (5 diskserver/200 transfers each)

Site issues CNAF: The huge T2 resources reported to be under usage. This is due to the LSF batch system and its fair share mechanism [..]. This will be fixed by site-fine-tuned agents that submits directly on the sites. 8th : Problem with the LRMS systematically failing submission there 10th StoRM upgraded: started to fail systematically unit (critical) test: problem fixed with a new release of the unit test code 18th CREAM CE direct submission: FQAN issue. Now a first prototype is working against CNAF 24th Too low number of pool accounts for pilot role defined 25th StoRM problem with LCMAPS preventing to upload data there 30th March: Oracle RAC intervention 8-14 April : site banned because of the changed connection string to CondDB was preventing to access as the APPCONFIG was not upgraded (CondDB person away) 26 CREAM CE failing all pilots: configuration issue 30 glitch on StoRM. May 5th GPFS issue preventing to write data on storm: ALARM and problem fixed in ~1 hour

Site issues GridKA: 3rd : SQLite problems due to the usual nfslock mechanism getting stuck. Restarted the nfs server. 5-14 (April): Shared area performances: site banned during real data taking. Concurrent high load put by ATLAS…added more h/w on their NFS servers 26th April: MDST Space full (ALARM ticket sent due to the missed response to the automatic notification on the week end) 28 April: PNFS to be restarted: 1 day off. IN2P3: Only Reported instabilities of the SRM endpoint according SAM unit test on MARCH 24-25 April: LRMS database down 26-27 AFS major issue. SIGUSR1 signal sent instead of SIGTERM before sending the SIGKILL

Site Issues NL-T1 (March):
March 1st:issue with critical test for file access (after the test moved to the right version of the root fixing the compatibility issue). Due to a missing library (libgsitunnel) not deployed in the AA March 3rd issue with HOME not set on some WN (NIKHEF). March 18th NIKHEF: reported a problem in uploading output data from production jobs no matter which destination March 20th Discovered the LFC instance at SARA wrongly configured (Oracle RAC issue) March 30th Issue with Accessing data at SARA-NIKHEF: discovered (many days later) to be due to an library incompatibility between Oracle and gsidcap never spotted before because all activities were not using concurrently CondDB (but just local SQLDB) and gsidcap (downloading data first) Roberto Santinelli

Site issues NL-T1 (April)
1-13 April: Site banned for the issue of accessing data via gsidcap and concurrent ConditionDB access. Found to be a clash of libraries and now working exclusively with dcap The issue has never been seen because with real data the very first time LHCb access ConditionDB and use file access protocol simultaneously. (usually a download of data first) 27-28 April: NIKHEF CREAMCE issue killing the site. Received a patch to submit to $TMPDIR instead of $HOME 29-4 May: Storage Issue due to a variety of reasons (ranging from h/w to network, from some head nodes to SRM overloaded). Roberto Santinelli

Site issues PIC: banned for more than 2 weeks for a problem with our application March 18th : problem with the installation of one of the LHCb application software for SL5-64bit. Resolved by using –force option in the installation. Site banned 2 weeks for that. March 29th : downtime (announced) causing some major perturbation on user activities due to some of the critical DIRAC services hosted there. April 7th: Issue with lhcbweb restarted accidentally. Backup host of the web portal at Barcelona University April 26-27: Network intervention May 6th and 7th Accounting system hosted at PIC down twice in 24 hours

Site issues RAL: March 1st: disk server issue
March 1st issue with voms certs March 9th Streams replication apply process failure 28th April CASTOR Oracle DB upgrade 5-6 April: Network issue 8-14 April: as for CNAF, the site was banned because the changed connection string was preventing to access condition DB (upgraded APPCONFIG not available, CondDB responsible was away) Roberto Santinelli

Outcome of the T1 Jamboree : highlights
Presentation of the computing model and resources needed First big change about to come: T2-LAC facilities Interesting overview of DIRAC Plans: reconstruction/reprocessing/simulations Activity must be flexible depending on LHC. Sites do not have to ask each time for occupying their CPUs CREAMCE usage (direct submission about to come) gLexec pattern usage in LHCb Most worrying issue at T1: file access and possible solutions usage of xroot taken into consideration. Testing it file download for production is the current solution. parameter tuning in dcache sites (WG in WLCG for FA optimization). For production the file download proved to be the best approach (despite some sites claim that would be better to access data through the LAN) Test suite “hammer cloud style“ to probe site. READY POSIX file access (LUSTRE and NFS4.1) Roberto Santinelli

Outcome of the T1 Jamboree : highlights
CPU/wallclock: sites reporting some inefficiency..multiple reasons: too many pilot submitted now that we are running in filling mode (pilot commit suicide if no task is available but after few minutes) Also problems on stuck connections (data upload, data-stream with dcap/root servers, storage shortage, AFS outages, jobs hanging) Very aggressive watch dog in place that kills jobs (stalled or not consuming CPU any longer i.e. <5% over a configurable amount of minutes) Most worrying issue at T2 sites: Shared Area This is a critical service for LHCb and as such must be taken into account by sites Tape protection discussions T1 LRMS fair shares: quick turn-round when there is low activity never fall down to zero. Site round table on allocated resources and plans for 2010/2011 Roberto Santinelli

Outcome of T1 Jamboree: 2010 Resources
Site CPU (HEPSPEC06) Disk (TB) Tape (TB) CERN 15000/23000 720/1290 3000/1800 CNAF 2700/5500 158/450 442/442 FZK 7480/7480 315/560 408/408 IN2p3 8370/9742 470/728 270/531 NL-T1 8992/8992 410/707 261/1012 PIC 1560/2632 256/197(+50) ~200/189 RAL 8184/8184 311/612 462/446 CERN: full allocation 2010: June or earlier. Full allocation 2011&2012 by April 1st each year. : 12 April : all resources seems declared to be allocated CNAF: CPU: to be ordered. Disk and Tape: delivery in March. FZK: it’s assumed that CPU fully allocated beginning of April. Disk and Tape entirely allocated in May IN2p3: full allocation 2010: 20/05/2010+ T2 resources  HEPSPEC-06/479 TB disk NL-T1: Disk and Tape fully available end of Spring (<20th June) PIC: end of April plan to allocate 2010 pledged. Agreed to host 6% of MC data (extra 50TB) RAL: Full allocation in June and do not foresee any problem in meeting it: Roberto Santinelli

LHCb: March/April Operational Report

Similar presentations

Presentation on theme: "LHCb: March/April Operational Report"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LHCb: March/April Operational Report

Similar presentations

Presentation on theme: "LHCb: March/April Operational Report"— Presentation transcript:

Similar presentations

About project

Feedback