LHCb: March/April Operational Report NCB 10 th May 2010.

LHCb: March/April Operational Report NCB 10 th May 2010

GGUS tickets (March/April/ first week of May) m 91 GGUS tickets in total: o 8 normal tickets o 6 ALARM tickets (3 test) o 77 TEAM ticket m 27 GGUS tickets with shared area problems in total m 29 (real) GGUS tickets open against T0/T1: o ALARM (CERN): FTS found not working… o ALARM (GRIDKA): No space left on M-DST token o ALARM (CNAF): GPFS not working: all transfers failing o NL-T1:8 o CERN: 8 o CNAF: 5 o GRIDKA:4 o PIC: 3 o IN2p3: 1 o RAL: 0 Roberto Santinelli2

(Most) worrying Issues m FTS issue at CERN revealed some weakness in the effective communication o due to the migration in happening simultaneously to with the Jamboree o but also not clear procedures/announcement service side (expected just an alias change - CMS also complained) m SVN service down and degradation of performances m NIKHEF file access issue for some users. File available and dccp works fine but ROOT can’t open o m/w problem (see later) resolved moving to dcap everywhere on dCache. m GridKA shared area o performances issue due to concurrent activity on ATLAS software area.. adding more servers on NFS. m LHCB: banned a pic for two weeks for a missing option in the installation script (March). m CNAF and RAL: banned a week for a new connection string due to a recent migration of their Oracle to new RACs 3

April : Production Mask 4 CERN CNAF GridKA IN2p3 NL-T1 Pic RAL

Site round robin m CERN (March): o March 3 rd Offline CondDB schema corrupted. Need to restore the schema to the previous configuration but the apply process failed against 5 out of 6 T1’s (but CNAF).CondDB o March 7 th merging jobs at CERN affected: input data was not available (3 days weekend not available) o March 11 th Migration of central Dirac services o March 11 th LFC replication failing against all T1 according SAM o A glitch with the AFS shared area preventing to write lock files spawned by SAMSAM o March 17 th SVN reported to be extremely slow o March 25 th: Started xrootd tests: found the server not properly setup (supporting only kerberos) o March 29 th CASTOR: The LHCb data written to lhcbraw service has not been migrated for several days o March 31 st: FTS not working: wrong instance used. 5

Site round robin m CERN (April/May): o In April: Recurrent issue with LFC-RO at CERN (due to long story of CORAL LFC interface). Received finally a patch now part of GAUDI but still the need to use the workaround based on a local DB lookup xml file (application still based on previous version of CORAL). o 15 th CASTOR Data Access problem: lhcbmdst pool was running out of maximum number of allowed parallel transfers (mitigated adding more nodes) P  Old issue of sizing pools in terms of number of servers (and not just TB provided) o 29 th: Lhcb downstream capture for conditions was stuck for several hours. o May 4 th : lost 72 files on the lhcb default pool. A diskserver reinstalled and data (not migrated since the 22 nd of March) scrapped. P At the end the lost was limited and only 10 files unretrievable o May 5 th default pool overloaded/unavailable yesterday due to some LXBATCH user putting too much load on and small size (5 diskserver/200 transfers each) 6

Site round robin m CNAF: reported issues on under usage of their resources. This is due to the LSF batch system and its fair share mechanism [..]. This will be fixed by site-fine-tuned agents that submits directly o 8 th : Problem with the LRMS systematically failing submission there o 10 th StoRM upgraded: started to fail systematically unit (critical) test: problem fixed with a new release of the unit test code o 18 th CREAM CE direct submission: FQAN issue. Now a first prototype is working against CNAF o 24 th Too low number of pool accounts for pilot role defined o 25 th StoRM problem with LCMAPS preventing to upload data there o 30th April: Oracle RAC intervention o 8-14 April : site banned because of the changed connection string to CondDB was preventing to access as the APPCONFIG was not upgraded (CondDB person away) o 26 CREAM CE failing all pilots: configuration issue o 30 glitch on StoRM. o May 5 th GPFS issue preventing to write data on storm: ALARM and problem fixed in ~1 hour 7

Site round robin m GridKA: o 3 rd : SQLite problems due to the usual nfslock mechanism getting stuck. Restarted the nfs server. o 5-14 (April): Shared area performances: site banned during real data taking. Concurrent high load put by ATLAS…added more h/w on their NFS servers o 26 th April: MDST Space full (ALARM ticket sent due to the missed response to the automatic notification on the week end) o 28 April: PNFS to be restarted: 1 day off. m IN2p3: o Only Reported instabilities of the SRM endpoint according SAM unit test on MARCH o 24-25 April: LRMS database down o 26-27 AFS major issue. o SIGUSR1 signal sent instead of SIGTERM before sending the SIGKILL 8

Site Round Robin m NL-T1 (March): o March 1 st: issue with critical test for file access (after the test moved to the right version of the root fixing the compatibility issue). Due to a missing library (libgsitunnel) not deployed in the AA o March 3 rd issue with HOME not set on some WN (NIKHEF). o March 18 th NIKHEF: reported a problem in uploading output data from production jobs no matter which destination o March 20 th Discovered the LFC instance at SARA wrongly configured (Oracle RAC issue) o March 30 th Issue with Accessing data at SARA-NIKHEF: discovered (10 days later) to be due to an library incompatibility between Oracle and gsidcap never spotted before because all activities were not using concurrently CondDB (but just local SQLDB) and gsidcap (downloading data first) Roberto Santinelli9

Site Round table m NL-T1 (April) o 1-13 April: Site banned for the issue of accessing data via gsidcap and concurrent ConditionDB access. Found to be a clash of libraries and now working exclusively with dcap P The issue has never been seen because with real data the very first time LHCb access ConditionDB and use file access protocol simultaneously. (usually a download of data first) o 27-28 April: NIKHEF CREAMCE issue killing the site. Received a patch to submit to $TMPDIR instead of $HOME o 29-4 May: Storage Issue due to a variety of reasons (ranging from h/w to network, from some head nodes to SRM overloaded). Roberto Santinelli10

Site round robin m pic: banned for more than 2 weeks for a problem with our application o March 18 th : problem with the installation of one of the LHCb application software for SL5-64bit. Resolved by using –force option in the installation. Site banned 2 weeks for that. o March 29 th : downtime (announced) causing some major perturbation on user activities due to some of the critical DIRAC services hosted there. o April 7 th: Issue with lhcbweb restarted accidentally. Backup host of the web portal at Barcelona University o April 26-27: Network intervention o May 6 th and 7 th Accounting system hosted at pic down twice in 24 hours 11

Site Round Robin m RAL: o March 1 st: disk server issue o March 1 st issue with voms certs o March 9 th Streams replication apply process failure o 28 th April CASTOR Oracle DB upgrade o 5-6 April: Network issue o 8-14 April: as for CNAF, the site was banned because the changed connection string was preventing to access condition DB (upgraded APPCONFIG not available, CondDB responsible was away) Roberto Santinelli12

Outcome of the T1 Jamboree : highlights m Presentation of the computing model and resources needed P First big change about to come: T2-LAC facilities m Interesting overview of DIRAC m Plans: reconstruction/reprocessing/simulations P Activity must be flexible depending on LHC. Sites do not have to ask each time for occupying their CPUs P CREAMCE usage (direct submission about to come) P gLexec pattern usage in LHCb m Most worrying issue at T1: file access and possible solutions P usage of xroot taken into consideration. Testing it P file download for production is the current solution. P parameter tuning in dcache sites (WG in WLCG for FA optimizition). P For production the file download proved to be the best approach (despite some sites claim that would be better to access data through the LAN) P Test suite “hammer cloud style“ to probe site. READY P POSIX file access (LUSTRE and NFS4.1) Roberto Santinelli13

m CPU/wallclock: sites reporting some inefficiency..multiple reasons: P too many pilot submitted now that we are running in filling mode (pilot committed to suicide if no task is available but after few minutes) P Also problems on stuck connections (data upload, data-stream with dcap/root servers, storage shortage, AFS outages, jobs hanging) P Very aggressive watch dog in place that kills jobs (stalled or not consuming CPU any longer i.e. <5% over a configurable amount of minutes) m Most worrying issue at T2 sites: Shared Area P This is a critical service for LHCb and as such must be taken into account by sites m Tape protection discussions m T1 LRMS fair shares: P quick turn-round when there is low activity P never fall down to zero. m Site round table on allocated resources and plans for 2010/2011 Roberto Santinelli14 Outcome of the T1 Jamboree : highlights

Outcome of T1 Jamboree: 2010 Resources Roberto Santinelli15 Site CPU (HEPSPEC06)Disk (TB)Tape (TB) CERN15000/23000720/1290 3000/1800 CNAF 2700/5500 158/450 442/442 FZK 7480/7480315/560 408/408 IN2p3 8370/9742470/728 270/531 NL-T1 8992/8992410/707 261/1012 PIC 1560/2632256/197(+50) ~200/189 RAL 8184/8184311/612 462/446 CERN: full allocation 2010: June or earlier. Full allocation 2011&2012 by April 1 st each year. http://lcg.web.cern.ch/lcg/Resources/WLCGResources-2009-2010_12APR10.pdf : 12 April : all resources seems declared to be allocated http://lcg.web.cern.ch/lcg/Resources/WLCGResources-2009-2010_12APR10.pdf CNAF: CPU: to be ordered. Disk and Tape: delivery in March. FZK: it’s assumed that CPU fully allocated beginning of April. Disk and Tape entirely allocated in May IN2p3: full allocation 2010: 20/05/2010+ T2 resources  18.112 HEPSPEC-06/479 TB disk NL-T1: Disk and Tape fully available end of Spring (<20 th June) PIC: end of April plan to allocate 2010 pledged. Agreed to host 6% of MC data (extra 50TB) RAL: Full allocation in June and do not foresee any problem in meeting it:

LHCb: March/April Operational Report NCB 10 th May 2010.

Similar presentations

Presentation on theme: "LHCb: March/April Operational Report NCB 10 th May 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LHCb: March/April Operational Report NCB 10 th May 2010.

Similar presentations

Presentation on theme: "LHCb: March/April Operational Report NCB 10 th May 2010."— Presentation transcript:

Similar presentations

About project

Feedback