T1 status Input for LHCb- NCB 9 th November 2009.

T1 status Input for LHCb- NCB 9 th November 2009

WLCG availability: second half October Missing LHCb application SAM test results s because of the obsolete SAM client shipped. The red period for all sites (apart from CNAF that did not fail any other test at any point on time) was due to the Dashboard logic (used by LCG MB) to monitor sites. All sites became green again as soon as these test disappeared from the central SAMDB Test now reintegrated in the SAM suite

Storage: pledge, allocated, consumed This table (to be integrated in DIRAC portal and SSB) offers an overview of: each ST share of requested space/ what is allocated (as per SRM report) what is consumed/what is known to be consumed by LHCb (LFC) SLS based ALARMING and WARNING mechanism in place (so far: efficient and prompt reaction from sites) RAL and IN2p3 fulfill the requests Temporary available on http://santinel.home.cern.ch/santinel/cgi-bin/space_tokenshttp://santinel.home.cern.ch/santinel/cgi-bin/space_tokens

Simulation Reco/Stripping/Merging GridKA (~15% of resources) absorbed ~45% of the CPU Pic (smallest) run comparably with the rest: for reco/stripping the 2 nd (MC-M-DST shortage as consequence) CNAF & CERN: underused for MC simulation IN2p3 : did not run reco proportionally to the share (20% while it seems less than 10%) CNAF largely used for Reco/Stripping User

dCache sites in general: watchdog killing jobs with hanging connection opening files (affecting stripping and analysis jobs from 1 st till 15 th of October) – Ron’s recipe (after October GDB) to release dcap movers connections after a shorter timeout than default (2 hours instead of days) – migration to “golden release 1.9.5” fixing a lot of outstanding problems GIP (Lyon and GridKA) advertizing erroneously “0” waiting jobs on the BDII has caused to have anomalous number of jobs queued during heavy activity periods (up to 25K waiting and 6K running) – Improved the ranking expression (but still depending on BDII) Large sites with dedicated resources through fair share (e.g CNAF) are better used – CREAM direct submission will be the final solution General (main) problems

Site by Site: CERN Suffered the fact that only 2 CEs were originally pointing to SL5 batch farm (and both running often in fanny states) (1 st Oct) – Added subsequently 4 more CEs (ce[130-133]) Some (user) file access slowness observed (1 st Oct.) – Due to some I/O very intensive activity registered Slowness deleting files (reported prior October): – Race condition with multiple stagers (rolled in production the patch) UI issue: the new UI had some inconsistency in the PYTHONPATH causing lfc modules not properly loaded (7 th Oct.) CASTOR not available the 14 th (ALARM ticket) One diskserver on lhcbdata problematic (another ALARM ticket ) 15 th Oct. File access issue: just a transient problem (22 nd Oct.) Major CATSORLHCB intervention the 27 th (SL5 migration + patches + H/w on SRM) Hammered the LSF master node with too many bjobs/bqueue queries from DIRAC Issue with CASTORLHCB messed up the LSF view of which servers belong to which pool. (7 th Nov.)

Site by Site: CNAF ACL issue on StoRM affecting Stripping jobs (5 th Oct). CASTOR intervention (6 th Oct) (again) ACL issue onStoRM preventing to access data from local pilot accounts (7 th Oct) Reported disk space issue on MC-*-DST STs (12 th Oct) Problems deleting a directory in StoRM (13 th Oct). StoRM not available (15 th Oct) CASTOR issue with one certificate preventing few RDST from FEST week to get to CNAF. 60 files left in an inconsistent status (15 th Oct). – Problem reported continuously until the 27 th when it has been understood as configuration issue mapping Space tokens and diskpools CNAF-T1 and CNAF-T2 underused because a suboptimal rank expression (22 nd Oct) Glitch on StoRM the 27 th October Problem listing directories on StoRM the 5 th of Nov.

Site by Site: GridKA Watch dog issue killing stack in connection jobs (1 st Oct.) – Cleaned some dcap movers – Adopted later the solution from Ron. Failure to list directories (14 th Oct. ) Lack of disk space on MC space tokens(17 th Oct.). – Allocated more space Reported 6K running + 25K waiting jobs (20 th Oct.) – Misleading publication of the BDII due (probably) to the very old version of LCG- CE installed

Site by Site: IN2p3 Issue with jobs killed because exceeding memory (beginning Oct.): – Coupling queue length and memory is absurd (fixed beginning of Nov.) – Found to be due to DIRAC over-estimating the time left and then pulling long jobs dCache issue of hanging connections (7 th Oct. fixed tens days later) – pinManager outage (8 th Oct.) – Firewall configuration issue on file server preventing jobs to receive call back. (10 th Oct.) – Cleaned the dcap movers and increased their number (up to 1500) (12 th Oct.) – Finally adopted Ron’s recipe. Issue with one CE wrongly publishing jobs running and then attracting more and more pilots (20 th Oct.).

Site by Site: NL-T1 Watch dog issue killing stack in connection jobs (1 st Oct.) – Changed the number of slots per disk server on t1d1 pools ridiculously low – Cleaned some dcap movers – Found the solution by tuning timeouts. Impossible to remove by remote some data (14 th Oct. ) Authorization issue accessing data due to migration to new ldap based system for mapping users (16 th Oct.). tURL resolution issue: new tape protection mechanism preventing to run BoL operations (5 th Nov.) Gauss 134 Error on WN at SARA (6 th Nov.) Davinci 139 error: problem with ROOT not finding HOME directory (problem following a kernel upgrade on WN 'nscd' service ) (8 th Nov)

Site by Site: pic Suffered about lack of disk space on MC Space tokens (since 16 th Oct.) – Used even more than what requested – Reconstruction activity larger than share – Pledged not allocated yet User reporting problem opening files (1 st Oct. ) – Exhausted # of dcap slots. It did not appear any longer.

Site by Site: RAL Major outage on the h/w beneath the Oracle 3D RAC serving CASTOR and other grid services (4 th to 13 th Oct.). (Post Mortem available) – Further recovering of data failed. 200 user visible data (merged DST) have been lost definitely. Tuned the number of rootd slot connections per disk server to scale adequately the system (now 300) (1 st Oct) SAM failing accessing shared area (quattor issue) 16 th Oct. Multi-VO user issue with CASTOR (due to non VOMS awareness of CASTOR) (26 th Oct.). – Fixed by mapping to lhcb only users in LHCB dedicated CASTOR instance OPN issue affecting incoming transfers from other T1 (pic in particular) (29 th Oct.)

T1 status Input for LHCb- NCB 9 th November 2009.

Similar presentations

Presentation on theme: "T1 status Input for LHCb- NCB 9 th November 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

T1 status Input for LHCb- NCB 9 th November 2009.

Similar presentations

Presentation on theme: "T1 status Input for LHCb- NCB 9 th November 2009."— Presentation transcript:

Similar presentations

About project

Feedback