Presentation is loading. Please wait.

Presentation is loading. Please wait.

GDB meeting - July’06 1 LHCb Activity oProblems in production oDC06 plans & resource requirements oPreparation for DC06 oLCG communications.

Similar presentations


Presentation on theme: "GDB meeting - July’06 1 LHCb Activity oProblems in production oDC06 plans & resource requirements oPreparation for DC06 oLCG communications."— Presentation transcript:

1 GDB meeting - July’06 1 LHCb Activity oProblems in production oDC06 plans & resource requirements oPreparation for DC06 oLCG communications

2 GDB meeting - July’06 2 Monitoring LHCb production activity http://lhcb01.pic.es/DIRAC/Monitoring/Production/

3 GDB meeting - July’06 3 Main problematic sites http://santinel.home.cern.ch/santinel/cgi-bin/logging_info

4 GDB meeting - July’06 4 Problem highlights GridKA (GGUS #8383): Cause: Unspecified Gridmanager Error Problem reported :begin of May Jobs failed (rough estimate): 5-15K Solution: 1.Not clearly cleaned up home directory (# i-nodes exhausted) 2.Error 15007: authorization problem in the local LRMS Notes: Close collaboration VO-Site admins for investigating Status : on solution Gridmanager error solved but still problems running at site, ongoing investigations Task force: LHCb & GridKa personnel investigating, extremely positive exercise

5 GDB meeting - July’06 5 Problems ~1 month to solve Scotgrid (GGUS #8545): –Cause:Working Directory not writable –Problem reported : May 12th –Jobs failed (rough estimate): 3-5K –Status: Solved (May 29th) csTCDie (GGUS #8577): –Cause: Unspecified Gridmanager error –Problem reported: May 13th –Jobs failed: 3-6K –Status: Solved (June 14th ) UKI-NORTHGRID-LIV-HEP (GGUS #8576): –Cause: Cannot read JobWrapper output both from Condor and from Maradona –Problem reported: May 13th –Jobs failed: 3-5K –Status: Solved June 14th

6 GDB meeting - July’06 6 Other problems over past month Intermittent behavior of publication of site info –particularly CERN, IN2P3 –High load on the CE head node: split CE specific grid services from local batch master node and from site BDII LB software on rb107.cern.ch –screwed up by some Quattor feature LFC problems –host certificate expired Overloaded LHCb SE for log files showed many lcg-cp commands hanging –bug on the underlying globus software Site Queues very often too short for holding LHCb jobs –Sites usually react quickly to GGUS ticket Sites in “downtime” advertising there are available for production

7 GDB meeting - July’06 7 CERN issues over past month Castor2 for grid user –Default configuration on Grid is CASTOR-1 –No VO specific “login” environment –Shipping site specific scripts not a solution (we run at 100 sites!) Software Installation Mechanism should be exactly the same as for other sites –Currently not possible through Grid mechanisms –Should be solved by integrating in a transparent way the gssklog mechanism in the LSF Job Wrapper at CERN

8 GDB meeting - July’06 8 Service monitoring page http://santinel.home.cern.ch/santinel/cgi-bin/lhcb To do: Update to latest endpoints Add T0-T1 endpoint

9 GDB meeting - July’06 9 LHCb DC’06 Challenge (using the LCG production services): a)Distribution of RAW data from CERN to Tier-1’s b)Reconstruction/stripping at Tier-1’s including CERN c)DST distribution to CERN & other Tier-1’s Start imminent! LHCb Tier-1’s: CNAFNIKHEF GridKaPIC IN2P3RAL

10 GDB meeting - July’06 10 LHCb DC’06 Distribution of RAW data from CERN ~ 40 TB of storage replicated over Tier-1 sites (each file on only one Tier-1) CERN MSS SRM  local MSS SRM endpoint Reconstruction/stripping at Tier-1’s including CERN ~300 kSI2k.months needed to reconstruct & strip events Output: rDST (from recons): ~10 TB (accumulate) at Tier-1 sites where produced, on MSS DST (from stripping): ~1.2TB in total on disk SE DST distribution to CERN & all other Tier-1’s 1.2 TB of DST will be replicated to a disk based SE at each Tier-1 & CERN Updated Numbers

11 GDB meeting - July’06 11 Reconstruction (CERN & Tier-1’s) Need about ~0.4 MSI2k months needed for this phase (July-August) 500 events per file Assume a job 10k events per job (20-40 i/p files) ~ 6 hours/job 2.0 GB/job input; 0.5 GB/job output Output stored at Tier-1 where job runs 20k jobs Comparison with computing model: 2-3 GB (single) input files - ~100k events 2-3 GB output data Jobs duration ~36 hours ~30k recons jobs per month during data taking Updated Numbers

12 GDB meeting - July’06 12 Stripping (CERN & Tier-1’s) Need about <0.1 MSI2k months for this phase (run concurrently with recons) Assume a job 40k events per job (4 i/p files) ~2 hours/job 10 GB input/job; ~1.5 GB data output per job output distributed to ALL Tier-1’s & CERN 5k jobs Comparison with computing model: Jobs duration ~36 hours 10 input rDST files + 10 input RAW files - ~1M events 50 GB data input per job; ~7GB data output per job ~3k stripping jobs per month during data taking Updated Numbers

13 GDB meeting - July’06 13 Recons & Stripping (CERN & Tier-1’s) Need to be able to prioritise recons & stripping jobs LHCb prioritisation (MC prodcution vs recons/stripping) dealt with by LHCb Important jobs not unduly delayed by other VO activities

14 GDB meeting - July’06 14 SRM -based SE’s important consistent implementation across back-ends SRM 1.1 enabled Should have disk and MSS separate SEs at all Tier-1’s - as agreed following Mumbai workshop Not all there yet CERN disk SE endpoint available for testing since last week CNAF - tied to migration to CASTOR2 Low capacity at IN2P3 & PIC

15 GDB meeting - July’06 15 Slide from the past IMPORTANT: data will be accessed directly from [local] SE through protocols supported by ROOT/POOL & NOT only by GridFTP/srmcp This is NOT a trivial statement ! It has become apparent some sites had not appreciated this First show Jan’06 GDB when became transparent some T1 centres hadn’t appreciated this fact Status July’06: This is still NOT a trivial statement

16 GDB meeting - July’06 16 Data Challenge in Oct/Nov’06 (DC06-2) Reconstruction using Conditions data (COOL-based) Essential ALL LHCb Tier-1 centres are running database service supporting COOL & 3D Timescales presented at GDB allows months of pre- testing of services before use in anger Preliminary tests are already ongoing

17 GDB meeting - July’06 17 LCG Communications Plethora of meeting and workshops Not convinced of the usefulness - concern LHCb requirements not effectively conveyed through these forums Like to build on successful collaboration with GridKa and organise LHCb-T1 meetings Vital to ensure LHCb reqts are understood and site related problems can be addressed together Essential 2-way communication Important for the minority of T1 centres where no LHCb personnel

18 GDB meeting - July’06 18 Summary oProduction running relatively smoothly but not without hiccup oClose & fruitful collaboration ongoing to understand LHCb problems at GridKa oDC06 environment not completely there for LHCb yet oMajor concern about lack of access to “local” SE using applications oConcern at LCG communication channels to propagate LHCb expt requirements oPropose to commence LHCb-T1 meetings


Download ppt "GDB meeting - July’06 1 LHCb Activity oProblems in production oDC06 plans & resource requirements oPreparation for DC06 oLCG communications."

Similar presentations


Ads by Google