Presentation is loading. Please wait.

Presentation is loading. Please wait.

LHCb Status report June 08. LHCb Computing Report Activities since February  Applications and Core Software  Preparation of applications for real data.

Similar presentations


Presentation on theme: "LHCb Status report June 08. LHCb Computing Report Activities since February  Applications and Core Software  Preparation of applications for real data."— Presentation transcript:

1 LHCb Status report June 08

2 LHCb Computing Report Activities since February  Applications and Core Software  Preparation of applications for real data  Simulation with real geometry (from survey)  Certification of GEANT4 9.1  Alignment and calibration procedures in place  Production activities (aka DC06)  New simulations on demand  Continue stripping and re-processing  Dominated by data access problems (and how to deal with them)  Core Computing  Building on CCRC08 phase 1  Improved WMS and DMS  Introduce error recovery, failover mechanism etc…  Commission DIRAC3 for simulation and analysis  Only DM and reconstruction exercised in February LHCb WLCG-MB report, June 082

3 LHCb Computing Report Sites configuration  Databases  ConditionsDB and LFC replicated (3D) to all Tier1s  LFC mirror read-only at all Tier1s available  Sites SE migration  RAL migration from dCache to Castor2  Very painful exercise (bad pool configurations)  Took over 8 months to complete, but… it’s over!!!  PIC migration from Castor1 to dCache  Fully operational, Castor decommissioned since March  CNAF migration of T0D1 and T1D1 to StoRM  Went very smoothly for migrating existing files from Castor2  SRM v2 spaces  All spaces needed for CCRC were well in place  SRM v2 still not used for DC06 production (see later) LHCb WLCG-MB report, June 083

4 LHCb Computing Report DC06 production issues  Still using DIRAC2, i.e. SRM v1  No plan to backport SRM v2 to DIRAC2  We could update the LFC entries to srm-lhcb.cern.ch (v2) for reading but…  Still need srm-durable-lhcb.cern.ch for T0D1 upload  Need srm-get-metadata that works for SRM v2  When SRM v2 and v1 are the same end-point, no problem  DIRAC2 checked against StoRM: no problem  File access problems  As for CCRC, dominating the re-processing problems  Castor sites: OK, would like to have rootd at RAL (problems known for 4 years with rfio plugin in root, aleviated with rootd)  Many (>7,000) files lost at CERN-Castor (mostly from autumn 2006), to be marked as such in the LFC (some irrecoverable)  dCache sites: no problem using dcap protocol (PIC, GridKa), many problems with gsidcap (IN2P3, NL-T1)  How to deal with erratic errors (files randomly unaccessible) LHCb WLCG-MB report, June 084

5 LHCb Computing Report CCRC’08 Summary of last week’s report courtesy of Nick Brook LHCb WLCG-MB report, June 085

6 CCRC’08 post mortem - June’08 6 Planned tasks May activities Maintain equivalent of 1 month data taking Assuming a 50% machine cycle efficiency Run fake analysis activity in parallel to production type activities Analysis type jobs were used for debugging throughout the period GANGA testing ran for last weeks at low level

7 CCRC’08 post mortem - June’08 7 Activities across the sites Planned breakdown of processing activities (CPU needs) prior to CCRC08 SiteFraction (%) CERN14 FZK11 IN2P325 CNAF9 NIKHEF/SARA26 PIC4 RAL11

8 CCRC’08 post mortem - June’08 8 Pit -> Tier 0 Use of rfcp to copy data from pit to CASTOR rfcp is the recommended approach from IT A file sent every ~30 sec Data remains on online disk until CASTOR migration Rate to CASTOR - ~70MB/s Problems with online storage area In general ran smoothly: - Stability problems with online storage area - solved with firmware update during CCRC - Internal issues with sending bk-keeping info

9 CCRC’08 post mortem - June’08 9 Tier 0 -> Tier 1 FTS from CERN to Tier-1 centres Transfer of RAW will only occur once data has migrated to tape & checksum is verified Rate out of CERN - ~35MB/s averaged over the period Peak rate far in excess of requirement In smooth running sites matched LHCb requirements

10 CCRC’08 post mortem - June’08 10

11 CCRC’08 post mortem - June’08 11 Tier 0 -> Tier 1 To first order all transfers eventually succeeded Plot shows efficiency on 1st attempt… Issue with UK certificates CERN outage CERN SRM endpoint problems Restart IN2P3 SRM endpoint

12 CCRC’08 post mortem - June’08 12 Reconstruction Used SRM 2.2 SE LHCb space tokens are: LHCb_RAW (T1D0) LHCb_RDST (T1D0) Data shares need to be preserved Important for resource planning Input 1 RAW file & output 1 rDST file (1.6 GB) Reduced nos of events per recons job from 50k to 25k (job ~12 hour duration on 2.8 kSI2k machine) In order to fit within the available queues Need to get queues at all sites that match our processing time Alternative: reduce file size!

13 CCRC’08 post mortem - June’08 13 Reconstruction After data transfer file should be online, as job submitted immediately, but… LHCb pre-stage files & then checks on the status of the file before submitting pilot job - use gfal_ls –Pre-stage should ensure access availability from cache –Only issue at NL-T1 with reporting of file status Discussed last week during Storage session (dCache version) –(Problem developed at IN2P3 right at end of CCRC08 - 31 st May)

14 CCRC’08 post mortem - June’08 14 Reconstruction 41.2k reconstruction jobs submitted 27.6k jobs proceeded to done state Done/created ~67% CERN 6.1k (14%) 5.3k (13%) 86% CNAF 3.9k (9%) 2.8k (7%) 72% GridKa 4.1k (11%) 3.1k (7%) 76% IN2P3 10.3k (25%) 6.1k (14%) 56% Sub jobs Done jobs Ratio NIKHEF 10.3k (26%) 2.3k (6%) 23% PIC 1.8k (4%) 1.6k (4%) 89% RAL 4.7k (11%) 3.5k (8%) 74%

15 CCRC’08 post mortem - June’08 15 Reconstruction 27.6k reconstruction jobs in done state 21.2k jobs processed 25k events Done/25k events ~77% 3.0k jobs failed to upload rDST to local SE (only 1 attempt before trying Failover) Failover/25k events ~13% CERN5.2k (100%) 0.7k (14%) 76% CNAF2.6k (95%) 0.0k (1%) 67% GridKa3.0k (99%) 0.7k (22%) 58% IN2P35.1k (90%) 0.7k (14%) 43% 25k events Fail upload Success/ Created NIKHEF1.2k (53%) 0.9k (70%) 4% PIC1.6k (99%) 0.0k (0%) 89% RAL3.1k (89%) 0.0k (1%) 68%

16 LHCb Computing Report Summary of reconstruction issues  File access problems  Random or permanent failure to open files using gsidcap  No problem observed with dcap  Request IN2P3 and NL-T1 to allow dcap protocol for local read access  (Temporary?) solution: download file to WN  Had to get it back from CERN is some cases  Wrong file status returned by dCache SRM after a put  Discussed and understood last week  Problem was that bringOnline was not doing anything  Questionable however whether all read should happen in a shared cache pool with disk-to-disk copy (even for TxD1 files)  Software area access problems  Site banned for a while until problem is fixed  Creates inefficiency an unavailability, but “easy” to recover  Application crashes  Fixed within a day or so (new SW release and deployment) LHCb WLCG-MB report, June 0816 i/p dataset download start

17 CCRC’08 post mortem - June’08 17 Reconstruction Low efficiency at CNAF due : s/w area access more jobs than cores on a WN … Low efficiency at RAL & IN2P3 due to data download Resolved through tuning timeout CPU efficiency based on ratio of wall-clock to CPU time on running jobs

18 CCRC’08 post mortem - June’08 18 dCache Observations Official LCG recommendation - 1.8.0-15p3 LHCb ran smoothly at half of T1 dCache sites PIC OK - version 1.8.0-12p6 (unsecure) GridKa OK - version 1.8.0-15p2 (unsecure) IN2P3 - problematic - version 1.8.0-12p6 (secure) Seg faults - needed to ship version of GFAL to run Could explain CGSI-gSOAP problem???? NL-T1 - problematic (secure) Many versions during CCRC to solve number of issues 1.8.0-14 -> 1.8.0-15p3->1.8.0-15p4 “Failure to put data - empty file”->”missing space token” problem -> incorrect metadata returned, NEARLINE issue

19 CCRC’08 post mortem - June’08 19 Stripping Stripping on rDST files 1 rDST files & associated RAW file Space tokens: LHC_RAW & LHCb_rDST DST files & ETC produced during the process stored locally on T1D1 (add storage class) Space tokens: LHCb_M-DST DST & ETC file then distributed to all other computing centres on T0D1 (except CERN T1D1) Space tokens: LHCb_DST (LHCb_M-DST)

20 CCRC’08 post mortem - June’08 20 Stripping 31.8k stripping jobs were submitted 9.3k jobs ran to “Done” Major issues with LHCb bk-keeping CERN2.4k2.3k CNAF2.3k2.0k GridKa2.0k IN2P34.5k0.2k NIKHEF0.3k<0.1k PIC1.1k RAL2.2k1.6k Failed to resolve datasets 17.0k-

21 CCRC’08 post mortem - June’08 21 Stripping: T1-T1 transfers Stripping limited to 4 T1 centres CNAF PIC GridKa RAL Stripping reduction factor too small

22 CCRC’08 post mortem - June’08 22 Lesson Learnt for DIRAC3 Improved error reporting in workflow & pilot logs –Careful checking of log files was required for detailed analysis Full failover mechanism is in place but not yet deployed –only CERN was used for CCRC08 Alternative forms of data access –Minor tuning of the timeout for downloading input data was required 2 timeouts needed: time of copy & activity timeout

23 CCRC’08 post mortem - June’08 23 Summary –Data transfer of CCRC08 using FTS was successful –Still plagued with many issues associated data access Issues improved since Feb CCRC08 but… 2 sites problematic for large chunks of CCRC08 - 50% of LHCb resources!! Problems mainly associated with access with dCache Commencing tests with xrootd –DIRAC3 tools improved significantly from Feb Still need improved reporting of problems –LHCb bk-keeping remains a major concern New version due prior to data taking –LHCb need to implement a better interrogation of log files

24 LHCb Computing Report Outlook  Continue CCRC-like exercise for testing new releases of DIRAC3  One or two 6-hour runs at a time  Features under test  Full failover for file upload, LFC registration, BK registration, job status reporting (using VOBoxes)  Commission DIRAC3 fully for simulation  Easier than processing workflows  Adapt ganga for DIRAC3 submission  Delayed due to an accident of the developer…  We would like to test a “generic” pilot agent mode of running even in absence of glexec  Certify the ”time left” utility on all sites  Assess the mode of running (full test of proxy handling)  We can limit to running LHCb applications (no user scripts)  No security risk higher than for production jobs LHCb WLCG-MB report, June 0824


Download ppt "LHCb Status report June 08. LHCb Computing Report Activities since February  Applications and Core Software  Preparation of applications for real data."

Similar presentations


Ads by Google