Presentation is loading. Please wait.

Presentation is loading. Please wait.

Successful Distributed Analysis ~ a well-kept secret K. Harrison LHCb Software Week, CERN, 27 April 2006.

Similar presentations


Presentation on theme: "Successful Distributed Analysis ~ a well-kept secret K. Harrison LHCb Software Week, CERN, 27 April 2006."— Presentation transcript:

1 Successful Distributed Analysis ~ a well-kept secret K. Harrison LHCb Software Week, CERN, 27 April 2006

2 27 April 20062/14 Shocking News! LHCb Distributed Analysis system is up and running DIRAC and Ganga working well together People with little or no knowledge of Grid technicalities are using the system for physics analysis More than 30 million events processed in past two months Fraction of jobs completing successfully averaging about 88% Extended periods with success rate >94% How can this be happening? Did he say 30 million? Who’s doing this?

3 27 April 20063/14 Beginnings of a success story 2nd LHCb-UK Software Course held at Cambridge, 10th-12th January 2006 Half day dedicated to Distributed Computing: presentations and 2 hours of practical sessions –U.Egede: Distributed Computing & Ganga –R.Nandakumar: UK Tier-1 Centre –S.Paterson: DIRAC –K.Harrison: Grid submission made simple Made clear to participants a number of things –Tier 1 centres have a lot of resources –Easy to submit jobs to Grid using Ganga –DIRAC ensures high success rate  Distributed analysis not just possible in theory but possible in practice Photographs by P.Koppenburg

4 27 April 20064/14 Cambridge HEP Group People –Theory: 14 including 4 PhD students –Experiment: 36 (10 LHCb) including 10 PhD students (4 LHCb) –Also have project students (5 LHCb about to finish) Computing resources –Condor pool of 37 Linux machines, all but 2 with single CPU These are people’s desktop machines: also used interactively –8-10 Tb disk space Local resources usually fine for LHCb analyses of 50k-100k events For larger-scale analyses rely on access to other resources –Grid is an attractive option: develop locally and run remotely without needing to copy files around manually

5 27 April 20065/14 Setting up at Cambridge for distributed analysis LHCb software installed locally and updated for new releases DIRAC installed together with LCG client tools, and regularly updated (currently using v2r10) –Have script to take care of post-install configuration EDG job-submission tools installed –Has allowed testing of direct submission to LCG Using Ganga 4.1.0-beta3 (released December 2005) with additions and patches –No built-in job splitting and no graphical interface –Ganga 4.1.0 public release installed, but needs bug fixes

6 27 April 20066/14 User group C.Lazzeroni: B +  D 0 (K S 0  +  - )K + J.Storey: Flavour tagging with protons Project students: –M.Dobrowolski: B +  D 0 (K S 0 K + K - )K + –S.Kelly: B 0  D + D - and B S 0  D S + D S - –B.Lum: B 0  D 0 (K S 0  +  - )K *0 –R.Dixon del Tufo: B S 0   –A.Willans: B 0  K *0  +  - R.Dixon del Tufo had previous experience of Grid, Ganga and HEP software Others encountered these for first time at LHCb-UK software course Cristina decided she preferred Cobra to Python Photograph by A.Buckley CHEP06, Mumbai

7 27 April 20067/14 Work model (1) Usual strategy has been to develop/test/tune algorithms using signal samples and small background samples on local disks, then process (many times) larger samples (>700k events) on Grid Job submission performed using GPI (Python) script that implements simple-minded job splitting –Users need only look at the few lines for specifying DaVinci version, master package, job options and splitting requirements –Splitting parameters are files per job and maximum total number of files (very useful for testing on a few files) –Script-based approach popular with both new users (very little to remember) and experienced users (similar to what they usually do to submit to a batch system) –Jobs submitted to both DIRAC and Condor

8 27 April 20068/14 Work model (2) Interactive Ganga session started to have status updates and output retrieval DIRAC monitoring page also used for checking job progress Jobs usually split so that output files were small enough to be returned in sandbox (i.e. retrieved automatically by Ganga) Large outputs placed on CERN storage element (CASTOR) by DIRAC –Outputs retrieved manually using LCG transfer command (lcg-cp) and logical-file name given by DIRAC Hbook files merged in Ganga framework using GPI script: –ganga merge 16,27,32-101 myAnalysis.hbook ROOT files merged using standalone ROOT script (from C.Jones) Excellent support from S.Patterson and A.Tsaregorodtsev for DIRAC problems/queries, and from M.Bargiotti for LCG catalogue problems

9 27 April 20069/14 Example plots from jobs run on distributed-analysis system J.Storey: Flavour tagging with protons Analysis run on 100k B s  J/   tagHLT events C.Lazzeroni: Evaluation of background for B +  D 0 (K 0  +  - )K + Analysis run on 400k B +  D 0 (K 0  +  - )K *0  Results presented at CP Measurements WG meeting, 16 March 2006

10 27 April 200610/14 Job statistics (1) DIRAC job state outputreadystalledfailedotherall Number of jobs 254695209422892 Statistics taken from DIRAC monitoring page for analysis jobs submitted from Cambridge (user ids: cristina, deltufo, kelly, lum, martad, storey, willans) between 20 February 2006 (week after CHEP06) and 26 April 2006 Estimated success rate: outputready/all = 2246/2892 = 88% Individual job typically processes 20-30 files of 500 events each –Estimated number of events successfully processed: 25  500  2546 = 3.18  10 7

11 27 April 200611/14 Job statistics (2) Stalled jobs: 95/2892 = 3.3% –Proxy expires before job completes Problem essentially eliminated by having Ganga create proxy with long lifetime –Problems staging data? Failed jobs: 209/2892 = 7.2% –73 failures where input data listed in bookkeeping database (and physically at CERN), but not in LCG file catalogue Files registered by M.Bargiotti, then jobs ran successfully –115 failures 7-20 April because of transient problem with DIRAC installation of software (associated with upgrade to v2r10)  Excluding above failures, job success rate is: 2546/2704 = 94%

12 27 April 200612/14 Areas for improvement (1) More-helpful messages when something goes wrong –Ganga error messages fairly unintelligible to users, and rarely explain how to fix a known problem (e.g. spurious LOCK file) –Difficult for user to understand if he/she has done something wrong, or if there’s a problem with the system Robustness of Workload Management System –Single server being down or unreachable can halt entire system Luckily doesn’t happen often! –Add redundancies (e.g. job-server mirrors)? Control over what happens to output –Need to have way of automatically doing what user wants with larger files: they shouldn’t just end up on CASTOR

13 27 April 200613/14 Areas for improvement (2) Configuration possibilities –Some things that site manager (and some users) may want to change are only possible by hacking the code Load only backend plugins relevant to site Customise information displayed when jobs are listed User support structure –Users submitting to Grid don’t make a strong distinction between Ganga and DIRAC (we have a seamless system!), so better to have single-point of entry for problems/queries Obtaining Grid certificate and registering with Virtual Organisation –Current procedure very convoluted and drawn out Any improvements much appreciated

14 27 April 200614/14 Conclusions LHCb distributed-analysis system is being successfully used for physics studies Ganga makes the system easy to use DIRAC ensures system has high efficiency Extended periods with job success rate >94% More than 30 million events processed in past two months This isn’t the finished product, but is already a useful tool –No need to keep it a secret! He did say 30 million!


Download ppt "Successful Distributed Analysis ~ a well-kept secret K. Harrison LHCb Software Week, CERN, 27 April 2006."

Similar presentations


Ads by Google