Presentation is loading. Please wait.

Presentation is loading. Please wait.

ATLAS DC2 Pile-up Jobs on LCG Atlas DC Meeting February 2005.

Similar presentations


Presentation on theme: "ATLAS DC2 Pile-up Jobs on LCG Atlas DC Meeting February 2005."— Presentation transcript:

1 chudoba@fzu.cz ATLAS DC2 Pile-up Jobs on LCG Atlas DC Meeting February 2005

2 chudoba@fzu.cz Pile-up tasks Jobs defined in 3 tasks: –210 dc2.003002.lumi10.A2_z_mumu.task –307 dc2.003026.lumi10.A0_top.task –308 dc2.003004.lumi10.A3_z_tautau.task Input files with min. bias were distributed to selected sites using DQ, 700GB Each job used 8 input files with min. bias (~250MB each), downloaded from closeSE, and 1 input file with signal 1 GB RAM per job required

3 chudoba@fzu.cz 5 sites involved Number of jobs per site golias25.farm.particle.cz:2119/jobmanager-lcgpbs-lcgatlasprod lcg00125.grid.sinica.edu.tw:2119/jobmanager-lcgpbs-infinite lcgce01.triumf.ca:2119/jobmanager-lcgpbs-atlas lcgce02.ifae.es:2119/jobmanager-lcgpbs-atlas t2-ce-01.roma1.infn.it:2119/jobmanager-lcgpbs-infinite

4 chudoba@fzu.cz

5

6

7 Status JOBSTATUS NJOBS failed 3702 finished 5703 pending 323 running 64 TASKDONE%DONEALL 21020032010050 3071793228000 30818861910000 5682 21 jobs have JOBSTATUS finished and CURRENTSTATE ABORTED - probably initial tests, ENDTIME = 23-SEP-04, 30-SEP-04 and 07-OCT-04

8 chudoba@fzu.cz Why so big differences in the efficiency? PRAGUE: 48% TW: 70% ATTEMPT NJOBS 1 2442 2 466 3 244 4 291 5 130 6 71 7 66 8 52 9 48 10 26 11 7 ATTEMPT NJOBS 1 2662 2 361 3 184 Jobs with Attempt = 1 AllGoodFailedEff % TW2662207458878 Prague24421368107456 Other differences: RB on TW lexor running on UI on TW many signal files stored on SE on TW

9 chudoba@fzu.cz Failures Not easy to get cause of failure from proddb –VALIDATIONDIAGNOSTIC quite difficult to parse by script: – t2-wn-36.roma1.infn.it 1 0m2.360s STAGE-IN failed: WARNING: No FILE or RFIO access for existing replicasWARNING: Replication of sfn://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._014 77.pool.root.1 to close SE failed: Error in replicating PFN sfn://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._014 77.pool.root.1 to t2-se-01.roma1.infn.it: lcg_aa: File existslcg_aa: File existsGiving up after attempting replication TWICE.WARNING: Could not stage input file sfn://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._014 77.pool.root.1: Gridftp copy failed from gsiftp://castorftp.cnaf.infn.it/castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._0 1477.pool.root.1 to file:/home/atlassgm/globus-tmp.t2-wn-36.17931.0/WMS_t2-wn- 36_018404_https_3a_2f_2flcg00124.grid.sinica.edu.tw_3a9000_2fKv9HpVIUkMLTBBe- Ia3xLA/dc2.003002.simul.A2_z_mumu._01477.pool.root: the server sent an error response: 550 550 /castor/cnaf.infn.it/grid/lcg/atlas/datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._01477.pool.root.1: Invalid argument. EDGFileCatalog: level[Always] Disconnected No log for stageout phase –mw failures: Job RetryCount (0) hit

10 chudoba@fzu.cz Some Jobs with many Attempts JOBDEFINITIONID=459795 –Attempt 1: 09-NOV-04 t2-wn- 42.roma1.infn.it 1 0m43.250s Transformation error: -------- Problem report -------[Unknown Problem]AthenaPoolConve... ERROR (PersistencySvc) pool::PersistencySvc::UserDatabase::connectForRead: FID is not existing in the catalog================================-------- Problem report -------[Unknown Problem]PileUpEventLoopMgrWARNING Original event selector has no events================================ No log for stageout phase –... –Attempt 11: 15-DEC-04 goliasx76.farm.particle.cz 1 0m41.460s Transformation error: -------- Problem report -------[Unknown Problem]AthenaPoolConve... ERROR (PersistencySvc) pool::PersistencySvc::UserDatabase::connectForRead: FID is not existing in the catalog================================-------- Problem report -------[Unknown Problem]PileUpEventLoopMgrWARNING Original event selector has no events================================ No log for stageout phase

11 chudoba@fzu.cz JOBDEFINITIONID=456843 Attempt 1: t2-wn-37.roma1.infn.it 1 0m2.830s STAGE-IN failed: WARNING: No FILE or RFIO access for existing replicasWARNING: Replication of srm://lcgads01.gridpp.rl.ac.uk//datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._02629.p ool.root.6 to close SE failed: Error in replicating PFN srm://lcgads01.gridpp.rl.ac.uk//datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._02629.p ool.root.6 to t2-se-01.roma1.infn.it: lcg_aa: File existslcg_aa: File existsGiving up after attempting replication TWICE.WARNING: Could not stage input file srm://lcgads01.gridpp.rl.ac.uk//datafiles/dc2/simul/dc2.003002.simul.A2_z_mumu/dc2.003002.simul.A2_z_mumu._02629.p ool.root.6: Get TURL failed: lcg_gt: Communication error on sendEDGFileCatalog: level[Always] Disconnected No log for stageout phase Attempt 2: lcg00172.grid.sinica.edu.tw 2 0m23.660s Transforma tion error: -------- Problem report ------- [SOFTWARE]AthenaCrash================================ No log for stageout phase... Attempt 9: goliasx44.farm.particle.cz 2 0m23.340s Transformati on error: -------- Problem report ------- [SOFTWARE]AthenaCrash================================ No log for stageout phase

12 chudoba@fzu.cz JOBDEFINITIONID=504139 Attempt 1: t2-wn-48.roma1.infn.it 2 66m58.650s Transformation error: -------- Problem report ------- [SOFTWARE]AthenaCrash================================ No log for stageout phase Attempt 2: lcg00144.grid.sinica.edu.tw 2 66m56.800s Transform ation error: -------- Problem report ------- [SOFTWARE]AthenaCrash================================ No log for stageout phase the same up to attempt 5 Attempt 6: mw failure Attempt 7: goliasx60.farm.particle.cz 0 152m53.780s ???

13 chudoba@fzu.cz Jobs properties no exact relation between a job in the oracle db and an entry in the PBS log file STARTTIME and ENDTIME are just hints Some jobs on golias: –1232 finished jobs in December registered in proddb –1299 selected jobs from PBS logs in December, cuts on CPU time and virtual memory values Nodes: 3.06 GHz Xeon, 2GB RAM Histos based on information from PBS log files

14 chudoba@fzu.cz some jobs (6) successfully ran on machine with only 1GB RAM but the wallTime was 20h – probably a lot of swapping

15 chudoba@fzu.cz

16

17 WN -> SE -> NFS server WN has the same NFS mount – could it be used directly?

18 chudoba@fzu.cz

19 Conclusions no job name in the local batch system – difficult to identify version of the lexor executor should be in the proddb proddb: very slow response, these queries were done on atlassg (has snapshot of proddb from Feb 8) a study of log files should be done before increasing MAXATTEMPT proddb should be cleaned


Download ppt "ATLAS DC2 Pile-up Jobs on LCG Atlas DC Meeting February 2005."

Similar presentations


Ads by Google