Download presentation
Presentation is loading. Please wait.
Published byBrendan Mills Modified over 8 years ago
1
First test of the PoC
2
Caveats I am not a developer ;) I was also beta tester of Crab3+WMA in 2011; I restarted testing it ~2 weeks ago to have a 1 to 1 comparison The first 2 weeks of the PoC test were mainly – Finding a problem – Communicating the developers – Getting a new version – Trying again – I simply skip this part, which is ok; I speak about the results after all the fixes
3
What I tested (with both) A complicated workflow: the official (V)H->bb analysis step1 (see https://twiki.cern.ch/twiki/bin/view/CMS/VHbbA nalysisNewCode#NtupleV42_CMSSW_5_3_3_pat ch2 ) which takes ~2 hours just to compile https://twiki.cern.ch/twiki/bin/view/CMS/VHbbA nalysisNewCode#NtupleV42_CMSSW_5_3_3_pat ch2 – Indeed ISB ~ 45 MB, with 56 user compiled libraries Running on dataset /DoubleElectron/Run2012B- PromptReco-v1/AOD – 40 LS/job -> ~ 1200 jobs, a couple of hours each
4
Where I tested CRAB3/Panda: test is restricted to few sites (FNAL, Pisa, DESY, …) – The sample is indeed just in FNAL and Pisa among the PoC sites CRAB3/WMA: 8 T2s available, some of poor quality (T2_RU_*) Always used Pisa as storage site
5
Moreover PoC is not expected to provide full Crab3 functionality, just (as in the email I got) – Submit – Resubmit – Kill – Status – Getoutput – Getlog So I stick to these also for Crab3/WMA (i.e. I do not do DBS publication)
6
Configs from WMCore.Configuration import Configuration import os from datetime import datetime config = Configuration() config.section_("General") config.General.serverUrl = 'poc3test.cern.ch’ config.General.ufccacheUrl = 'cmsweb-testbed.cern.ch’ config.section_("JobType") config.JobType.pluginName = 'Analysis' config.JobType.psetName = 'patData.py’ config.section_("Data") config.Data.inputDataset = '/DoubleElectron/Run2012B- PromptReco-v1/AOD' config.Data.publishDataName = os.path.basename(os.path.abspath('.')) +"_tom" config.Data.lumiMask = 'Lumi.json’ config.Data.publishDbsUrl = "https://cmsdbsprod.cern.ch:8443/cms_dbs_ph_analysis_02 _writer/servlet/DBSServlet" config.Data.splitting = 'LumiBased' config.Data.unitsPerJob = 40 config.section_("User") config.User.email = ’’ config.section_("Site") config.Site.storageSite = 'T2_IT_Pisa' from WMCore.Configuration import Configuration import os config = Configuration() config.section_("General") config.General.requestName = 'request_name2' config.General.serverUrl = 'crab3-test.cern.ch' config.General.ufccacheUrl = 'cmsweb.cern.ch' config.section_("JobType") config.JobType.pluginName = 'Analysis' config.JobType.psetName = 'patData.py' config.section_("Data") config.Data.inputDataset = '/DoubleElectron/Run2012B- PromptReco-v1/AOD’ config.Data.splitting = 'LumiBased' config.Data.unitsPerJob = 40 config.Data.lumiMask = 'Lumi.json’ config.section_("User") config.User.email = ’’ config.section_("Site") config.Site.storageSite = 'T2_IT_Pisa' Panda WMA
7
Soon after submit bash-3.2$ crab status -t crab_20121127_113729 -i Registering user credentials Task name: tboccali_crab_20121127_113729_121127_103859 Panda url: http://panda.cern.ch/server/pandamon/query?job=*&jobsetID=19 &user=Tommaso%20Boccali Details: running 0.78 % (10/1279) activated 99.22 % (1269/1279) Information per site are not available. Log file is /afs/cern.ch/work/b/boccalio/PoC/CMSSW_5_3_3_patch2/src/VHb bAnalysis/HbbAnalyzer/test/PoCTests/crab_20121127_113729/cra b.log No information per site, link to monitoring present bash-3.2$ crab status -t crab_request_name2 -i Registering user credentials Task Status: running Using 7 site(s): Jobs Details: submitted 100.00 % ( running 44.31 % pending 55.69 % ) T2_US_Florida: submitted 14.58 % T2_FR_GRIF_IRFU: submitted 14.58 % T2_RU_JINR: submitted 14.58 % T2_UK_London_IC: submitted 12.54 % T2_FR_GRIF_LLR: submitted 14.58 % T2_IT_Pisa: submitted 14.58 % T2_ES_IFCA: submitted 14.58 % Log file is /afs/cern.ch/work/b/boccalio/PoC/CMSSW_5_3_3_patch2/src/VHb bAnalysis/HbbAnalyzer/test/Crab3Tests/crab_request_name2/crab.log (no link to dashboard?) – one has to find by hand
8
Few Considerations Let’s start from the obvious: with both systems I reached 100% done, with some “resubmit” (site problems) Feature: with Panda a resubmit is a second task (with a second web page)… Not used to it but not a critical issue (you need just to get used to it)
9
ASO It worked flawlessly in both cases Nothing more to say I guess … (I did not even need to look into the ASO monitoring) You can get the files before ASO operated (I guess lcg-cp is used, …)
10
Issues with Panda Kill did not work for me; I understood it was simple timeout to be set to a different threshold, did not check more
11
Is resubmit working fine? In both cases, it was for me Caveat: the PoC enabled sites are generally good/very good. No chance to test a massive failure scenario
12
Let’s go straight to the point Up to here executive summary could be: “Limiting the scenario to what the PoC is supposed to allow me to do, PANDA performs at least as well as WMA” (again, this _after_ the two weeks of initial testing)
13
What is different Panda Monitoring seems by far better than what we are used to
14
Dashboard/WMA… (as usual)
15
…Plus WMStats Some debugging info added, but not that much (where is the WN name? where is the LSF id?)
16
Features we usually do not have All the log (pilots + stderr + stdout) are on the web – All: not only snippets for failed jobs – I guess ph support would love it, instead of asking to upload logs – support can get all the info from WEB, no need to ask the (maybe not too skilled user) – Snippets are not ok in general: a failure can be dependent from a bad Env Variable … cannot be seen from the snippet alone There is link PILOT LSF id ! This I considered lost since we left gLite, and it is a MAJOR help to debug strange problems (like WNs acting as black holes)
17
Pilot log WN LSF id
18
logs (full logs present, not just snippets guessed as interesting by the system) Full logs uploaded to SE
19
Other features I liked Panda seems user friendly when scheduling jobs: if you submit a task, even if your priority is very low, a few jobs are executed almost immediately, allowing you to spot broken workflows in advance It seems I can resubmit at any time (no need to wait for task in cooloff …) – Is it because ACDC is not in the game? Is there anything we pay for this (side effects I am not aware of?)
20
Conclusions? As said, functionally both were doing what asked – PANDA does not look at all behind I cannot speak about what is NOT supposed to be in PoC (which is not a small subset) The major differences to me are – Monitoring: way better in PoC with full disclosure of all the info – The early prioritization of some jobs is a lot of help (goes far beyond simple python sanity check) – You seem to be able to resubmit any time – no cool off needed; this potentially cuts the time to process tails
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.