Univ. Milano-Bicocca e INFN Data challenge CMS M. Paganoni Univ. Milano-Bicocca e INFN
Contributors CNAF: William Bacchi, Daniele Bonacorsi, Paolo Capiluppi, Giuseppe Codispoti, Alessandra Fanfani T2 Roma: Luciano Barone, Pietro Govoni, Martina Malberti, Paolo Meridiani, Giovanni Organtini, Shahram Rahatlou, Francesco Safai Tehrani T2 LNL: Massimo Biasotto, Federica Fanzago, Ugo Gasparini, Martino.Margoni, Gaetano Maron, Ezio Torassa Pisa: Filippo Ambroglini, Giuseppe Bagliesi, Paolo.Bartalini, Tommaso Boccali, Federico Calzolari, Livio Fanò Bari: Nicola De Filippis, Giacinto Donvito, Giorgio Maggi
Computing Software Analysis Challenge 2006 A 50 million event exercise to test the workflow and dataflow as defined in the CMS computing model A test at 25% of the capacity needed in 2008 Main components: Preparation of large MC simulated datasets (some with HLT-tags) Prompt reconstruction at Tier-0: Reconstruction at 40 Hz (over 150 Hz) using CMSSW Application of calibration constants from offline DB Generation of Reco, AOD, and AlCaReco datasets Splitting of an HLT-tagged sample into 10 streams Distribution of all AOD & some FEVT to all participating Tier-1s Calibration jobs on AlCaReco datasets at some Tier-1s and CAF Re-reconstruction performed at Tier-1s Skim jobs at some Tier-1s with data propagated to Tier-2s Physics jobs at Tier-2s and Tier-1s on AOD and Reco Italian contribution
Official Timeline – June 1: computing systems ready for Service Challenge SC4 – June 15: physics simulation validation complete – July 1: start MC production – Aug.15: calibration, alignment, HLT, reconstruction, and analysis tools ready – Aug.30: 50 Mevt produced, 5M with HLT pre-processing – Sep. 1: Computing systems ready for CSA – Sep 15: Start CSA06 – Oct 1: start smooth operation for CSA06 – Oct 30: End smooth operation for CSA06 – Nov 15: Finish CSA06
Success metrics Item Goal Threshold Result # Tier1 7 5 # Tier2 20 15 24 Weeks of sustained rate 4 2 Tier0 efficiency 80 % 30 % 100 % Running jobs per day (2h) Tier1+Tier2 50k 30k Grid job efficiency 90 % 75 % 95 % Data serving (storage to CPU) 1 MB/s/slot 300 MB/s (T1) 100 MB/s (T2) OK Data transfer Tier0-all Tier1 (tape) 150 MB/s 75 MB/s 550 MB/s Data transfer Tier1 - Tier2 20 MB/s 5 MB/s
Computing resources Tier-0 (CERN): 1.4M SI2K (~ 1400 CPUs at CERN) 240 TB Tier-1 (7 sites): 2500 CPUs in total 70 TB disk + tape as minimum to participate Tier-2 (25 sites): 2400 CPUs in total Average 10 TB disk at participating Tier-2
MC production software and tools ProdAgent tool used to automatise the production: consists of many agents running in parallel: JobCreator, JobSubmitter, JobTracking, MergeSensor…. ouput files are registered in Data bookkeping service (DBS); blocks of files are registered in Data Location System (DLS); local catalogue used to map LFNs to local PFNs Files are merged for optimum size before transfer to CERN CMSSW installed via grid tools or directly by site admins in remote sites. Storage management deployed: CASTOR, dCache, DPM
MC pre-production 4 production teams active: 1 for OSG -- Ajit Mohapatra – Wisconsin (taking care of 7 OSG CMS Tier2) 3 for LCG: -- Jose Hernandez – Madrid (Spain, France, Belgium, CERN) -- Carsten Hof – Aachen (Germany, Estonia, Taiwan, Russia, Switzerland, FNAL) -- Nicola De Filippis – Bari (Italy, UK, Hungary) Large partecipation of CMS T1s and T2s involved
Monitoring of production via web interface First prototype of monitoring was developed by Bari team:
Monitoring of MinBias (1) Maximum rate per day: 1.15 M
Monitoring of MinBias (2) CNAF LNL Roma Bari Pisa Most of the failures at CNAF were related to stageout and stagein problems with CASTOR2
12 M of events produced by the LCG(3) team Dataset statistics Total: ~ 66 M events No Pileup Total FEVT: O(150) TB 1. Minimum bias (40M) 2. Zµµ (2M) 3. We (4M) 4. t-tbar (6M) [all decays] 5. Electroweak soup (5M) Wl nu + Drell-Yan (m>15 GeV) + WW + HWW 6. HLT soup (5M): 10 effective MC HLT triggers W (leptons) + Drell-Yan (leptons) + t-tbar (all modes) + dijets 7. Jet calibration soup (1M) dijet + Z+jet 8. Soft Muon Soup (2M) Inclusive muons in minbias + J/Psi production 9. Exotics Soup (1M) LM1 SUSY, Z’ (700 GeV), and excited quark (2000 GeV) [all decays] for calibration 12 M of events produced by the LCG(3) team
Efficiency and problems Overall efficiency: 88% Probability for a job to end successfully once it is submitted Grid efficiency: 95% Aborted jobs: jobs not submitted because requirements not met (merge jobs) or jobs once submitted fail due to Grid infrastructure reason Problems: stage out was the main cause of job failures. More robust checking were implemented, more attempts to stage, a fallback strategy etc.. merge jobs caused tipically an overload of the storage system because of the high rate of read access; CASTOR2 at CNAF was tuned to cope with the needs of the production (D. Bonacorsi and CNAF admins) site validation: storage, software tag, software mount points, matching of CE consistency between fileblock/files in DBS/DLS and the reality at sites. Support of Italian Tier-1 and Tier-2 very effective also in August
Tier-0 tasks in CSA06 Reconstruction with CMSSW_1_0_x (x6) All main reconstruction components included Detector-specific local reconstruction and clustering Tracking (only 1 algo used), vertexing, standalone , jets Global (with tracker), electrons, photons, b&tau tagging Reconstruction time small: 4.5s/ev MB, 20s/ev ttbar Computing model assumes 25 s/ev Calibration/Alignment Ability to pull in constants from Offline DB included for ECAL, Tracker, and Muon reconstruction Direct access to Oracle or via Frontier cache
Tier-0 operations 4 weeks uptime (goal), 207M events processed 2 Oct. : operations at Tier0 started First week mostly minbias (with some EWK) using CMSSW102 while bugs fixed to improve robustness on signal samples Second week processing included signal samples at rates generally matched to T1 bandwidth metrics and using CMSSW103 After having run for about 23 days, 120M events at 100% uptime, decided to increase scale for last days Reprocessed all signal samples in ~5 days using CMSSW106 and maximum CPU usage Performance: 160 Hz processing rate, peaking at 300 Hz 1250 CPUs for prompt reconstruction 150 CPUs for AOD and AlCaReco production (separate step) All constants pulled from Frontier i.e. full complexity of CSA exercise 4 weeks uptime (goal), 207M events processed
Calibration/Alignment exercise at Tier-0 CAF Calibration/alignment tasks: Specialized tasks to align/calibrate subsystems using start-up miscalibrated samples, e.g. Align a portion of Tracker with HIP algorithm by using Z →mm sample on the central analysis facility (CAF) for prompt calibration/alignment Intercalibrate ECAL crystals by phi symmetry in minbias events, 0/, or by isolated electrons from W/Z Specialized reduced RECO data format (AlCaReco) to be used for calibration/alignment stream from Tier-0 Mechanism to write constants back into offline DB to be used Re-reconstruction at Tier-1 required to test new constants Propose that miscalibration is applied at RECO
Tracker Alignment exercise CSA06 misalignment scenario: TIB dets and TOB rods misaligned by applying: random shifts, drawing from a flat distribution of witdth +/-100 mm, in (x,y,z) for the double sided modules and in x (sensitive coordinate) for the single sided ones random rotations, drawing from a flat distribution of witdth +/-10 mrad, in (alpha,beta,gamma) for all the modules TIB double sided dets positions Alignment exercise: to read the object in the DB, to apply the initial misalignment; to run the iterative HIP algorithm and to determine alignment constants; 1M events used and 10 iterations. jobs running in parallel on 20 CPUs on a dedicated queue at Tier-0; new costants inserted into the DB
Transfer Tier-0/Tier-1s All 7 Tier-1 participated in the challenge performing very well some storage element software or hardware problems at individual sites The longest down time at any site has been about 18 hours Files are injected into the CMS data transfer system PhEDEx and transferred using FTS Highest rate from CERN was 550MB/s First 3 Week Average Site Rate ASGC 14.3MB/s CNAF 18.0MB/s FNAL 47.8MB/s GridKa 21.7MB/s IN2P3 14.6MB/s PIC 14.4MB/s RAL 16.4MB/s Total 147MB/s
Transfer Tier-0/Tier-1s
Skimming data at Tier-1s To fit data at T2, and to reduce primary datasets to manageable sizes, it was needed to run skim jobs at T1s to select events according to the analyses Skim configuration files prepared according to the RECO and AOD format (also including some “MC truth” information) Organized skim jobs ran with ProdAgent Different skim procedures prepared by the users for running on the same dataset were unified in a single skim job producing different streams 10 filters prepared by the Italian people to cope with the analyses prepared 4 teams for running skim jobs at tier-1s N. De Filippis: Electroweak soup (RAL, CNAF, ASGC, IN2P3) D. Mason: Jets (FNAL) C. Hof: TTbar ( FZK and FNAL) J. Hernandez: Zmumu (PIC and CNAF) Skim job output files shipped to Tier-2s for end-user analyses 9 Oct. – T1 Skim jobs started
Monitoring of skim jobs at Tier-1s
Transfer of skim outputs from Tier-1s to Tier-2s Problems related to: wrong config. of Tier-2 sites wrong setup of download agents with FTS CNAF related problems (FTS server, CASTOR)
Total transfer Tier-0 to Tier-1s and Tier-2s Exceeded 1PB in 1 month!
Analyses at Tier-2s (1)
minimum bias underlying event Analyses at Tier-2s (2) All INFN Tier2s took part to the last step of the CSA06: the physics analyses starting from the output of skim procedures LNL Wmn selection Pisa (tau validation) minimum bias underlying event Rome electron reco Bari tracker misalignment
Analysis at Rome Three analyses with goals: to study of the electron reconstruction in Z ee events (Meridiani) to measure the W mass in W en events (Tabarelli, Malberti) to run a simple calibration with W en events (Govoni) Electron and Z mass reconstruction using the hybrid supercluster Eff vs pT Eff vs h mZ
Analysis at Pisa (1) The general idea is to simulate a "early data taking" activity of the t group: study the tau tag efficiency from the Z tt events study the misidentification with the recoiling jet with Z+jet, Z mm events In addition: run t validation package on skimmed events pure di-tau sample and ttbar sample (S. Gennai, G. Bagliesi). Isolation efficiency vs Isolation Cone : pT of the jet
Analysis at Pisa (2) Study of minimum bias/underlying event (Fanò, Ambroglini, Bartalini) Monte Carlo tuning for LHC Pileup undestanding UE contribution measurements in MB events UE MinBias
Analysis at LNL Goal: to study the W mn preselection with different Monte Carlo data samples Two data samples were considered (Torassa, Margoni, Gasparini): (1) the electroweak soup (3.4 M evts, 50% Wmn and 50% DY) (2) the soft muons (1.8 M evts, 50% minimum bias and 50% J/y, pTm > 4 GeV) EWK soup GlobalMuon reconstructor
Analysis at Bari Goals: to study the effect of tracker misalignment on track reconstruction performances (De Filippis): with the perfect tracker geometry; in the short term and in the long term misalignment scenario by reading misalignment position and errors via frontier/squid from the offline database ORCAOFF. by using the tracker module position and errors as obtained by the output of the alignment process that will be run at CERN T0. Data samples used: Z→mm and ttbar (the second for computing the fake rate)
Analysis jobs at Bari grid efficiency = 99 %, appl. eff = 94 % CRAB_1_4_0 used to submit 1.8 k jobs grid efficiency = 99 %, appl. eff = 94 % Bunch of 150 jobs run in different time slots max 45 jobs run in parallel the configuration of squid tuned to ensure that the alignment data were read by the local cache of squid via the frontier client rather than from CERN (blue histo). frontier/squid works as expected at tier-2 Bari when accessing alignment data
Re-reconstruction at Tier-1s The last step of CSA06: Re-reconstruction at Tier-1s Goals: to demonstrate re-reconstruction from some RAW data at Tier-1s as part of the calibration exercise Status: access of Offline database via frontier working re-reconstruction demonstrated at ASGC, FNAL, IN2P3, PIC and CNAF Running at RAL and further tests at CNAF PIC
Tier2 - LNL Import: 200TB transferred, rate 20-50 MB/s Export: 60TB transferred, rate 5-20 MB/s SC4 challenge (Jun-Sep 06) fake analysis jobs MC production user analysis Jun-Sep 2006 (~50K jobs)
Tier2 - LNL (2) Total number of jobs running (last 6 months) CMS CSA06 CMS MC production
Tier2 - Roma Nuove cpu installate 12/10/06
Pisa /day fake analysis jobs MC production
Bari
Conclusions CSA06 was supposed to be a challenge to commission the computing/software/analysis system but in some cases it required also development/deployment of the tools CSA06 was successful at INFN (all the steps were executed) but thanks to the 100 % work of few experts and to the coordinated effort of many people at Tier-1 and Tier-2 sites. CSA06 analysis exercises need to be he ramp-up for the physics program/organization in Italy CSA07 should cope with both simulated and real data and focus on start-up operations (calibration and alignment) and analysis preparation
Rimodulazione
Richieste 09/2006 1 TB = 1.5 kEuro 1 box = 8 kSI = 3 kEuro Le richieste totali (inclusi SJ) porterebbero sul totale dei 4 centri di calcolo ad avere un fattore 2 sotto il piano iniziale, per tenere conto della rimodulazione di LHC (248 TB contro 530 TB e 1014 kSI2K contro 1600 kSI2K)