Download presentation
Presentation is loading. Please wait.
Published byLilian Austin Modified over 8 years ago
1
Oct 16, 2009T.Kurca Grilles France1 CMS Data Distribution Tibor Kurča Institut de Physique Nucléaire de Lyon Journées “Grilles France” October 16, 2009 IPN de Lyon General Thoughts Data Volume Computing Model Data Processing Guidelines The Plan Primary & Secondary Datasets Skims
2
Oct 16, 2009T.Kurca Grilles France2 General Thoughts Imperative : Data must be accesible to all of CMS quickly initial data distribution model should be simple later will be adapted to reality How to use the data most efficiently? - large data volumes, mostly background - computing resources distributed at unprecedented level - very broad physics program with a diverse needs Reduce the amount of data to be processed reduce the strain on computing resources A core set of data has to be small enough & very representative for a given analysis - most analyses don’t need access to all data split data into datasets easy manageable day-to-day jobs enable full tests of analysis components before running the full statistics allow prioritisation of data analysis
3
Oct 16, 2009T.Kurca Grilles France3 CMS Data Volume Estimated rates/ data volumes - 2009/10 - 70 days running - 300 Hz rate in physics stream, 2.3x 10E9 events - Assume 26% mean overlap 3.3 PB RAW data (1.5MB/evt) : detector data + L1, HLT results after online formatting 1.1 PB RECO (0.5MB/evt): reconstructed objects with their associated hits 220 TB AOD (0.1MB/evt): main analysis format: clusters, tracks, particles id Multiple re-reco passes Data placement –RAW/RECO: one copy across all T1, disk1tape1 –Sim RAW/RECO: one copy across all T1, on tape with 10% disk cache –AOD: one copy at each T1, disk1tape1
4
Oct 16, 2009T.Kurca Grilles France4 Tier-0-1-2 7 2 x T0: Prompt reco (24/24), FEVT storage, data distribution T1: Data storage, processing (Re-Reco, skim, AOD extraction), raw data access, Tier-2 support data serving to T2s T2: Analysis, MC production, specialised support tasks, local + group use
5
Oct 16, 2009T.Kurca Grilles France5 T1/T2 Associations Associated Tier-1: hosting MC prod + reference for AOD serving - Full AOD sample at Tier-1 (after T1 T1 transfers for re-recoed AODs) Stream “allocation” ~ available disk storage at centre
6
Oct 16, 2009T.Kurca Grilles France6 T2-PAG,POG,DPG Associations
7
Oct 16, 2009T.Kurca Grilles France7 Data Processing Guidelines we aim for prompt reconstruction and analysis - reduce backlogs we need the possibility of prioritisation - cope with backlogs without delaying critical data - prompt calibration using low latency data we are using data streaming based on trigger bits need to understand the trigger and event selection - early events classification allows later prioritisation 1. Express-stream of : ‘hot’ physics events calibration events Data Quality Monitoring (DQM) events 2. Physics stream - propose O(7) ‘primary datasets’, immutable but can have overlap
8
Oct 16, 2009T.Kurca Grilles France8 T1s Commissioning Express Streams
9
Oct 16, 2009T.Kurca Grilles France9 Trigger – Dataset Connection The Goal: create Datasets based on triggers for specific physics objects Datasets distributed to central storage at T2s run skims on those Datasets (or skim of a skim …) Purpose- benefits of this kind of Datasets: - group similar events together to facilitate data analyses - data separation (streaming) is based on trigger bits no need for additional processing triggers are persistent and Datasets resilient - recognized drawback – events are not selected with optimal reco & AlCa Obvious consequences: - every trigger has to be asigned to at least one Dataset - increasing Datasets overlaps increasing storage requirements
10
Oct 16, 2009T.Kurca Grilles France10 The Plan 1. Primary Datasets (PD) - immutable, based on triggers, split at T-0 - RAW/RECO/AOD PDs distributed (limited) to central storage at T1s 2. Secondary Dataset (SD) - produced from PDs at T1s by dataOps and distributed to central storage at T2s - RECO or AOD format, trigger based as well 3. Central skims - produced at T1s by dataOps - very few initiall for key applications that cannot be served otherwise 4. Group skims - run on datasets stored at T2s; flexibility in choice of event content but provenance must be maintained. - approved by group conveners and expect to have a tool allowing them to be registered in global DBS tested in October excercise ! - subscribable to Group space 5. User analysis skims - a dataset that is no more than skim away from provenance
11
Oct 16, 2009T.Kurca Grilles France11 The Goal: Provide easy acces to interesting data PD could be quite large reduce the amount of data to easily manageable sizes Secondary Datasets (SD) - each SD centrally produced subset of one PD using trigger info …. not more than ~30% of events - RECO format initially, later AOD Central Skims: - produced centrally at T1s, 1-2 per PD 10% of most interesting events or an uniform subset (prescale 10) Group skims: - designed by groups run on any data at T2 - could be run also on PD if in the T2 - ideally central skims as input 1% of PD …. manageable day-to-day User skims: final skims by individual users for their needs Procedures tested during October exercise From PD to SD & Skims
12
Oct 16, 2009T.Kurca Grilles France12 Physics Objects Only example Physics objects not well balanced in size combine or split them for balance … keep overlaps low
13
Oct 16, 2009T.Kurca Grilles France13 Object DS Primary DS Example here is for L >>8E29 for L=8 E29 need only 8 PD J. Incandella For rate equalization: Some are too big Split Some are too small Merge From object datasets to PDs: -Splitting based only on trigger bits -Merge correlated triggers keep unprescaled triggers together -Allow duplication of triggers if meaningful from physics point of view
14
Oct 16, 2009T.Kurca Grilles France14 Object DS Primary DS (2) Bjet, Lepton+x datasets: very small rates Bjet: merged with MultiJet dataset LepX: - they are combined object triggers split & absorbed into 2 relevant lepton datasets same trigger appearing in 2 DS
15
Oct 16, 2009T.Kurca Grilles France15 1st Iteration Secondary Datasets 8E29 Primary Datasets (PD) Secondary Datasets (SD): - Dataset/Skim name - Data format (RECO, AOD, reduced AOD, etc) - Prescale wrt to the PD data - Rate in Hz - Fraction of events wrt to the total parent PD - Fraction of disk wrt to the PD size (RECO assumed)
16
Oct 16, 2009T.Kurca Grilles France16 Secondary Datasets L=8 E29 Jet SDs - low pT Jet triggers already prescaled at HLT further prescaled at SD level full DiJetAve15 stats needed for JEC to be stored in reduced format -keep 50 GeV as lowest unprescaled single jet threshold full stat for DiJetAve30 again neede for JEC … reduced event size - keep 35 Gev as the lowest unprescaled MET trigger - keep also events from Btag and HSCP triggers
17
Oct 16, 2009T.Kurca Grilles France17 Summary The Goal of CMS data distribution model is to make data access easier, more reliable and efficient We have in place many components - Tiers-structure - T2 associations and data transfer tools - trigger tables - Primary Datasets and 1st iteration of SDs based on triggers PDs & SDs in standard formats will be distributed to T2s (AOD, RECO if possible) Central and group skims run on PDs accessible at T2s and more manageable No restrictions on group and user skims … even if special data formats are required in process post mortem analysis of October exercise where all this was tested before real data taking
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.