Presentation is loading. Please wait.

Presentation is loading. Please wait.

LHCb Roadmap 2009-10. 2008: DIRAC3 put in production m Production activities o Started in July o Simulation, reconstruction, stripping P Includes file.

Similar presentations


Presentation on theme: "LHCb Roadmap 2009-10. 2008: DIRAC3 put in production m Production activities o Started in July o Simulation, reconstruction, stripping P Includes file."— Presentation transcript:

1 LHCb Roadmap

2 2008: DIRAC3 put in production m Production activities o Started in July o Simulation, reconstruction, stripping P Includes file distribution strategy, failover mechanism P File access using local access protocol (rootd, rfio, (gsi)dcap, xrootd) P Commissioned alternative method: copy to local disk d Drawback: non-guaranteed space, less CPU efficiency, additional network traffic (possibly copied from remote site) o Failover using VOBOXes P File transfers (delegated to FTS) P LFC registration P Internal DIRAC operations (bookkeeping, job monitoring…) m Analysis o Started in September o Ganga available for DIRAC3 in November o DIRAC2 de-commissioned on January 12 th Call me DIRAC now…. wLCG Workshop, March 2009, Prague2

3 2009 DIRAC concurrent jobs wLCG Workshop, March 2009, Prague3 111 sites

4 DIRAC jobs per day wLCG Workshop, March 2009, Prague4

5 LHCb Computing Operations m Production manager o Schedules production work, sets up and checks workflows, reports to LHCb operations m Computing shifters o Computing Operations shifter (pool of ~12 shifters) P Covers 14h/day, 7 days / week P Computing Control room (2-R-014) o Data Quality shifter P Covers 8h/day, 7 days / week o Both are in the LHCb Computing Control room (2-R-014) m Daily DQ and Operations meetings o Week days (twice a week during shutdowns) m Grid Expert on-call o On duty for a week o Runs the operations meetings m Grid Team (~6 FTEs needed, ~2 missing) o Shared responsibilities (WMS, DMS, SAM, Bookkeeping…) wLCG Workshop, March 2009, Prague5

6 Activities in 2008 m Completion of MC simulation called DC06 o Additional channels o Re-reconstruction (at Tier1s) P Involves a lot of pre-staging (2 years old files) o Stripping (at Tier1s) m User analysis of DC06 o At Tier1s, using ganga and DIRAC (2, then 3) P Access to D1 data (some files are 2 years old) m Commissioning for 2008 data taking o CCRC08 (February, May) P Managed to distribute data at nominal rate P Automatic job submission to Tier1s P Re-processing of data still on disk o Very few cosmics data (only saved at Tier0, analysed online) o First beam data P Very few events (rate: 1 event / 48 seconds…) wLCG Workshop, March 2009, Prague6

7 Plans for 2009 m Simulation… and its analysis in 2009 o Tuning stripping and HLT for 2010 (DC09) P 4/5 TeV, 50 ns (no spillover), cm -1 s -1 P Benchmark channels for first physics studies (100 Mevts) B µµ, Γ s, B Dh, B s J/ ψ ϕ, B K * µµ … P Large minimum bias samples (~ 1mn of LHC running, 10 9 events) P Stripping performance required: ~ 50 Hz for benchmark channels P Tune HLT: efficiency vs retention, optimisation o Replacing DC06 datasets (DC09-2) P Signal and background samples (~500 Mevts) P Minimum bias for L0, HLT and stripping commissioning (~100 Mevts) P Used for CP-violation performance studies P Nominal LHC settings (7 TeV, 25 ns, cm -2 s -1 ) o Preparation for very first physics (MC-2009) P 2 TeV, low luminosity P Large minimum bias sample (10 9 events, part used for FEST09) m Commissioning for data taking (FEST09) o See next slides wLCG Workshop, March 2009, Prague7

8 FEST09 m Aim o Replace the non-existing 2008 beam data with MC o Learn on how to desal with real data P HLT strategy: from 1 MHz to 2 kHz d First data (loose trigger) d Higher lumi/energy data (b-physics trigger) P Online detector monitoring d Based on event selection from HLT e.g. J/Psi events d Automatic detector problems detection P Online Data streaming d Physics stream (all triggers) and calibration stream (subset of triggers, typically 5 Hz) P Alignment and calibration loop d Trigger re-alignment d Run alignment processes d Validate new alignment (based on calibration stream) P Feedback of calibration to reconstruction P Stripping, streaming, data merging and distribution P Physics Analysis (group analysis, end-user…) wLCG Workshop, March 2009, Prague8

9 FEST09 preparation (2) m Online developments o Event injector P Read MC files with emulated L0 trigger P Creates Multi-Event Packets (MEP as font-end does) P Send MEP to an HLT farm node o Event injector control system P Emulation of the standard Run Control P Simulates a regular run, but using event injector as source o Multiple online streams P Using HLT classification as criterion d Was not needed for 2008 run, hence was delayed o Status P Tests in December, operational in January P First FEST week: 26 January d Mainly online commissioning, limited data transfers P Second FEST week: 2 March d Data Quality commissioning, feedback to reconstruction wLCG Workshop, March 2009, Prague9

10 Resources (preliminary) m Consider as a whole (new LHC schedule) o Real data P Split year in two parts: d s at low lumi – LHC-phase1 d 3 to s at higher lumi ( ) – LHC phase2 P Trigger rate independent on lumi and energy: 2 kHz o Simulation: events (nominal year) in 2010 m New assumptions for (re-)processing and analysis o More re-processings during LHC-phase1 o Add calibration checks (done at CERN) o Envision more analysis at CERN with first data P Increase from 25% (TDR) to 50% (phase1) and 35% (phase2) P Include SW development and testing (LXBATCH) o Adjust event sizes and CPU needs to current estimates P Important effort to reduce data size (packed format for rDST, DST, µDST…) P Use new HEP-SPEC06 benchmarking wLCG Workshop, March 2009, Prague10

11 Resources (contd) m CERN usage o Tier0: P Real data recording, export to Tier1s P First pass reconstruction of ~85% of raw data P Reprocessing (in future foresee to use also the Online HLT farm) o CAF (Calibration and Alignment Facility) P Dedicated LXBATCH resources P Detector studies, alignment and calibration o CAF (CERN Analysis Facility) P Part of Grid distributed analysis facilities (estimate 40% in ) P Histograms and interactive analysis (lxplus, desk/lap-tops) m Tier1 usage o Re[-re]construction P First pass during data taking, reprocessing o Analysis facilities P Grid distributed analysis P Local storage for users data (LHCb_USER SRM space) o Simulation in 2009 (background activity) wLCG Workshop, March 2009, Prague11

12 Resource requirements trends m Numbers being finalised for MB meeting and C-RRB m Trends are: o Shift in tape requirements due to LHC schedule o Increase in CERN CPU requirements P Change in assumptions in the computing model o Tier1s: P CPU requirements lower in 2009 but similar in 2010 d More real data re-processings in 2010 P Decrease in disk requirements o Tier2s: P CPU decrease due to less MC simulation requests in 2009 m Anyway: o All this is full of many unknowns! P LHC running time P Machine background P Number of re-processings (how fast can we calibrate?) o More than anything hard to predict needed power and space as function of time! Only integrated CPU, final storage estimates wLCG Workshop, March 2009, Prague12

13 What are the remaining issues? wLCG Workshop, March 2009, Prague13 Storage Stability Storage Stability

14 Storage and data access m 3 years after Mumbai o Definition of storage classes o Roadmap to SRM v2.2 m Where are we? o Many scalability issues P We do use and only use SRM P Data access from storage (no local copy) o Instabilities of storage-ware (and DM tools) P Delay in coping in changes (inconsistent tools) o Data disappearance…. P Tapes damaged P Disk servers offline o Still no unified RFIO library between Castor and DPM… m What can be done? o Regular meetings between experiments DM experts, sites and storage-ware developers P Pre-GDB resurrected? P Should be technical, not political wLCG Workshop, March 2009, Prague14

15 Storage and Data Access (2) m Reliability of data access? o We (experiments) cannot design sites storage o If more hardware is needed, should be evaluated by sites P Flexible to changes P Number of tape drives, size of disk caches, cache configuration… P Examples: d Write pools different from read pools: Is it a good choice? How large pools should be d Scale number of tape drives to disk cache and staging policy m Consistency of storage with catalogs o Unaccessible data (tape or disk) o Job matching based on Catalog P For T1D0 data, we use pre-staging: ensures availability of data d Spot lost files P For D1 data, we assume it is available d We can query SRM, but will collapse d Will SRM reply the truth, i.e. UNAVAILABLE? d We often can get a tURL, but opening file just hangs… wLCG Workshop, March 2009, Prague15

16 Software repository and deployment m Very important service: o Can make a site unusable! o Should scale with number of WNs o Use proper technology P Example: at CERN LHCb has 1 write AFS server and 4 read-only AFS servers o Of course proper permissions should be set… P Write to lcg-admin (a.k.a. sgm accounts) P Read-only to all others P Make your choice: pool accounts and separate groups or single accounts o Intermittent outages can kill all jobs on a site! m Middleware client o We do need support for multiplatform P Libraries linked to applications (LFC, gfal, castor, dCache…) o Therefore we must distribute it P LCG-AA distribution is primordial wLCG Workshop, March 2009, Prague16

17 Workload Management m Stability and reliability of gLite WMS o Mega-patch is not a great experience… o In most cases we dont need brokering P Next step is direct CE submission (CREAM) d Need a reliable CEMON information m Job matching to WNs: shopping list o MaxCPUTime matching: which units? P Is it guaranteed? o Memory usage P We are very modest memory consumers, but… P Jobs are often killed by batch systems due to excessive memory (virtual) P There is no queue parameter allowing a JDL requirement d Only indication on WN memory P Some sites have linked memory to CPU!!! d Seem strange… short jobs all fail… P Limits should be increased d Can bias physics results (e.g. large number of particles in Geant4) P CPUs with (really) many cores are almost here… wLCG Workshop, March 2009, Prague17

18 SAM jobs and reports m Need to report on usability by the experiments o Tests reproduce standard use cases o Should run as normal jobs, i.e. not on special clean environment m Reserve lcg-admin for software installation o Needs dedicated mapping for permissions to repository m Use normal accounts for running tests o Running as Ultimate Priority DIRAC jobs o Matched by the first pilot job that starts P Scans the WN domain d Often see WN-dependent problems (bad config) P Regular environment o Should allow for longer periods without report P Queues may be full (which is actually good sign) but then no new job can start! wLCG Workshop, March 2009, Prague18

19 Conclusions m 2008 o CCRC very useful for LHCb (although irrelevant to be simultaneous due to low throughput) o DIRAC3 fully commissioned P Production in July P Analysis in November P As of now, called DIRAC o Last processing on DC06 P Analysis will continue in 2009 o Commission simulation and reconstruction for real data m o Large simulation requests for replacing DC06, preparing o FEST09: ~1 week a month and 1 day a week o Resource requirements being prepared for C-RRB in April Services are not stable enough yet! wLCG Workshop, March 2009, Prague19


Download ppt "LHCb Roadmap 2009-10. 2008: DIRAC3 put in production m Production activities o Started in July o Simulation, reconstruction, stripping P Includes file."

Similar presentations


Ads by Google