DØ Grid Computing Gavin Davies, Frédéric Villeneuve-Séguier Imperial College London On behalf of the DØ Collaboration and the SAMGrid team The 2007 Europhysics Conference on High Energy Physics Manchester, England July 2007
EPS-HEP 2007 Manchester 2Outline Introduction DØ Computing Model SAMGrid Components Interoperability Activities Monte Carlo Generation Data Processing Conclusion Next Steps / Issues Summary
EPS-HEP 2007 Manchester 3Introduction Tevatron –Running experiments (Less data than LHC, but still PBs/experiment) –Growing - great physics & better still to come.. Have >3fb -1 of data and expect up 5fb -1 more by end 2009 Computing model: Datagrid (SAM) for all data handling & originally distributed computing with evolution to automated use of common tools/solutions on the grid (SAMGrid) for all tasks –Started with production tasks –Started with production tasks eg MC generation, data processing Greatest need & easiest to ‘gridify’ - ahead of wave & a running expt. –Base on SAMGrid, but have program of interoperability from v. early on Initially LCG and then OSG –Increased automation, user analysis considered last SAM gives remote data analysis
EPS-HEP 2007 Manchester 4 Computing Model Remote Analysis Systems Data Handling Services Central Analysis Systems Remote Farms Central Farms User Desktops Central Storage Raw Data RECO Data RECO MC User Data
EPS-HEP 2007 Manchester 5 Components - Terminology SAM (Sequential Access via Metadata) –Well developed metadata & distributed data replication system –Originally developed by DØ & FNAL-CD, now used by CDF & MINOS JIM (Job Information and Monitoring ) –handles job submission and monitoring (all but data handling) –SAM + JIM → SAMGrid – computational grid Runjob –handles job workflow management Automation –d0repro tools, automc (UK Role – Project leadership, key technology and operations)
EPS-HEP 2007 Manchester 6 SAMGrid Interoperability Long programme of interoperability – LCG 1 st and then OSG Step 1: Co-existence – use shared resources with SAM(Grid) headnode –Widely done for both MC and p /5 data reprocessing Step 2: SAMGrid interface –SAM does data handling & JIM job submission –Basically forwarding mechanism SAMGrid-LCG –1 st used early 2006 for data fixing –MC & p20 data reprocessing since SAMGrid-OSG –Learnt from SAMGrid-LCG –p20 data reprocessing (spring 07) Replicate as needed
EPS-HEP 2007 Manchester 7 SAM plots Over 10 PB (250B evts) last yr Up to 1.6 PB moved per month (x5 increase over 2 yrs ago) SAM TV - monitor SAM and SAM stations Continued success: SAM shifters – often remote 1PB / month
EPS-HEP 2007 Manchester 8 SAMGrid plots - I JIM: > 10 active execution sites “Moving to forwarding nodes” “No longer add red dots” name=samgrid.fnal.gov
EPS-HEP 2007 Manchester 9 SAMGrid plots - II “native” SAMGrid (Europe) SAMGrid-LCG forwarding mechanism (Europe) SAMGrid-OSG forwarding Mechanism (US) “native” SAMGrid (China !)
EPS-HEP 2007 Manchester 10 Monte Carlo Massive increase with spread of SAMGrid use & LCG (OSG later) p17/p20 – 550M events since 09/05 Up to 12M events/week –Downtimes due to software transition, p20 reprocessing and site availability 80% in Eu –30% in Fr UKRAC –Full details on web – /d0_uk_rac/d0_uk_rac.htmlhttp:// /d0_uk_rac/d0_uk_rac.html LCG gridwide submission reached scaling problem
EPS-HEP 2007 Manchester 11 p14 Reprocessing: Winter 2003/04 –100M events remotely, 25M in UK –Distributed computing rather than Grid p17 Reprocessing: Spring – Autumn 05 –x 10 larger ie 1B events, 250TB, from raw –SAMGrid as default p17 Fixing: Spring 06 –All RunIIa – 1.4B events in 6 weeks –SAMGrid-LCG ‘burnt-in’ Increasing functionality –Primary processing tested, will become default Data – reprocessing & fixing - I Site certification
EPS-HEP 2007 Manchester 12 Data – reprocessing & fixing - II p20 (Run IIb) reprocessing –Spring 2007 –Improved reconstruction & detector calibration for RunIIb data (2006 and early 2007) –~ 500M events (75TB) –Reprocessing using native SAMGrid, SAMGrid-OSG (& SAMGrid-LCG) – 1 st large scale use of SAMGrid-OSG –Up to 10M events produced / merged remote daily (initial goal was 3M/day) –Successful reprocessing
EPS-HEP 2007 Manchester 13 Integration of a “grid” (OSG) P20 reprocessing Such exercises ‘debug’ a grid –Revealed some teething troubles –Solved quickly thanks to GOC, OSG and LCG partners SAMGrid-LCG experience –Up to 3M/day at full speed “LCG” OSG (initially) A lot of green A lot of red
EPS-HEP 2007 Manchester 14 Next steps / issues Complete endgame development –Additional functionality /usage – skimming, primary processing on the grid as default (& at multiple sites?) –Additional resources - Completing the forwarding nodes Full data / MC functionality for both LCG & OSG Scaling issues to access the full LCG &OSG worlds –Data analysis – how gridified do we go? – an open issue Need to be ‘interoperable’ (Fermigrid, LCG sites, OSG, …) Will need development, deployment and operations effort “Steady” state – goal to reach by end of CY 07 (≥ 2yrs running) –Maintenance of existing functionality –Continued experimental requests –Continued evolution as grid standard’s evolve Manpower –Development, integration and operation handled by the dedicated few
EPS-HEP 2007 Manchester 15 Summary / plans Tevatron & DØ performing very well –A lot of data & physics, with more to come SAM & SAMGrid critical to DØ –Grid computing model as important as any sub-detector Without LCG and OSG partners would not have worked either –Largest grid ‘data challenges’ in HEP (I believe) –Learnt a lot about the technology, and especially how it scales –Learnt a lot about organisation / operation of such projects –Some of these can be abstracted and of benefit to others… –Accounting model evolved in parallel (~$4M/yr) Baseline: Ensure (scaling for) production tasks –Further improving operational robustness / efficiency underway In parallel open question of data analysis – will need to go part way
EPS-HEP 2007 Manchester 16 Back-ups
EPS-HEP 2007 Manchester 17 SAMGrid Architecture
EPS-HEP 2007 Manchester 18 Interoperability architecture Network Boundaries Forwarding Node LCG/OSG Cluster VO-Service (SAM) Job Flow Offers Service