Presentation is loading. Please wait.

Presentation is loading. Please wait.

ATLAS Experience on Large Scale Productions on the Grid CHEP-2006 Mumbai 13th February 2006 Gilbert Poulard (CERN PH-ATC) on behalf of ATLAS Data Challenges;

Similar presentations


Presentation on theme: "ATLAS Experience on Large Scale Productions on the Grid CHEP-2006 Mumbai 13th February 2006 Gilbert Poulard (CERN PH-ATC) on behalf of ATLAS Data Challenges;"— Presentation transcript:

1 ATLAS Experience on Large Scale Productions on the Grid CHEP-2006 Mumbai 13th February 2006 Gilbert Poulard (CERN PH-ATC) on behalf of ATLAS Data Challenges; Grid and Operations teams

2 CHEP-2006G. Poulard - CERN PH-ATC2 Overview  Introduction o ATLAS experiment o ATLAS Data Challenges program  ATLAS production system  Data Challenge 2 & “Rome” production o The 3 Grid flavors (LCG; Grid3 and NorduGrid)  Observations; Experience; Achievements  Conclusions  Future Plans

3 CHEP-2006G. Poulard - CERN PH-ATC3 LHC (CERN) Introduction: LHC/CERN Mont Blanc, 4810 m Geneva

4 CHEP-2006G. Poulard - CERN PH-ATC4 Introduction: ATLAS  Detector for the study of high-energy proton-proton collision.  The offline computing will have to deal with an output event rate of 200 Hz. i.e 2x10 9 events per year with an average event size of 1.6 Mbyte.  Researchers are spread all over the world. ATLAS: ~ 2000 Collaborators ~150 Institutes 34 Countries Diameter25 m Barrel toroid length26 m Endcap end-wall chamber span46 m Overall weight 7000 Tons A Toroidal LHC ApparatuS

5 CHEP-2006G. Poulard - CERN PH-ATC5 ATLAS resources needed in 2008 CPU (MSi2k)Disk (PB)Tape (PB) Tier-04.10.395.7 CERN Analysis Facility2.71.90.5 Sum of Tier-1s24.014.49.0 Sum of Tier-2s19.98.70.0 Total50.625.416.9  Data taking will be done at CERN-LHC o Assume 2 x 10 9 events per year (1.6 MB per event)  First pass reconstruction will run at CERN Tier-0  Re-processing will be done at Tier-1s (Regional Computing Centers) (10)  Monte Carlo simulation will be done at Tier-2s (e.g.. Physics Institutes) (40-50) o Full simulation of ~20% of the data rate

6 CHEP-2006G. Poulard - CERN PH-ATC6 Introduction: Data Challenges  LHC Computing Review (2001) “Experiments should carry out Data Challenges of increasing size and complexity to validate their Computing Model their Complete Software suite their Data Model to ensure the correctness of the technical choices to be made”

7 CHEP-2006G. Poulard - CERN PH-ATC7 Introduction: Data Challenges  DC1 (2002-2003) o First ATLAS exercise on world-wide scale  O(1000) CPUs peak o Put in place the full software chain  Simulation of the data; digitization; pile-up; reconstruction o Production system  Tools Bookkeeping of data and Jobs (~AMI); Monitoring; Code distribution o “Preliminary” Grid usage  NorduGrid: all production performed on the Grid  US: Grid used at the end of the exercise  LCG-EDG: some testing during the Data Challenge but not “real” production o At least one person per contributing site  Many people involved o Lessons learned  Management of failures is a key concern  Automate to cope with large amount of jobs o “Build” the ATLAS DC community  Physics Monte Carlo data needed for ATLAS High Level Trigger Technical Design Report

8 CHEP-2006G. Poulard - CERN PH-ATC8 Introduction Data Challenges  DC2 (2004) o Similar exercise as DC1 (scale; physics processes) BUT o Introduced the new ATLAS Production System (ProdSys)  Unsupervised production across many sites spread over three different Grids (US Grid3; ARC/NorduGrid; LCG-2)  Based on DC1 experience with AtCom and GRAT Core engine with plug-ins  4 major components Production supervisor Executor Common data management system Common production database  Use middleware components as much as possible Avoid inventing ATLAS’s own version of Grid –Use middleware broker, catalogs, information system, …  “Rome” production (2005) o Production of simulated data for an ATLAS Physics workshop in Rome in June 2005 using the DC2 infrastructure.

9 CHEP-2006G. Poulard - CERN PH-ATC9 ATLAS Production System  ATLAS uses 3 Grids o LCG (= EGEE) o ARC/NorduGrid (evolved from EDG) o OSG/Grid3 (US)  Plus possibility for local batch submission (4 interfaces)  Input and output must be accessible from all Grids  The system makes use of the native Grid middleware as much as possible (e.g.. Grid catalogs); not “re-inventing” its own solution.

10 CHEP-2006G. Poulard - CERN PH-ATC10 ATLAS Production system LCGNGGrid3LSF LCG exe LCG exe NG exe G3 exe LSF exe super prodDB dms RLS jabber soap jabber Don Quijote Windmill Lexor AMI Capone Dulcinea

11 CHEP-2006G. Poulard - CERN PH-ATC11 ATLAS Production System  Production database: ProdDB o Holds records for jobs and executions o Oracle database hosted at CERN o The design does not require that there is only a single central database, it’s just easier to manage  Supervisor o Consumes jobs from the production database o Submits them to one of the executors it is connected with o Follows up on the job o Will retry “n” times if necessary o Implementation: “Windmill”

12 CHEP-2006G. Poulard - CERN PH-ATC12 ATLAS Production System  Executor o One for each facility flavor  LCG(lexor; lexor-CG); NorduGrid (dulcinea); Grid3 (Capone); PBS; LSF; BQS; … o Translates facility neutral job definition into facility specific language  XRSL; JDL; wrapper scripts; …  Data Management system o Allows global cataloging of files  Opted to interface to existing replica catalog flavors o Allows global file movement  A job can get/put a file anywhere o Presents a uniform interface on top of all facility native data management tools o Implementation: “DonQuijote”

13 CHEP-2006G. Poulard - CERN PH-ATC13 The 3 Grid flavors: LCG-2 Number of sites; resources are evolving quickly ATLAS DC2 Autumn 2004

14 CHEP-2006G. Poulard - CERN PH-ATC14 The 3 Grid flavors: NorduGrid > 10 countries, 40+ sites, ~4000 CPUs, ~30 TB storage  NorduGrid is a research collaboration established mainly across Nordic Countries but includes sites from other countries.  They contributed to a significant part of the DC1 (using the Grid in 2002).  It supports production on several operating systems (non-RedHat 7.3 platforms). ATLAS DC2 Autumn 2004

15 CHEP-2006G. Poulard - CERN PH-ATC15 The 3 Grid flavors: Grid3  The deployed infrastructure has been in operation since November 2003  At this moment running 3 HEP and 2 Biological applications  Over 100 users authorized to run in GRID3 Sep 04 30 sites, multi-VO shared resources ~3000 CPUs (shared) ATLAS DC2 Autumn 2004

16 CHEP-2006G. Poulard - CERN PH-ATC16 Hits MCTruth Digits (RDO) MCTruth Bytestream Raw Digits ESD Geant4 Reconstruction Pile-up Bytestream Raw Digits Bytestream Raw Digits Hits MCTruth Digits (RDO) MCTruth Physics events Events HepMC Events HepMC Hits MCTruth Digits (RDO) MCTruth Geant4 Digitization Digits (RDO) MCTruth Bytestream Raw Digits Bytestream Raw Digits Bytestream Raw Digits Events HepMC Hits MCTruth Geant4 Pile-up Digitization Mixing Reconstruction ESD Pythia Event generation Detector Simulation Digitization (Pile-up) Reconstruction Event Mixing Byte stream Events HepMC Min. bias Events Piled-up events Mixed events With Pile-up ~5 TB 20 TB30 TB 20 TB5 TB TB Volume of data for 10 7 events Persistency: Athena-POOL Production phases AOD

17 CHEP-2006G. Poulard - CERN PH-ATC17 DC2 and “Rome” productions  DC2 o Few datasets o Different type of jobs  Physics Events Generation Very short  Geant simulation Geant3 in DC1; Geant4 in DC2 & “Rome” Long: more than 10 hours  Digitization Medium: ~5 hours  Reconstruction short o All types of jobs run sequentially  Each phase one after the other  “Rome” o Many different (>170) datasets  Different physics channels o Same type of jobs  Event Generation; Simulation, etc. o All type of jobs run in parallel The different type of running has a large impact on the production rate

18 CHEP-2006G. Poulard - CERN PH-ATC18 ATLAS DC2 & “Rome” productions: countries (sites)  Australia (1) (0)  Austria (1)  Canada (4) (3)  CERN (1)  Czech Republic (2)  Denmark (4) (3)  France (1) (4)  Germany (1+2)  Greece (0) (1)  Hungary (0) (1)  Italy (7) (17)  Japan (1) (0)  Netherlands (1) (2)  Norway (3) (2)  Poland (1)  Portugal (0) (1)  Russia (0) (2)  Slovakia (0) (1)  Slovenia (1)  Spain (3)  Sweden (7) (5)  Switzerland (1) (1+1)  Taiwan (1)  UK (7) (8)  USA (19) DC2: 20 countries; 69 sites “Rome”: 22 countries; 84 sites DC2: 13 countries; 31 sites “Rome”: 17 countries; 51 sites DC2: 7 countries; 19 sites “Rome”: 7 countries; 14 sites

19 CHEP-2006G. Poulard - CERN PH-ATC19 ATLAS DC2: Jobs Total 20 countries 69 sites ~ 260000 Jobs ~ 2 MSi2k.months As of 30 November 2004

20 CHEP-2006G. Poulard - CERN PH-ATC20 Rome production Number of Jobs As of 17 June 2005 6 % 5 % 6 % 4 % 5 % 4 % 6 %

21 CHEP-2006G. Poulard - CERN PH-ATC21 Rome production statistics  173 datasets  6.1 M events simulated and reconstructed (without pile- up)  Total simulated data 8.5 M events  Pile-up done for 1.3 M events o 50 K reconstructed

22 CHEP-2006G. Poulard - CERN PH-ATC22 ATLAS Production (July 2004 - May 2005)

23 CHEP-2006G. Poulard - CERN PH-ATC23 Monitoring  Production overview: o via ProdDB – ATLAS specific  Grid monitors: o GOC monitor: http://goc.grid-support.ac.uk/gridsite/monitoring/http://goc.grid-support.ac.uk/gridsite/monitoring/ o Site Functional Tests o BDII monitors (several)  http://hpv.farm.particle.cz/chudoba/atlas/lcg/bdii/html/latest.html http://hpv.farm.particle.cz/chudoba/atlas/lcg/bdii/html/latest.html  http://www.nordugrid.org/applications/prodsys/lcg2-atlas.php http://www.nordugrid.org/applications/prodsys/lcg2-atlas.php  http://www.mi.infn.it/~gnegri/rome_bdii.htm http://www.mi.infn.it/~gnegri/rome_bdii.htm

24 CHEP-2006G. Poulard - CERN PH-ATC24 Observations  Production rate extremely irregular  Switch from long jobs (~20h) to short jobs (~2h) hardly noticeable o Bottleneck is not cpu  Clear correlation with weekends (LCG) and/or holidays (Grid3) o People!  Introduction of Condor-G executor (March 2005) almost double LCG throughput o Bypasses LCG resource broker

25 CHEP-2006G. Poulard - CERN PH-ATC25 Experience  We had high dependency on o Middleware  Broker in LCG, RLS in Grid3/NG, …  Many bugs found and corrected o Sites  Mis-configuration and/or “wrong” information published  DC2 started before development was finished o Many bugs found and corrected  Software was not prepared for unreliable middleware o Error handling, logging, etc were too minimal  Data Management System component was lacking important functionality to manage the huge amount of files involved o Lack of manpower  A lot of people were necessary to compensate

26 CHEP-2006G. Poulard - CERN PH-ATC26 Experience  Critical services were underestimated o RB, BDII, RLS, SE, UI, DQ, MyProxy, DB servers  On LCG just 1 combo machine with RB, UI, BDII, DQ in the beginning of DC2  quickly evolved in a complex system of many services running on many machines  And also o Other DB’s required by ATLAS software like geometryDB and ConditionsDB  MySQL servers Hard limit of number of connections was hit during Rome production  Replica servers were introduced and code change to select between them o SE - problems if input data are on a SE which is down

27 CHEP-2006G. Poulard - CERN PH-ATC27 Main achievements  DC2 o 10 M events fully processed  260K jobs; ~200 kSI2k years; >50 TB of data  20 countries; 69 sites  Large scale production on Grid ONLY,  Using 3 distinct Grid flavors in a coherent way  “Rome” production  8.5 M events simulated and 6.1 M events reconstructed  173 datasets  570K jobs; ~300 kSI2k.years; > 50 TB of data  22 countries and 84 sites  Both large scale production on the Grid (only); using 3 distinct Grid flavors in a coherent way

28 CHEP-2006G. Poulard - CERN PH-ATC28 Conclusions  The 3 Grids LCG-EGEE; Arc/NorduGrid and OSG/Grid3 have proven to be usable in a coherent way for real large scale productions o Possible, but not easy o Requiring non-scalable amount of man power  All involved components need substantial improvements o New developments in the 3 Grids (e.g.. gLite for LCG-EGEE) o New version of ProdSys  Same architecture  But new Supervisor implementation (Eowyn)  And new executors (Panda for OSG/Grid3) o and of Data Management System (DQ2)  These new components are being tested o See dedicated presentations at this conference

29 CHEP-2006G. Poulard - CERN PH-ATC29 Future plans  We have now moved to a “continuous” production mode o Goals is to fully process ~2 M events per week o To provide data to physicists  Several exercises are scheduled this year to validate the Computing Model o Tier-0 exercise simulating the full processing chain  Moving data from the pit to Tier-0  Processing of the data at Tier-0; producing ESD and AOD  Distributing data from Tier-0 to Tier-1s (Raw and ESD) and to Tier-2s (AOD) o Distributed productions  Re-processing at Tier-1s with distribution of data at other Tier1-s (ESD and AOD) and Tier-2s (AOD). This will also include testing of the calibration procedure.  Full Monte Carlo process at Tier-2s with replication of the data at Tier-1s. o Distributed Analysis o These exercises will be done in the context of SC4  But also processing of real “cosmic” data.

30 CHEP-2006G. Poulard - CERN PH-ATC30 Thank you


Download ppt "ATLAS Experience on Large Scale Productions on the Grid CHEP-2006 Mumbai 13th February 2006 Gilbert Poulard (CERN PH-ATC) on behalf of ATLAS Data Challenges;"

Similar presentations


Ads by Google