ATLAS Experience on Large Scale Productions on the Grid CHEP-2006 Mumbai 13th February 2006 Gilbert Poulard (CERN PH-ATC) on behalf of ATLAS Data Challenges;

Slides:



Advertisements
Similar presentations
31/03/00 CMS(UK)Glenn Patrick What is the CMS(UK) Data Model? Assume that CMS software is available at every UK institute connected by some infrastructure.
Advertisements

Resources for the ATLAS Offline Computing Basis for the Estimates ATLAS Distributed Computing Model Cost Estimates Present Status Sharing of Resources.
Enabling Grids for E-sciencE Analysis of the ATLAS Rome Production Experience on the LCG Computing Grid Simone Campana, CERN/INFN EGEE.
Randall Sobie The ATLAS Experiment Randall Sobie Institute for Particle Physics University of Victoria Large Hadron Collider (LHC) at CERN Laboratory ATLAS.
Santiago González de la Hoz on behalf of ATLAS DC2 Collaboration EGC 2005 Amsterdam, 14/02/2005.
Experience with ATLAS Data Challenge Production on the U.S. Grid Testbed Kaushik De University of Texas at Arlington CHEP03 March 27, 2003.
Current Monte Carlo calculation activities in ATLAS (ATLAS Data Challenges) Oxana Smirnova LCG/ATLAS, Lund University SWEGRID Seminar (April 9, 2003, Uppsala)
A tool to enable CMS Distributed Analysis
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.
ATLAS Data Challenge Production and U.S. Participation Kaushik De University of Texas at Arlington BNL Physics & Computing Meeting August 29, 2003.
Don Quijote Data Management for the ATLAS Automatic Production System Miguel Branco – CERN ATC
F.Fanzago – INFN Padova ; S.Lacaprara – LNL; D.Spiga – Universita’ Perugia M.Corvo - CERN; N.DeFilippis - Universita' Bari; A.Fanfani – Universita’ Bologna;
EGEE is a project funded by the European Union under contract IST Large scale simulations on the EGEE Grid Jiri Chudoba FZU and CESNET, Prague.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
1 ATLAS Grid Computing and Data Challenges Nurcan Ozturk University of Texas at Arlington Recent Progresses in High Energy Physics Bolu, Turkey. June 23-25,
Rackspace Analyst Event Tim Bell
LHC Computing Review - Resources ATLAS Resource Issues John Huth Harvard University.
The first year of LHC physics analysis using the GRID: Prospects from ATLAS Davide Costanzo University of Sheffield
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
The LHC Computing Grid – February 2008 The Worldwide LHC Computing Grid Dr Ian Bird LCG Project Leader 25 th April 2012.
ATLAS and GridPP GridPP Collaboration Meeting, Edinburgh, 5 th November 2001 RWL Jones, Lancaster University.
F. Fassi, S. Cabrera, R. Vives, S. González de la Hoz, Á. Fernández, J. Sánchez, L. March, J. Salt, A. Lamas IFIC-CSIC-UV, Valencia, Spain Third EELA conference,
ATLAS Data Challenge Production Experience Kaushik De University of Texas at Arlington Oklahoma D0 SARS Meeting September 26, 2003.
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Tier-2  Data Analysis  MC simulation  Import data from Tier-1 and export MC data CMS GRID COMPUTING AT THE SPANISH TIER-1 AND TIER-2 SITES P. Garcia-Abia.
David Adams ATLAS ADA, ARDA and PPDG David Adams BNL June 28, 2004 PPDG Collaboration Meeting Williams Bay, Wisconsin.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
EGEE is a project funded by the European Union under contract IST HEP Use Cases for Grid Computing J. A. Templon Undecided (NIKHEF) Grid Tutorial,
ATLAS Data Challenges US ATLAS Physics & Computing ANL October 30th 2001 Gilbert Poulard CERN EP-ATC.
Zprávy z ATLAS SW Week March 2004 Seminář ATLAS SW CZ Duben 2004 Jiří Chudoba FzÚ AV CR.
The LHC Computing Grid – February 2008 The Challenges of LHC Computing Dr Ian Bird LCG Project Leader 6 th October 2009 Telecom 2009 Youth Forum.
Les Les Robertson LCG Project Leader High Energy Physics using a worldwide computing grid Torino December 2005.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
High Energy Physics & Computing Grids TechFair Univ. of Arlington November 10, 2004.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Post-DC2/Rome Production Kaushik De, Mark Sosebee University of Texas at Arlington U.S. Grid Phone Meeting July 13, 2005.
SC4 Planning Planning for the Initial LCG Service September 2005.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
ATLAS Grid Computing Rob Gardner University of Chicago ICFA Workshop on HEP Networking, Grid, and Digital Divide Issues for Global e-Science THE CENTER.
The ATLAS Computing Model and USATLAS Tier-2/Tier-3 Meeting Shawn McKee University of Michigan Joint Techs, FNAL July 16 th, 2007.
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
1 The Capone Workflow Manager M. Mambelli, University of Chicago R. Gardner, University of Chicago J. Gieraltowsky, Argonne National Laboratory 14 th February.
Overview of ATLAS Data Challenge Oxana Smirnova LCG/ATLAS, Lund University GAG monthly, February 28, 2003, CERN Strongly based on slides of Gilbert Poulard.
The ATLAS Strategy for Distributed Analysis on several Grid Infrastructures D. Liko, IT/PSS for the ATLAS Distributed Analysis Community.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
ATLAS Distributed Analysis DISTRIBUTED ANALYSIS JOBS WITH THE ATLAS PRODUCTION SYSTEM S. González D. Liko
WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.
CMS Production Management Software Julia Andreeva CERN CHEP conference 2004.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
EXPERIENCE WITH ATLAS DISTRIBUTED ANALYSIS TOOLS S. González de la Hoz L. March IFIC, Instituto.
ATLAS Computing Model Ghita Rahal CC-IN2P3 Tutorial Atlas CC, Lyon
CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.
ATLAS Distributed Analysis S. González de la Hoz 1, D. Liko 2, L. March 1 1 IFIC – Valencia 2 CERN.
Data Challenge with the Grid in ATLAS
INFN-GRID Workshop Bari, October, 26, 2004
Readiness of ATLAS Computing - A personal view
ATLAS DC2 ISGC-2005 Taipei 27th April 2005
Zhongliang Ren 12 June 2006 WLCG Tier2 Workshop at CERN
US ATLAS Physics & Computing
ATLAS DC2 & Continuous production
The LHCb Computing Data Challenge DC06
Presentation transcript:

ATLAS Experience on Large Scale Productions on the Grid CHEP-2006 Mumbai 13th February 2006 Gilbert Poulard (CERN PH-ATC) on behalf of ATLAS Data Challenges; Grid and Operations teams

CHEP-2006G. Poulard - CERN PH-ATC2 Overview  Introduction o ATLAS experiment o ATLAS Data Challenges program  ATLAS production system  Data Challenge 2 & “Rome” production o The 3 Grid flavors (LCG; Grid3 and NorduGrid)  Observations; Experience; Achievements  Conclusions  Future Plans

CHEP-2006G. Poulard - CERN PH-ATC3 LHC (CERN) Introduction: LHC/CERN Mont Blanc, 4810 m Geneva

CHEP-2006G. Poulard - CERN PH-ATC4 Introduction: ATLAS  Detector for the study of high-energy proton-proton collision.  The offline computing will have to deal with an output event rate of 200 Hz. i.e 2x10 9 events per year with an average event size of 1.6 Mbyte.  Researchers are spread all over the world. ATLAS: ~ 2000 Collaborators ~150 Institutes 34 Countries Diameter25 m Barrel toroid length26 m Endcap end-wall chamber span46 m Overall weight 7000 Tons A Toroidal LHC ApparatuS

CHEP-2006G. Poulard - CERN PH-ATC5 ATLAS resources needed in 2008 CPU (MSi2k)Disk (PB)Tape (PB) Tier CERN Analysis Facility Sum of Tier-1s Sum of Tier-2s Total  Data taking will be done at CERN-LHC o Assume 2 x 10 9 events per year (1.6 MB per event)  First pass reconstruction will run at CERN Tier-0  Re-processing will be done at Tier-1s (Regional Computing Centers) (10)  Monte Carlo simulation will be done at Tier-2s (e.g.. Physics Institutes) (40-50) o Full simulation of ~20% of the data rate

CHEP-2006G. Poulard - CERN PH-ATC6 Introduction: Data Challenges  LHC Computing Review (2001) “Experiments should carry out Data Challenges of increasing size and complexity to validate their Computing Model their Complete Software suite their Data Model to ensure the correctness of the technical choices to be made”

CHEP-2006G. Poulard - CERN PH-ATC7 Introduction: Data Challenges  DC1 ( ) o First ATLAS exercise on world-wide scale  O(1000) CPUs peak o Put in place the full software chain  Simulation of the data; digitization; pile-up; reconstruction o Production system  Tools Bookkeeping of data and Jobs (~AMI); Monitoring; Code distribution o “Preliminary” Grid usage  NorduGrid: all production performed on the Grid  US: Grid used at the end of the exercise  LCG-EDG: some testing during the Data Challenge but not “real” production o At least one person per contributing site  Many people involved o Lessons learned  Management of failures is a key concern  Automate to cope with large amount of jobs o “Build” the ATLAS DC community  Physics Monte Carlo data needed for ATLAS High Level Trigger Technical Design Report

CHEP-2006G. Poulard - CERN PH-ATC8 Introduction Data Challenges  DC2 (2004) o Similar exercise as DC1 (scale; physics processes) BUT o Introduced the new ATLAS Production System (ProdSys)  Unsupervised production across many sites spread over three different Grids (US Grid3; ARC/NorduGrid; LCG-2)  Based on DC1 experience with AtCom and GRAT Core engine with plug-ins  4 major components Production supervisor Executor Common data management system Common production database  Use middleware components as much as possible Avoid inventing ATLAS’s own version of Grid –Use middleware broker, catalogs, information system, …  “Rome” production (2005) o Production of simulated data for an ATLAS Physics workshop in Rome in June 2005 using the DC2 infrastructure.

CHEP-2006G. Poulard - CERN PH-ATC9 ATLAS Production System  ATLAS uses 3 Grids o LCG (= EGEE) o ARC/NorduGrid (evolved from EDG) o OSG/Grid3 (US)  Plus possibility for local batch submission (4 interfaces)  Input and output must be accessible from all Grids  The system makes use of the native Grid middleware as much as possible (e.g.. Grid catalogs); not “re-inventing” its own solution.

CHEP-2006G. Poulard - CERN PH-ATC10 ATLAS Production system LCGNGGrid3LSF LCG exe LCG exe NG exe G3 exe LSF exe super prodDB dms RLS jabber soap jabber Don Quijote Windmill Lexor AMI Capone Dulcinea

CHEP-2006G. Poulard - CERN PH-ATC11 ATLAS Production System  Production database: ProdDB o Holds records for jobs and executions o Oracle database hosted at CERN o The design does not require that there is only a single central database, it’s just easier to manage  Supervisor o Consumes jobs from the production database o Submits them to one of the executors it is connected with o Follows up on the job o Will retry “n” times if necessary o Implementation: “Windmill”

CHEP-2006G. Poulard - CERN PH-ATC12 ATLAS Production System  Executor o One for each facility flavor  LCG(lexor; lexor-CG); NorduGrid (dulcinea); Grid3 (Capone); PBS; LSF; BQS; … o Translates facility neutral job definition into facility specific language  XRSL; JDL; wrapper scripts; …  Data Management system o Allows global cataloging of files  Opted to interface to existing replica catalog flavors o Allows global file movement  A job can get/put a file anywhere o Presents a uniform interface on top of all facility native data management tools o Implementation: “DonQuijote”

CHEP-2006G. Poulard - CERN PH-ATC13 The 3 Grid flavors: LCG-2 Number of sites; resources are evolving quickly ATLAS DC2 Autumn 2004

CHEP-2006G. Poulard - CERN PH-ATC14 The 3 Grid flavors: NorduGrid > 10 countries, 40+ sites, ~4000 CPUs, ~30 TB storage  NorduGrid is a research collaboration established mainly across Nordic Countries but includes sites from other countries.  They contributed to a significant part of the DC1 (using the Grid in 2002).  It supports production on several operating systems (non-RedHat 7.3 platforms). ATLAS DC2 Autumn 2004

CHEP-2006G. Poulard - CERN PH-ATC15 The 3 Grid flavors: Grid3  The deployed infrastructure has been in operation since November 2003  At this moment running 3 HEP and 2 Biological applications  Over 100 users authorized to run in GRID3 Sep sites, multi-VO shared resources ~3000 CPUs (shared) ATLAS DC2 Autumn 2004

CHEP-2006G. Poulard - CERN PH-ATC16 Hits MCTruth Digits (RDO) MCTruth Bytestream Raw Digits ESD Geant4 Reconstruction Pile-up Bytestream Raw Digits Bytestream Raw Digits Hits MCTruth Digits (RDO) MCTruth Physics events Events HepMC Events HepMC Hits MCTruth Digits (RDO) MCTruth Geant4 Digitization Digits (RDO) MCTruth Bytestream Raw Digits Bytestream Raw Digits Bytestream Raw Digits Events HepMC Hits MCTruth Geant4 Pile-up Digitization Mixing Reconstruction ESD Pythia Event generation Detector Simulation Digitization (Pile-up) Reconstruction Event Mixing Byte stream Events HepMC Min. bias Events Piled-up events Mixed events With Pile-up ~5 TB 20 TB30 TB 20 TB5 TB TB Volume of data for 10 7 events Persistency: Athena-POOL Production phases AOD

CHEP-2006G. Poulard - CERN PH-ATC17 DC2 and “Rome” productions  DC2 o Few datasets o Different type of jobs  Physics Events Generation Very short  Geant simulation Geant3 in DC1; Geant4 in DC2 & “Rome” Long: more than 10 hours  Digitization Medium: ~5 hours  Reconstruction short o All types of jobs run sequentially  Each phase one after the other  “Rome” o Many different (>170) datasets  Different physics channels o Same type of jobs  Event Generation; Simulation, etc. o All type of jobs run in parallel The different type of running has a large impact on the production rate

CHEP-2006G. Poulard - CERN PH-ATC18 ATLAS DC2 & “Rome” productions: countries (sites)  Australia (1) (0)  Austria (1)  Canada (4) (3)  CERN (1)  Czech Republic (2)  Denmark (4) (3)  France (1) (4)  Germany (1+2)  Greece (0) (1)  Hungary (0) (1)  Italy (7) (17)  Japan (1) (0)  Netherlands (1) (2)  Norway (3) (2)  Poland (1)  Portugal (0) (1)  Russia (0) (2)  Slovakia (0) (1)  Slovenia (1)  Spain (3)  Sweden (7) (5)  Switzerland (1) (1+1)  Taiwan (1)  UK (7) (8)  USA (19) DC2: 20 countries; 69 sites “Rome”: 22 countries; 84 sites DC2: 13 countries; 31 sites “Rome”: 17 countries; 51 sites DC2: 7 countries; 19 sites “Rome”: 7 countries; 14 sites

CHEP-2006G. Poulard - CERN PH-ATC19 ATLAS DC2: Jobs Total 20 countries 69 sites ~ Jobs ~ 2 MSi2k.months As of 30 November 2004

CHEP-2006G. Poulard - CERN PH-ATC20 Rome production Number of Jobs As of 17 June % 5 % 6 % 4 % 5 % 4 % 6 %

CHEP-2006G. Poulard - CERN PH-ATC21 Rome production statistics  173 datasets  6.1 M events simulated and reconstructed (without pile- up)  Total simulated data 8.5 M events  Pile-up done for 1.3 M events o 50 K reconstructed

CHEP-2006G. Poulard - CERN PH-ATC22 ATLAS Production (July May 2005)

CHEP-2006G. Poulard - CERN PH-ATC23 Monitoring  Production overview: o via ProdDB – ATLAS specific  Grid monitors: o GOC monitor: o Site Functional Tests o BDII monitors (several)   

CHEP-2006G. Poulard - CERN PH-ATC24 Observations  Production rate extremely irregular  Switch from long jobs (~20h) to short jobs (~2h) hardly noticeable o Bottleneck is not cpu  Clear correlation with weekends (LCG) and/or holidays (Grid3) o People!  Introduction of Condor-G executor (March 2005) almost double LCG throughput o Bypasses LCG resource broker

CHEP-2006G. Poulard - CERN PH-ATC25 Experience  We had high dependency on o Middleware  Broker in LCG, RLS in Grid3/NG, …  Many bugs found and corrected o Sites  Mis-configuration and/or “wrong” information published  DC2 started before development was finished o Many bugs found and corrected  Software was not prepared for unreliable middleware o Error handling, logging, etc were too minimal  Data Management System component was lacking important functionality to manage the huge amount of files involved o Lack of manpower  A lot of people were necessary to compensate

CHEP-2006G. Poulard - CERN PH-ATC26 Experience  Critical services were underestimated o RB, BDII, RLS, SE, UI, DQ, MyProxy, DB servers  On LCG just 1 combo machine with RB, UI, BDII, DQ in the beginning of DC2  quickly evolved in a complex system of many services running on many machines  And also o Other DB’s required by ATLAS software like geometryDB and ConditionsDB  MySQL servers Hard limit of number of connections was hit during Rome production  Replica servers were introduced and code change to select between them o SE - problems if input data are on a SE which is down

CHEP-2006G. Poulard - CERN PH-ATC27 Main achievements  DC2 o 10 M events fully processed  260K jobs; ~200 kSI2k years; >50 TB of data  20 countries; 69 sites  Large scale production on Grid ONLY,  Using 3 distinct Grid flavors in a coherent way  “Rome” production  8.5 M events simulated and 6.1 M events reconstructed  173 datasets  570K jobs; ~300 kSI2k.years; > 50 TB of data  22 countries and 84 sites  Both large scale production on the Grid (only); using 3 distinct Grid flavors in a coherent way

CHEP-2006G. Poulard - CERN PH-ATC28 Conclusions  The 3 Grids LCG-EGEE; Arc/NorduGrid and OSG/Grid3 have proven to be usable in a coherent way for real large scale productions o Possible, but not easy o Requiring non-scalable amount of man power  All involved components need substantial improvements o New developments in the 3 Grids (e.g.. gLite for LCG-EGEE) o New version of ProdSys  Same architecture  But new Supervisor implementation (Eowyn)  And new executors (Panda for OSG/Grid3) o and of Data Management System (DQ2)  These new components are being tested o See dedicated presentations at this conference

CHEP-2006G. Poulard - CERN PH-ATC29 Future plans  We have now moved to a “continuous” production mode o Goals is to fully process ~2 M events per week o To provide data to physicists  Several exercises are scheduled this year to validate the Computing Model o Tier-0 exercise simulating the full processing chain  Moving data from the pit to Tier-0  Processing of the data at Tier-0; producing ESD and AOD  Distributing data from Tier-0 to Tier-1s (Raw and ESD) and to Tier-2s (AOD) o Distributed productions  Re-processing at Tier-1s with distribution of data at other Tier1-s (ESD and AOD) and Tier-2s (AOD). This will also include testing of the calibration procedure.  Full Monte Carlo process at Tier-2s with replication of the data at Tier-1s. o Distributed Analysis o These exercises will be done in the context of SC4  But also processing of real “cosmic” data.

CHEP-2006G. Poulard - CERN PH-ATC30 Thank you