Presentation is loading. Please wait.

Presentation is loading. Please wait.

Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration.

Similar presentations


Presentation on theme: "Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration."— Presentation transcript:

1 Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration CHEP 2004, Interlaken, 2004-09-29

2 2004-09-29www.nordugrid.org2 NorduGrid  NorduGrid is a research collaboration established by universities in Denmark, Estonia, Finland, Norway and Sweden –Focuses on providing production-quality Grid middleware for academic researchers –Triggered by the needs of LHC experiments  Close cooperation with other Grid projects: –EU DataGrid (2001-2003) –SWEGRID –NDGF –LCG –EGEE  Assistance in Grid deployment outside the Nordic area

3 2004-09-29www.nordugrid.org3 Advanced Resource Connector (ARC)  ARC is the Grid middleware developed by the NorduGrid –Based on Globus libraries and API –Original architectural solutions, services and implementations –Supports one of the largest functional Grid-like systems 10 countries, 40+ sites, ~4000 CPUs, ~30 TB storage

4 2004-09-29www.nordugrid.org4 ATLAS Data Challenge 2  Mass simulation of future data taking –Event generation, detector simulation –Test of Tier0 operation: “raw data” processing, distribution of output to regional centers –Distributed analysis  Duration: Summer-Fall 2004  New ATLAS software  An automated production system  Resources come via available Grid systems –Grid3 (USA), LCG, NorduGrid + other ARC-enabled sites  Test of the ATLAS Computing Model

5 2004-09-29www.nordugrid.org5 ATLAS Production System  Thin application-specific layer on top of the Grid and legacy systems –“Don Quijote” is a data management system, interfacing to Grid data indexing services (RLS) –Production Database holds job definitions and status records –“Windmill” – the supervisor, interacts between the ProdDB and the executors –Executors use Grid-specific API to schedule and manipulate the jobs “Capone”: Grid3 “Dulcinea”: ARC “Lexor”: LCG2

6 2004-09-29www.nordugrid.org6 Dulcinea implementation  Implemented in C++ — compiled as a shared library –Shared library imported into Python  Wraps ATLAS jobs into a tailored script that: –Creates POOL file catalog for the input files –Untars the ATLAS transformations tarball –Calls the transformation requested by the Windmill –Creates an XML file with metadata (Don Quijote attributes) for the output results  Calls ARC User Interface API and Globus RLS API –File transfer is handled entirely by the ARC gatekeeper –No internal tracking of jobs, relies on the ARC Information System –Can avoid problematic sites using a “blacklist”  Fetches the XML file for each job and adds the attributes to the RLS catalogue

7 2004-09-29www.nordugrid.org7 Dulcinea performance  Ran at most 2 Dulcinea executor instances at all times  Up to 5000 jobs handled by each such executor without major problems –can run unattended for several days  Few serious problems: –very long startup-times of supervisor (recovering accumulated jobs) –transfer of large XML messages between the supervisor and the executor can render the system unresponsive for long periods of time Dulcinea executor+supervisor instances in ATLAS DC2 production

8 2004-09-29www.nordugrid.org8 SiteCountry~ # CPUs~ % Dedicated 1atlas.hpc.unimelb.edu.au2830% 2genghis.hpc.unimelb.edu.au9020% 3charm.hpc.unimelb.edu.au20100% 4lheppc10.unibe.ch12100% 5lxsrv9.lrz-muenchen.de2345% 6atlas.fzk.de8845% 7morpheus.dcgc.dk18100% 8lscf.nbi.dk3250% 9benedict.aau.dk4690% 10fe10.dcsc.sdu.dk6441% 11grid.uio.no40100% 12fire.ii.uib.no5850% 13grid.fi.uib.no4100% 14hypatia.uio.no10060% 15sigrid.lunarc.lu.se10030% 16sg-access.pdc.kth.se10030% 17hagrid.it.uu.se10030% 18bluesmoke.nsc.liu.se10030% 19ingrid.hpc2n.umu.se10030% 20farm.hep.lu.se6060% 21hive.unicc.chalmers.se10030% 22brenta.ijs.si50100% Totals at peak:  7 countries  22 sites  ~3000 CPUs –dedicated ~700  7 Storage Services (in RLS) –few more storage facilities –~12TB  ~1FTE (1-3 persons) in charge of production –At most 2 executor instances simultaneously ARC-connected resources for DC2

9 2004-09-29www.nordugrid.org9  Total # of successful jobs: 42202 (as of September 25, 2004)  Failure rate before ATLAS ProdSys manipulations: 20% ~1/3 of failed jobs did not waste resources  Failure rate after: 35%  Possible reasons: Dulcinea failing to add DQ attributes in RLS DQ renaming Windmill re-submitting good jobs ARC performance in ATLAS DC2

10 2004-09-29www.nordugrid.org10 Failure analysis  Dominant problem: hardware accidents

11 2004-09-29www.nordugrid.org11 Summary  ARC middleware and the Dulcinea executor provided stable services for ATLAS DC2 –20+ sites from Norway to Australia operate as a single resource –These sites contributed ~30% to the total ATLAS DC2 production Despite offering the least ATLAS-dedicated resources Originally committed to provide only 20%  Performed extremely well comparing to other Grid systems –Negligible amount of middleware-related problems Save the initial instability of the Globus RLS – a common problem –Needed order of magnitude less human efforts comparing to both Grid3 and LCG –Produced same amount of data having much less resources due to higher resource usage efficiency  Dulcinea & ARC helped to prove the validity of the ATLAS Production System concept  Problems still to solve: –Safeguard against site-specific hardware failures –Improvement of the ATLAS Production System


Download ppt "Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration."

Similar presentations


Ads by Google