University of Texas At Arlington Louisiana Tech University

University of Texas At Arlington Louisiana Tech University
ATLAS Grid Activities Mark Sosebee University of Texas At Arlington DOSAR Workshop V Louisiana Tech University September 27, 2007

Overview - Introduction
ATLAS grid activities comprise a diverse range and scope of work both in the U.S. and world-wide Two of the primary activities we’re involved with in the U.S. are: Managed Monte Carlo production User analysis In the U.S. we utilize the resources at the Tier 1 center (BNL), five Tier 2’s, plus a mix of Tier 3 and non-ATLAS dedicated OSG sites A central component of U.S. grid operations is PanDA – more about this later Mark Sosebee Sept 27, 2007

ATLAS Grids – Big Picture
ATLAS is divided into three production grids (makes life interesting…) EGEE, NDGF and OSG (10 Tier 1’s, ~40 Tier 2’s, +…) (Disk space is quoted for the Tier 1’s) NDGF (Nordic) ~500 CPU’s, ~60 TB disk OSG (U.S.) ~2500 CPU’s, ~500 TB disk EGEE ~3500 CPU’s, ~400 TB disk Common features: Task/Job definition, Job dispatch (supervisor), metadata (AMI) Access to data through ATLAS DDM system Differences: Production software (executors), grid middleware Independent operations team for each grid Service architecture, storage systems… Mark Sosebee Sept 27, 2007

ATLAS Production Computer System Commissioning (CSC)
Software integration and operations exercise Started more than 1.5 years ago (since end of 2005) Distributed (MC) production goals Validation of Athena software using ‘standard’ samples Continuous production of high statistics physics samples To exercise widely distributed computing infrastructure Distributed Data Management (DDM) goals Worldwide data distribution and replication Controlled flow of data T0 => T1 => T2 => T1 Bring sites and services to steady operations mode Mark Sosebee Sept 27, 2007

Details of Production Operations
First, some nomenclature Job – atomic unit of execution (‘n’ events processed by Athena using a single CPU, real or virtual) Task – collection of jobs through a processing step (evgen, recon, pileup, digit…) for a physics topic (top, Higgs, SUSY…) Dataset – collection of similar files from a single task MC Production steps Physics groups define tasks (assigned to a specific grid) Tasks are converted to jobs and stored centrally Shift teams manage successful execution of jobs DDM operations team monitors and manages data flow Physicists analyze the data on the grid Mark Sosebee Sept 27, 2007

CSC Production Statistics
Many hundreds of physics processes have been simulated Tens of thousands of tasks spanning two major releases Dozens of sub-releases (about every three weeks) have been tested and validated Thousands of ‘bug reports’ fed back to software and physics 50M+ events done from CSC12 >300 TB of MC data on disk Impressive team effort… Mark Sosebee Sept 27, 2007

Resource Usage CPU and disk resources available to ATLAS rising steadily Production system efficiencies are increasing But much more will be required for data taking Additional resources are steadily coming online Mark Sosebee Sept 27, 2007

Resource Allocation (U.S.)
All U.S. ATLAS Tier 1/2 sites provide resources For reliable storage of distributed data CPU’s for managed production (ATLAS-wide groups) CPU’s for regional/local production of large samples through PanDA CPU’s for user analysis through pathena CPU’s for interactive Athena testing / software development (unlikely to be available at all Tier 2’s – will be available at BNL) ROOT analysis of AANtuple’s is expected to be done on personal workstations, and Tier 3 sites U.S. Resource Allocation Committee (chaired by Jim Shank) oversees fair share usage of resources Set allocations between ATLAS-wide and U.S. usage Set allocations between different groups Set quotas for individual users Mark Sosebee Sept 27, 2007

Data Production / Processing (U.S.)
ATLAS managed production (MC, reprocessing) Historically, U.S. has contributed ~25% of MC production Tier 1 and Tier 2’s provide dedicated queues and storage for this Physics groups directly manage task requests (we will have quotas/allocations per group arbitrated by RAC) Detector, calibration, particle ID, test beam, commissioning groups… will also have allocations Regional U.S. production Same as ATLAS managed production – physics groups define tasks needed by U.S. physicists with special group name (Ex. ushiggs) PanDA manages quota (currently 20-25% for U.S. production) So far, U.S. physicists have been slow in taking advantage of this (less than 25% of the quota allocated by RAC is being used) Mark Sosebee Sept 27, 2007

PanDA PanDA is the distributed computing service infrastructure deployed in the U.S. and Canada (joined over the Summer) Includes all U.S. and Canadian Tier 1’s and Tier 2’s Plus some opportunistic/shared sites Works both with OSG (U.S.) and EGEE (Canada) middleware PanDA uses pilot jobs and provide a single task queue Pilot jobs allow instant activation of highest priority tasks Task queue brings ‘batch system’ concept to distributed grids Athena jobs transparently become ‘pathena’ jobs running in PanDA PanDA provides an integrated software and computing system for U.S. / Canadian ATLAS sites, managing all production and user analysis activities Mark Sosebee Sept 27, 2007

More on PanDA PanDA Distributed Production and Distributed Analysis models tightly adhere to the ATLAS Computing Model Goals of PanDA model: Support ATLAS-wide multi-Tiered computing (no longer U.S. only) Support analysis needs of ATLAS physicists Data availability is critically important! Sufficient computing cycles for a variety of use cases PanDA and pathena are the only distributed processing tools currently supported by U.S. ATLAS computing Interactive Athena-based computing is provided primarily at BNL (and CERNCAF – competing with rest of ATLAS) Mark Sosebee Sept 27, 2007

PanDA (U.S. & Canada) Production Statistics
PanDA has completed ~30M fully simulated physics events (simul+digit step), >30% of total central production Also successfully completed >15M single particle events Since November, all available CPU’s occupied (ran out of jobs only for few days, plus few days of service outages) About 400 TB of original data stored at BNL T1 (includes data generated on other grids) Additional ~100 TB of replicas kept at U.S. ATLAS Tier 2 sites Canadian sites are ramping up rapidly – large contributions Mark Sosebee Sept 27, 2007

Number of R12 Panda Jobs DONE
24783 susy 8123 standard 437435 physics 12263 ushiggs 14942 higgs 13917 panda 153460 prodsys 3505 usbphysics 131114 egamma 2148 btagging 4790 top 9771 bphysics 1323 admin 3996 exotics 11144 validation Mark Sosebee Sept 27, 2007

Panda Central Production
Since 1/1/06 Since 3/1/07 Mark Sosebee Sept 27, 2007

DDM DDM – distributed data management DDM is critical for ATLAS
Service to distribute data among Tiers of ATLAS according to the Computing Model (current implementation is called DQ2) DDM is critical for ATLAS To catalog and manage flow of data In some respects it has not performed at the level needed Main issues Robustness of site services (transfer agents) Quality of Service (completion of data movement tasks) Too much manual intervention required What can be done? Ask ATLAS management for more effort Develop backup plan for data movement (DQ2 catalogs are fine) Giving up on ATLAS computing model is not an option… These topics are actively under discussion Mark Sosebee Sept 27, 2007

Data Locations in the U.S.
Tier 1 – main repository of data (MC & Primary) Store complete set of ESD, AOD, AANtuple & TAG’s on disk Fraction of RAW and all U.S. generated RDO data Tier 2 – repository of analysis data Store complete set of AOD, AANtuple & TAG’s on disk Complete set of ESD data divided among five Tier 2’s Data distribution to Tier 1 & Tier 2’s will be managed Tier 3 – unmanaged data -- matched to local interests Data through locally initiated subscriptions Mostly AANtuple’s, some AOD’s Tier 3’s will be associated with Tier 2 sites? Tier 3 model is still not fully developed – everyone’s input is needed Mark Sosebee Sept 27, 2007

Computing Service Model
PanDA production (including pathena) is managed at Tier 1 and Tier 2 resources by U.S. Production Team, working with local site support teams We have started integrating Canadian sites (evolving) DDM Operations team (coordinated by Alexei Klimentov) manages data distribution, data access issues Production team operates shifts to provide QoS Hypernews, RT user support system, and Savannah bug reporting systems are available to users Various categories are available – PanDA, production, site issues, data management… Mark Sosebee Sept 27, 2007

Projections for U.S. Tier 2’s
Totals include capacity committed to international ATLAS and capacity retained under U.S. control for U.S. physicists Most Tier 2’s are approximately at half the value shown below for 2007 Mark Sosebee Sept 27, 2007

Where Is This All Headed?
Goal: steady ramp up of operations till the end of 2007 MC Production Achieve 10 Million evts/week by the end of 2007 All targets that were set in 2005 have been met – so far DDM Operations Steady operations Automation and monitoring improvements Computing Model data flow implemented by end of 2007 Mark Sosebee Sept 27, 2007

Conclusion Computing systems are generally in good shape – through long running commissioning (CSC) exercise 2006 – successful transition into operations mode 2007 – high volume production and DDM operations Successful distributed MC production for CSC notes But, many more challenges to come Need more participation by American physicists – as users, developers, shifters… source of feedback –what works, what doesn’t Lots of resources available for physicists Overall, progressing well towards full readiness for LHC data Mark Sosebee Sept 27, 2007

https://twiki.cern.ch/twiki/bin/view/Atlas/PanDA
Links O’Interest BNL Physics, Software and Computing page: The PanDA Twiki: U.S. ATLAS grid computing page – tons of info – maintained by Horst: Obtaining a grid certificate: Mark Sosebee Sept 27, 2007

University of Texas At Arlington Louisiana Tech University

Similar presentations

Presentation on theme: "University of Texas At Arlington Louisiana Tech University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

University of Texas At Arlington Louisiana Tech University

Similar presentations

Presentation on theme: "University of Texas At Arlington Louisiana Tech University"— Presentation transcript:

Similar presentations

About project

Feedback