Experience with ATLAS Data Challenge Production on the U.S. Grid Testbed Kaushik De University of Texas at Arlington CHEP03 March 27, 2003.

Experience with ATLAS Data Challenge Production on the U.S. Grid Testbed Kaushik De University of Texas at Arlington CHEP03 March 27, 2003

K. De CHEP03 2 Multi-purpose experiment at the Large Hadron Collider, CERN 14 GeV c.m. pp collisions starting in 2007 Physics: Higgs, SUSY, new searches... Petabytes/year of data analyzed by >2000 physicists worldwide - need the GRID The ATLAS Experiment

March 27, 2003 K. De CHEP03 3 U.S. ATLAS Grid Testbed  BNL - U.S. Tier 1, 2000 nodes, 5% ATLAS, 10 TB  LBNL - pdsf cluster, 400 nodes, 5% ATLAS, 1 TB  Boston U. - prototype Tier 2, 64 nodes  Indiana U. - prototype Tier 2, 32 nodes  UT Arlington - 20 nodes  Oklahoma U. - 12 nodes  U. Michigan - 10 nodes  ANL - test nodes  SMU - 6 nodes  UNM - new site

March 27, 2003 K. De CHEP03 4 U.S. Testbed Goals  Deployment  Set up grid infrastructure and ATLAS software  Test installation procedures (PACMAN)  Development & Testing  Grid applications - GRAT, Grappa, Magda...  Other software - monitoring, packaging...  Run Production  For U.S. physics data analysis and tests  Main focus - ATLAS Data Challenges  Simulation, pileup  Reconstruction  Connection to GRID projects  GriPhyN - Globus, Condor, Chimera… use & test  iVDGL - VDT, glue schema testbed, Worldgrid testbed, demos… use and test  EDG, LCG… testing & deployment

March 27, 2003 K. De CHEP03 5 ATLAS Data Challenges DC’s - Generate and analyse simulated data (see talk by Gilbert Poulard on Tuesday)  Original Goals (Nov 15, 2001)  Test computing model, its software, its data model, and to ensure the correctness of the technical choices to be made  Data Challenges should be executed at the prototype Tier centres  Data challenges will be used as input for a Computing Technical Design Report due by the end of 2003 (?) and for preparing a MoU  Current Status  Goals are evolving as we gain experience  Sequence of increasing scale & complexity  DC0 (completed), DC1 (underway)  DC2, DC3, and DC4 planned  Grid deployment and testing major part of DC’s

March 27, 2003 K. De CHEP03 6 GRAT Software  GRid Applications Toolkit  Used for U.S. Data Challenge production  Based on Globus, Magda & MySQL  Shell & Python scripts, modular design  Rapid development platform  Quickly develop packages as needed by DC  Single particle production  Higgs & SUSY production  Pileup production & data management  Reconstruction  Test grid middleware, test grid performance  Modules can be easily enhanced or replaced by Condor-G, EDG resource broker, Chimera, replica catalogue, OGSA… (in progress)

March 27, 2003 K. De CHEP03 7 GRAT Execution Model 1. Resource Discovery 2. Partition Selection 3. Job Creation 4. Pre-stage 5. Batch Submission 6. Job Parameterization 7. Simulation 8. Post-stage 9. Cataloging 10. Monitoring DC1 Prod. (UTA) Remote Gatekeeper Replica (local) MAGDA (BNL) Param (CERN) Batch Execution scratch 1,4,5,10 2 3 4 5 6 7 89

March 27, 2003 K. De CHEP03 8 Middleware Evolution of U.S. Applications Used in current production software (GRAT & Grappa) Tested successfully (not yet used for large scale production) Under development and testing Tested for simulation (will be used for large scale reconstruction)

March 27, 2003 K. De CHEP03 9 Databases used in GRAT  MySQL databases central to GRAT  Production database  define logical job parameters & filenames  track job status, updated periodically by scripts  Data management (Magda)  file registration/catalogue  grid based file transfers  Virtual Data Catalogue  simulation job definition  job parameters, random numbers  Metadata catalogue (AMI)  post-production summary information  data provenance  Similar scheme being considered ATLAS- wide by the Grid Technical Board

March 27, 2003 K. De CHEP03 10 DC1 Production on U.S. Grid  August/September 2002  3 week DC1 production run using GRAT  Generated 200,000 events, using ~ 1,300 CPU days, 2000 files, 100 GB storage at 4 sites  December 2002  Generated 75k SUSY and Higgs events for DC1  Total DC1 files generated and stored > 500 GB, total CPU used >1000 CPU days in 4 weeks  January 2002  More SUSY sample  Started pile-up production on the grid, both high and low luminosity, for 1-2 months at all sites  February/March 2002  Discovered bug in software (non grid part)  Regenerating all SUSY, Higgs & pile-up samples  ~15TB data, 15k files, 2M events, 10k CPU days

March 27, 2003 K. De CHEP03 11 DC1 Production Examples Each production run requires development & deployment of new software at selected sites

March 27, 2003 K. De CHEP03 12 DC1 Production Experience  Grid paradigm works, using Globus  Opportunistic use of existing resources, run anywhere, from anywhere, by anyone...  Successfully exercised grid middleware with increasingly complex tasks  Simulation: create physics data from pre-defined parameters and input files, CPU intensive  Pile-up: mix ~2500 min-bias data files into physics simulation files, data intensive  Reconstruction: data intensive, multiple passes  Data tracking: multiple steps, one -> many -> many more mappings  Tested grid applications developed by U.S.  For example, PACMAN (Saul Youssef - BU)  Magda (see talk by Wensheng Deng)  Virtual Data Catalogue (see Poster by P. Nevski)  GRAT (this talk), GRAPPA (see talk by D. Engh)

March 27, 2003 K. De CHEP03 13 Grid Quality of Service  Anything that can go wrong, WILL go wrong  During 18 days of grid production (in August), every system died at least once  Local experts were not always be accessible  Examples: scheduling machines died 5 times (thrice power failure, twice system hung), Network outages multiple times, Gatekeeper died at every site at least 2-3 times  Three databases used - production, magda and virtual data. Each died at least once!  Scheduled maintenance - HPSS, Magda server, LBNL hardware, LBNL Raid array…  Poor cleanup, lack of fault tolerance in Globus  These outages should be expected on the grid - software design must be robust  We managed > 100 files/day (~80% efficiency) in spite of these problems!

March 27, 2003 K. De CHEP03 14 Conclusion  The largest (>10TB) grid based production in ATLAS was done by U.S. testbed  Grid production is possible, but not easy right now - need to harden middleware, need higher level services  Many tools are missing - monitoring, operations center, data management  Requires iterative learning process, with rapid evolution of software design  Pile-up was a major data management challenge on the grid - moving >0.5 TB/day  Successful so far  Continuously learning and improving  Many more DC’s coming up!

Experience with ATLAS Data Challenge Production on the U.S. Grid Testbed Kaushik De University of Texas at Arlington CHEP03 March 27, 2003.

Similar presentations

Presentation on theme: "Experience with ATLAS Data Challenge Production on the U.S. Grid Testbed Kaushik De University of Texas at Arlington CHEP03 March 27, 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Experience with ATLAS Data Challenge Production on the U.S. Grid Testbed Kaushik De University of Texas at Arlington CHEP03 March 27, 2003.

Similar presentations

Presentation on theme: "Experience with ATLAS Data Challenge Production on the U.S. Grid Testbed Kaushik De University of Texas at Arlington CHEP03 March 27, 2003."— Presentation transcript:

Similar presentations

About project

Feedback