Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.

Slides:



Advertisements
Similar presentations
GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.
Advertisements

SAM-Grid Status Core SAM development SAM-Grid architecture Progress Future work.
Rod Walker IC 13th March 2002 SAM-Grid Middleware  SAM.  JIM.  RunJob.  Conclusions. - Rod Walker,ICL.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
Meta-Computing at DØ Igor Terekhov, for the DØ Experiment Fermilab, Computing Division, PPDG ACAT 2002 Moscow, Russia June 28, 2002.
The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Grid Job, Information and Data Management for the Run II Experiments at FNAL Igor Terekhov et al (see next slide) FNAL/CD/CCF, D0, CDF, Condor team, UTA,
SAMGrid – A fully functional computing grid based on standard technologies Igor Terekhov for the JIM team FNAL/CD/CCF.
Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
CDF data production models 1 Data production models for the CDF experiment S. Hou for the CDF data production team.
November 7, 2001Dutch Datagrid SARA 1 DØ Monte Carlo Challenge A HEP Application.
11 March 2004Getting Ready for the Grid SAM: Tevatron Experiments Using the Grid CDF and D0 Need the Grid –Requirements, the CAF and SAM –Grid from the.
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina, L.Lueking,
HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Deploying and Operating the SAM-Grid: lesson learned Gabriele Garzoglio for the SAM-Grid Team Sep 28, 2004.
SAM Job Submission What is SAM? sam submit …… Data Management Details. Conclusions. Rod Walker, 10 th May, Gridpp, Manchester.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.
CHEP'07 September D0 data reprocessing on OSG Authors Andrew Baranovski (Fermilab) for B. Abbot, M. Diesburg, G. Garzoglio, T. Kurca, P. Mhashilkar.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
SAMGrid for CDF MC (and beyond) Igor Terekhov, FNAL/CD/CCF/SAM for JIM team.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
DØ Computing Model & Monte Carlo & Data Reprocessing Gavin Davies Imperial College London DOSAR Workshop, Sao Paulo, September 2005.
Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Mar 28, 20071/18 The OSG Resource Selection Service (ReSS) Gabriele Garzoglio OSG Resource Selection Service (ReSS) Don Petravick for Gabriele Garzoglio.
The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
22 nd September 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
São Paulo Regional Analysis Center SPRACE Status Report 22/Aug/2006 SPRACE Status Report 22/Aug/2006.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.
Dzero MC production on LCG How to live in two worlds (SAM and LCG)
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
4 March 2004GridPP 9th Collaboration Meeting SAMGrid:JIM and CDF Development CDF Accepts the Need for the Grid –Requirements How to Meet the Need –Status.
The SAM-Grid / LCG Interoperability Test Bed Gabriele Garzoglio ( ) Speaker: Pierre Girard (
DØSAR a Regional Grid within DØ Jae Yu Univ. of Texas, Arlington THEGrid Workshop July 8 – 9, 2004 Univ. of Texas at Arlington.
July 25, 20071/21 OSG Information Services Gabriele Garzoglio, Rob Quick, Chris Green OSG Information Services, VO Monitoring Services and Resource Selection.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Stefano Belforte INFN Trieste 1 CMS Simulation at Tier2 June 12, 2006 Simulation (Monte Carlo) Production for CMS Stefano Belforte WLCG-Tier2 workshop.
T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.
GridPP11 Liverpool Sept04 SAMGrid GridPP11 Liverpool Sept 2004 Gavin Davies Imperial College London.
19 February 2004SAMGrid Project Review SAMGrid: Future Plans CDF Accepts the Need for the Grid –Requirements D0 Relies on the Grid –Requirements How to.
UTA MC Production Farm & Grid Computing Activities Jae Yu UT Arlington DØRACE Workshop Feb. 12, 2002 UTA DØMC Farm MCFARM Job control and packaging software.
Worldwide Data Processing with SAMGrid. As experiments refine their understanding of raw data, a point is reached where it becomes desirable to reanalyze.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
December 07, 2006Parag Mhashilkar, Fermilab1 Samgrid – OSG Interoperability Parag Mhashilkar, Fermi National Accelerator Laboratory.
Adapting SAM for CDF Gabriele Garzoglio Fermilab/CD/CCF/MAP CHEP 2003.
Grid Job, Information and Data Management for the Run II Experiments at FNAL Igor Terekhov et al FNAL/CD/CCF, D0, CDF, Condor team.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
July 26, 2007Parag Mhashilkar, Fermilab1 DZero On OSG: Site And Application Validation Parag Mhashilkar, Fermi National Accelerator Laboratory.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
A Case for Application-Aware Grid Services Gabriele Garzoglio, Andrew Baranovski, Parag Mhashilkar, Anoop Rajendra*, Ljubomir Perković** Computing Division,
Jianming Qian, UM/DØ Software & Computing Where we are now Where we want to go Overview Director’s Review, June 5, 2002.
The SAM-Grid / LCG interoperability system: a bridge between two Grids Gabriele Garzoglio, Andrew Baranovski, Parag Mhashilkar Anoop Rajendra*, Sudhamsh.
DØ Computing Model and Operational Status Gavin Davies Imperial College London Run II Computing Review, September 2005.
DØ Grid Computing Gavin Davies, Frédéric Villeneuve-Séguier Imperial College London On behalf of the DØ Collaboration and the SAMGrid team The 2007 Europhysics.
5/12/06T.Kurca - D0 Meeting FNAL1 p20 Reprocessing Introduction Computing Resources Architecture Operational Model Technical Issues Operational Issues.
BOSS: the CMS interface for job summission, monitoring and bookkeeping
Presentation transcript:

Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division

Gabriele Garzoglio Mar 15, 2005 Overview The DZero experiment at Fermilab Data reprocessing Motivation The computing challenges Current deployment The SAM-Grid system Condor-G and Global Job Management Local Job Management Getting more resources submitting to LCG

Gabriele Garzoglio Mar 15, 2005 Fermilab and DZero DZero

Gabriele Garzoglio Mar 15, 2005 Data size for the D0 Experiment Detector Data 1,000,000 Channels Event size 250KB Event rate ~50 Hz 0.75 PB of data since 1998 Past year overall 0.5 PB Expect overall 10 – 20 PB This means Move 10s TB / day Process PBs / year 25% – 50% remote computing

Gabriele Garzoglio Mar 15, 2005 Overview The DZero experiment at Fermilab  Data reprocessing Motivation The computing challenges Current deployment The SAM-Grid system Condor-G and Global Job Management Local Job Management Getting more resources submitting to LCG

Gabriele Garzoglio Mar 15, 2005 Motivation for the Reprocessing Processing: changing the data format from something close to the detector to something close to the physics. As the understanding of the detector improves, the processing algorithms change Sometimes it is worth to “reprocess” all the data to have “better” analysis results. Our understanding of the DZero calorimeter calibration is now based on reality rather then design/plans: we want to reprocess

Gabriele Garzoglio Mar 15, 2005 The computing task Events 1 Bilion Input 250TB (250kB/Event) Output 70TB (70kB/Event) Time 50s/Event: 20,000months Ideally 3400CPUs (1GHz PIII) for 6mths (~2 days/file) Remote processing 100% A stack of CDs as high as the Eiffel tower

Gabriele Garzoglio Mar 15, 2005 Data processing model Input Datasets (n files) Site 1Site 2Site m … Job 1Job 2Job n … Out 1Out 2Out n … (n batch processes per site) (stored locally at the site) Merging Permanent Storage (at any site) (n~100: files produced in 1 day)

Gabriele Garzoglio Mar 15, 2005 Challenges: Overall scale A dozen computing clusters in US and EU common meta-computing framework: SAM-Grid administrative independence Need to submit 1,700 batch jobs / day to meet the dead line (without counting failures) Each site needs to be filled up at all time: locally scale up to 1000 batch nodes Time to completion of the unit of bookkeeping (~100 files): if too long (days) things are bound to fail Handle 250+ TB of data

Gabriele Garzoglio Mar 15, 2005 Challenges: Error Handling / Recovery Design for random failures unrecoverable application errors, network outages, file delivery failures, batch system crashes and hangups, worker-node crashes, filesystem corruption... Book-keeping of succeeded jobs/files: needed to assure completion without duplicated events Book-keeping of failed jobs/files: needed for recovery AND to trace problems in order fix bugs and to assure efficiency Simple error recovery to foster smooth operations

Gabriele Garzoglio Mar 15, 2005 Available Resources SITE#CPU 1GHz-eq. STATUS FNAL Farm1000CPUsused for data-taking Westgrid600CPUs ready Lyon400CPUs ready SAR (UTA)230CPUs ready Wisconsin30CPUs ready GridKa500CPUs ready Prague200CPUs ready CMS/OSG100CPUs under test UK750CPUs 4 sites being deployed CPUs (1GHz PIII equiv.)

Gabriele Garzoglio Mar 15, 2005 Overview The DZero experiment at Fermilab Data reprocessing Motivation The computing challenges Current deployment  The SAM-Grid system Condor-G and Global Job Management Local Job Management Getting more resources submitting to LCG

Gabriele Garzoglio Mar 15, 2005 The SAM-Grid SAM-Grid is an integrated job, data, and information management system Grid-level job management is based on Condor-G and Globus Data handling and book-keeping is based on SAM (Sequential Access via Metadata): transparent data transport, processing history, and book-keeping. …lot of work to achieve scalability at the execution cluster

Gabriele Garzoglio Mar 15, 2005 SAM-Grid Diagram Site Resource Selector Info Collector Info Gatherer Match Making User Interface Submission Global Job Queue Grid Client Submission User Interface Global DH Services SAM Naming Server SAM Log Server Resource Optimizer SAM DB Server RCMetaData Catalog Bookkeeping Service SAM Stager(s) SAM Station (+other servs) Data Handling Worker Nodes Grid Gateway Grid/Fabric Interface JIM Advertise Local Job Handling Cluster AAA Dist.FS Info Manager XML DB server Site Conf. Glob/Loc JID map... Info Providers MDS MSS Cache Site Web Serv Grid Monitoring User Tools Flow of: jobdata meta-data

Gabriele Garzoglio Mar 15, 2005 JOB Computing Element User Interface Submission Service Job Management Diagram User Interface Resource Selection Match Making Service Information Collector Exec Site #1 Match Making Service Computing Element Grid Sensors Execution Site #n Submission Service Grid Sensors Computing Element Generic Service Generic Service Informatio n Collector Grid Sensor s Computin g Element Generic Service Generic Service ext. algo ext. algo Grid/Fabri c Interface

Gabriele Garzoglio Mar 15, 2005 Fabric Level Job Management Execution Site Grid/Fabric Interface JOB SAM Station Sandbox Facility SAM Station XML Monitoring Database Batch System Adaptor Sandbox Facility Batch System Worker Node XML Monitorin g Database Batch System Worker Node Batch System Adapter Job enters the SiteLocal Sandbox created for job (user input, configuration, SAM client, GridFTP client, user credentials) Local services notified of job Batch Job submission details requested Job submittedJob starts on Worker nodePush of monitoring info starts Job fetches SandboxJob gets dependent products and input data Framework passes control to application Grid monitors status of job User requests status of job Job stores output from application Stdout, stderr, logs handed to Grid

Gabriele Garzoglio Mar 15, 2005 How do we get more resources? We are working on forwarding jobs to the LCG Grid A “forwarding-node” is the advertised gateway to LCG LCG becomes yet another batch system… well, not quite a batch system Need to get rid on the assumptions on the locality of the network SAM-Grid LCG Fwd-node VO-srv

Gabriele Garzoglio Mar 15, 2005 Conclusions DZero needs to reprocess 250 TB of data in the next 6-8 months It will produce 70 TB of output, processing data at a dozen computing centers on ~3000 CPUs The SAM-Grid system will provide the meta- computing infrastructure to handle data, job, and information.

Gabriele Garzoglio Mar 15, 2005 More info at… d0.fnal.gov/computing/reprocessing/