Meta-Computing at DØ Igor Terekhov, for the DØ Experiment Fermilab, Computing Division, PPDG ACAT 2002 Moscow, Russia June 28, 2002.

Slides:



Advertisements
Similar presentations
GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.
Advertisements

Physics with SAM-Grid Stefan Stonjek University of Oxford 6 th GridPP Meeting 30 th January 2003 Coseners House.
SAM-Grid Status Core SAM development SAM-Grid architecture Progress Future work.
David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)
Amber Boehnlein, FNAL D0 Computing Model and Plans Amber Boehnlein D0 Financial Committee November 18, 2002.
Rod Walker IC 13th March 2002 SAM-Grid Middleware  SAM.  JIM.  RunJob.  Conclusions. - Rod Walker,ICL.
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
The Sam-Grid project Gabriele Garzoglio ODS, Computing Division, Fermilab PPDG, DOE SciDAC ACAT 2002, Moscow, Russia June 26, 2002.
Workload Management Massimo Sgaravatto INFN Padova.
JIM Deployment for the CDF Experiment M. Burgon-Lyon 1, A. Baranowski 2, V. Bartsch 3,S. Belforte 4, G. Garzoglio 2, R. Herber 2, R. Illingworth 2, R.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Grid Job, Information and Data Management for the Run II Experiments at FNAL Igor Terekhov et al (see next slide) FNAL/CD/CCF, D0, CDF, Condor team, UTA,
SAMGrid – A fully functional computing grid based on standard technologies Igor Terekhov for the JIM team FNAL/CD/CCF.
High Energy Physics At OSCER A User Perspective OU Supercomputing Symposium 2003 Joel Snow, Langston U.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
HEP Experiment Integration within GriPhyN/PPDG/iVDGL Rick Cavanaugh University of Florida DataTAG/WP4 Meeting 23 May, 2002.
The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.
ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.
Grid Job and Information Management (JIM) for D0 and CDF Gabriele Garzoglio for the JIM Team.
1 School of Computer, National University of Defense Technology A Profile on the Grid Data Engine (GridDaEn) Xiao Nong
November 7, 2001Dutch Datagrid SARA 1 DØ Monte Carlo Challenge A HEP Application.
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.
Deploying and Operating the SAM-Grid: lesson learned Gabriele Garzoglio for the SAM-Grid Team Sep 28, 2004.
SAM Job Submission What is SAM? sam submit …… Data Management Details. Conclusions. Rod Walker, 10 th May, Gridpp, Manchester.
Computational grids and grids projects DSS,
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
CHEP 2003Stefan Stonjek1 Physics with SAM-Grid Stefan Stonjek University of Oxford CHEP th March 2003 San Diego.
September 4,2001Lee Lueking, FNAL1 SAM Resource Management Lee Lueking CHEP 2001 September 3-8, 2001 Beijing China.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
PPDG and ATLAS Particle Physics Data Grid Ed May - ANL ATLAS Software Week LBNL May 12, 2000.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
SAM and D0 Grid Computing Igor Terekhov, FNAL/CD.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Instrumentation of the SAM-Grid Gabriele Garzoglio CSC 426 Research Proposal.
Data Grid projects in HENP R. Pordes, Fermilab Many HENP projects are working on the infrastructure for global distributed simulated data production, data.
International Workshop on HEP Data Grid Nov 9, 2002, KNU Data Storage, Network, Handling, and Clustering in CDF Korea group Intae Yu*, Junghyun Kim, Ilsung.
Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.
The SAM-Grid and the use of Condor-G as a grid job management middleware Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
22 nd September 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
Author - Title- Date - n° 1 Partner Logo EU DataGrid, Work Package 5 The Storage Element.
1 DØ Grid PP Plans – SAM, Grid, Ceiling Wax and Things Iain Bertram Lancaster University Monday 5 November 2001.
DØSAR a Regional Grid within DØ Jae Yu Univ. of Texas, Arlington THEGrid Workshop July 8 – 9, 2004 Univ. of Texas at Arlington.
The Earth System Grid (ESG) Computer Science and Technologies DOE SciDAC ESG Project Review Argonne National Laboratory, Illinois May 8-9, 2003.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Data reprocessing for DZero on the SAM-Grid Gabriele Garzoglio for the SAM-Grid Team Fermilab, Computing Division.
Lee Lueking 1 The Sequential Access Model for Run II Data Management and Delivery Lee Lueking, Frank Nagy, Heidi Schellman, Igor Terekhov, Julie Trumbo,
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
GridPP11 Liverpool Sept04 SAMGrid GridPP11 Liverpool Sept 2004 Gavin Davies Imperial College London.
19 February 2004SAMGrid Project Review SAMGrid: Future Plans CDF Accepts the Need for the Grid –Requirements D0 Relies on the Grid –Requirements How to.
VO Privilege Activity. The VO Privilege Project develops and implements fine-grained authorization to grid- enabled resources and services Started Spring.
UTA MC Production Farm & Grid Computing Activities Jae Yu UT Arlington DØRACE Workshop Feb. 12, 2002 UTA DØMC Farm MCFARM Job control and packaging software.
Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.
Adapting SAM for CDF Gabriele Garzoglio Fermilab/CD/CCF/MAP CHEP 2003.
Grid Job, Information and Data Management for the Run II Experiments at FNAL Igor Terekhov et al FNAL/CD/CCF, D0, CDF, Condor team.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
D0 File Replication PPDG SLAC File replication workshop 9/20/00 Vicky White.
VOX Project Status T. Levshina. 5/7/2003LCG SEC meetings2 Goals, team and collaborators Purpose: To facilitate the remote participation of US based physicists.
DØ Grid Computing Gavin Davies, Frédéric Villeneuve-Séguier Imperial College London On behalf of the DØ Collaboration and the SAMGrid team The 2007 Europhysics.
SAM/SAM-Grid Lee Lueking Oklahoma Workshop July 10, 2002.
Workload Management Workpackage
Distributed Data Access and Resource Management in the D0 SAM System
Wide Area Workload Management Work Package DATAGRID project
The DZero/PPDG D0/PPDG mission is to enable fully distributed computing for the experiment, by enhancing SAM as the distributed data handling system of.
Presentation transcript:

Meta-Computing at DØ Igor Terekhov, for the DØ Experiment Fermilab, Computing Division, PPDG ACAT 2002 Moscow, Russia June 28, 2002

2 Overview  Overview of the D0 Experiment  Introduction into Computing and the paradigm of Distributed Computing  SAM – the advanced data handling system  Global Job And Information Management (JIM) – the current Grid project  Collaborative Grid work

3 The DØ Experiment  P-pbar collider experiment 2TeV  Detector (Real) Data  1,000,000 Channels (793k from Silicon Microstrip Tracker), 5-15% read at a time  Event size 250KB (25% increase in RunIIb)  Recorded event rate 25 Hz RunIIa, 50 Hz (projected) RunIIb  On-line Data Rate 0.5 TB/day, Total 1TB/day  Est. 3 year totals (incl Processing and analysis):  Over 10 9 events, 1-2 PB  Monte Carlo Data  6 remote processing centers  Estimate ~300 TB in next 2 years.

4 The Collaboration  600+ people  78 Institutions  18 countries  Is a large Virtual Organization whose members share resources for solving common problems

5

6 Analysis Assumptions Num of Jobs % of DataSetDuration CPU/evt, 500 MHz Long630% 12 weeks 5 sec Medium5010% 4 weeks 1 sec Short1501% 1 week 0.1 sec

7 Data Storage  The Enstore Mass Storage System, isd.fnal.gov/enstore/index.html isd.fnal.gov/enstore/index.htmlhttp://www- isd.fnal.gov/enstore/index.html  All data is stored on tape in Automated Tape Library (ATL) – robot, including derived datasets  Enstore is attached to the network, accessible via a cp-like command  Other, remote MSS’s may be used (the distributed ownership paradigm – grid computing)

8 Data Handling - SAM  Responds to the above challenges in:  Amounts of data  Rate of access (processing)  The degree to which the user base is distributed  Major goals and requirements  Reliably store (real and MC) produced data  Distribute the data globally to remote analysis centers  Catalogue the data – contents, status, locations, processing history, user datasets etc  Manage resources

9 SAM Highlights  SAM is Sequential data Access via Meta-data   Joint project between D0 and Computing Division started in 1997 to meet the Run II data handling needs  Employs a centrally managed RDBMS (Oracle) for meta-data catalog  Processing takes place at stations  Actual data is managed by a fully distributed set of collaborating servers (see architecture later)

10 SAM Advanced Features  Uniform interfaces for data access modes  Online system, reconstruction farm, Monte- Carlo farm, analysis server are all subclasses of the station.  Uniform capabilities for processing at FNAL and remote centers  On-demand data caching and forwarding (intra- cluster and global)  Resource management:  Co-allocation of compute and data resources (interfaces with batch system abstraction)  Fair share allocation and scheduling

11 Components of a SAM Station Station & Cache Manager File Storage Server File Stager(s) Project Masters /Consumers eworkers File Storage Clients MSS or Other Station MSS or Other Station Data flow Control Producers/ Cache Disk Temp Disk

12 SAM as a Distributed System Database Server(s) (Central Database) Name Server Global Resource Manager(s) Log server Station 1 Servers Station 2 Servers Station 3 Servers Station n Servers Mass Storage System(s) Shared Globally Local To Site Shared Locally Arrows indicate Control and data flow

13 Data Site WAN SAM as a Distributed System optimizer Logger Shared locally, optional Shared Globally (standard): Database Server optimizer Logger FNAL

14 Data Site WAN Data Flow Routing+Caching=Replication

15 SAM as a Data Grid  Provides high-level collective services of reliable data storage and replication  Embraces multiple MSS’s (Enstore, HPSS, etc) local resource management systems (LSF, FBS, PBS, Condor), several different file transfer protocols (bbftp, kerberos rcp, grid ftp, …)  Optionally uses Grid technologies and tools  Condor as a Batch system (in use)  Globus FTP for data transfers (ready for deployment)  From de facto to de jure…

16 Fabric Tape Storage Elements Request Formulator and Planner Client Applications Compute Elements Indicates component that will be replaced Disk Storage Elements LANs and WANs Resource and Services Catalog Replica Catalog Meta- data Catalog Authentication and Security GSI SAM-specific user, group, node, station registrationBbftp ‘cookie’ Connectivity and Resource CORBAUDP File transfer protocols - ftp, bbftp, rcp GridFTP Mass Storage systems protocols e.g. encp, hpss Collective Services Catalog protocols Significant Event LoggerNaming ServiceDatabase ManagerCatalog Manager SAM Resource Management Batch Systems - LSF, FBS, PBS, Condor Data Mover Job Services Storage Manager Job ManagerCache Manager Request Manager “Dataset Editor” “File Storage Server”“Project Master”“Station Master” WebPython codes, Java codesCommand line D0 Framework C++ codes “Stager”“Optimiser” Code Repository Name in “quotes” is SAM-given software component name or addedenhancedusing PPDG and Grid tools

17 Dzero SAM Deployment Map Processing Center Analysis site

18 SAM usage statistics for DZero 497 registered SAM users in production 360 of them have at some time run at least one SAM project 132 of them have run more than 100 SAM projects 323 of them have run a SAM project at some time in the past year 195 of them have run a SAM project in the past 2 months 48 registered stations, 340 registered nodes 115TB of data on tape 63,235 cached files currently (over 1 million entries total) 702,089 physical and virtual data files known to SAM 535,048 physical files (90K raw, 300K MC related) 71,246 “analysis” projects ever ran for more info

19 SAM + JIM  Grid  So we can reliably replicate a TB of data, what’s next?  It is handling of jobs, not data, that constitutes the top of the services pyramid  Need services for job submission, brokering and reliable execution  Need resource discovery and opportunistic computing (shared vs dedicated resources)  Need monitoring of the global system and jobs  Job and Information Management (JIM) emerged

20 JIM and SAM-Grid  (NB: Please hear Gabriele Garzoglio’s talk)  Project started in 2001 as part of the PPDG collaboration to handle D0’s expanded needs.  Recently included CDF  These are real Grid problems and we are incorporating (adopting) or developing Grid solutions   PPDG, GridPP, iVDGL, DataTAG and other Grid Projects

21 SAMGrid Principal Components  (NB Please come to Gabriele’s talk)  Job Definition and Management: The preliminary job management architecture is aggressively based on the Condor technology provided by through our collaboration with University of Wisconsin CS Group.  Monitoring and Information Services: We assign a critical role to this part of the system and widen the boundaries of this component to include all services that provide, or receive, information relevant for job and data management.  Data Handling: The existing SAM Data Handling system, when properly abstracted, plays a principal role in the overall architecture and has direct effects on the Job Management services.

22 SAM-Grid Architecture Job Handling Monitoring and Information Data Handling Request Broker Compute Element Resource Site Gatekeeper Logging and Bookkeeping Job Scheduler Info Processor And Converter Replica Catalog DH Resource Management Data Delivery and Caching Resource Info JH Client AAA Batch System Condor-G Condor MMS GRAM GSI SAM Grid sensors (All) Job Status Updates MDS-2 Condor Class Ads Grid RC Principal Component Service Implementation Or Library Information

23 SAMGrid: Collaboration of Collaborations  HEP Experiments are traditionally collaborative  Computing solutions in the Grid era: new types of collaboration  Sharing solution within experiment – UTA MCFarm software etc  Collaboration between experiments – D0 and CDF joining forces an important event for SAM and FNAL  Collaboration among the grid players: Physicists, Computer Scientists (Condor and Globus teams), Physics-oriented computer professionals (such as myself)

24 Conclusions  The Dzero experiment is one of the largest currently running experiments and presents computing challenges  The advanced data handling system, SAM, has matured. It is fully distributed, its model is proven sound and we expect to scale to meet RunII needs for both D0 and CDF  Expanded needs are in the area of job and information management  The recent challenges are typical of the Grid Computing and D0 engages actively, in collaboration with Computer scientists and other Grid participants  More in Gabriele Garzoglio’s talk

25 The Milestone Dependencies Job Def Doc Execute unstructured MC and SAM analysis jobs with basic brokering Tech. Rev. doc. Execute unstructured SAM analysis jobs UC doc Arch. Doc Execute User-routed MC Jobs Prototype Grid with RB, JSS, GMA-based MIS Study JDLsUse Cases Condor GMA, MDS GSI SAM GSI In SAM Condor In SAM Basic SAM Res Info Service Toy Grid with JSS, basic Monitoring MDS TestBed Status Mon-ing of unstructured jobs Basic System Mon-ing CondorG TestBed SAM Grid- ready Reliable Execution of structured, locally distributed MC and SAM analysis jobs with basic brokering Scheduling criteria for data-intensive jobs, JH-DH interaction design Monitoring of structured jobs DH Mon-ing JH, MIS fully distributed JDL 6 Mo 9-19 Mo Now