Stuart Wakefield Imperial College London1 How (and why) HEP uses the Grid.

Slides:



Advertisements
Similar presentations
1 14 Feb 2007 CMS Italia – Napoli A. Fanfani Univ. Bologna A. Fanfani University of Bologna MC Production System & DM catalogue.
Advertisements

1 ALICE Grid Status David Evans The University of Birmingham GridPP 14 th Collaboration Meeting Birmingham 6-7 Sept 2005.
31/03/00 CMS(UK)Glenn Patrick What is the CMS(UK) Data Model? Assume that CMS software is available at every UK institute connected by some infrastructure.
Introduction to CMS computing CMS for summer students 7/7/09 Oliver Gutsche, Fermilab.
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
11 Dec 2000F Harris Datagrid Testbed meeting at Milan 1 LHCb ‘use-case’ - distributed MC production
A tool to enable CMS Distributed Analysis
Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.
Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.
Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.
December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.
Zhiling Chen (IPP-ETHZ) Doktorandenseminar June, 4 th, 2009.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid January 2012, CIEMAT, Madrid.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
November 7, 2001Dutch Datagrid SARA 1 DØ Monte Carlo Challenge A HEP Application.
Building a distributed software environment for CDF within the ESLEA framework V. Bartsch, M. Lancaster University College London.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
3rd June 2004 CDF Grid SAM:Metadata and Middleware Components Mòrag Burgon-Lyon University of Glasgow.
Cosener’s House – 30 th Jan’031 LHCb Progress & Plans Nick Brook University of Bristol News & User Plans Technical Progress Review of deliverables.
Interactive Job Monitor: CafMon kill CafMon tail CafMon dir CafMon log CafMon top CafMon ps LcgCAF: CDF submission portal to LCG resources Francesco Delli.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
ATLAS and GridPP GridPP Collaboration Meeting, Edinburgh, 5 th November 2001 RWL Jones, Lancaster University.
GridPP18 Glasgow Mar 07 DØ – SAMGrid Where’ve we come from, and where are we going? Evolution of a ‘long’ established plan Gavin Davies Imperial College.
7April 2000F Harris LHCb Software Workshop 1 LHCb planning on EU GRID activities (for discussion) F Harris.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Tier-2  Data Analysis  MC simulation  Import data from Tier-1 and export MC data CMS GRID COMPUTING AT THE SPANISH TIER-1 AND TIER-2 SITES P. Garcia-Abia.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
EGEE is a project funded by the European Union under contract IST HEP Use Cases for Grid Computing J. A. Templon Undecided (NIKHEF) Grid Tutorial,
Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008.
…building the next IT revolution From Web to Grid…
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
GridPP11 Liverpool Sept04 SAMGrid GridPP11 Liverpool Sept 2004 Gavin Davies Imperial College London.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Implementation and performance analysis of.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CRAB: the CMS tool to allow data analysis.
ATLAS Grid Computing Rob Gardner University of Chicago ICFA Workshop on HEP Networking, Grid, and Digital Divide Issues for Global e-Science THE CENTER.
1 Andrea Sciabà CERN The commissioning of CMS computing centres in the WLCG Grid ACAT November 2008 Erice, Italy Andrea Sciabà S. Belforte, A.
tons, 150 million sensors generating data 40 millions times per second producing 1 petabyte per second The ATLAS experiment.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Latest Improvements in the PROOF system Bleeding Edge Physics with Bleeding Edge Computing Fons Rademakers, Gerri Ganis, Jan Iwaszkiewicz CERN.
Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
GDB, 07/06/06 CMS Centre Roles à CMS data hierarchy: n RAW (1.5/2MB) -> RECO (0.2/0.4MB) -> AOD (50kB)-> TAG à Tier-0 role: n First-pass.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
1 June 11/Ian Fisk CMS Model and the Network Ian Fisk.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
DØ Grid Computing Gavin Davies, Frédéric Villeneuve-Séguier Imperial College London On behalf of the DØ Collaboration and the SAMGrid team The 2007 Europhysics.
DIRAC: Workload Management System Garonne Vincent, Tsaregorodtsev Andrei, Centre de Physique des Particules de Marseille Stockes-rees Ian, University of.
Overview of the Belle II computing
Data Challenge with the Grid in ATLAS
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Simulation use cases for T2 in ALICE
N. De Filippis - LLR-Ecole Polytechnique
R. Graciani for LHCb Mumbay, Feb 2006
ExaO: Software Defined Data Distribution for Exascale Sciences
 YongPyong-High Jan We appreciate that you give an opportunity to have this talk. Our Belle II computing group would like to report on.
DØ MC and Data Processing on the Grid
Using an Object Oriented Database to Store BaBar's Terabytes
The LHCb Computing Data Challenge DC06
Presentation transcript:

Stuart Wakefield Imperial College London1 How (and why) HEP uses the Grid.

Stuart Wakefield Imperial College London2 Overview Major challenges Scope of talk MC production Data transfer Data analysis Conclusions

Stuart Wakefield Imperial College London3 HEP in a nutshell Workflows include: Monte Carlo production Data calibration Reconstruction of RAW data. Skimming of RECO data. Analysis of RAW/RECO/TAG data physicists per experiment So far main activities are MC production and user analysis

Stuart Wakefield Imperial College London4 Computing Challenges I Large amounts of data. –~100 million electronics channels (per experiment). –~1MB per event. –40 million events per second. –Record ~100 events per second. –~billion events per year. –~15PB per year. Trivially paralizable workflows Many users, O(1000), performing unstructured analysis Each analysis requires non-negligable data access (<1TB). Each analysis requires similar amounts of simulated (Monte Carlo) data. Concorde (15 Km) Balloon (30 Km) CD stack with 1 year LHC data! (~ 20 Km) Mt. Blanc (4.8 Km)

Stuart Wakefield Imperial College London5 Computing Challenges II HEP requirements: –Scalable workload management system with 10,000s of jobs, 1000s of users and 100s of sites worldwide. –Useable by non computing experts. –High levels of data integrity / availability. –PBs of data storage –Automatic/reliable data transfers between 100s sites managed at a high level. Of a 120TB data transfer Mr DiBona, open source program manager at Google said: "The networks aren't basically big enough and you don't want to ship the data in this manner, you want to ship it fast.” We have no choice

Stuart Wakefield Imperial College London6 Scope of talk I know most about LHC experiments, esp. CMS. Many Grid projects/organisations/acronyms Focus on EGEE/Glite == (mainly) Europe. NGS not included - though plans for interoperability. Illustrate the different approaches taken by LHC experiments. Attempt to give an idea of what works and what doesn’t.

Stuart Wakefield Imperial College London7 HEP approaches to grid As many ways to use distributed computing as there are experiments. Differences due to: –Computational requirements –Available resources (Hardware/Manpower) LCG systems used in a mix ‘n’ match fashion by each experiment –Workload management Jobs submitted to Resource Broker (RB) which then decides where to send job, monitors it and resubmits if failure. –Data management Similar syntax with jobs submitted to copy files between sites. Includes concepts of transfer channel, fair share and multiple retries. File catalogue maps files to locations (can have multiple instances for different domains)

Stuart Wakefield Imperial College London8 Computing model ATLAS (also ALICE / CMS) Tier2 Centre ~200kSI2k Event Builder Event Filter ~7.5MSI2k T0 ~5MSI2k UK Regional Centre (RAL) US Regional Centre French Regional Centre Dutch Regional Centre RHULUCLQMUL Imperial ~0.25TIPS ~100 Gb/sec ~3 Gb/sec raw Mb/s links Some data for calibration and monitoring to institutes Calibrations flow back Each of ~30 Tier 2s have ~20 physicists (range) working on one or more channels Each Tier 2 should have the full AOD, TAG & relevant Physics Group summary data Tier 2 do bulk of simulation Physics data cache ~Pb/sec ~ 75MB/s/T1 raw for ATLAS Tier2 Centre ~200kSI2k  622Mb/s links Tier 0 Tier 1 Desktop Average CPU = ~1-1.5 kSpecInt2k London Tier ~200kSI2k Tier 2  ~200 TB/year/T2  ~2MSI2k/T1  ~2 PB/year/T1  ~5 PB/year  No simulation  622Mb/s links 10 Tier-1s reprocess house simulation Group Analysis

Stuart Wakefield Imperial College London9 MC generation Last few years conducted extensive analysis of simulated data. Required massive effort from many people. Only recently reached stage of large scale, automated production with grid. Taken a lot of work and still not perfect Each experiment has own system which use LCG components in different ways. CMS adopts a “traditional” LCG approach –I.e. jobs to RB to site. ATLAS bypasses the RB sends direct to known “good” sites. LHCb implement their own system using the RB but managing their own loadbalancing.

Stuart Wakefield Imperial College London10 MC generation (CMS) LCG submission uses RB, multiple instances also can be multi-threaded. Adopts a “if fail try-try again” approach to failures. Does not use LCG file catalogues due to performance/scalability concerns. Instead use a custom system with an entry per dataset, O( GB). ProdRequest ProdMgr ProdAgent Resource User Request Get Work Jobs Report Progress User Interface Accountant

Stuart Wakefield Imperial College London11 MC generation (CMS II)

Stuart Wakefield Imperial College London12 MC generation (CMS III) Large scale production round started 22 March.

Stuart Wakefield Imperial College London13 MC generation (LHCb) Completely custom workload management framework –“Pilot” jobs –Late binding –Pull mechanism –Dynamic job priorities –Single point of failure Use standard LCG file tools

Stuart Wakefield Imperial College London14 MC generation (LHCb II) CNAF GRIDKA IN2P3 NIKHEF PIC RAL ALL CERN

Stuart Wakefield Imperial College London15 MC generation overview Couple of different approaches. LCG can cope with workload requirements but concerns over reliability, speed and scalability –Multiple RBs with multiple (multi-threaded) submitters –Automatic retry –ATLAS Bypass RB and submit direct to known sites (x10 faster) –LHCb implement their own late binding File handling –Again scalability and performance concerns over central file catalogues. –New LCG architecture allows multiple catalogues but some still have concerns Instead of tracking individual files use entire datasets

Stuart Wakefield Imperial College London16 Data analysis Generally less developed than MC production system. So far less jobs - but need to be ready for experiment start up. Experiment use similar methodologies to their production systems. LHCb adopts a late bindng approach with pilot jobs. CMS submits via resource broker Generally send jobs to data Additional requirements from MC production –Local storage throughput of 1-5MB/s per job –Ease of use –Gentle learning curve –Pretty interface etc. –Sensible defaults etc.

Stuart Wakefield Imperial College London17 Data analysis (ATLAS/LHCb) See talk by Ulrik Egede

Stuart Wakefield Imperial College London18 Data analysis (CMS) Standard grid model again using the RB. Requires large software (~4GB) install at site. –Site provides nfs area to all worker nodes –Software installed with apt,rpm (over nfs) –Trivial to use tar etc… User provides application + config CRAB creates, submits and tracks jobs. Output returned to user or stored to a site Plans for server architecture to handle retries

Stuart Wakefield Imperial College London19 Data analysis (CMS II) arda-dashboard.cern.ch/cms

Stuart Wakefield Imperial College London20 Data analysis summary More requirements on sites - harder for smaller sites to support. Non expert users cause a large user support workload.

Stuart Wakefield Imperial College London21 Data Transfer Require reliable, prioritisable, autonomous large scale file transfers. LCG file transfer functionality relatively new and still under development. Can submit a job to a file management system that will attempt file transfers for you. All(?) experiments have created their own systems to provide high level management and to overcome failures

Stuart Wakefield Imperial College London22 Data Transfer (CMS) PhEdEx –Agents at each site connect to a central DB and receive work (transfers and deletions). Web-based management of whole system. With web interface –Subscribe data –Delete data –All from any site in system –Authentication with X509 certificates

Stuart Wakefield Imperial College London23 Data transfer (CMS II)

Stuart Wakefield Imperial College London24 Data transfer (CMS III)

Stuart Wakefield Imperial College London25 Data transfer overview LCG provides tools for low level (file) access and transfer. For higher level management (I.e. multi- TB) need to write own system.

Stuart Wakefield Imperial College London26 Conclusions The (LCG) grid is a vast computational resource ready for exploitation. Still far from perfect –More failures than local resources –Less performance than local resources –But probably much larger! The less your requirements the more successful you will be.

Stuart Wakefield Imperial College London27 Backup

Stuart Wakefield Imperial College London28 Computing model II LHCb Similar but places lower resource requirements on smaller sites. Allows uncontrolled user access to vital Tier- 1 resources. Possibility for conflict

Stuart Wakefield Imperial College London29 MC generation (ATLAS) Submission via Resource Broker slow >5-10 secs per job. Limit of 10,000 jobs per day per submitter. LCG submission bypassing the RB, goes direct to site, load balancing handled by experiment software.

Stuart Wakefield Imperial College London30 Data Transfer (ATLAS) Similar approach to CMS. Throughput (MB/s) Total Errors