LHCb on the Grid A Tale of many Migrations

Slides:

Advertisements

Similar presentations

DataTAG WP4 Meeting CNAF Jan 14, 2003 Interfacing AliEn and EDG 1/13 Stefano Bagnasco, INFN Torino Interfacing AliEn to EDG Stefano Bagnasco, INFN Torino.

Advertisements

Ying Ying Li Windows Implementation of LHCb Experiment Workload Management System DIRAC LHCb is one of the four main high energy physics experiments at.

1 ALICE Grid Status David Evans The University of Birmingham GridPP 14 th Collaboration Meeting Birmingham 6-7 Sept 2005.

Your university or experiment logo here LHCb is Beautiful? Glenn Patrick GridPP19, 29 August 2007.

1 ALICE Grid Status David Evans The University of Birmingham GridPP 16 th Collaboration Meeting QMUL June 2006.

Storage Review David Britton,21/Nov/ /03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC Data? Oversight.

Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.

B A B AR and the GRID Roger Barlow for Fergus Wilson GridPP 13 5 th July 2005, Durham.

LHCb Roadmap : DIRAC3 put in production m Production activities o Started in July o Simulation, reconstruction, stripping P Includes file.

LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.

EU 2nd Year Review – Jan – Title – n° 1 WP1 Speaker name (Speaker function and WP ) Presentation address e.g.

Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.

Patricia Méndez Lorenzo (IT/GS) ALICE Offline Week (18th March 2009)

LHCb Quarterly Report October Core Software (Gaudi) m Stable version was ready for 2008 data taking o Gaudi based on latest LCG 55a o Applications.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.

LCG Plans for Chrsitmas Shutdown John Gordon, STFC-RAL GDB December 10 th, 2008.

Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.

Status of the LHCb MC production system Andrei Tsaregorodtsev, CPPM, Marseille DataGRID France workshop, Marseille, 24 September 2002.

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

Your university or experiment logo here LHCb Development Glenn Patrick Raja Nandakumar GridPP18, 20 March 2007.

1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.

Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.

Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.

Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.

LHCb The LHCb Data Management System Philippe Charpentier CERN On behalf of the LHCb Collaboration.

Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing the bulk data transfer infrastructure developed to.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.

1 LHCb File Transfer framework N. Brook, Ph. Charpentier, A.Tsaregorodtsev LCG Storage Management Workshop, 6 April 2005, CERN.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

Jens G Jensen RAL, EDG WP5 Storage Element Overview DataGrid Project Conference Heidelberg, 26 Sep-01 Oct 2003.

CHEP 2006, February 2006, Mumbai 1 LHCb use of batch systems A.Tsaregorodtsev, CPPM, Marseille HEPiX 2006, 4 April 2006, Rome.

The CMS Top 5 Issues/Concerns wrt. WLCG services WLCG-MB April 3, 2007 Matthias Kasemann CERN/DESY.

DIRAC Review (12 th December 2005)Stuart K. Paterson1 DIRAC Review Workload Management System.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

The GridPP DIRAC project DIRAC for non-LHC communities.

DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.

LHCb report to LHCC and C-RSG Philippe Charpentier CERN on behalf of LHCb.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

LHCb status and plans Ph.Charpentier CERN. LHCb status and plans WLCG Workshop 1-2 Sept 2007, Victoria, BC 2 Ph.C. Status of DC06  Reminder:  Two-fold.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.

The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.

The GridPP DIRAC project DIRAC for non-LHC communities.

1 DIRAC WMS & DMS A.Tsaregorodtsev, CPPM, Marseille ICFA Grid Workshop,15 October 2006, Sinaia.

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

LHCb Computing activities Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group.

LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.

DIRAC for Grid and Cloud Dr. Víctor Méndez Muñoz (for DIRAC Project) LHCb Tier 1 Liaison at PIC EGI User Community Board, October 31st, 2013.

Acronyms GAS - Grid Acronym Soup, LCG - LHC Computing Project EGEE - Enabling Grids for E-sciencE.

The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.

LHCb Status report June 08. LHCb Computing Report Activities since February  Applications and Core Software  Preparation of applications for real data.

Lessons learned administering a larger setup for LHCb

L’analisi in LHCb Angelo Carbone INFN Bologna

LCG Service Challenge: Planning and Milestones

LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.

Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group

Grid Deployment Board meeting, 8 November 2006, CERN

Short update on the latest gLite status

Simulation use cases for T2 in ALICE

R. Graciani for LHCb Mumbay, Feb 2006

LHCb status and plans Ph.Charpentier CERN.

Status and plans for bookkeeping system and production tools

The LHCb Computing Data Challenge DC06

Presentation transcript:

LHCb on the Grid A Tale of many Migrations Raja Nandakumar

LHCb computing model CERN (Tier-0) is the hub of all activity Full copy at CERN of all raw data and dst-s All T1s have a full copy of dst-s Simulation at all possible sites (CERN, T1, T2) LHCb has used about 120 sites on 5 continents so far Reconstruction, Stripping and Analysis at T0 / T1 sites only Some analysis may be possible at “large” T2 sites in the future Almost all the computing (except for development / tests) will be run on the grid. Large productions : production team Ganga (Dirac) grid user interface

LHCb storage on Tier-1 LHCb storage primarily on the Tier-1s and CERN RAL dCache scheduled to stop in May 2008 Scale of storage usage in Oct 2007 Disk : ~75 TB Tape : ~45 TB 104 tapes in total Required storage on CASTOR provided on demand by Tier-1 CASTOR stable by Sept 2007 v2.1.4 (in Sept 2007) Great work done by Tier-1 in getting CASTOR operational A few delays from LHCb side in working on migration CHEP ‘07, DIRAC3 development issues, etc. LHCb data migration began in Nov 2007

LHCb disk migration Disk migration began ~ 20 Nov 2007 Completed ~ 20 Dec 2007 A few minor issues / configuration problems. Quickly solved. All disk servers finally released in Feb 2008 Files with transfer errors were re-transferred. Files transferred using FTS between dCache and CASTOR DIRAC interface for LHCb, to automatically register files after transferring Bulk LFC cleanup after groups of files are transferred Peak rates of about 100 MB/s from dCache to CASTOR Also depended on number of different servers on each end Mostly smooth operations Problems : minor, human LHCb running off CASTOR since mid Jan 2008 A few worries, but mostly fine Disk

LHCb tape migration Tape migration first tried ~ 7 Dec 2007 In full scale in Jan 2008 Many iterations to get current procedure Tape staging by hand FTS does not automatically stage files on tape Automatic staging of files not practical Problems with coordination of tape staging (RAL) and FTS job submission (CERN) Files getting wiped off dCache before transfer Now staging two tapes at a time Wait for FTS jobs to complete before staging next tapes ~ 50 tapes to go still … Fine when it is running ~ 60 MB/s transfer rates ~ 4 hours to stage a tape ~ 2 hours to transfer it to CASTOR ~ 2 tapes a day Tape

LHCb storage on Tier-1 LHCb running off CASTOR now Ignore data only in dCache tape Transfers into CASTOR running fine Currently use srm-v1 for official production srm-v2 used in CCRC08 Critical service for LHCb File replication using gLite FTS TURLs retrieved via gfal for access via available site protocols root, rfio, dcap, gsidcap RDST output file upload to local Tier-1 SE via lcg-utils / gfal File removal using gfal Tier-0,1 Storage Elements providing SRM2 spaces: LHCb_RAW (T1D0), LHCb_RDST (T1D0) LHCb_M-DST (T1D1), LHCb_DST (T0D1) LHCb_FAILOVER (T0D1) Used for temporary upload in case of destination unavailability Testing of srm-v2 is a key part of CCRC’08 for LHCb To be used for all production jobs As soon as DIRAC3 is in full production mode

LHCb on RAL-CASTOR A few issues running off CASTOR Hang when too many jobs run off data on a single server A single server can currently support ~ 200 lsf job slots A single job can have 1-3 files open on the server Can easily have 200 jobs running Currently kill all the file requests to restore the server Bonny / Shaun / Chris Need for more CASTOR monitoring tools Will be good to have rootd / xrootd also Pausing of jobs during downtime (works at CERN) Jobs should pick up from where they were, when they are restarted A few non-castor problems too Backplanes replacement / fire hazard Power down due to transformer shutdown RAL first to publish CASTOR storage to the IS Very very useful Overall service - stable

LHCb on the Grid DIRAC is LHCb’s interface to the grid DIRAC3 Written mostly in python Pilot agent paradigm Fine grained visibility of grid to the jobs and DIRAC3 servers DIRAC3 Re-writing of DIRAC using 4 years of experience Main ideas / framework retained Many changes in algorithms and implementations Security : Authentication & logging for all operations Separate out generic and LHCb-specific modules Better designed to support more options srm v2, gLite WMS, generic pilots, … Job throttling, job prioritisation, generic pilots, … DIRAC2 DIRAC3

DIRAC LFC checkData Job Receiver JDL Job Receiver Data Optimizer Task Queue Sandbox Job Input JobDB Job JDL Agent Director checkJob getReplicas Agent Monitor checkPilot Matcher CE JDL WMS Admin getProxy Job Monitor RB RB / WMS Pilot Job SE uploadData DIRAC services LCG Workload On WN getSandbox CE VO-box putRequest Job Wrapper execute (glexec) Pilot Agent WN User Application fork

LHCb and DIRAC3 DIRAC3 status – Still under development Major parts already running Used in CCRC08 So far, successful testing of simulation and reconstruction workflows User analysis to be integrated with DIRAC3 once it is stable in production Bookkeeping in DIRAC3 Bookkeeping not to be stand alone Will be a set of fully integrated services New : effort from Ireland in the web / user interface of bookkeeping Critical for Monte Carlo analysis and in future to data analysis

DIRAC3 monitoring Note the authentication at top right Not needed for browsing the jobs Needed to perform actions

LHCb and CCRC08 Planned tasks : Test the LHCb computing model Raw data distribution from pit to T0 centre Use of rfcp into CASTOR from pit - T1D0 Raw data distribution from T0 to T1 centres Use of FTS - T1D0 Recons of raw data at CERN & T1 centres Production of rDST data - T1D0 Use of SRM 2.2 Stripping of data at CERN & T1 centres Input data: RAW & rDST - T1D0 Output data: DST - T1D1 Use SRM 2.2 Distribution of DST data to all other centres Use of FTS - T0D1 (except CERN T1D1) First three tasks successfully accomplished Tests of stripping workflow within DIRAC3 ongoing

Data distribution Site CPU Share (Percentage) RAW Share (Percentage) CERN 14 100 GridKa 11 12 IN2P3 25 29 CNAF 9 NIKHEF 26 30 PIC 4 5 RAL 13

RAW data transfers Data transfers mimicked LHC data taking 6 hours on, 6 hours off A few pauses for software upgrades Peak of 125 MB/s (Feb 12th) Nominal rate : 70 MB/s Some Tier-1 problems seen 0 checksum mismatches No problems seen at RAL ! S. Paterson – CCRC F2F meeting Tier-1 transfers

FTS performance Histograms of time between a file being Assigned andTransferred to the LHCb Tier-1s (minutes) FTS submit / monitor / done cycle Most sites show stable behaviour S. Paterson – CCRC F2F meeting

CCRC’08 LHCb issues Automatic job submission successfully demonstrated Problems setting up job workflows Had planned to run 23K jobs over 2 weeks Cpu time underestimated for reconstruction Many jobs overshot wall time and were killed Systematic over all Tier-1s DIRAC3 servers downtime Lost connection to running jobs Not possible to recover many jobs Configuration service unstable Needed to be restarted regularly Need for backup / failover systems

CCRC’08 site issues Problems on dCache sites CERN – CNAF – Configuration of timeouts on gsidcap ports dCache not releasing space reserved even if files are deleted. Data transfer, access problems due to load on pnfs server (IN2P3) Problem with solaris servers at SARA CERN – AFS instabilities CNAF – Low CASTOR lsf slots per server RAL : No problems S. Paterson – CCRC F2F meeting

Upcoming schedule March April May Later Introduce stripping workflow Consider usage of xrootd protocol Helps with flickering data access April Migration of GridPP2+ to GridPP3 May 4 weeks of running at nominal rate If possible, include analysis Possibly using generic pilots Pending approval and deployment of glexec Ganga (migrating to v5!) for analysis job submission Later Data taking …