Download presentation
Presentation is loading. Please wait.
Published byAugusta Marshall Modified over 9 years ago
1
Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing the bulk data transfer infrastructure developed to allow high bandwidth distribution of data across the grid in accordance with the computing model. To enable reliable bulk replication of data, LHCb's DIRAC system has been integrated with gLite's File Transfer Service middleware component to make use of dedicated network links between LHCb computing centres. DIRAC's Data Management tools previously allowed the replication, registration and deletion of files on the grid. For SC3 supplementary functionality has been added to allow bulk replication of data (using FTS) and efficient mass registration to the LFC replica catalog. Provisional performance results have shown that the system developed can meet the expected data replication rate required by the computing model in 2007. This paper details the experience and results of integration and utilisation of DIRAC with the SC3 transfer machinery. Introduction to DIRAC Data Management Architecture DIRAC architecture split into three main component types: Services - independent functionalities deployed and administered centrally on machines accessible by all other DIRAC components Resources - GRID compute and storage resources at remote sites Agents - lightweight software components that request jobs from the central Services for a specific purpose. The DIRAC Data Management System is made up an assortment of these components. FileCatalogC FileCatalogB SE Service SRMStorage GridFTPStorage HTTPStorage StorageElement ReplicaManager FileCatalogA UserInterface WMS TransferAgent Data Management Clients Physical storage DIRAC Data Management Components Main components of the DIRAC Data Management System: Storage Element abstraction of GRID storage resources actual access by specific plug-ins srm, gridftp, bbftp, sftp, http supported namespace management, file up/download, deletion etc. Replica Manager provides an API for the available data management operations point of contact for users of data management systems removes direct operation with Storage Element and File Catalogs uploading/downloading file to/from GRID SE, replication of files, file registration, file removal File Catalog standard API exposed for variety of available catalogs allows redundancy across several catalogs LHCb Transfer Aims During SC3 The extended Service Phase of SC3 was to allow the experiments to test their specific software and validate their computing models using the platform of machinery provided. LHCb’s Data Replication goals during SC3 can be summarised as: Replication ~1TB of stripped DST data from CERN to all Tier-1’s. Replication of 8 TB of digitised data from CERN/Tier-0 to LHCb participating Tier1 centers in parallel. Removal of 50k replicas (via LFN) from all Tier-1 centres Moving 4TB of data from Tier1 centres to Tier0 and to other participating Tier1 centers. Integration of DIRAC with FTS SC3 replication machinery utilised gLite’s File Transfer Service (FTS) lowest-level data movement service defined in the gLite architecture offers reliable point-to-point bulk file transfers physical files (SURLs) between SRM managed SEs accepts source-destination SURL pairs assigns file transfers to dedicated transfer channel take advantage of networking between CERN and Tier1s routing of transfers is not provided Higher level service required to resolve SURLs and hence decide on routing. DIRAC Data Management System employed to do these tasks. Integration requirements: new methods developed in Replica Manager previous Data Management operations single file and blocking bulk operation functionality added to the Transfer Agent/Request monitoring of asynchronous FTS jobs required information for monitoring stored within Request DB entry LCG – SC3 Machinery Transfer network LHCb - DIRAC DMS Request DB File Catalog Interface Transfer Manager Interface Replica Manager Transfer Agent LCG File Catalog File Transfer Service Tier0 SE Tier1 SE ATier1 SE BTier1 SE C
2
Once Transfer Agent obtains Request XML file: replica information for LFNs obtained replicas matched against source SE and target SE SURL pairs resolved using endpoint information SURL pairs are then submitted via the FTS Client FTS GUID and other information on job stored in XML file Obtain Job Param Resolve PFNs Resolve SURL Pairs Submit FTS Job Update DB with Job Info Update Monitoring with Job Info Request DB Transfer Agent FTS Client LCG File Catalog DIRAC Config Svc DIRAC Monitoring Replica Manager Obtain Job Info Get Job Status Resolve Failed or Succeeded Update Request and Monitoring If Job Terminal: Register Completed Files Send Accounting Data Request DB Transfer Agent FTS Client LCG File Catalog DIRAC Accounting DIRAC Monitoring Resubmit Failed Files to Request Request DB Replica Manager Transfer Agent executed periodically using ‘runit’ daemon scripts replication request information retrieved from Request DB status of the FTS job is obtained via the FTS Client status (active, done, failed) of individual files obtained Request XML file updated monitoring information sent to allow web based tracking If the FTS job has reached terminal state: completed files are registered in the file catalogs failed files constructed into new replication request accounting information sent to allow bandwidth measurements Performance Obtained During T0-T1 Replication 0 10 20 30 40 50 9/10/05 11/10/05 13/10/0515/10/05 17/10/05 19/10/0521/10/0523/10/0525/10/05 27/10/05 29/10/05 31/10/05 2/11/05 4/11/05 6/11/05 Date Rate (MB/s) CERN_Castor -> RAL_dCache-SC3 CERN_Castor -> PIC_Castor-SC3 CERN_Castor ->SARA_dCache-SC3 CERN_Castor -> IN2P3_HPSS-SC3 CERN_Castor -> GRIDKA_dCache-SC3 CERN_Castor -> CNAF_Castor-SC3 60 Many Castor 2 Problems Service Intervention SARA ProblemsRequired Rate Combined 40MB/s from CERN to 6 LHCb Tier1s to meet SC3 goals aggregated daily rate was obtained overall SC3 machinery not completely stable target rate not sustained over the required period peak rates of 100MB/s were observed over several hours Rerun of exercise planned to demonstrate the required rates. Tier1–Tier1 Replication Activity On-Going During T0-T1 replication FTS was found to be most efficient when replicating files pre-staged on disk. dedicated disk pools setup to T1 sites for seed files 1.5TB of seed files transferred to dedicated disk FTS Servers were installed by T1 sites channels setup directly between sites Replication activity is on going with this exercise. The current status of this setup is shown below. PIC No FTS Server Channels Managed by Source SE IN2P3 FTS Server Manage Incoming Channels CNAF FTS Server Manage Incoming Channels FZK FTS Server Manage Incoming Channels RAL FTS Server Manage Incoming Channels SARA FTS Server Manage Incoming Channels Bulk File Removal Operations Bulk removal of files performed on completion of T0-T1 replication. bulk operation of ‘srm-advisory-delete’ used takes list of SURls and ‘removes’ physical file functionality added to Replica Manager and Storage Element additions required for SRM Storage Element plug-in Replica Manager SURL resolution tools reused Different interpretations of the SRM standard has lead to different underlying behavior between SRM solutions. Initially bulk removal operations executed by a single central agent SC3 goal of 50K replicas in 24 hours shown to be unattainable Several parallel agents instantiated each performing physical and catalog removal for a specific SE 10K replicas were removed from 5 sites in 28 hours performance loss observed in replica deletion on LCG FC (see below) unnecessary SSL authentications CPU intensive remedied by ‘sessions’ when performing multiple catalog operations 0 50 100 150 200 250 300 350 400 450 1234 Removal Phase Time for 100 Replica Removals (s) Phase 1 RAL Phase 2 GRIDKA,IN2P3 Phase 3 GRIDKA, IN2P3, CNAF, PIC Phase 4 GRIDKA, IN2P3, CNAF, PIC, RAL Operation of DIRAC Bulk Transfer Mechanics DIRAC integration with FTS deployed centrally managed machine at CERN service all data replication jobs for SC3 Lifetime of bulk replication job: bulk replication requests submitted to the DIRAC WMS JDL file with an input sandbox of an XML file XML contains important parameters i.e. LFNs, source/target SE DIRAC WMS populates the Request DB of central machine with XML Transfer Agent polls Request DB periodically for ‘waiting’ requests
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.