Presentation is loading. Please wait.

Presentation is loading. Please wait.

GRID interoperability and operation challenges under real load for the ALICE experiment F. Carminati, L. Betev, P. Saiz, F. Furano, P. Méndez Lorenzo,

Similar presentations


Presentation on theme: "GRID interoperability and operation challenges under real load for the ALICE experiment F. Carminati, L. Betev, P. Saiz, F. Furano, P. Méndez Lorenzo,"— Presentation transcript:

1 GRID interoperability and operation challenges under real load for the ALICE experiment F. Carminati, L. Betev, P. Saiz, F. Furano, P. Méndez Lorenzo, A. Grigoras, C. Grigoras, S. Bagnasco, A. Peters, D. Saiz, O. Datskova, S. Schreiner, L. Sehoon, J. Zhu Taipei - October 18, 2010 CHEP 2010

2 CHEP 2010 -- Costin.Grigoras@cern.ch Overview THE Challenge: – The software infrastructure created by ALICE in the past ten years had to respond to the 1 st LHC data taking – AliEn, WLCG services, support and operation across more than 60 sites all over the world would be finally stressed – These items will be analyzed in the next 15min

3 AliEn version during the 1 st data taking: v2.18 – Implementation of a large number of features intended to escalate the number of concurrent jobs for fundamental ALICE activities: Pass1 & 2 reconstruction, calibration, MC production and user analysis – Besides all the new implemented features, we mention two important ones for the Grid sites and the end users: Implementation of Job and File Quotas – Limit on the available resources per user – Jobs: # jobs, cpuCost, runningtime – Files: # files, total size (including replicas) Automatic storage elements discovery – Finding the closest working SEs of a QoS for optimal, configuration- free, writing from jobs. The discovery is based on MonALISA monitoring information (topology, status, etc) – Sorting replicas for reading from the closest available one – Simplifying the selection of SE and adding more options in case of special needs CHEP 2010 -- Costin.Grigoras@cern.ch 1 st ACTOR: AliEn 3

4 ALICE approach for the past two years: – Decrease the number of services needed at the sites in terms of workload management system Deprecation of the gLite-WMS service Requirement of the CREAM-CE deployment at all sites (since 2008) – Reinforcement (failover strategies) of the required WLCG services at each site Applicable to gLite3.2 VOBOXEs and CREAM-CE – Reinforcement of the ALICE solutions for data transfers Deprecation of the FTS service for transfers Xrootd is the I/O and movement solution chosen by ALICE – Applicable to T0-T1 transfers since January 2010 – The approach can be summarized as: Simplification of the local infrastructures Homogeneous solutions for all sites (differences between T1 and T2 sites only in QoS terms) Flexible relations between all sites – It has demonstrated a good performance considering: Number of sites still increasing Grid activity growing Limited man power to follow all services CHEP 2010 -- Costin.Grigoras@cern.ch WLCG services in AliEnv2.18 4

5 ALICE included CREAM-CE in the experiment production environment in summer 2008 – 2009: Dual approach: Parallel submission to LCG-CE and CREAM-CE at each site – 2010: Submission to CREAM-CE only Direct submission mode CLI implemented in AliEn through a specific CREAM module – Redundant approach at all sites If several CREAM-CE per site are available, a random submission approach has been included in AliEn to ensure a balance submission among all CREAM-CEs – Issues Serious instabilities found this summer in both the CREAM-CE DB and the resource BDII – Any query to the CREAM-CE DB has been deleted in the next AliEnv2.19 version – Sites have to ensure a reliable BDII publication – General valuation of the service Very good results in terms of performance and scalability Very positive support provided by the CREAM-CE developer team gLite-WMS has been deprecated at all sites since 1 st January 2010 – Only CERN uses it (Around 20 LCG-CE vs. 3 CREAM-CE) CHEP 2010 -- Costin.Grigoras@cern.ch WLCG Services: CREAM-CE 5

6 CHEP 2010 -- Costin.Grigoras@cern.ch 2010: CREAM-CE vs. LCG-CE 6 ALICE production through CREAM-CEs only ALICE production through LCG-CE only (at CERN) CREAM-CE production Average: 9878 jobs Peak: 26194 jobs LCG-CE production (CERN only) Average: 454 jobs Peak: 2484 jobs

7 The dual submission approach (CREAM-CE vs. LCG-CE) implemented by ALICE in 2009 required the deployment of a 2 nd VOBOX at each site 2010 approach foresees a single submission backend – The 2 nd VOBOX is not needed anymore – ALICE rescue approach: Setup of the 2 nd VOBOX in failover mode Redundant local AliEn services running at both VOBOXES Available at many sites, not only T1s General valuation of the service – The gLite-VOBOX is by far the most stable WLCG service for ALICE – ALICE support members participate together with the IT-GT team in the deployment of new versions – Eventual issues at the sites identified and solved in few minutes CHEP 2010 -- Costin.Grigoras@cern.ch WLCG Services: gLite-VOBOX 7

8 CHEP 2010 -- Costin.Grigoras@cern.ch Raw Data transfers: Procedure 8 Tx-Ty data transfers performed via xrootd – This includes the T0-T1 raw data transfers – 3 rd party copy (xrd3cp) enabled DAQ CASTOR MonALISA Repository FZK CNAF NDGF Alien tranfer queue FTD Electronic logbook AliEn Catalo gue … Info about each raw data file DB mysql - Good runs - Transfer completed - Run conditions - Automatic pass 1 reconstruction - Transfers to T1 - Storage status - Quotas per site Limited number of files transferred xrd3cp

9 The number of “channels” (concurrent transfers) opened by FTD is centrally controlled by AliEn and limited to 200 concurrent transfers in TOTAL – The amount of concurrent files transferred per each T1 site is defined by the resources provided by each T1 to ALICE These numbers were presented by the experiment during the SC3 exercise and have not been changed – ALICE infrastructure prevents possible abusive usage Before submitting more transfers the monitoring information is checked – Status of previous transfers, SE usage and availability, bandwidth usage per SE cluster etc – General valuation: Homogeneous infrastructure put in place all over the sites No network abuses or issues have been reported by any site since the start up of the LHC data taking CHEP 2010 -- Costin.Grigoras@cern.ch9 T0-T1 raw data transfers

10 CHEP 2010 -- Costin.Grigoras@cern.ch10 T0-T1 data transfers profile Still some issues prevent the complete publication of the ALICE transfers in Gridview Working together with developers to complete this publication ALICE raw data transfers performed via xrootd are monitored together with the other 3 LHC experiments in Gridview Underestimated information

11 Pass 1 reconstruction – Quasi-online, follows the registration of RAW in CASTOR@CERN – Raw data fully reconstructed after 24h Average 5h per job. 95% of the runs processed after 10h and 99% after 15h – Reconstruction efficiency around 98% Pass2 reconstruction – ~1 month after data taking @T1s – Updated software, updated conditions Improved detectors calibration from Pass1 reconstruction ESDs (calibration trains) Analysis – Chaotic analysis performed in the Grid with high stability and performance Internal ALICE prioritization applied for end user analysis CHEP 2010 -- Costin.Grigoras@cern.ch11 ALICE Operations Raw data recorded in CASTOR@CERN 365 different users during this period

12 T0 site – Increase of the CASTOR capacity for ALICE (both disk and tape) up to 2.3PB ready before the HI data taking (5. Nov – 6. Dec) – Good support on terms of the WMS services Achieved a good CREAM-CE behavior – CAF facility After some instabilities during the setup of the system, establishment of a very good collaboration with the system managers. Steady operation – Software area: AFS Split of readable/writable volumes to improve the access to AFS ALICE is planning to avoid the usage of the software area (EVERYWHERE) T1 sites – Steady operation T2 sites – Tiny issues immediately solved together with the ALICE contact persons at the sites – Bandwidth issues with new incoming sites Creation of a Bandwidth Task Force together with the CERN network experts CHEP 2010 -- Costin.Grigoras@cern.ch12 Sites news

13 Simplification of the Grid infrastructures at the sites – No differences foreseen in terms of T1-T2 sites – Homogeneous solutions for all sites (independently of the middleware stacks) Grid operation is now fairly routine – Establishment of good collaboration with the services developers (CREAM-CE) and managers (network, nodes, etc) Grid issues – Better control of services upgrades – Control and follow up of the local services (services still manpower intensive) – Network studies for new incoming sites New AliEnv2.19 foreseen before the end of 2010 – Reinforcement of the CREAM-CE submission modules CHEP 2010 -- Costin.Grigoras@cern.ch13 Summary and Conclusions


Download ppt "GRID interoperability and operation challenges under real load for the ALICE experiment F. Carminati, L. Betev, P. Saiz, F. Furano, P. Méndez Lorenzo,"

Similar presentations


Ads by Google