GRID interoperability and operation challenges under real load for the ALICE experiment F. Carminati, L. Betev, P. Saiz, F. Furano, P. Méndez Lorenzo,

Slides:



Advertisements
Similar presentations
During the last three years, ALICE has used AliEn continuously. All the activities needed by the experiment (Monte Carlo productions, raw data registration,
Advertisements

Status GridKa & ALICE T2 in Germany Kilian Schwarz GSI Darmstadt.
CREAM: Update on the ALICE experiences WLCG GDB Meeting Patricia Méndez Lorenzo (IT/GS) CERN, 11th March 2009.
T1 at LBL/NERSC/OAK RIDGE General principles. RAW data flow T0 disk buffer DAQ & HLT CERN Tape AliEn FC Raw data Condition & Calibration & data DB disk.
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
ALICE Operations short summary and directions in 2012 Grid Deployment Board March 21, 2011.
ALICE Operations short summary LHCC Referees meeting June 12, 2012.
ALICE Operations short summary and directions in 2012 WLCG workshop May 19-20, 2012.
1 Status of the ALICE CERN Analysis Facility Marco MEONI – CERN/ALICE Jan Fiete GROSSE-OETRINGHAUS - CERN /ALICE CHEP Prague.
Patricia Méndez Lorenzo (IT/GS) ALICE Offline Week (18th March 2009)
ALICE DATA ACCESS MODEL Outline ALICE data access model - PtP Network Workshop 2  ALICE data model  Some figures.
ALICE data access WLCG data WG revival 4 October 2013.
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
ALICE Roadmap for 2009/2010 Patricia Méndez Lorenzo (IT/GS) Patricia Méndez Lorenzo (IT/GS) On behalf of the ALICE Offline team Slides prepared by Latchezar.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES P. Saiz (IT-ES) AliEn job agents.
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
Costin Grigoras ALICE Offline. In the period of steady LHC operation, The Grid usage is constant and high and, as foreseen, is used for massive RAW and.
WLCG GDB, CERN, 10th December 2008 Latchezar Betev (ALICE-Offline) and Patricia Méndez Lorenzo (WLCG-IT/GS) 1.
Status Report of WLCG Tier-1 candidate for KISTI-GSDC Sang-Un Ahn, for the GSDC Tier-1 Team GSDC Tier-1 Team 12 th CERN-Korea.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
The ALICE Distributed Computing Federico Carminati ALICE workshop, Sibiu, Romania, 20/08/2008.
Status of PDC’07 and user analysis issues (from admin point of view) L. Betev August 28, 2007.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES A. Abramyan, S. Bagansco, S. Banerjee, L. Betev, F. Carminati,
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
 Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010.
Status Report of WLCG Tier-1 candidate for KISTI-GSDC Sang-Un Ahn, for the GSDC Tier-1 Team GSDC Tier-1 Team ATHIC2012, Busan,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Implementation and performance analysis of.
Experiment Operations: ALICE Report WLCG GDB Meeting, CERN 14th October 2009 Patricia Méndez Lorenzo, IT/GS-EIS.
1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 24 January 2005.
The CMS Top 5 Issues/Concerns wrt. WLCG services WLCG-MB April 3, 2007 Matthias Kasemann CERN/DESY.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
JAliEn Java AliEn middleware A. Grigoras, C. Grigoras, M. Pedreira P Saiz, S. Schreiner ALICE Offline Week – June 2013.
ALICE DATA ACCESS MODEL Outline 05/13/2014 ALICE Data Access Model 2  ALICE data access model  Infrastructure and SE monitoring.
Status of the Production ALICE TF MEETING 11/02/2010.
Production Activities and Results by ALICE Patricia Méndez Lorenzo (on behalf of the ALICE Collaboration) Service Challenge Technical Meeting CERN, 15.
LCG-LHCC mini-review ALICE Latchezar Betev Latchezar Betev for the ALICE collaboration.
Patricia Méndez Lorenzo (CERN, IT/GS-EIS) ċ. Introduction  Welcome to the first ALICE T1/T2 tutorial  Delivered for site admins and regional experts.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Christmas running post- mortem (Part III) ALICE TF Meeting 15/01/09.
ALICE experiences with CASTOR2 Latchezar Betev ALICE.
Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.
CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.
Current status WMS and CREAM CE deployment Patricia Mendez Lorenzo ALICE TF Meeting (CERN, 02/04/09)
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
CERN IT Department CH-1211 Genève 23 Switzerland t CHEP 2009, Monday 26rd March 2009 (Prague) Patricia Méndez Lorenzo on behalf of the IT/GS-EIS.
Status of GSDC, KISTI Sang-Un Ahn, for the GSDC Tier-1 Team
ALICE Physics Data Challenge ’05 and LCG Service Challenge 3 Latchezar Betev / ALICE Geneva, 6 April 2005 LCG Storage Management Workshop.
ALICE computing Focus on STEP09 and analysis activities ALICE computing Focus on STEP09 and analysis activities Latchezar Betev Réunion LCG-France, LAPP.
The ALICE Production Patricia Méndez Lorenzo (CERN, IT/PSS) On behalf of the ALICE Offline Project LCG-France Workshop Clermont, 14th March 2007.
Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.
Storage discovery in AliEn
Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
Federating Data in the ALICE Experiment
ALICE Monitoring
Update on Plan for KISTI-GSDC
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Dagmar Adamova (NPI AS CR Prague/Rez) and Maarten Litmaath (CERN)
Storage elements discovery
Simulation use cases for T2 in ALICE
ALICE – FAIR Offline Meeting KVI (Groningen), 3-4 May 2010
The LHCb Computing Data Challenge DC06
Presentation transcript:

GRID interoperability and operation challenges under real load for the ALICE experiment F. Carminati, L. Betev, P. Saiz, F. Furano, P. Méndez Lorenzo, A. Grigoras, C. Grigoras, S. Bagnasco, A. Peters, D. Saiz, O. Datskova, S. Schreiner, L. Sehoon, J. Zhu Taipei - October 18, 2010 CHEP 2010

CHEP Overview THE Challenge: – The software infrastructure created by ALICE in the past ten years had to respond to the 1 st LHC data taking – AliEn, WLCG services, support and operation across more than 60 sites all over the world would be finally stressed – These items will be analyzed in the next 15min

AliEn version during the 1 st data taking: v2.18 – Implementation of a large number of features intended to escalate the number of concurrent jobs for fundamental ALICE activities: Pass1 & 2 reconstruction, calibration, MC production and user analysis – Besides all the new implemented features, we mention two important ones for the Grid sites and the end users: Implementation of Job and File Quotas – Limit on the available resources per user – Jobs: # jobs, cpuCost, runningtime – Files: # files, total size (including replicas) Automatic storage elements discovery – Finding the closest working SEs of a QoS for optimal, configuration- free, writing from jobs. The discovery is based on MonALISA monitoring information (topology, status, etc) – Sorting replicas for reading from the closest available one – Simplifying the selection of SE and adding more options in case of special needs CHEP st ACTOR: AliEn 3

ALICE approach for the past two years: – Decrease the number of services needed at the sites in terms of workload management system Deprecation of the gLite-WMS service Requirement of the CREAM-CE deployment at all sites (since 2008) – Reinforcement (failover strategies) of the required WLCG services at each site Applicable to gLite3.2 VOBOXEs and CREAM-CE – Reinforcement of the ALICE solutions for data transfers Deprecation of the FTS service for transfers Xrootd is the I/O and movement solution chosen by ALICE – Applicable to T0-T1 transfers since January 2010 – The approach can be summarized as: Simplification of the local infrastructures Homogeneous solutions for all sites (differences between T1 and T2 sites only in QoS terms) Flexible relations between all sites – It has demonstrated a good performance considering: Number of sites still increasing Grid activity growing Limited man power to follow all services CHEP WLCG services in AliEnv2.18 4

ALICE included CREAM-CE in the experiment production environment in summer 2008 – 2009: Dual approach: Parallel submission to LCG-CE and CREAM-CE at each site – 2010: Submission to CREAM-CE only Direct submission mode CLI implemented in AliEn through a specific CREAM module – Redundant approach at all sites If several CREAM-CE per site are available, a random submission approach has been included in AliEn to ensure a balance submission among all CREAM-CEs – Issues Serious instabilities found this summer in both the CREAM-CE DB and the resource BDII – Any query to the CREAM-CE DB has been deleted in the next AliEnv2.19 version – Sites have to ensure a reliable BDII publication – General valuation of the service Very good results in terms of performance and scalability Very positive support provided by the CREAM-CE developer team gLite-WMS has been deprecated at all sites since 1 st January 2010 – Only CERN uses it (Around 20 LCG-CE vs. 3 CREAM-CE) CHEP WLCG Services: CREAM-CE 5

CHEP : CREAM-CE vs. LCG-CE 6 ALICE production through CREAM-CEs only ALICE production through LCG-CE only (at CERN) CREAM-CE production Average: 9878 jobs Peak: jobs LCG-CE production (CERN only) Average: 454 jobs Peak: 2484 jobs

The dual submission approach (CREAM-CE vs. LCG-CE) implemented by ALICE in 2009 required the deployment of a 2 nd VOBOX at each site 2010 approach foresees a single submission backend – The 2 nd VOBOX is not needed anymore – ALICE rescue approach: Setup of the 2 nd VOBOX in failover mode Redundant local AliEn services running at both VOBOXES Available at many sites, not only T1s General valuation of the service – The gLite-VOBOX is by far the most stable WLCG service for ALICE – ALICE support members participate together with the IT-GT team in the deployment of new versions – Eventual issues at the sites identified and solved in few minutes CHEP WLCG Services: gLite-VOBOX 7

CHEP Raw Data transfers: Procedure 8 Tx-Ty data transfers performed via xrootd – This includes the T0-T1 raw data transfers – 3 rd party copy (xrd3cp) enabled DAQ CASTOR MonALISA Repository FZK CNAF NDGF Alien tranfer queue FTD Electronic logbook AliEn Catalo gue … Info about each raw data file DB mysql - Good runs - Transfer completed - Run conditions - Automatic pass 1 reconstruction - Transfers to T1 - Storage status - Quotas per site Limited number of files transferred xrd3cp

The number of “channels” (concurrent transfers) opened by FTD is centrally controlled by AliEn and limited to 200 concurrent transfers in TOTAL – The amount of concurrent files transferred per each T1 site is defined by the resources provided by each T1 to ALICE These numbers were presented by the experiment during the SC3 exercise and have not been changed – ALICE infrastructure prevents possible abusive usage Before submitting more transfers the monitoring information is checked – Status of previous transfers, SE usage and availability, bandwidth usage per SE cluster etc – General valuation: Homogeneous infrastructure put in place all over the sites No network abuses or issues have been reported by any site since the start up of the LHC data taking CHEP T0-T1 raw data transfers

CHEP T0-T1 data transfers profile Still some issues prevent the complete publication of the ALICE transfers in Gridview Working together with developers to complete this publication ALICE raw data transfers performed via xrootd are monitored together with the other 3 LHC experiments in Gridview Underestimated information

Pass 1 reconstruction – Quasi-online, follows the registration of RAW in – Raw data fully reconstructed after 24h Average 5h per job. 95% of the runs processed after 10h and 99% after 15h – Reconstruction efficiency around 98% Pass2 reconstruction – ~1 month after data – Updated software, updated conditions Improved detectors calibration from Pass1 reconstruction ESDs (calibration trains) Analysis – Chaotic analysis performed in the Grid with high stability and performance Internal ALICE prioritization applied for end user analysis CHEP ALICE Operations Raw data recorded in 365 different users during this period

T0 site – Increase of the CASTOR capacity for ALICE (both disk and tape) up to 2.3PB ready before the HI data taking (5. Nov – 6. Dec) – Good support on terms of the WMS services Achieved a good CREAM-CE behavior – CAF facility After some instabilities during the setup of the system, establishment of a very good collaboration with the system managers. Steady operation – Software area: AFS Split of readable/writable volumes to improve the access to AFS ALICE is planning to avoid the usage of the software area (EVERYWHERE) T1 sites – Steady operation T2 sites – Tiny issues immediately solved together with the ALICE contact persons at the sites – Bandwidth issues with new incoming sites Creation of a Bandwidth Task Force together with the CERN network experts CHEP Sites news

Simplification of the Grid infrastructures at the sites – No differences foreseen in terms of T1-T2 sites – Homogeneous solutions for all sites (independently of the middleware stacks) Grid operation is now fairly routine – Establishment of good collaboration with the services developers (CREAM-CE) and managers (network, nodes, etc) Grid issues – Better control of services upgrades – Control and follow up of the local services (services still manpower intensive) – Network studies for new incoming sites New AliEnv2.19 foreseen before the end of 2010 – Reinforcement of the CREAM-CE submission modules CHEP Summary and Conclusions