Experiment Operations: ALICE Report WLCG GDB Meeting, CERN 14th October 2009 Patricia Méndez Lorenzo, IT/GS-EIS.

Slides:



Advertisements
Similar presentations
CREAM: Update on the ALICE experiences WLCG GDB Meeting Patricia Méndez Lorenzo (IT/GS) CERN, 11th March 2009.
Advertisements

CREAM John Gordon GDB November CREAM number of sites now – gstat2 says 24. Batch systems supported Experiment Tests Feedback from sites. Evaluation.
ALICE Operations short summary and directions in 2012 Grid Deployment Board March 21, 2011.
ALICE Operations short summary and directions in 2012 WLCG workshop May 19-20, 2012.
Patricia Méndez Lorenzo (IT/GS) ALICE Offline Week (18th March 2009)
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES P. Saiz (IT-ES) AliEn job agents.
Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
WLCG GDB, CERN, 10th December 2008 Latchezar Betev (ALICE-Offline) and Patricia Méndez Lorenzo (WLCG-IT/GS) 1.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Angela Poschlad (PPS-FZK), Antonio Retico.
Grid Operations Centre LCG Accounting Trevor Daniels, John Gordon GDB 8 Mar 2004.
Status of the Production and Nagios news ALICE TF Meeting 29/07/2010.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
LCG Pilot Jobs + glexec John Gordon, STFC-RAL GDB 7 November 2007.
Glexec, SCAS & CREAM. Milestones CREAM-CE capable of large-scale direct job submission Glexec & SCAS capable of large-scale use on WN in logging only.
 Status of the ALICE Grid Patricia Méndez Lorenzo (IT)ALICE OFFLINE WEEK, CERN 18 October 2010.
CREAM: ALICE Experience WLCG GDB Meeting, CERN 11th November 2009 Stefano Bagnasco (INFN-Torino), Jean-Michel Barbet (Subatech), Latchezar Betev (ALICE),
1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
LCG Report from GDB John Gordon, STFC-RAL MB meeting February24 th, 2009.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Status of the Production ALICE TF MEETING 11/02/2010.
DIRAC Pilot Jobs A. Casajus, R. Graciani, A. Tsaregorodtsev for the LHCb DIRAC team Pilot Framework and the DIRAC WMS DIRAC Workload Management System.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
BQS integration in gLite-CE TCG meeting, CERN 01/11/2006 Sylvain Reynaud, Fabio Hernandez.
Patricia Méndez Lorenzo (CERN, IT/GS-EIS) ċ. Introduction  Welcome to the first ALICE T1/T2 tutorial  Delivered for site admins and regional experts.
WLCG ‘Weekly’ Service Report ~~~ WLCG Management Board, 5 th August 2008.
1 Update at RAL and in the Quattor community Ian Collier - RAL Tier1 HEPiX FAll 2010, Cornell.
Christmas running post- mortem (Part III) ALICE TF Meeting 15/01/09.
CERN IT Department CH-1211 Genève 23 Switzerland t SL(C) 5 Migration at CERN CHEP 2009, Prague Ulrich SCHWICKERATH Ricardo SILVA CERN, IT-FIO-FS.
WP1 Status and plans Francesco Prelz, Massimo Sgaravatto 4 th EDG Project Conference Paris, March 6 th, 2002.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.
PDC’06 - status of deployment and production Latchezar Betev TF meeting – April 27, 2006.
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
LCG Pilot Jobs + glexec John Gordon, STFC-RAL GDB 7 December 2007.
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
EGEE-III INFSO-RI Enabling Grids for E-sciencE JRA1 and SA3 All Hands Meeting December 2009, CERN, Geneva Product Teams –
Current status WMS and CREAM CE deployment Patricia Mendez Lorenzo ALICE TF Meeting (CERN, 02/04/09)
GRID interoperability and operation challenges under real load for the ALICE experiment F. Carminati, L. Betev, P. Saiz, F. Furano, P. Méndez Lorenzo,
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
CERN IT Department CH-1211 Genève 23 Switzerland t CHEP 2009, Monday 26rd March 2009 (Prague) Patricia Méndez Lorenzo on behalf of the IT/GS-EIS.
CREAM CE: upgrades in the system  Migration of the ALICE production queue in the CREAM CE: DONE  From pps-cream-fzk.gridka.de:8443/cream-pbs-pps to.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
INFSO-RI Enabling Grids for E-sciencE Padova site report Massimo Sgaravatto On behalf of the JRA1 IT-CZ Padova group.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CREAM: current status and next steps EGEE-JRA1.
Pledged and delivered resources to ALICE Grid computing in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
CNAF - 24 September 2004 EGEE SA-1 SPACI Activity Italo Epicoco.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
CREAM Status and plans Massimo Sgaravatto – INFN Padova
Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.
The ALICE Christmas Production L. Betev, S. Lemaitre, M. Litmaath, P. Mendez, E. Roche WLCG LCG Meeting 14th January 2009.
Status of the SL5 migration ALICE TF Meeting
ALICE Workload Model – WMS and CREAM
gLite->EMI2/UMD2 transition
Latest WMS news and more
Summary on PPS-pilot activity on CREAM CE
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
Grid status ALICE Offline week Nov 3, Maarten Litmaath CERN-IT v1.0
Update on Plan for KISTI-GSDC
The CREAM CE: When can the LCG-CE be replaced?
Summary from last MB “The MB agreed that a detailed deployment plan and a realistic time scale are required for deploying glexec with setuid mode at WLCG.
ALICE – FAIR Offline Meeting KVI (Groningen), 3-4 May 2010
TCG Discussion on CE Strategy & SL4 Move
Francesco Giacomini – INFN JRA1 All-Hands Nikhef, February 2008
Presentation transcript:

Experiment Operations: ALICE Report WLCG GDB Meeting, CERN 14th October 2009 Patricia Méndez Lorenzo, IT/GS-EIS

Outlook Summary of the last GDB report in one slide Status of the CREAM-CE Deployment CREAM-CE: site admins and developers feedback Status of the gLite3.2 VOBOX: Deployment status and issues Status of the SL(C)5 Deployment Summary 14/10/09 2 ALICE Report: Experiment Operations

Summary of the last GDB report in one slide During the last GDB (09/09/09), ALICE presented the following plans: Deployment of SL5 in all WNs and VOBOXEs (T0,T1,T2 sites) DEADLINE: Mid-September 2009  Self-imposed and following the recommendation of the MB in terms of pushing T1/T2 to migrate to SL5 Deployment of a CREAM-CE at all sites (T0,T1,T2 sites) DEADLINE: Mid-November 2009  following the milestones defined in terms of CREAM-CE deployment In this talk we are going to upgrade the status of these plans 14/10/09ALICE Report: Experiment Operations 3

Status of the CREAM-CE Deployment Number of sites providing CREAM-CE to ALICE production CERN, KISTI, INFN-Torino, CNAF, RAL, FZK, Kolkata, IHEP, SARA, Subatech, Legnaro, SPbSU, Prague Same situation presented one month ago For the mentioned sites Alice is running in stable production mode New sites providing the CREAM-CE will be put in production immediately Situation of CC-IN2P3 CREAM integration with BQS (local batch system) done ALICE local support testing the system Next step: deployment of the system in production for ALICE testing 14/10/09ALICE Report: Experiment Operations 4

CREAM-CE: Deployment status CREAM1.5 has been deployed in production by the 6th of October ( patch #3259 for SLC4/i386) Patch details: Important bug fixes for the ALICE production included in this new version (next slide) It contains also fixes for two vulnerability reports: Associated to security issues for the sites Migration of sites to CREAM1.5 is highly encouraged and ALICE fully support it for all sites providing this service for the experiment ALICE would like to mention the excellent support and assistance provided by the CREAM-CE developers team in special from Massimo Sgaravatto 14/10/09ALICE Report: Experiment Operations 5

CREAM-CE: site admins and developers feedback (I) Purge issues: ALICE REPORT: Wrong report of job status. CREAM’s vision of running jobs de-synchronized ALICE REQUIREMENT: Method to purge jobs in a non terminal status CREAM STATUS: CREAM job status can be wrongly reported because of some misconfigurations or because of these two bugs in the BLAH Blparser (Solution Status: integration) candidates for CREAM1.6 #55078: « Possible final state not considered in BLParserPBS and BUpdaterPBS » #54949: « Some job can remain in running state when BLParser is restarted for both lsf and pbs » Finally there is an specific bug which covers the ALICE requirement #55420: « Allow admin to purge CREAM jobs in a non terminal status » (Solution Status: in progress) IMPORTANT NOTE: The script provided to the ALICE site admins are indeed the bug fix. As soon as the relevant patch is release, the script will be part of the CREAM rpm CURRENT RISK FOR ALICE: Low once the developers provided site admins with the corresponding purge script (very high before) 14/10/09ALICE Report: Experiment Operations 6

CREAM-CE: site admins and developers feedback (II) DISK SPACE issues (reported by Subatech): Areas to monitor and purge or clean ALICE REPORT: The local mysql DB grown up to 2.5 GB CREAM STATUS: Issue associated to mysql engine. While deleting entries from the DB, the relevant disk space is not released (therefore the CREAM DB does not decrease). But the space is reused when new data added in the DB CURRENT RISK FOR ALICE: low ALICE REPORT: purge of the input Sandboxes in /opt/glite/var/cream_sandbox CREAM STATUS: Solved in CREAM1.5 #48144: « Problems with purge in CREAM when the mapped group name is different than the VO name » RISK FOR ALICE: none once sites upgrade to CREAM1.5 14/10/09ALICE Report: Experiment Operations 7

CREAM-CE: site admins and developers feedback (III) DISK SPACE issues (cont.) ALICE REPORT: issues regarding /opt/glite/var/log and /var/log ALICE REQUIREMENT: Cleaning policy required for these files, otherwise files can grow forever CREAM STATUS: policies exist for all these files and can be costumized file by file: Only the blah accounting log files are out of the CREAM developer’s control (files cannot be deleted before having been processed by the accounting system) For /opt/glite/var/log/glite-ce-cream.log and /opt/glite/var/log/glite-ce- monitor.log, the policy is defined under /var/lib/tomcat5/webapps/ce-cream/WEB- IFN/classes/log4j.properties and the default values can be changed Relevant info under: For /opt/glite/var/log/glite-xxxparser.log the policy is available under /opt/logrotate.d/glite-xxxparser For /etc/logrotate.d/globus-gridftp manages the gridftp log files under /var/log RISK FOR ALICE: low since the size is manageble by site admins 14/10/09ALICE Report: Experiment Operations 8

CREAM-CE: site admins and developers feedback (IV) DISK SPACE issues (cont.) ALICE REPORT: issues regarding /opt/glite/var/cream/user_proxy CREAM STATUS: bug reported and accepted not available in CREAM1.5 #49497: « User proxies on CREAM do not get cleaned up » CREAM developers could increase the priority of this bug if needed RISK FOR ALICE: waiting for site admins feedback 14/10/09ALICE Report: Experiment Operations 9

CREAM-CE: site admins and developers feedback (V) LOAD issues (reported by Subatech): ALICE REPORT: UNIX load going up to 5 (during startup or high rate of submission) CREAM STATUS: problem reported by GRNET and the origin of the problem was a missed index in the CREAM DB #52876: « The extra_attribute table in the CREAM DB has no key/indexes defined » solved in CREAM1.5 RISK FOR ALICE: low once upgrading the CREAM version ALICE REPORT: When tomcat restarted the system can take up to 15 min before submitting new jobs CREAM STATUS: The slow start of CREAM is also due to the problems coming from jobs reported in wrong status #51978: «CREAM can be slow to start» bug in progress, not included in CREAM1.5 but will be released in CREAM1.6 RISK FOR ALICE: Purge actions should speed this startup and therefore decrease the risk for the experiment 14/10/09ALICE Report: Experiment Operations 10

CREAM-CE: Summary of issues All reported issues are:  Solved in CREAM1.5 OR  Known by the developers and they are working on their solutions to include them in CREAM1.6 OR  Developers have provided workarounds for ALICE 14/10/09ALICE Report: Experiment Operations 11

gLite3.2 VOBOX: Deployment « Pre-PPS « version announced at the beginning of October Patch #3205 (SL5 VOBOX) and #3040 (WMS UI fixes) installed at CERN for testing purposes Put in production for ALICE to gain familiary with the system and to submit to CREAM-CE Experience used to help other sites while installation the system Instructions provided by GD and distributed through ALICE TF to sites, specifing: It is a testing patch to gain familiary with the system and its installation New rpm issues stopped the deployment of the patch for almost 4 weeks ALICE sites have been informed about any advance and issue at any moment Two sites more had migrated the VOBOX already: KISTI and ITEP No actions were taken and we tested the new system The patch is in production since yesterday ALL SITES ARE NOW ENCOURAGED TO UPGRADE ALL VOBOXES 14/10/09ALICE Report: Experiment Operations 12

gLite3.2 VOBOX: Issues Thanks to the available gLite3.2 VOBOXES, the experiment was able to find a compatibility problem associated to the libs The libs needed by the new VOBOX for its normal operations (proxy-renewal, etc) entered in concurrence with those provided by AliEn Status: SOLVED Changes in the environment setup before and after any VOBOX specific operation Further issues: NONE The current gLite3.2 VOBOX should not show any further problem for the experiment or the site admins Very easy installation and configuration 14/10/09ALICE Report: Experiment Operations 13

Status of the SL(C)5 Deployment (I) Status of the ALICE sites on regard with the SL5 deployment 14/10/09ALICE Report: Experiment Operations 14

Status of the SL(C)5 Deployment (II) Hybrid (worse) situation: WNs in SL4/32b and SL5/64b This is the situation ALICE wanted to avoid in September The experiment can manage this situation if and only if: 2 VOBOXES are provided (each one will run an independent PackMan service) and this independent of CREAM! 2 different software areas (per VOBOX, per cluster) are provided The sw version changes with the architecture The support infrastructure has to be doupled 14/10/09ALICE Report: Experiment Operations 15

Useful info for the sites SL4SL532b64b XXXX XXX XX XX X 14/10/09ALICE Report: Experiment Operations 16  2 VOBOXES  1 VOBOX VOBOX multiplicity comes defined by the arquitecture 32b VOBOX can work with 64b WNs although not optimal (compatibility issues) Therefore a site with SL4/64b and SL5/64b will require ideally 2 VOBOXES

Summary All sites providing CREAM-CE to ALICE are encouraged to migrate to CREAM1.5 asap The gLite3.2 VOBOX is now in production and sites should migrate their nodes also inmediately ALICE keeps the requirement of the migration of WNs to SL5 Independent of CREAM, sites are encourage to provide homogeneous setups for ALICE in terms of WNs and VOBOXES This will ease site admins and experiment lives And now it is the time! 14/10/09ALICE Report: Experiment Operations 17