CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

Slides:



Advertisements
Similar presentations
Applications Area Issues RWL Jones GridPP13 – 5 th June 2005.
Advertisements

Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim
New VOMS servers campaign GDB, 8 th Oct 2014 Maarten Litmaath IT/SDC.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.
Stefano Belforte INFN Trieste 1 CMS SC4 etc. July 5, 2006 CMS Service Challenge 4 and beyond.
OSG Services at Tier2 Centers Rob Gardner University of Chicago WLCG Tier2 Workshop CERN June 12-14, 2006.
OSG Middleware Roadmap Rob Gardner University of Chicago OSG / EGEE Operations Workshop CERN June 19-20, 2006.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
SRM 2.2: status of the implementations and GSSD 6 th March 2007 Flavia Donno, Maarten Litmaath INFN and IT/GD, CERN.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
LCG Middleware Testing in 2005 and Future Plans E.Slabospitskaya, IHEP, Russia CERN-Russia Joint Working Group on LHC Computing March, 6, 2006.
Grid job submission using HTCondor Andrew Lahiff.
WLCG Nagios and the NGS. We have a plan NGS is using a highly customised version of the (SDSC written) INCA monitoring framework. It was became too complicated.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
UMD TIER-3 EXPERIENCES Malina Kirn October 23, 2008 UMD T3 experiences 1.
Maarten Litmaath (CERN), GDB meeting, CERN, 2006/02/08 VOMS deployment Extent of VOMS usage in LCG-2 –Node types gLite 3.0 Issues Conclusions.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios for Grid Services E. Imamagic, SRCE.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Grid Monitoring Tools Alexandre Duarte CERN.
Andrea Sciabà CERN CMS availability in December Critical services  CE, SRMv2 (since December) Critical tests  CE: job submission (run by CMS), CA certs.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks State of Interoperability Laurence Field.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,
Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov
Stefano Belforte INFN Trieste 1 Middleware February 14, 2007 Resource Broker, gLite etc. CMS vs. middleware.
INFSO-RI Enabling Grids for E-sciencE OSG-LCG Interoperability Activity Author: Laurence Field (CERN)
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
SAM Tests SAM Devel. & Support Team CERN IT/GD WLCG/EGEE/OSG Operations Workshop 25 Jan. 2007, CERN.
Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
The GridPP DIRAC project DIRAC for non-LHC communities.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
1 Andrea Sciabà CERN The commissioning of CMS computing centres in the WLCG Grid ACAT November 2008 Erice, Italy Andrea Sciabà S. Belforte, A.
SAM Database and relation with GridView Piotr Nyczyk SAM Review CERN, 2007.
INFSO-RI Enabling Grids for E-sciencE FTS failure handling Gavin McCance Service Challenge technical meeting 21 June.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Ideal information system - CMS Andrea Sciabà IS.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
The GridPP DIRAC project DIRAC for non-LHC communities.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Probes Requirement Review OTAG-08 03/05/ Requirements that can be directly passed to EMI ● Changes to the MPI test (NGI_IT)
II EGEE conference Den Haag November, ROC-CIC status in Italy
OSG Status and Rob Gardner University of Chicago US ATLAS Tier2 Meeting Harvard University, August 17-18, 2006.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
Site Manageability Issues for LCG Ian Bird IT Department, CERN HEPiX JLab, 12 th October 2006.
CERN LCG1 to LCG2 Transition Markus Schulz LCG Workshop March 2004.
CERN IT Department CH-1211 Genève 23 Switzerland t CMS SAM Testing Andrea Sciabà Grid Deployment Board May 14, 2008.
RSV and Nagios in OSG Rob Quick. March 11, 2008 USCMS Tier-2 Workshop 2 Current State of OSG ~ 100 Sites ~ 30 VOs April 8th:  216,000 jobs (85% successful)
The CMS Beijing Tier 2: Status and Application Xiaomei Zhang CMS IHEP Group Meeting December 28, 2007.
Daniele Bonacorsi Andrea Sciabà
Xiaomei Zhang CMS IHEP Group Meeting December
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
March Availability Report for EGEE Sites based on Nagios
Short update on the latest gLite status
Discussions on group meeting
Site availability Dec. 19 th 2006
The LHCb Computing Data Challenge DC06
Presentation transcript:

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007

Outline SAM for CMS CMS SAM tests Using SAM with OSG sites SAM and VOMS SAM and OSG Issues Plans

SAM for CMS Why SAM? SAM is explicitly developed to run periodic sanity checks on Grid (and experiment) services How can it be used? Relying on ops test results The easiest option, done for years Running some standard tests under the CMS VO e.g. to spot problems occurring only with VOs other than ops Running custom CMS tests The most effective option

Using ops tests as critical tests CMS uses since a long time some ops tests as critical tests Job submission CA certs version csh test The failure of any of this tests is definitely a serious problem!

Using CMS custom tests in SAM A CMS instance of the SAM client is installed at CERN Tests are submitted every two hours to “real” CMS sites The SAM framework allows to easily plug in new tests for existing sensors Added to "testjob" sensor  run on the worker node Test nameWhat it does basicChecks that the CMS software area is defined and exists, and the CMS site local configuration file is correct swinstChecks that the required versions of the CMS software are correctly installed Monte CarloChecks that the stage out of a file from the WN to the local SE is working correctly SquidDiscovers from the local site configuration file the name of the Squid server and makes a simple query through it FroNtierReads calibration data using CMSSW via the local Squid server

SAM and VOMS roles Different tests may need different VOMS roles The /cms/Role=lcgadmin role is preferred because It allows to write in the experiment software area It has a higher priority at sites However the /cms/Role=production role is needed for the "Monte Carlo" test To take advantage of any write access privileges granted only to that role Solution It is necessary to submit two jobs for every CE instead of one

EGEE and OSG The job submission is done using the LCG Resource Broker for both EGEE and OSG For EGEE sites is must work by definition For OSG sites it requires some effort The site must be in the central EGEE BDII to be in the SAM database: OK The CA certs and CRLs must be kept up to date: OK The lcgadmin and production roles must be supported: OK The middleware installed in the OSG WN’s must be "friendly" to the LCG job wrapper: OK The SAM tests run nicely on OSG! After an initial phase where lots of problems were found and fixed, now job submission problems are rather infrequent

Description of the SRM tests (I) SRM-v1-get-pfn-from-tfc Given the SE name, looks in the TMDB for the corresponding lfn-to-pfn rule for the test LFN /store/unmerged/SAM/testSRMv1_070628_ Returns a warning if in TMDB transfers go to an SRM different from the input node Returns an error if it could not map to a PFN SRM-v1-put Copies with srmcp a test file from the UI to the PFN Retries are handled by the script, not by srmcp Returns an error if srmcp fails Returns a warning if the pfn-from-tfc test could not map to a PFN It's not SRM's test if the catalog has not the right information!

Description of the SRM tests (II) SRM-v1-get-metadata Uses srm-get-metadata on the PFN to retrieve size and checksum Gives an error if srm-get-metadata fails or the size or the checksum differ from the original file SRM-v1-get Copies with srmcp the PFN to the UI Gives an error if srmcp fails or the copied file differs from the original file SRM-v1-advisory-delete Uses srm-advisory-delete to delete the PFN Gives an error if srm-advisory-delete fails NOTE: for CASTOR the method is dummy, so test files will grow in number; manual cleaning is required once in a while

Critical tests for SE/SRM Not clear the distinction between the SE and the SRM sensor A legacy of the past… CMS runs SRM tests and no SE test It used to run the lcg-cr test Since last Monday, there are no critical tests for the SE Planned to make the SRM-v1-put test critical for CMS With care, as sites tend to be sensitive about critical tests! They don't want to look bad in the availability calculation…

Availability calculation in GridView Bug found in the algorithm The service instance status is UNKNOWN if no test is critical  it should be UP The service status stops being computed from the moment when no test is critical any more Will be fixed very soon

FNAL availability The FNAL availability in GridView is flawed because the site (in GLUE sense) USCMS- FNAL-WC1 used by GridView has not the CEs The CEs are at another sites: one at uscms-fnal- wc1-ce and one at uscms-fnal-wc1-ce2 FNAL will always look better than it is, if the CEs are ignored!! GridView should be able to aggregate more "GLUE" sites in a "GridView" site

Other issues None of relevance in the framework Tests change from time to time, generally improving to avoid "false alarms" After increasing the timeouts from 10 to 20 minutes, the time needed to submit all tests increased to ~1.5 hours, dangerously close to the 2 hours period of the cron job Most of the time is taken by the SRM tests Will upgrade ASAP to the latest version of the SAM client, that prevents a sensor to be run if there is still another instance running

Plans Add a test for the "analysis" Try to read a small dataset like the JobRobot Closely monitor the SRM-v1-put before making it critical