Marian Babik, Luca Magnoni SAM Test Framework. Outline  SAM Test Framework  Update on Job Submission Timeouts  Impact of Condor and direct CREAM tests.

Slides:



Advertisements
Similar presentations
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Advertisements

HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
CREAM-CE status and evolution plans Paolo Andreetto, Sara Bertocco, Alvise Dorigo, Eric Frizziero, Alessio Gianelle, Massimo Sgaravatto, Lisa Zangrando.
New VOMS servers campaign GDB, 8 th Oct 2014 Maarten Litmaath IT/SDC.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
EVOLUTION OF THE EXPERIMENT PROBE SUBMISSION FRAMEWORK (SAM/NAGIOS) Marian Babik.
OSG Middleware Roadmap Rob Gardner University of Chicago OSG / EGEE Operations Workshop CERN June 19-20, 2006.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
PanDA Multi-User Pilot Jobs Maxim Potekhin Brookhaven National Laboratory Open Science Grid WLCG GDB Meeting CERN March 11, 2009.
Grid job submission using HTCondor Andrew Lahiff.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GStat 2.0 Joanna Huang (ASGC) Laurence Field.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios for Grid Services E. Imamagic, SRCE.
MW Readiness Verification Status Andrea Manzi IT/SDC 21/01/ /01/15 2.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Direct gLExec integration with PanDA Fernando H. Barreiro Megino CERN IT-ES-VOS.
MW Readiness WG Update Andrea Manzi Maria Dimou Lionel Cons 10/12/2014.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
Overview of Privilege Project at Fermilab (compilation of multiple talks and documents written by various authors) Tanya Levshina.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Information System Status and Evolution Maria Alandes Pradillo, CERN CERN IT Department, Grid Technology Group GDB 13 th June 2012.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Priorities update Andrea Sciabà IT/GS Ulrich Schwickerath IT/FIO.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
DataTAG is a project funded by the European Union DataTAG WP4 meeting, Bologna 29/07/2003 – n o 1 GLUE Schema - Status Report DataTAG WP4 meeting Bologna,
Accounting Update John Gordon and Stuart Pullinger January 2014 GDB.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
WLCG Information System Use Cases Review WLCG Operations Coordination Meeting 18 th June 2015 Maria Alandes IT/SDC.
New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.
Sep 17, 20081/16 VO Services Project – Stakeholders’ Meeting Gabriele Garzoglio VO Services Project Stakeholders’ Meeting Sep 17, 2008 Gabriele Garzoglio.
Grid Deployment Board 5 December 2007 GSSD Status Report Flavia Donno CERN/IT-GD.
The GridPP DIRAC project DIRAC for non-LHC communities.
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
MW Readiness WG Update Andrea Manzi Maria Dimou Lionel Cons Maarten Litmaath On behalf of the WG participants GDB 09/09/2015.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
Probes Requirement Review OTAG-08 03/05/ Requirements that can be directly passed to EMI ● Changes to the MPI test (NGI_IT)
OSG Status and Rob Gardner University of Chicago US ATLAS Tier2 Meeting Harvard University, August 17-18, 2006.
Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing Research Infrastructures Grant Agreement n
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
Outcome should be a documented strategy Not everything needs to go back to square one! – Some things work! – Some work has already been (is being) done.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Present and Future Pedro Andrade (CERN IT) 31 st August.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Update on Service Availability Monitoring (SAM) Marian Babik, David Collados,
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Maria Alandes Pradillo, CERN Training on GLUE 2 information validation EGI Technical Forum September 2013.
ALICE WLCG operations report Maarten Litmaath CERN IT-SDC ALICE T1-T2 Workshop Torino Feb 23, 2015 v1.2.
WLCG IPv6 deployment strategy
WLCG Workshop 2017 [Manchester] Operations Session Summary
NGI and Site Nagios Monitoring
GOCDB Update 27/05/ Me: Working on GOCDB 3 days a week
Advancements in Availability and Reliability computation Introduction and current status of the Comp Reports mini project C. Kanellopoulos GRNET.
Monitoring of the infrastructure from the VO perspective
Presentation transcript:

Marian Babik, Luca Magnoni SAM Test Framework

Outline  SAM Test Framework  Update on Job Submission Timeouts  Impact of Condor and direct CREAM tests  SAM Test Framework Evolution 2

SAM test framework  Aims at monitoring quality of services delivered by tiers and sites in WLCG  Responsible to actively check service status  Currently, main data source for SAM3 A/R computations and WLCG monthly reports  Generic – schedule checks via dedicated plug-ins and handle results  Nagios as check scheduler  Checks are Nagios plugins - ~30 provided by experiments, SAM and PTs  Configuration system and Nagios to messaging bridge 3

SAM test functionality  3 categories of tests  Public Grid Services  Check service functionality via custom probes – storage probes (SRM)  Job Submission  Send job to a specific CE and validate that it correctly executed within the expected constraints  WNs  Execute checks on a remote WN machine  Check execution environment, security, core services 4

Recent changes  Direct CREAM plugins in production since June  Condor plugins in production since October  VOMS/Proxy  Migration to new VOMS servers campaign run in pre- preproduction in August  Entered production in September  Submissions with RFC proxies early next year  HT-Condor backend support  Since November used to test several CMS OSG sites in production  WebDAV plugin developed and tested  Migration to UMD3  SAM Update-23 released by EGI – we’ll merge the changes as both systems are still compatible 5

Probes/Plugins Status ProbesUsed byMaintained by Job Submission WMS CREAMCE CONDOR LHCb, ALICE LHCb, ALICE ATLAS, CMS SAM Storage LFCLHCbLHCb (DPM) SRM 1 ATLAS, CMS, LHCb Worker Nodes ATLAS WN (4) 1 CMS WN (11) 1 LHCb WN (5) 1 ATLAS CMS LHCb ATLAS CMS LHCb GLEXECATLAS, LHCb, ALICEGLEXEC PT + SAM CMS-GLEXECCMS 1 some parts/plugins are common, others are experiments specific 6

Job Submission Timeouts  Follow up on status presented in February’s GDBGDB  SAM can test Job submission with different credentials  WNs tests may need specific group/role to be effective (e.g. glexec)  WNs test inherits credential from Job  Configurable timeouts for different states of particular backend  Sites cannot prioritize SAM tests over normal jobs  The problem: SAM tests may timeout because VO out of share  In February – certain timeouts were reported as critical and accounted for 38% in ATLAS, 33% in CMS of all critical states  Investigation has shown that this was caused by too short timeout on WMS waiting state 7

ATLAS Oct/Nov Job Submission 8

CMS Oct/Nov Job Submission 9

Job Submission Timeouts Summary  Analysis of timeouts in February has shown that most timeouts were due to WMS  45 minutes timeout too short causing job aborts by WMS  Small timeouts on site’s side (0% CMS, 1% ATLAS)  It was proposed to focus on Condor/Direct CREAM  Migration to Condor (ATLAS, CMS) was done on 1 st of October  Currently all timeouts have warning status (not affecting site’s A/R).  Small fraction of timeouts – all after 24 hours  In addition ATLAS moved to lcgadmin only tests for WLCG monthly reports  Fundamental limitations didn’t change  Remote monitoring requires a job to land on a site  However with SAM3 in production, alternative sources can be considered for A/R (accounting, experiment’s sources, etc.) 10

Queue Selection  Currently major source for UNKOWNs  Queues discovered via BDII  For both Condor and CREAM probes  Correct setup of site BDII needed  Algorithm  if FQAN look for resource with compatible access rule  (GlueCEAccessControlBaseRule=VOMS:/atlas/Role=pilot)  if no match then query for queues enabled for VO  (|(GlueCEAccessControlBaseRule=VOMS:/atlas)(GlueCEAccessContr olBaseRule=VO:atlas))  if multiple resources found – use ranking attribute – configurable  GlueCEStateEstimatedResponseTime  Both LDAP filter and ranking attribute are reported in the detailed output 11

SAM Test Framework Evolution  Probes/Plugins  Test Submission Framework 12

Overview  A generic test framework remains fundamental for WLCG monitoring  no major differences from current architecture  Evolution  Overall simplification and reduction of complexity  Plugins – need to reflect changing technologies both in storage and job submission  Test Framework – keep up with changes in the technologies  Open Monitoring Distribution (OMD)  Broad range of new Nagios compatible systems (checkmk, Shinken, Icinga)  New features – scalability, high availability, higher frequency testing  Adopt new messaging libraries and clients  Detailed technical proposal presented and discussed at WLCG monitoring consolidation WLCG monitoring consolidation 13

Probes/Plugins Evolution  Storage probes (based on GFAL2)*  GFAL2-based python APIs used for Put-Get-Del basic SE tests  Tests storage on endpoints is transparent with respect to protocols (SRM, HTTPS, xRootD) via GFAL2 suited plug-ins  Developed by/for ATLAS, but implemented in VO-independent fashion  Tested endpoints paths are the only VO-related component which are to be externally fed  Currently tested in ATLAS pre-production  Direct ARC probe**  support for SAM payload and multiple FQANs  currently in testing * Credits Salvatore Tupputi (ATLAS) ** Credits Pavlo Svirin (ALICE) 14

Probes/Plugins Evolution  Job Submission plugins – provide an open platform with support for various backends  Condor (HT-Condor, CREAM, ARC, GRAM), CREAM, ARC, WMS  Queue selection directly from experiment sources  Based on new Nagios plugin library  Storage plugins  Update to support other protocols (FTS, xRootD, HTTP)  Improve integration with experiment sources  Worker node plugins/framework  Refactor current bootstrapping script – streamline configuration of the environment and propagation of command line arguments  Introduce possibility to have multiple Nagios binaries  Migrate to new messaging clients (stompclt) and add new authentication methods 15

Test Framework Evolution  Nagios-compatible system and Messaging will be kept  Site and service managers notifications via Nagios will be kept  New SAM test framework configuration will be introduced  Auto-generation of the Nagios configuration  Create separation between input sources (topology, metrics) and Nagios core configuration concepts.  Messaging changes  Refactoring current Nagios-Messaging-Nagios bridge  Migration to new messaging libraries  New authorization service Nagios access  Migration to Open Monitoring Distribution  Moving SAM/Nagios specific packages to EPEL 16

Summary  SAM test framework  Many operational and development changes done this year  Introduced support for job submission via Condor, CREAM and ARC*  Significantly improved situation with scheduling timeouts  Future of the framework  Similar architecture, new features  More open and reusable, smooth transition 17

Backup slides 18

19 SAM Test Framework

CREAM Timeouts LHCb/ALICE 20

Condor Timeouts ATLAS/CMS 21

CRITICAL states 22

23

24