EVOLUTION OF THE EXPERIMENT PROBE SUBMISSION FRAMEWORK (SAM/NAGIOS) Marian Babik.

Slides:



Advertisements
Similar presentations
HTCondor and the European Grid Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.
Advertisements

WLCG Operations and Tools TEG Monitoring – Experiment Perspective Simone Campana and Pepe Flix Operations TEG Workshop, 23 January 2012.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Passive Monitoring with Nagios Jim Prins
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
AUTOBUILD Build and Deployment Automation Solution.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.
1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management team CHEP, La Jolla 25 March 2003.
LCG Middleware Testing in 2005 and Future Plans E.Slabospitskaya, IHEP, Russia CERN-Russia Joint Working Group on LHC Computing March, 6, 2006.
Marian Babik, Luca Magnoni SAM Test Framework. Outline  SAM Test Framework  Update on Job Submission Timeouts  Impact of Condor and direct CREAM tests.
WLCG Nagios and the NGS. We have a plan NGS is using a highly customised version of the (SDSC written) INCA monitoring framework. It was became too complicated.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GStat 2.0 Joanna Huang (ASGC) Laurence Field.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios for Grid Services E. Imamagic, SRCE.
MW Readiness Verification Status Andrea Manzi IT/SDC 21/01/ /01/15 2.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Direct gLExec integration with PanDA Fernando H. Barreiro Megino CERN IT-ES-VOS.
1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
WLCG infrastructure monitoring proposal Pablo Saiz IT/SDC/MI 16 th August 2013.
Information System Status and Evolution Maria Alandes Pradillo, CERN CERN IT Department, Grid Technology Group GDB 13 th June 2012.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT DPM / LFC and FTS news Ricardo Rocha ( on behalf of the IT/GT/DMS.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Julia Andreeva on behalf of the MND section MND review.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
Data Placement Intro Dirk Duellmann WLCG TEG Workshop Amsterdam 24. Jan 2012.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Storage Interfaces and Access pre-GDB Wahid Bhimji University of Edinburgh On behalf of all those who participated.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
EMI Inter-component and Large Scale Testing Infrastructure Danilo Dongiovanni INFN-CNAF.
New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.
The Grid Storage System Deployment Working Group 6 th February 2007 Flavia Donno IT/GD, CERN.
The GridPP DIRAC project DIRAC for non-LHC communities.
MW Readiness WG Update Andrea Manzi Maria Dimou Lionel Cons Maarten Litmaath On behalf of the WG participants GDB 09/09/2015.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
EMI is partially funded by the European Commission under Grant Agreement RI Argus Policies Tutorial Valery Tschopp (SWITCH) – Argus Product Team.
Probes Requirement Review OTAG-08 03/05/ Requirements that can be directly passed to EMI ● Changes to the MPI test (NGI_IT)
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing Research Infrastructures Grant Agreement n
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Present and Future Pedro Andrade (CERN IT) 31 st August.
DPM: Future Proof Storage Ricardo Rocha ( on behalf of the DPM team ) EMI INFSO-RI
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Update on Service Availability Monitoring (SAM) Marian Babik, David Collados,
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
ALICE WLCG operations report Maarten Litmaath CERN IT-SDC ALICE T1-T2 Workshop Torino Feb 23, 2015 v1.2.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios Grid Monitor E. Imamagic, SRCE OAT.
Metrics data published Via different methods Monitoring Server
Daniele Bonacorsi Andrea Sciabà
WLCG IPv6 deployment strategy
NGI and Site Nagios Monitoring
Dag Toppe Larsen UiB/CERN CERN,
Use of Nagios in Central European ROC
Dag Toppe Larsen UiB/CERN CERN,
POW MND section.
Marian Babik Andrea Sciabà
A Messaging Infrastructure for WLCG
Presentation transcript:

EVOLUTION OF THE EXPERIMENT PROBE SUBMISSION FRAMEWORK (SAM/NAGIOS) Marian Babik

Motivation  Functional testing of the WLCG site services  Functional testing of the WLCG central services  Execution of the functional tests on the worker nodes  On demand functional testing of sites (e.g. middleware readiness, ipv6)  Site-level service testing* (supporting only existing functional tests)

SAM/Nagios  Probe Submission Framework (Nagios)  Schedule tests  Resolve test dependencies, detect flapping, sandboxing  Notifications (site, internal)  Transport (ActiveMQ)  Interfaces (scheduler APIs, event processors, web)  Probes/Plugins (~30 probes, 113 metrics)  Standardized, re-usable tests (Nagios plugins)  Remote API tests or worker node tests  Worker Node Test Framework  Probe submission framework on the worker nodes  Scheduler, Transport, Interfaces

Technology Evolution  Nagios compatible systems became very popular  Nagios, Naemon, Sensu, Shinken, Icinga (see Related Work)  Nagios core has a sustained record of continued releases every 2-3 months for several years  4.0 introduces many interesting performance improvements  Nagios plugins became de-facto standard (nagios-plugins.org, monitoring-plugins.org)  Interesting work in APIs (Livestatus, Livecheck), auto- configurations (check_mk config engine, WATO), new distributions (Open Monitoring Distribution) and visualizations (graphite instead of pnp4nagios)  Messaging (ActiveMQ) became common  Fits well with geographically distributed monitoring  New versions, protocols and libraries became available

Requirements (short-list)  Consolidation  Complexity reduction  Currently 14 components (+ deps) – many no longer actively developed, only maintained  Decouple configuration, plugins, submission technology  Pending maintenance (environment evolution)  UMD3, SLC6/7, Messaging libraries, yaim/puppet  Plugin requests  HT-Condor, RFC proxies, GFAL2 plugins  New use cases  IPv6, Middleware readiness  Scalability and HA  Reaching limits on number of tests for single Nagios  Single point of failure (no backup instance solution)

PROBE SUBMISSION FRAMEWORK

Proposed changes  Nagios-compatible system and Messaging will be kept  Site and service managers notifications via Nagios will be kept  New Nagios configuration component will be introduced  Auto-generation of the Nagios configuration  Creating separation between input sources (topology, metrics) and Nagios core configuration concepts.  Messaging changes  Refactoring current Nagios-Messaging-Nagios bridge  Migration to stompclt will be performed to process the message queue and publish/consume messages  New authorization service (Nagios access)  Migration to Open Monitoring Distribution  Moving SAM/Nagios specific packages to EPEL

Service Configuration  Binds metrics with services  Intermediate serialization (json) – enables snapshotting, manual configs, getting config via messaging  Generated using plugins – enables different sources  Interesting for Site and On demand use cases { host-tags: { “CEs” : “host1 host2 host3 host4” } checks: { [ “emi.cream.glexec.CREAMCE-DirectJobState-/Role/pilot”, [“host1”, “host2”, “host3]] but also via host tags [ “emi.cream.glexec.CREAMCE-DirectJobState-/Role/pilot”, “CEs”] with additional possibility to add params here: [“glexec_js”, “CEs”, {“params”: { “interval”: 60, args : {“--x”: “/proxy_path”} } ] }

Metric Configuration Changes  Metric identifiers will be the same as Nagios service description  will include FQANs, so we will have emi.cream.CREAMCE-JobSubmit- /Role/atlas=lcgadmin instead of emi.cream.CREAMCE-JobSubmit)  Join all existing command attributes (path, binary, arguments, attributes, etc.) into a single space called command  command: /path/binary –arg1 val1 –arg2 val2 …  Include worker node metric definitions  Auto-configured with all other metrics  Introduce possibility to define multiple metric dependencies  Easily discover all needed metrics (submit <- state, monitor <- proxy-valid <- proxy-get)  Discontinue all other customization files (e.g. /etc/ncg/ncg-localdb.d/ etc.)

PROBES/PLUGINS

ProbesUsed byMaintained by Job Submission WMS 2 CREAMCE 2 CONDOR 2 LHCb LHCb, ALICE ATLAS, CMS SAM ARCALICE (?)ARC PT Storage LFC 2 LHCbLHCb (DPM) SRM 1,2 ATLAS, CMS, LHCb Worker Nodes ATLAS WN (4) 1 CMS WN (11) 1 LHCb WN (5) 1 ATLAS CMS LHCb ATLAS CMS LHCb GLEXECATLAS, LHCb, ALICEGLEXEC PT + SAM CMS-GLEXECCMS 1 some parts/plugins are common, others are experiments specific 2 based on python-gridmon (Nagios plugins library)

Proposed changes  python-gridmon (Nagios plugin library)  Remove old, obsolete and no longer used code, migrate to higher version of python  Keep backwards compatibility with existing plugins  Promote code reuse for experiment specific plugins  Make it easier to transfer data from external sources into plugin’s command line arguments (e.g. space tokens or service endpoint URIs in storage plugins, but also optional forwarding of the arguments to the worker node plugins)  Job Submission plugins – refactor focusing on removing duplicate, obsolete and no longer used code.  Upgrade all plugins to support SLC6 and UMD3  Clean up environment configuration and dependencies  Move package deps to probes  Central vs per-plugin environment configuration

WORKER NODE TEST FRAMEWORK

Worker Node Test Framework  Execution of Nagios complaint plugins  via Nagios scheduler (including all its options)  Creates sandbox for checks  Establishes environment for Nagios plugins  SAM specific environment (brokers, more recently also name of the CE)  Provides bootstrapping script to  setup worker node environment  generates Nagios config (nagios.cfg) – and setup of nagios-messaging bridge  runs Nagios and message publisher and cleans up  Provides messaging publisher that reads from the message queue and publishes to the broker  specifically designed to work on the worker node environment (lightweight with minimal requirements, python 2.3+ compliant)

Proposed changes  Changing the current messaging client  Testing stompclt on the worker nodes or refactoring the current one  Add additional authentication methods to connect to the broker (ideally based on job’s proxies)  Refactoring current bootstrapping script  Streamlining configuration of the environment and propagating command line arguments from SAM/Nagios  Introduce possibility to have multiple Nagios binaries and/or messaging clients  Upgrading worker node Nagios binary to a higher version  currently 3.2.1, most recent stable is 4.0.8

SUMMARY

Summary  Areas of work  Experiment probe submission framework  New configuration and environment setup  New messaging and Nagios/Messaging bridge  Establishing more open system while taking advantage of the recent developments in open source monitoring  Plugins  Refactoring python-gridmon – Nagios plugin library  Refactoring job submission probes  Plugin environment changes  Worker node test framework  Refactoring messaging and bootstrapping  Changes critical to keep SAM/Nagios running  will introduce more flexible and open system (enable new use cases), decrease operational effort and preserve investments already made in plugins

REF Comments welcome: DiqJVXzYFeAwC2ZreGJM9-INe- npkK8cwUCk9JY/edit?usp=sharing

BACKUP

ProbesUsed byMaintained by CREAMCE (WMS) DirectCREAMCE OPS, ATLAS, ALICE, CMS, LHCb OPS, LHCb, ALICE CREAM PT ARCOPSARC PT GLEXECATLAS, LHCb, ALICE, OPSGLEXEC PT + SAM CMS-GLEXECCMS BDIIOPSBDII PT SRMOPSdCache PT LFCATLAS, LHCb, OPSDPM PT ATLAS SRM CMS SRM LHCb SRM ATLAS CMS LHCb ATLAS CMS LHCb ATLAS WN CMS WN LHCb WN gen. WN ATLAS CMS LHCb OPS, ALICE ATLAS CMS LHCb OPS WN replication testsOPSDPM PT July 2013 Babik, M.: SAM Probes Framework 22