Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.

Slides:



Advertisements
Similar presentations
1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.
Advertisements

WLCG Operations and Tools TEG Monitoring – Experiment Perspective Simone Campana and Pepe Flix Operations TEG Workshop, 23 January 2012.
New VOMS servers campaign GDB, 8 th Oct 2014 Maarten Litmaath IT/SDC.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
EVOLUTION OF THE EXPERIMENT PROBE SUBMISSION FRAMEWORK (SAM/NAGIOS) Marian Babik.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Marian Babik, Luca Magnoni SAM Test Framework. Outline  SAM Test Framework  Update on Job Submission Timeouts  Impact of Condor and direct CREAM tests.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROD model assessment ROC UKI John Walsh.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GStat 2.0 Joanna Huang (ASGC) Laurence Field.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Service Availability Monitoring – Status.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.
WLCG Monitoring Roadmap Julia Andreeva, CERN , WLCG workshop, CERN.
Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
CERN IT Department CH-1211 Geneva 23 Switzerland t DBES LHC(b) Grid operations Roberto Santinelli IT/ES 5 th User Forum – Uppsala April.
Julia Andreeva on behalf of the MND section MND review.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.
PIC port d’informació científica EGEE – EGI Transition for WLCG in Spain M. Delfino, G. Merino, PIC Spanish Tier-1 WLCG CB 13-Nov-2009.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN
DJ: WLCG CB – 25 January WLCG Overview Board Activities in the first year Full details (reports/overheads/minutes) are at:
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Monitoring Overview: status, issues and outlook Simone Campana.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Update on Service Availability Monitoring (SAM) Marian Babik, David Collados,
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Regional tools use cases overview Peter Solagna – EGI.eu On behalf of the.
Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April
Enabling Grids for E-sciencE EGEE-II INFSO-RI ROC managers meeting at EGEE 2007 conference, Budapest, October 1, 2007 Admin Matters Vera Hanser.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI COD activity in EGI-InSPIRE Marcin Radecki CYFRONET, Poland & COD Team 9/29/2016.
NGI and Site Nagios Monitoring
POW MND section.
Experiment Dashboard overviw of the applications
Monitoring of the infrastructure from the VO perspective
Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford
Presentation transcript:

Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB

Introduction Recent WLCG reporting proposal – Introduces joint T1/T2 reports – Removes OPS tests from reports This has impact on sites supporting multiple VOs and thus having interest in consolidated monitoring and notifications: – Reporting and VO specific tests – Notifications of sites in case of failures 2

Today WLCG reporting – Overview and recent changes – VO specific tests (site vs experiment view) Overview of the existing approaches in SAM and Dashboards – Notifications/alarming of site failures Aim is to start discussion and get feedback from sites and experiments 3

WLCG Reporting Weekly reports - T0/T1 – Presented at the WLCG MB – Contain sites availability and list of GGUS tickets Monthly reports - T0/T1/T2 – Broadcasted by WLCG office to MB, CB, GDB – Current structure T0/T1 summary and detailed reports (OPS and VO specific) T2 detailed report (only OPS) Quarterly reports – Compiled manually from the monthly reports – Broadcasted to OB All reports are based on list of critical tests agreed by MB – This set is evolving as technology and experiments needs are changing 4

WLCG Reporting Changes In January changes to current monthly reports were proposed at WLCG MB – Introducing joint T1/T2 reports (VO specific) – Effectively removing OPS reports Update on new reports presented at May WLCG MB – Contains review of T2 performance T2 reports are based on the list of critical tests defined by each experiment – The same tests are used already for some time in T0/T1 reports – Performance of T2s based on those tests, although not reported, already followed up by the experiments for some time 5

Critical Tests Overview Reports generated based on list of critical tests contains the following service types: – CE, SRM, LFC and WN Contains mixture of operational and experiment specific tests – All executed with experiments credentials – Operational tests supported and developed by the PTs – List of critical tests contains mostly operational and only few experiment specific ones (evolving, differ per experiment) Those are easy to understand for sites Many tests are shared between experiments – CE job submission, SRM – Very positive recent development of the CVMFS probe List of critical tests used in the new WLCG reports – 6

Experiments vs Sites Experiment vs site level tests – Important to understand if sites failures are due experiment or site level issues – Not a new topic – discussed as part of OPS TEG WG1 Monitoring and Metrics Outcome was a recommendation to distinguish site vs experiment specific tests For ATLAS, CMS and LHCb this is already the case – Major revision of tests was performed in SAM – Different metric sets (profiles) are now used to follow up site and experiment specific issues ALICE is currently using only site level tests, but plans to include also experiment specific 7

Site perspective Moving from OPS to VO specific reporting has an impact on sites – Understand the overall situation across different VOs – Get notified in case of site failures affecting reported site’s availability/reliability SDC/MI team requested to provide an overview of existing approaches in SAM and Dashboards – Notification infrastructure Aim is to start a discussion and understand what works for sites 8

Testing infrastructure 9

Notifications 10

Experiment Computing Shifts Organized to oversee experiments grid operations ATLAS – Computing Shifts (24/7) and ATLAS Cloud Squad – Actively monitoring T0/T1/T2/T3 sites using SSB (SAM, DDM, etc.) – Sites contacted through GGUS – Experiment issues tracked through Savannah CMS – Computing Shifts (24/7) – Actively monitoring T0/T1/T2 sites using SSB – Problems reported using Savannah (Savannah-GGUS bridge used for T1s, T2s on request) LHCb – Computing Shifts (8/7) – Monitoring T1 using SAM interface and cross-checking with DIRAC monitoring – If there is a problem shifter contacts GEOC to follow up with sites via GGUS – T2 monitoring only through DIRAC (changes are planned) ALICE – AliEn Operations Team – Active monitoring of sites usually not necessary – If there issues T1/T2 sites follow up through GGUS, mailing lists – MonALISA notification system in place including status portal 11

ROD/COD Regional Operator on Duty – Notifications received from regional NAGIOS and processed in the EGI operations portal – Follow up and contacts sites through GGUS Central Operator on Duty – Oversees ROD – Notified if ROD doesn’t handle alarms within 72 hours and for tickets open for more than one month – Monthly follow up of site’s A/R if below 70%/75%, requesting justification 12

Site NAGIOS Standalone solution for integrated monitoring and notification – Designed as an extension of an already existing NAGIOS run by the site admins – Offered as part of SAM (supported on best effort basis) Capabilities – Importing metric results from regional or experiment NAGIOSes – Re-using existing NAGIOS web interface to show consolidated view – Supports native NAGIOS notifications s sent to sysadmins with basic soft/hard state filtering Very few sites have shown actual interest (PIC, NIKHEF) 13

Site Nagios 14

Experiment NAGIOS Monitoring and notification at the experiment level – Offered from the four experiment NAGIOSes – Offers equivalent functionality as the Regional NAGIOS Capabilities – Native Nagios notifications can be enabled to send s to site contacts (from GOCDB/OIM or custom list) Contains soft/hard state filtering before notification is sent – Can be fine-tuned per site, service type, service – The same information can be also sent via message bus Re-using the way Regional NAGIOS informs EGI Operations Portal Notifications already enabled for some CMS sites 15

Experiment Nagios Filtering quite complex - usually many notifications are sent – Currently nothing is sent if tests are not executed – freshness mechanism can be enabled to resolve this, but caused flapping in the past – Filtering notifications only for critical metrics currently not implemented 16

Site Status Board Notifications Monitoring and notification at the experiments level – Offered by the experiment SSBs deployed at CERN Capabilities – Multiple profiles, defined by experiment experts – Includes SAM tests, Job Information, DDM, … – Site admins can subscribe to desired information Notifications sent when site changes status – Monitoring part heavily used by ATLAS & CMS – ~20 CMS site admins already registered 17

Site Status Board Notifications 18

Notifications Summary NotificationsPros/Cons Site Nagios Hosted locally at sites Native NAGIOS notifications and filtering Possibility to add custom site-only metrics Doesn’t contain detection of missing results or critical metrics filter  Experiment Nagios Hosted centrally Native NAGIOS notifications and filtering Notifications accessible via message bus Doesn’t contain detection of missing results or critical metrics filter  Site Status Board Hosted centrally Provides both dashboard and notifications Metrics defined by Experiment expert  Contains detection of missing results and can support filtering of notifications only for critical metrics 19

Plans Many existing monitoring and notification endpoints – The actual use varies depending on endpoint Feedback from sites needed to understand what is their preference Common WLCG monitoring project aiming at critical analysis of what is monitored, technologies used, deployment and support models – Reduce complexity of the system – Ensure simplified but effective operations and support – Unify deployment strategy, dev. process and components implementation (where possible) 20

References and Credits Credits to WLCG Operations Coordination and Commissioning Team WLCG Reports – New T1/T2 monthly reports: SAM References – SAM Documentation: – SAM Reporting Web: Dashboard Links – ALICE SSB: – ATLAS SSB: – CMS SSB: – LHCb SSB: 21

Backup 22

SAM/Dashboard APIs Possibility to access raw results, service/site status and service/site availability Capabilities: – Web API (JSON, XML) – Includes logic to detect missing results – Metric sets filtering is possible Not a subscription service (notification system), but rather point of access – Requires post-processing (development) on the client – Quite popular according to Web API statistics Hosted centrally – SAM Web API specification