James Casey, CERN IT-GD WLCG Workshop 1st September, 2007

Slides:



Advertisements
Similar presentations
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Site Monitoring for Grid Services WLCG Grid.
Advertisements

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations Ian Bird, CERN IT/GD LHCC.
OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.
HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
02/07/09 1 WLCG NAGIOS Kashif Mohammad Deputy Technical Co-ordinator (South Grid) University of Oxford.
CERN IT Department CH-1211 Geneva 23 Switzerland t Open projects in Grid Monitoring IT-GS-MDS Section Meeting 25 th January 2008.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
James Casey, CERN, IT-GT-TOM 1 st ROC LA Workshop, 6 th October 2010 Grid Infrastructure Monitoring.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios for Grid Services E. Imamagic, SRCE.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team James Casey EGEE’08.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,
CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Wojciech Lapka SAM Team CERN EGEE’09 Conference,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.
LCG workshop on Operational Issues CERN November, EGEE CIC activities (SA1) Accounting: current status
Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.
The OSG and Grid Operations Center Rob Quick Open Science Grid Operations Center - Indiana University ATLAS Tier 2-Tier 3 Meeting Bloomington, Indiana.
Site Manageability & Monitoring Issues for LCG Ian Bird IT Department, CERN LCG MB 24 th October 2006.
RSV: OSG Grid Fabric Monitoring and Interoperation with WLCG Monitoring Systems Rob Quick, Arvind Gopu, and Soichi Hayashi Computing in High Energy and.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
Julia Andreeva on behalf of the MND section MND review.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
LCG Accounting Update John Gordon, CCLRC-RAL WLCG Workshop, CERN 24/1/2007 LCG.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Ian Bird All Activity Meeting, Sofia
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
Open Science Grid OSG Resource and Service Validation and WLCG SAM Interoperability Rob Quick With Content from Arvind Gopu, James Casey, Ian Neilson,
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
1 Grid Service Monitoring James Casey, CERN IT-GD WLCG/OSG Operations Meeting 14th June 2007.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations automation team presentazione.
RSV: OSG Grid Monitoring and User Customizable Views Rob Quick, Arvind Gopu, and Soichi Hayashi High Performance Distributed Computing Location: Munich,
Monitoring Working Group Update Grid Deployment Board 5 th December, CERN Ian Neilson.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks New WLCG Grid Service Monitoring Displays.
Grid Colombia Workshop with OSG Week 2 Startup Rob Gardner University of Chicago October 26, 2009.
Monitoring BOF, 23 rd Jan 2007 Grid Service Monitoring Working Group Monitoring WG BOF, January 2007 James Casey/Ian Neilson.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operational Tools M2 Update James Casey.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Status of the SAM/Nagios/GSTAT Components.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios Grid Monitor E. Imamagic, SRCE OAT.
HPDC Grid Monitoring Workshop June 25, 2007 Grid monitoring from the VO/user perspectives Shava Smallen.
Documentation, Best Practices and Procedures: Roadmap
Bob Jones EGEE Technical Director
Regional Operations Centres Core infrastructure Centres
NGI and Site Nagios Monitoring
SA1 Status Report EGEE Grid Operations & Management
POW MND section.
Operational Tools Update OMB 27/07/2010
Introduction to OAT presentations
Evolution of SAM in an enhanced model for monitoring the WLCG grid
Grid Service Monitoring Working Group
March Availability Report for EGEE Sites based on Nagios
Infrastructure Area EMI All Hands Summary.
NE-ROC Nordics Operations
LCG Operations Centres
Monitoring in EGEE Automatisierung & Regionalisierung im Hinblick auf EGI Torsten Antoni (KIT), James Casey (CERN), Sabine Reißer (KIT)
Solutions for federated services management EGI
Monitoring of the infrastructure from the VO perspective
New Types of Accounting Beyond CPU
Presentation transcript:

James Casey, CERN IT-GD WLCG Workshop 1st September, 2007 Monitoring BOF James Casey, CERN IT-GD WLCG Workshop 1st September, 2007

Out of scope (at least not prepared :) Welcome Scope of session Grid service monitoring from the viewpoint of a site and a VO Service availability calculation Focus of session Progress since last workshop Multi-infrastructure issues Out of scope (at least not prepared :) Accounting, information system, auditing

Progress since last workshop Focus in January was on newly created WLCG Monitoring WGs Highlights System Management group http://www.sysadmin.hep.ac.uk/ System Analysis group New tools for experiments (dashboards) Working on getting usage patterns and requirements from the experiments Grid Service Monitoring Worked on system architecture Created probe and exchange specifications Created first prototype of site monitoring

Grid Service Monitoring WG Very active participation from many groups SRCE (Emir Imamagic) contributed nagios based prototype Based on work done for EGEE CE ROC OSG provided signifigant input on probes specifications (Arvind Gopu, Rob Quick) EDS Openlab collaboration (Max Böhm) has worked on architecture and analysis GridIce, Gridview, SAM, R-GMA teams were regular contributors at phone-cons

“The Nagios-based Prototype” Simple monitoring of grid services based on Currently available remote data (SAM, Network) Existing probes from EGEE CE region New probes written according to component developer provided specifications Initally implement using one fabric monitoring system - Nagios … but architecture checked with LEMON developers OSG actively involved in design process Parallel working done using Gratia for data collection Same probes can be used in both systems Some simple plotting using Ganglia

Nagios Display

Ganglia display

Prototype delivery timescale Stage I – ‘gather_sam’ Operations Workshop, mid-June 2007 Stage II – ‘check_wlcg’ End mid-July 2007 Stage III – Local probes CHEP, September 2007 Expect to rapidly iterate, so perhaps only a few “early adopter” sites in June/July Will ask for volunteers at Operations meeting DONE DONE End September ? CERN PPS, PIC, NIKHEF, SRCE, … (?)

Prototype deployed more widely Futures Prototype deployed more widely Probably as part of gLite release in ~1/2 months Added sensors running on the actual service nodes Checking logs, daemon status, … Integrate OSG, EGEE, (NDGF) data in a single SAM/Gridview display Also some new visualisation tools aimed at giving a better “view of the grid”

Discussion Progress of WLCG Monitoring WG since last workshop Demonstration of Nagios-based Prototype SAM Availability calculation including equivalence of components across multiple grid infrastructure Site Local vs. Central tests - what is a good balance? Various job submission methods and job monitoring, monitoring of jobs submitted via condor_g

WLCG Monitoring WG

Monitoring You can’t manage what you don’t measure... accuracy and credibility appropriate metrics - directly relevant to user experience clearly defined and understood measurement instrumentation - active, passive, collection intervals, alarms data collection points - system element  service real-time  historical Sensors/Agents  Transport  Repositories Views Grid Monitoring Presentation automated decision making manual decision making Control Slide by Max Böhm, EDS

WLCG Grid Monitoring Landscape Domain Monitoring Tools in use Grid Applications Application monitoring Experiment Dashboards ... GStat SAM/GridView GridICE GridPP Real Time Monitor ... Grid Middleware central services Grid Services monitoring site services local resources Lemon/SLS Nagios Ganglia ... Local monitoring site 3 WLCG Monitoring Working Groups Slide by Max Böhm, EDS

Aims of Grid Services WG Create set of ‘standard’ WLCG Probes And how to calculate availability based on the metrics produced Improve quality by providing technical guidance Documenting best practices Providing example components

Direction Focus on the interaction points between the different systems “Specifications, not Standards” Timescales mean we can’t get involved in long and heavyweight standards activities Take best practices from existing systems, and document them Get something out to the stakeholders Close feedback loop is the key to adoption Plan for a “standards based” solution in the future

High-level Model See https://twiki.cern.ch/twiki/pub/LCG/GridServiceMonitoringInfo/0702-WLCG_Monitoring_for_Managers.pdf for details

Example Site Component View