Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

WLCG Operations and Tools TEG Monitoring – Experiment Perspective Simone Campana and Pepe Flix Operations TEG Workshop, 23 January 2012.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle and Streams Diagnostics and Monitoring Eva Dafonte Pérez Florbela Tique Aires.
Maintaining and Updating Windows Server 2008
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
UKI-SouthGrid Overview Face-2-Face Meeting Pete Gronbech SouthGrid Technical Coordinator Oxford June 2013.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
UK NGI Operations John Gordon 10 th January 2012.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator GridPP 24 - RHUL 15 th April 2010.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
Southgrid Technical Meeting Pete Gronbech: 16 th March 2006 Birmingham.
OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct 2004.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
Northgrid Alessandra Forti M. Doidge, S. Jones, A. McNab, E. Korolkova Gridpp26 Brighton 30 April 2011.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
1 1 Service Composition for LHC Computing Grid Monitoring Beob Kyun Kim e-Science Division, KISTI
02/07/09 1 WLCG NAGIOS Kashif Mohammad Deputy Technical Co-ordinator (South Grid) University of Oxford.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Fermilab Distributed Monitoring System (NGOP) Progress Report J.Fromm K.Genser T.Levshina M.Mengel V.Podstavkov.
SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.
WLCG Nagios and the NGS. We have a plan NGS is using a highly customised version of the (SDSC written) INCA monitoring framework. It was became too complicated.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROD model assessment ROC UKI John Walsh.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks GStat 2.0 Joanna Huang (ASGC) Laurence Field.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
James Casey, CERN, IT-GT-TOM 1 st ROC LA Workshop, 6 th October 2010 Grid Infrastructure Monitoring.
Steve Traylen PPD Rutherford Lab Grid Operations PPD Christmas Lectures Steve Traylen RAL Tier1 Grid Deployment
Towards a Global Service Registry for the World-Wide LHC Computing Grid Maria ALANDES, Laurence FIELD, Alessandro DI GIROLAMO CERN IT Department CHEP 2013.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,
GridPP Building a UK Computing Grid for Particle Physics Professor Steve Lloyd, Queen Mary, University of London Chair of the GridPP Collaboration Board.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Wojciech Lapka SAM Team CERN EGEE’09 Conference,
Presenter Name Facility Name UK Testbed Status and EDG Testbed Two. Steve Traylen GridPP 7, Oxford.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
Rutherford Appleton Lab, UK VOBox Considerations from GridPP. GridPP DTeam Meeting. Wed Sep 13 th 2005.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.
Accounting Update John Gordon and Stuart Pullinger January 2014 GDB.
Tier3 monitoring. Initial issues. Danila Oleynik. Artem Petrosyan. JINR.
ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.
Julia Andreeva on behalf of the MND section MND review.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
The RAL PPD Tier 2/3 Current Status and Future Plans or “Are we ready for next year?” Chris Brew PPD Christmas Lectures th December 2007.
Maintaining and Updating Windows Server 2008 Lesson 8.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
G. Russo, D. Del Prete, S. Pardi Kick Off Meeting - Isola d'Elba, 2011 May 29th–June 01th A proposal for distributed computing monitoring for SuperB G.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Present and Future Pedro Andrade (CERN IT) 31 st August.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
18/12/03PPD Christmas Lectures 2003 Grid in the Department A Guide for the Uninvolved PPD Computing Group Christmas Lecture 2003 Chris Brew.
Daniele Bonacorsi Andrea Sciabà
NGI and Site Nagios Monitoring
Evolution of SAM in an enhanced model for monitoring the WLCG grid
Cristina del Cano Novales STFC - RAL
Kashif Mohammad Deputy Technical Co-ordinator (South Grid) Oxford
Presentation transcript:

Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011

2 Introduction to GridPP Local Site Monitoring UK Regional Monitoring Global Monitoring Combined Dashboards

Hierarchy of the Grid 3 National Centres CERN Local Sites Tier 3Tier 0Tier 1Tier 2 GridPP provides the UK Particle Physics Grid, 17 University sites and the Rutherford Appleton Lab Tier 1 centre Part of the Worldwide Large Hadron Collider Computing Grid (WLCG) Sites GridPP provides ~28000 CPU cores WLCG provides ~ CPU cores

Site Monitoring - Ganglia Sites consist of various front end servers, a batch system providing compute and storage servers Most commonly monitored using Ganglia, which is a simple to install tool used for monitoring the status of nodes. 4

PBSWEBMON In addition, tools to monitor the specific batch system may be used. Torque (formerly known as PBS) with the Maui scheduler is the predominant batch system used at the UK sites. Pbswebmon can be used to monitor this. 5

Network Actual Traffic rates are monitored at many sites using Cacti. Cluster network traffic between WN’s and storage for example can be seen on the Ganglia plots. The GridPP developed GRIDMON to measure network capacity. Each site had an identical node which could run a matrix of tests between sites to monitor bandwidth capacities and quality A database and web front end provided the ability to get historical plots which aid problem diagnosis at sites. 6

Gridmon: Test Topology “Full mesh” testing does not scale: As you add hosts it becomes more and more difficult to avoid contention between tests In this particular case, LHC aids us by using a topology of a central star and several mini-meshes for its data flows Each site only tests to/from the Tier-1 and other sites within their Tier-2 A combination of ping, iperf, udpmon and traceroute is used. rl lancs dl liv man shef dur ed gla

Gridmon in use 8

Fabric Monitoring Is a system up Has it run out of disk space Has a particular process stopped Security logging, and patch status (pakiti) Central Sys logger can help with scanning logs, can be automated with Swatch Nagios provides a framework to schedule tests against nodes and inform you if there is a problem. Far Better than having to trawl logs trying to spot if ‘it’s not OK’. So although there is a web interface, it’s most useful to configure Nagios to send or SMS alerts when problems occur. 9

UK Wide Testing Steve Lloyds tests – Collection of global and local tests for the UK sites 10

Grid Service Monitoring Regional Service Availability Monitoring –Each region (eg UK) has a system that tests the various grid components at the sites. This is also based on Nagios, the system queries the GOCDB to build up a list of services provided by the sites and then tests them. –The results are displayed on the web interface and the MyEGI portal but more importantly sent via ActiveMQ to a message bus where the Regional Dashboard picks them up. –Critical Failures will generate Alarms, which a team of Operators (Regional Operator on Duty or ROD), will use to assign tickets to the site. Sites are duty bound by EGI/WLCG MoUs to respond to these tickets within certain time scales dependant on Tier status. 11

The UK regional nagios service is run by Oxford University GridPPnagios Views 12

Operations Portal 13

GSTAT – Information publishing Information published by LDAP from the site BDII’s 14

Experimental Dashboards Large VO’s such as Atlas, CMS, LHCb have their own extensive monitoring systems –These monitor the jobs and the success/ failure at sites 15

Atlas Dashboards 16

More Atlas Views 17

LHCb dashboard 18

Global Accounting 19

Site Dashboards Attempt to bring together the most relevant information from several web pages and display on one page. Some times done by screen scraping. Others use a Programmatic Interface to select specific information. 20

Site Dashboards RAL Tier 1 21

Oxford / Glasgow Site dashboards 22 Thanks to Glasgow for the idea / code

Oxford’s Atlas dashboard 23

Conclusions Probably too much information to ever fit on one dashboard Systems Administrators will continue to need multiple screens to keep track of many web pages They will have to try to consolidate these with customized dashboards, Or perhaps... 24

References GridPP WLCG Ganglia Pbswebmon Cacti pakiti Nagios swatch Gridmon Steve Lloyd tests GridPPnagios (WLCG Nagios SAM equivalent tests) reporting to Central Operational Dashboard portal.egi.eu/dashboard, and MyEGI portal.egi.eu/dashboardhttps://gridppnagios.physics.ox.ac.uk/myegi EGI Levels GOCDB APEL Experimental SAM/ Dashboards, (eg Atlas dashboard Experiment based Nagios GSTAT / WLCG REBUS