CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t Michel Jouvin (GRIF/LAL) on behalf of James Casey (CERN) (All materials from J. Casey)

Slides:



Advertisements
Similar presentations
CERN IT Department CH-1211 Genève 23 Switzerland t Messaging System for the Grid as a core component of the monitoring infrastructure for.
Advertisements

CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
CERN IT Department CH-1211 Geneva 23 Switzerland t Open projects in Grid Monitoring IT-GS-MDS Section Meeting 25 th January 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t MSG status update Messaging System for the Grid First experiences
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Overlook of Messaging.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
James Casey, CERN, IT-GT-TOM 1 st ROC LA Workshop, 6 th October 2010 Grid Infrastructure Monitoring.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team James Casey EGEE’08.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Service Availability Monitoring – Status.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,
CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
WLCG Monitoring Roadmap Julia Andreeva, CERN , WLCG workshop, CERN.
CERN IT Department CH-1211 Geneva 23 Switzerland t CCRC’08 Tools for measuring our progress CCRC’08 F2F 5 th February 2008 James Casey, IT-GS-MND.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks MSG - A messaging system for efficient and.
CERN IT Department CH-1211 Genève 23 Switzerland t MSG Status update Daniel Rodrigues.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,
Site Manageability & Monitoring Issues for LCG Ian Bird IT Department, CERN LCG MB 24 th October 2006.
CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey WLCG Monitoring – some worked examples.
Visualization Ideas for Management Dashboards
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
ATP Future Directions Availability of historical information for grid resources: It is necessary to store the history of grid resources as these resources.
Julia Andreeva on behalf of the MND section MND review.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks APEL CPU Accounting in the EGEE/WLCG infrastructure.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
1 Models for Monitoring James Casey, CERN WLCG Service Reliability Workshop 27th November, 2007.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
1 Grid Service Monitoring James Casey, CERN IT-GD WLCG/OSG Operations Meeting 14th June 2007.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
CERN IT Department CH-1211 Genève 23 Switzerland t Towards end-to-end debugging for data transfers Gavin McCance Javier Conejero Banon Sophie.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations automation team presentazione.
Monitoring Working Group Update Grid Deployment Board 5 th December, CERN Ian Neilson.
CERN IT Department CH-1211 Geneva 23 Switzerland t LHCOPN Meeting Madrid, 11 th March 2008 James Casey WLCG Monitoring – An overview.
James Casey, CERN IT-GD WLCG Workshop 1st September, 2007
Key Activities. MND sections
POW MND section.
Evolution of SAM in an enhanced model for monitoring the WLCG grid
Grid Service Monitoring Working Group
A Messaging Infrastructure for WLCG
Monitoring in EGEE Automatisierung & Regionalisierung im Hinblick auf EGI Torsten Antoni (KIT), James Casey (CERN), Sabine Reißer (KIT)
Monitoring of the infrastructure from the VO perspective
Enterprise Infrastructure
Enterprise Integration
Presentation transcript:

CERN IT Department CH-1211 Geneva 23 Switzerland t Michel Jouvin (GRIF/LAL) on behalf of James Casey (CERN) (All materials from J. Casey) EGEE France, Lyon April 10, 2008 The Architecture of the WLCG Monitoring System

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 2 Outline WLCG Monitoring Working Group –Mandate, background and key principles Technology investigation –Messaging system –Reporting tools Site Monitoring Prototype Example –OSG RSV publication –Job Reliability Monitoring –WLCG/CCRC08 VO-oriented views Summary 2

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 3 WLCG Monitoring Working Group The WLCG Monitoring working was set up Nov “….help improve the reliability of the grid infrastructure….” “…. provide stakeholders with views of the infrastructure allowing them to understand the current and historical status of the service. …” “… stakeholder are site administrators, grid service managers and operations, VOs, Grid Project management” Now acting as a project rather than a WG –Provides and maintains deliverables –Part of normal operations 3

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 4 Rely on Sites “Site administrators are closest to the problems, and need to know about them first” –On the front line to reduce time to respond Initial focus has been on site monitoring Implications –Improved understanding of how to monitor services “Service Cards” developed by EGEE SA3 –Need to deploy components to sites Sometimes an entire monitoring system Needs active participation of site admins

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 5 Tell others what you know “If you’re monitoring a site remotely, it’s only polite to give the data to the site” (Chris Brew, RAL). –Remote systems should feed back information to sites Implications –Common publication mechanisms –Integration into fabric monitoring –Discovery of data –Site trust of data – Is it a “backdoor” communications mechanism?

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 6 Authority for data… Currently repositories have direct DB connections to (all) other repositories –E.g. SAM, Gridview, Gstat, GOCDB, CIC And they cache and merge and process the data Implications –We have a “Interlinked distributed schema” –Tools should take responsibility for contents of parts of it

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 7 Visualization for each community “User-targeted” visualization –All should use the same underlying data Extract information processing out of visualization tools Provide same processed info to all visualizations –Interface with community specific information, e.g. names Implications –Many “similar” dashboards –Everyone sees the same data –Common frameworks/widgets would help

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 8 Process Review existing monitoring systems –“Improving reliability is our goal !” Identify gaps Design integrated architecture for monitoring –Prototype some solutions –Reduce to a minimum specific components to develop and maintain Must be usable by whole WLCG –EGEE, OSG, NDG 8

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 9 The pieces to work with… The starting point was what we have now: –Availability testing framework – SAM/RSV –Job and Data reliability monitoring – Gridview –Grid topology – GOCDB/Registration DB –Dynamic view of the grid – BDII/CeMon –Accounting – APEL/Gratia –Experiment views – Dashboards –Fabric monitoring – Nagios, LEMON, … –Grid operations tools – CIC Portal They work together right now –To a certain extent ! 9

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 10 We’ve got an integration problem ! 10

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 11 No monolithic systems Different systems should specialize in their areas of expertise –And not have to also invent all the common infrastructure Implications –Less overlap and duplication of work –Someone needs to manage some common infrastructure –We need to agree on the common infrastructure

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 12 Don’t have central bottlenecks “Local problems detected locally shouldn’t require remote services to work out what the problem is” –Still a role for central detection of problem Just they’re reported locally too Lots of central processing done now in SAM/Gridview Implications –Do as much processing locally (or regionally) –Helps scaling – improves robustness –Enables automation - reduces manpower –Harder to deploy

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 13 Re-use, don’t re-invent What do we do? –Collect some information, Move it around –Store it, View it, Report on it This is pretty common –We should look at existing systems Already happening for site fabric… –Nagios, LEMON, … Implication –Less code to develop and maintain –Integration nightmare?

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 14 Don’t impose systems on sites We can’t dictate a monitoring system –Many (big?) sites already have a deployed system –We have to be pluggable into them Implications –Modular approach –Specifications to define interfaces between existing systems and new components

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 15 Broker at the centre 15 Reliablity and persistence of messaging built into the broker network Mitigates the single point of failures we’ve had with previous solutions Message delivery is guaranteed

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 16 Plug’n’Play Components Still can end up with spaghetti –A component must take care only of its own job and ignore details of others (e.g. data schema) Tight specification of interaction of components is required –Message format specifications –Standard metadata schema –Message Queue naming schemas –Protocols Standard “Patterns” can act as a basis – 16

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 17 Messaging Systems for Integration We need: –Loose coupling of systems –Distributed components –Reliable delivery of messages –Standard methods of communication –Flexibility to add new producers and consumers of the information without having to reconfigure everything Message Oriented Middleware provides this –And is widely used in similar scenarios 17

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 18 Messaging Systems Flexible architecture: –Deliver messages, either in point to point (queue)… –… or multicast mode (topics) –Support Synchronous or Asynchronous communication. Reliable delivery of messages: –Provide reliability to the senders if required –Configurable persistency / Master-Slave. Highly Scalable: –Network of Brokers WLCG Monitoring – some worked examples - 18

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 19 ActiveMQ Mature open-source implementation of these ideas –Top-level Apache project –Commercial support available from IONA Easy to integrate –Multiple language + transport protocol support Good performance characteristics –See later … Work done to integrate into our environment –RPMs, Quattor components + templates, LEMON alarms WLCG Monitoring – some worked examples - 19

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 20 ActiveMQ Architecture WLCG Monitoring – some worked examples - 20

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 21 ActiveMQ Throughput > Consumers > Throughput ?? Consumer Bottleneck! With a larger number of producers, even more messages per second saturating the consumer.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 22 Reporting for WLCG Currently a post-processing of results and graphs in Excel –Much manual work needed ! Try to implement it directly on the GridView DB Using a mature open-source reporting toolkit – JasperReports –UI Report builder – iReports –Web-based report server - OpenReports WLCG Monitoring – some worked examples - 22

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 23 JasperReports WLCG Monitoring – some worked examples - 23

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 24 Site Monitoring & Nagios More details in next talk: –“Central Europe ROC Nagios Experience” Nagios has shown itself to be a very useful component for building many part of our monitoring solutions –Local Site monitoring –Replacing the SAM execution framework Too hard to maintain, too much centralized to scale –gStat – BDII monitoring Probes within Nagios Publish site results upwards to be part of availability/reliability computation 24

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 25 Messaging based archiving and reporting 25

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 26 In Production - OSG RSV to SAM RSV – Resource and Service Validation –Uses Gratia as native transport within OSG –And OSG GOC runs a bridge to SAM for WLCG 26

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 27 Job Reliability Monitoring Requires to be able to gather job state transitions from all jobs submitted in WLCG resources –EGEE (RB/WMS + Condor_G) + OSG + NDG Only gather this information once –Propagate to interested parties Using existing systems and expertise where possible –Don’t try and deploy components on every WMS/RB/L&B/CE/… –Get ‘cooked’ data from the systems Hook up with Pilot Jobs –Linkage between pilot and experiment jobs as a ‘state change’ 27

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 28 Current situation Currently mines L&B log files, and sends them via R-GMA –Requires a specific component on every L&B Loses many records GridView hacks to ‘finish’ unfinished jobs after 24h –Inaccurate results Jobs reported via experiment frameworks –Gathers from many sources – Imperial College XML files, job submission tools, MonAlisa reporting from jobs, R-GMA But some missing information for Condor_G jobs –info between submission and user job starting on WN –Job aborted 28

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 29 Proposal Use WLCG Monitoring infrastructure (MSG) for collecting and transporting the data –Messaging system –Standard message formats Work with expert groups to instrument the job submission systems Visualization by Gridview + Dashboards 29

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 30 EGEE L&B Notifications means we don’t have to run components mining L&B logfiles –Consumer of notifications can be remote L&B is stated to scale for our needs –Tested at >1 million records/day –Testing of integrating with notifications underway by GridView team Message formats already defined –Old log mining approach will all be moved to messaging system to free GridView from R-GMA dependency 30

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 31 Condor_G Condor_G submitter instrumented to create L&B messages –Done by a separate listener process that is started by Condor_G –Limited subset of Condor_G state changes will be sent Listener/reporter can use different transport for reporting –Currently MonAlisa as a transport layer –Will migrate to WLCG messaging system 31

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 32 Pilot Jobs L&B client resides on every worker node Can be used to submit additional messages to L&B for a job –Timestamps +environment for Job Wrapper start/end –Timestamp of handover to user job –Linkage of pilot job to experiment job ID –… Benefit is that it’s all in one coherent data structure for a given job 32

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 33 EGEE Architecture 33

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 34 CMS SAM Portal 34

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 35 ServiceMap What’s a ServiceMap? –It’s a gridmap with many different maps, showing different aspects of the WLCG infrastructure –Gridmap : “treemap”-based view of the grid What’s the CCRC’08 ServiceMap? –Service ‘readiness’ –Service availability –Experiment Metrics A single place to see both the VO and the infrastructure view of the grid 35

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 36 CCRC’08 ServiceMap …Demo… 36

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 37 WLCG Experiment metrics Show the VO view of the infrastructure Two extra ‘maps’ planned –Reliability (e.g successful data transfer, jobs, …) –Metrics (MB/s, events/s, …) Need interaction with experiments to create these two views

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 38 Summary CCRC’08 is a good opportunity to try some new operational tools –And evaluated them in a ‘real-world’ mode The CCRC’08 ServiceMap seems to give a useful view of the grid –Need to iterate on what is useful to show –And fill in the white spaces… Next Steps –MoU calculation and reporting to sites Feedback on all the tools welcome ! 38

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 39 Links to CCRC08 tools CCRC’08 ServiceMap CCRC’08 Observations logbook RSS feed : logger.cern.ch/elog/CCRC'08+Observations/elog.rdfhttps://prod-grid- logger.cern.ch/elog/CCRC'08+Observations/elog.rdf Reponse tracking logbook RSS feed : logger.cern.ch/elog/CCRC'08+Logbook/elog.rdfhttps://prod-grid- logger.cern.ch/elog/CCRC'08+Logbook/elog.rdf Presentation title - 39

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 40 Strategy Summary Converge to standards, but without a big bang Leverage the underlying infrastructures rather than layer lots of systems on top Reduce maintenance/development costs by using commodity components whenever possible Modular and loosely-coupled to adapt to changes in infrastructure and funding models 40

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 41 Architecture Summary Our design for a new architecture leverages commodity software components –Probe Execution (Nagios), Messaging (ActiveMQ), Reporting (JasperReports) It is essentially an integration exercise –Make existing tools work together better In order to improve reliability –This is what we will verify over the next 12 months 41

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services 42 More Information… GDB Reports on Monitoring by James –Almost every month at GDB or pre-GDB – =3l181http://indico.cern.ch/categoryDisplay.py?categId =3l181 Improving Job Reliability – d=20228http://indico.cern.ch/conferenceDisplay.py?confI d= James… –Look at CERN directory…