Daniele Bonacorsi Andrea Sciabà

Daniele Bonacorsi Andrea Sciabà
Monitoring in CMS Daniele Bonacorsi Andrea Sciabà 19/5/2011

Outline Introduction Service monitoring Site monitoring
Dashboard monitoring Transfer monitoring Data popularity Other monitoring Conclusions 5/4/2011

Introduction CMS uses a large variety of monitoring sources and tools
Main providers are WLCG (SAM/Nagios, Gridview, etc.) CERN IT (Lemon, SLS, Hammercloud, Dashboard, Data popularity, etc.) Caltech (MonALISA) CMS (PhEDEx monitoring, Overview, etc.) KIT+DESY+… (HappyFaces) This is not meant to be an exhaustive review of every monitoring system! Mainly focused on computing rather than data quality / software release validation etc. 5/4/2011

Service monitoring Most CMS services (and IT services used by CMS) are monitored using Lemon and SLS Lemon: node-centric, standard + custom metrics, provides alarms and actuators SLS: service-centric, produces one estimator (“availability”) plus arbitrary metrics, provides alarms Both widely used in CERN IT and LHC experiments The Critical Service map (developed by the Dashboard team) gives an overview of the status of CMS services 5/4/2011

SLS 5/4/2011

Critical service map 5/4/2011

Site monitoring SAM/Nagios framework CMS Job Robot
Functional tests run on remote services (computing and storage elements) Used in WLCG and EGEE/EGI since several years CMS-specific tests are run with a CMS certificate CMS Job Robot “Fake” analysis jobs automatically sent to all sites Read a dataset replicated everywhere Job success rate measured Transfer link quality Count how many “good” links the site has, looking at the rate of transfer failures 5/4/2011

Site monitoring plots SAM JR summaries JR errors Transfer quality
5/4/2011

Gridview? The portal is not actively used in CMS
Site availability calculated by the Dashboard using more critical tests than those considered by Gridview LCG-CE and CREAM-CE already “ORed“ in the Dashboard Soon possible also in Gridview 5/4/2011

CMS Site Readiness An aggregator of site monitoring information to express if a site is “working” or not READY / NOT-READY / WARNING / SCHEDULED DOWNTIME Use the recent history of the tests rather than a simple “AND” combination of the latest results (e.g. READY if all metrics OK for ≥ 5/7 days) Combines SAM/Nagios, JR and link quality to answer questions like Do jobs run? Is CMS software properly installed? Can read local data? Can copy output to local storage? Can data be remotely read and written? Can transfer to/from other sites? Uses GOCDB to find downtimes 5/4/2011

Example 5/4/2011

SR metrics in the Site Status Board
Using the Dashboard Site Status Board to display arbitrary site information 5/4/2011

Historical trends ~6 over 7 T1 good ~40 over 50 T2 good 5/4/2011

Hammercloud A distributed Analysis testing system used in ATLAS, CMS and LHCb serving two use-cases: Robot-like functional testing: frequent “ping” jobs to all sites to perform basic site validation Stress testing: on-demand large-scale stress tests using real analysis jobs to test one or many sites to: Help commission new sites Evaluate changes to site infrastructure Evaluate SW changes Compare site performances 5/4/2011

Hammercloud statistics
Italian Tier-2 sites 5/4/2011

The Dashboard framework
Initially developed in IT for CMS, later extended to the other LHC experiments Covers job monitoring and site/service status monitoring Provides user/VO monitoring views Information sent by job submission tools and by running jobs to a MonALISA server as UDP messages Planned to start using the WLCG MSG system Information stored in Oracle database 5/4/2011

Dashboard architecture
5/4/2011

CMS job (and not only) monitoring
Interactive view To see the status of current and recent jobs, how they are distributed, how they failed, etc Historical view To see, as a function of time, the number of running jobs, their success rate, their CPU efficiency, etc. Task monitoring See status of user analysis tasks Other monitoring Visualization of SAM/Nagios tests, analysis weekly reports, critical service map, etc. 5/4/2011

Interactive view 5/4/2011

Historical view 5/4/2011

Task monitoring 5/4/2011

Nagios portal in Dashboard
5/4/2011

Transfer monitoring Extensive current and historical information available from the PhEDEx monitoring 5/4/2011

FTS monitoring In WLCG all site-to-site transfers proceed via the File Transfer Service (or via xrootd) Troubleshooting transfer problems needs to directly look at the FTS monitoring Channel configuration Details on failed transfer attempts 5/4/2011

FTS monitor parser Being developed in CMS but usable by anybody
Full statistics about successful transfers from FTS monitors worldwide Average transfer rates per file/stream and their historical evolution Useful to find general issues with endpoints and links check network performance (e.g. for LHCONE) Optimize FTS channels 5/4/2011

Example 5/4/2011

WLCG transfer monitoring
A VO-agnostic project to provide a global transfer monitoring Concept FTS instances (and other transfer systems) publish transfer events and queue status to the WLCG MSG (ActiveMQ) A global transfer dashboard stores recent data (~3 months) and produces plots and statistics Raw event data can be consumed via MSG by any application using an API 5/4/2011

Architecture 5/4/2011

Advantages and plans Advantages Plans
Decouple from local FTS monitoring Cross-technology interface (FTS, xrootd, etc.) More details on transfers Correlations among VOs Plans Defined message format Implemented prototype of FTS publisher (IT-GT) Web interface development starts this summer 5/4/2011

Data popularity A framework developed by CERN IT-ES to provide
Usage statistics vs time for CMS files and datasets by analysis jobs: file access success/failure, CPU time, users, … A data service for future applications 5/4/2011

Data popularity: first results
Accesses by dataset Accesses by site of most popular dataset # accesses Fraction of open failures by site Fraction of open failures by file 5/4/2011

MonALISA Used “behind the curtains”: Very stable and reliable
Dashboard CRAB server monitoring Xrootd monitoring Very stable and reliable 5/4/2011

Other monitoring Storage accounting xrootd monitoring
Using the Site Status Board to publish amount of used and free space on sites Needs work, as BDII information not reliable xrootd monitoring Developed for the CMS xrootd global redirector project Data operations monitoring T0 operations: detailed info on T0 workflows Local Tier-1 batch monitoring via HappyFaces Information sent as standard XML files 5/4/2011

HappyFaces 5/4/2011

Future’s main goals Move towards a coherent framework for alarms and notifications Reorganize views to make them more convenient and converge on fewer aggregator technologies Provide a more powerful monitoring for operators of the workflow management tools, both for production and analysis “Clean up” and further improve performance of available tools 5/4/2011

Daniele Bonacorsi Andrea Sciabà

Similar presentations

Presentation on theme: "Daniele Bonacorsi Andrea Sciabà"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Daniele Bonacorsi Andrea Sciabà

Similar presentations

Presentation on theme: "Daniele Bonacorsi Andrea Sciabà"— Presentation transcript:

Similar presentations

About project

Feedback