Presentation is loading. Please wait.

Presentation is loading. Please wait.

CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.

Similar presentations


Presentation on theme: "CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure."— Presentation transcript:

1 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure P.Andrade, L.Cons, I.Fedorko, B.Fiorini, A.Iribarren, V.Lefebure, G.Mccance, O.Pera, M.Paladin, I. Reguero, M. Dos Santos, S.Traylen CERN HEPiX Fall 2012 Workshop

2 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t CERN Agile Infrastructure Monitoring HEPiX Spring 2012 –High Level Architecture –View of shared architecture Lemon – LHC Era Monitoring System –Is Lemon only about “performance monitoring”? –Why architecture evolution rather than replacement by existing monitoring tool(s)? Agile Infrastructure for Monitoring –Shared Infrastructure –Use cases: Data store, Visualization –Event processing and management –Status of the components 2

3 Lemon LHC Era Monitoring System In-house developed, multi-components, client/server-based monitoring system SQL TCP/UDP HTTP Sensor Monitoring Agent Local Cache Oracle Database Repository Backend Application Server Lemon CLI Lemon-host-check Web Browser RRD tool / Python Apache/ PHP (command line tool to access data) (command line tool node exceptions) Measurement Repository User Interfaces Node Monitoring Individually configurable nodes with autonomous recovery actions Chain of tools based on DB backend 3

4 Lemon Performance, application and facility monitoring Node monitoring e.g. CPU Load Time-series processing Hierarchy clustering Cluster Sub-cluster Node On behalf monitoring Smart Power Distribution Units Historical data export 4

5 Lemon Service availability and alarming Node monitoring Disk occupancy Number of processes Log file parse matched Correction action on the node Run script locally to clean var dir After 3 rd attempt var occupancy > 90% Monitoring repository export with guaranteed reliability and data processing e.g. Service Level Status var_ful/ alarm System administrator Support ticket 5

6 Lemon Monitoring @ Large scale 6 Experience No single solution replacement Requirements Tools chain e.g. data mining interface different from time series trending Flexible migration e.g. compatible with lemon node client Large scale ready Current system: ~11k monitored entities ~150 metrics/entity Expected scale: ~300k entities

7 Agile Infrastructure with performance monitoring Lemon agent Lemon agent Lemon to messaging Message Bus Custom script Monitoring XYZ Visualization and correlation Visualization and correlation Data store Cluster processing High load for >50% of cluster Cluster processing High load for >50% of cluster Ticketing SMS gateway Dashboard Operations Planned Components Views 7

8 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Storing and visualization Message Bus Oracle Lemon web NoSQL Visualization Data mining (batch processing) Splunk Data mining Visualization Correlation RRD visualization R&D on-going Possible options 8

9 NoSQL-based data store for monitoring Example from Data Storage Service 9 Log parsing and processing based on the NoSQL DB Prototyped by CERN IT/DSS Shared infrastructure

10 Splunk for data mining/visualization High precision data mining in the current system solved by dedicated exports ~1.5 year of Lemon raw data (~4.5 TB in Oracle)  ~2.5 TB Splunk data with metadata information (~43 billion entries) 10 One year period of basic metrics on node  on the fly browsing capability with high time granularity Under testing

11 Example of Splunk Dashboard Lemon data with entity cluster hierarchy 11 Metric - Time - Match entity name Sum of running jobs over time split by entities Under testing

12 Event processing and management concept Metric correlation Metrics Node monitoring Monitoring infrastructure Event processing e.g. Heartbeat checking e.g. Load over cluster 12 Ticketing system Incident process Event process Event process Event record Incident ticket Service Now prototype

13 Possible use of Splunk for event processing Alarming  on the fly information processing in time windows if counter >3  event Splunk Automate Monitoring 5 min time window time Notification Splunk Aggregated Notification 13 In production for backup TSM service @CERN

14 Configuration status and transition period Lemon application server (one/data centre) Lemon metric management Quattor managed node Puppet managed node Puppet managed node Puppet Quattor configuration Puppet configuration 14 AI monitoring Metric Management prototype

15 Component status Lemon agent Lemon agent Lemon to messaging Apollo Custom script Cluster processing High load for >50% of cluster Cluster processing High load for >50% of cluster Visualization and correlation Splunk Visualization and correlation Splunk Data store Hadoop Monitoring XYZ prototyping/testing/using planned/R&D on-going 15 Ticketing SMS gateway Dashboard Operations

16 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Summary No single solution replacement of the current Lemon system Shared Agile Infrastructure  Modular concept –covering all the CERN Computer Centre monitoring domains –continuous development and deployment Transition plan in place Steady progress in implementation 16


Download ppt "CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure."

Similar presentations


Ads by Google