ABACUS & COSMOS Monitoring review

ABACUS & COSMOS Monitoring review
Frank Locci on behalf of ABACUS and COSMOS team BE-ICS,BE-CO - Technical Meeting:

Agenda ABACUS review outcomes COSMOS project overview and plan
Finding on existing solutions Mandate and investigations Review proposal COSMOS project overview and plan Project mandate Approach and scope Some technical aspects Conclusions

ABACUS Proposal Plug our systems to CollectD and ITMON services CO sysadmin needs to be aligned with IT solution Set up an Icinga2 monitoring solution for all other use cases Icinga2 covers our major requirements, collectD support, has CERN users community (ATLAS+CMS) New paradigm: Each user/expert responsible for some monitoring will add the specific part over the monitoring core infrastructure (collect, visualization) Let me start with the ABACUS conclusions It propose to ... ... benefit from developements and central services Finally It propose a new paradigm to delegate responsabilities of the specifc part to the user (...)

Our finding on the existing solution 1/2
Diamon Moon None of the tools satisfy all the requirements Lot of feature duplication in different solutions Actual situation it is not adequate for our infrastructure today Too many tools with no integration Not easy to be configured for new systems Scaling up and support are bottleneck points Some missing features Auto-discovery, Downtime notif., self-monitoring, etc. Lemon Custom viewers & tools Xymon Kibana/ES So what id we find out from …. …. To display, organize, filter, compute the data ... ... even no longer tenable in term of maintenance Meter (IT) Spectrum MIB RT metrics rsyslog/ atop Proprietary tools (HP) FESA/RDA

Our finding on the existing solution 2/2
Stacking of house-made solutions lead to complexity and specialization  difficult to operate and to maintain for single and centralized team Need more modular and interconnected solutions  to decentralize services ... … allow to redistribute the responsabilities depending on the domain and expertise level In most of the cases the user himself can take resp. to implement the collection and visualization of the data over the big-data core services maintained by the monitoring team. BIG-DATA CORE Data processing Aggregation Reduction History logs Storage SOURCES VIZUALISATION Data collection Formatting Ingestion Transport Alerting Searching Tables, Tree, Graphs, Dashboards user user monitoring teams Open Source market extremely rich in the domain of monitoring

Abacus Mandate (Jun 2016) Gather and document the user requirements.
Identify and document the existing solutions Organize the monitoring review meeting with external reviewers Perform a market survey to understand the available solutions Deliver a sound organizational proposal to setup the realization project This proposal has then to be approved and staffed by the SLM. So, here is the ABACUS mandate quickly: ... considering the major components of the current monitoring system BE-ICS is particular and potential candidates

People Core members Reviewers Collaborators
. Felix Nikolaus Ehm . Luigi Gallerani . Joel Lauener . Frank Locci Reviewers . Marc Buttner . Brice Copy . Stephane Deghaye . Chris Roderick . Marc Vanden Eynden Collaborators . Marine Gourber-Pace . Fernando Varela Rodriguez Here is the list of people involved in the review . A core members team with CO people . Reviewers and collaborators from both CO and ICS groups

Phases of the review Requirements collection and analysis - July-Aug   High priority for users, common and specific requirements, Survey/Interviews with lot of experts Existing solutions comparison, feature matrix - Sept Diamon, Moon, Lemon, Xymon, log tracing, specific (i.e. timing, FESA) Matrix  Available, missing or duplicated features Unified Monitoring Infrastructure Proposal - Jan layers architecture data sources, core, visualization as new paradigm with delegation of metric responsibility Evaluation of Icinga2, Zabbix, Cern ITMON, Prometheus - Feb  . Setup and configuration for basic monitoring, alerts and dashboard . No time to test advanced features, performance and scalability in real conditions. . Golden use cases like White Rabbit Switches monitoring successfully implemented. The review was conducted in 4 steps: . The first step to analyse and ... . Second step to compare major solutions deployed at CERN (especially in BE and IT). We had technical meetings with the different experts to present their product, their showstoppers and potential Improvements .In the 3rd step, We prepared a first proposal based on IT UMI solution but it was discussed because of limited time and resources, and probably oversized proposal. So we agreed to evaluate other alternatives and lower the scope of the product to more integrated solution. .This led us to the 4th step to evaluate some candiate solutions … selected 4 well-established product of the market. ..for few 100s of hosts

People contacted - Interviews
Stephen Page and Alastair Bland as sysadmin part Jakub Wozniak as Logging expert Jean-Claude Bau, Andrzej Dworak and Luca Molinari as Timing expert Julien Palluel as White Rabbit and WFIP expert, Lukasz Burdzanowski as Database expert. Kajetan Fuchsberger as reference from OP group, Alberto Aimar and Pedro Andrade from IT Monitoring team. A dedicated interdepartmental CNIC meeting has been proposed by Luigi and Stephan Lueders ( Timo Hakulinen for Access and Safety monitoring; Franco Brasolin ATLAS TDAQ sysadmin monitoring with Icinga2 ; Lavnia Darlea and Marc Dobson for CMS monitoring, Service, Hirsto Mohamed for LHCb online monitoring, Véronique Lefébure for IT-ICS network monitoring, Sebastian Bukowiec for the IT Windows monitoring. Thanks to Javier Serrano that actively helped with ICALEPCS contacts, we also have received inputs from: Prados Boda Cesar at GSI, Jean-Michel Chaize from ESRF, Alain Buteau from Synchrotron Soleil, Enzo Carrone from SLAC, Artur Barczyk and Vincent Hardion from MAX IV.

Key common requirements
Reduce solutions multiplicity and services overlap Easy : Plug & Play or auto-discovery of new systems Self definition / tuning of metrics, alerts, parameters Multitenant, Multi-view, custom visualization, integrated/coherent GUIs Multi Platform access, web, mobile devices, control room System relations and dependencies, hierarchy view Reliable notifications, Acknowledge/Maintenance History (long one) Fast History, Fast Search, Easy Filters, Easy Query No real surprise with the outcome of that requirements study:

Investigation of open solutions
We have investigated the 4 major candidates ITMON/ CollectD Icinga2 Zabbix Prometheus Full stack OSS solution - ++ + Learning curve Aligned with CO sysadmin + (Ansible, file-based) Aligned with IT solutions + (Collectd, DBOD) Setup, configuration, ease of use Extent of customization ++ (checks, config files, advanced templating,...) + (browser-based Gui) Data processing (filtering, aggreg./reduct., analysis, …) + (built-in procedures) ++ (powerful language) Security (comm., authentication, multi-user, …) ++ (SSL/TLS) Scalability and network distribution High-availability (load balancing, redundancy, ..) ++ (fully indep. zones) The results of these comparative tests are summarized in this table which contents only significant topics. Clearly, Icinga2 and Zabbix can do 80% of the job but we have selected Icinga2 for different reasons: 1. We consider it is better aligned with CO sysadmin who has a long history and preference for file-based systems 2. It provides support for collectD and it can works with IT DB on-demand service for configuration and metrics databases 3. Concept of check process fits very well with the new delegation of responsability paradigm 4. It is also very well designed for distributed layout and high-availability (secure load balancing and redundant features)

ABACUS Proposal (reminder)
Plug our systems to CollectD and ITMON services CO sysadmin needs to be aligned with IT solution Set up an Icinga2 monitoring solution for all other use cases Icinga2 covers our major requirements, collectD support, has CERN users community (ATLAS+CMS) New paradigm: Each user/expert responsible for some monitoring will add the specific part over the monitoring core infrastructure (collect, visualization) So to conclude on the ABACUS review, let me show you again the outcomes of the review that led us to the COMOS project approved in March this year.

COSMOS Mandate (March 2017) Controls Open Source MOnitoring System
Study and implement a new CO monitoring system and concentrate efforts on our core business  use de-facto standard technologies and open-source solutions Start LS2 (end 2018): CO infrastructure monitoring focusing on our first priorities - All CO hosts+hw-modules and services base checks (ping, system metrics, process/services state) - Golden use-cases: QuadServer, JBOD/Raid, crates and field-buses (WR-Switch, Pulse Rep., WFIP) End LS2: System integration and extension - Architecture consolidation (scalability, HA, performance) - Configuration management tools - CO and partners services integration (alarms, logging, GUIs, custom monitoring, ...) - Advanced features (actions/feedback, dependencies tree, …) - Gradual phase-out of DIAMON 6 people from different CO sections and different expertise have been partially allocated to COSMOS (~1.5FTE so far) In that project, we propose to : Study and implement … The project is splitted in 2 parts: . First, setup the monitoring of the all CO infrast. before start LS2, including: … In the second step, plan for end LS2 (and possibly before) we will focus on the integration and extension of the system Frank Julien Luigi Laura Felix Sergey

COSMOS monitoring definition
It does Constantly track the system resources Detect fluctuations, slow or failing components Provide health reports for each host and service Collect metrics Notify (by , SMS) It is not A database A control command system A (smart) diagnostic tool A Logging tool So we must first define what we mean by COSMOS monitoring [----] It can be a diagnostic tool somehow but not an artificial intelligent system … but don’t worry, it still has many other things to do! [----] And now, the main question is: where to put the cursor? ================= And we have an answer for that: monitoring is not tracing or profiling and requires static configuration … we should know what we are looking for! . In this picture you can see the data complexity growing from the left to the right and where is the limit of the COSMOS monitoring service. . It includes: . count and check hosts and services state (up/down) . … to check not only the service is running but it is running properly (we can perform get/set to RDA for instance) . usually it requires introspection to provide internal metrics and expert data of SW and HW components. . COSMOS monitoring should not cover: logging, … or even more sophisticated application like: analysis, … Service metrics introspection Host state Process state Service state Acc. equipment check COSMOS Monitoring Diagnostic ping up/down JMS, RDA, SQL, JMX, CMX, FESA, … Logging, profiling, search, processing, … Offl. analysis, AFT, DocFIP, CANDI, … Static configuration Dynamic configuration We know what we are looking for!

COSMOS core scope COSMOS features and support included
Linux and windows OS/network metrics (cpu, mem, disk, netw, …)  collectD Single and common interface to collect health report and perf. data  Nagios CHECK Host, Process and Service state, Apps. status/metrics ingestion Metrics and performance data storage, query support  influxDB (time-series DB) Multiview and Multitenant support for visualization  Grafana Core services: Config. Managment, Alerts, Interactive troubleshooting Out of the COSMOS support Custom checks, Data analysis, Tracing/Searching, Op. Alarms, Diagnostic Icinga2 checks and API can be used to plug external services (Elastic, Logstash, Prometheus, etc.)

Solution: Icinga2 check
Active check Also called plugin (Nagios): standalone extension to Icinga2 core A script or binary: >check_http –H cwe-513-vol510 –w 10 –c 20 –f follow Returning a status code Returning an output message HTTP OK: HTTP/ OK bytes in second response time Returning some perf. Data (optional) |time= s; ; ; size=85997B;;;0 >3000 plugins developed by the community Icinga2 process Initiate on periodic basis status+ metrics Status code Host status Service status 1 2 3 UP UNSTABLE DOWN/UNREACHABLE OK WARNING CRITICAL UKNOWN plugin Passive check Icinga2 process status+ metrics asynchronous plugin

Solution: COSMOS layout
GPN IT firewall rules TN Grafana service IT agent Collectd server Sys. metrics agent agent agent IT time series CO routing agent perf.data check agent Icinga2 server p-check check check events check Icinga2 satellite check IT Icinga2web server check

Current status: November 2017
Icinga2: master and web servers (x2 Pro, Dev) available for development and tests IT Services: Large effort done to make the whole chain integrated Databases: COSMOS relies on DB on-demand service with TN/GPN routing (IDO: MySQL, TS: influxDB) Visualisation: COSMOS relies on IT Openshift service for Grafana Security: LDAP/SSO support for ICINGA2 servers, Grafana setup as multi-tenant service Collectd: Provide metrics for CO Linux infrast. with data accessible from Grafana! In progress: Monitoring of diskless platforms (FEC) CO checks: Host alive check: ping SNMP services: WR-Switch, ELMA & Wiener crates, Pulse repeater In the pipeline: IPMI check for Quad servers and enclosures (JBOD/RAID systems)

Conclusions COSMOS is progressing well, with respect to the review recommendations Initial developments already addressed many technical queries Full-stack monitoring test successful: Icinga2+Collectd  IT DBs (status/time-series)  IT Grafana Still some points to investigate to reassure us definitely before LS2: Define the optimal network distribution over GPN/TN (scalability/ availability) Evaluate the system performances in operational conditions Cohabitation of COSMOS and DIAMON until end LS2

Questions?

ABACUS & COSMOS Monitoring review

Similar presentations

Presentation on theme: "ABACUS & COSMOS Monitoring review"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ABACUS & COSMOS Monitoring review

Similar presentations

Presentation on theme: "ABACUS & COSMOS Monitoring review"— Presentation transcript:

Similar presentations

About project

Feedback