CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.

Slides:



Advertisements
Similar presentations
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
Advertisements

26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS TSM CERN Daniele Francesco Kruse CERN IT/DSS.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Post-C5 Lemon-web 2.0 Daniel Lenkes and Ivan Fedorko.
CERN IT Department CH-1211 Genève 23 Switzerland t The Agile Infrastructure Project Monitoring Markus Schulz Pedro Andrade.
ManageEngine TM Applications Manager 8 Monitoring Custom Applications.
The CERN Computer Centres October 14 th 2005 CERN.ch.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle and Streams Diagnostics and Monitoring Eva Dafonte Pérez Florbela Tique Aires.
Institute of Computer Science AGH Performance Monitoring of Java Web Service-based Applications Włodzimierz Funika, Piotr Handzlik Lechosław Trębacz Institute.
CERN IT Department CH-1211 Genève 23 Switzerland t Messaging System for the Grid as a core component of the monitoring infrastructure for.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
CERN IT Department CH-1211 Genève 23 Switzerland t Next generation of virtual infrastructure with Hyper-V Michal Kwiatek, Juraj Sucik, Rafal.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
BMC Software confidential. BMC Performance Manager Will Brown.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Messaging System Ivan, Omar, Sergio 14 march 2012.
7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.
CCR GRID 2010 (Catania) Daniele Gregori, Stefano Antonelli, Donato De Girolamo, Luca dell’Agnello, Andrea Ferraro, Guido Guizzunti, Pierpaolo Ricci, Felice.
Large Computer Centres Tony Cass Leader, Fabric Infrastructure & Operations Group Information Technology Department 14 th January and medium.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
CERN IT Department CH-1211 Geneva 23 Switzerland t Daniel Gomez Ruben Gaspar Ignacio Coterillo * Dawid Wojcik *CERN/CSIC funded by Spanish.
Database Architectures Database System Architectures Considerations – Data storage: Where do the data and DBMS reside? – Processing: Where.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
4/5/2007Data handling and transfer in the LHCb experiment1 Data handling and transfer in the LHCb experiment RT NPSS Real Time 2007 FNAL - 4 th May 2007.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
CERN IT Department CH-1211 Genève 23 Switzerland t IT Monitoring WG IT/CS Monitoring System Virginie Longo September 14th 2011.
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, June 2006.
WLCG infrastructure monitoring proposal Pablo Saiz IT/SDC/MI 16 th August 2013.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon for Quattor I.Fedorko CERN CF/IT 16 March 2011.
CERN IT Department CH-1211 Genève 23 Switzerland PES 1 Ermis service for DNS Load Balancer configuration HEPiX Fall 2014 Aris Angelogiannopoulos,
Managing and Monitoring the Microsoft Application Platform Damir Bersinic Ruth Morton IT Pro Advisor Microsoft Canada
CERN IT Department CH-1211 Genève 23 Switzerland t DBA Experience in a multiple RAC environment DM Technical Meeting, Feb 2008 Miguel Anjo.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.
Julia Andreeva on behalf of the MND section MND review.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CF Monitoring: Lemon, LAS, SLS I.Fedorko(IT/CF) IT-Monitoring.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Drupal at CERN Juraj Sucik Jarosław Polok.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Alarming with GNI VOC WG meeting 12 th September.
CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.
CERN - IT Department CH-1211 Genève 23 Switzerland t Operating systems and Information Services OIS Proposed Drupal Service Definition IT-OIS.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Present and Future Pedro Andrade (CERN IT) 31 st August.
CERN - IT Department CH-1211 Genève 23 Switzerland t ASM and Oracle Service Availability Monitoring LCG 3D Workshop CERN, January 26 th,
9 Copyright © 2004, Oracle. All rights reserved. Getting Started with Oracle Migration Workbench.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.
Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.
CERN IT Department CH-1211 Genève 23 Switzerland t Load testing & benchmarks on Oracle RAC Romain Basset – IT PSS DP.
Copyright © 2012, SAS Institute Inc. All rights reserved. SAS ENVIRONMENT MANAGER CARMINE VERRELLI - TECHNICAL TRAINING SPECIALIST.
Daniele Bonacorsi Andrea Sciabà
System Monitoring with Lemon
Status of Fabric Management at CERN
POW MND section.
Miroslav Siket, Dennis Waldron
Presentation transcript:

CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure P.Andrade, L.Cons, I.Fedorko, B.Fiorini, A.Iribarren, V.Lefebure, G.Mccance, O.Pera, M.Paladin, I. Reguero, M. Dos Santos, S.Traylen CERN HEPiX Fall 2012 Workshop

CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring HEPiX Spring 2012 –High Level Architecture –View of shared architecture Lemon – LHC Era Monitoring System –Is Lemon only about “performance monitoring”? –Why architecture evolution rather than replacement by existing monitoring tool(s)? Agile Infrastructure for Monitoring –Shared Infrastructure –Use cases: Data store, Visualization –Event processing and management –Status of the components 2

Lemon LHC Era Monitoring System In-house developed, multi-components, client/server-based monitoring system SQL TCP/UDP HTTP Sensor Monitoring Agent Local Cache Oracle Database Repository Backend Application Server Lemon CLI Lemon-host-check Web Browser RRD tool / Python Apache/ PHP (command line tool to access data) (command line tool node exceptions) Measurement Repository User Interfaces Node Monitoring Individually configurable nodes with autonomous recovery actions Chain of tools based on DB backend 3

Lemon Performance, application and facility monitoring Node monitoring e.g. CPU Load Time-series processing Hierarchy clustering Cluster Sub-cluster Node On behalf monitoring Smart Power Distribution Units Historical data export 4

Lemon Service availability and alarming Node monitoring Disk occupancy Number of processes Log file parse matched Correction action on the node Run script locally to clean var dir After 3 rd attempt var occupancy > 90% Monitoring repository export with guaranteed reliability and data processing e.g. Service Level Status var_ful/ alarm System administrator Support ticket 5

Lemon Large scale 6 Experience No single solution replacement Requirements Tools chain e.g. data mining interface different from time series trending Flexible migration e.g. compatible with lemon node client Large scale ready Current system: ~11k monitored entities ~150 metrics/entity Expected scale: ~300k entities

Agile Infrastructure with performance monitoring Lemon agent Lemon agent Lemon to messaging Message Bus Custom script Monitoring XYZ Visualization and correlation Visualization and correlation Data store Cluster processing High load for >50% of cluster Cluster processing High load for >50% of cluster Ticketing SMS gateway Dashboard Operations Planned Components Views 7

CERN IT Department CH-1211 Genève 23 Switzerland t Storing and visualization Message Bus Oracle Lemon web NoSQL Visualization Data mining (batch processing) Splunk Data mining Visualization Correlation RRD visualization R&D on-going Possible options 8

NoSQL-based data store for monitoring Example from Data Storage Service 9 Log parsing and processing based on the NoSQL DB Prototyped by CERN IT/DSS Shared infrastructure

Splunk for data mining/visualization High precision data mining in the current system solved by dedicated exports ~1.5 year of Lemon raw data (~4.5 TB in Oracle)  ~2.5 TB Splunk data with metadata information (~43 billion entries) 10 One year period of basic metrics on node  on the fly browsing capability with high time granularity Under testing

Example of Splunk Dashboard Lemon data with entity cluster hierarchy 11 Metric - Time - Match entity name Sum of running jobs over time split by entities Under testing

Event processing and management concept Metric correlation Metrics Node monitoring Monitoring infrastructure Event processing e.g. Heartbeat checking e.g. Load over cluster 12 Ticketing system Incident process Event process Event process Event record Incident ticket Service Now prototype

Possible use of Splunk for event processing Alarming  on the fly information processing in time windows if counter >3  event Splunk Automate Monitoring 5 min time window time Notification Splunk Aggregated Notification 13 In production for backup TSM

Configuration status and transition period Lemon application server (one/data centre) Lemon metric management Quattor managed node Puppet managed node Puppet managed node Puppet Quattor configuration Puppet configuration 14 AI monitoring Metric Management prototype

Component status Lemon agent Lemon agent Lemon to messaging Apollo Custom script Cluster processing High load for >50% of cluster Cluster processing High load for >50% of cluster Visualization and correlation Splunk Visualization and correlation Splunk Data store Hadoop Monitoring XYZ prototyping/testing/using planned/R&D on-going 15 Ticketing SMS gateway Dashboard Operations

CERN IT Department CH-1211 Genève 23 Switzerland t Summary No single solution replacement of the current Lemon system Shared Agile Infrastructure  Modular concept –covering all the CERN Computer Centre monitoring domains –continuous development and deployment Transition plan in place Steady progress in implementation 16