Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.

Slides:



Advertisements
Similar presentations
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
Advertisements

26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
Peter Berrisford RAL – Data Management Group SRB Services.
Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Database Architectures and the Web
Futures – Alpha Cloud Deployment and Application Management.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Post-C5 Lemon-web 2.0 Daniel Lenkes and Ivan Fedorko.
NGOP J.Fromm K.Genser T.Levshina M.Mengel V.Podstavkov.
L. Granado Cardoso, F. Varela, N. Neufeld, C. Gaspar, C. Haen, CERN, Geneva, Switzerland D. Galli, INFN, Bologna, Italy ICALEPCS, October 2011.
CERN IT Department CH-1211 Genève 23 Switzerland t Messaging System for the Grid as a core component of the monitoring infrastructure for.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Understanding and Managing WebSphere V5
Enterprise Reporting with Reporting Services SQL Server 2005 Donald Farmer Group Program Manager Microsoft Corporation.
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
BMC Software confidential. BMC Performance Manager Will Brown.
Oracle Application Server 10g (9.0.4) Recommended Topologies Pavana Jain.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
ATLAS DQ2 Deletion Service D.A. Oleynik, A.S. Petrosyan, V. Garonne, S. Campana (on behalf of the ATLAS Collaboration)
AI project components: Facter and Hiera
7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.
Large Computer Centres Tony Cass Leader, Fabric Infrastructure & Operations Group Information Technology Department 14 th January and medium.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Update on Database Issues Peter Chochula DCS Workshop, June 21, 2004 Colmar.
Module 13 Implementing Business Continuity. Module Overview Protecting and Recovering Content Working with Backup and Restore for Disaster Recovery Implementing.
CERN IT Department CH-1211 Geneva 23 Switzerland t Daniel Gomez Ruben Gaspar Ignacio Coterillo * Dawid Wojcik *CERN/CSIC funded by Spanish.
4/5/2007Data handling and transfer in the LHCb experiment1 Data handling and transfer in the LHCb experiment RT NPSS Real Time 2007 FNAL - 4 th May 2007.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
CERN IT Department CH-1211 Genève 23 Switzerland t IT Monitoring WG IT/CS Monitoring System Virginie Longo September 14th 2011.
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
1 Makes Mobile WiMAX Simple Netspan Overview Andy Hobbs Director, Product Management 5 th October 2007.
Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, June 2006.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Automatic server registration and burn-in framework HEPIX’13 28.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon for Quattor I.Fedorko CERN CF/IT 16 March 2011.
CERN IT Department CH-1211 Genève 23 Switzerland t DBA Experience in a multiple RAC environment DM Technical Meeting, Feb 2008 Miguel Anjo.
System/SDWG Update Management Council Face-to-Face Flagstaff, AZ August 22-23, 2011 Sean Hardman.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.
CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CF Monitoring: Lemon, LAS, SLS I.Fedorko(IT/CF) IT-Monitoring.
Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Drupal at CERN Juraj Sucik Jarosław Polok.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
PDS4 Demonstration Management Council Face-to-Face Flagstaff, AZ August 22-23, 2011 Sean Hardman.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Alarming with GNI VOC WG meeting 12 th September.
Lemon Tutorial Sensor How-To Miroslav Siket, Dennis Waldron CERN-IT/FIO-FD.
CERN - IT Department CH-1211 Genève 23 Switzerland t Operating systems and Information Services OIS Proposed Drupal Service Definition IT-OIS.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF SINDES Secure INformation DElivery System CERN IT/CF-ASI.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.
Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.
CERN IT Department CH-1211 Geneva 23 Switzerland t OIS Operating Systems & Information Services CERN IT Department CH-1211 Geneva 23 Switzerland.
MeshCentral 2.0.
System Monitoring with Lemon
Database Architectures and the Web
Status of Fabric Management at CERN
Miroslav Siket, Dennis Waldron
CERN-Russia Collaboration in CASTOR Development
Oracle Architecture Overview
Status and plans for bookkeeping system and production tools
Presentation transcript:

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview Lemon LAS SLS Recent and coming challenges

CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring We have monitoring tools Lemon LAS SLS new requirements are coming progressively environment is changing

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon Purpose: –CC monitoring (Linux boxes) Customers: –CF-SAO (sys-admin, operators), IT&PH (SM and VOC) What is monitored: –Performance monitoring (e.g. CPU load) –Application (e.g. log parsing) –Facilities monitoring (e.g. power, temperature) - recently added Main characteristics: –Agent based system –Corrective action performed on node Overlap: –Data collection

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon Needs to be addressed –Visualization currently under development (see Dec 2010 post C5) Designed for simple cluster-host hierarchy  limit for facilities monitoring Scale limits Advance visualization out of design scope  addressed by CLUMAN –Scaling (virtualization) –Service monitoring –Remote data transfer –RFE Integration with other monitoring (Windows, Nagios) Data aggregation –CPU load over all cluster –Service state over all nodes and applications Remote test/probing

CERN IT Department CH-1211 Geneva 23 Switzerland t CF LAS Purpose: –Alarming based on Lemon information Customers: –CF-SAO (sys-admin, operators), IT&PH (SM and VOC) What is alarmed: –Exceptions coming from Lemon agents Main characteristics: –Based on Oracle backend (logic implemented in pl/sql) –Web interface for operator –Interacting with ITCM (Remedy) Needs to be addressed: –Integration of information from other monitoring system (e.g. Windows) –Alarms from distributed Lemon instances –Migration of ITCM to Service-Now Overlap: –Can we have basic alarming system infrastructure shared with other monitoring systems?

CERN IT Department CH-1211 Geneva 23 Switzerland t CF SLS Purpose: –Display service status information Customers: –IT&PH (SM and VOC), management What is displayed: –Service information (availability, KPI, numerical values) provided by user Main characteristics: –Service information is calculated/estimated and provided by the user –Service definition is in SDB Needs to be addressed: –Visualization of dependency between services and hosts –Software consolidation: What can we share with Lemon? –Migration of (only?) SDB to Service-Now? –New service monitoring integrated with Lemon? Overlap: –Can we share dash-board/status-board?

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Recent challenges Scaling RFE beyond current design Service monitoring –with recovery action, alarming, etc. Remote monitoring and management Software consolidation Integration with others Any common strategy?

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Manpower Lemon & LAS – 50% of staff FTE –1 fellow FTE till the end of June 2011 SLS –<1 fellow FTE till the end of June 2011

CERN IT Department CH-1211 Geneva 23 Switzerland t CF How to address In-house development –expensive Replacement –No silver bullet (CHEP, EE  experiment experience) –Tool review needed –Nagios, Ganglia, Zennos … Combination –Replace monitoring component to minimize development –Data collection and visualization infrastructure independent from tool(s) Any solution to build together?

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Backup From now on backup

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Visualization in Lemon 3 subclusters of cluster lxbatch node

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Services in Lemon Number of registered metrics during last year Host 1 Application A HW scan Host 2 Application B HW scan CPU load partitions occupancy is app running log parsing X log parsing Y SMART IPMI CPU load partitions occupancy is app running log parsing SMART IPMI Except. 1 Except. 2 Except. 3 Except. 4 Except. 5 Current LAS view Service alarms Sys-admin alarms

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon SQL TCP/UDP HTTP Sensor Monitoring Agent Local Cache Oracle Database Repository Backend Application Server Lemon CLI Lemon-host-check Web Browser RRD tool / Python Apache/ PHP (command line tool to access data) (command line tool node exceptions) Measurement Repository User InterfacesNode Monitoring

CERN IT Department CH-1211 Geneva 23 Switzerland t CF LAS Exception Metrics ITCM Lemon-webLAS GUI Lemon Oracle DB LAS Business Logic PL/SQL Operator Administrator High level objects CPU load over all cluster LAS  include windows monitoring ITCM will be migrated to ITCM

CERN IT Department CH-1211 Geneva 23 Switzerland t CF SLS SLS-web USER Scripts XMLSDBRRD test/probes LemonDB SLS XML SDB  Service Catalog? Infrastructure consolidation with Lemon with something new? Does cover SDB-SLS monitoring of all services?

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Discussed Monitoring Task I. Current activity Lemon-web 2.0 development ongoing Lemon enhancements under consideration New Lemon DB schema –Increase of monitoring data impacts the size and performance of DB repository –Impact on many Lemon components Lemon repository data export –Reduce amount of historical data stored in DB export to data files

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Discussed Monitoring Task II. Lemon-sensors review/development –Pending enhancement practically on all core sensors –New sensors (e.g. for SafeHost) –Python API High level objects –Trigger alarm if > 40% of cluster nodes is on high load –Data aggregation on data collection Integration with Windows monitoring (one LAS) Support for virtualization (new instances +federated web)

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Discussed Monitoring Task III. SLS –No pending RFE –Propagate deletion of service from SDB to SLS –Graphical representation of dependency between services and hosts (and alarms) –Ongoing DB backend consolidation –Possibly migration of SDB to Service Now

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon Monitoring ~11k monitored entities (~8k nodes) –performance monitoring CPU Load, partitions information –application monitoring File, log parsing –power, temperature –remote (ping, http, snmp) 5 core sensors covering ~60% of performance and application monitoring ~30 misc sensor –hw_scan, snmp, castor >5000 nodes with metrics

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon instances ~1.1k metrics, 473 exceptions, 254 classes ~1.7M monitored metrics across Lemon ~300GB of data / month produced Covered by 2 servers running in parallel –recent data  for LAS performance –historical data  no problem with powerful data

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon instances

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Monitoring DB Lemonops (latest only data) Size Used Avail Use% Mounted on 32G 29G 3.8G 89% /ORA/dbs03/LEMONOP Lemonrac (historical data) Size Used Avail Use% Mounted on 1.6T 1.5T 76G 96% /ORA/dbs03/LEMONRAC Data income: ~300 GB/month