CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t The Agile Infrastructure Project Monitoring Markus Schulz Pedro Andrade.

Slides:

Advertisements

Similar presentations

26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO

Advertisements

Data & Storage Services CERN IT Department CH-1211 Genève 23 Switzerland t DSS TSM CERN Daniele Francesco Kruse CERN IT/DSS.

Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Monitoring Use Cases 16 January 2012.

CoreGRID Workpackage 5 Virtual Institute on Grid Information and Monitoring Services Authorizing Grid Resource Access and Consumption Erik Elmroth, Michał.

23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.

Evaluation of NoSQL databases for DIRAC monitoring and beyond

CERN IT Department CH-1211 Genève 23 Switzerland t Messaging System for the Grid as a core component of the monitoring infrastructure for.

CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

WP6: Grid Authorization Service Review meeting in Berlin, March 8 th 2004 Marcin Adamski Michał Chmielewski Sergiusz Fonrobert Jarek Nabrzyski Tomasz Nowocień.

BMC Software confidential. BMC Performance Manager Will Brown.

Performance and Exception Monitoring Project Tim Smith CERN/IT.

 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.

1 High-Level Carrier Requirements for Cross Layer Optimization Dave McDysan Verizon.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

CERN IT Department CH-1211 Genève 23 Switzerland t Experiences running a production Puppet Ben Jones HEPiX Bologna Spring.

CERN IT Department CH-1211 Geneva 23 Switzerland t Daniel Gomez Ruben Gaspar Ignacio Coterillo * Dawid Wojcik *CERN/CSIC funded by Spanish.

CERN IT Department CH-1211 Genève 23 Switzerland t MSG status update Messaging System for the Grid First experiences

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Overlook of Messaging.

DataNet – Flexible Metadata Overlay over File Resources Daniel Harężlak 1, Marek Kasztelnik 1, Maciej Pawlik 1, Bartosz Wilk 1, Marian Bubak 1,2 1 ACC.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

CERN IT Department CH-1211 Genève 23 Switzerland t IT Monitoring WG IT/CS Monitoring System Virginie Longo September 14th 2011.

Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.

Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.

CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.

CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.

WLCG infrastructure monitoring proposal Pablo Saiz IT/SDC/MI 16 th August 2013.

CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing

XROOTD AND FEDERATED STORAGE MONITORING CURRENT STATUS AND ISSUES A.Petrosyan, D.Oleynik, J.Andreeva Creating federated data stores for the LHC CC-IN2P3,

Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT IT Monitoring WG Technology for Storage/Analysis 28 November 2011.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Agile Infrastructure Monitoring HEPiX Spring th April.

CERN IT Department CH-1211 Genève 23 Switzerland t HEPiX Conference, ASGC, Taiwan, Oct 20-24, 2008 The CASTOR SRM2 Interface Status and plans.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks APEL CPU Accounting in the EGEE/WLCG infrastructure.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CF Monitoring: Lemon, LAS, SLS I.Fedorko(IT/CF) IT-Monitoring.

David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Alarming with GNI VOC WG meeting 12 th September.

Global ADC Job Monitoring Laura Sargsyan (YerPhI).

CERN IT Department CH-1211 Genève 23 Switzerland t Migration from ELFMs to Agile Infrastructure CERN, IT Department.

Site Services and Policies Summary Dirk Düllmann, CERN IT More details at

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.

CERN 21 January 2005Piotr Nyczyk, CERN1 R-GMA Basics and key concepts Monitoring framework for computing Grids – developed by EGEE-JRA1-UK, currently used.

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.

Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Improving resilience of T0 grid services Manuel Guijarro.

Breaking the frontiers of the Grid R. Graciani EGI TF 2012.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures.

Pedro Andrade > IT-GD > D4Science Pedro Andrade CERN European Organization for Nuclear Research GD Group Meeting 27 October 2007 CERN (Switzerland)

INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.

Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Agile Infrastructure Project Overview : Status and.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.

CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Present and Future Pedro Andrade (CERN IT) 31 st August.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI EGI Services for Distributed e-Infrastructure Access Tiziana Ferrari on behalf.

Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.

CERN IT Department CH-1211 Genève 23 Switzerland t Load testing & benchmarks on Oracle RAC Romain Basset – IT PSS DP.

Conditions Data access using FroNTier Squid cache Server

Monitoring Of XRootD Federation

Cloud based Open Source Backup/Restore Tool

Overview of big data tools

Cloud computing mechanisms

Presentation transcript:

CERN IT Department CH-1211 Genève 23 Switzerland t The Agile Infrastructure Project Monitoring Markus Schulz Pedro Andrade

Outline Monitoring WG and AI Today’s Monitoring in IT Architecture Vision Implementation Plan Conclusions 2

Markus Schulz 3 Monitoring WG and AI

Introduction Motivation –Several independent monitoring activities in IT similar overall approach, different tool-chains, similar limitations –High level services are interdependent combination of data from different groups necessary, but difficult –Understanding performance became more important requires more combined data and complex analysis –Move to a virtualized dynamic infrastructure comes with complex new requirements on monitoring Challenges –Find a shared architecture and tool-chain components while preserving our investment in monitoring IT Monitoring Working Group 4

Timeline 5 Q Creation of Monitoring WG and mandate definition Presentations of monitoring status per IT group Q Presentations of monitoring plans per IT group Initial discussion on a shared monitoring architecture Q Definition of common tools and core user stories Agreement on a shared monitoring architecture Q Preparation of MWG summary report Definition of implementation plans in the context on AI Q Setup of infrastructure and prototype work. Import data from several sources into the Analysis Facility. Exercise messaging at expected rates and feed the storage system.

Pedro Andrade 6 Today’s Monitoring in IT

Monitoring Applications 7 GroupApplications CFLemon, LAS, SLS CISCDS, Indico CS Spectrum CA Events, Polling Value, Alarm History, Performance Analysis, Sflow/Nflow, Syslog, Wireless Monitoring DB Database monitoring, Web applications monitoring, Infrastructure Monitoring DI Central Security Logging All, Central Security Logging Logins, IP connections log, Deep Packet Inspection, DNS Logs DSSTSM, AFS, CASTOR Tape, CASTOR Stager ES Job Monitoring, Site Status Board, DDM Monitoring, Data Popularity, Hammer Cloud, Frontier, Coral GTSAM-Nagios OISSCOM PES Job Accounting, Fairshare, Job Monitoring, Real-time Job Status, Process Accounting

Monitoring Applications 8

Monitoring Data Producers –40538 Input Volume –283 GB per day Input Rate –697 M entries per min –2,4 M entries per min without PES/process accounting Query Rate –52 M queries per day –3,3 M entries per day without PES/process accounting 9

Analysis Monitoring in IT covers a wide range of resources –Hardware, OS, applications, files, jobs, etc Many application-specific monitoring solutions –Some are commercial solutions –Based on different technologies Limited sharing of monitoring data –Maybe no sharing, simply duplication of monitoring data All monitoring applications have similar needs –Publish metric results, aggregate results, alarms, etc 10

11 Architecture Vision Pedro Andrade

Constraints (Data) Large data store aggregating all monitoring data for storage and combined analysis tasks Make monitoring data easy to access by everyone! –Not forgetting possible security constraints Select a simple and well supported data format –Monitoring payload to be schema free Rely on a centralized metadata service(s) to discover the computer center resources information –Which is the physical node running virtual machine A –Which is the virtual machine running service B –Which is the network link used by node C –… this is becoming more dynamic in the AI 12

Constraints (Technology) Focus on providing well established solutions for each layer of the monitoring architecture –Transport, storage, analysis Flexible architecture where a particular technology can be easily replaced by a better one Adopt whenever possible existing tools and avoid home grown solutions Follow a tool chain approach Allow a phased transition where existing applications are gradually integrated 13

User Stories User stories were collected from all IT groups and commonalities between them were identified To guarantee that different types of user stories were provided three categories were established: –Fast and Furious (FF) Get metrics values for hardware and selected services Raise alarms according to appropriate thresholds –Digging Deep (DD) Curation of hardware and network historical data Analysis and statistics on batch job and network data –Correlate and Combine (CC) Correlation between usage, hardware, and services Correlation between job status and grid status 14

Architecture Overview 15 Application Specific Aggregation Storage Feed Analysis Storage Analysis Storage Alarm Feed Alarm Portal Report Custom Feed Publisher Sensor Publisher Sensor Portal Apollo Lemon Hadoop Oracle Splunk

Architecture Overview All components can be changed easily –Including the messaging system (standard protocol) Messaging and storage as central components –Tools connect either to the Messaging or Storage Publishers should be kept as simple as possible –Data produced either directly on sensor or after a first level of aggregation Scalability can be addressed either by horizontally scaling or by adding additional layers –Pre-aggregation, pre-processing –“Fractal approach” 16

Data Format The selected message format is JSON A simple common schema must be defined to guarantee cross-reference between the data. –Timestamp –Hardware and node –Service and applications –Payload These base elements (tag) require the availability of the metadata service(s) mentioned before –This is still under discussion 17

Messaging Broker Two technologies have been identified as the best candidates: Apollo and RabbitMQ –Apollo is the successor ActiveMQ –Prior positive experience in IT and the experiments Only realistic testing environments can produce reliable performance numbers. The use case of each application must be clear defined –Total number of producers and consumers –Size of the monitoring message –Rate of the monitoring message The trailblazer applications have already very demanding use cases 18

Central Storage and Analysis All data is stored in a common location –Makes easy the sharing of monitoring data –Promotes sharing of analysis tools –Allows feeding into the system data already processed NoSQL technologies are the most suitable solutions –Focus on column/tabular and document based solutions –Hadoop (from the Cloudera distribution) as first step 19

Central Storage and Analysis Hadoop is a good candidate to start with –Prior positive experience in IT and the experiments –Map-reduce paradigm is a good match for the use cases –Has been used successfully at scale –Many different NoSQL solutions use Hadoop as backend –Many tools provide export and import interfaces –Several related modules available (Hive, HBase) Document based store also considered –CouchDB/MongoDB are good candidates For some use cases a parallel relational database solution (based on Oracle) could be considered 20

Integrating Closed Solutions External (commercial) monitoring –Windows SCOM, Oracle EM Grid Control, Spectrum CA These data sources must be integrated –Injecting final results into the messaging layer –Exporting relevant data at an intermediate stage 21 Sensor Transport Storage Analysis Visualization/Repor ts Export Interface Messaging Integrated Product

22 Implementation Plan Pedro Andrade

Transition Plan Moving the existing production monitoring services to a new base architecture is a complex task as these services must be continuously running A transition plan was defined and foresees a staged approach where the existing applications gradually incorporate elements of the new architecture 23

Transition Plan 24 Aggregation Storage Feed Analysis Storage Analysis Storage Alarm Feed Alarm Portal Report Publisher OLD NEW

Milestones 25 Monitoring.v 1 Q AI nodes monitored with Lemon (dependency on Quattor) Deployment of Messaging Broker and Hadoop cluster Testing of other technologies (Splunk) Monitoring.v 2 Q AI nodes monitored with Lemon (no dependency on Quattor) Lemon data starts to be published via messaging Monitoring.v 3 Q Several clients exploiting the messaging infrastructure Messaging consumers for real time alarms and notifications Initial data store/analysis for select use cases Monitoring.v 4 Q Monitoring data published to the messaging infrastructure Large scale data store/analysis on Hadoop cluster

Monitoring v1 Several meetings organized – eInfraDocsMinuteshttps://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/Agil eInfraDocsMinutes Short-term tasks identified and tickets created – Work ongoing on four main areas: –Messaging broker deployment –Hadoop cluster deployment –Testing of Splunk with Lemon data –Lemon agents running on puppet 26

Monitoring v1 Deployment of the messaging broker –Based on Apollo and RabbitMQ Three SL6 nodes have been provided –2 nodes for production, 1 node for development –Each node will run Apollo and RabbitMQ Three applications have been identified to start using/testing the messaging infrastructure –OpenStack –MCollective –Lemon 27

Monitoring v1 Testing Splunk with Lemon data –Lemon data to be exported from DB (1 day, 1 metric) –Data exported into a JSON file and stored n AFS –This data will be imported to Splunk –Splunk functionality and scalability will be tested Started the deployment of a Hadoop cluster –Taking the Cloudera distribution –Other tools may also be deployed (HBase, Hive, etc) –Hadoop testing using Lemon data (as above) is planned 28

Monitoring v1/v2 AI nodes monitored with existing Lemon metrics –First step Current Lemon sensors/metrics are used for AI nodes Lemon metadata will still be taken from Quattor A solution is defined to get CDB equivalent data –Second step Current Lemon sensors/metrics are used for AI nodes Lemon metadata is not taken from Quattor Lemon agents start using the messaging infrastructure 29

30 Conclusions Pedro Andrade

Conclusions A monitoring architecture has been defined –Promotes sharing of monitoring data between apps –Based on few core components (transport, storage, etc) –Several existing external technologies identified A concrete implementation plan has been identified –It assures a smooth transition for today’s applications –It enables the new AI nodes to be monitored quickly –It allows moving towards a common system 31

Links Monitoring WG Twiki (new location!) – Monitoring WG Report (ongoing) – ngReporthttps://twiki.cern.ch/twiki/bin/view/MonitoringWG/Monitori ngReport Agile Infrastructure TWiki – Agile Infrastructure JIRA – 32

33 QUESTIONS? Thanks !