Monitoring and Fault Tolerance

Slides:



Advertisements
Similar presentations
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
Advertisements

CERN – BT – 01/07/ Cern Fabric Management -Hardware and State Bill Tomlin GridPP 7 th Collaboration Meeting June/July 2003.
26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
19/06/2002WP4 Workshop - CERN WP4 - Monitoring Progress report
The Premier Software Usage Analysis and Reporting Toolset CELUG Presentation – May 12, 2010 LT-Live : License Tracker’s License Server Monitor.
DataGrid is a project funded by the European Union CHEP 2003 – March 2003 – Towards automation of computing fabrics... – n° 1 Towards automation.
CHEP 2012 – New York City 1.  LHC Delivers bunch crossing at 40MHz  LHCb reduces the rate with a two level trigger system: ◦ First Level (L0) – Hardware.
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
ASIS et le projet EU DataGrid (EDG) Germán Cancio IT/FIO.
L. Granado Cardoso, F. Varela, N. Neufeld, C. Gaspar, C. Haen, CERN, Geneva, Switzerland D. Galli, INFN, Bologna, Italy ICALEPCS, October 2011.
Institute of Computer Science AGH Performance Monitoring of Java Web Service-based Applications Włodzimierz Funika, Piotr Handzlik Lechosław Trębacz Institute.
Understanding and Managing WebSphere V5
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Performance and Exception Monitoring Project Tim Smith CERN/IT.
WP4-install task report WP4 workshop Barcelona project conference 5/03 German Cancio.
CERN - IT Department CH-1211 Genève 23 Switzerland t The High Performance Archiver for the LHC Experiments Manuel Gonzalez Berges CERN, Geneva.
7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.
1 Linux in the Computer Center at CERN Zeuthen Thorsten Kleinwort CERN-IT.
Olof Bärring – WP4 summary- 6/3/ n° 1 Partner Logo WP4 report Status, issues and plans
Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, , Jan Iven Role and.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2
May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
20-May-2003HEPiX Amsterdam EDG Fabric Management on Solaris G. Cancio Melia, L. Cons, Ph. Defert, I. Reguero, J. Pelegrin, P. Poznanski, C. Ungil Presented.
Maite Barroso – WP4 Barcelona – 13/05/ n° 1 -WP4 Barcelona- Closure Maite Barroso 13/05/2003
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
May http://cern.ch/hep-proj-grid-fabric1 EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.
Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]
5-Oct-051 Tango collaboration status ICALEPCS 2005 Geneva (October 2005)
IDE disk servers at CERN Helge Meinhard / CERN-IT CERN OpenLab workshop 17 March 2003.
Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, June 2006.
EU 2nd Year Review – Feb – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess.
HEPiX 2 nd Nov 2000 Alan Silverman Proposal to form a Large Cluster SIG Alan Silverman 2 nd Nov 2000 HEPiX – Jefferson Lab.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
The CERN Computer Centre Supervision Project Helge Meinhard / CERN-IT HEPiX Catania 2002/04/18
Maite Barroso - 10/05/01 - n° 1 WP4 PM9 Deliverable Presentation: Interim Installation System Configuration Management Prototype
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
November 1, 2004 ElizabethGallas -- D0 Luminosity Db 1 D0 Luminosity Database: Checklist for Production Elizabeth Gallas Fermilab Computing Division /
Lemon Tutorial Sensor How-To Miroslav Siket, Dennis Waldron CERN-IT/FIO-FD.
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
Partner Logo Olof Bärring, WP4 workshop 10/12/ n° 1 (My) Vision of where we are going WP4 workshop, 10/12/2002 Olof Bärring.
A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.
Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.
Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.
Hepix EDG Fabric Monitoring tutorial – n° 1 Introduction to EDG Fabric Monitoring Sylvain Chapeland.
WP4 meeting Heidelberg - Sept 26, 2003 Jan van Eldik - CERN IT/FIO
Chapter 1: Introduction
System Monitoring with Lemon
N-Tier Architecture.
Status of Fabric Management at CERN
WP4 Fabric Management 3rd EU Review Maite Barroso - CERN
LEMON – Monitoring in the CERN Computer Centre
Status and plans of central CERN Linux facilities
Discussions on group meeting
Introduction of Week 3 Assignment Discussion
AWS DevOps Engineer - Professional dumps.html Exam Code Exam Name.
Get Amazon AWS-DevOps-Engineer-Professional Exam Real Questions - Amazon AWS-DevOps-Engineer-Professional Dumps Realexamdumps.com
Towards automation of computing fabrics using tools from the fabric management workpackage of the EU DataGrid project Maite Barroso Lopez (WP4)
Database Services for CERN Deployment and Monitoring
Chapter 2: System Structures
X Windows.
CERN openlab for DataGrid applications
EPICS: Experimental Physics and Industrial Control System
UFCEUS-20-2 Web Programming
New Tools In Education Minjun Wang
STATEL an easy way to transfer data
Presentation transcript:

Monitoring and Fault Tolerance Helge Meinhard / CERN-IT OpenLab workshop 08 July 2003

Monitoring and Fault Tolerance: Context Node Configuration System Monitoring Installation Fault Mgmt

History (1) In the 1990s, “massive” deployments of Unix boxes required automated monitoring of system state Answer: SURE Pure exception/alarm system No archiving of values, hence not useful for performance monitoring Not scalable to O(1000) nodes

History (2) PEM project at CERN (1999/2000) took fresh look at fabric mgmt, in particular monitoring PEM tool survey: Commercial tools found not flexible enough and too expensive; free solutions not appropriate Architecture, design and implementation from scratch

History (3) 2001 - 2003: European DataGrid project with work package on Fabric Management Subtasks: configuration, installation, monitoring, fault tolerance, resource management, gridification Profited from PEM work, developed ideas further

History (4) In 2001, some doubts about ‘do-it-all-ourselves’ approach of EDG WP4 Parallel to EDG WP4, project launched to investigate whether commercial SCADA system could be used Architecture deliberately kept similar to WP4

Monitoring and FT architecture (1) Monitoring: Captures non-intrusively actual state of a system (supposed not to change its state) Fault Tolerance: Reads and correlates data from monitoring system, triggers corrective actions (state-changing)

Monitoring and FT architecture (2) Sensor Monitoring Sensor Agent (MSA) MR – Monitoring Repository WP4: MR code with lower layer as flat file archive, or using Oracle CCS: PVSS system Sensor Sensor Local consumers Local consumers Local cache A P I Local consumers A P I DB

Monitoring and FT architecture (3) MSA controls communication with Monitoring Repository, configures sensors, requests samples, listens to sensors Sensors send metrics on request or spontaneously to MSA Communication MSA – MR: UDP or TCP based

Monitoring and FT architecture (4) FT system subscribing to metrics from monitoring subsystem Rule-based correlation engine takes decisions on firing actuators Actuators controlled by Actuator Agent, all actions logged by monitoring system

Deployment (1) End 2001: Put early versions of MSA and sensors on big clusters (~800 Linux machines), sending data (~100 metrics per machine, 1/min…1/day) to a PVSS-based repository At the same time, ~300 machines started sending performance metrics into flat file WP4 repository

Deployment (2) Sensors more refined over time (metrics added according to operational needs) Both exception and performance oriented sensors now deployed in parallel (some 150 metrics per node) More special machines added, currently ~1500 machines being monitored Test in May 2003: some 500 metric changes per second into the repository (~150 changes/s after “smoothing”)

Deployment (3) Repository requirements: Repository API implementation Oracle based fully functional alarm display for operators Currently using both an Oracle-MR based repository, and a PVSS based one Operators using PVSS based alarm screen as alternative to Sure display

Deployment (4) Interfaces: C API available, simple command line interface by end July, prototype Web access to time series of a metric available Fault tolerance: Just starting to look at WP4 prototype Configuration of monitoring: ad-hoc, to be migrated to CDB

Outlook Near term: Production services for LCG-1 Add more machines (e.g. network), metrics Software and service monitoring Medium term (end 2003): Monitoring for Solaris and Windows, … 2004 or 2005: Review of chosen solution for monitoring and FT Some of 1999 arguments no longer valid Will look at commercial and freeware solutions

Machine control High level: interplay of State Management System, Configuration Management, Monitoring, Fault Tolerance, … Low level: Past: CPU boxes didn’t have anything (5 rolling tables with monitors and keyboards per 500…1000 machines), disk and tape servers with analog KVM switches Future: Have investigated various options, benefit/cost analysis. Will go to serial consoles on all machines, 1 head node per 50…100 machines with serial multiplexers