May 24 2001PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May 24 2001 Olof Bärring, CERN.

Slides:

Advertisements

Similar presentations

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.

Advertisements

26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO

1 CHEP 2000, Roberto Barbera Roberto Barbera (*) Grid monitoring with NAGIOS WP3-INFN Meeting, Naples, (*) Work in collaboration with.

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance

12. March 2003Bernd Panzer-Steindel, CERN/IT1 LCG Fabric status

1 Virtual Machine Resource Monitoring and Networking of Virtual Machines Ananth I. Sundararaj Department of Computer Science Northwestern University July.

GGF Toronto Spitfire A Relational DB Service for the Grid Peter Z. Kunszt European DataGrid Data Management CERN Database Group.

CHEP04 - Interlaken - Sep. 27th - Oct. 1st 2004T. M. Steinbeck for the Alice Collaboration1/20 New Experiences with the ALICE High Level Trigger Data Transport.

Institute of Computer Science AGH Performance Monitoring of Java Web Service-based Applications Włodzimierz Funika, Piotr Handzlik Lechosław Trębacz Institute.

CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

2/10/2000 CHEP2000 Padova Italy The BaBar Online Databases George Zioulas SLAC For the BaBar Computing Group.

Performance and Exception Monitoring Project Tim Smith CERN/IT.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

SensIT PI Meeting, January 15-17, Self-Organizing Sensor Networks: Efficient Distributed Mechanisms Alvin S. Lim Computer Science and Software Engineering.

Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.

7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.

Olof Bärring – WP4 summary- 6/3/ n° 1 Partner Logo WP4 report Status, issues and plans

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

D0 SAM – status and needs Plagarized from: D0 Experiment SAM Project Fermilab Computing Division.

Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Jean-Yves Nief CC-IN2P3, Lyon HEPiX-HEPNT, Fermilab October 22nd – 25th, 2002.

Partner Logo DataGRID WP4 - Fabric Management Status HEPiX 2002, Catania / IT, , Jan Iven Role and.

Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2

1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.

10/22/2002Bernd Panzer-Steindel, CERN/IT1 Data Challenges and Fabric Architecture.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,

Maite Barroso – WP4 Barcelona – 13/05/ n° 1 -WP4 Barcelona- Closure Maite Barroso 13/05/2003

GLAST Science Support CenterJuly, 2003 LAT Ground Software Workshop Status of the D1 (Event) and D2 (Spacecraft Data) Database Prototypes for DC1 Robert.

Performance evaluation on grid Zsolt Németh MTA SZTAKI Computer and Automation Research Institute.

Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.

May http://cern.ch/hep-proj-grid-fabric1 EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN.

Olof Bärring – WP4 summary- 4/9/ n° 1 Partner Logo WP4 report Plans for testbed 2 [Including slides prepared by Lex Holt.]

CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.

Peter Chochula ALICE Offline Week, October 04,2005 External access to the ALICE DCS archives.

Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, June 2006.

EU 2nd Year Review – Feb – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess.

INFSO-RI Enabling Grids for E-sciencE GridICE: Grid and Fabric Monitoring Integrated for gLite-based Sites Sergio Fantinel INFN.

Olof Bärring – EDG WP4 status&plans- 22/10/ n° 1 Partner Logo EDG WP4 (fabric mgmt): status&plans Large Cluster.

April 2003 Iosif Legrand MONitoring Agents using a Large Integrated Services Architecture Iosif Legrand California Institute of Technology.

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

EMI INFSO-RI ARC tools for revision and nightly functional tests Jozef Cernak, Marek Kocan, Eva Cernakova (P. J. Safarik University in Kosice, Kosice,

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CF Monitoring: Lemon, LAS, SLS I.Fedorko(IT/CF) IT-Monitoring.

David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.

03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.

Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Mario Reale – GARR NetJobs: Network Monitoring Using Grid Jobs.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.

5/25/2001Monitoring panel, Monitoring session LCCWS Olof Bärring, CERN.

The ALICE data quality monitoring Barthélémy von Haller CERN PH/AID For the ALICE Collaboration.

DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 1 / 10 Grid Monitoring A conceptual introduction to GridICE Sergio Andreozzi

Partner Logo Olof Bärring, WP4 workshop 10/12/ n° 1 (My) Vision of where we are going WP4 workshop, 10/12/2002 Olof Bärring.

A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.

TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.

Managing Large Linux Farms at CERN OpenLab: Fabric Management Workshop Tim Smith CERN/IT.

Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.

Hepix EDG Fabric Monitoring tutorial – n° 1 Introduction to EDG Fabric Monitoring Sylvain Chapeland.

WP4 meeting Heidelberg - Sept 26, 2003 Jan van Eldik - CERN IT/FIO

System Monitoring with Lemon

Monitoring and Fault Tolerance

LEMON – Monitoring in the CERN Computer Centre

Miroslav Siket, Dennis Waldron

Sergio Fantinel, INFN LNL/PD

CSC 480 Software Engineering

Presentation transcript:

May PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May Olof Bärring, CERN

May PEM status report. O.Bärring 2 Outline History Design First prototype DataGrid fabric mgmt monitoring task Conclusions

May PEM status report. O.Bärring 3 History The Performance and Exception Monitoring (PEM) project is a CERN IT project since 1999 –Leader: Tim Smith, Bernd Panzer-Steindel from 2001 –Goal (and innovation): Monitor and alarm on service rather than server Long requirement phase with input from many IT groups Design settled in mid-2000

May PEM status report. O.Bärring 4 Design User InterfaceAccess control Configuration Repository (CR) AgentBroker Measurement repository (MR) Correlation Engine (CE) Data Control 1..n sensorsactuators

May PEM status report. O.Bärring 5 Design: agent The agent forwards data from monitoring sensors to the broker –Buffering of data for transfer efficiency and fault tolerance –The configuration of local sensors and actuators is received via the broker –Firing of actuators is triggered by MR (or CE) via broker to the agent

May PEM status report. O.Bärring 6 Design: broker The broker was introduced for scalability –Transmitter of configuration information to Agents Measurement repository –Transmitter of measurement data from agent to measurement repository –Transmitter of requests for firing actuators from MR to the agents

May PEM status report. O.Bärring 7 Design: measurement repository The MR is the central archive of all monitoring measurements –Data compression (e.g. averaging old data) –Not just a passive database: active notification of subscribed event listeners if a measurement is outside its configured limits; triggering recovery actions

May PEM status report. O.Bärring 8 Design: configuration repository The configuration repository contains the configuration for all other components and their relationships, e.g. –Agent: Metrics and measurement frequency Actuators –Broker: What agents to control –Measurement repository Metric limits Subscribed event listeners

May PEM status report. O.Bärring 9 First prototype The first PEM prototype was ready for deployment early 2001 Agent status: –Each agent measures 30 parameters (from system CPU to running daemons) –Frequency: measure every 30 seconds –Deployed on 400 nodes since ~7 weeks. Soon go to about 1000 nodes.

May PEM status report. O.Bärring 10 First prototype Broker status: –The multithreaded broker contacts its assigned agents once per minute and retrieves: Configuration: which metrics have been monitored The measurements since last request –The broker uses JDBC to write the measurements into an ORACLE database Each measurement value and its timestamp –Current configuration ~50 agents per broker –Data rate ~1GB/day

May PEM status report. O.Bärring 11 First prototype Measurement repository status: –Some scalability problems in the beginning. Related with threading in JDBC –The ORACLE installation is currently upgraded to cope with increasing load (concurrent read and write): Dual CPU PIII 800MHz 512MB memory Gigabit ethernet 750GB mirrored EIDE disk server –Plan to have a cluster of database nodes to cope with anticipated load in the future

May PEM status report. O.Bärring 12 DataGrid fabric mgmt (WP4) PEM prototype will most likely be adopted WP4 promotes: –High node autonomy. Monitoring hierarchies where lowest level can be entirely confined to a node (tight sensor-actuator loops) –PEM configuration repository will be replaced by WP4 configuration management system –Possibly use the transport layer from WP3 framework based on the GMA (Grid Monitoring Architecture) producer-consumer model

May PEM status report. O.Bärring 13 Monitoring hierarchy Node sensor Agent MR cache Config cache CE Actuator MR Correlation Engine (CE) GRID Node view Cluster view Fabric view GUI Configuration Management system

May PEM status report. O.Bärring 14 Service view How to translate measured simple metrics into a service view? Some preliminary plans in WP4: –Create probes that act as user programs (configurable for different CPU, memory, I/O, … characteristics) –Run probes on idle systems -> benchmarking –Run probes under different load conditions and measure concurrently a set of simple metrics (e.g. CPU load, memory usage, I/O rates, bandwidth to homedirectory, …) -> matrix for mapping of expected performance

May PEM status report. O.Bärring 15 Conclusions After a long requirements and design phase, PEM has now reached a working prototype PEM will be adopted by WP4 with slight modifications –Node autonomy, monitoring hierarchy –Use central fabric configuration mgmt –May implement GMA interfaces provided by WP3 for monitoring transport and publication mechanisms