Presentation is loading. Please wait.

Presentation is loading. Please wait.

May 24 2001PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May 24 2001 Olof Bärring, CERN.

Similar presentations


Presentation on theme: "May 24 2001PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May 24 2001 Olof Bärring, CERN."— Presentation transcript:

1 May 24 2001PEM status report. LCCWS@FNAL O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May 24 2001 Olof Bärring, CERN

2 May 24 2001PEM status report. LCCWS@FNAL O.Bärring 2 Outline History Design First prototype DataGrid fabric mgmt monitoring task Conclusions

3 May 24 2001PEM status report. LCCWS@FNAL O.Bärring 3 History The Performance and Exception Monitoring (PEM) project is a CERN IT project since 1999 –Leader: Tim Smith, Bernd Panzer-Steindel from 2001 –Goal (and innovation): Monitor and alarm on service rather than server Long requirement phase with input from many IT groups Design settled in mid-2000

4 May 24 2001PEM status report. LCCWS@FNAL O.Bärring 4 Design User InterfaceAccess control Configuration Repository (CR) AgentBroker Measurement repository (MR) Correlation Engine (CE) Data Control 1..n 1 1 1 1 1 1 1 1 sensorsactuators

5 May 24 2001PEM status report. LCCWS@FNAL O.Bärring 5 Design: agent The agent forwards data from monitoring sensors to the broker –Buffering of data for transfer efficiency and fault tolerance –The configuration of local sensors and actuators is received via the broker –Firing of actuators is triggered by MR (or CE) via broker to the agent

6 May 24 2001PEM status report. LCCWS@FNAL O.Bärring 6 Design: broker The broker was introduced for scalability –Transmitter of configuration information to Agents Measurement repository –Transmitter of measurement data from agent to measurement repository –Transmitter of requests for firing actuators from MR to the agents

7 May 24 2001PEM status report. LCCWS@FNAL O.Bärring 7 Design: measurement repository The MR is the central archive of all monitoring measurements –Data compression (e.g. averaging old data) –Not just a passive database: active notification of subscribed event listeners if a measurement is outside its configured limits; triggering recovery actions

8 May 24 2001PEM status report. LCCWS@FNAL O.Bärring 8 Design: configuration repository The configuration repository contains the configuration for all other components and their relationships, e.g. –Agent: Metrics and measurement frequency Actuators –Broker: What agents to control –Measurement repository Metric limits Subscribed event listeners

9 May 24 2001PEM status report. LCCWS@FNAL O.Bärring 9 First prototype The first PEM prototype was ready for deployment early 2001 Agent status: –Each agent measures 30 parameters (from system CPU to running daemons) –Frequency: measure every 30 seconds –Deployed on 400 nodes since ~7 weeks. Soon go to about 1000 nodes.

10 May 24 2001PEM status report. LCCWS@FNAL O.Bärring 10 First prototype Broker status: –The multithreaded broker contacts its assigned agents once per minute and retrieves: Configuration: which metrics have been monitored The measurements since last request –The broker uses JDBC to write the measurements into an ORACLE database Each measurement value and its timestamp –Current configuration ~50 agents per broker –Data rate ~1GB/day

11 May 24 2001PEM status report. LCCWS@FNAL O.Bärring 11 First prototype Measurement repository status: –Some scalability problems in the beginning. Related with threading in JDBC –The ORACLE installation is currently upgraded to cope with increasing load (concurrent read and write): Dual CPU PIII 800MHz 512MB memory Gigabit ethernet 750GB mirrored EIDE disk server –Plan to have a cluster of database nodes to cope with anticipated load in the future

12 May 24 2001PEM status report. LCCWS@FNAL O.Bärring 12 DataGrid fabric mgmt (WP4) PEM prototype will most likely be adopted WP4 promotes: –High node autonomy. Monitoring hierarchies where lowest level can be entirely confined to a node (tight sensor-actuator loops) –PEM configuration repository will be replaced by WP4 configuration management system –Possibly use the transport layer from WP3 framework based on the GMA (Grid Monitoring Architecture) producer-consumer model

13 May 24 2001PEM status report. LCCWS@FNAL O.Bärring 13 Monitoring hierarchy Node sensor Agent MR cache Config cache CE Actuator MR Correlation Engine (CE) GRID Node view Cluster view Fabric view GUI Configuration Management system

14 May 24 2001PEM status report. LCCWS@FNAL O.Bärring 14 Service view How to translate measured simple metrics into a service view? Some preliminary plans in WP4: –Create probes that act as user programs (configurable for different CPU, memory, I/O, … characteristics) –Run probes on idle systems -> benchmarking –Run probes under different load conditions and measure concurrently a set of simple metrics (e.g. CPU load, memory usage, I/O rates, bandwidth to homedirectory, …) -> matrix for mapping of expected performance

15 May 24 2001PEM status report. LCCWS@FNAL O.Bärring 15 Conclusions After a long requirements and design phase, PEM has now reached a working prototype PEM will be adopted by WP4 with slight modifications –Node autonomy, monitoring hierarchy –Use central fabric configuration mgmt –May implement GMA interfaces provided by WP3 for monitoring transport and publication mechanisms


Download ppt "May 24 2001PEM status report. O.Bärring 1 PEM status report Large-Scale Cluster Computing Workshop FNAL, May 24 2001 Olof Bärring, CERN."

Similar presentations


Ads by Google