Presentation is loading. Please wait.

Presentation is loading. Please wait.

GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.

Similar presentations


Presentation on theme: "GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP."— Presentation transcript:

1 GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP Collaboration meeting July 1, 2003

2 GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 2 Outline Fabric monitoring developments at CERN Architectural overview Deployment: status & plans for LCG-1 Outlook

3 GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 3 Fabric Monitoring at CERN Improved fabric management is key part of LCG programme EDG WP4 develops tools for automated installation, configuration, fabric monitoring, fault tolerance IT/FIO Supervision & Monitoring section: develop and deploy a monitoring solution for LHC-era A lot of expertise: EDG WP4 monitoring developments, PVSS Scada studies, SNMP studies, operator alarm displays, … Architecture based on functional requirements gathered by PEM project Important objective: fabric monitoring for LCG-1 at Cern

4 GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 4 Requirements and architecture Measurement Repository Monitored nodes Sensor Monitoring Sensor Agent Cache Consumer Local Consumer Sensor Consumer Global Consumer Database Both for performance and exception monitoring Local and global consumers Scalable, extensible, robust

5 GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 5 EDG WP4 implementation Measurement Repository (MR) Monitored nodes Sensor Monitoring Sensor Agent (MSA) Cache Consumer Local Consumer Sensor Consumer Global Consumer Monitoring Sensor Agent Calls plug-in sensors to sample configured metrics Stores all collected data in a local disk buffer Sends the collected data to the global repository Plug-in sensors Programs/scripts that implements a simple sensor- agent ASCII text protocol A C++ interface class is provided on top of the text protocol to facilitate implementation of new sensors The local cache Assures data is collected also when node cannot connect to network Allows for node autonomy for local repairs Transport Transport is pluggable. Two protocols over UDP and TCP are currently supported where only the latter can guarantee the delivery Measurement Repository The data is stored in a database A memory cache guarantees fast access to most recent data, which is normally what is used for fault tolerance correlations Database Repository API SOAP RPC Query history data Subscription to new data Database Proprietary flat-file database Oracle Open source interface to be developed

6 GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 6 Deployment status in Cern CC MSA with sensors for performance and exception monitoring, measuring 100-150 quantities per box Deployed on ~1500 RedHat Linux nodes 30 clusters, with specific configuration files Batch1000 nodes Interactive70 nodes Disk server200 nodes Tape server80 nodes WWW, DB, MISC200 nodes

7 GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 7 Status of exception monitoring ~50 possible alarms per monitored node HighLoad, DaemonDead, FileSysFull, install / config problems Operator alarm displays –PVSS-based, developed as part of PVSS-tests –WP4 alarm display under active development

8 GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 8 PVSS operator alarm display

9 GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 9 WP4 operator alarm display

10 GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 10 Performance monitoring WP4 Measurement Repository with Oracle backend is currently being deployed in the CERN CC for LCG-1 Data access –C-API to the repository is available, Perl and Java implementations to be done –Simple CLI is being delivered –GUI is being delivered

11 GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 11 Anamon

12 GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 12 Open issues Current solution is still very node-centric Not much experience with consumers No correlations engines, no corrective actions yet… Integration with configuration system to be done

13 GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 13 Summary and Outlook Fabric monitoring infrastructure for LCG-1 at Cern is being deployed Monitoring Sensor Agent has been operating very well Measurement Repository will now be challenged Consumers can start consuming… An interesting 6 months period await us!


Download ppt "GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP."

Similar presentations


Ads by Google