Presentation is loading. Please wait.

Presentation is loading. Please wait.

NGOP Prototype Status Report T.Levshina. N ext G eneration O peration GROUP Integrated Systems Development Department Krzysztof.

Similar presentations


Presentation on theme: "NGOP Prototype Status Report T.Levshina. N ext G eneration O peration GROUP Integrated Systems Development Department Krzysztof."— Presentation transcript:

1 NGOP Prototype Status Report T.Levshina

2 4/27/2001ngop@fnal.gov2 N ext G eneration O peration GROUP Integrated Systems Development Department Krzysztof Genser Terry Jones Tanya Levshina Igor Mandrichenko Don Petravick Operating Systems Support Department Troy Dawson Jim Fromm Lisa Giacchetti Marc Mengel Ken Schumacher Steven Timm Computing Services Department Rick Thies Rich Thompson

3 4/27/2001ngop@fnal.gov3 Presentation Highlights NGOP project phases Status of the Framework Status of the prototype deployment Near future milestones

4 4/27/2001ngop@fnal.gov4 NGOP Project Phases (since last HEPIX) December 2000: First prototype implementation was released. January 2001:Prototype installation on farms. Classes for farm administrators. February 2001: Ngop server node in the operator console area was installed. Monitoring by operators was started. March 2001:New release (“Swatch” and “PlugIns” Agents). Ngop was evaluated by system administrators, operators and others. Strategy meeting was carried out. April 2001 “Xfalive” service (low-level ping) was provided for the all nodes monitored by Computing Services Department.

5 4/27/2001ngop@fnal.gov5 NGOP Architecture Data Analyzer Persistent Config.Data Persistent Config.Data Archive Configuraton File Management Service Configuraton File Management Service Archive Service Central Server Cluster B Performance Data Cluster A Performance Storage Service Cluster B1 s s s s S s MA Cluster B2 MA Monitored Objects Host Element Cluster System NGOP Components Sensor Agent Server Monitoring Agent Monitoring Data Storage Clients Connections TCP connection between UDP Monitored Element and MA Not implemented in prototype yet MA s Administrator Monitor Report Generator Router s Action Client

6 4/27/2001ngop@fnal.gov6 Monitor Data Flow and NGOP Components Interaction MA Monitored Elements MA Monitored Elements ID=swap.nodeA State=Up Value=98 SevLevel=Error Dscrb=“swap > 95 %” ID=syslogd.nodeB State=Down Dscrb=“syslogd is down” Central Server Monitor Action Client Action Request Archiver Configuration Service CVS MA Monitored Elements MA Monitored Elements MA Monitored Elements MA Monitored Elements

7 4/27/2001ngop@fnal.gov7 Status of Framework (Implemented Components) Monitoring Agent: –MA API (only Python binding) –PlugIns Agent (XML configuration is required) –Several types of MAs are provided in NGOP Prototype: Linux Node "health" : –System Daemons presence –Critical File Systems presence and size –Cpu load –Memory utilization –Swap utilization –Number of users –Number of users’ processes –Number of processors –Baseboard temperature –Fan speed “Xfalive”: –Node availability (low level ping) –Node reset FBS : –FBS Daemons presence –Resources (“cpu” and scratch disk availability) “Swatch” : –watches a log file for lines matching a regular expression, e.g. syslog or console log

8 4/27/2001ngop@fnal.gov8 Status of Framework (Implemented Components) NGOP Central Server(NCS): –Gather events from MA’s –Scalable (so far ~ 512 nodes) –Provide users with requested information –Handle multiple users –Primitive locking mechanism to prevent simultaneous actions –Action broadcasting –Store information locally and forward it to Archive Storage NGOP Configuration File Management Service: –Provide a central repository for system configuration and monitoring rules. –Perform configuration sanity check –Provide clients with component subscription list –Allow dynamic reconfiguration –Notify clients about new configuration

9 4/27/2001ngop@fnal.gov9 Status of Framework (Implemented Components) Archive Server: –Handles archive storage (Oracle). –Provides a means to read and query the data (FNAL web interface: MISWEB) –Performs data roll out –Performs clean up procedure Action Client: –Performs centralized actions –Verifies user authorization to perform the action –Notifies NCS about action exit status Monitoring Client: –Allows to configure custom-built system views –Defines rules that determine the status of the system and their components –Requests and receives information about monitored objects –Determines the status of system based on the rules and obtained information –Initiates request to perform action. –All configuration files are written in XML

10 4/27/2001ngop@fnal.gov10 Status of Framework (Not yet implemented components) Sensor Agent: Agent that collects performance data and generates events at a higher rate than a monitoring agent. Performance Data Storage Service: Service that allows persistent storage of performance data, as well as means to read and query the data.Performance data will need to be consolidated. Looping Monitoring Agent: Agent that is capable to received information form NCS, analyze it, derive new events and send it back to NCS.

11 4/27/2001ngop@fnal.gov11 CFMS Admin

12 4/27/2001ngop@fnal.gov12 NGOP Monitor (Configuration)

13 4/27/2001ngop@fnal.gov13 NGOP Monitor (Display)

14 4/27/2001ngop@fnal.gov14 NGOP Monitor (Display)

15 4/27/2001ngop@fnal.gov15 Prototype Statistics Some implementation details: –Written primarily in Python (some modules in C) ~ 10, 000 line of Python code and ~1,000 of C code –Use XML (and partially MATHML) for all configuration files ~ 600 configuration files Some deployment details: –Monitoring 512 nodes, checking for node being down and node reset. –Monitoring four farms (CDF, D0, two Fix Target experiment farms) - (270 nodes out of 512) –Number of Monitoring Agents ~ 557( 270 local MAs monitor operating system and sensors data on the farms, 270 local MAs monitor syslog on the farms, 4 MAs monitor FBS on corresponding farms, 13 MAs perform “xfalive” service) –Number of Monitored Objects ~ 6,500 –About 5 instances of “ngop monitor” (GUI) are running simultaneously. – Local event log is kept since January,12. Rate is ~ 13 events per hour

16 4/27/2001ngop@fnal.gov16 Current Configuration NGOP Action Client MAs (Ping) NGOP Central Server Config File Management Server FNCDUH Archive Service NGOP Monitor User Node NGOP Monitor User Node NGOP Monitor User Node MA (OSHealth) MA (OFT_FBS) fnpc 1 - 37 fnsfh Old FixTarget Farm Swatch MA (OSHealth) MA (CDF_FBS) fncdf 1 - 90 cdffarm1 CDF Farm Swatch MA (OSHealth) MA (FT_FBS) Fnpc 201 - 250 fnsfo FixTarget Farm Swatch WWW Mail Servers License Servers EnstoreCMSSDSS MA (OSHealth) MA (D0_FBS) fnd0 1 - 100 d0bbin D0 Farm Swatch Division Servers MISCOMPKerberosD0CDFBTEVLicense Servers FNALUKTEVMINOSODSHPPCPPD

17 4/27/2001ngop@fnal.gov17 Summary Of Occurred Events Detected Problems: –Node reset –Node is down –One CPU is missing after reboot –File system not mounted –System daemon is dead –FBS Batch Manager is down Raised Alarms: –Memory usage is high –Swap usage is high –CPU Load is high –File System is full –Baseboard temperature is high –Specific messages found in syslog : nfs timeouts, drive timeouts …

18 4/27/2001ngop@fnal.gov18 Report Generator (MISCOMP Web Query Interface) Monitoring Agent id Monitored Object id Event typeEvent nameEvent valueDescription fnpc242_healthOSHealth fnpc242 cpuLoad fnpc242 sysUsagecpuLoad5.88Average load on the node is less or equal to 8 and greater than 5 fnpc208._healthOSHealth fnpc208 Memory fnpc208 sysUsagememory86Memory usage is greater or equal to 80% and less 95% fnpc204_healthHardware fnpc204 baseTemp_1 Fnpc204 HardwarebaseTemp_145.0Temperature is between 45C and 50C fnpc108_healthOSHealth fnpc108 rstatd fnpc108 Daemonrstatd0rstatd is not running

19 4/27/2001ngop@fnal.gov19 Next Milestone: From Prototype to Production System (for ~600 nodes) Goal 1: Gradually give the System Managers a Framework to develop and evolve tools to locally monitor their systems and enable them to send filtered information to the CSD operators Goal 2: Make sure all production systems can be supported by NGOP (excluding Windows2000 in the first phase)

20 4/27/2001ngop@fnal.gov20 Wish List: Improve the Production System Provide Monitoring Client API Implement Looping Agents Implement historical rules and escalating alarms Implement “snapshot” (“give me the updated system status now”) feature Provide other than Python Monitoring Agent API Fully Kerberize Provide Standard Win2000 Monitoring Agents Design and provide dynamic handling of configuration changes for the Monitoring Client Allow for easier handling of multiple configurations Improve Admin (Configuration Client) Client GUI Provide Configuration GUI (hoping for a good free XML Editor though) Provide Performance Data Framework Redesign/Rewrite GUI (for scalability and friendliness) Provide GUI for non-Linux platforms if really needed Work on scalability up to 10000 hosts


Download ppt "NGOP Prototype Status Report T.Levshina. N ext G eneration O peration GROUP Integrated Systems Development Department Krzysztof."

Similar presentations


Ads by Google