Presentation is loading. Please wait.

Presentation is loading. Please wait.

NGOP Overview J.Fromm K.Genser T.Levshina M.Mengel.

Similar presentations


Presentation on theme: "NGOP Overview J.Fromm K.Genser T.Levshina M.Mengel."— Presentation transcript:

1 NGOP Overview J.Fromm K.Genser T.Levshina M.Mengel

2 6/24/2001Large Scale Cluster Computing Workshop at Fermilab 2 N ext G eneration O peration GROUP Integrated Systems Development Department Krzysztof Genser Terry Jones Tanya Levshina Igor Mandrichenko Don Petravick Operating Systems Support Department Troy Dawson Jim Fromm Lisa Giacchetti Marc Mengel Ken Schumacher Steven Timm Computing Services Department Rick Thies Rich Thompson

3 6/24/2001Large Scale Cluster Computing Workshop at Fermilab 3 Current way of monitoring Various monitoring tools, thus no comprehensive picture of status of services –Xfalive –Patrol –NOC (network) –Fermi software (Enstore, FBS ….) When actions initiated by user’s problem report –Sometime misleading information –Postmortem investigation

4 6/24/2001Large Scale Cluster Computing Workshop at Fermilab 4 Fermi Computing Environment Heterogeneous clusters –Various OSs –Different services (batch, interactive, farms) Various sets of applications (lsf, fbs, enstore, sam) Mixed management – system administrators – software administrators Computer Services Department (CSD) provides a single point of contact for reporting problems

5 6/24/2001Large Scale Cluster Computing Workshop at Fermilab 5 NGOP Goals Active monitoring Problem diagnostics Early error detection and problem prevention Centralized data collection Status of service evaluation Execution of corrective and notification actions Performance analysis

6 6/24/2001Large Scale Cluster Computing Workshop at Fermilab 6 NGOP Project Phases 8/1999 – 3/2000 : Creation of NGOP group.Gathering requirements for Distributed Monitoring System. Evaluation of available commercial and freeware products.Evaluation 3/2000 – 12/2000:Design and development of NGOP prototypeDesign 1/2001 - present: Prototype deployment on the farms. Farms monitoring by system administrators and operators. Prototype evaluation. Extending “xfalive” service to all nodes monitored by CSD.

7 6/24/2001Large Scale Cluster Computing Workshop at Fermilab 7 Prototype Statistics Some implementation details: –Written primarily in Python (some modules in C) –Use XML (and partially MATHML) for all configuration files Some deployment details: –Monitoring a total of 512 nodes Checking for node being down and node reset On four farms (CDF, D0, two Fix Target experiment farms) - (270 nodes) –System daemons presence –Critical file systems presence and size –Cpu load, memory and swap utilization –Number of users and users’ processes –Number of processors off-line –Baseboard temperature and fan speed –NFS timeouts –Disk errors –Number of Monitored Objects ~ 6,500 –About 5 instances of “ngop monitor” (GUI) are running simultaneously. –Events are stored in Oracle Database

8 6/24/2001Large Scale Cluster Computing Workshop at Fermilab 8 Current Configuration NGOP Action Client MAs (Ping) NGOP Central Server Config File Management Server FNCDUH Archive Service NGOP Monitor User Node NGOP Monitor User Node NGOP Monitor User Node MA (OSHealth) MA (OFT_FBS) fnpc 1 - 37 fnsfh Old FixTarget Farm Swatch MA (OSHealth) MA (CDF_FBS) fncdf 1 - 90 cdffarm1 CDF Farm Swatch MA (OSHealth) MA (FT_FBS) Fnpc 201 - 250 fnsfo FixTarget Farm Swatch WWW Mail Servers License Servers EnstoreCMSSDSS MA (OSHealth) MA (D0_FBS) fnd0 1 - 100 d0bbin D0 Farm Swatch Division Servers MISCOMPKerberosD0CDFBTEVLicense Servers FNALUKTEVMINOSODSHPSSPPD

9 6/24/2001Large Scale Cluster Computing Workshop at Fermilab 9 Summary Of Occurred Events Detected Problems: –Node reset –Node is down –One CPU is missing after reboot –File system not mounted –System daemon is dead –FBS Batch Manager is down Raised Alarms: –Memory usage is high –Swap usage is high –CPU Load is high –File System is full –Baseboard temperature is high –Specific messages found in syslog : nfs timeouts, drive timeouts …

10 6/24/2001Large Scale Cluster Computing Workshop at Fermilab 10 GUI Monitor Snapshots

11 6/24/2001Large Scale Cluster Computing Workshop at Fermilab 11 Report Generator (MISCOMP Web Query Interface) Monitoring Agent id Monitored Object id Event type Event valueDescription fnpc242_healthOSHealth.fnpc242. cpuLoad.fnpc242 sysUsage5.88Average load on the node is less or equal to 8 and greater than 5 fnpc208._healthOSHealth.fnpc208. memory.fnpc208 sysUsage86Memory usage is greater or equal to 80% and less 95% fnpc204_healthHardware.fnpc204. baseTemp.fnpc204 Hardware45.0Temperature is between 45C and 50C fnpc108_healthOSHealth.fnpc108. rstatd.fnpc108 Daemon0rstatd is not running

12 6/24/2001Large Scale Cluster Computing Workshop at Fermilab 12 What’s next? NGOP Production (end of summer 2001) Wish List: –Provide Monitoring Client API –Implement Correlation(aka Looping) Agents –Implement historical rules and escalating alarms –Implement “snapshot” (“give me the updated system status now”) feature –Provide other than Python Monitoring Agent API –Fully Kerberize –Provide Standard Win2000 Monitoring Agents –Design and provide dynamic handling of configuration changes for the Monitoring Client –Allow for easier handling of multiple configurations –Improve Admin (Configuration Client) Client GUI –Provide Configuration GUI (hoping for a good free XML Editor though) –Provide Performance Data Framework –Redesign/Rewrite GUI (for scalability and friendliness) –Provide GUI for non-Linux platforms if really needed –Work on scalability up to 10000 hosts

13 6/24/2001Large Scale Cluster Computing Workshop at Fermilab 13 More Info url: http://www-isd.fnal.gov/ngop/http://www-isd.fnal.gov/ngop/ E-mail: ngop@fnal.gov

14 6/24/2001Large Scale Cluster Computing Workshop at Fermilab 14 Why not commercial products? limited off-shelf functionality considerable difficulties with integrating new packages into the framework high initial and support cost substantial efforts and human resources requirements during the installation and customization requirements for additional third-party products in order to gain better scalability and more off-shelf functionality Some quotes from Data Communication Journal 9/21/99 (by M.Jander “Framework Fraud” pp. 33-42) - Data Comm’s survey of net management frameworks that includes evaluation of Tivoli,HP Openview, Unicenter and other products by 1,100 network architects: “Deficient technology and broken promises” “Two years to get a system up and running” “We need a Ph.D. in physics get it [the product] working” “for each dollar of framework purchases, a customer pays $3 in after-sales services”

15 6/24/2001Large Scale Cluster Computing Workshop at Fermilab 15 NGOP Architecture Data Analyzer Persistent Config.Data Persistent Config.Data Archive Configuraton File Management Service Configuraton File Management Service Archive Service Central Server Cluster B Performance Data Cluster A Performance Storage Service Cluster B1 s s s s S s MA Cluster B2 MA Monitored Objects Host Element Cluster System NGOP Components Sensor Agent Server Monitoring Agent Monitoring Data Storage Clients Connections TCP connection between UDP Monitored Element and MA Not implemented in prototype yet MA s Administrator Monitor Report Generator Router s Action Client


Download ppt "NGOP Overview J.Fromm K.Genser T.Levshina M.Mengel."

Similar presentations


Ads by Google