NGOP Overview J.Fromm K.Genser T.Levshina M.Mengel.

Slides:



Advertisements
Similar presentations
NAGIOS AND CACTI NETWORK MANAGEMENT AND MONITORING SYSTEMS.
Advertisements

26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
Database System Concepts and Architecture
The Premier Software Usage Analysis and Reporting Toolset CELUG Presentation – May 12, 2010 LT-Live : License Tracker’s License Server Monitor.
ActiveXperts Network Monitor Monitors servers, workstations and devices for availability Alerts and corrects.
Keeping our websites running - troubleshooting with Appdynamics Benoit Villaumie Lead Architect Guillaume Postaire Infrastructure Manager.
UNDERSTANDING JAVA APIS FOR MOBILE DEVICES v0.01.
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
Jaeyoung Choi School of Computing, Soongsil University 1-1, Sangdo-Dong, Dongjak-Ku Seoul , Korea {heaven, psiver,
Copyright 2009 FUJITSU TECHNOLOGY SOLUTIONS PRIMERGY Servers and Windows Server® 2008 R2 Benefit from an efficient, high performance and flexible platform.
NGOP J.Fromm K.Genser T.Levshina M.Mengel V.Podstavkov.
Maintaining and Updating Windows Server 2008
F Fermilab Database Experience in Run II Fermilab Run II Database Requirements Online databases are maintained at each experiment and are critical for.
1 NGOP Overview Jim Fromm Farms and Clustered Systems Group Computing Division Fermilab.
Institute of Computer Science AGH Performance Monitoring of Java Web Service-based Applications Włodzimierz Funika, Piotr Handzlik Lechosław Trębacz Institute.
Understanding and Managing WebSphere V5
Overview SAP Basis Functions. SAP Technical Overview Learning Objectives What the Basis system is How does SAP handle a transaction request Differentiating.
Presented by INTRUSION DETECTION SYSYTEM. CONTENT Basically this presentation contains, What is TripWire? How does TripWire work? Where is TripWire used?
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Module 15: Monitoring. Overview Formulate requirements and identify resources to monitor in a database environment Types of monitoring that can be carried.
Module 18 Monitoring SQL Server 2008 R2. Module Overview Monitoring Activity Capturing and Managing Performance Data Analyzing Collected Performance Data.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Acceleratio Ltd. is a software development company based in Zagreb, Croatia, founded in We create innovative software solutions for SharePoint,
ArcGIS Workflow Manager An Introduction
6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.
DONE-10: Adminserver Survival Tips Brian Bowman Product Manager, Data Management Group.
Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.
7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.
Robert Fourer, Jun Ma, Kipp Martin Copyright 2006 An Enterprise Computational System Built on the Optimization Services (OS) Framework and Standards Jun.
November 3, FBSNG Overview Jim Fromm Farms and Clustered Systems Group, Computing Division, Fermilab.
NGOP Status and Plans Jim Fromm Marc Mengel Jack Schmidt May 2, 2006.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
PARMON A Comprehensive Cluster Monitoring System A Single System Image Case Study Developer: PARMON Team Centre for Development of Advanced Computing,
Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.
Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.
Introduction to the Adapter Server Rob Mace June, 2008.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Fermilab Distributed Monitoring System (NGOP) Progress Report J.Fromm K.Genser T.Levshina M.Mengel V.Podstavkov.
Optimizer Deployment Centralized Database module on Optimizer hub server Each monitored server has an instance of optimizer installed.
ALICE, ATLAS, CMS & LHCb joint workshop on
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Management of the LHCb DAQ Network Guoming Liu * †, Niko Neufeld * * CERN, Switzerland † University of Ferrara, Italy.
Lee Lueking 1 The Sequential Access Model for Run II Data Management and Delivery Lee Lueking, Frank Nagy, Heidi Schellman, Igor Terekhov, Julie Trumbo,
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Mark E. Fuller Senior Principal Instructor Oracle University Oracle Corporation.
CCNA4 v3 Module 6 v3 CCNA 4 Module 6 JEOPARDY K. Martin.
INTRUSION DETECTION SYSYTEM. CONTENT Basically this presentation contains, What is TripWire? How does TripWire work? Where is TripWire used? Tripwire.
NGOP Prototype Status Report T.Levshina. N ext G eneration O peration GROUP Integrated Systems Development Department Krzysztof.
Copyright 2007, Information Builders. Slide 1 Machine Sizing and Scalability Mark Nesson, Vashti Ragoonath June 2008.
April 2003 Iosif Legrand MONitoring Agents using a Large Integrated Services Architecture Iosif Legrand California Institute of Technology.
PerfSONAR-PS Working Group Aaron Brown/Jason Zurawski January 21, 2008 TIP 2008 – Honolulu, HI.
ECHO A System Monitoring and Management Tool Yitao Duan and Dawey Huang.
The ATLAS DAQ System Online Configurations Database Service Challenge J. Almeida, M. Dobson, A. Kazarov, G. Lehmann-Miotto, J.E. Sloper, I. Soloviev and.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
SAN DIEGO SUPERCOMPUTER CENTER Welcome to the 2nd Inca Workshop Sponsored by the NSF September 4 & 5, 2008 Presenters: Shava Smallen
2: Operating Systems Networking for Home & Small Business.
Site Authorization Service Local Resource Authorization Service (VOX Project) Vijay Sekhri Tanya Levshina Fermilab.
5/25/2001Monitoring panel, Monitoring session LCCWS Olof Bärring, CERN.
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
Interstage BPM v11.2 1Copyright © 2010 FUJITSU LIMITED INTERSTAGE BPM ARCHITECTURE BPMS.
DBS Monitor and DAN CD Projects Report July 9, 2003.
A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
G. Russo, D. Del Prete, S. Pardi Kick Off Meeting - Isola d'Elba, 2011 May 29th–June 01th A proposal for distributed computing monitoring for SuperB G.
Simulation Production System
GFA Controls IT Alain Bertrand
Oracle Solaris Zones Study Purpose Only
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Internet Protocols IP: Internet Protocol
Robert Down & Pranay Sadarangani Nov 8th 2011
Presentation transcript:

NGOP Overview J.Fromm K.Genser T.Levshina M.Mengel

6/24/2001Large Scale Cluster Computing Workshop at Fermilab 2 N ext G eneration O peration GROUP Integrated Systems Development Department Krzysztof Genser Terry Jones Tanya Levshina Igor Mandrichenko Don Petravick Operating Systems Support Department Troy Dawson Jim Fromm Lisa Giacchetti Marc Mengel Ken Schumacher Steven Timm Computing Services Department Rick Thies Rich Thompson

6/24/2001Large Scale Cluster Computing Workshop at Fermilab 3 Current way of monitoring Various monitoring tools, thus no comprehensive picture of status of services –Xfalive –Patrol –NOC (network) –Fermi software (Enstore, FBS ….) When actions initiated by user’s problem report –Sometime misleading information –Postmortem investigation

6/24/2001Large Scale Cluster Computing Workshop at Fermilab 4 Fermi Computing Environment Heterogeneous clusters –Various OSs –Different services (batch, interactive, farms) Various sets of applications (lsf, fbs, enstore, sam) Mixed management – system administrators – software administrators Computer Services Department (CSD) provides a single point of contact for reporting problems

6/24/2001Large Scale Cluster Computing Workshop at Fermilab 5 NGOP Goals Active monitoring Problem diagnostics Early error detection and problem prevention Centralized data collection Status of service evaluation Execution of corrective and notification actions Performance analysis

6/24/2001Large Scale Cluster Computing Workshop at Fermilab 6 NGOP Project Phases 8/1999 – 3/2000 : Creation of NGOP group.Gathering requirements for Distributed Monitoring System. Evaluation of available commercial and freeware products.Evaluation 3/2000 – 12/2000:Design and development of NGOP prototypeDesign 1/ present: Prototype deployment on the farms. Farms monitoring by system administrators and operators. Prototype evaluation. Extending “xfalive” service to all nodes monitored by CSD.

6/24/2001Large Scale Cluster Computing Workshop at Fermilab 7 Prototype Statistics Some implementation details: –Written primarily in Python (some modules in C) –Use XML (and partially MATHML) for all configuration files Some deployment details: –Monitoring a total of 512 nodes Checking for node being down and node reset On four farms (CDF, D0, two Fix Target experiment farms) - (270 nodes) –System daemons presence –Critical file systems presence and size –Cpu load, memory and swap utilization –Number of users and users’ processes –Number of processors off-line –Baseboard temperature and fan speed –NFS timeouts –Disk errors –Number of Monitored Objects ~ 6,500 –About 5 instances of “ngop monitor” (GUI) are running simultaneously. –Events are stored in Oracle Database

6/24/2001Large Scale Cluster Computing Workshop at Fermilab 8 Current Configuration NGOP Action Client MAs (Ping) NGOP Central Server Config File Management Server FNCDUH Archive Service NGOP Monitor User Node NGOP Monitor User Node NGOP Monitor User Node MA (OSHealth) MA (OFT_FBS) fnpc fnsfh Old FixTarget Farm Swatch MA (OSHealth) MA (CDF_FBS) fncdf cdffarm1 CDF Farm Swatch MA (OSHealth) MA (FT_FBS) Fnpc fnsfo FixTarget Farm Swatch WWW Mail Servers License Servers EnstoreCMSSDSS MA (OSHealth) MA (D0_FBS) fnd d0bbin D0 Farm Swatch Division Servers MISCOMPKerberosD0CDFBTEVLicense Servers FNALUKTEVMINOSODSHPSSPPD

6/24/2001Large Scale Cluster Computing Workshop at Fermilab 9 Summary Of Occurred Events Detected Problems: –Node reset –Node is down –One CPU is missing after reboot –File system not mounted –System daemon is dead –FBS Batch Manager is down Raised Alarms: –Memory usage is high –Swap usage is high –CPU Load is high –File System is full –Baseboard temperature is high –Specific messages found in syslog : nfs timeouts, drive timeouts …

6/24/2001Large Scale Cluster Computing Workshop at Fermilab 10 GUI Monitor Snapshots

6/24/2001Large Scale Cluster Computing Workshop at Fermilab 11 Report Generator (MISCOMP Web Query Interface) Monitoring Agent id Monitored Object id Event type Event valueDescription fnpc242_healthOSHealth.fnpc242. cpuLoad.fnpc242 sysUsage5.88Average load on the node is less or equal to 8 and greater than 5 fnpc208._healthOSHealth.fnpc208. memory.fnpc208 sysUsage86Memory usage is greater or equal to 80% and less 95% fnpc204_healthHardware.fnpc204. baseTemp.fnpc204 Hardware45.0Temperature is between 45C and 50C fnpc108_healthOSHealth.fnpc108. rstatd.fnpc108 Daemon0rstatd is not running

6/24/2001Large Scale Cluster Computing Workshop at Fermilab 12 What’s next? NGOP Production (end of summer 2001) Wish List: –Provide Monitoring Client API –Implement Correlation(aka Looping) Agents –Implement historical rules and escalating alarms –Implement “snapshot” (“give me the updated system status now”) feature –Provide other than Python Monitoring Agent API –Fully Kerberize –Provide Standard Win2000 Monitoring Agents –Design and provide dynamic handling of configuration changes for the Monitoring Client –Allow for easier handling of multiple configurations –Improve Admin (Configuration Client) Client GUI –Provide Configuration GUI (hoping for a good free XML Editor though) –Provide Performance Data Framework –Redesign/Rewrite GUI (for scalability and friendliness) –Provide GUI for non-Linux platforms if really needed –Work on scalability up to hosts

6/24/2001Large Scale Cluster Computing Workshop at Fermilab 13 More Info url:

6/24/2001Large Scale Cluster Computing Workshop at Fermilab 14 Why not commercial products? limited off-shelf functionality considerable difficulties with integrating new packages into the framework high initial and support cost substantial efforts and human resources requirements during the installation and customization requirements for additional third-party products in order to gain better scalability and more off-shelf functionality Some quotes from Data Communication Journal 9/21/99 (by M.Jander “Framework Fraud” pp ) - Data Comm’s survey of net management frameworks that includes evaluation of Tivoli,HP Openview, Unicenter and other products by 1,100 network architects: “Deficient technology and broken promises” “Two years to get a system up and running” “We need a Ph.D. in physics get it [the product] working” “for each dollar of framework purchases, a customer pays $3 in after-sales services”

6/24/2001Large Scale Cluster Computing Workshop at Fermilab 15 NGOP Architecture Data Analyzer Persistent Config.Data Persistent Config.Data Archive Configuraton File Management Service Configuraton File Management Service Archive Service Central Server Cluster B Performance Data Cluster A Performance Storage Service Cluster B1 s s s s S s MA Cluster B2 MA Monitored Objects Host Element Cluster System NGOP Components Sensor Agent Server Monitoring Agent Monitoring Data Storage Clients Connections TCP connection between UDP Monitored Element and MA Not implemented in prototype yet MA s Administrator Monitor Report Generator Router s Action Client