LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron CERN-IT/FIO-FD.

Slides:



Advertisements
Similar presentations
The Professional Open Source Company JBoss Network Enterprise Manager Introduction and Walkthrough.
Advertisements

TeraGrid Deployment Test of Grid Software JP Navarro TeraGrid Software Integration University of Chicago OGF 21 October 19, 2007.
21 Sep 2005LCG's R-GMA Applications R-GMA and LCG Steve Fisher & Antony Wilson.
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in the CERN Computer Center Jan van Eldik CERN-IT/FIO/SM 7 th GridPP.
26/05/2004HEPIX, Edinburgh, May Lemon Web Monitoring Miroslav Šiket CERN IT/FIO
Report on CVS Services : Central and LCG-dedicated services CERN IT/PS/UI May 2004.
Configuration management
Configuration management
CCTracker Presented by Dinesh Sarode Leaf : Bill Tomlin IT/FIO URL
High Availability Deep Dive What’s New in vSphere 5 David Lane, Virtualization Engineer High Point Solutions.
TOI - Refresh Upgrades in Cisco Unity Connection 8.6
1 Dynamic DNS. 2 Module - Dynamic DNS ♦ Overview The domain names and IP addresses of hosts and the devices may change for many reasons. This module focuses.
Week 6: Chapter 6 Agenda Automation of SQL Server tasks using: SQL Server Agent Scheduling Scripting Technologies.
Copyright 2007, Information Builders. Slide 1 Workload Distribution for the Enterprise Mark Nesson, Vashti Ragoonath June, 2008.
NGOP J.Fromm K.Genser T.Levshina M.Mengel V.Podstavkov.
Chapter Apache Installation in Linux- Mandrake. Acknowledgment The following information has been obtained directly from
ArcGIS for Server Reference Implementations An ArcGIS Server’s architecture tour.
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
Performance and Exception Monitoring Project Tim Smith CERN/IT.
Benjamin Lavalley, Sr. Product Marketing Manager Kaseya 2 Upgrade Review.
High-Availability Linux.  Reliability  Availability  Serviceability.
Replication Mechanisms for a Distributed Time Series Storage and Retrieval Service Mugurel Ionut Andreica Politehnica University of Bucharest Iosif Charles.
ELFms meeting, 2/3/04 German Cancio, 2/3/04 Proxy servers in CERN-CC.
7/2/2003Supervision & Monitoring section1 Supervision & Monitoring Organization and work plan Olof Bärring.
The Professional Open Source™ Company CLI Shell JBossNetwork Enterprise Manager Command Line Interface.
Deploying Oracle Names Jeff D’Abate Sr. Database Administrator Enterprise Application Services November 19, 2004.
FailSafe SGI’s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper
Systems Management Server 2.0: Backup and Recovery Overview SMS Recovery Web Site location: Updated.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
Quattor-for-Castor Jan van Eldik Sept 7, Outline Overview of CERN –Central bits CDB template structure SWREP –Local bits Updating profiles.
Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
02 August OraMonPlans 08/ August Topics Enhancements –OraMon DB redundancy layer –Compare and fix OraMon configurations –Expiry of historical.
 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.
Lemon Monitoring Presented by Bill Tomlin CERN-IT/FIO/FD WLCG-OSG-EGEE Operations Workshop CERN, June 2006.
EU 2nd Year Review – Feb – WP4 demo – n° 1 WP4 demonstration Fabric Monitoring and Fault Tolerance Sylvain Chapeland Lord Hess.
Configuring and Troubleshooting Identity and Access Solutions with Windows Server® 2008 Active Directory®
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon for Quattor I.Fedorko CERN CF/IT 16 March 2011.
CERN IT Department CH-1211 Genève 23 Switzerland PES 1 Ermis service for DNS Load Balancer configuration HEPiX Fall 2014 Aris Angelogiannopoulos,
Lemon Tutorial Sensor Exception Miroslav Siket, Dennis Waldron CERN-IT/FIO-FD.
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
1 CMPT 471 Networking II DNS © Janice Regan,
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CF Monitoring: Lemon, LAS, SLS I.Fedorko(IT/CF) IT-Monitoring.
Image Distribution and VMIC (brainstorm) Belmiro Moreira CERN IT-PES-PS.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Alarming with GNI VOC WG meeting 12 th September.
Lemon Tutorial Sensor How-To Miroslav Siket, Dennis Waldron CERN-IT/FIO-FD.
Oracle Business Intelligence Foundation – Testing and Deploying OBI Repository.
CASTOR Operations Face to Face 2006 Miguel Coelho dos Santos
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF CC Monitoring I.Fedorko on behalf of CF/ASI 18/02/2011 Overview.
Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.
Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Cluman: Advanced Cluster Management for Large-scale Infrastructures.
SMOOTHWALL FIREWALL By Nitheish Kumarr. INTRODUCTION  Smooth wall Express is a Linux based firewall produced by the Smooth wall Open Source Project Team.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Author etc Alarm framework requirements Andrea Sciabà Tony Wildish.
Lemon Computer Monitoring at CERN Miroslav Siket, German Cancio, David Front, Maciej Stepniewski Presented by Harry Renshall CERN-IT/FIO-FS.
INFSO-RI Enabling Grids for E-sciencE Running reliable services: the LFC at CERN Sophie Lemaitre
WP4 meeting Heidelberg - Sept 26, 2003 Jan van Eldik - CERN IT/FIO
System Monitoring with Lemon
High Availability Linux (HA Linux)
Monitoring and Fault Tolerance
Status of Fabric Management at CERN
Glasgow, SQL Server Meetup
Miroslav Siket, Dennis Waldron
Consulting Services JobScheduler Architecture Decision Template
GFA Controls IT Alain Bertrand
LDMS 2016 (v10.0) - Inventory Brandon Hill Last Updated 2/3/2016.
IBM Security Guardium: Early Detection, Identification of Root Cause, and Remediation of Full Appliance Issues — John S. Adams IBM Support - Level 2 Think.
Presentation transcript:

LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron CERN-IT/FIO-FD

09/10/2006Lemon Tutorial2 LAS building blocks Oracle DB server –running LAS logic and storing LAS data - PL/SQL OraMon – application server –Inserting exceptions to Oracle DB Web server –Providing access to LAS data from Oracle DB to LAS GUI (business logic) Remote monitoring – ping, http SURE gateways for UIMON/AFS

09/10/2006Lemon Tutorial3 LAS hardware Two independent instances –Primary Oracle DB and OraMon – lemondb1 Web server – lemonweb02 –Secondary Oracle DB and OraMon – lemondb2 Web server – lemonweb01 Remote monitoring machines –Lxfsrk4104 (aliased as lemonmr & lemonr01) –lxservb01 (alias lemonr02)

09/10/2006Lemon Tutorial4 Oracle DB server check Login to machine (lemondb1,lemondb2): > source ~oracle/.oraprofile.LEMON* > tnsping LEMON_A (LEMON_C for lemondb2) Check output of the previous command Example: OK (0 ms)

09/10/2006Lemon Tutorial5 OraMon check Already checked by LAS GUI Lemon-host-check ORAMON_WRONG procedure Log file: /var/log/OraMon.log

09/10/2006Lemon Tutorial6 Apache web server check Already checked by LAS GUI Lemon-host-check HTTPD_WRONG procedure Log file: /var/log/httpd/error_log

09/10/2006Lemon Tutorial7 Remote monitoring check Runs as sensor (remote) on remote monitoring machines Lemon-host-check Agent log file: /var/log/edg-fmon-agent.log

09/10/2006Lemon Tutorial8 SURE gateways for UIMON/SURE Runs as a sensor (suregateway) on remote monitoring machines Agent process and log file ISSUE: AFS machines –Uses lemon-sure-multiplexer process as a gateway –Lxfsrk4104 only –Check existence of the daemon, log file: /var/log/lemon-sure-multiplexer.log

09/10/2006Lemon Tutorial9 lemon-cli Command line tool for extracting raw (un-interpreted) data from lemon. Information can be extracted from local cache (/var/spool/edg-fmon-agent) or remote server over SOAP (aliased as lemonmr, physical machine: lxfsrk4104) Limitations –local cache is limited to seven days worth of history (purged everyday by the agent) –remote server queries limited to 20,000 returned results this limitation will be removed when the new lemon API is deployed (end Q4, begin Q1 2007) local cache contains much more information then is recorded at the server –Why? smoothing!! Smoothing is a mechanism which allows the agent to be selective on the information it sends to the central servers If the information you want is < 7 days use the local cache!! Full documentation at:

09/10/2006Lemon Tutorial10 lemon-cli (II) - Examples Resolving a metric id to a name –lemon-cli –m syslog –Displays all the metrics whose name contains ‘syslog’ Referencing time periods (--end, --start), e.g. –1h = 1 hour –2d3h36m44s = 2 days, 3 hours, 36 minutes and 44 seconds –Also supports log file timestamps e.g. Thu 02 Nov :45:00 (no guarantees!) If querying remotely –n accepts the same node name expansion criteria as wassh! e.g lemon-cli –m –n lxb[ ] --server All alarms can be seen on the machine using –lemon-cli –class “alarm.exception” –1 005, and are alarms –lemon-host-check interprets all the codes for you!!

09/10/2006Lemon Tutorial11 lemon-host-check (I) Aim: to provide a command line tool for viewing the status of all active alarms on a given machine by querying the edg-fmon- agent. Uses the information recorded in the agents local cache. (requires /var/ to be writeable!) Makes sure that the information reported to you is up to date (fresh!!) Checks that all sensors are running, and that 1 and only 1 agent processing is running. Must be logged in as root! Full documentation at:

09/10/2006Lemon Tutorial12 lemon-host-check (II) - Examples Check for active alarms on the machine –lemon-host-check Disable alarms “syslogd and klogd” –lemon-host-check –disable "30023,30032“ Show me alarms even if they are disabled –lemon-host-check –force Disable all alarms for the next 1 hour 30 minutes and 23 seconds –lemon-host-check –disable-all –duration 1h30m23s “demo intervention” View a list of all disabled alarms –lemon-host-check –list Enable all alarms –lemon-host-check –enable-all –Some alarms are “hard” disabled! Requires a CDB reconfiguration and ncm-ncd –co fmonagent run to make them visible again.

09/10/2006Lemon Tutorial13 lemon-host-check (III) Pre-alarms –Recent concept added to lemon. –Aims at dealing with transient alarms. –Real Use Case: high_load (30008) has pre-alarm capabilities! When high load is detected on the machine a pre alarm is raised (not visible on LAS). If the alarm exists for more then 10 minutes it becomes a proper alarm. This allows for high load spikes on machines/clusters such as lxplus to be ignored. –Not visible by default in lemon-host-check Caution: –If you have a high_load alarm and restart the agent the alarm will disappear!! If the root problem hasn’t been corrected the alarm will resurface 10 minutes later (A new ITCM ticket). –Don’t restart the agent unless you absolutely need to (reconfiguration, errors in the edg-fmon-agent.log,…) –If you have to restart use ‘lemon-host-check –show-all’ afterwards Note: (make sure to check the status of the alarm!!!!!! You need to ignore the disabled ones, if any!)

09/10/2006Lemon Tutorial14 lemon-host-check (IV) Common errors: No monitoring agent process running / Too many monitoring agent processes running –service edg-fmon-agent restart –If that fails Possible false exception –lemon-host-check has given up (after 60 seconds) trying to get information from the agent on the machine. If it failed to find out if an alarm was present for a particular exception it assumes the worst case scenario, that an alarm does exist but may not be real (possibly false) –Why? The agent maybe too busy to answer lemon-host-check Maybe some sensors have failed to retrieve the necessary information –Solution re-run lemon-host-check again Still fails check /var/log/edg-fmon-agent.log for any errors about sensors or missing metrics. If they exist spma_wrapper.sh the machine to get the latest sensor code if any. ncm-ncd –co fmonagent to reconfigure the agent. Try again Still failing, contact service manager and CC

09/10/2006Lemon Tutorial15 FAQ Are monitored machines running only Linux (e.g : SLC3/4, RHEL 3/4) ? –Linux (lemon agent, ping, http check) –Solaris (lemon agent, UIMON) –Windows (ping, http) Is there any limitation that we should be aware of on the other OS’s / platforms? –AFS machines have their own monitoring tools – no information available –UIMON monitored machines – running UIMON process and multiplexer to send alarms to suregateway sensor on remote monitoring machines We knew nodes' polling on SURE, what is implemented in Lemon? –Remote sensor on remote monitoring machines Is there any load balancing (DNS) and/or redundancy ? front-/backend part of the failover? –No, just two independent instances running in parallel. –In future (with RAC) there will be failover for OraMon and only one Oracle DB

09/10/2006Lemon Tutorial16 FAQ (II) What should we do in a case of a piquet call about a failure on these server(s)? –Operators' LAS procedures do not have any piquet actions defined. All other failures are standard OS/hw procedures that they already have. There is nothing LAS specific for them. How to interpret the correlation rules ? Could you explain the syntax found in the Remedy ticket? –Full documentation with examples at –Example: lxs5013:9104:1[/tmp] eq /tmp) && (lxs5013:9104:5[90] > 80 LAS reduction rules and multi-hosts tickets: a direct mapping? –Several use cases: e.g. 12 x spma_wrong on 12 nodes of cluster YYY –One LAS item if the number of machines reaches 51% of the active nodes in cluster –Several LAS items if they appear in burst and the alarm has been already reduced –Individual machine LAS items if below 51% –If new machines appear, there will be a new reduced LAS item for each set of them A mean to detect when a node started to be "alarmed" and when this stopped. –/var/log/ncm/component-setodesiredstate.log* log file on the machine in question

09/10/2006Lemon Tutorial17 FAQ (III) What to expect from them if no alarm can be displayed anymore at 3:00AM and they've got called by Operator? –No piquet service for LAS defined. If Las does not work, operators have procedures for finding out the state of the LAS – check QUESTIONS?