Presentation on theme: "LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron CERN-IT/FIO-FD."— Presentation transcript:
LAS for System Administrators LAS overview Miroslav Siket, Dennis Waldron http://cern.ch/lemon CERN-IT/FIO-FD
09/10/2006Lemon Tutorial2 LAS building blocks Oracle DB server –running LAS logic and storing LAS data - PL/SQL OraMon – application server –Inserting exceptions to Oracle DB Web server –Providing access to LAS data from Oracle DB to LAS GUI (business logic) Remote monitoring – ping, http SURE gateways for UIMON/AFS
09/10/2006Lemon Tutorial3 LAS hardware Two independent instances –Primary Oracle DB and OraMon – lemondb1 Web server – lemonweb02 –Secondary Oracle DB and OraMon – lemondb2 Web server – lemonweb01 Remote monitoring machines –Lxfsrk4104 (aliased as lemonmr & lemonr01) –lxservb01 (alias lemonr02)
09/10/2006Lemon Tutorial4 Oracle DB server check Login to machine (lemondb1,lemondb2): > source ~oracle/.oraprofile.LEMON* > tnsping LEMON_A (LEMON_C for lemondb2) Check output of the previous command Example: OK (0 ms)
09/10/2006Lemon Tutorial5 OraMon check Already checked by LAS GUI Lemon-host-check ORAMON_WRONG procedure Log file: /var/log/OraMon.log
09/10/2006Lemon Tutorial6 Apache web server check Already checked by LAS GUI Lemon-host-check HTTPD_WRONG procedure Log file: /var/log/httpd/error_log
09/10/2006Lemon Tutorial7 Remote monitoring check Runs as sensor (remote) on remote monitoring machines Lemon-host-check Agent log file: /var/log/edg-fmon-agent.log
09/10/2006Lemon Tutorial8 SURE gateways for UIMON/SURE Runs as a sensor (suregateway) on remote monitoring machines Agent process and log file ISSUE: AFS machines –Uses lemon-sure-multiplexer process as a gateway –Lxfsrk4104 only –Check existence of the daemon, log file: /var/log/lemon-sure-multiplexer.log
09/10/2006Lemon Tutorial9 lemon-cli Command line tool for extracting raw (un-interpreted) data from lemon. Information can be extracted from local cache (/var/spool/edg-fmon-agent) or remote server over SOAP (aliased as lemonmr, physical machine: lxfsrk4104) Limitations –local cache is limited to seven days worth of history (purged everyday by the agent) –remote server queries limited to 20,000 returned results this limitation will be removed when the new lemon API is deployed (end Q4, begin Q1 2007) local cache contains much more information then is recorded at the server –Why? smoothing!! Smoothing is a mechanism which allows the agent to be selective on the information it sends to the central servers If the information you want is < 7 days use the local cache!! Full documentation at: http://cern.ch/lemon/doc/components/lemon-cli.shtml
09/10/2006Lemon Tutorial10 lemon-cli (II) - Examples Resolving a metric id to a name –lemon-cli –m syslog –Displays all the metrics whose name contains ‘syslog’ Referencing time periods (--end, --start), e.g. –1h = 1 hour –2d3h36m44s = 2 days, 3 hours, 36 minutes and 44 seconds –Also supports log file timestamps e.g. Thu 02 Nov 2006 10:45:00 (no guarantees!) If querying remotely –n accepts the same node name expansion criteria as wassh! e.g lemon-cli –m 10005 –n lxb[0001-1000] --server All alarms can be seen on the machine using –lemon-cli –class “alarm.exception” –1 005, 1 135 and 1 000 are alarms –lemon-host-check interprets all the codes for you!!
09/10/2006Lemon Tutorial11 lemon-host-check (I) Aim: to provide a command line tool for viewing the status of all active alarms on a given machine by querying the edg-fmon- agent. Uses the information recorded in the agents local cache. (requires /var/ to be writeable!) Makes sure that the information reported to you is up to date (fresh!!) Checks that all sensors are running, and that 1 and only 1 agent processing is running. Must be logged in as root! Full documentation at: http://cern.ch/lemon/doc/components/lemon-host-check.shtml
09/10/2006Lemon Tutorial12 lemon-host-check (II) - Examples Check for active alarms on the machine –lemon-host-check Disable alarms “syslogd and klogd” –lemon-host-check –disable "30023,30032“ Show me alarms even if they are disabled –lemon-host-check –force Disable all alarms for the next 1 hour 30 minutes and 23 seconds –lemon-host-check –disable-all –duration 1h30m23s “demo intervention” View a list of all disabled alarms –lemon-host-check –list Enable all alarms –lemon-host-check –enable-all –Some alarms are “hard” disabled! Requires a CDB reconfiguration and ncm-ncd –co fmonagent run to make them visible again.
09/10/2006Lemon Tutorial13 lemon-host-check (III) Pre-alarms –Recent concept added to lemon. –Aims at dealing with transient alarms. –Real Use Case: high_load (30008) has pre-alarm capabilities! When high load is detected on the machine a pre alarm is raised (not visible on LAS). If the alarm exists for more then 10 minutes it becomes a proper alarm. This allows for high load spikes on machines/clusters such as lxplus to be ignored. –Not visible by default in lemon-host-check Caution: –If you have a high_load alarm and restart the agent the alarm will disappear!! If the root problem hasn’t been corrected the alarm will resurface 10 minutes later (A new ITCM ticket). –Don’t restart the agent unless you absolutely need to (reconfiguration, errors in the edg-fmon-agent.log,…) –If you have to restart use ‘lemon-host-check –show-all’ afterwards Note: (make sure to check the status of the alarm!!!!!! You need to ignore the disabled ones, if any!)
09/10/2006Lemon Tutorial14 lemon-host-check (IV) Common errors: No monitoring agent process running / Too many monitoring agent processes running –service edg-fmon-agent restart –If that fails email@example.com Possible false exception –lemon-host-check has given up (after 60 seconds) trying to get information from the agent on the machine. If it failed to find out if an alarm was present for a particular exception it assumes the worst case scenario, that an alarm does exist but may not be real (possibly false) –Why? The agent maybe too busy to answer lemon-host-check Maybe some sensors have failed to retrieve the necessary information –Solution re-run lemon-host-check again Still fails check /var/log/edg-fmon-agent.log for any errors about sensors or missing metrics. If they exist spma_wrapper.sh the machine to get the latest sensor code if any. ncm-ncd –co fmonagent to reconfigure the agent. Try again Still failing, contact service manager and CC firstname.lastname@example.org
09/10/2006Lemon Tutorial15 FAQ Are monitored machines running only Linux (e.g : SLC3/4, RHEL 3/4) ? –Linux (lemon agent, ping, http check) –Solaris (lemon agent, UIMON) –Windows (ping, http) Is there any limitation that we should be aware of on the other OS’s / platforms? –AFS machines have their own monitoring tools – no information available –UIMON monitored machines – running UIMON process and multiplexer to send alarms to suregateway sensor on remote monitoring machines We knew nodes' polling on SURE, what is implemented in Lemon? –Remote sensor on remote monitoring machines Is there any load balancing (DNS) and/or redundancy ? front-/backend part of the failover? –No, just two independent instances running in parallel. –In future (with RAC) there will be failover for OraMon and only one Oracle DB
09/10/2006Lemon Tutorial16 FAQ (II) What should we do in a case of a piquet call about a failure on these server(s)? –Operators' LAS procedures do not have any piquet actions defined. All other failures are standard OS/hw procedures that they already have. There is nothing LAS specific for them. How to interpret the correlation rules ? Could you explain the syntax found in the Remedy ticket? –Full documentation with examples at http://lemon.web.cern.ch/lemon/doc/sensors/exception.shtml http://lemon.web.cern.ch/lemon/doc/sensors/exception.shtml –Example: lxs5013:9104:1[/tmp] eq /tmp) && (lxs5013:9104:5 > 80 LAS reduction rules and multi-hosts tickets: a direct mapping? –Several use cases: e.g. 12 x spma_wrong on 12 nodes of cluster YYY –One LAS item if the number of machines reaches 51% of the active nodes in cluster –Several LAS items if they appear in burst and the alarm has been already reduced –Individual machine LAS items if below 51% –If new machines appear, there will be a new reduced LAS item for each set of them A mean to detect when a node started to be "alarmed" and when this stopped. –/var/log/ncm/component-setodesiredstate.log* log file on the machine in question
09/10/2006Lemon Tutorial17 FAQ (III) What to expect from them if no alarm can be displayed anymore at 3:00AM and they've got called by Operator? –No piquet service for LAS defined. If Las does not work, operators have procedures for finding out the state of the LAS – check http://lemon.web.cern.ch/lemon/cern/las_procedures.shtml http://lemon.web.cern.ch/lemon/cern/las_procedures.shtml QUESTIONS?