Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)

Similar presentations


Presentation on theme: "Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)"— Presentation transcript:

1 Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Lemon monitoring and Lemon Alarm System (sensors, exception, alarm) Ivan Fedorko 22/11/2010

2 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Overview Lemon overview Lemon agent and sensors How to write new sensor Exception sensor LAS

3 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Lemon components SQL TCP/UDP HTTP Sensor Monitoring Agent Local Cache Oracle Database Repository Backend Application Server Lemon CLI Lemon-host-check Web Browser RRD tool / Python Apache/ PHP (command line tool to access data) (command line tool node exceptions) Measurement RepositoryLemon-web User InterfacesNode Monitoring

4 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Lemon agent and sensors Class Instance Class Instance Monitoring Agent Class Instance Metric Class Metric Class Instance Class Instance Class Instance Class Instance Class Instance Class Instance Sensor Metric Class Metric Class Sensor Sensor: A process or script which is connected to the lemon-agent via a bi-directional pipe and collects information on behalf of the agent. Sensors implement: Metric Classes: The equivalent to a class in OOP (Object Orientated Programming) Metric Instance: Is an instance (an object) of a metric class which has its own configuration data. Metric ID: A unique identifier associated with a particular metric instance of a particular metric class.

5 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Lemon agent and sensors MSA (Monitoring Sensor Agent) forks sensors and communicate with them using custom protocol over a bi-directional “pipes” configures metric instances of metric classes of a sensor and pulls for metrics to configure: ncm-ncd --configure fmonagent configuration: /etc/lemon/agent/ log: /var/log/lemon-agent.log checks on status of sensors caches data locally ( e.g. /var/spool/lemon-agent/ ) Class Instance Class Instance Monitoring Agent Class Instance Metric Class Metric Class Instance Class Instance Class Instance Class Instance Class Instance Class Instance Sensor Metric Class Metric Class Sensor http://lemon.web.cern.ch/lemon/doc/sensors.shtml Supported Lemon Sensors

6 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Lemon agent and sensors Class Instance Class Instance Monitoring Agent Class Instance Metric Class Metric Class Instance Class Instance Class Instance Class Instance Class Instance Class Instance Sensor Metric Class Metric Class Sensor /etc/lemon/agent/sensors/linux_CDB.conf

7 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Lemon agent and sensors What we can store: StoreSample01: agent: user: StoreSample02: agent: user: StoreSample03: agent: user: Example of linux sensor Lemon API Reporting on behalf

8 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Create new sensor Check instruction and example code –http://lemon.web.cern.ch/lemon/doc/howto/sensor_tutorial.shtmlhttp://lemon.web.cern.ch/lemon/doc/howto/sensor_tutorial.shtml Prepare your code and test –Test can be done localy Prepare templates –For sensor  will be transformed to DB table definition –For metrics  create table/metric, metadata check on server Ask for metric ID at lemon.support Commit templates with ID to CDB and inform lemon.support If new exception is introduced, should be alarmed?

9 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Sensor template example Somewhere in your node template Sensor template For configuration For db backend

10 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Exception sensor Objective: –To run corrective action when the occupancy of the /tmp partition is greater then 80%. Involved Metrics –With ID 9104 (system.partitionInfo) –Field 1 = mountname, field 5 = percentage occupancy Correlation Correlation ((9104:1 eq '/tmp') && (9104:5 > 80)) Actuator /usr/local/sbin/clean-tmp-partition -o 75

11 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Exception sensor –an officially supported Lemon sensor coded in C++ –developed in collaboration between CERN and BARC –implements the Lemon alarm protocol –has a correlation engine which allows it to evaluate 1 or more metrics to determine if a problem exists on a machine –supports reporting on behalf of other monitored entities –allows corrective actions (actuators) up to n-times or within a given time window –is the primary interface to inserting alarms into the Lemon framework. –Provides one and only one metric class “alarm.exception” Full documentation at: –http://lemon.web.cern.ch/lemon/doc/sensors/exception.shtml

12 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Exception sensor Alarm evolution

13 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Exception metric "/system/monitoring/exception/_30010" = nlist( "name", "tmp_full", "descr", "tmp utilization exceeds limit", "active", true, "latestonly", false, "importance", 2, "correlation", "((9104:1 eq '/tmp') && (9104:5 > 80))", "actuator", nlist("execve", "/usr/local/sbin/clean-tmp-partition -o 75", "maxruns", 3, "timeout", 300, "window", 900, "active", true) ); what is the alarm's importance? Not used now! 0 - informative -> to be handled at convenience 1 - low - 9/5 support - to be handled within working hours, e- mail outside working hours 2 - high - 24/24 support - requires immediate action - PK or expert call name of the exception to be used on the web and later for operator's GUI short description to be also passed to GUI if false, ncm component will not include exception to config if false, value are stored in lemon DB archive

14 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Correlation Basic format of a correlation is: [entity_name]: :... Where, –entity_name An optional parameter, used for reporting on behalf of other entities The name of the entity (wildcards ‘*’ are supported) –metric_id The id of the metric to check –field_position The field to use within the metric. Allows the correlation to extract a single value from a multi-valued metric –Operater E.g. ==, !=, >, <, eq, ne, regex, !regex … –reference_value A string or number used to compare the metric_id:field_position against

15 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Correlation 1.Use basic object for correlation : 2. Combine exceptions 10004:1 > 600 && (10004:7 > 10 || (10004:8 > 150000 && 4109:3 eq 'i386') || (10004:8 > 600000 && 4109:3 regex '64') || 10007:2 > 50 || 10007:3 > 10 || 10007:4 > 0) 4. Join the metrics (9200:1 == 9208:1) 3.You can collect information on behalf, you can define exception on behalf [entity_name]]: : e.g. (*:9501:5 != 200) && (*:9501:5 != 301) 5. "correlation", "4109:2 ne 'symlink('/system/kernel/version')'"

16 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Actuator Information: –Run as forked processes. –Are connected to the sensor via a pipe. –All information written to stdout or stderr by the actuator is caught and recorded in the agents log file. –All actuator attempts are logged centrally and recorded locally in the agents log file. Running shell style actuators: –The system call used to run actuator doesn’t provide shell style conveniences. –To use shell style syntax like *, &&, | etc you must define you actuator like this: Actuator/bin/sh –c \\” /bin/echo ‘This is a demo message from $HOSTNAME’ \\”

17 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Actuator "/system/monitoring/exception/_30010" = nlist( "name", "tmp_full", "descr", "tmp utilization exceeds limit", "active", true, "latestonly", false, "importance", 2, "correlation", "((9104:1 eq '/tmp') && (9104:5 > 80))", "actuator", nlist("execve", "/usr/local/sbin/clean-tmp-partition -o 75", "maxruns", 3, "timeout", 300, "window", 900, "active", true) ); The maximum number of times an actuator can run consecutively before a final alarm is generating The maximum number of seconds that an actuator is allowed to run before being terminated by the sensor. Time window to execute all maxruns of actuator Actual value of correlation objects are accessible for actuator Actuator /bin/sh -c \\"/bin/echo '$act_value_01 $act_value_02' \\" Actuator /bin/sh -c \\"/bin/echo 'Died lemonmrd daemon $act_value_01 ' | /bin/mail -s 'Lemon RRD Daemon problem' 0041xyz@mail2sms.cern.ch\\"

18 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Exception config "minoccurs", 5 specifies how many time the exception should occur before rising the exception “local", yes Technically this value has no affect within the sensor but is an instruction to the lemon-agent to not transmit data for this exception to the remote application servers. As remote transmission does not occur the outcome of the exception can never appear on LAS (Lemon Alarm System) and is only visible locally on the machine using lemon-host-check. “silent", yes An exception which is considered silent effectively sets the exception state to the value 2 for all transitions. The exception is disabled, no actuators will run and no alarms will be displayed on the LAS console

19 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF LAS Exception Metrics Event Management System Lemon-web LAS GUI Lemon Oracle DB LAS Business Logic PL/SQL Operator Administrator CDB SMS

20 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF LAS Raw alarm represents an processed exception on one of the monitored entities (reported by Lemon). Not all exceptions become an alarm (only one with code 000, 005, 135). And only from host not in maintenance. L-alarm can represent one or more alarms. It is an item visible on the operators screen in the LAS GUI. every L-alarm must be acknowledged with created ticket in ITCM states: active, inactive, acknowledge, inhibited alarms may be grouped to L-alarm (by entities, exceptions, cluster rules…) What is no_contact alarm? Exception not evaluated on host! Alarm means that data are not arriving from host to DB. One of heartbeat metrics (6335, 6336, 9500, 10005) reported every (usually) 5 min Procedure in db is checking if last entry not older than (usually) 10 min

21 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF LAS

22 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF LAS How to avoid alarm disable (e.g. by lemon-host-check) make exception local make exception silent set host to maintenance  actuator will run /etc/lemon/exceptions/state.conf

23 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Backup From now on backup

24 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Sensor parameters

25 CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Smoothing


Download ppt "Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/i t CF Lemon monitoring and Lemon Alarm System (sensors, exception, alarm)"

Similar presentations


Ads by Google