Presentation is loading. Please wait.

Presentation is loading. Please wait.

Improved Scripting of IDS Alarms and Events Thomas Horner Senior DBA/S1 Corporation Informix User Forum 2005 Moving Forward With Informix Atlanta, Georgia.

Similar presentations


Presentation on theme: "Improved Scripting of IDS Alarms and Events Thomas Horner Senior DBA/S1 Corporation Informix User Forum 2005 Moving Forward With Informix Atlanta, Georgia."— Presentation transcript:

1 Improved Scripting of IDS Alarms and Events Thomas Horner Senior DBA/S1 Corporation Informix User Forum 2005 Moving Forward With Informix Atlanta, Georgia December 8-9, 2005

2 December 8-9, 20052 Overall Objectives Enhancements to the supplied scripts Help prevent unnecessary late night page or cell phone call Be proactive in monitoring of dbspaces Same shells can be used for 7.x, 9.x, and 10.x IDS engines

3 December 8-9, 20053 Presentation Overview What does IBM/Informix supply? Purpose of these custom shells Overall design of the shells Details of the alarm shell Changes made to evidence shell Details of the “LookatSpace” shell Other shells I use for administration Limitations of these shells

4 December 8-9, 20054 IBM Supplied Scripts alarmprogram.sh, log_full.sh, no_log.sh, and evidence.sh supplied by IBM/Informix IDS 9.4+ and 10.x alarm program is improved over the older versions –it gathers additional data for certain alarms –it sends email to and/or pages DBA –it recognizes the automatic log alarms First two functions are in my alarm shell, but not the last one

5 December 8-9, 20055 IBM onconfig Parameters ALARMPROGRAM onconfig parameter –set to appropriate value (full path name) ALRM_ALL_EVENTS onconfig parameter –set to 1 SYSALARMPROGRAM onconfig parameter –set to appropriate value (full path name) DYNAMIC_LOGS onconfig parameter –this needs to be 1 or 0 for my alarm shell –all available space in log dbspace allocated up front –this is a design decision

6 December 8-9, 20056 Purpose of these Shells Alarm Shell –combines functions of the “default” programs and adds features Evidence Shell –match design of this program with the alarm program changes LookatSpace Shell –gives DBA an “advance” notice of possible space issues

7 December 8-9, 20057 Purpose of these Shells Other Shells used to monitor and administer the databases: –check database shell – quick check of engine status –onchecks shell – perform oncheck commands weekly –update statistics shell – perform scheduled update statistics –prune log shell – prune online log and other logs

8 December 8-9, 20058 Overall Design of Shells Alarm and Evidence Shells –add functionality to supplied default programs –do not change how the shells are used by the Informix engine LookatSpace Shell –run on a scheduled basis to check for low space that may not be obvious from simple onstat -d output Other Shells –run on a daily or weekly schedule to perform other administrative functions

9 December 8-9, 20059 Overall Design of Shells All Shells –can be used for multi-instance installations and multiple production databases in one instance –can be used across 7.x, 9.x, and 10.x engines

10 December 8-9, 200510 Installation These are currently installed on four production servers and several test servers on the following versions: –IDS Version 7.24 on HPUX 10.20 –IDS Version 9.21 on HPUX 11.00 Other installations are successfully using them (based on emails I have received) Requires notification means to DBA team and to the Data Center

11 December 8-9, 200511 Alarm Program – Overview Five parameters passed from instance: –Severity (severity) ranges from 1 through 5 –Class_ID (class_id) contains the message ID that caused the alarm –Message (class_msg) contains the actual text of the alarm –Additional Text (specific_msg) –Event File (see_also)

12 December 8-9, 200512 Alarm Program – Functions Added Set the proper level of notification based on alarm severity Prevent overload of machine resources and email caused by duplicate or multiple alarms for the same issue Reduce “false” alarms by using mutex files Perform logical log backups using ontape Option for “no notification” Alarm log file used to record alarms and actions

13 December 8-9, 200513 Alarm Changes – Proper Notification Level Severity 1 or 2 –no notification as recommended by IBM/Informix Severity 3 –not critical – email is sent to the DBA team –no email if class 6, 15, 21, or 23 (more on why later) Severity 4 or 5 –critical – data center is notified for action and an email is sent to the DBA team for our records –no notification if class 6, 15, or 21 (more on why later)

14 December 8-9, 200514 Stop Duplicate Alarms Biggest design change I made from the default alarm programs Classes 6, 15, and 21 can cause multiple alarms –class 6 is “non fatal” Internal Subsystem Failure –class 15 is Data Replication Failure –class 21 is Online Resource Overflow Idea for this change came with my first encounter with multiple class 21 alarms –caused by process exceeding available number of locks (version 7.x engine) –hundreds of emails received within a minute – OOPS!

15 December 8-9, 200515 Stop Duplicate Alarms (cont’d) Separate section of code to handle classes 6, 15, and 21 Class 23 (logical log backup needed) also has specific section of code to perform log backups Shell uses distinctly named files in /tmp for these three classes of alarms: –/tmp/event${ENV}${FILENO}.`date +%H` Alarm is considered new if this file in /tmp does not exist or if that file is more than one hour old One hour threshold was a design decision

16 December 8-9, 200516 Stop Duplicate Alarms (cont’d) Steps used to handle classes 6, 15, and 21: –if the alarm severity is less than 3, ignore the alarm –if file in /tmp exists and is less than one hour old: consider this a duplicate alarm of this class simply log it –if file in /tmp file does not exist, or the file is more than one hour old, this is first alarm of this class: follow notification protocol create (or update) the /tmp file for this alarm

17 17 Alarm – Real alarm.log output Fri Jul 19 09:40:24 EDT 2002 alarm.sh got event 21 severity : 3 message : OnLine resource overflow: 'Locks'. additional text: Lock table overflow - user id 106, session id 1133666 reference file : Fri Jul 19 09:40:30 EDT 2002 alarm.sh got event 23 severity : 2 message : Logical Log 15362 Complete. additional text: Logical Log 15362 Complete. reference file : Fri Jul 19 09:40:39 EDT 2002 alarm.sh got event 18 severity : 2 message : Log Backup completed: 15362. additional text: Logical Log 15362 - Backup Completed reference file :

18 18 Alarm – Real alarm.log output (cont’d) Fri Jul 19 09:40:39 EDT 2002 Multiple alarms - class 21, severity 3. Fri Jul 19 09:40:40 EDT 2002 Multiple alarms - class 21, severity 3. Fri Jul 19 09:41:02 EDT 2002 Existing class 21 issue - no notification needed. Fri Jul 19 09:41:03 EDT 2002 Multiple alarms - class 21, severity 3. Fri Jul 19 09:41:03 EDT 2002 Multiple alarms - class 21, severity 3. Fri Jul 19 09:41:05 EDT 2002 Multiple alarms - class 21, severity 3. Fri Jul 19 09:41:05 EDT 2002 Existing class 21 issue - no notification needed. Fri Jul 19 09:41:17 EDT 2002 alarm.sh got event 23 severity : 2 message : Logical Log 15363 Complete. additional text: Logical Log 15363 Complete. reference file :

19 December 8-9, 200519 Alarm – Perform Logical Log Backups Make sure no other log backup is running: –check for /tmp/ontape.L${ENV}, a mutex file –do not start another log backup and notify DBA team via email if it does exist –not considered critical because this can occur normally when logs turn over quickly –create the /tmp/ontape.L${ENV} mutex file if it does not exist and continue If onconfig file has /dev/null for the LTAPEDEV onconfig, run ontape -a to free the log, then exit

20 December 8-9, 200520 Alarm – Perform Logical Log Backups (cont’d) Make sure engine is up using “onstat -” command –if not follow notification protocol (severity is critical) Make sure log backup device is ready –if not follow notification protocol (severity is critical) Determine number of first and last log that will be in this backup file using “onstat -l” command piped to a grep

21 December 8-9, 200521 Alarm – Perform Logical Log Backups (cont’d) Note any “missing” log numbers in log file Perform the actual log backup using “ontape -a” If ontape command fails, follow notification protocol (severity is critical) Move, rename, and compress the log backup file using gzip Remove the mutex file so that the next log backup can run

22 December 8-9, 200522 Alarm – No Notification Option At beginning of alarm program, it looks for file named alarm.nomail in /usr/informix MAILFLAG shell variable is set to “on” or “off” Before every statement where notification is to be sent, the MAILFLAG variable is looked at If MAILFLAG is “off”, do not send email or notify Data Center If MAILFLAG is “on”, send email and (if critical) notify Data Center You can simply remove the alarm.nomail file to start having notifications sent

23 December 8-9, 200523 Evidence Program – Overview Default (supplied) program is called evidence.sh Normally called by engine when an assert failure occurs to “gather evidence” for use by IBM/Informix support Not supplied with 7.2x engines SYSALARMPROGRAM configuration parameter Twelve parameters are passed to program IBM/Informix recommends not changing the functions of this more complex shell

24 December 8-9, 200524 Evidence Program – Issues Addressed I did change the notification techniques to match those used in the alarm program Added the use of MAILFLAG to stop notification Added notification for warnings (email to DBA team) in addition to failures Put in appropriate values for the environment variables at the beginning of the program I do not email the assert failure file (which the default program does) because of its large size Named the program evidence.${ENV} for use in multiple instances

25 December 8-9, 200525 LookatSpace Program – Purpose You may think that you have plenty of free space in a particular dbspace –one table that requests a large next extent can use up all the remaining free dbspace –another table in the same dbspace that also needs additional space can be “out of luck” and a SQL error will be returned to the user This shell looks for this type of situation and emails any issues found to the DBA team DBA team then has time to add a chunk to the dbspace before it becomes critical We run this once a week on a scheduled basis

26 December 8-9, 200526 LookatSpace – Program Design Get name of database with the largest table in the instance using sysmaster SQL to get name of production database (assumes only one) Obtain dbspace usage using sysmaster SQL –separate out those that contain blobs for use later Obtain which non-fragmented tables are in what dbspace using SQL Obtain which fragmented tables are in what dbspace using SQL

27 December 8-9, 200527 LookatSpace – Program Design (cont’d) Two lists of dbspaces are created –we do not put non-fragmented and fragmented tables in the same dbspace If dbspace contains no tables or blobs, and has less than 3% free space: –assume that this dbspace contains only indexes –send email to DBA team because it is low on space If dbspace has non-fragmented tables: –obtain table space usage and future needs –uses sysmaster SQL

28 December 8-9, 200528 LookatSpace – Program Design (cont’d) If dbspace has fragmented tables: –obtain table space usage and future needs –uses sysmaster SQL If space is more than 80% used, and next extent is greater than free space remaining in the dbspace: –send an email to the DBA team If space is more than 95% used, and next extent is greater than available dbspace: –add a warning message to that DBA team email

29 December 8-9, 200529 LookatSpace – Program Design (cont’d) If dbspace contains blobs, check free space in dbspace and the number of blobs remaining If space available is less than 3% and number of blobs remaining is less than 20000, send an email with warning to the DBA team While the program goes through all these steps, a basic text report (space report) is created If there are no issues to report, no email is sent, but the space report is always available for review

30 December 8-9, 200530 LookatSpace – Program Design (cont’d) The report is appended to each week, so a history of space utilization is available for analysis A future enhancement could include looking at the index dbspaces –we have had these unexpectedly fill up when there is more than one large index in the same dbspace Another enhancement can be to write code to analyze the space utilization reports and obtain trending information

31 December 8-9, 200531 LookatSpace – Sample Email Space is low in DBSpace dbs1 with tables on Tue Sep 27 05:31:00 EDT 2005 for host sf8pdb1, instance sfarm_shm. Table vfmtrnaudactvty next extent of 250000 pages will use all free 99997 pages in dbs1. Table has 1499947 pages allocated, 231611 pages free, and 84.56 percent used. Details are located in the /usr/informix/logs/checkspc.out file.

32 December 8-9, 200532 Other Shells I Use Check Database Shell –checks to see if engine is up and active on a scheduled basis –performs log move if requested (uses onmode commands) –log move is run from another shell (to prevent issue in case of hung checkpoint) –log move option is used in our shop for disaster recovery purposes Onchecks Shell –performs basic oncheck commands on a weekly basis

33 December 8-9, 200533 Other Shells I Use (cont’d) Update Statistics Shell –can choose how update statistics is run via input parameters –temporarily changes certain Informix environment variables to improve performance while running update statistics Prune Log Shell –archives various log files monthly –also archives the online.log

34 December 8-9, 200534 Limitations of these Shells The shells (except alarm or evidence) are run on a scheduled basis, not on a demand basis The LookatSpace shell requires that fragmented and non-fragmented tables not be in the same dbspace The LookatSpace shell does not “predict” when index dbspaces will fill up Certain thresholds are “hard-coded” in the shells and may need to be changed for your installation Certain names of files and directories are coded in the shells and may need to be changed for your installation Latest enhancements of data gathering features of 9.4+ supplied alarm program are not in the alarm shell

35 December 8-9, 200535 Review Alarm program –took the IBM/Informix “template” and ideas of others and myself to make it more robust –handles multiple alarms and performs log backups Evidence program –took the IBM/Informix “template” and made notification consistent with the alarm program LookatSpace program –helps the DBA team identify space issues before they impact end user or become an “emergency” Other shells we use to monitor the engines

36 December 8-9, 200536 Questions and Comments? To get a copy of these shells, email me at thorner@s1.com. I can package the files and send them to you via email. thorner@s1.com Objective here was to prevent the unnecessary page or phone call, that may result in fixing something that is actually not broken. Proactive monitoring of dbspaces using LookatSpace is better than that 3 am page requiring you to add a chunk. Thank you all for your attention. I hope that these shells enable you to keep better informed about the status of your production systems.

37 Improved Scripting of IDS Alarms and Events Thomas Horner thorner@s1.com Informix User Forum 2005 Moving Forward With Informix Atlanta, Georgia December 8-9, 2005


Download ppt "Improved Scripting of IDS Alarms and Events Thomas Horner Senior DBA/S1 Corporation Informix User Forum 2005 Moving Forward With Informix Atlanta, Georgia."

Similar presentations


Ads by Google