Presentation on theme: "Jeff Sly Principal IT Architect "— Presentation transcript:
1 Case Study Nagios @ Nu Skin Jeff SlyPrincipal IT ArchitectIsn’t Great to Be Here at the 1st Nagios World Conference, Ethan has done a great job with Nagios!
2 Who is in the Audience? How many of you are: Suppliers of Nagios or some value add-on for Nagios?Customers using Nagios?Just implementing Nagios or expanding implementation?Using NagiosXI?I would like to introduce some of our Nu Skin folks here that work on Nagios.Nate Broderick Nagios Systems EngineerScott McWhorter Production Support
3 Who is Nu Skin?Direct Selling - Lotions and Potions (or supplements), also Nourish the children
4 Our Technology Footprint Ecommerce – Home grownApplications – Java, EJB, ABAP, .NetDatabases – Oracle, MySQL, MSSQLOS – HPUX, Redhat, Windows, VMWareERP – SAP Supply Chain, CRM, FIDatacenters – 6 locations in 6 countriesOffices – 50 Countries
5 Monitoring GoalsMonitoring presents operations with a completely integrated global view.Good monitoring is proactive; it helps teams prevent problems from becoming outages.Good monitoring helps minimize outage downtime, quickly identify root cause and contacts correct people.
7 Our Monitoring History We tried for 10 years…We tried for 10 years…
8 Do it all in ‘One Tool Projects’ One Monitoring Tool to rule them all:Mercury SiteScopeRemedy Help DeskHP OpenViewQuest FoglightHome grown (several)One monitoring personHe decided to quit!
9 Could never get everything All Failed – We always gave up! Why?Servers and agents that were proprietaryHuge foot print inefficient performanceSteep learning curveVery expensiveUpdates costly and very time consumingSystem Administrators like their own scripts, can see what they are doing
10 Resulting Monitoring Issues Tried to make Operations clearing house for all warnings and alerts from 10+ toolsOperations was overwhelmedTook 4 process steps and lots of software to notify of critical failuresMost Administrators setup own private monitoring to receive warningsMany false notificationsLate notifications
11 As Is (start of project) Our Business Customers were Unhappy
12 Old Monitoring Work Flow Four steps to notify system administrator
14 Step 2: Operations Opens Email HelpDeskErrorNetworkHP NNMSystemScriptsRemember there are lots of s, critical s buried in with the rest.NagiosDatabaseFoglightSiteScope 8BACSitescope 6
15 Step 3: Operations Checks Source HelpDeskErrorNetworkFoglightSystemScriptsBACHP NNMSiteScope 8Sitescope 6NagiosDatabaseDecide if they think this is a critical problem, often a junior person trying to decided
17 Inventory of Existing Checks Regular Expression found on Web Page MonitoringHTTP Check - Up or DownPing Host Up or DownPORT monitoringFTP checkingSMTP checkingSNMP monitoring - no trap catching yetRadiusDNS monitoringDisk Space monitoringCPU and Load Average monitoringMemory Monitoring
18 Inventory of Existing Checks Service monitoringTransaction monitoring - page load times – performance graphWebsite click through (Webinject not working)Log File monitor –parse for ErrorsJava HEAP, Thread, Threadlock monitoringApache thread and worker count monitorsEcommerce shop monitorscan send and receiveSQL query ODBC (catalog ODBC had bugs)Later I show which of these we did in Nagios and how.
21 Idea 1: MoMOur first “break though” was the idea that even through we needed a centralized view for all monitoring that did not mean all monitoring had to be done by one monitoring tool.We had to pick a “Managerof the Monitors” (MoM)to bring together the best ofbreed monitoring.
23 Idea 2: Tool Requirements Open – not proprietary and closedMainstream – wanted good native support and strong communityInterface – to 3rd Party MonitoringFlexible – adapt to many types of monitoringEfficient – minimal foot print on production servers, not chatty on networkNotification – granular controlReliable – good clean architectureUsability – GUI interface, reporting
24 Idea 3: Shared Ownership Core teamOperation of Monitoring Environment: backups, upgrades, & custom plug-insMonitoring ExpertsTrainingMonitoring leads in Development & Admin teams:Set up own monitorsKeep own monitors currentAdjust monitorsIf something is not monitored not core teams fault
30 Idea 4: Lowest LevelHandle alerts at the lowest possible level in the organizationOnly forward alerts if not handled at lower levels before they become critical
31 Handle events at lowest level OperationsNetworkSystemScriptsSAPAsiaEuropeWebDatabase
32 Only forward unhandled alerts NetworkSystemScriptsSAPAsiaEuropeWebDatabase
33 Idea 5: Nagios Monitor Method Choose the Nagios Monitoring MethodActive Check from Nagios Server (normal)Active Check performed by remote clientNRPE, NSClientPassive Check – Listen to 3rd party monitorsNSCA
34 Active Local CheckWebHTTPorPingNagiosUnixWinDBDBMonitor
35 Active Remote Check - UX WebNagiosCPU, RAM(NRPE)UnixWinDBDBMonitor
36 Active Remote Check - Win WebNagiosCPU, RAM(NSClient)UnixWinDBDBMonitor
37 Passive 3rd Party Alert Web Nagios 3rd Party Alert NSCA Unix DB Win DB Monitor3rd Party Check DB
38 Bonus Idea - TuneTune the databaseAdd Ram Drive
39 Tune the DatabaseModify contents of the /etc/my.cnf [mysqld] section. tmp_table_size= max_heap_table_size= table_cache=768 set-variable=max_connections=100 wait_timeout=7800 query_cache_size = query_cache_limit=80000 thread_cache_size = 4 join_buffer_size = 128K Info on: MySQL Tuning, Nagios Tuning
40 RAM DriveCreate a RAM disk for Nagios tempory files I created a ramdisk by adding the following entry to the /etc/fstab file: none /mnt/ram tmpfs size=500M 0 0 Mount the disk using the following commands # mkdir -p /mnt/ram; mount /mnt/ram Verify the disk was mounted and created # df -k Modify the /usr/local/nagios/etc/nagios.cfg file with the following tuned parameters temp_file=/mnt/ram/nagios.tmp temp_path=/mnt/ram status_file=/mnt/ram/status.dat precached_object_file=/mnt/ram/objects.precache object_cache_file=/mnt/ram/objects.cache
41 Implementation Methodology Site SurveyInventory existing monitorsProof of conceptBuild new environmentMigrate monitors from each platform to Nagios, one at a timeIntegrate OEM, and to send monitors to Nagios
42 Three Project PhasesDeliver something useful in each phaseBuild a level at a time
43 Phase I Set up a pilot of Nagios XI using Trial License. Set up Foglight monitoring of JVM (Java Virtual Machine).Purchase NagiosXI and Consulting SupportBring in a consultant for two weeks to help set up the architecture and help us work with the system.Documentation Web Site for Nagios learning's and “How to guides”Define a set of standards and guidelines to follow to help aid an effective monitoring process.Backups on Running on Production Nagios ServerSet up services which aren't being caught right now and move a few of the important services over to the new Nagios XI monitoring system.Test Nagios plugins and server performance
44 Phase II Migrate off of Sitescope 6 and shutdown Decommission FoglightClean up the old monitoring serverMigrate the network team from old Nagios to core NagiosXI systemSet up standby NagiosXI system, cron to replicate weeklyResearch missing alerts and add them to the new NagiosXI system
45 Phase III Implement Global Monitoring Add monitors for existing international systemsAdd monitors using JMX to monitor Java serversNagios Remote Process Execution (NRPE) to monitor remotelyRemote Monitoring for Windows Servers (NS Client++)Implement notification and escalation of alertsAdd monitors for critical business functions
46 Phase III continued… Corporate Enhancements Request recurring down time enhancement from Ethan GalstadAutomate refresh of NagiosXI standby systemBuild Network MapRetire Windows SiteScopeAdd monitors for phone systemsAdd monitors to data center (UPS, Temperature, Humidity)Integrate to SAP Tidal monitoring tool
47 Phase III continued… Business Business review and approve SLA (using business terms)Monitor both the Business Functions and the individual point devices that provide the Business FunctionFollow the Sun with Eyes on Glass.TrainingHow to setup alertsHow to receive alertsHow to report on performance graphsCreate a new Dashboard for HelpDesk and International IT Staff
55 IT OperationsGoalQuick Notification & Recovery from OutageType of MonitorNotification of outages with details on which system is down, so we know who to contactSolutionMigrate from Sitescope, Openview to NagiosXI
56 IT Team ManagersGoalPrevention of outageType of MonitorWarnings about conditions before outages occur, allow for corrective actions that will prevent likely outagesSolutionMigrate from Sitescope, Openview to NagiosXI, Integrate OEM SAP and Scripts with Nagios
57 MoM ~ Manager of Managers Tool Requirements, enough but not all SummaryMoM ~ Manager of ManagersAllow specialized toolsTool Requirements, enough but not allOwnership for implementation, sharedHandle alerts, lowest level in organizationChoose Nagios monitoring method
58 Nagios XI Large Implementation Day 3, 2:00 Track 3 (Nate Broderick) Tips, Tricks & DemosNagios XI Large ImplementationDay 3, 2:00 Track 3 (Nate Broderick)3 DemosPerformance challenges and solutionsIntegrating monitoring solutions OracleMigrating from BAC & FoglightCustomizationGraphing, and more.