Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nagios Implementation Case: Eastman Kodak Company Eric Loyd Founder & CEO Bitnetix Incorporated 877.BITNETIX.

Similar presentations


Presentation on theme: "Nagios Implementation Case: Eastman Kodak Company Eric Loyd Founder & CEO Bitnetix Incorporated 877.BITNETIX."— Presentation transcript:

1 Nagios Implementation Case: Eastman Kodak Company Eric Loyd Founder & CEO Bitnetix Incorporated eric@bitnetix.com www.bitnetix.com 877.BITNETIX

2 © 2012 Bitnetix Incorporated 2 About Eric Loyd and Bitnetix Founder and CEO of Bitnetix Incorporated VOIP services and IT/network consulting 25 Years in IT at places like Eastman Kodak Frontier Communications Global Crossing Bitnetix started its seventh year in July, 2012 2012 Digital Rochester GREAT Award Finalist in Communications Technology Using Nagios to monitor our client equipment, VOIP platform, and still using it at Kodak since 2004

3 2011 Nagios World Conference 3 A History of Eastman Kodak’s kodak.com Web Server Infrastructure (non-confidential)

4 © 2012 Bitnetix Incorporated 4 History of kodak.com Pre-2004 Machines located in Rochester, NY Public Apache servers Reverse proxy Apache servers Application servers (ATG/Dynamo, Tomcat, etc) Database boxes, Production Support, etc. 2004 – Moved ~80 machines from ROC -> ??? ROC ??? Firewalls Bandwidth requirements Minimal user impact Flipped the switch, went live

5 © 2012 Bitnetix Incorporated 5 History of kodak.com Some of the things kodak.com did at the time Consumer store and product information B2B portal and wholesaler purchasing “Picture Of The Day” (www.kodak.com/go/potd)www.kodak.com/go/potd Warranty registration Photo lab calibration strips “Phone home” reports for printers, docks, cameras, etc Software/firmware updates Corporate press releases, bios, and regulatory information Reverse proxy for internal information through secure channels Dozens of sitelets for products and campaigns

6 2011 Nagios World Conference 6 Why Kodak Chose Nagios to Monitor kodak.com

7 © 2012 Bitnetix Incorporated 7 Why Nagios? No centralized corporate monitoring software Nothing to compete with internally Nothing to build on, either Cost No additional cost beyond existing human resources Framework Nagios worked with firewalls without needing agents Leverage SSH, HTTP and other remote protocols Custom checks and notifications (very important)

8 2011 Nagios World Conference 8 Initial Hurdles in the New Complex Server Environment

9 kodak.com Network © 2012 Bitnetix Incorporated

10 10 Initial hurdles Firewalls Public load balancers on external Internet IPs Public Apaches in Zone 1, Kodak network Reverse proxy, app servers in Zone 2, semi-secure Nagios machine in internal Zone 3, most secure Complex “top” and “bottom” checks for web site Is the site working from the user’s perspective (top)? From the application side (bottom)? How to separate apparent from actual failure

11 © 2012 Bitnetix Incorporated 11 Initial hurdles No Internal Nagios Knowledge It was a contractor who set up Nagios (me) Contractors typically have a finite lifespan at Kodak Contractor made custom checks, event handlers, and all Nagios configurations. Uh-oh… Escalation and Paging Screw it – let’s email everyone, every time and let Thunderbird sort it all out Paging done via texting gateway email address Which means email gateway failure = notification failure Twitter API as backup / current primary notification

12 2011 Nagios World Conference 12 SSH to Remote Servers

13 © 2012 Bitnetix Incorporated 13 SSH to the rescue One user, one key, infinite access Software apps run as second user, with SSH auth Additional robot accounts can be added at any time Wrap existing checks in an SSH shell Provides additional control, error handling, reporting Allows all checks to submit results to SQL database SQL Database Side Note – all custom scripts executed CLI Perl code that locked a file, logged to it, and unlocked it. A Perl cron job woke up every 5 minutes, locked the file, read it, pushed things to Oracle, unlocked, and deleted log file. A second cron pruned Oracle daily to 400 days of data and collapsed checks older than 30 days so that successive checks with the same status were removed.

14 2011 Nagios World Conference 14 Managing Nagios Configuration Files

15 © 2012 Bitnetix Incorporated 15 Configuration Management SCCS Solaris’s “poor man’s CVS” Pre-installed, no additional cost, existing expertise Current configuration is managed through SVN Rsync – the workhorse to move config files Configuration Repository and Push (CRaP) directory Cfengine Local versus remote execution Post-install, ignore pid files, deploy/restart, etc. Makefile – the “CLI” to the entire process

16 2011 Nagios World Conference 16 Common Event Handler

17 © 2012 Bitnetix Incorporated 17 Common Event Handler EKrestart – That Which Does Setup Arguments Conversions do_soft/hard? do_something? do_restart Lock, logs, SQL send_nagios SSH to remote Remote EKrestart Process args do_ send_nagios Unlock, log, SQL Terminate do_ Locks (level 2) Instance mapping Port mapping App restart Email & log Exit

18 © 2012 Bitnetix Incorporated 18 A Closer Look at EKrestart #!/bin/sh PATH=... [ "$1" = "-r" ] && client_code host="$1" service="$2" baseService=`echo $service | awk -F: '{print $1}'` state="$3" type="$4" tries="$5" perfdata="$6" class=" " number=" " case "$state" in OK) do_fixit;; WARNING) do_nothing;; UNKNOWN) do_nothing; CRITICAL) do_something; *) do_nothing; esac

19 © 2012 Bitnetix Incorporated 19 A Closer Look at EKrestart do_fixit() { case "$baseService" in Workers) do_restart;; *) do_nothing;; esac } do_nothing() { $debug && echo "$service is in $state state ($type) for $tries tries." } do_something() { case "$type" in SOFT) do_soft;; # Take action before it's too late? HARD) do_restart;; # Hard CRITICAL - Our last chance to take action *) do_nothing;; esac } do_soft() { case "$tries" in 3,4,5) do_restart;; # Okay, let's restart it before it goes hard *) do_nothing;; # Don't restart yet esac }

20 © 2012 Bitnetix Incorporated 20 A Closer Look at EKrestart do_restart() { # ssh $machine -r do_$service # exit } # On the client side, we use the same EKretart script, but start at client_code() client_code() { host=`hostname` function="$2" service="$3" # (etc) eval $function exit } # Example function do_Dynamo() { # lock file processing # turn off new sessions, wean existing ones # /etc/init.d/restart_dynamo_$instance # tear down return }

21 2011 Nagios World Conference 21 Integrating Nagios into Operational Procedures

22 © 2012 Bitnetix Incorporated 22 Integration with Operations Homebrew API nchart, send_nagios, nlog – all portable to other installations of Nagios on other machines Integrate with start/stop scripts Lock files. Lots of lock files! TOO MANY lock files!! The “Rippler” Leverage EKrestart, cron, and send_nagios Pager / Twitter and lots of private twitter feeds Inter-group notifications Predominately with procmail

23 2011 Nagios World Conference 23 Predictive Failure Recovery and a Good Night’s Sleep

24 © 2012 Bitnetix Incorporated 24 Predictive Failure Recovery On ATG/Dynamo (and other) services do_soft triggers do_restart on third failure do_hard always triggers restart Notifications on fourth failure Escalation to pager only on fifth notification Nagios has time to restart things that are bad, or are going bad, prior to sending out notifications Service check dependencies allow us to know whether it’s a bad application, server, or user experience Twitter – follow private tweets with smartphone, use apps to acknowledge problems, and get an even better night’s sleep!!

25 2011 Nagios World Conference 25 Questions Eric Loyd Founder & CEO Bitnetix Incorporated eric@bitnetix.com www.bitnetix.com 877.BITNETIX

26 © 2012 Bitnetix Incorporated 26 Overview of Presentation A history of Eastman Kodak’s kodak.com web server infrastructure :00-:03 Why Kodak chose Nagios to monitor kodak.com :03-:05 What the initial hurdles were in this complex server environment :05-:10 How we leveraged SSH to solve remote server issues :10-:15 How we manage Nagios configuration files:15-:25 Using a common event handler:25-:35 Integrating Nagios into Operational Procedures:35-:40 Questions:40-:50


Download ppt "Nagios Implementation Case: Eastman Kodak Company Eric Loyd Founder & CEO Bitnetix Incorporated 877.BITNETIX."

Similar presentations


Ads by Google