Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie

Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie mguthrie@nagios.com

20112 Distributed Monitoring Introduction Basic Definition: Splitting up your monitoring server over multiple machines Why use distributed monitoring? Multiple sites with firewall restrictions Large installations that exceed the CPU and memory resources that a single machine can offer.

20113 Understanding CPU Limitations The primary task of the Nagios Core engine is to schedule checks Example Monitoring Server 1000 Hosts, 4 services per host, 5mn interval Check load = ( 5000 checks / 5mn ) / 60 seconds About 16.6 checks per second In 1 second: About 16 scripts or binary processes are being launched, with about 16 sets of results coming in and being processed by Nagios and written to disk. When the check schedule exceeds CPU limitations, you get “check latency”

20114 Picking the Right Distributed Model Pick the right model for your environment Think logistics: PLAN before implementation Every hour spent in planning logistics will save tens or even hundreds of man hours later on A 30mn task on 1 server = 5 hours on 10 servers. Consider how to effectively view information across multiple machines As data quantity increases, discerning useful information from it becomes more important Viewing 10,000 hosts and 50,000 services on a page is too much raw data to be effective information

20115 The Classic Distributed Model Central Server (Passive Only) Active Checks Distributed servers running active checks, forwarding results to a central server Active Checks Active Checks Active Checks Active Checks Active Checks Active Checks Active Checks Forward Results After Every Check

20116 The Classic Distributed Model

20117 The Classic Distributed Model Central Monitoring vs Central Viewing? OCSP vs Event Handlers OSCP runs after every check Event handlers run only on state changes Freshness checking ensures current data Child servers can also do local monitoring without forwarding results Distributed servers can also receive passive checks and forward them along, creating a multi- level tree structure

20118 The Classic Distributed Model Strengths: Well tested, well documented, proven solution All built into the Nagios Core package Extremely flexible for checks, performance graphing, notifications, etc. Can be combined with other distributed models Challenges: Maintaining configs on multiple machines Which server issued the check? Where to process/view performance data?

20119 The Classic Distributed Model Workarounds: Use SVN, rsync, or cron to automatically maintain host and service configs on both distributed and central servers. Use templating as much possible Read Core Docs on “Object Inheritance” Keep template definitions separate Use naming conventions to keep configs organized Nagios XI distributed tools: Inbound and Outbound Checks Unconfigured Objects

201110 The Cluster Model – Nagios Load Balancing Nagios checks are managed by a sub-process and distributed evenly across multiple servers Works like a load balancer Two Popular Examples: DNX: Distributed Nagios eXecutor Mod Gearman Check results and configs are all managed at the central server

201111 The Cluster Model – DNX

201112 The Cluster Model – DNX DNX: How it works When a check is scheduled to execute, the job is passed to a worker node Worker node executes the check, and send results directly to results queue Checks are not associated with any particular worker node Bypasses the nagios.cmd pipe to eliminate a potential bottleneck If a worker goes down, all checks continue

201113 The Cluster Model – DNX DNX: Strengths: Central configuration management Checks redistributed if a worker is down Worker nodes can be added at any time Challenges: Performance data is still handled at the central server If the master goes down, all checks cease

201114 The Cluster Model – Mod Gearman

201115 The Cluster Model – Mod Gearman Strengths: Central configuration management Checks can be split by hostgroups or servicegroups, which can come in useful if groups are located in different network segments Challenges: Performance data is still handled at the central server If the master goes down, all checks cease Effectively viewing more than 10k+ services on a single machine

201116 The Central Dashboard Model Checks are executed and managed on multiple distributed servers Central viewer unifies all servers Central viewer polls data from each server and displays tactical data in the UI Examples: Nagios Fusion MNTOS check_MK Multisite

201117 The Central Dashboard Model

201118 The Central Dashboard Model: Nagios Fusion Displays tactical overview for each server Monitoring and object configurations compartmentalized to each server Good for geographically distributed servers where local management is required Unified login for all XI servers (basic auth still required for Core machines)

201119 The Central Dashboard Model: Nagios Fusion Strengths: Easy to add new servers User-level control of server views High level overview Very little CPU usage Commercial solution with support Challenges: Not a monitoring solution by itself Free 60 day trial, requires a license

201120 The Central Dashboard Model: Nagios Fusion

201121 The Central Dashboard Model: MNTOS

201122 The Central Dashboard Model: Multisite

201123 Single Server – Distributed Parts Not all environments require check distribution Offload nodutils (DB backend) to a different machine Offload performance data processing to a different machine Mount disk i\o intensive files to a RAM disk A Nagios Core installs can run between 10 - 20k checks depending on what is being checked and how it is configured

201124 Where To Go From Here? Future of Distributed Monitoring? Improved information viewing instead of just raw data Aggregated reporting and statistics Business process views and monitoring What do you, as admins, need to see in this area of software development?

201125 Conclusion Pick the right setup for your environment Any of these models can be mixed and combined PLAN before implementation: Plan for efficient maintenance An environment that implemented 250k services being overseen by a single server took almost an entire year of planning and implementation to do it right Environments can scale even larger with the right logistics planning in place

201126 Conference Resources Daniel Wittenberg: “Scaling Nagios At A Giant Insurance Company” @2pm Thursday 35,000 hosts and 1.4 million services Mike Weber: “Reducing Server Load with Mod Gearman” @10:30am Friday Dave Williams: Author of DNX

Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie

Similar presentations

Presentation on theme: "Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie

Similar presentations

Presentation on theme: "Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie"— Presentation transcript:

Similar presentations

About project

Feedback