1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring.

1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring

2 © Bull, 2014 Agenda Background Multi-Tenant Monitoring Why Multi-Tenant Multi-Tenant Design Service Catalogue Futures & ‘Blue Sky thinking’ Questions

3 © Bull, 2014 Background UK based Mainframe (IBM & Honeywell) Unix (HP-UX, AIX, Solaris) Linux (RedHat, SLES, Debian) Network (CASE, 3COM, CISCO) Working for Bull French Computer Manufacturer Mainframes, Unix, HPC, Security, Managed Services, Advisory Services

4 © Bull, 2014 Background System Monitoring OpenView Netview Open Master Open Source Monitoring NetSaint on AIX Nagios

5 © Bull, 2014 Why Multi-Tenant ? Outsourcing Support & Monitoring Multiple Customers – Different Levels of security – Different Hardware / Software Platforms One Support Team – Only need to know about real problems – Can be driven by support ticket not Nagios Required 365 x 24 – Infrastructure must survive all outages without loss of service

6 © Bull, 2014 Multi-Tenant Design Each customer may have 2-3000 hosts 10-100 services per host Real time monitoring Customer profile SLA Reporting Batch Event completion Different SLA’s for each Business Process per customer Different alerting & escalation methods per customer

7 © Bull, 2014 Multi-Tenant Design Hardware Platform – Central Support Virtualised Platform (Intel based) – XenServer Hypervisor  Allows clustering with shared storage  Inexpensive Licensing Shared Storage – NAS  Using QNAP Appliances with underlying RAID-5 & Hot Spare protection  Network connection using dual interfaces bound across multiple switches  Could have used FreeNas LAN Infrastructure – Dual connections to all hardware – SNMP managed switches

8 © Bull, 2014 Hardware Platform – Basic Schematic

9 © Bull, 2014 Multi-Tenant Design Hardware Platform – Resilience Virtualised Platform (Intel based) – XenServer Hypervisor  Allows clustering with shared storage  If Primary node fails cluster will ‘spin up’ image on 2 nd node Same data / logs (Shared storage) LAN Infrastructure – Dual connections to all hardware  Bonded interfaces for NAS access – no data loss / access loss with failure  SNMP managed switches

10 © Bull, 2014 Hardware Setup

11 © Bull, 2014 Multi-Tenant Design Hardware Platform – Recovery Virtualised Platform (Intel based) – XenServer Hypervisor  Allows clustering with shared storage  If Primary Site fails will spin up image  Internet Access fails over – using BGP Shared Storage – replicated from Prime Site – NAS  Using QNAP Appliances with underlying RAID-5 & Hot Spare protection  Using RTRR (Real Time Remote Replication) between sites  Network connection using dual interfaces bound across multiple switches LAN Infrastructure – Dual connections to all hardware  Bonded interfaces for NAS access – no data loss / access loss with failure  SNMP managed switches

12 © Bull, 2014 Hardware Platform - Resilience

13 © Bull, 2014 Hardware Platform – Customer Site Using generic netbooks Minimum requirement – 1Gb Memory, Atom processor, Ethernet Port – Running Centos 6.4 64 bit Operating System Can use Raspberry Pi for small customers – 512K Memory, Arm processor, Ethernet Port – Running Raspbian Operating System

14 © Bull, 2014 Software Platform – Central Site Nagios – Core Running latest 4.0.8 Using MK Livestatus for interfacing Using Thruk for Visualisation Graylog2 / Elastic Search Store all logs & Syslog in ‘Big Data’ repository using MongoDB Asterisk PBX Allow all alerting to use standard dial-up with speech synthesis + IVR SMS-Client Still using TAPI to SMS Text contacts

15 © Bull, 2014 Software Platform – Central Site (contd) NRPE Running 2.1.5 NSCA &NSCA-ng Using NSCA for external communication Using NSCA-ng for issuing remote commands Postfix / Procmail Used to generate emails but also handle responses. Routes unsolicited alerting emails (HP Insight, Pingdom) OTRS Record alerts, track issues

16 © Bull, 2014 Software Platform – Remote Site Nagios – Core Running latest 4.0.8 NRPE Running 2.14 NSCA Using NSCA for external communication OpenVPN Communication via IPSec VPN

17 © Bull, 2014 Customer Multi-Tenant

18 © Bull, 2014 Multi Tenant Schematic

19 © Bull, 2014 Service Catalogue ITIL Flavour Really just services & their characteristics

20 © Bull, 2014 Service Catalogue Agreed list of servers / services With importance levels With alerting paths With escalation paths Recovery options Feeds into Service Level Agreements and Operational Level Agreements Basis of agreed reporting structures

21 © Bull, 2014 Examples Basic Spreadsheet plus Shell script Usually easy to create, Shell script is different for each customer based on a initial standard script Chef or Puppet Use Exported Resources Nagios Cookbook – Nagios Conference 2012 Presentation

22 © Bull, 2014 Multi Tenant Issues Naming conventions Every customer has a server01 Customers naming conventions are obscure Customers have multiple physical locations or levels of security – This gives rise to different nagios names to actual names: – Custloc1-swfeltsw01 – Custloc2-nwfeltsw01 Not so smart when a non-Nagios originated alert is received, – ‘swfeltsw01 – RAID battery backup failure’ from HP Insight for example – The external alert processor has to perform table lookups before building the appropriate NSCA command for example

23 © Bull, 2014 Futures & Blue Sky thinking The Nagios Visualisation is resource heavy All Customers want their own Dashboard All Customers want a different screen layout Why not move the visualisation into the cloud ? Use a Amazon EC2 image to access central Livestatus via https Allow end user to authenticate Customer portal allows ‘spin up’ & ‘spin down’ of images – Move billing to the customer – Scale horizontally for Visualisation

24 © Bull, 2014 Load Sharing Using plugins like check_wmi_plus put a strain on the monitoring system, large number of queries that take wall clock time to complete and parse. Better to have ‘worker nodes’ via Merlin or Mod Gearman similar to perform these functions – Raspberry Pi for example. No great expense to add 2/3 Pi’s to customer site configurations, easy fall back if they fail – no unique locally stored data

1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring.

Similar presentations

Presentation on theme: "1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring.

Similar presentations

Presentation on theme: "1 © Bull, 2014 October 14th 2014 Dave Williams Technical Architect Multi-Tenant Nagios Monitoring."— Presentation transcript:

Similar presentations

About project

Feedback