3Agenda Introduction Example Implementations of Nagios General BackgroundSystem Monitoring BackgroundExample Implementations of NagiosUK Customer ExamplesDatacentre Monitoring with NagiosWhat is a Datacentre ?Software & Hardware combinationsVisionConclusions
4Background UK based Working for Bull Mainframe (IBM & Honeywell) Unix (HP-UX, AIX, Solaris)Network (CASE, 3COM, CISCO)Working for BullFrench Computer ManufacturerMainframes, Unix, HPC, Security, Managed Services
5Background System Monitoring Open Source Monitoring OpenView Netview Open MasterOpen Source MonitoringNetSaint on AIXNagios
7Crown Office Procurator Fiscal Service Responsible for the prosecution of crime in ScotlandInvestigation of suspicious deathsComplaints against the PoliceIT Locations in Glasgow & EdinburghWindows at every Courts of Justice in ScotlandAIX / Oracle DB at Glasgow & Edinburgh
8Crown Office Procurator Fiscal Service Already used Solarwinds for some network monitoringStrategy demanded AIX based monitoring & reportingIn a competitive tender Nagios selectedMain success points were – simplicity, ease of customisationFitted within AIX based distance data replication already in use
9Crown Office Procurator Fiscal Service 60+ Windows systems monitored for CPU, Disk Space etc2 AIX servers monitored for CPU, Disk Space etcTwo Oracle Instances monitored for performance and DBspace usageAll alerts shown on monitor screen and if necessary SMS Text alertsInstalled 2005, still workingProvides ‘backstop’ to Solarwinds for capacity monitoring on the WAN & LAN.
10Rother District Council “Working with the community to improve the overall well-being of the District “Responsible for Waste Collection, Housing, Planning & Building ControlThe District covers some 200 square miles and serves a population of around 90,000 inhabitants.
11Rother District Council Monitoring 20+ Windows Servers for CPU, Disk Utilsation etcMonitoring numerous disparate ApplicationsReporting on AvailabilityMonitoring Printer statusUnexpected benefits
12North Yorkshire County Council Internet Access system for 30,000 pupilsMonitoring , internet access, IDS, AV, WebserversReporting on AvailabilityMonitoring Service Level IndicatorsMix of application providers (Scalix, Plesk)Mix of appliance systems – Cisco, Panda, Radware, NetEnforcer, MyFilter
14North Yorkshire County Council Uses NRPE to perform active checks on hostsMulti O/S supportDebianRedHatUses NSCA to accept check results from WindowsVia NagiosEventLog
15North Yorkshire County Council Scalix running on Redhat Cluster. Checking all processes, cluster state etc.PLESK Web serverChecking availability of web sites via test installationMonitoring disk utilsation and processor utilisationAV systemsMonitoring availabilityChecking on AV databaseMyfilterMonitoring filters runningChecking that sufficient filters are available
16North Yorkshire County Council Nagios server runs external loopback test every 20 minutes to confirm external reachability.PLESK Web serverStraightforward implementation of check_httpNetBackupMonitoring that backups have runChecking that enough backup tapes are availableBusiness AvailabilityDefine which services constitute a business line07:00 check – tell support before the customers come on line
17NYCC - Nagiosgraph Nagiosgraph Uses process_performance _data Example of Unix load average
19NYCC Alerts sent via email to customers as well as support Backup notifications via SMS TextUse Nagios Looking Glass for Customer Viewnagiosgraph used to catch all service performance dataDebian & Redhat perfomance metricsNetwork throughput from LAN switchesLDAP response time
21What is a DataCentre ?A data center (or datacentre) is a facility used to house computer systems and associated components, such as telecommunications and storage systems. It generally includes redundant or backup power supplies, redundant data communications connections, environmental controls and security devices.(Wikipedia)
22How good is your DataCentre ? The TIA-942:Data Center Standards Overview describes the requirements for the data centre infrastructure. The simplest is a Tier 1 data centre, which is basically a server room, following basic guidelines for the installation of computer systems. The most stringent level is a Tier 4 data centre, which is designed to host mission critical computer systems, with fully redundant subsystems and compartmentalized security zones controlled by biometric access controls methods .(Wikipedia)
24What is a Green DataCentre ? The most commonly used metric to determine the energy efficiency of a data centre is power usage effectiveness, or PUE. This simple ratio is the total power entering the data centre divided by the power used by the IT equipment.PUE = Total facility Power / IT Equipment PowerPower used by support equipment, often referred to as overhead load, mainly consists of cooling systems, power delivery, and other facility infrastructure like lighting. The average data centre in the US has a PUE of 2.0, meaning that the facility uses one Watt of overhead power for every Watt delivered to IT equipment. State-of-the-art data centre energy efficiency is estimated to be roughly 1.2.= 28.8 hours downtimeDCiE = IT Equipment Power / Total Facility Power Data Centre infrastructure Efficiency
25Bull Datacentre BC1 ? New datacentre build on an already existing site Design criteria PUE 1.6Easily expanded on demandTier 3
26Bull UK Datacentre BC1What do you get for £1.2m ?
27Bull UK Datacentre BC1 New Mains Incomer 1.2Mw Generator Took feed from 11Kv ringHad to build own substation1.2Mw GeneratorRequired 8000 litre fuel tankSwitchgear to automatically start generator if mains incomer fails (10-45 seconds)3 x Ambient CRAC UnitsCooling via external temperature differentialN+1 configurationHot Aisle ContainmentIn-Line UPSUPS only required to keep IT equipment running until generator fires upUses space in Cab rows, easily scalable according to loadBC1 Latitude 53 degrees – good for ambient cooling!Fuel for 48 hours running time
28Bull UK Datacentre BC1 - Monitoring Physical EnvironmentAPC Netbotz DevicesTranslate inputs from sensorsHumidity, Temperature, Dew PointSEAL I/O Dry ContactVoltage indicatorsFor CRAC, FM200, Generator, UPSElectrical EfficiencyPowerLogic ION software reads from power metersPower meter on every Distribution BoardReal-time calculation of PUEPower DistributionEvery PDU strip (2 per Cab) monitored for power consumption & problemsA number of PDU strips also have remote control down to socket levelManagement NetworkLAN infrastructure required to support the DatacentreServers required to support the datacentreExternal alert mechanismsBC1 Latitude 53 degrees – good for ambient cooling!Fuel for 48 hours running time
29Bull UK Datacentre BC1What does Netbotz look like ?
30Bull UK Datacentre BC1What does SeaLevel look like ?
39Nagios Products in use Nagios Core Nagios Looking Glass Nagvis EventDB NRPENSCANagios Looking GlassNagvisEventDBSNMPTTNagmapNDO
40Other Open Source Products in use NediArpwatchPSADSMS-ClientBaculaConfluence (Wiki)i-doit (ITIL CMDB)MRTGRouters2cgi
41BC1 Datacentre Monitoring Elements Nagios CoreNormal install with direct polling of devicesOnly looking at DatacentreNagios Display SystemCentral reporting NagiosAbsorbs updates from other Nagios instancesInformation DisplayNormal system with 5 headsNagios Customer SystemRunning on an appliance connected to Customer networkSends data via encrypted secured link to Display SystemBackup SystemUse tape libraryHosts CMDB & WiKi
45BC1 Datacentre Customer System Hardware Platform – Motion TabletO/S Ubuntu LTSPentium M 1.5Ghz , 0.5 Gb memory, 30GB diskTouch Screen tablet systemNagios 3.2.3Built from tarballNagios PluginsNagios NSCASends status (encrypted) to central reporting system
46BC1 Datacentre Backup System Hardware Platform – IntelO/S Centos 5Xeon 3.06Ghz , 2.0 Gb memory, 108GB diskUses Bacula 5.0.3Controls SDLT 20 slot tape libraryBacks up all Datacentre InfrastructureWindowsCentosUbuntu
48Conclusions Strategic Overall Design Know what you need to monitorKnow who needs to be toldExpect to throw the first version awayOnly when you have fully engineered the solution will you understand all of the issuesKeep a record of design decisionsYou will have to make it pretty for managementAccept that an attractive display will be requiredReporting will become keyIt must be reliableMake backupsConsider clustering & recovery options
50Hints & Experience Separate Display systems from Monitoring systems If you are tracking 10,000’s of services you don’t want processor heavy graphics as wellEscalation & Alerting take timeFirstly to get right with your organisationSecondly to actually physically do !Suppliers go out of their way to make it difficultDon’t give in – there is always a way to get Nagios involvedScreen scrape, , telnet,RS232 are all possibleSNMP is your friendWhen in doubt use SNMP to help you outSNMP V3 with AES cypher is suitably secure for most implementations