Presentation is loading. Please wait.

Presentation is loading. Please wait.

HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics.

Similar presentations


Presentation on theme: "HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics."— Presentation transcript:

1 HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics

2 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees In a perfect world … Individual node status oIs it up? oWhat is its load? oWhat is the memory and swap usage? oNFS and network load? oAre the partitions full? oAre applications and services running properly? Amalgamated node status oSame info, but across groups of nodes

3 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees In a perfect world … Historical information oTrends Notification of service states oe.g. Storage down to 100 megs free = Warning oStorage down to 10 megs free = Critical osshd no longer running = Failure onotify by , pager, mobile Easy access to monitoring information oweb, , digest, mobile

4 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees In a perfect world … Avoidance of Too many red flashing lights oJust the facts, maam – only want root cause failures to be reported, not cascade of every downstram failure. oalso includes avoiding unnecessary checks oe.g. HTTP responding, therefore no need to ping oe.g. power outage, doesnt ping, so dont bother trying anything else Other wish list requirements?

5 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees Aspects of Current Grid Monitoring 1.LDAP (Lightweight Directory Access Protocol) is the current foundation for MDS. Designed frequent read, infrequent write. 2.MDS (Monitoring and Discovery Service) uses LDAP for maintaining static and dynamic system details. 3.R-GMA (Relational Grid Monitoring Architecture) meant to address shortcomings of LDAP based MDS system by using hierarchy of relational databases. Now being deployed. 4.GRIS (Grid Resource Information Service) stores details about the state of the grid (at least from the local node) 5.GIIS (Grid Index Information Service) ties together several GRISes 6.HBM (Heart Beat Monitor) monitor Globus services – seems to have died a quiet death

6 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees Existing Grid Monitoring Lacks… Historical information for trends Simple interface for accessing information Automated response to changes in system state Here is where RRDtool and Nagios can contribute

7 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees RRDtool Round Robin Database for time series data storage Command line based From the author of MRTG Made to be faster and more flexible Includes CGI and Graphing tools, plus APIs Solves the Historical Trends and Simple Interface problems

8 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees Define Data Sources (Inputs) DS:speed:COUNTER:600:U:U DS:fuel:GAUGE:600:U:U oDS = Data Source ospeed, fuel = variable names oCOUNTER, GAUGE = variable type o600 = heart beat – UNKNOWN returned for interval if nothing received after this amount of time oU:U = limits on minimum and maximum variable values (U means unknown and any value is permitted)

9 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees Define Archives (Outputs) RRA:AVERAGE:0.5:1:24 RRA:AVERAGE:0.5:6:10 oRRA = Round Robin Archive oAVERAGE = consolidation function o0.5 = up to 50% of consolidated points may be UNKNOWN o1:24 = this RRA keeps each sample (average over one 5 minute primary sample), 24 times (which is 2 hours worth) o6:10 = one RRA keeps an average over every six 5 minute primary samples (30 minutes), 10 times (which is 5 hours worth) Clear as mud! oall depends on original step size which defaults to 5 minutes

10 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees RRDtool Database Format Recent data stored once every 5 minutes for the past 2 hours (1:24) Medium length data averaged to one entry per half hour for the last 5 hours (6:10) Old data averaged to one entry per day for the last 365 days (288:365) --step 300 (5 minute input step size) RRA 1:24RRA 6:10RRA 288:365 RRD File

11 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees RRDtool Example Monitoring a car – fuel in the tank plus odometer 12: KM 7.0 L 12: KM 5.8 L 12: KM 5.2 L STOP 12: KM 5.2 L 12: KM 5.2 L RESTART 12: KM 4.2 L 12: KM 3.2 L 12: KM 2.2 L 12: KM 1.6 L 12: KM 9.0 L REFUEL 12: KM 8.4 L 13: KM 8.0 L 13: KM 7.5 L 13: KM 7.3 L 13: KM 7.2 L

12 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees RRDtool Example Create an RRD to store distance and fuel rrdtool create car.rrd --start \ DS:speed:COUNTER:600:U:U \ DS:fuel:GAUGE:600:U:U \ RRA:AVERAGE:0.5:1:24 \ RRA:AVERAGE:0.5:6:10 --start Defines earliest time RRD accepts

13 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees RRDtool Example Input data: rrdtool update car.rrd :12345: :12357:5.8 rrdtool update car.rrd :12363: :12363:5.2 rrdtool update car.rrd :12363: :12373:4.2 rrdtool update car.rrd :12383: :12393:2.2 rrdtool update car.rrd :12399: :12405:9.0 rrdtool update car.rrd :12411: :12415:8.0 rrdtool update car.rrd :12420: :12422:7.3 rrdtool update car.rrd :12423:7.2

14 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees RRDtool Graphing Now with data in the RRD, RRDtool can generate graphs: rrdtool graph speed.gif \ --start end \ --vertical-label m/s \ DEF:myspeed=car.rrd:speed:AVERAGE\ DEF:myfuel=car.rrd:fuel:AVERAGE \ CDEF:realspeed=myspeed,1000,* \ LINE2:realspeed#FF0000 \ LINE2:myfuel#00FF00

15 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees RRDtool Graphing Output Much more interesting graphs possible Multiple RRDs may be used as sources for variables Auto-interpolation of points Functions and calculations can be applied to variables Legends, labels, and text can be inserted

16 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees RRDtool Graphing Output

17 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees Nagios Instantaneous service level monitoring Web based interface Somewhat complicated set of configuration files to manually edit Automated notification of change in service level ( , phone, etc.) Defines WARNING, CRITICAL, FAILED levels

18 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees What Do We Want to Monitor? StaticDynamicServices CPU (SPECint)LoadLive RAM (swap)Mem/swap usageAccessible HD capacityStorage availableGlobus Network b/wNetwork utilisationSSH OSUsersEtc. ApplicationsProcesses Location, AdminQueues (PBS)

19 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees Nagios Host Definitions Define details about each node and their hierarchy in the network: define host{ host_name tbce01 alias Testbed CE address parents edg-testbed notifications_enabled 1 process_perf_data 1 check_command check-host-alive notification_interval 120 notification_period 24x7 notification_options d,u,r }

20 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees Nagios Service Definitions Define details about each service: define service{ name ping check_command check_ping!100.0,20%!500.0,60% contact_groups linux-admins check_period 24x7 max_check_attempts 3 normal_check_interval 5 notification_interval 120 notification_period 24x7 notification_options c,r }

21 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees Nagios Service and Host Polling Pull model, where Nagios server executes command to fetch host or service status Requires remote hosts and services to cooperate oNRPE installed on clients allows server to execute plugins to poll for information oAlternatively use existing client reporting mechanisms (ping, wget, http) Server responsible for configuration of polling intervals and details to be polled

22 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees Nagios Service and Host Reporting Push model, where services and hosts decide when to report status to Nagios server opush data when available/relevant ogenerally full access to node-local data orequires configuring every node independently oauthentication of nodes at server onodes need to know who to send data to

23 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees Host and Service Status

24 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees Host and Service Status

25 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees Host and Service Status

26 HEPSYSMAN Conference 29 April, 2003 – Ian Stokes-Rees Finally, some other monitors NWS (Network Weather Service) attempts to predict network utilisation from historical information Ganglia cluster monitoring system, provides aggregate graphs of cluster performance – Globus/EDG tie-ins underway Map Center EDG project to monitor Grid status and services ActiveMap, GridPortal, and InfoPortal* appear to be inactive projects


Download ppt "HEPSYSMAN Conference 29 April 2003 – Ian Stokes-Rees Grid Monitoring using Nagios and RRDtool Ian Stokes-Rees Oxford Particle Physics."

Similar presentations


Ads by Google