Presentation is loading. Please wait.

Presentation is loading. Please wait.

Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

Similar presentations


Presentation on theme: "Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006."— Presentation transcript:

1 Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006

2 “The SLAC Computing Services Group is dedicated to providing leadership and support in computing and communications to the laboratory as a whole, and to physics research, in particular” Major Concerns Power consumption Cooling Monitoring

3 I/O Rate CPU usage Memory Usage Temperature Fan Speed Load Monitoring Software -low overhead -scalable -low impact on individual machines What Is My Computer Doing???

4 “Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids” Scalable, overhead increases by number of clusters not nodes Works on multiple operating systems Round Robin Database Measures metrics like CPU usage, load, I/O rate, and memory usage GMOND, GMETAD, GMETRIC

5 BC A 1 3 2 4 Cluster One All machines know state of entire cluster Cluster Two Machines 1 and 3 know state of entire cluster Updates RRD, polls clusters periodically Ganglia Architecture http://www.slac.stanford.edu/comp/unix/ganglia/index.html

6

7 GMETRIC Allows users to monitor metrics to expand on the core monitored by the daemon gmond Name Value Type Units gmetric conf=/var/ganglia/gmond.conf –nCPUTemp1 –v75 –tuint8 –uCelsius Good because allows us to be more machine specific, can monitor temperature and fan speed

8 A little bit on hardware Noma - batch machines Tyan Thunder LE-T motherboard Winbond w83782d (lm_sensor compatible) 2 pentium III processors Why is temperature important? Chip specifications give temperature range Behavior is unpredictable outside temperature range Clues to weird machine behavior Pentiums have a max temp of 77 ° -82 ° C Tyan Thunder LE-T

9 What’s a Noma? Horse from Noma County Japan Smallest native Japanese pony 10.1 -10.3 hands Super rare 27 pure blood nomas left (1988) Some more machines COB DON TORI MORAB ORLOV NOMA

10 caitiem@noma0449 $ sensors w83782d-i2c-0-29 Adapter: SMBus PIIX4 adapter at 0580 Algorithm: Non-I2C SMBus adapter VCore 1: +1.48 V (min = +4.08 V, max = +4.08 V) VCore 2: +1.26 V (min = +4.08 V, max = +4.08 V) +3.3V: +3.37 V (min = +2.97 V, max = +3.63 V) +5V: +4.97 V (min = +4.50 V, max = +5.48 V) +12V: +12.08 V (min = +10.79 V, max = +13.11 V) -12V: -1.03 V (min = -13.21 V, max = -10.90 V) -5V: +2.84 V (min = -5.51 V, max = -4.51 V) V5SB: +5.12 V (min = +4.50 V, max = +5.48 V) VBat: +3.34 V (min = +2.70 V, max = +3.29 V) fan1: 8231 RPM (min = 3000 RPM, div = 2) fan2: 8333 RPM (min = 3000 RPM, div = 2) fan3: 0 RPM (min = 3000 RPM, div = 2) temp1: +77°C (limit = +60°C) sensor = thermistor ALARM temp2: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor ALARM temp3: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor ALARM vid: +1.450 V alarms: Chassis intrusion detection ALARM beep_enable: Sound alarm disabled

11 Perl Fills gap between low level languages like C and C++ and high level languages like shell. -mostly fast -basically unlimited -good for working with text -portable Regular Expressions /^temp([0-9]):\s+\+([0-9]+\.*[0-9]*)/ matches temp1: +77°C (limit = +60°C) sensor = thermistor temp2: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor

12 Sample Time - Decreasing Time interval = 12.15 minutes Fri Aug 11 03:04:05 PDT 2006 FanSpeed1 8035 FanSpeed2 7941 Temp 1: 77 Change: 0 Temp 2: 64.0 Change: 0 Temp 3: 64.0 Change: 1 Time interval = 9.8415 minutes Fri Aug 11 03:16:15 PDT 2006 Parameters Trigger = 0.5 degrees Decrement = 0.9 MaxTime = 15 minutes MinTime = 1 minute New time = old time * Decrement ^(Change / Trigger) * if new time < min time then newTime = minTime New time = 12.15 *.9 ^ (1 /.05) = 9.8415 Want Sample time to decrease faster when temperatures are changing faster

13 Sample Time – Increasing Time interval = 12.15 minutes Fri Aug 11 08:25:18 PDT 2006 Found FanSpeed1 8035 Found FanSpeed2 7941 Temp 1: 77 Change: 0 Temp 2: 64.0 Change: 0 Temp 3: 64.0 Change: 0 Time interval = 13.5 minutes Fri Aug 11 08:37:28 PDT 2006 Parameters Trigger = 0.5 degrees Decrement = 0.9 MaxTime = 15 minutes MinTime = 1 minute NewTime = OldTime / Decrement NewTime = 12.15 / 0.9 = 13.5 Want Sample Time to Increase Temperature is changing slowly or not at all *If we increase by large amounts we could miss valuable data

14 noma0450 noma0449

15 Up and running on two Nomas currently Noma0449 Noma0450 Will be installed on all Nomas Can be used on any Ganglia monitored machine with a compatible Winbond chip Much thanks to the DOE, SCCS systems group and especially Yemi Adesanya, John Goebel, & Karl Amrhein for all their help throughout the summer.

16 Smartmontools for SCSI devices Command smartctl –l error /dev/sda Error counter log: Errors Corrected Total Total Correction Gigabytes Total delay: [rereads/ errors algorithm processed uncorrected minor | major rewrites] corrected invocations [10^9 bytes] errors read: 234237 0 0 234237 234237 605.516 0 write: 0 0 0 0 0 1457.589 0 Non-medium error count: 0 http://smartmontools.sourceforge.net/smartmontools_scsi.html

17 Corrected Errors Minor/ Fast Correction algorithm works successfully No delay to reading later sectors These are ok Major / Slow Correction algorithm works successfully Delay in reading later sectors Not so good Uncorrected Errors Correction algorithm fails Very Bad

18 Other Information Total [rereads/rewrites] – errors corrected by applying retries Total errors corrected – number of all correctable errors Correction Algorithm Invocation – number of times algorithm is used Gigabytes Processed – number of bytes successfully and unsuccessfully read or written

19 This indicates there might be a problem This should be a flag as well This is ok, its correcting the errors and not losing any time doing so

20 Monitors Read Uncorrected Errors Read Delayed Errors Read No Delay Errors Write Uncorrected Errors Write Delayed Errors Write No Delay Errors Total Uncorrected Errors Total Delayed Errors Collects Data Once a Day errorsWatch -Noma -Don -Tori -Cob -Morab -Orlov


Download ppt "Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006."

Similar presentations


Ads by Google