Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006.

Slides:



Advertisements
Similar presentations
Real Time Power and Performance Monitoring of Supercomputer Application Shankar Prajapati BS in Computer Science Claflin University
Advertisements

ClearCube Blade Manager 4.0 Overview and Demonstration Rev
Performance What differences do we see in performance? Almost all computers operate correctly (within reason) Most computers implement useful operations.
Computer Abstractions and Technology
CS-334: Computer Architecture
Introduction CSCI 444/544 Operating Systems Fall 2008.
Operating Systems Input/Output Devices (Ch , 12.7; , 13.7)
1 Lecture 6 Performance Measurement and Improvement.
CS 300 – Lecture 22 Intro to Computer Architecture / Assembly Language Virtual Memory.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
1 Programming & Programming Languages Overview l Machine operations and machine language. l Example of machine language. l Different types of processor.
1 Today I/O Systems Storage. 2 I/O Devices Many different kinds of I/O devices Software that controls them: device drivers.
Institute of Computer Science AGH Performance Monitoring of Java Web Service-based Applications Włodzimierz Funika, Piotr Handzlik Lechosław Trębacz Institute.
 Introduction Introduction  Definition of Operating System Definition of Operating System  Abstract View of OperatingSystem Abstract View of OperatingSystem.
Input/Output. Input/Output Problems Wide variety of peripherals —Delivering different amounts of data —At different speeds —In different formats All slower.
How Computers Work. A computer is a machine f or the storage and processing of information. Computers consist of hardware (what you can touch) and software.
Hardware Monitor Sephiroth Kwon GRMA
Systems Software & Operating systems
Chapter 10: Input / Output Devices Dr Mohamed Menacer Taibah University
Instrumentation System Design – part 2 Chapter6:.
Digital Sound. Computer Sound To convert an analog wave into digital, converters use a process called sampling DEF: Sampling- the height of the sound.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Principles of I/0 hardware.
2: Computer-System Structures
Block1 Wrapping Your Nugget Around Distributed Processing.
RAID SECTION (2.3.5) ASHLEY BAILEY SEYEDFARAZ YASROBI GOKUL SHANKAR.
Reference: Ian Sommerville, Chap 15  Systems which monitor and control their environment.  Sometimes associated with hardware devices ◦ Sensors: Collect.
Motherboard and Bios. Generic Modern Motherboard.
Distributed monitoring system. Why Monitor? Solve them! Identify Problems Ensure conduct Requirements Manage many computers Spot trends in the system.
Towards the Design of Heterogeneous Real-Time Multicore System Adaptive Systems Laboratory, Master of Computer Science and Engineering in the Graduate.
IPMI Alert translation
Performance Monitoring of SLAC Blackbox Nodes Using Perl, Nagios, and Ganglia Roxanne Martinez Mentor: Yemi Adesanya United States Department of Energy.
Operating System Principles And Multitasking
Information Services Andrew Brown Jon Ludwig Elvis Montero grid:seminar1:lectures:seminar-grid-1-information-services.ppt.
1 Computer Maintenance Software Configuration: Evaluating Software Packages, Software Licensing, and Computer Protection through the Installation and Maintenance.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
By Tom and James. Hardware is a physical part of the system that you can pick up and move. There are two types of hardware, external and internal. External.
Processor Memory Processor-memory bus I/O Device Bus Adapter I/O Device I/O Device Bus Adapter I/O Device I/O Device Expansion bus I/O Bus.
Monitoring with InfluxDB & Grafana
1 Basic Processor Architecture. 2 Building Blocks of Processor Systems CPU.
Parallel IO for Cluster Computing Tran, Van Hoai.
Real-time System Definition A real-time system is a software system where the correct functioning of the system depends on the results produced by the.
VEHICLE BACK UP ALARM DESIGN Student: Qi Zhou Mentor: Dr. Stanislaw Legowski.
Computer Maintenance Software Configuration: Evaluating Software Packages, Software Licensing, and Computer Protection through the Installation and Maintenance.
Real-time Software Design
Chapter 48 Operating Systems, Computer Architecture and Databases
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
System Monitoring with Lemon
Performance and Fault Tolerance
Vladimir Stojanovic & Nicholas Weaver
Monitoring HTCondor with Ganglia
CS 286 Computer Organization and Architecture
Microcomputer Architecture
Real-time Software Design
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Computer Maintenance Software Configuration: Evaluating Software Packages, Software Licensing, and Computer Protection through the Installation and Maintenance.
Input-output I/O is very much architecture/system dependent
Computer-System Architecture
Module 2: Computer-System Structures
Operating Systems Chapter 5: Input/Output Management
Chapter 2: Operating-System Structures
Computer Evolution and Performance
Module 2: Computer-System Structures
Operating System Introduction.
Module 2: Computer-System Structures
Chapter 2: Operating-System Structures
Module 2: Computer-System Structures
Course Code 114 Introduction to Computer Science
Presentation transcript:

Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips Caitie McCaffrey, Yemi Adesanya August 2006

“The SLAC Computing Services Group is dedicated to providing leadership and support in computing and communications to the laboratory as a whole, and to physics research, in particular” Major Concerns Power consumption Cooling Monitoring

I/O Rate CPU usage Memory Usage Temperature Fan Speed Load Monitoring Software -low overhead -scalable -low impact on individual machines What Is My Computer Doing???

“Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids” Scalable, overhead increases by number of clusters not nodes Works on multiple operating systems Round Robin Database Measures metrics like CPU usage, load, I/O rate, and memory usage GMOND, GMETAD, GMETRIC

BC A Cluster One All machines know state of entire cluster Cluster Two Machines 1 and 3 know state of entire cluster Updates RRD, polls clusters periodically Ganglia Architecture

GMETRIC Allows users to monitor metrics to expand on the core monitored by the daemon gmond Name Value Type Units gmetric conf=/var/ganglia/gmond.conf –nCPUTemp1 –v75 –tuint8 –uCelsius Good because allows us to be more machine specific, can monitor temperature and fan speed

A little bit on hardware Noma - batch machines Tyan Thunder LE-T motherboard Winbond w83782d (lm_sensor compatible) 2 pentium III processors Why is temperature important? Chip specifications give temperature range Behavior is unpredictable outside temperature range Clues to weird machine behavior Pentiums have a max temp of 77 ° -82 ° C Tyan Thunder LE-T

What’s a Noma? Horse from Noma County Japan Smallest native Japanese pony hands Super rare 27 pure blood nomas left (1988) Some more machines COB DON TORI MORAB ORLOV NOMA

$ sensors w83782d-i2c-0-29 Adapter: SMBus PIIX4 adapter at 0580 Algorithm: Non-I2C SMBus adapter VCore 1: V (min = V, max = V) VCore 2: V (min = V, max = V) +3.3V: V (min = V, max = V) +5V: V (min = V, max = V) +12V: V (min = V, max = V) -12V: V (min = V, max = V) -5V: V (min = V, max = V) V5SB: V (min = V, max = V) VBat: V (min = V, max = V) fan1: 8231 RPM (min = 3000 RPM, div = 2) fan2: 8333 RPM (min = 3000 RPM, div = 2) fan3: 0 RPM (min = 3000 RPM, div = 2) temp1: +77°C (limit = +60°C) sensor = thermistor ALARM temp2: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor ALARM temp3: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor ALARM vid: V alarms: Chassis intrusion detection ALARM beep_enable: Sound alarm disabled

Perl Fills gap between low level languages like C and C++ and high level languages like shell. -mostly fast -basically unlimited -good for working with text -portable Regular Expressions /^temp([0-9]):\s+\+([0-9]+\.*[0-9]*)/ matches temp1: +77°C (limit = +60°C) sensor = thermistor temp2: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor

Sample Time - Decreasing Time interval = minutes Fri Aug 11 03:04:05 PDT 2006 FanSpeed FanSpeed Temp 1: 77 Change: 0 Temp 2: 64.0 Change: 0 Temp 3: 64.0 Change: 1 Time interval = minutes Fri Aug 11 03:16:15 PDT 2006 Parameters Trigger = 0.5 degrees Decrement = 0.9 MaxTime = 15 minutes MinTime = 1 minute New time = old time * Decrement ^(Change / Trigger) * if new time < min time then newTime = minTime New time = *.9 ^ (1 /.05) = Want Sample time to decrease faster when temperatures are changing faster

Sample Time – Increasing Time interval = minutes Fri Aug 11 08:25:18 PDT 2006 Found FanSpeed Found FanSpeed Temp 1: 77 Change: 0 Temp 2: 64.0 Change: 0 Temp 3: 64.0 Change: 0 Time interval = 13.5 minutes Fri Aug 11 08:37:28 PDT 2006 Parameters Trigger = 0.5 degrees Decrement = 0.9 MaxTime = 15 minutes MinTime = 1 minute NewTime = OldTime / Decrement NewTime = / 0.9 = 13.5 Want Sample Time to Increase Temperature is changing slowly or not at all *If we increase by large amounts we could miss valuable data

noma0450 noma0449

Up and running on two Nomas currently Noma0449 Noma0450 Will be installed on all Nomas Can be used on any Ganglia monitored machine with a compatible Winbond chip Much thanks to the DOE, SCCS systems group and especially Yemi Adesanya, John Goebel, & Karl Amrhein for all their help throughout the summer.

Smartmontools for SCSI devices Command smartctl –l error /dev/sda Error counter log: Errors Corrected Total Total Correction Gigabytes Total delay: [rereads/ errors algorithm processed uncorrected minor | major rewrites] corrected invocations [10^9 bytes] errors read: write: Non-medium error count: 0

Corrected Errors Minor/ Fast Correction algorithm works successfully No delay to reading later sectors These are ok Major / Slow Correction algorithm works successfully Delay in reading later sectors Not so good Uncorrected Errors Correction algorithm fails Very Bad

Other Information Total [rereads/rewrites] – errors corrected by applying retries Total errors corrected – number of all correctable errors Correction Algorithm Invocation – number of times algorithm is used Gigabytes Processed – number of bytes successfully and unsuccessfully read or written

This indicates there might be a problem This should be a flag as well This is ok, its correcting the errors and not losing any time doing so

Monitors Read Uncorrected Errors Read Delayed Errors Read No Delay Errors Write Uncorrected Errors Write Delayed Errors Write No Delay Errors Total Uncorrected Errors Total Delayed Errors Collects Data Once a Day errorsWatch -Noma -Don -Tori -Cob -Morab -Orlov