Monitoring with InfluxDB & Grafana

Slides:

Advertisements

Similar presentations

13,000 Jobs and counting…. Advertising and Data Platform Our System.

Advertisements

Using EC2 with HTCondor Todd L Miller 1. › Introduction › Submitting an EC2 job (user tutorial) › New features and other improvements › John Hover talking.

Copyright 2007, Information Builders. Slide 1 Workload Distribution for the Enterprise Mark Nesson, Vashti Ragoonath June, 2008.

HTCondor at the RAL Tier-1

ManageEngine TM Applications Manager 8 Monitoring Custom Applications.

Evaluation of NoSQL databases for DIRAC monitoring and beyond

MCTS Guide to Microsoft Windows Server 2008 Network Infrastructure Configuration Chapter 11 Managing and Monitoring a Windows Server 2008 Network.

F Fermilab Database Experience in Run II Fermilab Run II Database Requirements Online databases are maintained at each experiment and are critical for.

Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.

Tier-1 experience with provisioning virtualised worker nodes on demand Andrew Lahiff, Ian Collier STFC Rutherford Appleton Laboratory, Harwell Oxford,

Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.

Module 18 Monitoring SQL Server 2008 R2. Module Overview Monitoring Activity Capturing and Managing Performance Data Analyzing Collected Performance Data.

Thomas Finnern Evaluation of a new Grid Engine Monitoring and Reporting Setup.

CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.

RAL Site Report HEPiX Fall 2013, Ann Arbor, MI 28 Oct – 1 Nov Martin Bly, STFC-RAL.

Zabbix Performance Tuning

Elasticsearch in Dashboard Data Management Applications David Tuckett IT/SDC 30 August 2013 (Appendix 11 November 2013)

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

The SAMGrid Data Handling System Outline:  What Is SAMGrid?  Use Cases for SAMGrid in Run II Experiments  Current Operational Load  Stress Testing.

HTCondor at the RAL Tier-1 Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014.

Goodbye rows and tables, hello documents and collections.

Tier-1 Batch System Report Andrew Lahiff, Alastair Dewhurst, John Kelly, Ian Collier 5 June 2013, HEP SYSMAN.

Block1 Wrapping Your Nugget Around Distributed Processing.

Graphing and statistics with Cacti AfNOG 11, Kigali/Rwanda.

Multi-core jobs at the RAL Tier-1 Andrew Lahiff, Alastair Dewhurst, John Kelly February 25 th 2014.

Streamlining Monitoring Infrastructure in IT-DB-IMS Charles Newey ›

DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.

Local Monitoring at SARA Ron Trompert SARA. Ganglia Monitors nodes for Load Memory usage Network activity Disk usage Monitors running jobs.

 Load balancing is the process of distributing a workload evenly throughout a group or cluster of computers to maximize throughput.  This means that.

VMware vSphere Configuration and Management v6

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

Distributed Time Series Database

SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.

Web Cache. What is Cache? Cache is the storing of data temporarily to improve performance. Cache exist in a variety of areas such as your CPU, Hard Disk.

Log Shipping, Mirroring, Replication and Clustering Which should I use? That depends on a few questions we must ask the user. We will go over these questions.

BIG DATA/ Hadoop Interview Questions.

Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.

Deploying services with Mesos at a WLCG Tier 1 Andrew Lahiff, Ian Collier HEPIX Spring 2016 Workshop, DESY Zeuthen.

Integrating HTCondor with ARC Andrew Lahiff, STFC Rutherford Appleton Laboratory HTCondor/ARC CE Workshop, Barcelona.

Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:

Metrics at Mantas Klasavičius.

A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.

Andrew Lahiff HEP SYSMAN June 2016 Hiding infrastructure problems from users: load balancers at the RAL Tier-1 1.

Distributed Monitoring with Nagios: Past, Present, Future Mike Guthrie

Metrics data published Via different methods Monitoring Server

SQL Database Management

EPICS Channel History Storage

Network measurements with InfluxDB

SharePoint 101 – An Overview of SharePoint 2010, 2013 and Office 365

Understanding and Improving Server Performance

Introduction of load balancers at the RAL Tier-1

Containers and other technology at the Tier-1

Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.

HTCondor at RAL: An Update

Monitoring with Clustered Graphite & Grafana

Distributed Network Traffic Feature Extraction for a Real-time IDS

CWG10 Control, Configuration and Monitoring

Moving from CREAM CE to ARC CE

SCD Cloud at STFC By Alexander Dibbo.

LCGAA nightlies infrastructure

Monitoring HTCondor with Ganglia

Virtualization in the gLite Grid Middleware software process

A Marriage of Open Source and Custom William Deck

AWS DevOps Engineer - Professional dumps.html Exam Code Exam Name.

Where can I download Aws Devops Engineer Professional Exam Study Material - Get Updated Aws Devops Engineer Professional Braindumps Dumps4downlaod.us

Get Amazon AWS-DevOps-Engineer-Professional Exam Real Questions - Amazon AWS-DevOps-Engineer-Professional Dumps Realexamdumps.com

Cloud computing mechanisms

Cloud Computing Architecture

Collecting Performance Metrics

Presentation transcript:

Monitoring with InfluxDB & Grafana Andrew Lahiff HEP SYSMAN, Manchester 15th January 2016

Overview Introduction InfluxDB InfluxDB at RAL Example dashboards & usage of Grafana Future work

Monitoring at RAL Ganglia used at RAL Lots of problems have ~ 89000 individual metrics Lots of problems Plots don’t look good Difficult & time-consuming to make “nice” custom plots we use Perl scripts, many are big, messy, complex, hard to maintain, generate hundreds of errors in httpd logs whenever someone looks at a plot UI for custom plots is limited & makes bad plots anyway gmond sometimes uses lots of memory & kills other things doesn’t handle dynamic resources well not suitable for Ceph

A possible alternative InfluxDB + Grafana InfluxDB is a time-series database Grafana is a metrics dashboard Benefits both are very easy to install install rpm, then start the service easy to put data into InfluxDB easy to make nice plots in Grafana

Monitoring at RAL Go from Ganglia to Grafana

InfluxDB Time series database Written in Go - no external depedencies SQL-like query language - InfluxQL Distributed (or not) can be run as a single node can be run as a cluster for redundancy & performance will come back to this later Data can be written into InfluxDB in many ways REST API (e.g. Python) Graphite, collectd

InfluxDB Data organized by time series, grouped together into databases Time series can have zero to many points Each point consists of time a measurement e.g. cpu_load at least one key-value field e.g. value=5 zero to many tags containing metadata e.g. host=lcg1423.gridpp.rl.ac.uk

InfluxDB Points written into InfluxDB using the line protocol format <measurement>[,<tag-key>=<tag-value>...] <field-key>=<field-value>[,<field2-key>=<field2-value>...] [timestamp] Example for an FTS3 server active_transfers,host=lcgfts01,vo=atlas value=21 Can write multiple points in batches to get better performance this is recommended example with 0.9.6.1-1 for 2000 points sequentially: 129.7s in a batch: 0.16s

Retention policies Retention policy describes duration: how long data is kept replication factor: how many copies of the data are kept only for clusters  Can have multiple retention policies per database

Continuous queries An InfluxQL query that runs automatically & periodically within a database Mainly useful for downsampling data read data from one retention policy write downsampled data into another Example database with 2 retention policies 2 hour duration keep forever data with 1 second time resolution kept for 2 hours, data with 30 min time resolution kept forever use a continuous query to aggregate the high time resolution data to 30 min time resolution

Example queries > use arc Using database arc > show measurements name: measurements ------------------ name arex_heartbeat_lastseen jobs

Example queries > show tag keys from jobs name: jobs ---------- tagKey host state

Example queries > show tag values from jobs with key=host name: hostTagValues ------------------- host arc-ce01 arc-ce02 arc-ce03 arc-ce04

Example queries > select value,vo from active_transfers where host='lcgfts01' and time > now() - 3m name: active_transfers ---------------------- time value vo 2016-01-14T21:25:02.143556502Z 100 cms 2016-01-14T21:25:02.143556502Z 7 cms/becms 2016-01-14T21:26:01.256006762Z 102 cms 2016-01-14T21:26:01.256006762Z 8 cms/becms 2016-01-14T21:27:01.455021342Z 97 cms 2016-01-14T21:27:01.455021342Z 7 cms/becms 2016-01-14T21:27:01.455021342Z 1 cms/dcms

InfluxDB at RAL Single node instance What data is being sent to it? VM with 8 GB RAM, 4 cores latest stable release of InfluxDB (0.9.6.1-1) almost treated as a ‘production’ service What data is being sent to it? Mainly application-specific metrics Metrics from FTS3, HTCondor, ARC CEs, HAProxy, MariaDB, Mesos, OpenNebula, Windows Hypervisors, ... Cluster instance currently just for testing 6 bare-metal machines (ex worker nodes) recent nightly build of InfluxDB

InfluxDB at RAL InfluxDB resource usage over the past month currently using 1 month retention policies (1 min time resolution) CPU usage negligible so far

Sending metrics to InfluxDB Python scripts, using python-requests read InfluxDB host(s) from config file, for future cluster use picks one at random, tries to write to it if fails, picks another ... Alternatively, can just use curl: curl -s -X POST "http://<hostname>:8086/write?db=test" -u user:passwd --data-binary "data,host=srv1 value=5"

Telegraf Collects metrics & sends to InfluxDB Plugins for: system (memory, load, CPU, network, disk, ...) Apache, Elasticsearch, HAProxy, MySQL, Nginx, + many others Example system metrics - Grafana

Grafana – data sources a

Grafana – adding a database Setup databases

Grafana – making a plot a

Grafana – making a plot a

Templating

Templating can select between different hosts, or all hosts

Templating

Example dashboards

HTCondor a

Mesos a

FTS3 a

Databases

Ceph a

Load testing InfluxDB Can a single InfluxDB node handle large numbers Telegraf instances sending data to it? Telegraf configured to measure load, CPU, memory, swap, disk testing done the night before my HEPiX Fall 2015 talk 189 instances sending data each minute to InfluxDB 0.9.4 had problems testing yesterday 412 instances sending data each minute to InfluxDB 0.9.6.1-1 no problems couldn’t try more – ran out of resources & couldn’t create any more Telegraf containers

Current limitations (Grafana) long duration plots can be quite slow e.g. 1 month plot, using 1-min resolution data Possible fix: people have requested that Grafana should be able to automatically select different retention policies depending on time interval (InfluxDB) No way to automatically downsample all measurements in a database need to have a continuous query per measurement Possible fix: people have requested that it should be possible to use regular expressions in continuous queries

Upcoming features Grafana – gauges & pie charts in progress

Future work Re-test clustering once it becomes stable/fully-functional expected to be available in 0.10 at end of January also new storage engine, query engine, ... Investigate Kapacitor time-series data processing engine, real-time or batch trigger events/alerts, or send processed data back to InfluxDB anomoly detection from service metrics