Experiments Monitoring Plans and Progress

Experiments Monitoring Plans and Progress
Summary by Alberto AIMAR – CERN Input from ALICE, ATLAS, CMS, LHCb

Outline ALICE ATLAS CMS LHCb Summary

ALICE/CERN monitoring and QA
(slides from Jacek Otwinowski and Costin Grigoras) ALICE - 1

Data processing monitoring
Current situation Grid infrastructure monitored by MonALISA Jobs (resources, status), data servers, distributed services status etc. Same solution for DAQ for custom real-time display Online (High Level Trigger) cluster monitored by Zabbix Online/Offline (O2) in Run3/4 Particular constraints on the performance of the future system: 100k sources (processes), 600kHz data points, <500ms latency Considered solutions: Zabbix, MonALISA, monitoring stack (collectd/flume/influxdb/grafana/kafka/spark/…) Final solution after benchmarking the above ALICE - 2

Quality Assurance in Run2
Data Quality Monitoring/Online QA Detector performance monitored by Automatic MOnitoRing Environment (AMORE) Agents are monitoring processes (receive and analyze raw data samples) Visualization with dedicated AMORE GUI Detector and tracking performance on the High Level Trigger Data producers and mergers communicate via ZeroMQ OVERTWATCH system for detector performance monitoring via WEB interface Offline QA Processing on the grid with AliEn Post-processing on the QA cluster at CERN QA output (ROOT histograms and trees) stored on AFS/EOS file systems Custom visualization tools ALICE - 3

Quality Assurance in Run3/4
Online/Offline (O2) in Run3/4 Parallel synchronous and asynchronous data processing (ALFA framework with FairMQ) QA repository/database NoSQL database (Elasticsearch, Cassandra…or others) for metadata information EOS file system for storing QA objects (ROOT histograms and trees) Data aggregation and enrichment Custom solution based on ROOT (friend trees) Apache Kafka is considered Interactive QA analysis SWAN/Jupyter Notebook Custom database queries using Python and C++ APIs Custom visualization tools based on JSROOT ALICE - 4

Example QA in Run3/4 Elasticsearch/CERN Python API + Jupyter/SWAN
Conversion ROOT to JSON Elasticsearch/CERN Python API + Jupyter/SWAN ALICE - 5

(input from M.Lassnig, A.Di Girolamo, A.Filipcic, I.Vukotic)
ATLAS Monitoring (input from M.Lassnig, A.Di Girolamo, A.Filipcic, I.Vukotic)

ATLAS Monitoring and Analytics
Understand distributed systems Usage characterisation What do our users do? What do our systems do? Performance characterisation How long does it take to retrieve? And to insert? … Monitor resource usage CPU consumption, efficiency, usage by activities, site efficiency Storage usage, reliability and throughput Data migration, transfers, archival Key capabilities Correlate Data from multiple systems Model Using raw and aggregated data, data mining and machine learning toolkits Ad-hoc Analytics for user-requested questions Support Documentation and expert help ATLAS -1

Baseline Infrastructure for new monitoring
ElasticSearch Dedicated instance for ATLAS hosted by CERN IT (IT ES) Instance hosted by University of Chicago As part of MONIT, with curated and enriched data Notebooks Zeppelin instance, part of MONIT, and dedicated to ADC Hadoop HDFS to store raw and aggregated data Preparation of data for ElasticSearch ingestion Machine learning with Spark ATLAS -2

ATLAS and MONIT ATLAS ADC Analytics and Monitoring launched Working Groups to check and improve, implement the ATLAS MONIT dashboards and reports DDM and DDM Accounting Initial dashboards from MONIT on going Need checking and improvements by experts Running jobs Initial dashboards by MONIT First beta Job monitoring / Job accounting dashboards Running job slots were needed to start with dashboards Sites monitoring SAM test results available in MONIT Few features being developed (site grouping, availability profiles) ASAP metrics (HammerCloud) modified to send results as JSON ATLAS -2

ATLAS and ElasticSearch
UChicago Hardware-based 8 nodes total (each 8 core, 64GB RAM), 5 data nodes (3x1 TB SSD each) 3 master nodes (3 masters, 1 indexer, 2 kibana) 10 Gbps NICs Contents: 15'000'000'000 docs Clients: Kibana,Notebooks, Embedded visualisations, Cron jobs Indices in production across ITES (8) and UChicago (8) = 82TB/year Few options available Rolling buffers: Cannot look further back than 30 days — not enough for our reports Throw away data: Painful to know what could be important upfront, esp. for operations Redo all dashboards from Kibana/Grafana in Spark notebooks: Not enough human capacities for this ATLAS -3

ATLAS and Hadoop @ CERN Exclusively use the analytics cluster at CERN
40 nodes, 2TB RAM total, 2PB storage total Rucio (Data Management) Dumps from Oracle Flume from DDM servers and daemons (REST calls, logs, …) Custom servlet to serve HTTP streams directly from HDFS Reporting jobs — results shown via notebooks Kerberos authentication via (cumbersome) acron method ATLAS -5

ES and Hadoop Feedback Run-3/4 considerations
Users are much more comfortable now than half a year ago However, inexperienced users can hit the infrastructure hard Documentation is difficult, systems are changing fast, easier to ask colleagues ElasticSearch and Hadoop are both critical and used in production Most of the tools/capabilities are available Interplay between tools not ideal and efficient, lots of custom-made duct-tape Hadoop is working well, ES is sufficient for now (=< late 2017) Limiting for the applications (slidings windows, throwing away data,…) Upgrades necessary wrt. storage space and IO rates at least on ITES cluster Run-3/4 considerations 10 time increase of rates and volume, in line with WFMS and DDM upgrades Event-level processing workflows will require unprecedented instrumentation ATLAS -6

Wishlist Monitoring data volume
Data volume is huge, throwing away old data limits the usability Deeper studies require data in ES for at least a year, rather longer Limited functionality Grafana and Kibana not fully featured, crucial elements missing (e.g. log scale), how to extend? Data reliability Data/figure/plot validation (e.g. for RRB) is mandatory, needs a lot of experience and help from dedicated experts Presentation quality Quality of presentation material needs to be at professional level ATLAS -7

CMS Grid Monitoring and Migration to MONIT
(slides from Carl Vuosalo for CMS Monitoring Forum)

CMS Grid Monitoring and Migration to MONIT
CMS monitors its grid resources with variety of tools In process of migrating to use CERN’s standard MONIT framework CMS monitoring covers many kinds of resources: CMSWeb – front-end tool for displaying monitoring results Data Popularity – dataset usage SpaceMon – monitoring disk usage across CMS grid Site Status Board – shows status of CMS grid sites CRAB/ASO – monitors files transferred for user grid jobs HTCondor job monitoring WMAgent – production job submission and reporting WMArchive – grid job reports CMS - 1

CMSWeb and Data Popularity
CMSWeb hosts and displays other CMS services Already uses ElasticSearch and Kibana Planning to migrate Data Popularity Data Popularity tracks usage of CMS datasets by CMS jobs It uses CMS tools to aggregate and display results Planning to migrate to CERN MONIT infrastructure this summer Store and aggregate results using Hadoop file system and Spark Display results with Grafana and Kibana CMS - 2

SpaceMon and Site Status Board
CMS SpaceMon SpaceMon tracks disk usage at CMS sites Developed MONIT prototype that uses ElasticSearch and Kibana Data source will move to message broker (Active MQ) instead of Flume SSB Site Status Board displays status of CMS computing sites Migrating to MONIT Sending data as JSON instead of XML Will implement MONIT features (dashboards, reports and extract via APIs) CMS - 3

CMS SpaceMon

SSB - GlideinWMS Metrics

CRAB/ASO Ops and CMS Job Monitoring
CRAB and ASO Monitoring CRAB is CMS tool for user job submission ASO is used by CRAB for file transfer ASO has migrated all dashboards to MONIT Developed Grafana dashboards, developed with Spark jobs CMS Job Monitoring (data via HTCondor) HTCondor job monitoring prototype Send raw data from jobs to MONIT via AMQ message broker Perform aggregation in MONIT Use Grafana for dashboards CMS - 4

CRAB and ASO Monitoring

CMS Job Monitoring (data via HTCondor)

WMAgent and WMArchive WMAgent
WMAgent used for production job submission and reporting All dashboards now in MONIT Using Grafana – some fine-tuning still needed WMArchive WMArchive stores Framework Job Reports from production jobs Migrating aggregated results into CERN MONIT system WMArchive and WMCore added support for ActiveMQ broker Stomp protocol Now streaming data to MONIT broker Developing periodic Spark job for sending data Creating Kibana displays and Grafana plots CMS - 5

WMAgent

WMArchive (aggregated job reports)

(input from F.Stagni and A. McNab)
LHCb Monitoring (input from F.Stagni and A. McNab) LHCb - 1

Monitoring system Based on DIRAC framework and designed for:
real time monitoring (WMS jobs, DIRAC components, etc.) managing semi-structured data (JSON) efficient data storage, data analysis and retrieval provide high quality reports Technologies used: Elasticsearch distributed search and analytic engine DIRAC Graph library for creating high quality plots DIRAC web framework (front end) for visualize the plots Messaging Queue systems: AMQ (via stomp) and RabbitMQ Kibana flexible analytic and visualization framework (optional: for component monitoring) LHCb - 2

Overview of the system LHCb - 3

Monitoring web server LHCb - 4

Experience and Plans Experience:
LHCb is using the Monitoring System since October 2016 Elasticsearch clusters are provided by CERN (one for production, one for certification) Size: 108GB, 1200 indexes, 1280 million documents 8 nodes cluster: 3 master, 2 search and 3 data nodes Plans: More monitoring data Centralized logging of DIRAC components, based on Elasticsearch directly (currently in pre-production version) Job analytic: Worker node, Memory,CPU, Exectime, etc. LHCb - 5

Pilot Monitoring (via NagiosProbes)
The WLCG SAM tests (ETF) are plugins for Nagios that check what is working Either on the SAM/ETF central machines Or inside jobs submitted by SAM/ETF NagiosProbes is an extension to the next generation Pilot which can run these tests Use Nagios Plugin API, so can use same scripts Its tests currently logged to an HTTPS server Will be replaced by logging to pilot logging service LHCb - 6

Pilot NagiosProbes for SAM/ETF
Output of the probes is retrieved by a script we supply to SAM/ETF Itself a Nagios plugin, that returns the original probe’s outputs and return code We can hide the HTTPS to Pilot Logging transition since its all inside our script Prototype of this is now working Running on etf-lhcb-preprod Once in production, VM-based resources will immediately appearing in WLCG SAM dashboards / reports LHCb - 7

Summary All looking at external technologies instead of fully in-house solutions ElasticSearch, Kibana, Kafka, Messaging, Spark, HDFS less custom-made less control, but huge communities and free improvements ATLAS and CMS have their infrastructure for deeper analytics studies rely on MONIT for monitoring, and as migration from WLCG dashboards use MONIT curated data with aggregation, enrichment, processing, buffering, dashboards, etc. ALICE and LHCb run their own infrastructure based on central IT services (ES, HDFS) could share (some) data in MONIT, if needed

Experiments Monitoring Plans and Progress

Similar presentations

Presentation on theme: "Experiments Monitoring Plans and Progress"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Experiments Monitoring Plans and Progress

Similar presentations

Presentation on theme: "Experiments Monitoring Plans and Progress"— Presentation transcript:

Similar presentations

About project

Feedback