Presentation is loading. Please wait.

Presentation is loading. Please wait.

Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

Similar presentations


Presentation on theme: "Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring."— Presentation transcript:

1 Julia Andreeva

2 \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring Data transfer Data access Operation teams Sites Publicity & Dissemination Site Status Board Site usability SiteView Sites Operation teams Infrastructure monitoring Sites General public WLCG Google Earth Dashboard

3 ApplicationATLASCMSLHCbALICE Job monitoring (multiple applications) Ѵ Ѵ Site Status Board Ѵ Ѵ Ѵ Ѵ SUM Ѵ Ѵ Ѵ Ѵ DDM Monitoring Ѵ SiteView & GoogleEarth Ѵ Ѵ Ѵ Ѵ

4 ApplicationATLASCMSLHCbALICE Job monitoring (multiple applications) Ѵ Ѵ Site Status Board Ѵ Ѵ Ѵ Ѵ SUM Ѵ Ѵ Ѵ Ѵ DDM Monitoring Ѵ SiteView & GoogleEarth Ѵ Ѵ Ѵ Ѵ Global WLCG transfer monitor based on ATLAS DDM Dashboard is coming soon

5 ApplicationATLASCMSLHCbALICE Job monitoring (multiple applications) Ѵ Ѵ Site Status Board Ѵ Ѵ Ѵ Ѵ SUM Ѵ Ѵ Ѵ Ѵ DDM Monitoring Ѵ SiteView & GoogleEarth Ѵ Ѵ Ѵ Ѵ All applications are shared by 2/3/4 experiments All applications are developed in a common framework, which includes common construction blocks, build and test environment, common module structure, agent management, common repository

6

7 Machine- readable format (JSON) Client-side Ajax and java script based UI External applications Dashboard framework UI is completely agnostic regarding information source. Better flexibility. Adding new information source or replacing an existing one is a straight forward task. Clear decoupling of the development tasks

8 Over last months redesign Dashboard UIs Client-side Model-View-Controller architecture Using jQuery and AJAX Full bookmarking support Lot of effort was given to evaluation of the design of large- scale JavaScript WEB applications, jQuery libraries. Experience is well documented. Set up recommendations for the developers: https://twiki.cern.ch/twiki/bin/view/ArdaGrid/Libs http://code.google.com/p/hbrowse/w/list Possibly a dedicated presentation can be of interest for the members of the monitoring group

9 Provides information about data processing in the scope of a given VO Mainly based on the instrumentation of the job submission frameworks and therefore works transparently across various middleware platforms (OSG, ARC,gLite), various submission methods (pilots, etc…), various execution backends (Grid, local) Merges information about a given job from multiple information sources (unique job identifier is a requirement) Job monitoring applications are shared by ATLAS and CMS The db schema and user interfaces are shared, basically same implementation but adapted for a particular experiment The information sources and transport mechanisms are different, correspondingly different implementation for the collectors Keep track of all processing details on a single job level

10 Dashboard Data Repository One per experiment (ORACLE) Data retrieval via APIs Jobs running At the WNs Message server (MonAlisa or MSG) Dashboard consumer User WEB interfaces Job submission client or server Dashboard web server 10 Panda DB ATLAS Production DB

11 Currently implemented in ORACLE Schema is normalized CMS schema is being partitioned twice per month, ATLAS schema is partitioned weekly Some interfaces use pre-cooked aggregated data However some of them use raw data. Though a lot of tuning was done recently to improve performance of the applications which use raw data, still there is a room for improvements. Foresee to try nosql solutions to be used as a cash for the UI Main issue is eventual performance degradation due to the instabilities of the execution plan. Hopefully the situation might improve with migration to 11g

12 ATLAS submits up to 800K per day, CMS up to 300 K jobs per day => 1 million jobs per day to follow. Get regular updates of job status changes (per job) Per job DB contains time stamps of job status changes, meta information about jobs, job status, error codes and error reasons, job processing metrics (CPU, wallclock, memory consumption, etc…), list of accessed files (short time only) Plus aggregated information in summary tables with hourly and daily granularity. Size of ATLAS job monitoring DB 380GB for 1.5 year of data. Daily growth over last months 1-5GB/day

13 For historical reason in the lack of messaging system provided as a middleware component, for CMS use MonAlisa as a messaging system. Works well. Currently for CMS use 3 ML servers. In order to scale can add more servers. ML can accept up to 5K messages per second. The bottleneck is rather on the level of data recording to the DB. Which is being constantly monitored. Below are plots for one of the servers, the one used the most. One bar corresponds to 5 minutes (1 collector loop). ~20K status update records are inserted every 5 minutes from a single server that is 50-100 Hz from a single server. In case of any delay in information update alarm is sent. Alarm is triggered by ORACLE scheduled job

14 For ATLAS main data flow comes from PANDA (100-150 Hz). Single server deals well with the load. Collector loop runs every 2 minutes Had performance issues with the first collector implementation. Collector redesign allowed to solve the problem. Mainly replacing triggers by stored procedures which are called from the collector main thread Thanks to the CERN DBA for their support and suggestions Same as for CMS, performance is constantly monitored and alarms are sent in case of any delay

15 All UI applications are run in parallel on two web servers (behind the same alias). But no real load balancing. Would like to try to re-use what is used for ActiveMQ message brokers Access is monitored. It is steadily growing in terms of number of users, frequency of access and volume of accessed data. Awstats for a single CMS server (metrics should be multiplied by two) Monthly access patterns: ~ 3-4 k unique visitors (IP addresses) ~ 2-3 M pages ~ 300-400 GB bandwidth

16 Monitoring ATLAS DDM Data Registrations and Transfers Web UI & APIServer & AgentsDatabaseConsumers

17 2 consumers (Apache) receive callbacks events from 11 DDM Site Service VO boxes. (~ 50 Hz) Callback events stored in monthly partitioned database (Oracle) tables and kept for at least 3 months. Statistics generation agents (Dashboard Agent) run every 10 minutes generating statistics into 10 minute bins by source/destination/activity. (~ 50 k records per day) Statistics aggregation agents (Dashboard Agent) run every 10 minutes aggregating statistics into 24 hour bins by source/destination/activity. (~ 4 k records per day) Statistics stored in monthly partitioned database (Oracle) tables and kept indefinitely. Size of the DB is 1625GB, the biggest one of all Dashboard DBs. Daily growth over last months 1-5GB/day

18 Statistics and event details available via web UI and web API Web API (Dashboard Web) provides CSV/XML/JSON formats Web UI (Dashboard Web + AJAX/jQuery) provides highly flexible filtering for statistics matrix and plots Monthly access patterns: ~ 1 k unique visitors ~ 20 M page hits ~ 400 GB bandwidth > 90 % traffic for web API (50 % single user)

19 Deadlocks in DB due to too many connections from consumers Solution: Restrict number of connection pools (Apache thread model) and connection pool size (Dashboard) Some publishers monopolise consumer due to diverse latency Solution: Additional consumer for high latency publishers Statistics generation procedures run too slow Solution: Split procedures to run in parallel and use bulk SQL (Oracle) Web UI and API queries for extended time periods too slow Solution: Aggregate statistics into 24 hour bins in separate DB tables Web server memory usage too high Solution: Generate plots on the client (HighCharts)

20 Occasional execution plan instabilities Plan: Investigate Oracle 11g SQL plan management to improve stability Many thanks to DBAs for support in fixing instabilities when they occur High load on web API from a few clients Plan: Work with users to develop more efficient API that meets their requirements Consumers are approaching their load limit Plan: Investigate message brokering (ActiveMQ) as buffer to simplify bulk inserts

21 Deployed for the 4 experiments Gather metrics for all entities Metrics are defined by experiment. Can be dynamically created Originally, entity=site. Now, entity could be ‘channel’ Measurement= Start/End Time, Value, Color, Site, URL More than 370 metrics for all experiments: Metrics gathered by collectors Refresh rate between 10 min to 7 days Present latest state and historical information Present different views (view is a set of metrics) More than 40 views Different ORACLE databases per experiment CMS: 87 M entries, 20 collectors; ATLAS: 50 M, 3 collectors

22 Savannah Latest results COLLECTORS Free text BDII Job Eff. … Topology Historical data

23 Too many different writers (LOCKING)! Use temporary files, and a single writer Huge table for historical values Partition by hash of metric and time Insertion rate too slow (1 second/entry) Avoid triggers, materialized views and process as much as possible before insertion Monitor insertion rate Now, 20 ms per entry Thanks to the CERN DBA for their support and suggestions

24 Savannah Latest results Historical values COLLECTORS Free text BDII Job Eff. … Topology Load Data TMP FILES

25 Better graphics Filtering, sorting and pagination Exporting data Client side plotting

26 Steadily growing amount of data Aggregate? Decrease granularity for older values? NoSQL?

27 Exit code which is generated in case of job failure does not always allow to identify the cause of the problem Data mining technique called association rule mining was applied on the collected job monitoring data in order to identify the cause of the job failure Within the Dashboard framework Quick Analysis Of Error Sources (QAOES) application was developed by a PhD student Logically two steps: identifying a problem and then providing previously collected human expertise about possible solutions to the detected problem. Info is merged and exposed through the UI Application ran for a year or so for CMS. Needed active evaluation and contribution from the experiment in order to make something really useful out of it. Unfortunately it did not happen

28 50 machines 16 physical, quattor, PES control (4 SLC4 ) 28 virtual, quattor, PES control 2 temporary virtual, PES control 4 virtual, only for internal tests Used by IT-ES group (not only dashboard) Quattor templates for common components (iptables, certificates, web servers, yum) Every machine has some manual configuration MonALISA monitoring Host monitor, web server, collectors, rpm, alarms… Still to configure automatic actions Wiki describing typical actions https://twiki.cern.ch/twiki/bin/view/ArdaGrid/Dashboard#Dashboard_Machines_Overview https://twiki.cern.ch/twiki/bin/view/ArdaGrid/Dashboard#Dashboard_Machines_Overview

29

30 The CMS Popularity project is a monitoring service of the data access patterns Technology CRAB/Dashboard to collect the file based information from the CMS user jobs running on the grid Python demon to harvest the above information and populate an Oracle database backend Oracle Materialized Views for daily based data aggregation A Web UI, developed using the Django web framework and jQuery, exposes the popularity metrics (historical views, aggregated views) in terms of tables, plots, JSON API Scale Collected information of more than 300k files/day Harvesting time needed by the demon: ~40’/day Refreshing time of the Materialized views: ~1’/day Monitoring TF – IT/ES

31 CMS Site Cleaning Agent: Implements the strategies to free up space at T2s. Technology Python-based application CMS Popularity and PhEDEx information accessed via HTTP JSON APIs Expose results via the CMS Popularity web UI Scale Runs once a day: processing time ~2h Monitors the disk space of O(50) CMS T2 sites, and O(20) physics groups, looking for sites/groups over quota O(200k) data blocks checked per run Monitoring TF – IT/ES

32 Technology: - HC is a "ganga application": a python service which uses Ganga to submit grid jobs to Glite WMS, PanDA, CRAB, and DIRAC backends. - State is recorded in a MySQL database; plan to develop Oracle backend. - HC provides a Django-based web frontend, developed with JSON/jQuery UI elements. Scale: - HC runs three instances for ATLAS, CMS, and LHCb. ~60 user accounts (mostly grid site administrators) - Total of ~10-20,000 jobs per day. Each job is very short -- just to test the basic grid analysis workflows. - History is kept for all test jobs -- currently the DBs contact records for ~30 million job records.

33 In ADC 10 critical services are monitored. Each service has between 1 and 10 service instances. - Metrics to calculate availability are gathered using lemon, webalizer and service specific reports. - In addition ADC has a SLS based T1 storage space monitoring (around 40 spacetokens). Storage space information is retrieved using lcg-utils. LHCb has a very similar implementation. - Information is monitored by ADC shifters that are instructed to report immediately to the ATLAS Manager on Duty in case a service is degraded.

34 Monitor DB connections, queries, transactions performed by a CORAL client application Fixing and enhancing this feature (it has existed in CORAL since a long time but was never really used) CMS wants to use it with the Oracle and Frontier plugins CORAL code is internally instrumented to keep track of DB operations and dump them when the client job ends Also integrating this feature in the CORAL server ATLAS wants to use it to monitor DB operations in HLT Keep track of DB operations executed via CORAL server and make them available in real time while server is up Eventually would also like to monitor packet traffic through the hierarchy of CORAL server proxies caching HLT data Monitoring TF – IT/ES

35 Activity is performed within ATLAS With help from Frontier experts in CMS and in contact with the CORAL team in IT-ES Aim is to provide service availability monitoring for the ATLAS distributed Frontier/Squid deployment Probing Squids via MRTG (shown per individual node on frontier.cern.ch) – based on BDII, being moved to AGIS Probing Frontier via ping (shown for service only on SLS) Grep Frontier server logs (AWSTATS) – operational at some sites like CERN and BNL, being deployed elsewhere Monitoring TF – IT/ES


Download ppt "Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring."

Similar presentations


Ads by Google