Presentation is loading. Please wait.

Presentation is loading. Please wait.

CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Similar presentations


Presentation on theme: "CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on."— Presentation transcript:

1 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on behalf of the Dashboard team

2 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Table of content Importance of Job Monitoring Overview of the Dashboard Job Monitoring applications Monitoring of user analysis Conclusions

3 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Importance of Job Monitoring Data distribution and data processing are two main computing activities for VOs running on WLCG infrastructure Quality of job processing provides the estimation of the quality of the infrastructure in general and defines the overall success of the computing activities of the VOs On the other hand, detailed and reliable job monitoring helps to improve the computing models of the LHC VOs

4 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Dashboard main goals The goal of the Experiment Dashboard is to monitor the activities of the LHC experiments on the distributed infrastructure, providing monitoring data from the virtual organization (VO)/user perspectives. The LHC experiments use various Grid infrastructures (LCG/EGEE, OSG, NDGF) with correspondingly various middleware flavors and job submission methods. The main task is to provide a uniform and complete view of various activities like job processing, data movement and publishing, access to distributed databases regardless of the underlying Grid flavor.

5 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Overview of the Dashboard job monitoring applications Atlas ProdSys Monitoring Central repository for CMS ProdAgent monitoring data Generic Job Monitoring Monitoring of user analysis jobs

6 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Data flow for Dashboard Job Monitoring LB, CEMon, Condor-g, jobs instrumented to report their progress, Job Submission Tools of the experiments MonALISA, currently we’re going to switch to the Messaging System for the Grid (MSG) Data is available in various formats and can be presented for different categories of users: VO managers, computing shifters, MC production teams, Site commissioning, LHC physicists running their jobs on the Grid

7 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Dashboard job monitoring is designed as common solution for any virtual organization To provide complete view of job processing both from the Grid and application point of view the VO job submission tools should be instrumented to report job’s status information Dashboard job monitoring for CMS is the most advanced one since all CMS submission tools are well instrumented for the Dashboard reporting

8 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Dashboard Job Monitoring functionality Interactive view Shows what is going on NOW. -Distribution of jobs by site, CE, user, submission tool, application version, dataset, etc… -Distribution of jobs by status -Success rate, CPU and wall clock time, number of processed events Historical Interface Job statistics distributed over time Dashboard Task Monitoring Provides complete information about analysis job processing. Serves the needs of the analysis community and of the analysis support team Quick Analysis of Error Sources Automatically detects failing grid components and offers solutions to solve the problems.

9 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Distributed user analysis on the WLCG infrastructure is currently the main challenge of the LHC computing With data taking approaching number of analysis users will dramatically increase User-friendly, complete and reliable monitoring of the analysis task processing is an important factor for successful organization of the distributed analysis Task Monitoring application is developed on top of the common job monitoring repository Main users of the application are LHC physicists, distributed analysis support teams and site administrators Dashboard Task Monitoring

10 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services CMS Task monitoring for analysis users Provides transparent monitoring regardless submission method or middleware platform Detailed view of user tasks including failure diagnostics, processing efficiency and resubmission history Low latency, updates from the worker node where job is running User driven development Progress of processing in terms of processed events Distribution of jobs by their current status Very detailed per job information Failure diagnostics for GRID and application failures Distribution of efficiency By site 10

11 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services CMS Dashboard Usage by application Application is currently in production for CMS VO Became very popular in the CMS physics community Got a very positive feedback from the users Up to 150 physicists are using the application on everyday basis

12 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Trying to understand the failure reasons  VO jobs are instrumented to report the application exit codes. Unfortunately the exit codes are not always pointing to a particular failure reason. In most cases they are rather obscure  Failure of the job can be caused by many different reasons: Error in the user code Misconfiguration of the site -misconfiguration of the worker nodes on the site -corrupted distribution of the experiment software -problem of the accessing of the shared area from the worker node -etc... Problem accessing input data Problem saving output files to the remote storage  In order to adress the problem it’s necessary to understand the underlying reason. Possible ways to achive this goal are: - better diagnostics in case of failure published from the user jobs - analysis of the failure statistics  The Dashboard team works in both directions. In the first case in collaboration with the developers of the workload management systems of the experiments  In the ideal case the user doesn’t need to open the log file to understand what went wrong with his job but can get all the necessary information from user interface.

13 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Quick Analysis of Error Sources (QAOES) System of problem detection with the Association Rule Mining algorithm Expert system Aim is to decrease a time of fault detection and to improve grid reliability QAOES prototype is in production for the CMS analysis job monitoring data The tool is being evaluated by CMS distributed analysis support team

14 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services QAOES use case 1

15 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services QAOES use case 1 User analysis jobs have very low success rate. Let’s see whether these jobs relate to one particular user or not. Jobs overview on T2_FR_IPHC site sorted by activity for the last 6 hours

16 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Jobs overview on T2_FR_IPHC site sorted by user for the last 6 hours Almost all users have jobs failed from the application point of view. Let’s check why two users don’t have such problem.

17 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services QAOES use case 1 Jobs overview on T2_FR_IPHC site sorted by dataset for the last 6 hours Jobs with “unknown” dataset don’t use any input data. And the other jobs failed with 8020 error code. It’s a data access problem. We see that automatically generated rule correctly detected the faulty component – site. Namely, data access problem at the site.

18 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services QAOES use case 2

19 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services QAOES use case 2 Distribution of the user jobs per site for the last 6 hours A large number of jobs belonging to the user failed, cancelled or aborted on different sites. Let’s check if it happened with the particular task of the user or not. Sort the jobs by task.

20 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Distribution of the user jobs per task The user jobs fail on different sites with different tasks. It could be an input data problem. Let’s check.

21 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Distribution of the user jobs per dataset User is failing at various sites running different tasks and reading different datasets. It’s a clear indication of an error in the user code. Which is consistent with the automatically generated rule.

22 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Add a solution

23 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Conclusions Monitoring of the job processing is one of the main indicators for estimation of the overall quality of the Grid infrastructure User-friendly, reliable and complete monitoring is vital for effective organization of the distributed data analysis. Developed in the close collaboration with the user community, Dashboard job monitoring applications provide required functionality for LHC offline computing activity Future development and improvements are driven by the feedback and suggestions of the LHC users

24 CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Thank you for your attention! http://dashboard.cern.ch


Download ppt "CERN IT Department CH-1211 Genève 23 Switzerland www.cern.ch/i t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on."

Similar presentations


Ads by Google