Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – 2007 13-15 June, Stockholm.

Slides:

Advertisements

Similar presentations

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.

Advertisements

Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.

A tool to enable CMS Distributed Analysis

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.

LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.

CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.

HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.

Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks VO-specific systems for the monitoring of.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.

FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Grid Monitoring Tools Alexandre Duarte CERN.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

Dashboard program of work Julia Andreeva GS Group meeting

Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

WLCG Monitoring Roadmap Julia Andreeva, CERN , WLCG workshop, CERN.

Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.

INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

Julia Andreeva on behalf of the MND section MND review.

The GridPP DIRAC project DIRAC for non-LHC communities.

Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI User-centric monitoring of the analysis and production activities within.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.

EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.

LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

Global ADC Job Monitoring Laura Sargsyan (YerPhI).

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

1 Andrea Sciabà CERN The commissioning of CMS computing centres in the WLCG Grid ACAT November 2008 Erice, Italy Andrea Sciabà S. Belforte, A.

WLCG Information System Use Cases Review WLCG Operations Coordination Meeting 18 th June 2015 Maria Alandes IT/SDC.

Enabling Grids for E-sciencE Grid monitoring from the VO/User perspective. Dashboard for the LHC experiments Julia Andreeva CERN, IT/PSS.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.

Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.

D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.

CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,

The GridPP DIRAC project DIRAC for non-LHC communities.

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.

MND section. Summary of activities Job monitoring In collaboration with GridView and LB teams enabled full chain from LB harvester via MSG to Dashboard.

Accounting Review Summary from the pre-GDB related to CPU (wallclock) accounting Julia Andreeva CERN-IT GDB 13th April

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Probes Requirement Review OTAG-08 03/05/ Requirements that can be directly passed to EMI ● Changes to the MPI test (NGI_IT)

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,

Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.

Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April

HPDC Grid Monitoring Workshop June 25, 2007 Grid monitoring from the VO/user perspectives Shava Smallen.

Daniele Bonacorsi Andrea Sciabà

James Casey, CERN IT-GD WLCG Workshop 1st September, 2007

Key Activities. MND sections

Experiment Dashboard overviw of the applications

Monitoring of the infrastructure from the VO perspective

The LHCb Computing Data Challenge DC06

Presentation transcript:

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm

Grid Operations Workshop, June 2007 Julia Andreeva, CERN 2 Goal of the System Analysis Working Group As stated in the mandate the goal is to gain understanding of application failures in the grid environment and to provide an application view of the state of the infrastructure Summarize experience gained by the LHC experiments in achieving this goal and provide input to grid service monitoring working group and WLCG management Propagate information related to the availability of the Grid infrastructure/services as they are seen/measured by the LHC VOs to the ROC managers and local fabrics monitoring systems so that eventual problems are fixed by people who can take actions

Grid Operations Workshop, June 2007 Julia Andreeva, CERN 3 In practical terms We are not trying to introduce a new monitoring system Use what is already available (middleware itself, existing Grid monitoring tools, experiments work load management systems and data management systems, monitoring tools developed by the experiments, experiment dashboard). Analyze information flow - Identify information holes (if any) - understand how situation can be improved. Organize the application monitoring chain in a way which would ensure the improvements of the quality of the infrastructure and the efficiency of the use of the infrastructure by it’s customers

Grid Operations Workshop, June 2007 Julia Andreeva, CERN 4 What is the criteria for estimation of the quality of the Grid infrastructure? Can be many but the main judgment is of the users of the infrastructure: Can I run jobs? Can I transfer files? How much effort it requires? How quickly I get the results?

Grid Operations Workshop, June 2007 Julia Andreeva, CERN 5 Analysis of the application failures User error or Service failure? Currently when we talk about Grid success rate we are ONLY PARTIALLY taking into account Grid services involved into job processing (RB,,LB, CE, BDII, local batch system) Failures of the SEs, catalogues, access to the shared areas with software distributions, reading of the distributed databases are hidden in the application failures How to decouple these failures from the errors in the user code?

Grid Operations Workshop, June 2007 Julia Andreeva, CERN 6 Application failures of the CMS analysis jobs over last month 21% Failures caused by problems with data access 16% Required application version Is not found on the site

Grid Operations Workshop, June 2007 Julia Andreeva, CERN 7 Application failures of the CMS production jobs since beginning of the year 6.5% Failures relates to accessing/sourcing SW distribution At the site 22% Staging out failures

Grid Operations Workshop, June 2007 Julia Andreeva, CERN 8 One more example All jobs in London Are failing (user or site problem?) All jobs in Beijing are failing (requested sw version Is not available at the site) Evidently, not a user problem

Grid Operations Workshop, June 2007 Julia Andreeva, CERN 9 Sorted by user All users except one are failing Error code points to data access problem

Grid Operations Workshop, June 2007 Julia Andreeva, CERN 10 Sorted by dataset All jobs requiring access to any dataset are failing Only jobs not using input succeeded Rather site problem than the problem in the user code

Grid Operations Workshop, June 2007 Julia Andreeva, CERN 11 Main reasons of the application failures Data access problems Problems related to saving output on the SE and registration in the catalog Experiments are not yet widely using distributed databases, but access to the DB from every individual job can cause problems in future Raw estimation of the CMS application failures statistics shows that at least 30-40% of application failures are due to the failures/misconfiguration of the grid services or local sites Still very often application failures which are not resulted to jobs aborted by the Grid are regarded as a VO problem

Grid Operations Workshop, June 2007 Julia Andreeva, CERN 12 How experiments are currently trying to solve such problems? Very often the problem is first seen by the users/production teams, then reported to the experiment support lists and then either the ggus ticket is submitted or/and responsible people at the sites are contacted. Experiments are trying to put something in place to detect the problem as soon as possible (SAM tests, Job Robot in CMS, alarm system based on MonAlisa in Alice) Still the best indication of the fact that something is going wrong comes from of the massive job submission by the user community

Grid Operations Workshop, June 2007 Julia Andreeva, CERN 13 Application monitoring chain Analysis of the failure/success statistics Detecting of the problems Identifying the reason (User/Grid?) Providing help to the users For fixing problems in their code Notifying service providers / local site admins Collecting monitoring information Done via Experiment Dashboard, Production monitoring, monitoring of the work load management systems of the experiments (Dirac, Alien) Understanding whether the problem is related to the error in the user code or is due to the Grid failure is not an easy task. Trying to address it in the System Analysis Working Group Communication channel for reporting problems related to the local services at the sites seen from the experiment applications has to be defined. Might follow the model of publishing back of the SAM results into the local fabric monitoring system prototyped by Grid Services Monitoring WG

Grid Operations Workshop, June 2007 Julia Andreeva, CERN 14 Application failure diagnostics It is not easy to decouple application failures due to the user errors in the code or bad packaging of the user code from the failures caused by the problems of the Grid infrastructure itself Main problem - bad or inconsistent diagnostics from the application/job wrappers - developers of the application do not pay much attention to make error codes consistent and comprehensive - every job submission tool can have it’s own set of exit codes from the job wrapper, which in the scope of the same experiment can easily overlap and have different meaning Solutions : - in the framework of SAWG work with the experiments trying to improve the situation with the exit codes and reporting in general from the physics applications and job wrappers - analysis of big statistics accumulated in the dashboard DB - SAM tests

Grid Operations Workshop, June 2007 Julia Andreeva, CERN 15 Experiments monitoring systems In case of LHCb and ALICE there is one central submission queue, both for production and analysis, all monitoring information can be easily collected in the database of the work load management systems In case of ATLAS and CMS there is no central queue. Multiple submission UIs. Collecting of the monitoring data in the experiment scope is much more complicated. This was one of the motivation for starting Experiment Dashboard project

Grid Operations Workshop, June 2007 Julia Andreeva, CERN 16 Experiment Dashboard concept Information sources Generic Grid Services Experiment specific services Experiment work load management and data management systems Jobs instrumented to report monitoring information Monitoring systems (RGMA, GridIce, SAM, ICRTMDB, MonaAlisa, BDII, Apel…) Collect data of VO interest coming from various sources Store it in a single location Provide UI following VO requirements Analyze collected statistics Define alarm Conditions Propagate VO view of the infratructure To service providers VO users with various roles Potentially other Clients: PANDA, ATLAS production

Grid Operations Workshop, June 2007 Julia Andreeva, CERN 17 Dashboard status In production for all 4 LHC experiments Main applications: - Job monitoring (all experiments) - Data Management monitoring (ATLAS) - Data Transfer Monitoring (ALICE) - Site Reliability (all experiments) - Monitoring of the distributed DBs (implemented by 3D project) (ATLAS and LHCb) Recently the Job Monitoring and Site Reliability applications were installed by VLEMED VO outside the LHC community

Grid Operations Workshop, June 2007 Julia Andreeva, CERN 18 Current development focused on the detection and reporting of the application failures Importing into Dashboard results of the SAM tests to apply them while analyzing of the user job failures Service availability application based on the results of the sanity check reports sent from the VO jobs (LHCb) Enabling generation of the application efficiency reports in the format consistent with the Grid Monitoring Data Exchange Format developed by the Grid Service Monitoring Working Group

Grid Operations Workshop, June 2007 Julia Andreeva, CERN 19 Conclusisons Currently certain failures/inefficiencies of the Grid services cause the failure of the user application, though such jobs look like jobs properly handled by the Grid These problems have to be promptly detected, correctly identified and reported to the service providers / site administrators Work in the System Analysis WG and Experiment Dashboard in close collaboration with the LHC experiments and other monitoring working groups aims to address this issue.