Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.

Slides:



Advertisements
Similar presentations
WLCG Monitoring Consolidation NEC`2013, Varna Julia Andreeva CERN IT-SDC.
Advertisements

A tool to enable CMS Distributed Analysis
Client/Server Grid applications to manage complex workflows Filippo Spiga* on behalf of CRAB development team * INFN Milano Bicocca (IT)
Analysis demos from the experiments. Analysis demo session Introduction –General information and overview CMS demo (CRAB) –Georgia Karapostoli (Athens.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services GS group meeting Monitoring and Dashboards section Activity.
HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks VO-specific systems for the monitoring of.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Dashboard program of work Julia Andreeva GS Group meeting
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Report from GGUS BoF Session at the WLCG.
CERN IT Department CH-1211 Geneva 23 Switzerland t GDB CERN, 4 th March 2008 James Casey A Strategy for WLCG Monitoring.
Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
WLCG Monitoring Roadmap Julia Andreeva, CERN , WLCG workshop, CERN.
Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
INFSO-RI Enabling Grids for E-sciencE The gLite File Transfer Service: Middleware Lessons Learned form Service Challenges Paolo.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Site Manageability & Monitoring Issues for LCG Ian Bird IT Department, CERN LCG MB 24 th October 2006.
Visualization Ideas for Management Dashboards
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
Julia Andreeva on behalf of the MND section MND review.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI User-centric monitoring of the analysis and production activities within.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
Enabling Grids for E-sciencE Grid monitoring from the VO/User perspective. Dashboard for the LHC experiments Julia Andreeva CERN, IT/PSS.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
New solutions for large scale functional tests in the WLCG infrastructure with SAM/Nagios: The experiments experience ES IT Department CERN J. Andreeva.
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
Enabling Grids for E-sciencE Experience Supporting the Integration of LHC Experiments Computing Systems with the LCG Middleware Simone.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
MND section. Summary of activities Job monitoring In collaboration with GridView and LB teams enabled full chain from LB harvester via MSG to Dashboard.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
CMS Experience with the Common Analysis Framework I. Fisk & M. Girone Experience in CMS with the Common Analysis Framework Ian Fisk & Maria Girone 1.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
WLCG Transfers monitoring EGI Technical Forum Madrid, 17 September 2013 Pablo Saiz on behalf of the Dashboard Team CERN IT/SDC.
Accounting Review Summary and action list from the (pre)GDB Julia Andreeva CERN-IT WLCG MB 19th April
WLCG Accounting Task Force Introduction Julia Andreeva CERN 9 th of June,
HPDC Grid Monitoring Workshop June 25, 2007 Grid monitoring from the VO/user perspectives Shava Smallen.
Daniele Bonacorsi Andrea Sciabà
Bob Jones EGEE Technical Director
WLCG Workshop 2017 [Manchester] Operations Session Summary
James Casey, CERN IT-GD WLCG Workshop 1st September, 2007
Key Activities. MND sections
POW MND section.
New monitoring applications in the dashboard
Experiment Dashboard overviw of the applications
WLCG Collaboration Workshop;
Monitoring of the infrastructure from the VO perspective
Presentation transcript:

Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January 2007

WLCG Collaboration Workshop, Julia Andreeva, CERN 2 Goal As stated in the mandate the goal is to gain understanding of application failures in the grid environment and to provide an application view of the state of the infrastructure - Application view in this context means the comprehensive picture of the experiment activities on the Grid, i.e. combining Grid-related and application specific information and allowing to detect and address problems of various nature - This work is the continuation of what had been started in the Experiment Dashboard project - The view of the experiments activities on LCG can be achieved by taking into account the progress done in the monitoring area by all involved parties: developers and providers of the Grid services developers and providers of the existing monitoring tools experiments themselves, in particular developers of work load management tools and data management systems Summarize experience gained by the LHC experiments in achieving this goal and provide input to grid service monitoring and management

WLCG Collaboration Workshop, Julia Andreeva, CERN 3 Example of combined (Grid - application) monitoring Click on the bar and sort by dataset

WLCG Collaboration Workshop, Julia Andreeva, CERN 4 Example ofcombined (Grid- application) monitoring Click on the bar and sort by site

WLCG Collaboration Workshop, Julia Andreeva, CERN 5 Example of combined (Grid- application) monitoring All jobs in Madison and Purdue are failing (application failure) All jobs in Florida are aborted (Grid failure) LHC user does not care whether failure of his jobs is explained by the misconfiguration of the site from the Grid point of view or from the application point of view. For him the site is not working. Both problems have the same impact on the user. Though problems have different nature (different responsibility) - should find the way to monitor both and setup the standard procedure of addressing both

WLCG Collaboration Workshop, Julia Andreeva, CERN 6 In practical terms We are not planning to introduce a new monitoring system In close collaboration with the experiments development work aimed to provide the application view of the infrastructure had been started by the Experiment Dashboard project and dashboard development will continue following the output of the Working Group But the scope is wider. Overview what experiments had achieved in the area of application monitoring, define common patterns and identify common problems to address them in collaboration with two other monitoring groups

WLCG Collaboration Workshop, Julia Andreeva, CERN 7 How to achieve Identify current experiment use cases related to main areas of activities - job processing, data management, DB replication. For every use case analyze present experience and requirements of four LHC experiments For every use case analyze the existing sources of monitoring data and understand whether they are sufficient, whether bits of information coming from various sources can be complementary to each other, how they can be correlated. Identify information holes (if any). Where possible identify common approach and implementation for the application monitoring (experiment dashboard, SAM tests with VO specific content) Provide input to the Grid Service Monitoring Working Group regarding identified problems or suggestions for the improvements of the grid service monitoring Based on information collected in the experiment dashboard identify VO- site related failures of the real user jobs. Coupled with SAM tests with VO-specific content. This is the area where System Analysis WG has to collaborate with two other monitoring groups. Come with the suggestion of the procedure how these problems should be followed up.

WLCG Collaboration Workshop, Julia Andreeva, CERN 8 Follow job failures at the site for a given VO

WLCG Collaboration Workshop, Julia Andreeva, CERN 9 Analysis of the information flow of the monitoring data Information sources Generic Grid Services Experiment specific services Experiment work load management and data management systems Jobs instrumented to report monitoring information Depending on the use case identify which monitoring data is available. Whether it is enough to provide a complete and comprehensive picture for the VO? If not, again depending on the use case in order to address the issue either to collaborate with Grid Service Monitoring WG or with the experiments (application specific data) Monitoring systems (RGMA, GridIce, SAM, ICRTMDB, MonaAlisa, BDII…) For all main areas of the activities of the experiments on the GRID

WLCG Collaboration Workshop, Julia Andreeva, CERN 10 Analysis of the current VOs experience and requirements Experiments have different requirements for the high level monitoring depending on the way their workload management systems and data management systems are organized : - different level of centralization (example job submission) - different progress already done in the experiments regarding a given activity (example Phedex with very advanced monitoring system) - variety of the platforms used by the experiments - different technology for a similar task (example – DB replication) Still a lot of common issues regarding job processing, data transfer, data access, usage of the distributed DBs. Nothing to enforce, but to identify where possible the ways to implement monitoring in a common way: - via the experiment dashboard - using existing monitoring frameworks (SAM) - by defining common problems/requirements/suggestions to the middleware developers, providers of the monitoring tools, via Grid Service Monitoring working group (example – improving error reporting for the Grid related failures) Share experience and ideas related to the application monitoring between LHC experiments Different activities on the Grid are very much dependent on each other (example- job processing efficiency is dependent on data distribution, data publishing, data access). Analyze these dependences in order to understand how VO monitoring data can be better presented/navigated.

WLCG Collaboration Workshop, Julia Andreeva, CERN 11 Core group membership Chaired by Julia Andreeva Group is focused on the needs of the main WLCG customers – LHC experiments. LHC experiments should be the key players in the group. One representative per experiment, plus VO experiment experts will be invited to take part the meetings on the specific subject Dietrich Liko (ATLAS) Latchezar Betev (ALICE) Stefano Belforte (CMS) To be confirmed (LHCb) Experiment dashboard made a good start. Output of the work of the System Analysis WG will define directions for the further dashboard development. Benjamin Gaidioz Pablo Saiz Ricardo Brito Da Rocha MonAlisa monitoring system is widely used by several LHC experiments for the application level monitoring Iosif Legrand ROC at CERN started work to follow site problems for LHC VOs Diana Bosio LCG Experiment Integration Support Roberto Santinelli Monitoring tools developers will be invited to take part in the meetings related to a specific topic

WLCG Collaboration Workshop, Julia Andreeva, CERN 12 Expected outcome of the work Further development/improvement of the Experiment Dashboard following the output of the working group Make sure that via experiment dashboard, SAM and experiment specific monitoring systems LHC experiments are provided with the monitoring framework where LHC VO user depending on his role and use case can find necessary monitoring data Suggest the procedure to address VO-related problems at the sites which should improve overall level of site reliability from the point of view of the LHC experiments Provide input for the Grid Service Monitoring Working Group for the issues/requirements related to Grid Service monitoring and collaborate with it in order to find the solution

WLCG Collaboration Workshop, Julia Andreeva, CERN 13 Backup slides.Dashboard Architecture Dashboard DB RB WNs R-GMA Monalisa RB Web Service Interface R-GMA Client API RGMA Collector Monalisa Collector Oracle IC RTMRB Collector IC RTMRB XML files HTTP Historical plots and statistics Dashboard Interactive WEB UI RRD ATLAS DDM Site Services SAM sensors (XML) Potentially other Clients: PANDA, ATLAS production INPUT Multiple sources of information Increasing the reliability Providing both global and very detailed view Can satisfy users with various roles: Generic user running his jobs on the Grid Site administrator VO manager, production or analysis group coordinator, data transfer coordinator… OUTPUT Providing output in various formats (Web pages, xml, csv, image formats) Can be used by various clients both users and applications