Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.

Slides:



Advertisements
Similar presentations
Workload management Owen Maroney, Imperial College London (with a little help from David Colling)
Advertisements

A tool to enable CMS Distributed Analysis
Client/Server Grid applications to manage complex workflows Filippo Spiga* on behalf of CRAB development team * INFN Milano Bicocca (IT)
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
BaBar Grid Computing Eleonora Luppi INFN and University of Ferrara - Italy.
F.Fanzago – INFN Padova ; S.Lacaprara – LNL; D.Spiga – Universita’ Perugia M.Corvo - CERN; N.DeFilippis - Universita' Bari; A.Fanfani – Universita’ Bologna;
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks VO-specific systems for the monitoring of.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Julia Andreeva CERN (IT/GS) CHEP 2009, March 2009, Prague New job monitoring strategy.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
INFSO-RI Enabling Grids for E-sciencE Project Gridification: the UNOSAT experience Patricia Méndez Lorenzo CERN (IT-PSS/ED) CERN,
LCG Middleware Testing in 2005 and Future Plans E.Slabospitskaya, IHEP, Russia CERN-Russia Joint Working Group on LHC Computing March, 6, 2006.
Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.
Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
ARDA Prototypes Andrew Maier CERN. ARDA WorkshopAndrew Maier, CERN2 Overview ARDA in a nutshell –Experiments –Middleware Experiment prototypes (basic.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
ATLAS Production System Monitoring John Kennedy LMU München CHEP 07 Victoria BC 06/09/2007.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
INFSO-RI Enabling Grids for E-sciencE The gLite File Transfer Service: Middleware Lessons Learned form Service Challenges Paolo.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Korea Workshop May GAE CMS Analysis (Example) Michael Thomas (on behalf of the GAE group)
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CRAB: the CMS tool to allow data analysis.
Julia Andreeva on behalf of the MND section MND review.
INFSO-RI Enabling Grids for E-sciencE CRAB: a tool for CMS distributed analysis in grid environment Federica Fanzago INFN PADOVA.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI User-centric monitoring of the analysis and production activities within.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
David Adams ATLAS ATLAS-ARDA strategy and priorities David Adams BNL October 21, 2004 ARDA Workshop.
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
Daniele Spiga PerugiaCMS Italia 14 Feb ’07 Napoli1 CRAB status and next evolution Daniele Spiga University & INFN Perugia On behalf of CRAB Team.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Enabling Grids for E-sciencE Grid monitoring from the VO/User perspective. Dashboard for the LHC experiments Julia Andreeva CERN, IT/PSS.
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
INFSO-RI Enabling Grids for E-sciencE gLite Test and Certification Effort Nick Thackray CERN.
Enabling Grids for E-sciencE Experience Supporting the Integration of LHC Experiments Computing Systems with the LCG Middleware Simone.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
The ARDA Project Prototypes for User Analysis on the GRID Dietrich Liko/CERN IT.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
ConTZole Tomáš Kubeš, 2010 atlas-tz-monitoring.cern.ch An Interactive ATLAS Tier-0 Monitoring.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
James Casey, CERN IT-GD WLCG Workshop 1st September, 2007
RDMS CMS Computing Activities: current status & participation in ARDA
BOSS: the CMS interface for job summission, monitoring and bookkeeping
BOSS: the CMS interface for job summission, monitoring and bookkeeping
BOSS: the CMS interface for job summission, monitoring and bookkeeping
Nicolas Jacq LPC, IN2P3/CNRS, France
LCG middleware and LHC experiments ARDA project
Monitoring of the infrastructure from the VO perspective
Presentation transcript:

Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06 Mumbai, February 2006

Julia Andreeva, CERN CHEP06, Mumbai 2 ARDA project ARDA is an LCG project –Main objective is to enable LHC analysis on the grid –Contribute to the distributed analysis activity of each LHC experiment –Build on top the new EGEE middleware (gLite)  Use the grid software as it matures (exposed to all pre- releases)  Verify the components in an analysis environments  Provide early and continuous feedback Two main directions of ARDA-CMS development –Task Manager and Task Monitor for supporting user analysis on the Grid –Dashboard – monitoring system providing complete view of the Grid related and experiment specific information collected from the CMS distributed system

Julia Andreeva, CERN CHEP06, Mumbai 3 ASAP ASAP stands for ARDA Support for Analysis Processing Started to develop end-to-end prototype enabling CMS analysis on the gLite middleware platform in May 2004 Early phase of the gLite middleware development, very unstable and evolving environment Need feedback of the real users. How to attract them? Create the system which would make running analysis on the Grid for the user as simple as possible, requiring as little of time and attention as possible.

Julia Andreeva, CERN CHEP06, Mumbai 4 ASAP work and information flow RefDBPubDB ASAP UI Monalisa gLite/LCG JDL Job monitoring directory ASAP Task Manager service Publishing Job status On the WEB Delegates user credentials using MyProxy Job submission Checking job status Resubmission in case of failure Fetching results Storing results to Castor Output files location Application,applicationversion, Executable, Data sample, Working directoory, Castor directory to save output, Number of events to be processed Number of events per job Data cards for ORCA application Job running on the Worker Node

Julia Andreeva, CERN CHEP06, Mumbai 5 Really less effort ? Still user needs to - create a configuration file for a given task - run job generation command The rest can be done by the Task Manager - job submission - resubmission in case of Grid or application failure - generation of the web pages with the task monitoring information - saving of the output to the final location User wants to get a control over his task back and run in a simple mode without using a service Whether it is possible? Yes, at any given moment Whether users prefer to dump their work to the Task Manager rather then following themselves their tasks ? Yes, they do prefer to use Task Manager service

Julia Andreeva, CERN CHEP06, Mumbai 6 Example of monitoring user task with ASAP Job was resubmitted by the Task Manager multiple times due to the Grid failures Currently is running and had processed 715 events

Julia Andreeva, CERN CHEP06, Mumbai 7 Use of ASAP First pilot users start to use ASAP running their analysis jobs on the gLite prototype testbed in the beginning of February Enabled backends are gLite and LCG Currently about 30 CMS physicists are using ASAP for their every day work ~ 70K jobs per month ~ 30Mln events are processed per month Very positive feedback from the users Task Manager and Task Monitor should be streamlined with the development of the final CMS system

Julia Andreeva, CERN CHEP06, Mumbai 8 Goal of the Dashboard project The CMS Dashboard project aims to provide a single entry point to the monitoring data collected from the CMS distributed system. Dashboard development is a part of the CMS integration project, a joined effort of ARDA and MonAlisa teams There is a plan to follow Dashboard development as a joined project for CMS and ATLAS Currently, main activities are concentrated on the job monitoring part The objective is provide a complete view of how the jobs of CMS VO are doing on the Grid (both for LCG/gLite and OSG middleware platforms)

Julia Andreeva, CERN CHEP06, Mumbai 9 Use cases - Quantities (how many jobs per site, per user, per submission tool, per data collection…), distribution over time -Usage of the resources (CPU, memory, IO rates) aggregated on different levels -Sharing of the resources between production and analysis, different analysis groups, individual users -How Grid is behaving, success rate, failure reasons… -How CMS application is behaving, success rate, failure reasons… -How CMS data is distributed over the sites -Indication of the problems related to Pure Grid troubles( lack of stability, scalability, performance) Site configuration Data publishing Data access Software distribution …….

Julia Andreeva, CERN CHEP06, Mumbai 10 Sources of information Currently two main sources of information are used: RGMA for getting Grid related info (LCG/gLite logging and book- keeping) Investigating possibility to use GridIce and LCG2 Real Time Monitor Resource Broker XML Files to complement data got via RGMA There is a plan to use MonAlisa to get batch system related information for the OSG sites MonAlisa for CMS related data and system information CMS has a very positive experience for using MonAlisa for system monitoring. In ARDA we started to use MonAlisa for job monitoring. RGMA and MonAlisa are complementary regarding the type of information they provide. However using two sources of information make it possible to cross check between the two. Our experience shows that it allows to increase the reliability of the system.

Julia Andreeva, CERN CHEP06, Mumbai 11 Architecture Dashboad DB RB Submission tools WNs R-GMA Monalisa RB Constantly retrieve job information Web Service InterfaceR-GMA Client API Snapshot Statistics Job info. Plots RGMA Collector Monalisa Collector Dashboard Web UI PHP PostgreSQL sqLite Oracle Other clients: - MyFriend services - etc..

Julia Andreeva, CERN CHEP06, Mumbai 12 Main principles Provide a necessary level of interactivity Not just precooked views, but possibility to dig in in case of troubles, possibility to ask detailed questions. Foresee in future the possibility not only to collect and to expose information, but to analyze it and to generate alarms in case of evident problems Serve users with different level of requirements - Managers (computing projects managers, site managers, production managers) - CMS global views, site views - Users running their tasks on the grid - task views Follow input and feedback of the CMS community - What kind of information to collect - At what level to aggregate information - How to present collected information

Julia Andreeva, CERN CHEP06, Mumbai 13 Use of the Dashboard during SC3 Indicates publishing troubles

Julia Andreeva, CERN CHEP06, Mumbai 14 Use of the Dashboard during SC3

Julia Andreeva, CERN CHEP06, Mumbai 15 IO rates monitoring

Julia Andreeva, CERN CHEP06, Mumbai 16 Conclusions ARDA analysis system is used by the CMS physicists for running their analysis tasks on the Grid. Task Manager service aims to minimize the effort of the user and to increase the amount of successfully processed data. It got a positive feed back from the users and will be integrated in the new CMS analysis system. First prototype of the Dashboard and its use during CMS service challenge demonstrated need of the experiment in the common monitoring tool combining Grid and experiment specific information. LHC experiments are running on several middleware platforms, so the single entry point of monitoring information has an additional value. Still a lot of work has to be done to pass from the prototype to a production quality system providing a needed level of reliability, scalability and performance.

Julia Andreeva, CERN CHEP06, Mumbai 17 How to instrument jobs for MonAlisa reporting (backup slide ) At the submission time - send meta information related to the task (application version, name of the executable, name of the input data collections…) - job related information (event range or other identifier of the job in the task scope, grid job ID) From the worker node job wrapper should send - results of the sanity checks - start and end flags of the job processing steps (staging in, running executable, staging out) - exit code of the execution of job processing steps When job status is requested or job output is retrieved via job submission tool send to ML - status and status reason string