ATLAS Production System Monitoring John Kennedy LMU München CHEP 07 Victoria BC 06/09/2007.

Slides:



Advertisements
Similar presentations
1 CS2SPE- Group project presentation Haia Al-Majali Bojin Zhou Rania Ali Suraj Patel Fatima Tunc Victoria Casas Sam Diab 19 th March 2008.
Advertisements

Development of test suites for the certification of EGEE-II Grid middleware Task 2: The development of testing procedures focused on special details of.
Experience In Developing Dynamic Web Interfaces: The Case Study of the ALICE Job Reliability Dashboard Eamonn Maguire IT-PSS 30-Aug
Parallel Reconstruction of CLEO III Data Gregory J. Sharp Christopher D. Jones Wilson Synchrotron Laboratory Cornell University.
A tool to enable CMS Distributed Analysis
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
19 February CASTOR Monitoring developments Theodoros Rekatsinas, Witek Pokorski, Dennis Waldron, Dirk Duellmann,
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
AQS Web Quick Reference Guide Changing Raw Data Values Using Maintenance 1. From Main Menu, click Maintenance, Sample Values, Raw Data 2. Enter monitor.
CERN IT Department CH-1211 Geneva 23 Switzerland t The Experiment Dashboard ISGC th April 2008 Pablo Saiz, Julia Andreeva, Benjamin.
Stephen Booth EPCC Stephen Booth GridSafe Overview.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES PhEDEx Monitoring Nicolò Magini CERN IT-ES-VOS For the PhEDEx.
PanDA Multi-User Pilot Jobs Maxim Potekhin Brookhaven National Laboratory Open Science Grid WLCG GDB Meeting CERN March 11, 2009.
FORUM II Best Practices in Data Warehousing in Higher Education: A Framework for Higher Education Reporting April 18, 2005 Slide 1 Cornell University’s.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.
PanDA Monitor Development ATLAS S&C Workshop by V.Fine (BNL)
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
Power at Your Fingertips –Overlooked Gems in Oracle EM John Sheaffer Principal Sales Consultant – Oracle Corporation.
Metadata Mòrag Burgon-Lyon University of Glasgow.
David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Wojciech Lapka SAM Team CERN EGEE’09 Conference,
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
9 Copyright © 2009, Oracle. All rights reserved. Deploying and Reporting on ETL Jobs.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
Performance of The NorduGrid ARC And The Dulcinea Executor in ATLAS Data Challenge 2 Oxana Smirnova (Lund University/CERN) for the NorduGrid collaboration.
ATLAS Dashboard Recent Developments Ricardo Rocha.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
INFSO-RI Enabling Grids for E-sciencE ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, ,
HammerCloud Functional tests Valentina Mancinelli IT/SDC 28/2/2014.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Hammercloud and Nagios Dan Van Der Ster Nicolò Magini.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Monitoring of the LHC Computing Activities Key Results from the Services.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Ops Portal New Requirements.
GridKa Cloud T1/T2 at Forschungszentrum Karlsruhe (FZK)
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
ATLAS Distributed Analysis Dietrich Liko IT/GD. Overview  Some problems trying to analyze Rome data on the grid Basics Metadata Data  Activities AMI.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Distributed Analysis Tutorial Dietrich Liko. Overview  Three grid flavors in ATLAS EGEE OSG Nordugrid  Distributed Analysis Activities GANGA/LCG PANDA/OSG.
D.Spiga, L.Servoli, L.Faina INFN & University of Perugia CRAB WorkFlow : CRAB: CMS Remote Analysis Builder A CMS specific tool written in python and developed.
ATLAS Distributed Analysis DISTRIBUTED ANALYSIS JOBS WITH THE ATLAS PRODUCTION SYSTEM S. González D. Liko
WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
ConTZole Tomáš Kubeš, 2010 atlas-tz-monitoring.cern.ch An Interactive ATLAS Tier-0 Monitoring.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
9 Copyright © 2004, Oracle. All rights reserved. Getting Started with Oracle Migration Workbench.
Job monitoring and accounting data visualization
Key Activities. MND sections
POW MND section.
New monitoring applications in the dashboard
Savannah to Jira Migration
John Kennedy LMU Muenchen on behalf of many other people
PHP / MySQL Introduction
Leigh Grundhoefer Indiana University
Using JDeveloper.
Site availability Dec. 19 th 2006
Information Services Claudio Cherubino INFN Catania Bologna
Presentation transcript:

ATLAS Production System Monitoring John Kennedy LMU München CHEP 07 Victoria BC 06/09/2007

06/09/ John Kennedy2 Overview Production System – quick intro Monitoring Goals Design Functionality - examples Usage Conclusion + outlook

06/09/ John Kennedy3 Prod-sys Central Database with Tasks – Groups of jobs Jobs – job definitions Attempts – attempt info for jobs Supervisor(Eowyn) Retrieves jobs from DB Passes to GRID through Executor Manages job attempt info Executor(per grid flavour) Create wrappers etc for submission Submit job Manage Job + interpret output (failure reasons etc)

06/09/ John Kennedy4 Prod-sys Supervisor Eowyn Exe Prod DB DDM LCG LFC T1 VOBOX T2 LFC T1 VOBOX T2 LFC T1 VOBOX T2 GridKa Clouds Jobs Data O(50K) Jobs per Day

06/09/ John Kennedy5 Production Database Hierarchy of tables Task Table - 11,018 Contains task definition data JobDef Table - 6,619,954 Contains definition of each job within a task JobExe Table - 9,457,938 Contains record of each attempt at a specific job Taskid Tier Requirements Grid Task Jobid jobname partition attempt,priority JobDef Jobid jobname partition attempt,priority JobDef Jobid jobname partition attempt,priority JobDef JobExeid jobname executor,site metadata,errorinfo JobExe JobExeid jobname executor,site metadata,errorinfo JobExe JobExeid jobname executor,site metadata,errorinfo JobExe Multiple Jobs Per Task Multiple Attempts Per Job

06/09/ John Kennedy6 Monitoring Goals Provide easy access to monitoring and accounting info Avoid load on central database Provide Views for Recent jobs to allow problem spotting Configurable time windows for # jobs/walltime and errors Higher level views to give an understanding of job states and available jobs Provide High and Low Level Views!

06/09/ John Kennedy7 Design Monitoring tables defined in specific monitoring account Oracle Stored procedures used to fill monitoring tables from Prodsys tables Monitoring tables used in combination with smaller proddb tables to provide web pages PHP pages used to interface to tables Cron jobs to give periodic views

06/09/ John Kennedy8 Design Production Database All Job Execution Records PRODUCTION 10M

06/09/ John Kennedy9 Design Production Database Monitoring Database All Job Execution Records 24hr Window of Execution Records Summary Table Filled from snapshot MONITORING PRODUCTION Snapshot filled with Last 24hrs Summary 10M 100K

06/09/ John Kennedy10 Design Production Database Monitoring Database All Job Execution Records 24hr Window of Execution Records Summary Table Filled from snapshot Monitoring Web Pages MONITORING PRODUCTION Snapshot filled with Last 24hrs Summary 10M 100K

06/09/ John Kennedy11 Last 24hr View Snapshot table in the monitoring database contains 24hr sliding window of the job execution records. Pages generated from this table to give views of jobs/errors grouped by Task Site Executor This 24hr view is aimed towards people on shift + spotting recent problems

06/09/ John Kennedy12 Last 24hr View Views group info based on Tasks – Sites - Executors Aids problem solving

06/09/ John Kennedy13 Last 24hr View Views group info based on Tasks – Sites - Executors Aids problem solving Click to get breakdown Details for sites giving pend/run/fin/fail jobs and colour coded efficiency spot sinners!

06/09/ John Kennedy14 Last 24hr View

06/09/ John Kennedy15 Last 24hr View Click on efficiency to get breakdown of errors

06/09/ John Kennedy16 Last 24hr View Click on efficiency to get breakdown of errors Click on error to get detailed info job execution info Click # to get logfiles (egee)

06/09/ John Kennedy17 Tasks Status Select Task by Grid State and Name

06/09/ John Kennedy18 Tasks Status Select Task by Grid State and Name Click to get Details

06/09/ John Kennedy19 OverViews Allow Views to be generated over defined time windows Uses Summary table Ensures quick response Low load on database Provide Views of Finish/Failed Jobs/Walltime Error reports, pie charts Allocated Tasks/jobs – ensure grids have jobs to run Allow split by grid flavour and sites

06/09/ John Kennedy20 OverViews Release ValidationProduction Operation Successful jobs Oct-Dec 06 All Grids perform better after Validation Phase Release Validation Production Operation High Failure rates during Validation Phase. Savannah bug reports etc from operation teams Jobs Walltime

06/09/ John Kennedy21 OverViews Looking at Jobs to highlight problems and reduce load Looking at Walltime to find where/why we waste resources Stagein(lcg-cp) gLite(WMS) Proxy on WN GetOut StageOut(lcg-cr) ATHENA Stagein(lcg-cp)

06/09/ John Kennedy22 OverViews Views of available jobs split by SW version Grid and State - allows us to ensure we have jobs that can run Definitions of Error Codes/Acronyms with Meaning - helps shifters to interpret problems

06/09/ John Kennedy23 EGEE Shift System During the last 12 months an EGEE shift system has been put in place People responsible for Assigning Tasks to Tiers Monitoring Jobs Monitoring Data movement Ensuring Tasks are progressing Reporting problems to sites/physicists Strong users of Monitoring + lots of feedback Pages Developed to help shifters

06/09/ John Kennedy24 EGEE Shift Pages Identify sites with ATLAS SW installations problems Monitor Running Tasks Identify old Tasks which need action Identify nearly completed Tasks

06/09/ John Kennedy25 Monitoring Usage Monitor the Monitoring Monitor usage of main web pages Total # users Busy pages Log SQL query times of Cron Jobs Spot increasing load and problems when crons run Monitoring account which can be tracked Currently O(40-50) Individual users per day

06/09/ John Kennedy26 Conclusion The model of using cooked down monitoring tables has proved successful Quicker queries Reduced load on database Both High and low level views have been implemented allowing for Problem Identification Accounting EGEE shift pages are well developed due to good communication with users – I am part of the shift team! Much more functionality provided than originally expected From a personal point of view - It's been fun and it's useful

06/09/ John Kennedy27 Outlook Much has been learned during the project By developers Snapshot/summary table use What users want By users What's useful What's not What would be nice Prod-sys monitoring being integrated into ARDA dashboard Functionality based on experience gained here Database snapshot tables etc will remain PHP replaced by new framework Migration has started – expected to be complete by start of 2008

06/09/ John Kennedy28 THANKS! Thanks for you time and attention!