Overview of ALICE monitoring Catalin Cirstoiu, Pablo Saiz, Latchezar Betev 23/03/2007 System Analysis Working Group.

Slides:



Advertisements
Similar presentations
The Premier Software Usage Analysis and Reporting Toolset CELUG Presentation – May 12, 2010 LT-Live : License Tracker’s License Server Monitor.
Advertisements

May 2005 Iosif Legrand 1 Iosif Legrand California Institute of Technology May 2005 An Agent Based, Dynamic Service System to Monitor, Control and Optimize.
May 2005 Iosif Legrand 1 Iosif Legrand California Institute of Technology ICFA WORKSHOP Daegu, May 2005 Daegu, May 2005 An Agent Based, Dynamic Service.
1 Generic logging layer for the distributed computing by Gene Van Buren Valeri Fine Jerome Lauret.
A Computation Management Agent for Multi-Institutional Grids
MONITORING WITH MONALISA Costin Grigoras. M ONITORING WITH M ON ALISA What is MonALISA ? MonALISA communication architecture Monitoring modules ApMon.
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
June 2003 Iosif Legrand MONitoring Agents using a Large Integrated Services Architecture Iosif Legrand California Institute of Technology.
October 2003 Iosif Legrand Iosif Legrand California Institute of Technology.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Understanding and Managing WebSphere V5
1 Status of the ALICE CERN Analysis Facility Marco MEONI – CERN/ALICE Jan Fiete GROSSE-OETRINGHAUS - CERN /ALICE CHEP Prague.
Grid Monitoring By Zoran Obradovic CSE-510 October 2007.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Thomas Finnern Evaluation of a new Grid Engine Monitoring and Reporting Setup.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
September 2005 Iosif Legrand 1 End User Agents: extending the "intelligence" to the edge in Distributed Service Systems Iosif Legrand California Institute.
AliEn uses bbFTP for the file transfers. Every FTD runs a server, and all the others FTD can connect and authenticate to it using certificates. bbFTP implements.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
Online Monitoring with MonALISA Dan Protopopescu Glasgow, UK Dan Protopopescu Glasgow, UK.
Robert Fourer, Jun Ma, Kipp Martin Copyright 2006 An Enterprise Computational System Built on the Optimization Services (OS) Framework and Standards Jun.
Module 10: Monitoring ISA Server Overview Monitoring Overview Configuring Alerts Configuring Session Monitoring Configuring Logging Configuring.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
ACAT 2003 Iosif Legrand Iosif Legrand California Institute of Technology.
Ramiro Voicu December Design Considerations  Act as a true dynamic service and provide the necessary functionally to be used by any other services.
Enabling Grids for E-sciencE Overview of System Analysis Working Group Julia Andreeva CERN, WLCG Collaboration Workshop, Monitoring BOF session 23 January.
1 Ramiro Voicu, Iosif Legrand, Harvey Newman, Artur Barczyk, Costin Grigoras, Ciprian Dobre, Alexandru Costan, Azher Mughal, Sandor Rozsa Monitoring and.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES P. Saiz (IT-ES) AliEn job agents.
Monitoring, Accounting and Automated Decision Support for the ALICE Experiment Based on the MonALISA Framework.
February 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology February 2006 February 2006 An Agent Based, Dynamic Service System to Monitor,
1 Iosif Legrand, Harvey Newman, Ramiro Voicu, Costin Grigoras, Catalin Cirstoiu, Ciprian Dobre An Agent Based, Dynamic Service System to Monitor, Control.
CERN – Alice Offline – Thu, 03 Feb 2005 – Marco MEONI - 1/18 Monitoring of a distributed computing system: the AliEn Grid Alice Offline weekly meeting.
Costin Grigoras ALICE Offline. In the period of steady LHC operation, The Grid usage is constant and high and, as foreseen, is used for massive RAW and.
The huge amount of resources available in the Grids, and the necessity to have the most up-to-date experimental software deployed in all the sites within.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks WMSMonitor: a tool to monitor gLite WMS/LB.
Site operations Outline Central services VoBox services Monitoring Storage and networking 4/8/20142ALICE-USA Review - Site Operations.
SAN DIEGO SUPERCOMPUTER CENTER Inca TeraGrid Status Kate Ericson November 2, 2006.
Getting started DIRAC Project. Outline  DIRAC information system  Documentation sources  DIRAC users and groups  Registration with DIRAC  Getting.
Monitoring with MonALISA Costin Grigoras. What is MonALISA ?  Caltech project started in 2002
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Xrootd Monitoring and Control Harsh Arora CERN. Setting Up Service  Monalisa Service  Monalisa Repository  Test Xrootd Server  ApMon Module.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
April 2003 Iosif Legrand MONitoring Agents using a Large Integrated Services Architecture Iosif Legrand California Institute of Technology.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
PPDG February 2002 Iosif Legrand Monitoring systems requirements, Prototype tools and integration with other services Iosif Legrand California Institute.
JAliEn Java AliEn middleware A. Grigoras, C. Grigoras, M. Pedreira P Saiz, S. Schreiner ALICE Offline Week – June 2013.
AliEn central services Costin Grigoras. Hardware overview  27 machines  Mix of SLC4, SLC5, Ubuntu 8.04, 8.10, 9.04  100 cores  20 KVA UPSs  2 * 1Gbps.
+ AliEn site services and monitoring Miguel Martinez Pedreira.
October 2006 Iosif Legrand 1 Iosif Legrand California Institute of Technology An Agent Based, Dynamic Service System to Monitor, Control and Optimize Distributed.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES L. Betev, A. Grigoras, C. Grigoras, P. Saiz, S. Schreiner AliEn.
03/09/2007http://pcalimonitor.cern.ch/1 Monitoring in ALICE Costin Grigoras 03/09/2007 WLCG Meeting, CHEP.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
MONITORING WITH MONALISA Costin Grigoras. M ON ALISA COMMUNICATION ARCHITECTURE MonALISA software components and the connections between them Data consumers.
1 R. Voicu 1, I. Legrand 1, H. Newman 1 2 C.Grigoras 1 California Institute of Technology 2 CERN CHEP 2010 Taipei, October 21 st, 2010 End to End Storage.
Geant4 GRID production Sangwan Kim, Vu Trong Hieu, AD At KISTI.
A System for Monitoring and Management of Computational Grids Warren Smith Computer Sciences Corporation NASA Ames Research Center.
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
SQL Database Management
Jean-Philippe Baud, IT-GD, CERN November 2007
California Institute of Technology
UML diagrams for the AliEn job execution part and PackMan service
ALICE Monitoring
ALICE FAIR Meeting KVI, 2010 Kilian Schwarz GSI.
Sergio Fantinel, INFN LNL/PD
A Messaging Infrastructure for WLCG
Presentation transcript:

Overview of ALICE monitoring Catalin Cirstoiu, Pablo Saiz, Latchezar Betev 23/03/2007 System Analysis Working Group

Contents AliEn Overview Jobs workflow Direct Job Monitoring Monitoring requirements MonALISA overview What we monitor Monitoring architecture in AliEn Actions framework Feature examples

AliEn overview Worldwide distributed system for running ALICE jobs Set of central services TaskQueue, File Catalogue, Job Optimizer, Transfer Optimizer, User authentication, Logger Configuration… Set of site services – running on the VoBOX Site proxy, Computing element, Storage Adapter, Package Manager; Pilot jobs (Job Agents) submitted to the WNs It provides a high-level single interface to the Grid, separating the user from Grid flavors Complexity of the Grid itself Platforms, operating system flavors, software environment

AliEn Jobs workflow User submits job to AliEn Job is registered in the AliEn TaskQueue User receives back a job ID to track his job Various optimizers run on the queued jobs Optimizer splits the job (if necessary) to run in several sub- jobs, usually close to the location of the data Job priority and quotas Jobs are matched to the resources needed for their execution Job Agents Picks a job, prepares input files, runs the job, saves the output And reports on the status of the job through its lifetime

Job status flow chart

Direct job monitoring Show all my jobs Show details for job (split job with 943 sub-jobs) Show details for (another split job) From the AliEn shell (second part of the talk – MonALISA)

Direct job monitoring (2) Show all sub-jobs from master job in a specific error condition

Direct job monitoring (3) Show full tracelog of one of the sub-jobs in ERROR_V

Direct job monitoring (4) Show full tracelog of one of the sub-jobs in ERROR_IB

Direct job monitoring (5) All error conditions can be traced through this method Highly detailed log of the job progress Allows for further debugging of what went wrong Example ERROR_IB – SE not working, etc… In addition to already finished jobs – monitoring of currently running jobs

Direct job monitoring (6) Spy on any file in the sandbox as the job is running

Direct job monitoring (7) Additional information in the logfiles after the job has finished or ended in an error condition The sub-jobs with errors can be resubmitted for execution after the error has been identified in the same format as in the beginning (preserves even the job ID) Simple command ‘resubmit Services status can also be monitored through the AliEn interface For statistical overview of the job and services status and error conditions, ALICE uses MonALISA Second part of the presentation

Monitoring requirements Global view of the entire distributed system Non-intrusive Accurate Providing Near real-time information Long-term history of aggregated data On key parameters like System status Resource usage Helping with Correlating events System debugging Generating reports Taking automated actions based on the monitored data

Data Store MonALISA overview MonALISA is a Dynamic, Distributed Service Architecture capable to collect any type of information from different systems, to analyze it in near real time and to provide support for automated control decisions and global optimization of workflows in complex grid systems. Data Cache Service & DB Configuration Control (SSL) Predicates & Agents Data (via ML Proxy) Applications Java Client (other service) Agents Filters Data Modules WS Client (other service) Web Service WSDL SOAP Lookup Service Lookup Service Registration Discovery Postgres MySQL

ML Discovery System & Services The framework is based on a hierarchical structure of loosely coupled agents acting as distributed services which are independent & autonomous entities able to discover themselves and to cooperate using a dynamic set of proxies or self describing protocols. Network of JINI-LUSs Secure & Public MonALISA services Proxies GUI Clients, Repositories, HL services Global Services or Clients Dynamic load balancing Scalability & Replication Security AAA for Clients Distributed System for gathering and Analyzing Information Distributed Dynamic Discovery-based on a lease Mechanism and REN Agents

ApMon – Application Monitoring MonALISA Service MonALISA Service ApMon APPLICATION Monitoring Data UDP/XDR Mbps_out: 0.52 Status: reading App. Monitoring MB_inout: ApMon Config parameter1: value parameter2: value App. Monitoring... Time;IP;procID Monitoring Data UDP/XDR Monitoring Data UDP/XDR load1: 0.24 processes: 97 System Monitoring pages_in: 83 MonALISA hosts Config Servlet dynamic reloading ApMon configuration generated automatically by a servlet / CGI script No Lost Packages Lightweight library of APIs (C, C++, Java, Perl, Python) that can be used to send any information to MonALISA Services High comm. performance Flexible Accounting Sys Mon

What we monitor AliEn Components Central Services Task Queue, Information Service, Optimizers, API etc. Site Services Cluster Monitor, Computing & Storage Elements Job Agents Jobs status & resource usage Other Services CastorGrid staging & migration, Xrootd, MySQL Nodes Central, site, worker nodes Network traffic – inter & intra site Via Xrootd Via FTD

Long History DB Monitoring architecture in AliEn LCG Tools ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn Job Agent MonALISA LCG Site ApMon AliEn CE ApMon AliEn SE ApMon Cluster Monitor ApMon AliEn TQ ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn Job Agent ApMon AliEn CE ApMon AliEn SE ApMon Cluster Monitor ApMon AliEn IS ApMon AliEn Optimizers ApMon AliEn Brokers ApMon MySQL Servers ApMon CastorGrid Scripts ApMon API Services MonaLisaRepository Aggregated Data rss vsz cpu time run time job slots free space nr. of files open files Queued JobAgents cpu ksi2k job status disk used processes load net In/out jobs status sockets migrated mbytes active sessions MyProxy status Alerts Actions

Job status monitoring Global summaries For each/all conditions For each/all sites For each/all users Running & cumulative Error status From job agents From central services Real-time map view Integrated pie charts History plots

Job status & traffic - real-time map

Job status – integrated pie charts

Job status – history plots

Job resource usage monitoring Cumulative parameters CPU Time & CPU KSI2K Wall time & Wall KSI2K Read & written files Input & output traffic (xrootd) Running parameters Resident memory Virtual memory Open files Workdir size Disk usage CPU usage Aggregated per site

Job network traffic monitoring Based on the xrootd transfer from every job Aggregated statistics for Sites (incoming, outgoing, site to site, internal) Storage Elements (incoming, outgoing) Of what Read and written files Transferred MB/s

Individual job tracking Based on AliEn shell cmds. top, ps, spy, jobinfo, masterjob Interaction with file catalogue Using the GUI ML Client Status, resource usage, per job

Job agents monitoring From Job Agent itself Requesting job Installing packages Running job Done Error statuses From Computing Element Available job slots Queued Job Agents Running Job Agents

AliEn & LCG Services monitoring AliEn services Periodically checked PID check + SOAP call Simple functional tests SE space usage Efficiency LCG environment and tools Proxy, gsiscp, LCG CE/SE, Job submission, BDII, Local catalog, sw. area etc. Error messages in case of failure Efficiency f

CastorGrid scripts monitoring Migration Amount, speed, errors Staging Amount, speed, errors Nodes Host parameters Xrootd resource usage File cache status Used space, no. of files

API Services monitoring API Service sessions Established, active API Service users Active, total Statistics Executed commands

VOBox monitoring Machine parameters, real-time & history Load, memory & swap usage, processes, sockets

FTD Monitoring Status of the transfers Transfer rates Success/failures

Annotations

Actions framework Based on monitoring information, actions can be taken in ML Service ML Repository Actions can be triggered by Values above/below given thresholds Absence/presence of values Correlation between multiple values Possible actions types Alerts Instant messaging RSS Feeds External commands Event logging MLRepository ML Service Actions based on global information Actions based on local information Traffic Jobs Hosts Apps Temperature Humidity A/C Power … Sensors Local decisions Global decisions

Alerts and actions MySQL daemon is automatically restarted when it runs out of memory Trigger: threshold on VSZ memory usage ALICE Production jobs queue is automatically kept full by the automatic resubmission Trigger: threshold on the number of aliprod waiting jobs Administrators are kept up-to-date on the services’ status Trigger: presence/absence of monitored information

Summary All aspects of the system are monitored Job execution, job data transfers Central and site services Machines LCG services monitoring is done through custom scripts This should be improved MonALISA is a very flexible tool Provides a top-down monitoring solution for large wide distributed systems It allows using the gathered data for intelligent control and decisions taking

Questions? Thank you!

ML Service deployment MonALISA is packaged and prepared for installation by the AliEn Build System (BITS)

ML Service configuration From site administrators point of view, it is just like any other AliEn service You start it with ` alien StartMonaLisa ` You stop it with ` alien StopMonaLisa ` Check status with ` alien StatusMonaLisa ` Configuration files for ML are generated automatically from AliEn LDAP

PROOF CAF Monitoring Each host reports CPU, memory, swap, network Each slave reports Summaries per query type CPU, memory Event rate File rate I/O vs. network rate

Summary Monitoring is vital in a large distributed system Distributed systems should have built-in monitoring capabilities We have to deal with a lot of data We need decentralization And aggregation MonALISA is very flexible and powerful Monitoring is just the beginning Gathered data can be used for intelligent control and decisions taking