Dave Kant Monitoring ROC Workshop Milan 10-11/5/04.

Slides:



Advertisements
Similar presentations
John Gordon CCLRC eScience centre Grid Support and Operations John Gordon CCLRC GridPP9 - Edinburgh.
Advertisements

Andrew McNab - Manchester HEP - 17 September 2002 Putting Existing Farms on the Testbed Manchester DZero/Atlas and BaBar farms are available via the Testbed.
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct GOSC Oct 28.
Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.
Deployment Team. Deployment –Central Management Team Takes care of the deployment of the release, certificates the sites and manages the grid services.
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
John Gordon and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations.
Accounting Update Dave Kant Grid Deployment Board Nov 2007.
The EU DataGrid – Information and Monitoring Services The European DataGrid Project Team
Introduction on R-GMA Shi Jingyan Computing Center IHEP.
OSG Middleware Roadmap Rob Gardner University of Chicago OSG / EGEE Operations Workshop CERN June 19-20, 2006.
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct 2004.
Dave Kant LCG Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPSYSMAN April 2005.
HPDC 2007 / Grid Infrastructure Monitoring System Based on Nagios Grid Infrastructure Monitoring System Based on Nagios E. Imamagic, D. Dobrenic SRCE HPDC.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
SEE-GRID-SCI SEE-GRID-SCI Operations Procedures and Tools Antun Balaz Institute of Physics Belgrade, Serbia The SEE-GRID-SCI.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Information System on gLite middleware Vincent.
Dave Kant Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005.
Steve Traylen Particle Physics Department EDG and LCG Status 9 th December 2003
APEL & MySQL Alison Packer Richard Sinclair. APEL Accounting Processor for Event Logs extracts job information by parsing batch system (PBS, LSF, SGE.
Dave Kant Grid Operations Centre LCG Workshop CERN 24/3/04.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
Open Science Grid OSG CE Quick Install Guide Siddhartha E.S University of Florida.
Grid Operations Centre LCG Accounting Trevor Daniels, John Gordon GDB 8 Mar 2004.
Some Title from the Headrer and Footer, 19 April Overview Requirements Current Design Work in Progress.
SAN DIEGO SUPERCOMPUTER CENTER Inca TeraGrid Status Kate Ericson November 2, 2006.
Presenter Name Facility Name UK Testbed Status and EDG Testbed Two. Steve Traylen GridPP 7, Oxford.
E-infrastructure shared between Europe and Latin America FP6−2004−Infrastructures−6-SSA gLite Information System Pedro Rausch IF.
Grid Deployment Enabling Grids for E-sciencE BDII 2171 LDAP 2172 LDAP 2173 LDAP 2170 Port Fwd Update DB & Modify DB 2170 Port.
LCG Accounting John Gordon Grid Deployment Board 13 th January 2004.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
High-Performance Computing Lab Overview: Job Submission in EDG & Globus November 2002 Wei Xing.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Local Job Accounting Cristina del Cano Novales STFC-RAL.
VO Box Issues Summary of concerns expressed following publication of Jeff’s slides Ian Bird GDB, Bologna, 12 Oct 2005 (not necessarily the opinion of)
Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct
John Gordon CCLRC RAL Grid Operations LCG Grid Deployment Board FNAL, 9th October 2003.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks APEL CPU Accounting in the EGEE/WLCG infrastructure.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
Open Science Grid Build a Grid Session Siddhartha E.S University of Florida.
Accounting in LCG/EGEE Can We Gauge Grid Usage via RBs? Dave Kant CCLRC, e-Science Centre.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Accounting in LCG Dave Kant CCLRC, e-Science Centre.
APEL Accounting Update Dave Kant CCLRC, e-Science Centre.
Dave Kant LCG Accounting Overview GDA 7 th June 2004.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
INFN GRID Production Infrastructure Status and operation organization Cristina Vistoli Cnaf GDB Bologna, 11/10/2005.
CERN 21 January 2005Piotr Nyczyk, CERN1 R-GMA Basics and key concepts Monitoring framework for computing Grids – developed by EGEE-JRA1-UK, currently used.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Mario Reale – GARR NetJobs: Network Monitoring Using Grid Jobs.
John Gordon Grid Accounting Update John Gordon (for Dave Kant) CCLRC e-Science Centre, UK LCG Grid Deployment Board NIKHEF, October.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
II EGEE conference Den Haag November, ROC-CIC status in Italy
OSG Status and Rob Gardner University of Chicago US ATLAS Tier2 Meeting Harvard University, August 17-18, 2006.
Enabling Grids for E-sciencE APEL Accounting update Dave Kant (presented by Jeremy Coles) 2 nd EGEE/LCG Operations Workshop Bologna 25.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
INFSO-RI Enabling Grids for E-sciencE GOCDB2 Matt Thorpe / Philippa Strange RAL, UK.
Service Availability Monitoring
Regional Operations Centres Core infrastructure Centres
EGEE is a project funded by the European Union
LCG Monitoring and Accounting
Use of Nagios in Central European ROC
Patricia Méndez Lorenzo ALICE Offline Week CERN, 13th July 2007
Accounting at the T1/T2 Sites of the Italian Grid
Information Services Claudio Cherubino INFN Catania Bologna
Presentation transcript:

Dave Kant Monitoring ROC Workshop Milan 10-11/5/04

Dave Kant Within the scope of LCG we are responsible for monitoring how the grid is running – who is up, who is down, and why Identifying Problems, Contact the Right People, Suggest Actions Providing scalable solutions to allow other people to monitor resources Manage site Information – definitive source of information Accounting – Aggregate Job Throughput (per Site, per VO) Established at CLRC (RAL) Status of LCG2 Grid here: Grid Operations Centre

Dave Kant  Why We Monitor Keep systems up and running Notice failures; grid-wide services mds; Knowing what services a site should be running  no point raising an alert if the site isn’t meant to run it!  definition of services and which sites run them (SLA)  What Tools Do We Use Job Submission; GridIce; Nagios How – Database Developments Planned nagios  3 Stage Plan over next 12 months Monitoring Overview

Dave Kant  RAL runs monitoring  All RCs added to database through their ROC i.e ROC takes responsibility for adding and checking information / data consistency in the database.  Provide Tailored Maps (example GridPP)  Each ROC will monitor its sites and regional services through the GOC monitoring at RAL  Timescale ~ 3-6 Months EGEE Stage 1

Dave Kant  Distribution of GOC s/w to allow ROCs to run their own monitoring i.e they run the monitoring tools themselves!  Centralised Database based at RAL but ROCs configure their monitoring from the centralised database  Further monitoring development required before completion of this stage.  [Nagios not finished; Other outstanding things e.g Packaging and Document; CVS..do we continue to use the LCG CVS repository?]  Timescale ~ 6 – 12 Months EGEE Stage 2

Dave Kant  Distribute database amongst the ROCs  A large distributed database instead of a single database  Distributed database hops to monitor core services  Timescale ~12 Months and beyond EGEE Stage 3

Dave Kant GOC Site Database Develop and maintain a database to hold Site Information Contact Lists, Nodes, IP, URLs, Scheduled Maintenance Each Site has its own Administration Page where Access is Controlled through the use of X509 certificates. (GridSite) Monitoring Scripts read information in database and run a set of customised tools to monitor the infrastructure To be included in the monitoring a site must register its resources (CE,SE,RB,RC,RLS,MDS,RGMA,BDII,..)

Dave Kant GOC GridSite MySQL Resource Centre Resources & Site Information EDG, LCG-1, LCG-2, … ce se bdii rb Monitoring Secure Database Management via HTTPS / X.509 People, Contact Information, Resources Scheduled Maintenance RC SQL https SERVERSERVER

Dave Kant People: Who do we notify when there are problems EXAMPLE: RAL Site

Dave Kant Node Information (Type, Hostname, IP Address, Group) EXAMPLE: RAL Site

Dave Kant Monitoring Services There are many frameworks which can be used to monitor distributed environments MAPCENTRE GPPMON GRIDICE NAGIOS MONALISA Example: Mapcentre 30 sites ~ 500 lines in config file (static version) Example: Nagios 30 sites, 12 individual config files with dependencies Developed Tools to Configure these services to make the job easier NAGIOS, MAPCENTER and GPPMON

Dave Kant GOC Features – GPPMon Status of Grid, based on the success of job submission to resources, displayed as a world map, with sites represented by coloured dots SQL Query of Database -> List of Resources (CE, RB) Job Submission to each Site in Two Ways: Direct to CE = globus-job-run Indirect to CE via Resource Brokers = edg-job-submit Responses Collected and Translated into a Site Status Colour Index Success via RB = Green, Globus Only = Orange, Fail = Red Geographical View Presented Against World Map

Dave Kant LCG2 CORE SITES Status: 23 March SITES

Dave Kant GOC Job Submission Flow Diagram JOB Script RB.CE create RB sent acknowledge edg-job-submit GOC (UI) Build List of CE, RB Resources SITE DB SQL QUERY CE Other.GlueCEUniqueID wget received acknowledgement WN CE

Dave Kant GOC Job Submission Flow Diagram GOC (UI) Build List of CE, RB Resources JOB Script GLOBUS.CE create CE sent acknowledge globus-job-run CE SITE DB SQL QUERY wget received acknowledgement

Dave Kant LCG2 CORE SITES Status: 8th May ~30 SITES

Dave Kant LCG1 CERT Status: 27 Feb 2004

Dave Kant GOC Features – Nagios Monitoring Nagios is a powerfull monitoring service that supports notifications, and the execution of remote agents to correct problems when faults are discovered. Advantages => proactively monitor grid (NRPE daemon) Automatic Configuration of Nagios based on Database Developed a set of plugins which focus on service behaviour and data consistency Do RBs find resources? Does Site GIIS’s publish correct hostname? Is the site running the latest stable software release? Does the Gatekeeper authentication service work? Are the host certificates valid e.g Issued by Trusted CA Are essential services running e.g GridFTP Further plugins are being developed (e.g certification)

Dave Kant Nagios Screen Shot Service Summary for Nodes: Certificate Lifetime Check, GridFTP, GRAM Authentication Site Attributes via GIIS (siteName, Tag, …) HOST PLUGIN STATUS STATUS INFORMATION

Dave Kant

Dave Kant

Dave Kant Distributing GOC Software GOC GridSite MySQL  Packaging Monitoring Tools Provide ROCs with a standard set of tools to proactively monitor resources 2 nd Prototype GOC established in Taipei (GMT+8hours) GOC Centre CLRC, TW Remote Query to collect a list of resources Local query if service not available Monitor Resources via Job Submission TOOLSTOOLS SITE CONFIG

Dave Kant LCG Accounting Overview CE PBS/LSF Jobmanager Log GateKeeper Listens on port 2119 GRAM Authentication GIIS LDAP Information Server MON RGMA Database We have an accounting solution. The Accounting is provided by RGMA At each site, log-file data is processed from different sources and published into a local database.

Dave Kant LCG Accounting – How it Works GOC provides an interface to produce accounting plots “on-demand” Total Number of Jobs per VO per Site (ok) Total Number of Jobs per VO aggregated over all sites (to be done) Tailor plots according to the requirements of the user community ~ 1000 Alice Jobs Taipei Statistics Feb/Mar

Dave Kant LCG Accounting CNAF Statistics March ~ 10,000 Alice Jobs RAL Statistics March ~ 6,300 Alice Jobs

Dave Kant  Provide ROCs with a package to monitor the resources in the region Tailored Monitoring ROCs can upload their own maps GUI to automate site locations on the map  Hierarchical view of Resources Example GridPP made up of virtual T2 centres Monitoring Developments EGEE FranceUK/I GridPP LondonT2 IMPERIAL QMUL ScotGrid Edinburgh S.E.E

Dave Kant  Proactively Monitor Resources Make use of NAGIOS (NRPE) Features Future Direction for GridIce which already monitors critical processes.  Document to Define Monitoring Procedures Check list to provide a roadmap of what tests to perform and what actions to undertake when problems are discovered. Federated Ganglia? Provide a Cook Book to do this Can GOC tools be used for CIC? Monitoring Developments

Dave Kant  CE and GK not the same machine! Which machine holds the relevant messages log file for processing? Accounting Developments

Dave Kant LCG Accounting – How it Works DATA SOURCE PBS EVENT LOGS SQL PbsRecords Table LcgProcessed Table PBS filter to extract data from the event log records. RGMA-API publishes data to a PbsRecords database table on the MON box and records the names of the processed logs for book- keeping CE MON /var/spool/pbs/server_priv/accounting

Dave Kant LCG Accounting – How it Works “END” EVENT RECORDS CONTAIN THE FOLLOWING INFORMATION | Field | Type | | RecordIdentityP | varchar(255)| | SiteName | varchar(50) | | JobName | varchar(100) | | LocalUserID | varchar(20) | | LocalUserGroup | varchar(20) | | WallDuration | varchar(30) | | CpuDuration | varchar(30) | | WallDurationSeconds | int(11) | | CpuDurationSeconds | int(11) | | StartTime | varchar(30) | | StopTime | varchar(30) | | SubmitHost | varchar(50) | SQL PbsRecords Table MON The actual table schema contains more information than is shown here.

Dave Kant LCG Accounting – How it Works Extract data from globus-gatekeeper and system messages logs Record a list of files processed to reduce network traffic/load DATA SOURCE GLOBUS GATEKEEPER LOGS GateKeeper SQL GKRecords Table LcgProcessed Table JobNames MON /var/log: globus-gatekeeper.log gz messages.2.gz messages.3.gz DATA SOURCE System Messages LOGS

Dave Kant LCG Accounting – How it Works | Field | Type | RecordIdentityG | varchar(255) | | GramScriptJobID | varchar(100) | | LocalJobID | varchar(50) | | GlobalUserName | varchar(255) | | SubmitHost | varchar(50) | | SiteName | varchar(50) | | ValidFrom | date | | ValidUntil | date | The actual table schema contains more information that is shown here. SQL GKRecords Table MON

Dave Kant LCG Accounting – How it Works In order to match the authenticated user DN’s to the corresponding jobs we need to process the system message logs. Record ID : [GK] =/= Record ID [PBS] Gatekeeper log PBS Event log PBSJobNameID 1390.lcgce02.gridpp.rl.ac.uk Messages log GramScriptJobID :lcgpbs:internal_ : : : : : : :139 4

Dave Kant LCG Accounting – How it Works DATA SOURCE LDAP GIIS Server GIIS filter to collect CPU performance benchmarks for the worker nodes from the subclusters attached to the CE. RGMA-API publishes data to SpecRecords database table on the MON box GIIS SQL SpecRecords Table MON

Dave Kant LCG Accounting – How it Works | Field | Type | | RecordIdentity | varchar(255) | | SiteName | varchar(50) | | ClusterID | varchar(50) | | SubClusterID | varchar(50) | | SpecInt2000 | int(11) | | SpecFloat2000 | int(11) | The actual table schema contains more information that is shown here. SQL SpecRecords Table MON CPU Performance benchmarks for the worker nodes in the subclusters attached to the CE

Dave Kant LCG Accounting – How it Works 3-Way join matches records and writes them to the LcgRecords Table. LcgRecords records are unique Site now has a copy of its own accounting data. SQL GKRecords PbsRecords JobNames LcgRecords MON

Dave Kant LCG Accounting – How it Works LcgRecords MON Site 1 LcgRecords MON Site n Site LcgRecords 1. n MON GOC Data processed at each site is streamed to the GOC server GOC has then aggregated information for all sites

Dave Kant LCG Accounting – How it Works GOC provides an interface to produce accounting plots “on-demand” Total Number of Jobs per VO per Site (ok) Total Number of Jobs per VO aggregated over all sites (to be done) Tailor plots according to the requirements of the user community ~ 1000 Alice Jobs Taipei Statistics Feb/Mar

Dave Kant LCG Accounting CNAF Statistics March ~ 10,000 Alice Jobs RAL Statistics March ~ 6,300 Alice Jobs

Dave Kant 1.PBS log processed daily on site CE to extract required data, filter acts as R-GMA DBProducer -> PbsRecords table 2.Gatekeeper log processed daily on site CE to extract required data, filter acts as R-GMA DBProducer -> GkRecords table 3.Site GIIS interrogated daily on site CE to obtain SpecInt and SpecFloat values for CE, acts as DBProducer -> SpecRecords table, one dated record per day 4.These three tables joined daily on MON to produce LcgRecords table. As each record is produced program acts as StreamProducer to send the entries to the LcgRecords table on the GOC site. 5.Site now has table containing its own accounting data; GOC has aggregated table over whole of LCG. 6.Interactive and regular reports produced by site or at GOC site as required. LHC Accounting Summary

Dave Kant Accounting Issues 1.There is no R-GMA infrastructure LCG-wide, so most sites are not able to install and run the accounting suite at present. It is expected that R-GMA and the MON boxes will be rolled out in LCG2 soon after the storage problems are resolved. Until this happens the complete batch and gatekeeper logs will have to be copied to the GOC site for processing. 2.The VO associated with a user’s DN is not available in the batch or gatekeeper logs. It will be assumed that the group ID used to execute user jobs, which is available, is the same as the VO name. This needs to be acknowledged as an LCG requirement. 3.The global jobID assigned by the Resource Broker is not available in the batch or gatekeeper logs. This global jobID cannot therefore appear in the accounting reports. The RB Events Database contains this, but that is not accessible nor is it designed to be easily processed. 4.At present the logs provide no means of distinguishing sub-clusters of a CE which have nodes of differing processing power. Changes to the information logged by the batch system will be required before such heterogeneous sites can be accounted properly. At present it is believed all sites are homogeneous.