GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre.

Slides:



Advertisements
Similar presentations
Zhongxing Telecom Pakistan (Pvt.) Ltd
Advertisements

Monitoring and Accounting in EGEE/LCG Dave Kant GridPP 15 RAL.
21 Sep 2005LCG's R-GMA Applications R-GMA and LCG Steve Fisher & Antony Wilson.
GridPP July 2003Stefan StonjekSlide 1 SAM middleware components Stefan Stonjek University of Oxford 7 th GridPP Meeting 02 nd July 2003 Oxford.
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
John Gordon CCLRC eScience centre Grid Support and Operations John Gordon CCLRC GridPP9 - Edinburgh.
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct GOSC Oct 28.
4 Oracle Data Integrator First Project – Simple Transformations: One source, one target 3-1.
Andrew McNab - Manchester HEP - 2 May 2002 Testbed and Authorisation EU DataGrid Testbed 1 Job Lifecycle Software releases Authorisation at your site Grid/Web.
Introduction to Costing with PPM Amanda Oliver 2008 PPM User Conference.
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
Accounting in LCG Dave Kant & John Gordon CCLRC, e-Science Centre.
Accounting Update Dave Kant Grid Deployment Board Nov 2007.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Accounting in EGEE … and beyond John Gordon and David Kant CCLRC, e-Science Centre.
Introduction on R-GMA Shi Jingyan Computing Center IHEP.
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct 2004.
Dave Kant LCG Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPSYSMAN April 2005.
Stephen Booth EPCC Stephen Booth GridSafe Overview.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Summary of Accounting Discussion at the GDB in Bologna Dave Kant CCLRC, e-Science Centre.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
Dave Kant Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005.
APEL & MySQL Alison Packer Richard Sinclair. APEL Accounting Processor for Event Logs extracts job information by parsing batch system (PBS, LSF, SGE.
Dave Kant Grid Operations Centre LCG Workshop CERN 24/3/04.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
Accounting in LCG Dave Kant CCLRC, e-Science Centre.
Steve Traylen PPD Rutherford Lab Grid Operations PPD Christmas Lectures Steve Traylen RAL Tier1 Grid Deployment
Grid Operations Centre LCG Accounting Trevor Daniels, John Gordon GDB 8 Mar 2004.
Some Title from the Headrer and Footer, 19 April Overview Requirements Current Design Work in Progress.
GDB March User-Level, VOMS Groups and Roles Dave Kant CCLRC, e-Science Centre.
Dave Kant Monitoring ROC Workshop Milan 10-11/5/04.
E-infrastructure shared between Europe and Latin America FP6−2004−Infrastructures−6-SSA gLite Information System Pedro Rausch IF.
Grid Deployment Enabling Grids for E-sciencE BDII 2171 LDAP 2172 LDAP 2173 LDAP 2170 Port Fwd Update DB & Modify DB 2170 Port.
LCG Accounting John Gordon Grid Deployment Board 13 th January 2004.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Local Job Accounting Cristina del Cano Novales STFC-RAL.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks APEL CPU Accounting in the EGEE/WLCG infrastructure.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
LCG Accounting Update John Gordon, CCLRC-RAL WLCG Workshop, CERN 24/1/2007 LCG.
LCG User Level Accounting John Gordon CCLRC-RAL LCG Grid Deployment Board October 2006.
Accounting in LCG/EGEE Can We Gauge Grid Usage via RBs? Dave Kant CCLRC, e-Science Centre.
Accounting in LCG Dave Kant CCLRC, e-Science Centre.
APEL Accounting Update Dave Kant CCLRC, e-Science Centre.
Dave Kant LCG Accounting Overview GDA 7 th June 2004.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
INFSO-RI Enabling Grids for E-sciencE DGAS, current status & plans Andrea Guarise EGEE JRA1 All Hands Meeting Plzen July 11th, 2006.
CERN 21 January 2005Piotr Nyczyk, CERN1 R-GMA Basics and key concepts Monitoring framework for computing Grids – developed by EGEE-JRA1-UK, currently used.
CERN - IT Department CH-1211 Genève 23 Switzerland t Grid Reliability Pablo Saiz On behalf of the Dashboard team: J. Andreeva, C. Cirstoiu,
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
John Gordon Grid Accounting Update John Gordon (for Dave Kant) CCLRC e-Science Centre, UK LCG Grid Deployment Board NIKHEF, October.
Accounting in LCG Dave Kant CCLRC, e-Science Centre.
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
Enabling Grids for E-sciencE APEL Accounting update Dave Kant (presented by Jeremy Coles) 2 nd EGEE/LCG Operations Workshop Bologna 25.
LCG Accounting Update John Gordon, CCLRC-RAL 10/1/2007.
Accounting Update Dave Kant, John Gordon RAL Javier Lopez, Pablo Rey Mayo CESGA.
James Casey, CERN IT-GD WLCG Workshop 1st September, 2007
LCG Monitoring and Accounting
Accounting at the T1/T2 Sites of the Italian Grid
Cristina del Cano Novales STFC - RAL
Monitoring of the infrastructure from the VO perspective
Site availability Dec. 19 th 2006
Presentation transcript:

GridPP Monitoring & Accounting Dave Kant CCLRC, e-Science Centre

EGEE03, April Monitoring Overview` 1. Overview 2. How Many Jobs on the Grid? 3. LCG/EGEE Monitoring System 4. Putting it all together for GridPP 5. Future Plans

EGEE03, April How Many Jobs on the Grid? As a way to introduce the various tools that are in development in the LCG/EGEE Grid There are different sources for getting estimates about the number of Jobs. Information System Accounting System Resource Brokers

EGEE03, April How Many Jobs on the Grid? One source of information is the monitoring system based on R-GMA Tools which gather information and use the R-GMA backbone for data collection GIIS Monitor Apel Site Functional Tests Tools which create reports RB Logging&Bookkeeping data mining Accounting

EGEE03, April GIIS Monitor GIIS Monitor developed by GOC Taipei (Min Tsai) Tool to display and check information published by the site GIIS Sanity checks, fault detection of information system every 5 minutes Provides an instantaneous snapshot of the number of Jobs

EGEE03, April How Many Jobs on the Grid? Another source of information is the accounting, which as so many sources, is not complete, but covers most of the resources. This is not the case for GridPP resources. Accounting information is based on resource usage published by batch servers

EGEE03, April How Many Jobs on the Grid? Latest source is a data mining tool which can be used to examine RB Logging and Bookkeeping information (via R-GMA) at the user level.

EGEE03, April How Many Jobs on the Grid? A further source is based on the work by the EGEE QA Team They monitor several – but not all – resource brokers on LCG and create reports of their usage. JRA2/index.html JRA2/index.html Statisticts based on aggregated information Job Success and job throughput per VO and per RB Grid efficiency (Execution time vs Waiting Time)

EGEE03, April How Many Jobs on the Grid?

EGEE03, April How Many Jobs on the Grid? Job Duration showing a dominance of Dteam and LHCb jobs which are relatively short lived.

EGEE03, April Site Functional Tests Installation and configuration of a site is quite a complicated procedure. -When there is a new release, sites dont upgrade at the same time. -Some upgrades dont always go smoothly -Unexpected things happen (who turned of the power?) -Day-to-day problems; robustness of service under load? SFT framework consists of a number of tests which probe a site to determine the operational status. This includes all certified sites in EGEE/LCG infrastructure but also testing uncertified sites (for internal certification process performed by ROCs), monitoring sites that are part of gLite Pre- Production Service, and all other sites that are using LCG or gLite middleware

EGEE03, April SFT Site summaries and histories SFT used by ROCs for certification Grid–Ireland SFT SFT runs every 3 hours and writes test results to a database using R-GMA

EGEE03, April GridPP Monitoring Map Links hourly job submission test results to SFT, GSTAT, RSS Feeds and Accounting data GPPMon is a lightweight test which sends a simple job to GridPP resources every hour.

EGEE03, April Future Plans for GPPMon GPPMON - GridPP monitor to be switched off SFT2 runs every 3 hours and sites/ROCS can run these tests independently, so there is no real need for these jobs. Proposal is to link GridPP monitoring map to the monitoring data in the R-GMA and make use of changes to the grid M/W e.g. support for longitude and latitude in Glue Schema (LCG 2.6). Google Map

EGEE03, April Google Map

EGEE03, April Accounting Overview This is a summary of the status of Accounting & Reporting following its deployment in LCG2_6 1. Overview 2. APEL Design 3. Whats New? 4. LCG Accounting (OSG, NorduGrid, EGEE) 5. Issues

EGEE03, April Requirement Capture Originally a requirement of the LHC Computing Grid project. Requirements were originally captured through presentations to LCGs Grid Deployment Board Deployment Team. LHC experiments and the Tier1 centres are represented on the GDB.

EGEE03, April Requirements A historical record of grid usage to identify the use of individual sites by VOs as a function of time To demonstrate the total delivery of resources by that site to the Grid Aggregated views of the collected data grouped by: Virtual Organisation Country – a requirement of LCG which has a country-based structure EGEE Region – for use by EGEE Regional Operations Centre (ROC) A presentation front-end to the data to allow the selection on-demand of the views described above for different VOs and periods of time. To present the data as A graphical view for interpretation A tabular view for precision To support sites that already had their own methods of data collection by allowing arbitrary data collection techniques and insertion of the data in the standard schema into the central database.

EGEE03, April Requirements It was not an explicit requirement that user information be captured but we included this in the design as we were sure this would be a secondary requirement This is a reporting system, not a charging mechanism. The information is under the control of the site, so it does not meet the requirement of a charging system to be digitally signed and irrefutable. Information is gathered centrally, not under the control of the VO

EGEE03, April Design Information collected at each site from batch logs, gatekeeper logs etc Information joined at site level to select grid jobs and stored in database on R-GMA MON box at site. Information published through R-GMA and collected centrally in an R-GMA archive at GOC Web site presents various views of this data for presentation Structure of Grid taken from GOC DB – the grid configuration database. Only normalised cpu time collected

EGEE03, April Accounting Flow Diagram

EGEE03, April How APEL Works? PBS/LSF log processed daily on site CE to extract required data, filter acts as R-GMA DBProducer -> PbsRecords table Gatekeeper log processed daily on site CE to extract required data, filter acts as R-GMA DBProducer -> GkRecords table Message log processed daily on site CE to extract required data, filter acts as R-GMA DBProducer -> MessageRecords table Site GIIS interrogated daily on site CE to obtain SpecInt and SpecFloat values for CE, acts as DBProducer -> SpecRecords table, one dated record per day These three tables joined daily on MON to produce LcgRecords table. As each record is produced program acts as StreamProducer to send the entries to the LcgRecords table on the GOC site. Site now has table containing its own accounting data; GOC has aggregated table over whole of LCG. Interactive and regular reports produced by site or at GOC site as required.

EGEE03, April Join Processor Mapping grid users to the resource usage on local farms

Job Records In via RGMA RGMA MON SQL QUERY TO Accounting Server 1 Query / Hour On-Demand Accounting Pages based on SQL queries to summary data 1 Record per Grid Job (Millions of records expected) Summary data refreshed every hour (Max records about 100K per year) Home Page User queries Graphs GOC Reporting Web Pages

Accounting Home Page 107 Sites publishing data (Sep ) Over 3.3 Million Job records ~ 100K records per week (period June 1 st – mid Aug 2005) /

EGEE03, April Whats New? Added GridPP View to the reporting interface Requirements driven by GridPP –Global view of entire organisation –Tier-2 Summaries –Detailed view at Site level –CSV download of information –Toggle between Normalised / Un-normalised Datasets

EGEE03, April GridPP Input GridPP Metrics and Deployment Document (J.Coles) Metric 10:Number of sites publishing accounting data at the end of the last quarter Metric 11:KSI2K hours of CPU processing delivered (per VO) over the last quarter We are looking for meaningful plots that allow important conclusions to be drawn without misleading people Is Job Efficiency meaningful? Sites treat their data in different ways:- At Tier-1 WCT are scaled because of the scheduler At other sites, only system time is scaled What about Hyper threading? Perhaps we need to provide descriptive text against each plot to warn of such problems? Spot potential problems in resource allocation Identify trends

GridPP View Screen Shots

Atlas and LHCb dominating KSI2K delivered per Tier1/Tier2 per VO Atlas dominates in Tier1 Job Efficiency = CPUT/WCT Why is atlas EFF at 60%? Why is DZERO EFF for MANHEP > 1 ?

Tier2 View (NorthGrid)

Site View (Lancaster) Breakdown of data per Vo per month showing Njobs, CPUt, WCT, record history Total CPU Usage per VO Gantt Chart NB:Gaps across all VOs consistent with scheduled downdowns in GocDB

EGEE03, April APEL IN LCG 2.6 New version with better documentation APEL supports PBS and LSF Consists of a number of components Core module contains functionality common to all components Plugin components provide log parsing functionality for PBS and LSF job managers.

EGEE03, April Accounting Dissemination 1. CERN Courier 2. LCG Computing Newsletter (slightly more technical) 3. AHM 2005 (more technical still)

EGEE03, April Monitoring Apel SFT Checks the Accounting Service Apel is not considered a critical service Two Step process Is archiver listening ? o GOC Flexible archiver service listens for accounting producers rgma -c "select count(*) from LcgRecords" If tests fail Archiver Down warning Has site published accounting data recently? o How many records published by site in the last day? rgma -c "select count(*) from LcgRecords where ExecutingCE=' ' and MeasurementDate>=' ' If tests fail Site Apel Down warning

EGEE03, April Issues 4. Which Log Files Should Site Administrators Backup? To build accounting records, we need to process data from THREE log file sources. This is a mandatory requirement in order to reconstruct what has been done during the 2004 period. /var/log/globus-gatekeeper* o Match between grid-user dn to GramScriptJobId /var/spool/pbs/server_priv/accounting/* o Local jobID and details of resources consumed o No distinction between grid jobs and non-grid jobs. /var/log/messages* o Map GramScriptJobID to local JobID This is how we separate grid jobs from local user jobs which run on the local fabric. If the site has deleted its messages files, we may be able to work around this by matching local unix groups in the batch logs. Accounting records formed in this way will not contain the dn of the grid-user.

EGEE03, April Issues 5. siteName Changes Recent problem with presenting data from the French ROC where CCIN2P3 was renamed to IN2P3-CC via GOCDB portal All records associated with the site are updated in order for SQL queries to match the new siteName. 6. Namespace Convention? Naming scheme to identify data belonging to large sites which provide services for different communities etc. NIKHEF: lcgprod.nikhef.nl, lcg2prod.nikhef.nl, edgapptb.nikhef.nl *SiteName* is a bad choice because we get multiple hits o *IC-LCG2* gives multiple matches PIC-LCG2 and IFIC-LCG2 Request sites stick to the convention *.SiteName o h1.desy.de, zeus.desy.de

EGEE03, April Issues 7. Normalisation We want to perform a reasonably sensible first order estimate to account for the differences in worker node performance. Homogeneous vs Heterogeneous PBS Job Records dont have any information about the worker node benchmarks, so we must insert one manually PBS Farms setup in different ways; can lead to an error in the normalisation calculation (Blindman vs internal normalisation) Histories - What SpecInts do we use in order to process archived Job Records? LSF Job Records have a CPU_FACTOR (1 - 4) in the Job Record. o What does a value of 1 correspond to? o Different calibration value at each site o Conversion table? o Can the site publish a weighted specInt2000 for the farm?

EGEE03, April Issues 8. Service Reliability & Hardening If flexible archiver is down, sites unable to publish data to GOC Update: /43 Apel core checks if flexible archiver service is available before attempting to publish data. GOC publishes a test record every 5 minutes to check the service is alive: automatic service recovery mechanism now in place Investigate running multiple flexible archiver services 1 per GOC or 1 per ROC? At the moment, the archiver service listens for all producers rather than producers belonging to a ROC. Single point of failure if registry is down? Multiple registry replicas supported in the RC1 (gLite) release? Update: Multiple registries supported in LCG2_4_0 ?

EGEE03, April Future Plans 1. Interoperate 2. CERN Courier / LCG News 3. Wiki Pages

EGEE03, April APEL and gLite Is APEL integrated in g-Lite? Work currently in progress. We have ported the APEL code into the gLite CVS repository but need to understand functional differences e.g. WMS and use of Condor What about its development plan? Future unclear given presence of DGAS in gLite Areas of possible development: Condor (easy or complicated) Reporting Tool (GridICE will most likely provide this)

EGEE03, April LCG Accounting Project involves combining results from all three infrastructures and presenting an aggregated view Peer Infrastructures in LCG Open Science Grid (Ruth Pordes, Philippe Canal, Matteo Melani) Nordugrid (Per Oster) EGEE Currently, LHCView filters LHC VO data from EGEE accounting data.

EGEE03, April Requirements Combine results from all three infrastructures … Ideally: Distributed queries to multiple databases Each peer manages an accounting database LHC VO filtering provided through a web services interface Initial Implementation: Centralised Collection Peers publish data into a global database WebServices or direct MySql inserts Common Problem: Different Grid infrastructures may use different Schemas. GGF define a schema, but quite flexible. May need translators to convert from one schema to another. (already exist)