Monitoring and Accounting in EGEE/LCG Jeremy Coles (for Dave Kant) ARM-6 Barcelona Based on GridPP15 talk.

Slides:



Advertisements
Similar presentations
Monitoring and Accounting in EGEE/LCG Dave Kant GridPP 15 RAL.
Advertisements

Accounting in LCG Dave Kant & John Gordon CCLRC, e-Science Centre.
Accounting, ‘the last A’ John Gordon Amsterdam Workshop, May 13 th 2005.
London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.
John Gordon and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations.
Accounting Update Dave Kant Grid Deployment Board Nov 2007.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
Accounting in EGEE … and beyond John Gordon and David Kant CCLRC, e-Science Centre.
ATLAS Off-Grid sites (Tier-3) monitoring A. Petrosyan on behalf of the ATLAS collaboration GRID’2012, , JINR, Dubna.
Introduction on R-GMA Shi Jingyan Computing Center IHEP.
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct 2004.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Summary of Accounting Discussion at the GDB in Bologna Dave Kant CCLRC, e-Science Centre.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
A.Guarise – F.Rosso 1 Enabling Grids for E-sciencE INFSO-RI Comprehensive Accounting Views on large computing farms. Andrea Guarise & Felice Rosso.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
JSPG: User-level Accounting Data Policy David Kelsey, CCLRC/RAL, UK LCG GDB Meeting, Rome, 5 April 2006.
Dave Kant Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005.
1 OSG Accounting Service Requirements Matteo Melani SLAC for the OSG Accounting Activity.
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
13 May 2004EB/TB Middleware meeting Use of R-GMA in BOSS for CMS Peter Hobson & Henry Nebrensky Brunel University, UK Some slides stolen from various talks.
Accounting in LCG Dave Kant CCLRC, e-Science Centre.
Steve Traylen PPD Rutherford Lab Grid Operations PPD Christmas Lectures Steve Traylen RAL Tier1 Grid Deployment
Grid Operations Centre LCG Accounting Trevor Daniels, John Gordon GDB 8 Mar 2004.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Grid Monitoring Tools Alexandre Duarte CERN.
Some Title from the Headrer and Footer, 19 April Overview Requirements Current Design Work in Progress.
GDB March User-Level, VOMS Groups and Roles Dave Kant CCLRC, e-Science Centre.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Wojciech Lapka SAM Team CERN EGEE’09 Conference,
LCG Accounting John Gordon Grid Deployment Board 13 th January 2004.
INFSO-RI Enabling Grids for E-sciencE GridICE: Grid and Fabric Monitoring Integrated for gLite-based Sites Sergio Fantinel INFN.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
Storage Accounting John Gordon, STFC GDB March 2013.
Recent improvements in HLRmon, an accounting portal suitable for national Grids Enrico Fattibene (speaker), Andrea Cristofori, Luciano Gaido, Paolo Veronesi.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Accounting Update John Gordon and Stuart Pullinger January 2014 GDB.
Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.
CERN IT Department CH-1211 Geneva 23 Switzerland t A proposal for improving Job Reliability Monitoring GDB 2 nd April 2008.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks APEL CPU Accounting in the EGEE/WLCG infrastructure.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
LCG Accounting Update John Gordon, CCLRC-RAL WLCG Workshop, CERN 24/1/2007 LCG.
LCG User Level Accounting John Gordon CCLRC-RAL LCG Grid Deployment Board October 2006.
GridView - A Monitoring & Visualization tool for LCG Rajesh Kalmady, Phool Chand, Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav B.A.R.C. BARC-CERN/LCG Meeting.
Accounting in LCG/EGEE Can We Gauge Grid Usage via RBs? Dave Kant CCLRC, e-Science Centre.
Accounting in LCG Dave Kant CCLRC, e-Science Centre.
The National Grid Service User Accounting System Katie Weeks Science and Technology Facilities Council.
APEL Accounting Update Dave Kant CCLRC, e-Science Centre.
Gennaro Tortone, Sergio Fantinel – Bologna, LCG-EDT Monitoring Service DataTAG WP4 Monitoring Group DataTAG WP4 meeting Bologna –
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
INFSO-RI Enabling Grids for E-sciencE DGAS, current status & plans Andrea Guarise EGEE JRA1 All Hands Meeting Plzen July 11th, 2006.
CERN 21 January 2005Piotr Nyczyk, CERN1 R-GMA Basics and key concepts Monitoring framework for computing Grids – developed by EGEE-JRA1-UK, currently used.
Open Science Grid OSG Accounting System Matteo Melani SLAC 9/28/05 Joint OSG and EGEE Operations Workshop.
Status of gLite-3.0 deployment and uptake Ian Bird CERN IT LCG-LHCC Referees Meeting 29 th January 2007.
John Gordon Grid Accounting Update John Gordon (for Dave Kant) CCLRC e-Science Centre, UK LCG Grid Deployment Board NIKHEF, October.
Accounting in LCG Dave Kant CCLRC, e-Science Centre.
II EGEE conference Den Haag November, ROC-CIC status in Italy
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
Enabling Grids for E-sciencE APEL Accounting update Dave Kant (presented by Jeremy Coles) 2 nd EGEE/LCG Operations Workshop Bologna 25.
DGAS Distributed Grid Accounting System INFN Workshop /05/1009, Palau Giuseppe Patania Andrea Guarise 6/18/20161.
APEL Architecture Alison Packer. Overview Grid jobs accounting tool APEL Client software - installed in sites (CEs, gLite- APEL node) APEL Server accepts.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Storage Accounting John Gordon, STFC OMB August 2013.
LCG Accounting Update John Gordon, CCLRC-RAL 10/1/2007.
Enabling Grids for E-sciencE Claudio Cherubino INFN DGAS (Distributed Grid Accounting System)
Accounting Update Dave Kant, John Gordon RAL Javier Lopez, Pablo Rey Mayo CESGA.
Accounting at the T1/T2 Sites of the Italian Grid
Cristina del Cano Novales STFC - RAL
DGAS Today and tomorrow
Presentation transcript:

Monitoring and Accounting in EGEE/LCG Jeremy Coles (for Dave Kant) ARM-6 Barcelona Based on GridPP15 talk

ARM-6 Barcelona, Jan Overview Monitoring  Service Availability Monitoring Service Availability Monitoring Environment The Sensors Schema Accounting  Status of Batch Support in APEL Condor and SGE  LCG-RUS

ARM-6 Barcelona, Jan Service Availability Monitoring Grid Operations Activity (CERN lead) … with contributions from anyone who wants to participate Work started at 4 th EGEE conference Pisa (October) Implementation of sensors, metrics and alarms for services in EGEE/LCG infrastructure to ensure smooth grid operations  Good sensors  Meaningful metrics  Controllable Alarms How to contribute

ARM-6 Barcelona, Jan Contributions to Sensors Substantial Metrics document in circulation which defines 50+ metrics Home Page not yet available. Section 6 concerns the services BDIIAsaiPacificMin Tsai CatalogueCERNJames Casey CECERNPiotr Nyczyk FTSCERNGavin McCance MyProxySEE-Greece / CERN? / Maarten Litmaath RGMAUKI and CERNLaurence Field, Antony Wilson RBUKI and ItalyDave Kant, Sergio Andreozzi SRMUKIDave Kant, Jens Jensen, Greg Cowan VOMSItalyValerio Venturi

ARM-6 Barcelona, Jan Architecture All sensors publish into RGMA using a common schema Publish frequency depends on the sensor: SFT every 2 hours; RB every 30 mins; SRM once-a-day Alarms generated according to thresholds e.g. RB alarm if match make time exceeds 90 seconds

ARM-6 Barcelona, Jan TimeLine Preliminary Releases Expected Sensors: Feb 2006 Summary Generator  Indian Team ? Metric Generator:  Re-use Lemon Components? Displays: Feb 2006  Based on SFT Alarm System: March 2006  Sure/Lemon Piotr  RSS Dave  Integration with CIC portal (Lyon?) Work in progress Community working together

ARM-6 Barcelona, Jan Sensors for Service Monitoring RB Active Monitoring  Track a test job through the Grid; from UI to Worker Node  Functional Test: Can RBs Match Jobs to Resources requested  Frequent Job submission: Sample functionality every 30 minutes  Tools on the UI (edg-get-job-info)  RGMA Publishers on RB and WN  Sceen Shots: Job Summaries, RB Summaries, Metrics

ARM-6 Barcelona, Jan Example: RB Service Monitoring Our Experience Maps not practical for day-to- day operational activities?

ARM-6 Barcelona, Jan Example: RB Service Monitoring Shows Results of the latest round of jobs sent to RBs View details of individual tests

ARM-6 Barcelona, Jan Track a test job through the Grid; from UI to Worker Node UI edg-job-output RB L&B Info RB Publisher WN Publisher

ARM-6 Barcelona, Jan Recent History for a RB Derive Metric Data  Capture time to matchmake the job  Capture availability in a 24 hr period Number of Jobs to reach DONE Total number of jobs submitted

ARM-6 Barcelona, Jan Passive Monitoring Passive Monitoring (Italy: Sergio Andreozzi) Processing of log files Workload Manager Component  WaitingRequests  InputFileListSize Job Controller  WaitingRequests  InputFileListSizee Network Server  submissionRate (requests/600s) WM Proxy  ServerPoolSize Whole System  InJobs in last 10 mins  OutJobs in last 10 mins Hosting Environment  Load (1,5,15), memory (used, free, total, real, virtual)

ARM-6 Barcelona, Jan General Issues Will R-GMA/MySQL be able to cope with volume of data ?  GSTAT (GIIS monitor) alone generates 5GB data per Month  CERN are considering moving to Oracle (RGMA supported or migrating data from the MySQL archiver)  What plans are there for Oracle support in R-GMA?

ARM-6 Barcelona, Jan Types of Accounting Job Accounting AFTER the event (APEL Domain) Concept of a “Job” as a unit of resource consumption Determination of value after job execution Job usage record as a complete description of resource consumption Suitable for post paid services. Real Time Accounting (DGAS, SGAS Domain) Incremental determination of resource value while job being executed Incremental decrement of account balance Can enforce user quotas Suitable for pre-paid services

ARM-6 Barcelona, Jan APEL, Job Accounting Flow Diagram [1] Build Job Accounting Records at site. [2] Send Job Records to a central repository [3] Data Aggregation

ARM-6 Barcelona, Jan Accounting for Grid Jobs Build Job Records at Site APEL mapping grid users to the resource usage on local farms

Job Records In via RGMA RGMA MON SQL QUERY TO Accounting Server 1 Query / Hour On-Demand Accounting Pages based on SQL queries to summary data 1 Record per Grid Job (Millions of records expected) Summary data refreshed every hour (Max records about 100K per year) Home Page User queries Graphs GOC Consolidation of Data

ARM-6 Barcelona, Jan APEL Status APEL has been in production for 1 year 156 Sites, 5.4 Million Job Records 100K Job records per week -> Linear rise (c.f exponential) continues despite growth in CE. -> More site doesn’t mean more Jobs or more users.

ARM-6 Barcelona, Jan Demos of Accounting Aggregation Global views of resource consumption. LHC View  Data Aggregation across Countries EGEE View  Data Aggregation across EGEE ROC Based on LHC View and Data Mining Displays Official EGEE VOs (12) and Regional VOs Tables to show which GOCDB sites haven’t published recently … which ones publish but are not listed in GOCDB GridPP View  Specific view for GridPP accounting summaries for Tier-2s  Comments from GridPP users -> Prototype -> EGEE view changes

Aggregation of Data for GridPP

Aggregation of Data for Tier2

Data Aggregation at Site Level Breakdown of data per Vo per month showing Njobs, CPUt, WCT, record history Total CPU Usage per VO Gantt Chart NB:Gaps across all VOs consistent with scheduled downdowns in GocDB

ARM-6 Barcelona, Jan Batch Support in APEL Currently Available in LCG 2.6 OpenPBS, Torque, PBSPro and Vanilla PBS  ~80% Sites in LCG/EGEE Load Share Facility (Versions 5 and 6)  CERN, Italy In Development Condor ( )  Requested by Canada, UK  Was due for release in Nov/December but delayed  Deal with multiple parses of large batch files: Condor does not self-manage its logs, so they grow to > 2GB in size, multiple parses via APEL in-efficient. Sun Grid Engine ( )  Requested by UK (Imperial College)  Format of Log records unclear to us: Missing information in message logs  LCG-SGE job manager format is not LCG Compliant (PBS, LSF and Condor all are!). Substantial changes to APEL required unless this is addressed more carefully.

ARM-6 Barcelona, Jan APEL/RGMA Issues Publishing Missing Records  Options available to users are limited all  Republish mean republish everything: exceeds internal memory limits in Java causing APEL to crash. RGMA Archiver is growing in size It takes longer to traverse the database About 2 minutes to run the summary generator Benefit to move to Oracle Batch Support is still limited (!) Condor and SGE should be seen by the community to be important extensions to the application. APEL and gLite (!) Will Apel work in this environment. Nb. The web summary views are independent of APEL Data Privacy and Security Sites don’t publish User DN … its private data Restrict access to private data via RGMA client Data needs to be shifted from produces to consumers in a secure way Restricted to Fixed Schema in RGMA? Cannot easily add new fields to the database Unable to capture information about Jobs in batch logs e.g. exit status, time in queue, etc (STEVE FISHER COMMENTS: NEW FIELDS CAN BE ADDED)

ARM-6 Barcelona, Jan What Lies Ahead? Challenges Ahead World Wide Accounting Service for LCG

ARM-6 Barcelona, Jan More Wider Issues How important is accounting?  Compute resource viewed as a grid currency  Need a guarantee that the data has not been tampered with in an un fair way  How does normalisation fit into this? The concept of a raw usage records has no meaning if internal scaling is applied to Heterogeneous farms. Recognise that accounting isn’t just about “job usage” its about Resource usage which encompasses many things:-  CPU Usage  Also Storage & Network Usage  Treated Differently ? CPU is consumed; Storage is Occupied and can be recycled Getting Data from All Participants  Hasn’t been easy to get all sites in EGEE to send data to us.  Many reasons: some technical, some political  How do we account for usage in wider communities which span grid projects e.g. LHC?

ARM-6 Barcelona, Jan Challenges Ahead Data Collection  Many implementations for collecting accounting data in LCG World; APEL/DGAS in EGEE SGAS in SweGrid Sites that implement their own systems (FermiILab: multiple grid job managers from different grids feed a single condor pool) Also OSG who are interested in deploying APEL with their own transport mechanism.  Switching one for another doesn’t resolve the problem of data sharing across the project. No mechanism in place to share this data in a consistent way in place.  GGF Working on a Resource Usage Service  What would the model for data sharing look like? Low level or high level?  Low Level: sensors publishing data via a web service?  High level: Data collected within the infrastructure, aggregated in a meaningful way, reviewed and approve data before it can be passed on (FermiLab)  Some Tier-1 centres have concerns about data association “LCG not EGEE” “Will the service be separate?”

ARM-6 Barcelona, Jan Challenges Ahead Usage Reporting at what Level?  Anonymous level: How much resource has been provided to each VO  Aggregation across: VOs, Countries, Regions, Grids, Organisations  Granularity: summed over units of Hours, Days, Weeks, Months? User Level Reporting?  If 10,000 CPU hours were consumed by Atlas VO, who are the users that submitted the work?  Data privacy laws  A Grid “DN” is personal information which could be used to target an individual.  Who has access to this data and how do you get it?  Can CA policies change to support anonymous DNs and reverse DN mappings?  What are the consequences? Are there any lawyers in the audience?

ARM-6 Barcelona, Jan World Wide Accounting Service for LCG Project involves combining results from all three peer infrastructures and presenting an aggregated view of resource usage for LHC VOs to the RRB  Peer Infrastructures in LCG Open Science Grid + Others (Ruth Pordes, Philippe Canal, Matteo Melani) Nordugrid (Per Oster, Thomas Sandholm) LCG/EGEE (Kors Bos, Dave Kant)

ARM-6 Barcelona, Jan Resource Usage Service Based on emerging GGF standards and Web Services  GGF UR, OGSI An implementation exists in “Market for Computational Science” – UK e- Science project. What does DGAS provide? Use case might be:  A user invokes the query service through a web browser, using SSL for client authentication, to ensure that usage information at user level belongs to the user. Servlet sends query to RUS web service and gets user data. Service Interface RUS WS Application ACL DB Web Service Container Work started with Akram Khan and Xiaoyu Chen at Brunel

ARM-6 Barcelona, Jan Conclusions Very Busy Year Ahead SGE and Condor support need to be completed Improve some features of APEL that cause difficulties Investigate LCG-RUS Service Metrics Activity – very important - beginning to consume effort.