Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct 2004.

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

John Gordon CCLRC eScience centre Grid Support and Operations John Gordon CCLRC GridPP9 - Edinburgh.
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct GOSC Oct 28.
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Torsten Antoni – LCG Operations Workshop, CERN 02-04/11/04 Global Grid User Support - GGUS -
Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.
John Gordon and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations.
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
08/11/908 WP2 e-NMR Grid deployment and operations Technical Review in Brussels, 8 th of December 2008 Marco Verlato.
CMS Report – GridPP Collaboration Meeting VI Peter Hobson, Brunel University30/1/2003 CMS Status and Plans Progress towards GridPP milestones Workload.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
RLS Tier-1 Deployment James Casey, PPARC-LCG Fellow, CERN 10 th GridPP Meeting, CERN, 3 rd June 2004.
Dave Kant LCG Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPSYSMAN April 2005.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
GGF12 – 20 Sept LCG Incident Response Ian Neilson LCG Security Officer Grid Deployment Group CERN.
QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
Dave Kant Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005.
Dave Kant Grid Operations Centre LCG Workshop CERN 24/3/04.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
Accounting in LCG Dave Kant CCLRC, e-Science Centre.
WP3 Information and Monitoring Steve Fisher / RAL 23/9/2003.
Steve Traylen PPD Rutherford Lab Grid Operations PPD Christmas Lectures Steve Traylen RAL Tier1 Grid Deployment
Grid Operations Centre LCG Accounting Trevor Daniels, John Gordon GDB 8 Mar 2004.
Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov
GDB March User-Level, VOMS Groups and Roles Dave Kant CCLRC, e-Science Centre.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
Dave Kant Monitoring ROC Workshop Milan 10-11/5/04.
Presenter Name Facility Name UK Testbed Status and EDG Testbed Two. Steve Traylen GridPP 7, Oxford.
INFSO-RI Enabling Grids for E-sciencE GridICE: Grid and Fabric Monitoring Integrated for gLite-based Sites Sergio Fantinel INFN.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
LCG workshop on Operational Issues CERN November, EGEE CIC activities (SA1) Accounting: current status
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
INFSO-RI Enabling Grids for E-sciencE An overview of EGEE operations & support procedures Jules Wolfrat SARA.
Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks APEL CPU Accounting in the EGEE/WLCG infrastructure.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
EGEE is a project funded by the European Union under contract IST Roles & Responsibilities Ian Bird SA1 Manager Cork Meeting, April 2004.
Accounting in LCG/EGEE Can We Gauge Grid Usage via RBs? Dave Kant CCLRC, e-Science Centre.
Accounting in LCG Dave Kant CCLRC, e-Science Centre.
APEL Accounting Update Dave Kant CCLRC, e-Science Centre.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
John Gordon Grid Accounting Update John Gordon (for Dave Kant) CCLRC e-Science Centre, UK LCG Grid Deployment Board NIKHEF, October.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
EGEE is a project funded by the European Union under contract IST Service Activity 1 M.Cristina Vistoli ROC Coordinator All activity meeting,
Accounting in LCG Dave Kant CCLRC, e-Science Centre.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operational Procedures (Contacts, procedures,
II EGEE conference Den Haag November, ROC-CIC status in Italy
TIFR, Mumbai, India, Feb 13-17, GridView - A Grid Monitoring and Visualization Tool Rajesh Kalmady, Digamber Sonvane, Kislay Bhatt, Phool Chand,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
1 Grid Operations Jinny Chien ASGC June 09, Academia Sinica Slides adapted from the EGEE training material repository:
INFSO-RI Enabling Grids for E-sciencE GOCDB2 Matt Thorpe / Philippa Strange RAL, UK.
Regional Operations Centres Core infrastructure Centres
EGEE is a project funded by the European Union
LCG Monitoring and Accounting
Ian Bird GDB Meeting CERN 9 September 2003
Brief overview on GridICE and Ticketing System
Technical workshop: Grid and ROC operations; Planning
Cristina del Cano Novales STFC - RAL
Site availability Dec. 19 th 2006
Presentation transcript:

Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct 2004

2 Monitoring the Grid is a Challenge Number of participating sites is growing every day: August 2003 => 12 sites ; October 2004 => 83 sites ; 8000 CPUs; 96TB Disk Grid Operations Centre Monitor the operational status of sites; Fault detection Problem Management Identify problems; escalate; track;

3 Distributed GOC LCG sites are distributed all over the globe There has been a coordinated effort to develop and integrate a variety of monitoring tools from CERN, CCLRC (UK), GridPP, INFN (Italy) and Taiwan

4 We have only fragmentary information about the services that sites are running. We don’t know what RBs/SEs/Sites the VOs are using for data challenges. We don’t know what the core services are and who is running them. We don’t have a toolkit to test specific core services. We have to concentrate on functional behaviour of services e.g If an RB sends your job to a CE, then we must assume the RB is working fine. Is this the only test of a RB? Not all the tests that we perform are effective at finding problems so we must take tests written by the experts and integrate them into GOC monitoring. We must develop tests which simulate the life cycle of real applications in a Grid environment. There are lots of monitoring tools available, so we need to bring them together. Do we spend time investigating new tools, or make the ones which we already have better? …and probably lots more! Monitoring Challenges

5 There are many tools which can be used to monitor sites in a distributed environment. Many developed from other projects e.g. EDG, DataTAG, GridPP including the open source community. MAPCENTER GPPMON GRIDICE NAGIOS GIIS Monitor / / Ganglia Monitoring Services By no means a complete list!

6 GOC Configuration Database GOC GridSite MySQL Resource Centre Resources & Site Information EDG, LCG-1, LCG-2, … ce se bdii rb Monitoring Services Operations Maps Configure other Tools Mapcenter:30 sites = 500 lines in config file Nagios:12 configuration scripts with dependencies Organisation Structures Secure services - Site News - Self Certification Secure Database Management via HTTPS / X.509 People, Contact Information, Resources Scheduled Maintenance RC SQL https SERVERSERVER

7 Operations Map – Job Submission Tests GPPMON Displays the results of tests against sites. Test: Job Submission Job is a simple test of the grid middleware components e.g. Gatekeeper service, RB service, and the Information System via JDL requirements. Site resources can be taken from different sources: - BDII List of sites belonging to a grid e.g. Production, Development, LCG, GILDA … - GOC DB Tailor the information for different customers e.g. regional monitoring for EGEE

8 Operations Map – Certificate Lifetime GPPMON Displays the results of tests against sites. Test:Certificate Lifetime Many grid services require a valid certificate for security. Can be used to provide advanced warning to sites.

9 GRIDICE – Architecture Developed by the INFN-GRID Team

10 GRIDICE – Global View Display shows the processes belonging to the Broker service. Problems are flagged List of Sites Resource Usage CPU#, Load, Storage, Job Info

11 GRIDICE – Expert View Display shows the processes belonging to the Broker service. Problems are flagged Node Processes

12 Ganglia Monitoring Can use Ganglia to monitor a cluster RAL Tier-1 Centre LCG PBS Server displays Job status for each VO

13 Federating Cluster Information Can also use Ganglia to monitor clusters of clusters

14 GIIS Monitor Developed by MinTsai (GOC Taipei) Tool to display and check information published by the site GIIS

15 Regional Monitoring EGEE is made up of regions. Each region contains many computing centres. Regional Operational Centres is a focus for operations. USA

16   Provide ROCs with a package to monitor the resources in the region Tailored Monitoring GUI to automate site locations on the map  Hierarchical view of Resources Example UK Particle Physics GridPP Regional Monitoring Maps EGEE (1) France (1.1) UK/I (1.2) GridPP (1.2.1) LondonT2 IMPERIAL QMUL ScotGrid Edinburgh S.E.E (1.3)

17 Site Certification Service In terms of middleware, the installation and configuration of a site is quite a complicated procedure. –When there is a new release, sites don’t upgrade at the same time –Some upgrades don’t always go smoothly –Unexpected things happen (who turned of the power?) –Day-to-day problems; robustness of service under load? Its necessary to actively hunt for problems Site certification testing is by CERN deployment team on a daily basis. First step toward providing this service involves running a series of replica manager tests which register files onto the grid, move them around, delete them; and 3 rd party copies from remote SE. Unlike the simple job submission tests described earlier, these are more heavy weight and attempt simulate the life cycle of real applications.

18 Certification Test Results Individual Test Results

19 Aggregator RSSReader (Windows Client) GOC generates RSS feeds which clients can pull using an RSS aggregator. Aggregators available for Linux, Windows and MacOS The aggregator shown displays test results for the RAL CE. These results are archived and popup on the desktop when the feed is updated. Syndication of Monitoring Information

20 Real Time Grid Monitor A Visualisation tool to track jobs currently running on the grid. Applet queries the logging and bookkeeping service to get information about grid jobs. Why are jobs failing? Why are jobs queued at sites while others are empty?

21 Monitoring Paradigm GOC Services collect information and publish into an archiver. ROC/CIC Services provide a means for the community to interact with this information on-demand. GOC provides services tailored to the requirements of the community. Uses infrastructure (R-GMA, database and web portal developed for accounting) Information Repository (RGMA) Accounting Monitoring GSTAT Testing ROC Services Self Certification CIC Services Communities VOs ROCs EGEE Sites Organisations GOC Services

22 GOC UseCase Accounting An accounting package for LCG has been developed by the GOC at RAL There are two main parts –the accounting data-gathering infrastructure based on R-GMA which brings the data to a central point –a web portal to allow on-demand reports for a variety of players.

23 Batch Log GK Log messages filter CE Site GIIS LCG SITE Data Sources MON RGMARGMA GOC Site MON RAW Accounting Data Data Aggregation per VO per ROC Accounting Service On Demand Reports Accounting Flow Diagram

24 GOC Accounting Services BaseCpuSeconds Aggregated across EGEE Each Site, per VO, per Month Simple interface to customise views of data: VO, time frame and Region (default = EGEE) Each Region, per VO, per Month On Demand Services to EGEE Community Other Distributions Normalised CPU # Jobs

25 Future Plans Extend the ideas developed in the accounting useCase to all the tools that have been described. Want to move toward a Service Orientated Architecture model and provide the community with a direct interface into the monitoring.

26 Summary Accounting Information gathering infrastructure has been developed It has been through the C&T cycle and should be deployed in the next release. A web portal for display of this information has been developed (work in progress) This is an EGEE deliverable (DSA1.3) The display infrastructure can be deployed for other information (e.g monitoring)

27 Summary Since August 2003, the LCG GOC has been working to understand the problems of running a large scale distributed grid. Setup a distributed GOC and deployed tools to help understand the issues. Development towards on-demand services to provide the community with up-to-date information, aggregated at different levels. Development of Visualisation tools to enhance our understanding of the grid.