John Gordon CCLRC eScience centre Grid Support and Operations John Gordon CCLRC GridPP9 - Edinburgh.

Slides:



Advertisements
Similar presentations
S.L.LloydATSE e-Science Visit April 2004Slide 1 GridPP – A UK Computing Grid for Particle Physics GridPP 19 UK Universities, CCLRC (RAL & Daresbury) and.
Advertisements

UK Testbed Report GridPP 9 Steve Traylen
S.L.LloydGrid Steering Committee 8 March 2002 Slide 1 Status of GridPP Overview Financial Summary Recruitment Status EU DataGrid UK Grid Status GridPP.
GridPP9 – 5 February 2004 – Data Management DataGrid is a project funded by the European Union GridPP is funded by PPARC GridPP2: Data and Storage Management.
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Partner Logo Tier1/A and Tier2 in GridPP2 John Gordon GridPP6 31 January 2003.
Partner Logo UK GridPP Testbed Rollout John Gordon GridPP 3rd Collaboration Meeting Cambridge 15th February 2002.
The National Grid Service Mike Mineter.
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct GOSC Oct 28.
Support: Certificates and Helpdesks Andrew Richards (GSC/NGS) – CCLRC, RAL.
18 April 2002 e-Science Architectural Roadmap Open Meeting 1 Support for the UK e-Science Roadmap David Boyd UK Grid Support Centre CLRC e-Science Centre.
Andrew McNab - Manchester HEP - 22 April 2002 EU DataGrid Testbed EU DataGrid Software releases Testbed 1 Job Lifecycle Authorisation at your site More.
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Torsten Antoni – LCG Operations Workshop, CERN 02-04/11/04 Global Grid User Support - GGUS -
Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.
Andrew McNab - Manchester HEP - 2 May 2002 Testbed and Authorisation EU DataGrid Testbed 1 Job Lifecycle Software releases Authorisation at your site Grid/Web.
Andrew McNab - Manchester HEP - 22 April 2002 EU DataGrid Testbed EU DataGrid Software releases Testbed 1 Job Lifecycle Authorisation at your site More.
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
Andrew McNab - EDG Access Control - 14 Jan 2003 EU DataGrid security with GSI and Globus Andrew McNab University of Manchester
Andrew McNab - Manchester HEP - 6 November Old version of website was maintained from Unix command line => needed (gsi)ssh access.
John Gordon and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations.
08/11/908 WP2 e-NMR Grid deployment and operations Technical Review in Brussels, 8 th of December 2008 Marco Verlato.
CMS Report – GridPP Collaboration Meeting VI Peter Hobson, Brunel University30/1/2003 CMS Status and Plans Progress towards GridPP milestones Workload.
John Gordon CCLRC RAL Grid Operations Centre Update Trevor Daniels LCG Grid Deployment Board 10 th November 2003.
5 November 2001F Harris GridPP Edinburgh 1 WP8 status for validating Testbed1 and middleware F Harris(LHCb/Oxford)
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct 2004.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
1 st EGEE Conference – April UK and Ireland Partner Dave Kant Deputy ROC Manager.
Dave Kant Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK GridPP 12 Jan 31 st - Feb 1 st 2005.
John Gordon CCLRC e-Science Centre LCG Deployment in the UK John Gordon GridPP10.
Dave Kant Grid Operations Centre LCG Workshop CERN 24/3/04.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
Tony Doyle - University of GlasgowOutline EDG LCG GSC UK Core Grid GridPP2 EGEE Where do we go from here? Operations.
EGEE is a project funded by the European Union under contract IST User support in EGEE Alistair Mills Torsten Antoni EGEE-3 Conference 20 April.
Steve Traylen PPD Rutherford Lab Grid Operations PPD Christmas Lectures Steve Traylen RAL Tier1 Grid Deployment
Grid Operations Centre LCG Accounting Trevor Daniels, John Gordon GDB 8 Mar 2004.
Some Title from the Headrer and Footer, 19 April Overview Requirements Current Design Work in Progress.
GridPP Building a UK Computing Grid for Particle Physics Professor Steve Lloyd, Queen Mary, University of London Chair of the GridPP Collaboration Board.
Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.
Grid Operations Centre LCG SLAs and Site Audits Trevor Daniels, John Gordon GDB 8 Mar 2004.
Dave Kant Monitoring ROC Workshop Milan 10-11/5/04.
Presenter Name Facility Name UK Testbed Status and EDG Testbed Two. Steve Traylen GridPP 7, Oxford.
Grid Deployment Enabling Grids for E-sciencE BDII 2171 LDAP 2172 LDAP 2173 LDAP 2170 Port Fwd Update DB & Modify DB 2170 Port.
Andrew McNab - Manchester HEP - 17 September 2002 UK Testbed Deployment Aim of this talk is to the answer the questions: –“How much of the Testbed has.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Using GStat 2.0 for Information Validation.
John Gordon CCLRC RAL Grid Operations LCG Grid Deployment Board FNAL, 9th October 2003.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks APEL CPU Accounting in the EGEE/WLCG infrastructure.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
LCG Accounting Update John Gordon, CCLRC-RAL WLCG Workshop, CERN 24/1/2007 LCG.
EGEE is a project funded by the European Union under contract IST Roles & Responsibilities Ian Bird SA1 Manager Cork Meeting, April 2004.
The National Grid Service Mike Mineter.
Accounting in LCG/EGEE Can We Gauge Grid Usage via RBs? Dave Kant CCLRC, e-Science Centre.
APEL Accounting Update Dave Kant CCLRC, e-Science Centre.
Dave Kant LCG Accounting Overview GDA 7 th June 2004.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
EGEE is a project funded by the European Union under contract IST New VO Integration Fabio Hernandez ROC Managers Workshop,
John Gordon Grid Accounting Update John Gordon (for Dave Kant) CCLRC e-Science Centre, UK LCG Grid Deployment Board NIKHEF, October.
J Jensen/J Gordon RAL Storage Storage at RAL Service Challenge Meeting 27 Jan 2005.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
Slide § David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow GridPP delivering The UK Grid.
18/12/03PPD Christmas Lectures 2003 Grid in the Department A Guide for the Uninvolved PPD Computing Group Christmas Lecture 2003 Chris Brew.
Bob Jones EGEE Technical Director
Regional Operations Centres Core infrastructure Centres
LCG Monitoring and Accounting
SA1 Execution Plan Status and Issues
Ian Bird GDB Meeting CERN 9 September 2003
Presentation transcript:

John Gordon CCLRC eScience centre Grid Support and Operations John Gordon CCLRC GridPP9 - Edinburgh

What is support? Not well defined..or rather defined differently in many places End users, sysadmins, deployers, developers –all need support Some examples

Grid Support Centre 14 named staff at Rutherford, Daresbury, Manchester and Edinburgh. Operates the UK e-Science Certification Authority. – Provides a helpdesk for first point of call queries. Website for advertising services provided. – Provides technical training and evaluations of middleware. Supports the Level-2 Grid project. –National Information Server for Core programme. –Publishing of site monitoring information in xml. Core support for the OGSA-DAI project.

European Grid Support Centre Collaboration between CCLRC, CERN and KTH Sweden each providing 1 FTE Point of trusted reliability between major projects and middleware producers. Directly communicates with staff from Globus Alliance to ensure European issues faced having assisted with release. Website up and running though currently a skeleton of the final content. Attended EDG meeting in Barcelona to publicise and GGF-8 to guide User Services R.G. work.

Started 1st of october at GridKa Forschungszentrum Karlsruhe (Germany) Supports already 41 usergroups of GridKa Website Global Grid User Support – GGUS The Model

ESUS GOC Local operations GGUS Interaction Service Request First line of support: Problems (experiment specific) will be solved by ESUS (with Savannah) or sent to GGUS using an agreed interface; Grid related problems will be solved by GGUS or sent to GOC using the GGUS system; Data flow Grid User GGUS: Global Grid User Support ESUS: Experiment Specific User Support GOC: Grid Operations Centre Information flow

GridPP TB-Support 1.Support Team built from sysadmins. 4 funded by GridPP to work on EDG WP6, the rest are the usual site sysadmins. 2.Methods list, phone meetings, personal visits, job submission monitoring RB, VO, RC for UK use to support non- EDG use Planned to verify EDG releases but they have been too infrequent to test procedures 3.Rollout Experience from RAL in EDG dev testbeds and IC and Bristol in CMS testbeds >10 sites have been part of EDG app testbed at one time 3 in LCG1

Savannah

EGEE Operations Resource Centres – all sites Regional Operations Centres (ROC) –At least one per region! –RAL in UK/Ireland Core Infrastructure Centres (CIC) –CERN, RAL, CNAF, CC-IN2P3

Others Tier1`Support –Role to support UK Tier2s in LCG –Deployment role in GridPP2 Tier2 Specialist Posts –Support for varous middleware areas Middleware Developers

Where do you go for support? Users go to experiment support Experiment support diagnoses and forwards as necessary to Grid user support or middleware or operations or applications Resource Centres look to their Regional Operations Centre (Tier2s to their Tier1) ROCs will also push problems to their RCs. But we know that users will go to their local sysadmin or direct to their Tier2 or Tier1 too. –And some sysadmins will go to their favourite experiment expert –And Tier1s will go direct to middleware experts. In short, chaos. Strategy for now is to have a UK Plan that is self-contained and can deliver support in the UK when and where required. –Interface this to the various outside bodies –Dont duplicate for the sake of it, but be ready to. Or be prepared to role our work into wider provision when it is proven.

John Gordon CCLRC eScience centre Grid Operations Centre

What is Operations? RAL leading development of LCG GOC The Vision GOC Processes and Activities –Coordinating Grid Operations –Defining Service Level Parameters –Monitoring Service Performance Levels –First-Level Fault Analysis –Interacting with Local Support Groups –Coordinating Security Activities –Operations Development Recent developments :-

GOC - Monitoring Who is Involved? 3.0 FTE (Trevor Daniels, Dave Kant, Matt Thorpe, Jason Leake) What are we Doing? Monitor Grid Services, Manage Site Information, Accounting Developed Tools to Configure/Integrate Monitoring to make the job easier GPPMon Nagios Mapcentre Example: Mapcentre 30 sites ~ 500 lines in config file Example: Nagios 30 sites, 12 individual config files with dependencies Both tedious to configure Not practical by hand with large numbers of nodes

GOC - Database Develop/maintain a database to hold site information Site Information (contact lists, resources, site information, URLs) Secure access through GridSite (X509 certificates) via PHP web interface RC managers should maintain their own pages as part of the site certification process. Monitoring scripts read information in database and run a set of customised tools to monitor the infrastructure. To be included in the monitoring a site must register its resources (CE,SE,RB,RC,RLS,MDS,RGMA,BDII,..) BDII can be queried to check GOC database is up-to- date.

GOC Monitoring Today GOC DB GridSite MySQL EDG UI Remote UI Queries Database to build a list of resources Submit monitoring jobs to those resources Publish Results on WWW LCG1 UI LCG-2 UI EDG RESOURCES LCG-1 RESOURCES LCG-2 RESOURCES

New GPPMon Features Download Host Certificates daily and monitor Life Times for CEs and SEs for LCG and EDG

New GPPMon Features Reliability of service provided using RRDTool to show Globus and RB stats

New GPPMon Features Moving toward LCG-1, LCG-2 and EDG monitoring gridkap01.fzk.de Tuesday 3/2/04 14:10 Only RAL and FZK have updated their LCG-2 information in the GOC database.

Nagios Customised plugins for monitoring Focus service behaviour and data consistency Do RBs find resources Do site GIISs publish correct hostname? Is the site running the latest stable software release? Does the Gatekeeper authenticate? Are the host certificates valid? Are essential services running?

Nagios Screen Shots LCG-1

Service Summary for Gatekeeper Nodes

Nagios Screen Shots LCG-1 Host and Service Summary tables for BDII nodes

GOC Configuration Example: Manage a Grid-Wide Database - provides access to site information via trusted certificate - scripts to automatically configure Nagios from the GOC database - provide plugins to monitor services for nagios - create configurations file for mapcentre

GOC GridSite MySQL Resource Centre Resources & Site Information EDG, LCG-1, LCG-2, … ce se bdii rb Monitoring Secure Database Management via HTTPS / X.509 RC

GOC Server

Whats in the Database? People: Who do we notify when there are problems

Whats in the Database? Node Information (Hostname, IP Address, Group)

Whats in the Database? Scheduled Downtimes: Advanced warning of site maintenance resulting in reduced service availability

LCG Accounting Overview 1.PBS log processed daily on site CE to extract required data, filter acts as R-GMA DBProducer -> PbsRecords table 2.Gatekeeper log processed daily on site CE to extract required data, filter acts as R-GMA DBProducer -> GkRecords table 3.Site GIIS interrogated daily on site CE to obtain SpecInt and SpecFloat values for CE, acts as DBProducer -> SpecRecords table, one dated record per day 4.These three tables joined daily on MON to produce LcgRecords table. As each record is produced program acts as StreamProducer to send the entries to the LcgRecords table on the GOC site. 5.Site now has table containing its own accounting data; GOC has aggregated table over whole of LCG. 6.Interactive and regular reports produced by site or at GOC site as required. Note: This is an improved design over that presented at the Jan GDB. The SOAP transport has been replaced by R-GMA.

GOC Site LCG Site MON LCG Site CEMON PBS log gk log site GIIS filter GOC Reports LCG Site Accounting DB LCG Accounting Flow

Progress Status on 3 Feb 2004: –The code which will run on the CE to parse and process the PBS and Gatekeeper logs is written. The PbsRecords and GkRecords tables are created and are being populated. –The code to join these two tables and publish the new joined table (LcgRecords) is also written and working. –Work is in progress to write the archiver at the GOC to receive the aggregated LcgRecords table – 2 days work. To do: –Write the code to interrogate the site GIIS to extract the CPU power values and populate these fields in the tables – 2 days work –Integration testing and debugging – 5 days –Packaging for deployment – 3 days –Write the report generators – 30 days (estimate – not yet designed)

Accounting Issues 1.There is no R-GMA infrastructure LCG-wide, so most sites are not able to install and run the accounting suite at present. It is expected that R-GMA and the MON boxes will be rolled out in LCG2 soon after the storage problems are resolved. Until this happens the complete batch and gatekeeper logs will have to be copied to the GOC site for processing. 2.The VO associated with a users DN is not available in the batch or gatekeeper logs. It will be assumed that the group ID used to execute user jobs, which is available, is the same as the VO name. This needs to be acknowledged as an LCG requirement. 3.The global jobID assigned by the Resource Broker is not available in the batch or gatekeeper logs. This global jobID cannot therefore appear in the accounting reports. The RB Events Database contains this, but that is not accessible nor is it designed to be easily processed. 4.At present the logs provide no means of distinguishing sub-clusters of a CE which have nodes of differing processing power. Changes to the information logged by the batch system will be required before such heterogeneous sites can be accounted properly. At present it is believed all sites are homogeneous.

Future Direction Towards EGEE Distribute Tools to help the ROCs monitor their RCs (Database + Monitoring Packages) Distribute Tools to help CICs monitor Core Services – Grid Wide Monitoring Ideas on how this would work: CIC monitoring tools query ROC databases Select core services Run a standard set of checks on those services Display information / Notifications …

John Gordon CCLRC eScience centre UK Deployment, Support and Operations

Deployment Team Grid Support Centre 5 FTE Core Grid Coordinator 1 FTE Security Officer RAL Data and Storage Management Glasgow, Bristol, Edinburgh VO Management and Services North (0.5 FTE) Workload Management Services London Network Management London (0.5 FTE) MiddleWare Specialist Support 6 FTE 2 Tier1 Deployment 4 Tier 2 UK Coordinators LondonGrid,NorthGrid ScotGrid, SouthGrid 1 Tier2 Coordinator Ireland Applications Expert Deployment Team 8 FTE Grid Operations Centre Production Manager EGEE GridPPJISCCore UK Proposal for a UK wide Team to provide and run a UK wide Grid The GridPP View. There are alternative views for other stakeholders Manager Operations (2) Technical Writer Network Support Helpdesk Network Monitoring

Resource Centres Tier1 : Rutherford Appleton Laboratory Tier-2 centres are distributed over many sites. Sites which have signed up to LCG and deployed software (RAL,IC,Cambridge) expect to join EGEE (PM1) London Grid IC,QMUL,RHUL,UCL, Brunel North Grid Daresbury, Lancaster, Liverpool,Manchester, Sheffield Scot Grid Durham, Edinburgh, Glasgow South Grid Birmingham, Bristol, Cambridge, Oxford, RAL-PPD

Tier 2Number Of CPUs TOTAL CPU [KSI2000] Total Disk [TB] Total Tape [TB] London North Grid South Grid Scot Grid Total Tier-2 Centre Resources (Projected 2004) Projected resources available in September 2004 to be applied to large-scale production Grid deployment. The total CPU at each institute is proportional to the size of the green circles. The disk storage at each site is proportional to the height of the grey vertical bars

Roles (1) Production Manager Overall Manager to oversee operations and report to other groups (ROC Coordinator, OMC …) Core Grid Coordinator Bring UK non-Particle Physics projects (applications and resources) into EGEE

Roles (2) Deployment Team Consists of about 7 people to spearhead the rollout and certification of Grid software to the Resource Centres (Tier1 & Tier2) Grid Operations Centre Similar role to the proposed CIC in EGEE. Monitor health of services and provide toolkits Operate Core Grid Services Database of RCs managed by RC site administrators

Roles (3) Middleware Specialist Support Body of experts to provide specialist support to Resource Centres in key areas: security, data management, network, VO management and workflow management. Grid Support Centre Helpdesk facility, CA Broker requests to middleware specialists

Team UK A large team in the UK (GridPP, EU, and other) GridPP Production Manager should orchestrate this team to deliver a production grid for GridPP –But interwork with as many other UK grids and projects as possible Meet our EGEE ROC and CIC deliverables for support and operations A big challenge