Grid Operations Centre LHCC Comprehensive Review Trevor Daniels, John Gordon 25 Nov 2003.

Slides:



Advertisements
Similar presentations
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Torsten Antoni – LCG Operations Workshop, CERN 02-04/11/04 Global Grid User Support - GGUS -
Advertisements

Last update 01/06/ :23 LCG 1Maria Dimou- cern-it-gd Maria Dimou IT/GD Site Registration policy & procedures
Grid Operations – Keeping the Grid Running EB-TB Joint Meeting John Gordon 13 th May 2004.
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Helmut Dres, Institute For Scientific Computing – GDB Meeting Global Grid User Support.
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
John Gordon and LCG and Grid Operations John Gordon CCLRC e-Science Centre, UK LCG Grid Operations.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
RomeWorkshop on eInfrastructures 9 December LCG Progress on Policies & Coming Challenges Ian Bird IT Division, CERN LCG and EGEE Rome 9 December.
John Gordon CCLRC RAL Grid Operations Centre Update Trevor Daniels LCG Grid Deployment Board 10 th November 2003.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
JSPG: User-level Accounting Data Policy David Kelsey, CCLRC/RAL, UK LCG GDB Meeting, Rome, 5 April 2006.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
9-Sep-03D.P.Kelsey, LCG-GDB-Security1 LCG/GDB Security (Report from the LCG Security Group) CERN, 9 September 2003 David Kelsey CCLRC/RAL, UK
EGEE is a project funded by the European Union under contract IST User support in EGEE Alistair Mills Torsten Antoni EGEE-3 Conference 20 April.
8-Jul-03D.P.Kelsey, LCG-GDB-Security1 LCG/GDB Security (Report from the LCG Security Group) RAL, 8 July 2003 David Kelsey CCLRC/RAL, UK
Grid Operations Centre LCG Accounting Trevor Daniels, John Gordon GDB 8 Mar 2004.
GGUS at PEB – –- page 1 LCG Klaus-Peter Mickel, GridKa Karlsruhe LCG-PEB-Meeting ( ) The Global Grid User Support Model (Report of GDB.
15-Dec-04D.P.Kelsey, LCG-GDB-Security1 LCG/GDB Security Update (Report from the Joint Security Policy Group) CERN 15 December 2004 David Kelsey CCLRC/RAL,
Grid Operations Centre LCG SLAs and Site Audits Trevor Daniels, John Gordon GDB 8 Mar 2004.
9-Oct-03D.P.Kelsey, LCG-GDB-Security1 LCG/GDB Security (Report from the LCG Security Group) FNAL 9 October 2003 David Kelsey CCLRC/RAL, UK
Grid Security Vulnerability Group Linda Cornwall, GDB, CERN 7 th September 2005
Proposal of interface between GUS + Call Center and Experiments GDB Meeting – Klaus-Peter Mickel GridKa Karlsruhe.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
John Gordon CCLRC RAL Grid Operations LCG Grid Deployment Board FNAL, 9th October 2003.
Operations Working Group Summary Ian Bird CERN IT-GD 4 November 2004.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
LCG WLCG Accounting: Update, Issues, and Plans John Gordon RAL Management Board, 19 December 2006.
LCG Accounting Update John Gordon, CCLRC-RAL WLCG Workshop, CERN 24/1/2007 LCG.
EGEE is a project funded by the European Union under contract IST Roles & Responsibilities Ian Bird SA1 Manager Cork Meeting, April 2004.
LCG User Level Accounting John Gordon CCLRC-RAL LCG Grid Deployment Board October 2006.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
DataTAG is a project funded by the European Union International School on Grid Computing, 23 Jul 2003 – n o 1 GridICE The eyes of the grid PART I. Introduction.
Grid Operations Centre Proposal (bis) Trevor Daniels, John Gordon GDB 8 July 2003.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
DataTAG is a project funded by the European Union CERN, 8 May 2003 – n o 1 / 10 Grid Monitoring A conceptual introduction to GridICE Sergio Andreozzi
INFSO-RI Enabling Grids for E-sciencE GOCDB Requirements John Gordon, STFC.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
Operations Working Group Summary Ian Bird CERN IT-GD 4 November 2004.
LCG Accounting Update John Gordon, CCLRC-RAL 10/1/2007.
Accounting Update Dave Kant, John Gordon RAL Javier Lopez, Pablo Rey Mayo CESGA.
INFN-Grid WS, Bari, 2004/10/15 Andrea Caltroni, INFN-Padova Marco Verlato, INFN-Padova Andrea Ferraro, INFN-CNAF Bologna EGEE User Support Report.
15-Jun-04D.P.Kelsey, LCG-GDB-Security1 LCG/GDB Security Update (Report from the LCG Security Group) CERN 15 June 2004 David Kelsey CCLRC/RAL, UK
CERN WLCG Grid Storage Systems Deployment Flavia Donno, CERN 6 November 2007 Organization of Storage Support through GGUS Flavia Donno CERN/IT-GD CERN.
Service Availability Monitoring
People Inc. from P&A Software
Bob Jones EGEE Technical Director
Status of Task Forces Ian Bird GDB 8 May 2003.
Grid Operations Centre Progress to Aug 03
Regional Operations Centres Core infrastructure Centres
EGEE is a project funded by the European Union
David Kelsey CCLRC/RAL, UK
Ian Bird GDB Meeting CERN 9 September 2003
Brief overview on GridICE and Ticketing System
Securing the Network Perimeter with ISA 2004
ATLAS support in LCG.
Accounting at the T1/T2 Sites of the Italian Grid
Nordic ROC Organization
Integrated Site Security for Grids
GGUS Partnership between FZK and ASCC
LCG Operations Centres
a VO-oriented perspective
LCG Operations Workshop, e-IRG Workshop
Leigh Grundhoefer Indiana University
Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002
People Inc. from P&A Software
Neopay Practical Guides #2 PSD2 (Should I be worried?)
Presentation transcript:

Grid Operations Centre LHCC Comprehensive Review Trevor Daniels, John Gordon 25 Nov 2003

GOC Group  The June GDB agreed that a task force should be created to define the requirements and agree on a prototype for a Grid Operations Service  The members of this GOC Steering Group are  Trevor Daniels (RAL)RAL, Convenor  Markus Shultz (CERN)CERN  John Gordon (RAL)RAL  Rolf Rumler (IN2P3)IN2P3  Cristina Vistoli (INFN)INFN  Claude WangTaipei (observer)  Eric YenTaipei  Ian FiskFNAL, US-CMS  Bruce GibbardBNL, US-Atlas

GOC Group The views of the group have been sought on several topics:  Revised proposal for GOC resulted in submission to July GDB  Prototype website general layout restrictions on certain pages monitoring pages  Approaches to monitoring SLAs possible tests for CE and RB services  Security proposals as presented to Sept GDB  SLA Guide a supplement to the Security Policy

Overview GOC Proposal envisaged three Phases  Phase 1 Jul 03 – Oct 03  Phase 2 Nov 03 – May 04  Phase 3 Jun 04 – Jun 05  GOC Vision  What was planned in Phase 1 and its current status  What is planned for Phase 2  User Support  Brief summary on behalf of GGUS at FZK

The Vision  GOC Processes and Activities  Coordinating Grid Operations  Defining Service Level Parameters  Monitoring Service Performance Levels  First-Level Fault Analysis  Interacting with Local Support Groups  Coordinating Security Activities  Operations Development

Phase 1 ( Jun 03 – Oct 03) Taken from Proposal Jun 2003  An evolutionary approach  Phase 1 explores possible approaches to an operations centre  a) Set up an initial monitoring centre - Done  b) Draft Security Policy and Procedures - Done  c) Define Service Level Parameters – Partly Done  d) Establish a Monitoring Regime – Done ( but further development is ongoing)  e) Select tools for use and evaluation in Phase 2 - Done

Phase 1  In addition to the work envisaged in the Proposal for Phase 1 RAL is acting as an operational GOC by monitoring LCG sites from the moment they install the LCG software.  All CE s are tested every 10 mins with an authentication test  All RB s are tested every 10 mins with a job-list-match test  Network connectivity is tested every 10 mins from RAL to every host  Port accessibility is tested to every externally accessible service every 10 mins  A trivial job is submitted to every CE every hour via Globus and via the CERN RB  Logs are examined and analysed several times a week  Significant failings or problems are reported to the LCG-Rollout list  Several problems have been uncovered in both the monitors and in various sites

Approach to Service SLAs  Formal Contract with GOC? – No, because  GOC is not (likely to be) a legal body  GOC will not (be likely to) have any formal powers over Service Providers  GOC will not (be likely to) pay for any Services  So difficult for GOC to enforce a traditional SLA  Instead, prefer a virtual contract between Service Provider and the LCG Grid Community  Any Centre wishing to provide a Service must publish its design levels for the specified service level parameters of that Service  GOC will then monitor the actual levels achieved and publish them so they may be compared with the design levels  Service Providers (Centres) will then compete on quality or possibly quality/cost, either to attract work or enhance reputation

Form of SLA  One for each instance of a LCG Service  To be published on the GOC website in standard format exactly as provided by the Service Administrator  Format still to be agreed, but likely to contain as a minimum  Identification of Service (type, release, etc)  Statement on compliance with Security and Availability Policy (standard wording)  Limitations on use (if any)  Designed Availability  Designed Reliability  Designed Performance (Service-specific; to be defined for each type of Service)

Monitoring The monitoring pages bring together the several LCG monitoring tools which are readily available, together with a touch- sensitive map which links to pertinent information about each LCG site, including a link to each site’s published status. The currently running and displaying monitors are:  GridICE monitoring of LCG-1 (at CERN)  MapCenter monitoring of LCG-1 (at RAL)  LCG-1 status measured with GridPP (at RAL) also shows  LCG-1 overall rollout status page (at CERN)  Each of these provides multiple views of status information

MapCenter Performs low-level tests and aggregates these up through several levels to country, showing best and worst status at each level. This is the view of Europe showing individual sites. Other views show world-wide sites, USA, etc

MapCenter Part of the MapCenter full list view showing aggregation up to country. Tests include icmp response, ports tests to gk, mds, gsiftp, etc and specific SLA-based tests like ce-auth

GridPP Monitor Submits jobs via globus-job-run, via CERN, RAL and other RBs, displays coloured dot to indicate recent results on map and also in list form. Gives user-level view of status

GridICE VO view Partial view of DTEAM VO showing infn, fzk and sinica Shows info on cpu loading, jobs, and storage by cluster

Started 1st of october at GridKa Forschungszentrum Karlsruhe (Germany) Supports already 41 usergroups of GridKa Website Global Grid User Support – GGUS The Model

ESUS GOC Local operations GGUS Interaction Service Request First line of support: Problems (experiment specific) will be solved by ESUS (with Savannah) or sent to GGUS using an agreed interface; Grid related problems will be solved by GGUS or sent to GOC using the GGUS system; Data flow Grid User GGUS: Global Grid User Support ESUS: Experiment Specific User Support GOC: Grid Operations Centre Information flow

Present GGUS Services (November 2003)  Support for GridKa users via web portal  Entry point for grid related questions and problems  problem tracking with a problem management tool: remedy action request system  rapid development of processes, workflow elements, escalations and automated  entry of problem tickets via web browser, , phone  ticket information for user via web browser  offer a web interface to problem database to other support teams  working hours 8 am to 4 pm Monday – Friday (to be extended)  FAQs  Status and diagnostic information of jobs and job queues, documentation about grid resources  News, temporary problems, bottlenecks and down times

System Overview (jobs and queues)

Tracking a Service Request

 Further development of GGUS Portal  Further work on Remedy: problem database, knowledge database and workflows, escalations  Define requirements for a centralised LDAP-based user authentication and authorisation  Implement a centralized LDAP authentication and authorisation if possible  Implement a GOC GUS interface to share common tools  Implement a ESUS GUS interface for problem interchange  Integrate another Tier-1 Centre for first and second level support (including Remedy integration) Roadmap for Q1 & Q2 2004

End of Presentation  Questions?

Supplementary Slides  Some more detailed slides follow which will not be used in the LHCC presentation

Phase 1 ( Jun 03 – Oct 03) Taken from Proposal Jun 2003  a) Set up an initial monitoring centre - Done  Steering Group established  LCG-Rollout list installed 99 members July – 74 messages Aug – 121 messages Sep – 311 messages Oct – 200 messages Nov – 140 messages (to 11 th )  GOC website set up Includes certificate protection for sensitive pages  MapCenter configured and installed  GppMon configured and installed  GridICE (link)  SLA tests developed and running for CE and RB

Phase 1  b) Draft Security Policy and Procedures - Done  Drafted with the LCG Security Group Approved by GDB in October Will be submitted to SC2 for Adoption  Three GOC-related supporting Annexes in preparation Service Level Agreement Guide - drafted Procedures for Resource Admins - partly drafted Procedure for site self-audit - in outline

Phase 1 c) Define Service Level Parameters – Partly Done  Schedule, Availability, Reliability all clear and defined Schedule –The published periods of downtime for upgrading etc Availability –The proportion of actual up-time to scheduled up-time Reliability –The mean time to failure  Performance is service-specific; ideas under discussion needs experience with real users before deciding what is important  Service Level Agreement The publication by the site of the targeted (designed) service level parameters for an LCG service in a prescribed format will comprise the SLA for that service The GOC will monitor and publish alongside the actual achieved values of the same parameters

Phase 1  d) Establish a Monitoring Regime – Done  (but further development is ongoing)  SLA Monitoring CE and RB availability and reliability are being crudely monitored now Reports of significant failures sent to Rollout List  Use and Development of MapCenter Good low and high level tool Needs no local sensors Easily extended and has history will be used for SLA monitoring CE and RB SLA monitoring tests installed in test  Use and Development of GppMon tests job submission via Globus and CERN RB  GridICE at CERN – we simply link to it

Phase 1  e) Select tools for use and evaluation in Phase 2 - Done  As Phase 1 GppMon (extended to add history) MapCenter (extended to accommodate SLA tests) GridICE (as implemented at CERN)  plus MonALISA needs local sensing agents  plus network monitoring tools from EDG WP7 needs local agents needs R-GMA

Plans for Phase 2 (Nov 03 – May 04)  a) Set up a second monitoring centre  Eventually there should be 2 more, one in the East and one in the West to provide 24 hour cover, and to provide regional coordination of operational issues like alerts and SLAs  A second site would immediately deploy the Phase 2 monitoring tools  A second site would enable the details of inter-site coordination to be established and permit a parallel monitoring system to be set up  The October GDB thought finding a second site was premature, so this may be delayed

Plans for Phase 2  b) Establish Grid operations and security coordination regime in consultation with  LCG Security Group  Local Security Officers  Local Support Groups  LCG User Support Centre (GGUS)  to  promote the Security Policy and associated documents  agree and establish common operational practices, principally the way in which SLAs and monitoring will work  agree a fault analysis and alerting mechanism  agree an incident response mechanism

Plans for Phase 2  c) Establish a simple change control regime  question whether or to what degree 'control' is appropriate  as a minimum ensure information about recent and prospective changes is published to the community  establish whatever mechanism is agreed in coordination with local support groups  the minimum in outline would include: the schedule of service down time (part of SLA) the schedule and nature of proposed changes site would publish information via GOC web site

Plans for Phase 2  d) Monitoring service levels  Investigate using EDG WP7 network monitoring tools uses R-GMA  Install tools to monitor and detect deviations from SLA  Deploy remote agents - include in software distributions?  Automatic alert mechanisms for operations staff  Set up mechanisms to notify local support of problems

Plans for Phase 2  e) Establish accounting mechanism  define accounting schema,  develop filters to transform required data from sites to the schema for one or two batch systems,  develop mechanisms for collecting data from sites and transmitting it to the GOC,  develop mechanism for matching up data from batch and CE,  develop and install suitable DB to hold accounting data,  develop suitable web-based static and interactive reports

 To find partners for two other time zones (USA, Asia)  Further development of processes with ESUS  Definition of working groups  Contact person  Synchronization with experiment specific  Access to a centralized (distributed) LDAP directory of Grid Users for authentication and authorisation  Additional attributes necessary  Standardised format of VOs Open Issues