Deployment Summary GridPP12 Jeremy Coles 1 st February 2005.

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Torsten Antoni – LCG Operations Workshop, CERN 02-04/11/04 Global Grid User Support - GGUS -
Last update 01/06/ :23 LCG 1Maria Dimou- cern-it-gd Maria Dimou IT/GD Site Registration policy & procedures
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
Communications Deployment parallel session Jeremy Coles 14th September 2004.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Grid Infrastructure and Operations Maite.
UK NGI Operations John Gordon 10 th January 2012.
Dave Kant Grid Monitoring and Accounting Dave Kant CCLRC e-Science Centre, UK HEPiX at Brookhaven 18 th – 22 nd Oct 2004.
BINP/GCF Status Report BINP LCG Site Registration Oct 2009
EGEE ARM-2 – 5 Oct LCG Security Coordination Ian Neilson LCG Security Officer Grid Deployment Group CERN.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
Core operations Jeremy Coles GridPP28 17 th April 2012 Jeremy Coles GridPP28 17 th April 2012 a b.
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
Deployment Issues David Kelsey GridPP13, Durham 5 Jul 2005
Quarterly report ScotGrid Quarter Fraser Speirs.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
INFSO-RI Enabling Grids for E-sciencE EGEE 1 st EU Review – 9 th to 11 th February 2005 CERN.
John Gordon CCLRC e-Science Centre LCG Deployment in the UK John Gordon GridPP10.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
GridPP Deployment Status GridPP14 Jeremy Coles 6 th September 2005.
EGEE is a project funded by the European Union under contract IST User support in EGEE Alistair Mills Torsten Antoni EGEE-3 Conference 20 April.
Steve Traylen PPD Rutherford Lab Grid Operations PPD Christmas Lectures Steve Traylen RAL Tier1 Grid Deployment
Dan Tovey, University of Sheffield User Board Overview Dan Tovey University Of Sheffield.
UKI ROC/GridPP/EGEE Security Mingchao Ma Oxford 22 October 2008.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE-EGI Grid Operations Transition Maite.
Ian Bird LCG Deployment Area Manager & EGEE Operations Manager IT Department, CERN Presentation to HEPiX 22 nd October 2004 LCG Operations.
Deployment Summary GridPP11 Jeremy Coles 15th September 2004.
Grid Operations Centre LCG SLAs and Site Audits Trevor Daniels, John Gordon GDB 8 Mar 2004.
Production Manager’s Report PMB Jeremy Coles 13 rd September 2004.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1: Grid Operations Maite Barroso (CERN)
Grid Security Vulnerability Group Linda Cornwall, GDB, CERN 7 th September 2005
INFSO-RI Enabling Grids for E-sciencE EGEE SA1 in EGEE-II – Overview Ian Bird IT Department CERN, Switzerland EGEE.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The EGEE User Support Infrastructure Torsten.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
EGEE is a project funded by the European Union under contract IST Support in EGEE Ron Trompert SARA NEROC Meeting, 28 October
INFSO-RI Enabling Grids for E-sciencE An overview of EGEE operations & support procedures Jules Wolfrat SARA.
LCG GDB LCG User Support 8 February 2005 – n o 1 LCG/EGEE User Support Flavia Donno LCG/INFN-Pisa
Operations Working Group Summary Ian Bird CERN IT-GD 4 November 2004.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Communication tools between Grid Virtual.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN
EGEE is a project funded by the European Union under contract IST Roles & Responsibilities Ian Bird SA1 Manager Cork Meeting, April 2004.
INFSO-RI SA2 ETICS2 first Review Valerio Venturi INFN Bruxelles, 3 April 2009 Infrastructure Support.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
CERN - IT Department CH-1211 Genève 23 Switzerland t IT-GD-OPS attendance to EGEE’09 IT/GD Group Meeting, 09 October 2009.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.
J Jensen/J Gordon RAL Storage Storage at RAL Service Challenge Meeting 27 Jan 2005.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks What all NGIs need to do: Helpdesk / User.
EGEE is a project funded by the European Union under contract IST Issues from current Experience SA1 Feedback to JRA1 A. Pacheco PIC Barcelona.
LCG Workshop User Support Working Group 2-4 November 2004 – n o 1 Some thoughts on planning and organization of User Support in LCG/EGEE Flavia Donno LCG.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operational Procedures (Contacts, procedures,
II EGEE conference Den Haag November, ROC-CIC status in Italy
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROC model assessment AP ROC ShuTing Liao.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
Operations Working Group Summary Ian Bird CERN IT-GD 4 November 2004.
INFSO-RI Enabling Grids for E-sciencE EGEE general project update Fotis Karayannis EGEE South East Europe Project Management Board.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
Scuola Grid - Martina Franca, Thursday 08 November Il Sistema di Supporto INFNGrid & GGUS ( Global Grid User.
2007/07/04 Organisation and tasks of ROC France Pierre Girard Visit of Japanese grid site managers.
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
SA1 Execution Plan Status and Issues
Ian Bird GDB Meeting CERN 9 September 2003
Report on SLA progress Ioannis Liabotis <ilaboti at grnet.gr>
The CCIN2P3 and its role in EGEE/LCG
Nordic ROC Organization
LCG Operations Workshop, e-IRG Workshop
Leigh Grundhoefer Indiana University
Pierre Girard ATLAS Visit
EGEE Operation Tools and Procedures
Presentation transcript:

Deployment Summary GridPP12 Jeremy Coles 1 st February 2005

Contents LCG operations workshop EGEE structures Operations model Current status Support Planning Metrics Some of the recurring issues at GridPP12 Future activities

Some operational issues Slow response from sites (central perception) –Upgrades, response to problems, etc –Problems reported daily – some problems last for weeks Lack of staff available to fix problems –All on vacation, … Misconfigurations (units, gridmap-file builds, user profiles, pools …) Lack of configuration management – problems that are fixed reappear Lack of fabric management –Is it GDA responsibility to provide solutions to these problems? Lack of understanding (training?) –Admins reformat disks of SE … Firewall issues –coordination between grid admins and firewall maintainers PBS problems –Are we seeing the scaling limits of PBS? People not reading documentation … Background to workshop

LCG Workshop Nov 2004 Operational Security –Incident Handling Process –Variance in site support availability –Reporting Channels –Service Challenges Operational Support –Workflow for operations & security actions –What tools are needed to implement the model –“24X7” global support sharing operational load (CIC-on-duty) –Communications (news) –Problem Tracking System –Defining Responsibilities problem follow-up deployment of new releases –Interface to User Support LCG (EGEE) discussion on superset of topics discussed at GridPP11

LCG Workshop Nov 2004 Fabric Management –System installations (tools, intrfacing tools with each other) –Batch/scheduling Systems (openPBS/Torque, MAUI. fair-share) –Fabric monitoring –Software installation –Representation of site status (load) in the Information System Software Management –Operations on and for VOs (add/remove/service discovery) –Fault tolerance, operations on running services (stop,upgrades, re- starts) –Link to developers –What level of intrusion can be tolerated on the WNs (farm nodes) application (experiment) software installation –Removing/(re-adding) sites with (fixed)troubles –Multiple views in the information system (maintenance)

GDB LCG Grid Deployment Board One representative from each country (with a Regional Centre) involved in the LCG and one representative from each experiment Chairman changes annually Meet in person once per month What it does! Explores issues of global concern to the LCG community Makes decisions on deployment, operations and planning for LCG Provides mechanisms for resource forecasting How? By calling upon experts to present latest information on specific topics By creating and overseeing working groups to tackle important areas Currently three groups: The Security, Networking and Quattor groups Who is involved in UKI UK representative: John Gordon Security group coordinator: Dave Kelsey GDB secretary: Jeremy Coles

Proposed escalation procedure Because unstable and badly configured sites cause a big problem: –Unstable sites that have frequent problems Will appear on a list of bad sites –Sites that do not respond to problem reports Including not upgrading middleware versions –Will be removed from the information systems and maps –Will have to be re-certified to get back in –Will be reported to the GDB (LCG) or PMB (EGEE) representative as non-responsive

ROCs Regional Operations Centres (ROCs) Part of the EGEE SA1 activity ( The regions are CERN, France, Italy, UK & Ireland, Germany & Switzerland, Northern Europe, South West Europe, South East Europe, Central Europe and Russia. What they do Coordinate regional efforts in all activities (support, operations representation, security) Take up operations and deployment issues at cross project meetings Provide forum for agreeing work needed – pre-production service How? Setup ROC structures within the region Create common groups to work on areas like pre-production services, helpdesk interfaces Meet fortnightly via telephone ( to discuss regional issues and problemshttp://agenda.cern.ch/displayLevel.php?fid=339 Who is involved for UK? General: John Gordon Support: Andy Richards Security: Romain Wartel EGEE Background

CICs Core Infrastructure Centre (CIC) The CICs cover more than one region and deal with operations issues. There are currently 4 CICs France, Italy, UK & Ireland and CERN Coordinated by the Operations Management Centre team at CERN. Meet weekly via telephone ( ) Each CIC is “on-duty” for 1 week in 4. What they do! Operational and performance monitoring Troubleshooting and following up identified problems Operate general grid services (e.g. VO related services) Provide information via the CIC portal How? Review monitoring data such as gstat, daily test results Enter problems identified into Savannah (moving to GGUS portal soon) Follow up problems using and telephone contacts Troubleshoot using experts, Wiki etc. Who is involved in UKI Steve Traylen & Philippa Strange EGEE Background

CIC portal

Regional Operations Centres (9) –Act as front-line support for user and operations issues –Provide local knowledge and adaptations User Support Centre (GGUS) –In FZK –provide single point of contact (service desk) Core Infrastructure Centres (4) –CICs build on the LCG GOC at RAL –Also run essential infrastructure services –Provide support for other (non-LHC) applications –Provide 2 nd level support to ROCs Coordination: –At CERN (Operations Management Centre) and CIC for HEP LCG-2/EGEE Operations Taipei provide operations centre, and 2 nd instance of GGUS –  start to build round-the-clock coverage Discussions with Grid3/OSG on how to collaborate on ops support –Share coverage?

(New) Operations Model Operations Center role rotates through the CICs –CIC on duty for one week –Procedures and tasks are currently defined first operations manual is available (living document) –tools, frequency of checks, escalation procedures, hand over procedures CIC on duty website:website –Problems are tracked with a tracking tool now central in Savannah migration to GGUS (remedy) with link to ROCs PT tools problems can be added at GGUS or ROC level –CICs monitor service, spot and track problems interact with sites on short term problems (service restart etc,) interact with ROCs on longer, non trivial problems all communication with a site is visible for the ROC build FAQs –ROCs support installation, first certification resolving complex problems

Operations Model OMC CIC ROC RC Other Grid RC

How does support map onto this? OMC CIC ROC helpdesk RC Other Grid RC Savannah GGUS

How does user support map onto this? OMC CIC ROC helpdesk RC Other Grid RC Savannah GGUS VO1 VO2 VO3

How does user support map onto this? OMC CIC ROC helpdesk RC Other Grid RC Savannah GGUS VO1 VO2 VO3 We need to work out a better model for this in the UK

Site updates Most sites have stated an intention to move to SL3 and LCG 2.3 over the next few weeks

Monitoring progress Produced: Certification tests GPPMon Maps RSS feeds Can we: Have a single view Integrate network info

Today’s functional test results The tests show similar patterns across EGEE as a whole How can tests be made more useable by those who can react?

Accounting progress Well done: Imperial College Manchester Oxford RAL Tier-1 RAL PPD Edinburgh Glasgow UCL – CCC Durham What next? More sites!! Provide older data Analyse & use ALL sites need to keep their log files. Details in the accounting page FAQ.

Ganglia Well done: Manchester Edinburgh Lancaster QMUL Sheffield Bristol Oxford Liverpool What next? We need all sites Review against MoUs Use data for warnings?

Status of planning

We have developed a plan for deployment at a high level. The deliverables form part of the GridPP2 project map. Each area has consequences for Tiers- 1, 2 and 3 in for example: Service challenges Data challenges Networking Security Resource provision Core services MoU commitments Functionality Accounting Scheduling of use Support …. It is still evolving and there is a lot of work here!

What metrics and why? Number of sites in production – simple count based on GOCDB information? Number of registered users – count of certificates issued? Number of active users Number of supported VOs Percentage of available resources utilised Peak number of concurrent jobs – measured by Gstat for grid jobs Average number of concurrent jobs – measured by Gstat for grid jobs Number of jobs not terminated by themselves or the batch system Accumulated site downtime per week (scheduled and un-scheduled) Total CPUs deployed CPUs available Storage available and used CPU hours per VO UK relative contribution to experiments The list shared before… Subject of DTEAM discussion 16:00-18:00 today What is actually useful now?

LHCb DC feedback LCG Job Submission Summary Table LCG Efficiency: 61 % … but note Tony Cass’s comments earlier of improving performance

DO MC performance – CESuccessFailed bohr0001.tier2.hep.man.ac.uk2373 cclcgceli01.in2p3.fr-14 grid-ce.physik.uni-wuppertal.de-- gridkap01.fzk.de golias25.farm.particle.cz19815 heplnx131.pp.rl.ac.uk2464 lcgce02.gridpp.rl.ac.uk29310 mu6.matrix.sara.nl3977 tbn18.nikhef.nl1542 Total Efficiency 98 % Is this “much less than production quality” ? 98.8% 98.4% 96.7%

DO MC performance LCG Efficiency 99 % We need to be careful with what we mean ! Error Aborted35LCG error: f.e. file not found Cancelled21Done by us for various reasons Cleared5Done by us, enough events Running10D0 softw.error: infinite loop Scheduled3Can be OK, CZ disk crash Total74Really 35 LCG errors

Ability to plan (service challenges, networking, resources) Responsiveness of sites Security gLite, gLite, gLite GridPP12 Deployment issues } This is a “production” service Concept behind the “pre-production” service: New middleware (gLite, …) can be demonstrated and validated before being deployed in production Understand the migration strategy to 2nd generation middleware Use the existing production service as the baseline comparison

Ability to plan (service challenges, networking, resources) Responsiveness of sites Security gLite, gLite, gLite Tier-2s operating as real Tier-2s Use of Tier-2s (experiment models) Metrics (“get fit” plan) Use of Tier-2 SEs SRM = Storage Really Matters! Engagement with experiments On-demand tests and other tools Support Communications } This is a “production” service GridPP12 Deployment issues

Deployment web-pages WORK IN PROGRESS

Summary LCG workshop was useful. Some progress but not enough answers. Roadmaps proposed. EGEE has a deployment structure and GridPP deployment works within the UKI ROC/CIC We need to unravel the support problems and introduce something that works well for UK Sites are responding to requests but sometimes slowly. Better communications are needed. We still have significant planning challenges to overcome (LCG SC1 failed and there is no clear gLite migration strategy. gLite could require a step back in deployment terms! Implications of experiment computing models.) By the next GridPP meeting we must be reporting on carefully defined metrics THANK YOU to everyone involved. Please remember - we need your feedback to improve the deployment mechanisms and GridPP service.