EGEE-II INFSO-RI-031688 Enabling Grids for E-sciencE www.eu-egee.org EGEE and gLite are registered trademarks EGEE Grid Infrastructure and Operations Maite.

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Torsten Antoni – LCG Operations Workshop, CERN 02-04/11/04 Global Grid User Support - GGUS -
INFSO-RI Enabling Grids for E-sciencE SA1 Ian Bird SA1 Activity Leader CERN IT Department EGEE Final Review 23 rd – 24 th May 2006.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks From ROCs to NGIs The pole1 and pole 2 people.
EGEE-II INFSO-RI Enabling Grids for E-sciencE AP ROC Min-Hong Tsai ASGC SA1 Transition Meeting May 8 th, 2008
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Romanian SA1 report Alexandru Stanciu ICI.
1 Introduction to EGEE-II Antonio Fuentes Tutorial Grid Madrid, May 2007 RedIRIS/Red.es (Slices of Bob Jone, Director of EGEE-II.
INFSO-RI Enabling Grids for E-sciencE GLOBAL GRID USER SUPPORT THE MODEL AND EXPERIENCE IN LCG/EGEE Gilles Mathieu(1), Torsten Antoni(2),
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Simply monitor a grid site with Nagios J.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Related Projects Dieter Kranzlmüller Deputy.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
INFSO-RI Enabling Grids for E-sciencE EGEE 1 st EU Review – 9 th to 11 th February 2005 CERN.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROD model assessment ROC UKI John Walsh.
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Operations EGEE.
INFSO-RI Enabling Grids for E-sciencE Plan until the end of the project and beyond, sustainability plans Dieter Kranzlmüller Deputy.
EGEE is a project funded by the European Union under contract IST User support in EGEE Alistair Mills Torsten Antoni EGEE-3 Conference 20 April.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Next steps with EGEE EGEE training community.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team James Casey EGEE’08.
INFSO-RI Enabling Grids for E-sciencE Integration and Testing, SA3 Markus Schulz CERN IT JRA1 All-Hands Meeting 22 nd - 24 nd March.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Multi-level monitoring - an overview James.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE II: an eInfrastructure for Europe and.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE-EGI Grid Operations Transition Maite.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1: Grid Operations Maite Barroso (CERN)
INFSO-RI Enabling Grids for E-sciencE EGEE SA1 in EGEE-II – Overview Ian Bird IT Department CERN, Switzerland EGEE.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The EGEE User Support Infrastructure Torsten.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Dashboard Cyril L’Orphelin - CNRS/IN2P3.
Grid Monitoring and Operations SAM Development Team CERN IT/GD Tier2 Admin Workshop 03 Dec. 2006, Mumbai.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operational Security Coordination Team Ian.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGI Operations Tiziana Ferrari EGEE User.
INFSO-RI Enabling Grids for E-sciencE An overview of EGEE operations & support procedures Jules Wolfrat SARA.
Operations Working Group Summary Ian Bird CERN IT-GD 4 November 2004.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Communication tools between Grid Virtual.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CIC portal Requirements from users WLCG service.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Deliverable DSA1.4 Jules Wolfrat ARM-9 –
EGEE is a project funded by the European Union under contract IST Roles & Responsibilities Ian Bird SA1 Manager Cork Meeting, April 2004.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The EGEE User Support Infrastructure Alistair.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Training Synergies with Collaborating Projects.
INFSO-RI Enabling Grids for E-sciencE User and Virtual Organisation Support in EGEE Flavia Donno, CERN Torsten Antoni, FZK Alistair.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations Automation Team Kickoff Meeting.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operating the EGEE Grid Presented by Mike.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks What all NGIs need to do: Helpdesk / User.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Best Practices and Use cases David Bouvet,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operational Procedures (Contacts, procedures,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid is a Bazaar of Resource Providers and.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks ROC model assessment AP ROC ShuTing Liao.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
INFSO-RI Enabling Grids for E-sciencE EGEE general project update Fotis Karayannis EGEE South East Europe Project Management Board.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks COD-16 (Transition to EGEE-III) Report to.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Operations automation team presentazione.
1 Grid Operations Jinny Chien ASGC June 09, Academia Sinica Slides adapted from the EGEE training material repository:
Scuola Grid - Martina Franca, Thursday 08 November Il Sistema di Supporto INFNGrid & GGUS ( Global Grid User.
INFSO-RI Enabling Grids for E-sciencE GOCDB2 Matt Thorpe / Philippa Strange RAL, UK.
Enabling Grids for E-sciencE EGEE-II INFSO-RI ROC managers meeting at EGEE 2007 conference, Budapest, October 1, 2007 Admin Matters Vera Hanser.
CERN WLCG Grid Storage Systems Deployment Flavia Donno, CERN 6 November 2007 Organization of Storage Support through GGUS Flavia Donno CERN/IT-GD CERN.
Regional Operations Centres Core infrastructure Centres
EGEE is a project funded by the European Union
Ian Bird GDB Meeting CERN 9 September 2003
The CCIN2P3 and its role in EGEE/LCG
Nordic ROC Organization
EGEE: Grid Operations & Management
EGEE Operation Tools and Procedures
Presentation transcript:

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Grid Infrastructure and Operations Maite Barroso Lopez CERN SA1 Deputy Activity Leader 1 st EELA conference Santiago de Chile, 4 th September 2006

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September Outline EGEE – SA1/SA3 EGEE infrastructure – status Grid Operations User Support Security & Policy Summary SA: 54% of total SA1 (operations) : 86% SA2 (network) : 3% SA3 (certification): 11% SA: 54% of total SA1 (operations) : 86% SA2 (network) : 3% SA3 (certification): 11%

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September A global, federated e-Infrastructure EUIndiaGrid EUMedGrid SEE-GRID EELA BalticGrid EUChinaGrid OSG NAREGI EGEE: > 192 sites, 40 countries, 11 ROCs > 28,000 processors ~ 2500 TB storage > concurrent jobs per day

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September Infrastructure status

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September Some statistics ~6000 cpu-months/month

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Operations

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September EGEE Operations Structure Operations Coordination Centre (OCC) Regional Operations Centres (ROC) –Front-line support for user and operations issues –Provide local knowledge and adaptations –One in each region – many distributed (inc. A-P) –Manage daily grid operations – oversight, troubleshooting  “Operator on Duty” –Run infrastructure services User Support Centre (GGUS) –In FZK: provide single point of contact (service desk) + portal.

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September EGEE Operations Process Grid operator on duty Grid monitoring tools Geographically distributed responsibility for operations: –There is no “central” operation –Tools are developed/hosted at different sites:  GOC DB (RAL), SFT (CERN), GStat (Taipei), CIC Portal (Lyon) Procedures described in Operations Manual –Linked from the CIC portal

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September Grid Operator on Duty Role: –Watch the problems detected by the grid monitoring tools –Problem diagnosis –Report these problems (GGUS tickets) –Follow and escalate them if needed (well defined procedure) –Provide help, propose solutions –Build and maintain a central knowledge database (WIKI) Who does it?: –9 ROC teams working in pairs (one lead and one backup) on a weekly rotation –CERN, France, Italy, UK, Russia, Asia-Pacific, Southeastern- Europe, Central-Europe, Germany-Switzerland

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September Grid monitoring tools Tools used by the Grid Operator on Duty team to detect problems Distributed responsibility CIC portal –single entry point –Integrated view of monitoring tools Site Functional Tests (SFT) -> Service Availability Monitoring (SAM) GIIS monitor (Gstat) GOC certificate lifetime GOC job monitor

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September Site Functional Tests Site Functional Tests (SFT) –Framework to test (sample) services at all sites –Shows results matrix –Detailed test log available for troubleshooting and debugging –History of individual tests is kept –Can include VO-specific tests (e.g. sw environment) –Normally >80% of sites pass SFTs  NB of 190 sites, some are not well managed Very important in stabilising sites: Apps use only good sites Bad sites are automatically excluded Sites work hard to fix problems

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September Service Availability Monitoring Service Availability Monitoring (SAM) Will cover all grid core services measure availability by service, site, VO each service has associated service class defining required availability (Critical, highly available, etc.) Will be used to generate alarms generate trouble tickets call out support staff

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September Regional Operations Centre …… Resource Centre Resource Centre … Regional Operations Centre Resource Centre Resource Centre … OSCT Grid Operator on-duty 2 nd Level support 1 st Level support Monitoring shows a problem Operator submits a GGUS ticket against the ROC and cc’s the site. The ticket is followed until it is solved ROC and Site work to resolve the problem Operations support workflows

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September Escalation procedures Action takenLownormalhigh 1st mail to site admin and ROC 3 days 1 day 2nd mail to ROC3 days 1 day phone call to ROC3 days 1 day final mail to ROC immediate weekly operations meeting call asap Mail to OCC for validation asap site suspension asap

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September Site related procedures Introducing a new site – The ROC is the contact point – ROC registers the site and sets the Initial site status to uncertified – After SFTs run OK for a week -> certified Site downtime scheduling –EGEE resources need to be switched off properly in order not to disturb operations –Set downtime period in GOCDB and tick off “monitoring” for the affected nodes –Announce the downtime through the EGEE broadcast tool –“GlueCEStateStatus: Closed” Required site contacts –Stored in GOCDB Suspending a site –The site is then removed from the top-level BDII and monitoring is turned off

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September Operations coordination ROC managers meeting –Biweekly –Minutes: –Discuss inter-ROC issues, general coordination, interfaces with other activities Operations meeting –Weekly, Mondays at 16:00 (Swiss time) –Agendas, minutes: –WLCG/OSG/EGEE –Pre-reports from sites, ROCs and VOs through CIC portal –Discuss, track and solve operation related issues from the previous week Operation Workshops –Twice per year. –Next one: Spring 2007 –Agenda of last one:

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September Checklist for a new service User support procedures (GGUS) –Troubleshooting guides + FAQs –User guides Operations Team Training –Site admins –CIC personnel –GGUS personnel Monitoring –Service status reporting –Performance data Accounting –Usage data Service Parameters –Scope - Global/Local/Regional –SLAs –Impact of service outage –Security implications Contact Info –Developers –Support Contact –Escalation procedure to developers Interoperation –Documented issues First level support procedures –How to start/stop/restart service –How to check it’s up –Which logs are useful to send to CIC/Developers  and where they are SFT Tests –Client validation –Server validation –Procedure to analyse these  error messages and likely causes Tools for ROC to spot problems –GIIS monitor validation rules (e.g. only one “global” component) –Definition of normal behaviour  Metrics ROC Dashboard –Alarms Deployment Info –RPM list –Configuration details –Security audit  This is what is takes to make a reliable production service from a middleware component  Not much middleware is delivered with all this … yet

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks User Support

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September User support in EGEE Global Grid User Support (GGUS) is the EGEE support infrastructure for Grid users, deployment and operation problems It offers a large variety of services to satisfy user needs at all levels It does not substitute but integrate existing infrastructures and coordinates support efforts

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September Central Application (GGUS) Deployment Support Middleware Support Network Support Operations Support TPM ROC 1 ROC 10 ROC… VO Support Interface Webportal The Support Model “Regional Support with Central Coordination" The ROCs, VOs and other project- wide groups such as the middleware groups (JRA), network groups (NA), service groups (SA) areJRANA connected via a central integration platform provided by GGUS. Regional Support units User Support units Technical Support units Other grids (e.g. OSG)

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September The GGUS System

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September GGUS Portal: user services Browseable tickets Search through solved tickets Useful links (Wiki FAQ) Broadcast tools Latest News GGUS Search Engine Updated documentation (Wiki FAQ)

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September EGEE and EELA: Operations Cooperation between EGEE and EELA, in all areas, is very important This conference is an opportunity to explore some points where we can work better Starting discussions to apply standard EGEE operations procedures and tools into EELA: –Creation of own ROC  To support EELA sites  Initial support from CERN ROC –Site monitoring, SAM server being deployed by Alexandre Duarte

Enabling Grids for E-sciencE EGEE-II INFSO-RI Maite Barroso: Grid Operations EELA, Santiago, 4 th September Summary EGEE operates the world’s largest multi-disciplinary grid infrastructure for scientific research –In constant and significant production use –EELA as part of this production infrastructure Operations procedures and tools under constant evolution –Much is being learned – but there remains much to be done to achieve long term sustainability –EELA is starting to use some of these tools/procedures; feedback plus additions are welcome! We have gained significant experience in what it takes to deploy, operate and manage a large distributed infrastructure –Next steps: Service Availability Monitoring, Service Level Agreements Importance of interoperability/interoperations with related projects, EELA