Operations Status Report

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

Project Management Summary Castor Development Team Castor Readiness Review – June 2006 German Cancio, Giuseppe Lo Presti, Sebastien Ponce CERN / IT.
INFSO-RI Enabling Grids for E-sciencE Status of EGEE Operations Ian Bird, CERN SA1 Activity Leader EGEE 3 rd Conference Athens,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Grid Infrastructure and Operations Maite.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks gLite Release Process Maria Alandes Pradillo.
LCG Milestones for Deployment, Fabric, & Grid Technology Ian Bird LCG Deployment Area Manager PEB 3-Dec-2002.
EMI INFSO-RI SA2 - Quality Assurance Alberto Aimar (CERN) SA2 Leader EMI First EC Review 22 June 2011, Brussels.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
LCG and HEPiX Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002.
EGEE is a project funded by the European Union under contract IST Testing processes Leanne Guy Testing activity manager JRA1 All hands meeting,
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
INFSO-RI Enabling Grids for E-sciencE EGEE 1 st EU Review – 9 th to 11 th February 2005 CERN.
Responsibilities of ROC and CIC in EGEE infrastructure A.Kryukov, SINP MSU, CIC Manager Yu.Lazin, IHEP, ROC Manager
EGEE is a project funded by the European Union under contract IST JRA1-SA1 requirement gathering Maite Barroso JRA1 Integration and Testing.
EGEE is a project funded by the European Union under contract IST User support in EGEE Alistair Mills Torsten Antoni EGEE-3 Conference 20 April.
INFSO-RI Enabling Grids for E-sciencE Integration and Testing, SA3 Markus Schulz CERN IT JRA1 All-Hands Meeting 22 nd - 24 nd March.
EGEE is a project funded by the European Union under contract IST EGEE Services Ian Bird SA1 Manager Cork Meeting, April
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA1: Grid Operations Maite Barroso (CERN)
INFSO-RI Enabling Grids for E-sciencE EGEE SA1 in EGEE-II – Overview Ian Bird IT Department CERN, Switzerland EGEE.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
INFSO-RI Enabling Grids for E-sciencE An overview of EGEE operations & support procedures Jules Wolfrat SARA.
Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct
Operations Working Group Summary Ian Bird CERN IT-GD 4 November 2004.
CERN LCG Deployment Overview Ian Bird CERN IT/GD LCG Internal Review November 2003.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks SA3 partner collaboration tasks & process.
EGEE is a project funded by the European Union under contract IST Roles & Responsibilities Ian Bird SA1 Manager Cork Meeting, April 2004.
INFSO-RI Enabling Grids for E-sciencE LCG-2_6_0 Markus Schulz CERN GD 7 th September 2005.
EGEE is a project funded by the European Union under contract IST “LCG2 Operational Experience and Status” Markus Schulz, IT-GD, CERN
INFSO-RI Enabling Grids for E-sciencE gLite Certification and Deployment Process Markus Schulz, SA1, CERN EGEE 1 st EU Review 9-11/02/2005.
INFSO-RI Enabling Grids for E-sciencE gLite Test and Certification Effort Nick Thackray CERN.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
INFN GRID Production Infrastructure Status and operation organization Cristina Vistoli Cnaf GDB Bologna, 11/10/2005.
INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.
INFSO-RI Enabling Grids for E-sciencE Upcoming Releases Markus Schulz CERN SA1 15 th June 2005.
EGEE-III INFSO-RI Enabling Grids for E-sciencE JRA1 and SA3 All Hands Meeting December 2009, CERN, Geneva Product Teams –
CERN Certification & Testing LCG Certification & Testing Team (C&T Team) Marco Serra - CERN / INFN Zdenek Sekera - CERN.
JRA1 Meeting – 09/02/ Software Configuration Management and Integration EGEE is proposed as a project funded by the European Union under contract.
EGEE is a project funded by the European Union under contract IST Issues from current Experience SA1 Feedback to JRA1 A. Pacheco PIC Barcelona.
II EGEE conference Den Haag November, ROC-CIC status in Italy
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid is a Bazaar of Resource Providers and.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
Operations Working Group Summary Ian Bird CERN IT-GD 4 November 2004.
Grid Deployment Technical Working Groups: Middleware selection AAA,security Resource scheduling Operations User Support GDB Grid Deployment Resource planning,
CERN LCG1 to LCG2 Transition Markus Schulz LCG Workshop March 2004.
Scuola Grid - Martina Franca, Thursday 08 November Il Sistema di Supporto INFNGrid & GGUS ( Global Grid User.
Service Availability Monitoring
Bob Jones EGEE Technical Director
Il Sistema di Supporto INFNGrid & GGUS (Global Grid User Support )
James Casey, CERN IT-GD WLCG Workshop 1st September, 2007
Status of Task Forces Ian Bird GDB 8 May 2003.
Regional Operations Centres Core infrastructure Centres
EGEE is a project funded by the European Union
SA3’s Responds to the Review Report
SA1 Execution Plan Status and Issues
Andreas Unterkircher CERN Grid Deployment
Ian Bird GDB Meeting CERN 9 September 2003
CREAM Status and Plans Massimo Sgaravatto – INFN Padova
INFNGRID Workshop – Bari, Italy, October 2004
Testing for patch certification
Quality Control in the dCache team.
Nordic ROC Organization
EMI: dal Produttore al Consumatore
SA1 ROC Meeting Bologna, October 2004
LCG Operations Workshop, e-IRG Workshop
EGEE SA1 – Operations Status Overview
Leigh Grundhoefer Indiana University
Ian Bird LCG Project - CERN HEPiX - FNAL 25-Oct-2002
Release and deployment process Working Group
Site availability Dec. 19 th 2006
Presentation transcript:

Operations Status Report Ian Bird CERN GDB Meeting 8th February 2005

Introduction Current Release and Deployment Procedures Experience Additional Input New Procedures gLite & LCG preproduction service Lessons Learned Operations Roles in EGEE Operations Procedures Implementation Examples Status and Summary Ian.Bird@cern.ch

Certification & Testing Experiment/Application Current Procedure OMC C&T Certification & Testing GIS Grid Infrastructure Support GDB Grid Deployment Board Monthly process (sequential) Gathering of new material Prioritization Integration of items on list Deployment on testbeds First tests feedback Release to EIS testbed for experiment validation Full testing (functional and stress) feedback to patch/component providers final list of new components Internal release (LCFGng) On demand (parallel) Preparation/Update of release notes for LCFGng Preparation/Update of generic install documentation Test installations on GIS testbeds Update of user documentation Announcement on the LCG-Rollout list EIS Experiment/Application IntegrationSupport Applications RCs (sites) CICs/ROCs Ian.Bird@cern.ch

functional/stress tests Release Preparation EIS GIS GDB Applications RCs CIC Bugs/Patches/Task Savannah e-mail integration & first tests C&T 3 Developers Applications Internal Release 4 C&T Deployment on EIS testbed EIS 5 Wish list for next release 1 full deployment on test clusters (6) functional/stress tests ~1 week C&T 6 C&T EIS GIS Head of Deployment prioritization & selection Developers Final Internal Release 7 List for next release (can be empty) 2 LCFGng & change record Ian.Bird@cern.ch

Deployment EIS GIS GIS GIS GIS Final Internal Release Upgrade Install LCFGng & change record Final Internal Release Update User Guides EIS Release Notes Finalize LCFGng Conf. Prepare Manual Guide GIS 8 Sites upgrade at own pace Announce Release on the LCG-Rollout list GIS 12 Upgrade Install RCs 13 Release Notes Installation Guides User 9 LCFGng Install Test Manual GIS 10 Synchronize Applications ROCs Re-Certify GIS 14 Release 11 Certification is run daily Ian.Bird@cern.ch

Experience Process was decisive to improve the middleware The process is time consuming (5 releases 2004) Many sequential steps Many different site layouts have to be tested Format of internal and external releases differ Multiple packaging formats (tool based, generic) All components are treated equal same level of testing for non vital and core components new tools and tools in use by other projects are tested to the same level Process to include new components is not transparent Timing for releases difficult users: now sites: scheduled Upgrades need a long time to cover all sites some sites had problems to become functional after an upgrade Ian.Bird@cern.ch

Additional Input Data Challenges client libs need fast and frequent updates core services need fast patches (functional/fixes) applications need a transparent release preparation many problems only become visible during full scale production Installation tool is not available for new OS versions Configuration is a major problem on smaller sites Operations Workshop smaller sites can handle major upgrades only every 3 month sites need to give input in the selection of new packages resolve conflicts with local policies gLite releases need to be deployed software already partially tested by JRA1 unit and functional tests certification will need fewer iterations preproduction service replaces part of the certification process LCG2 and gLite have to run side by side (coexist on same fabric) Ian.Bird@cern.ch

Changes I Simple Installation/Configuration Scripts YAIM (Yet Another Installation Method) semi automatic simple configuration management based on scripts (easy to integrate into other frameworks) all configuration for a site are kept in one file APT (Advanced Package Tool) based installation of middleware RPMs simple dependency management updates (automatic on demand) no OS installation Client libs packaged in addition as user space tar-ball can be installed like application software Process (in development) new process to gather and prioritize new packages formal tracking tool, priorities are assigned to the packages cost to completion assigned (time of a specific individual) at cut off day selection process with participation of applications, sites and deployment work will continue based on priority list between releases (rolling) Ian.Bird@cern.ch

Changes II different release frequency for client libs (UI, WN) services (CE, SE) core services (RB, BDII,..) major releases (configuration changes, RPMs, new services) updates (bug fixes) added any time to specific releases non critical components will be made available with reduced testing Fixed release dates for major releases (allows planning) every 3 months, sites have to upgrade within 3 weeks Minor releases every month based on ranked components available at a specific date in the month not mandatory for smaller RCs to follow client libs will be installed as application level software early access to pre-releases of new software for applications client libs. will be made available on selected sites services with functional changes are installed on EIS-Applications testbed early feedback from applications Ian.Bird@cern.ch

New Process (simplified) Bugs/Patches/Task Savannah 3 RC Applications integration & first tests C&T Developers EIS Applications GIS C&T 4 GDB assign and update cost Internal Releases User Level install of client tools EIS 5 Internal Client Release 1 Bugs/Patches/Task Savannah CICs EIS prioritization & selection full deployment on test clusters (6) functional/stress tests ~1 week 6 C&T Developers C&T Head of Deployment List for next release (can be empty) 2 7 Client Release Service Updates Core Service components ready at cutoff Ian.Bird@cern.ch

New Deployment EIS GIS CIC GIS Release(s) User Guides Release Update Release Notes Update User Guides EIS GIS YAIM Release Notes Installation Guides User Guides Every Month Every 3 months on fixed dates ! Certification is run daily 11 Release Client Release Deploy Major Releases (Mandatory) ROCs RCs Every Month Re-Certify CIC Deploy Client Releases (User Space) GIS Deploy Service (Optional) CICs RCs at own pace Ian.Bird@cern.ch

Lessons Learned Certification of the middleware was the essential tool to improve its quality Early access to new releases was crucial for applications Process has to undergo evolutionary changes software matures certification becomes more complex (shift to applications) scale (110 sites) releases with radical changes become very hard to deploy usage (production) some uniformity and fast spread of fixes is expected by applications Ian.Bird@cern.ch

Operations: Roles SA1 EGEE European Grid Support, Operation and Management activity OMC Operation Management Centre CIC Core Infrastructure Centre ROC Regional Operation Centre RC Resource Centre GGUS Global Grid User Support (FZK) CIC RC ROC OMC Ian.Bird@cern.ch

Procedures Driven by experience during 2004 Data Challenges Reflecting the outcome of the November Operations Workshop Operations Procedures roles of CICs - ROCs - RCs weekly rotation of operations centre duties (CIC-on-duty) daily tasks of the operations shift monitoring (tools, frequency) problem reporting problem tracking system communication with ROCs&RCs escalation of unresolved problems handing over the service to the next CIC Ian.Bird@cern.ch

Implementation Evolutionary Development Procedures documented (constantly adapted) available at the CIC portal http://cic.in2p3.fr/ in use by the shift crews Portal http://cic.in2p3.fr access to tools and process documentation repository for logs and FAQs provides means of efficient communication provides condensed monitoring information Problem tracking system currently based on Savannah at CERN is moving to the GGUS at FZK exports/imports tickets to local systems used by the ROCs Weekly Phone Conferences and Quarterly Meetings Ian.Bird@cern.ch

A day in an operators life All in One Cic-on-duty Dashboard https://cic.in2p3.fr/pages/cic/framedashboard.html Ian.Bird@cern.ch

A day in an operators life goes on TZR Ticket status GIIS Goc Wiki Blacklist phone PMB 1 st mail 2 nd Ian.Bird@cern.ch

A day in an operators life goes on and on By watching the EGEE Monitoring tools, here a selection: GIIS Monitor graphs GIIS Monitor Sites Functional Tests and History GOC Data Base Scheduled Downtimes Live Job Monitor GridIce – VO view GridIce – fabric view Certificate Lifetime Monitor Ian.Bird@cern.ch

Summary Initial set of operations procedures are available and implemented based on experience 2004 and Operations Workshop No long term experience exists have to adapt tools, roles and procedures as we learn and grow the system Rotation between CICs spreads the load (~50 tickets are handled per week) distributes knowledge quickly first step towards 24/7 operation introducing CICs in other time zones (Taipei, Vancouver) Monitoring tools need to be linked to give access to all information automate creation of alarms better diagnosis of problems first steps taken, several monitoring tools export data into EGEE R-GMA Certification and Operation are closely linked same entities involved same knowledge needed (FAQs) Ian.Bird@cern.ch

Ongoing Produce and publish metrics for Service and site reliability and stability Information available – extract, plot, and publish Application efficiency – from logging and bookkeeping, also good to have application instrumentation Build realistic jobs, instrumented, run 2-3 times per day But – need reasonable resources and priority at sites to run these Application verification of site For many applications now Select stable, well configured sites: efficiency >85-90% D0, CMS, Geant4, … Improve and demonstrate a reliable and trusted user support service See Flavia’s talk Pre-production service and gLite … Priorities vs LCG-2 Ian.Bird@cern.ch