CERN - IT Department CH-1211 Genève 23 Switzerland www.cern.ch/it Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
CERN - IT Department CH-1211 Genève 23 Switzerland t LCG Deployment GridPP 18, Glasgow, 21 st March 2007 Tony Cass Leader, Fabric Infrastructure.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Operational experiences HEPiX Taiwan Oct Miguel Coelho dos Santos.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Overview of day-to-day operations Suzanne Poulat.
1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB
CERN IT Department CH-1211 Genève 23 Switzerland t Tier0 Status - 1 Tier0 Status Tony Cass LCG-LHCC Referees Meeting 18 th November 2008.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
GridPP Deployment & Operations GridPP has built a Computing Grid of more than 5,000 CPUs, with equipment based at many of the particle physics centres.
CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.
CERN - IT Department CH-1211 Genève 23 Switzerland WLCG 2009 Data-Taking Readiness Planning Workshop Tier-0 Experiences Miguel Coelho dos.
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
INFSO-RI Enabling Grids for E-sciencE Integration and Testing, SA3 Markus Schulz CERN IT JRA1 All-Hands Meeting 22 nd - 24 nd March.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
Grid Operations Centre LCG SLAs and Site Audits Trevor Daniels, John Gordon GDB 8 Mar 2004.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
CERN - IT Department CH-1211 Genève 23 Switzerland Tier-0 CCRC’08 May Post-Mortem Miguel Santos Ricardo Silva IT-FIO-FS.
CERN IT Department CH-1211 Genève 23 Switzerland t 24x7 Service Support Tony Cass LCG GDB, 24 th November 2009.
Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
Plans for Service Challenge 3 Ian Bird LHCC Referees Meeting 27 th June 2005.
CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN
CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.
CERN - IT Department CH-1211 Genève 23 Switzerland Tape Operations Update Vladimír Bahyl IT FIO-TSI CERN.
INFSO-RI Enabling Grids for E-sciencE FTS failure handling Gavin McCance Service Challenge technical meeting 21 June.
Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.
Patricia Méndez Lorenzo Status of the T0 services.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
The GridPP DIRAC project DIRAC for non-LHC communities.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Mario Reale – GARR NetJobs: Network Monitoring Using Grid Jobs.
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
CERN IT Department CH-1211 Genève 23 Switzerland t Managing changes - 1 Managing changes Olof Bärring WLCG 2009, 14 th November 2008.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Platform & Engineering Services CERN IT Department CH-1211 Geneva 23 Switzerland t PES Improving resilience of T0 grid services Manuel Guijarro.
CERN site report Operational aspects of Grid Services at the Tier-0.
II EGEE conference Den Haag November, ROC-CIC status in Italy
Reaching MoU Targets at Tier0 December 20 th 2005 Tim Bell IT/FIO/TSI.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
ASGC incident report ASGC/OPS Jason Shih Nov 26 th 2009 Distributed Database Operations Workshop.
CERN - IT Department CH-1211 Genève 23 Switzerland t Service Level & Responsibilities Dirk Düllmann LCG 3D Database Workshop September,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
INFN-Grid WS, Bari, 2004/10/15 Andrea Caltroni, INFN-Padova Marco Verlato, INFN-Padova Andrea Ferraro, INFN-CNAF Bologna EGEE User Support Report.
CERN IT Department CH-1211 Genève 23 Switzerland t DPM status and plans David Smith CERN, IT-DM-SGT Pre-GDB, Grid Storage Services 11 November.
Cross-site problem resolution Focus on reliable file transfer service
Castor services at the Tier-0
Olof Bärring LCG-LHCC Review, 22nd September 2008
WLCG Service Interventions
The CCIN2P3 and its role in EGEE/LCG
Ákos Frohner EGEE'08 September 2008
EGEE Operation Tools and Procedures
Deploying Production GRID Servers & Services
Presentation transcript:

CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007

CERN - IT Department CH-1211 Genève 23 Switzerland 2 Many questions to answer in 15’! Tools used in daily operations Features missing Examples of the most frequent scheduled interventions at CERN Examples of the most frequent unscheduled interventions at CERN Points to improve in communication with ROC, other sites, VOs, rest of the world... Plan for deployment of updates/new versions in continuous operation Communication with users Correlation of cross-site issues Percentage of real site problems are detected and reported by the COD before we know about them Usefulness of operations bodies/meetings

CERN - IT Department CH-1211 Genève 23 Switzerland 3 Overview CERN fabric by numbers –CE (DNS loadbalanced clusters) 17 LCG CE 5 gLite CE –WNs: ~2800 WNs, 7000 CPUs –SE SRM – 20 nodes (v11, v22) Storage system: CASTOR version 2 –~1.5PB disk cache (LHC VOs only) –3PB tape used, ~140 tape drives (T10K, 3592B, 9940B) Service not covered in this presentation –FTS –gLite WMS –VOMS –Database services

CERN - IT Department CH-1211 Genève 23 Switzerland 4 Tools used in daily operations ELFms toolsuite ( ) –Monitoring, exceptions & alarms –System installation and configuration –Hardware and state management –Service display Operational strategy –Automate where possible Lemon actuators Automated state changes upon declared exceptions –If not, document the procedure, if possible for non service-expert use (sysadmin, operators) Operation and support workflows –Remedy –GGUS tickets are fed into local Remedy

CERN - IT Department CH-1211 Genève 23 Switzerland 5 Grid services support flow CC operators Sysadmin team Service managers Service experts Lemon alarms 2 nd level alarms handling System installations, Manage Hardware repairs 24 x 7 coverage through Piquet service User/experiment problems Complex cases May end up in bug to developers Piquet Working Group looking into the possibility to establish piquet services for critical services 24 x 7 coverage 1 st level alarm handling Driven by procedures Service responsible Applies s/w upgrades, configuration changes, provides procedures Entry point for support lines SMod, GMod – person on rota Problem reports come to the Service Managers via many different flows, using many different tools, directly and indirectly. This still needs some tuning

CERN - IT Department CH-1211 Genève 23 Switzerland 6 Features missing Support for scheduled upgrades (ongoing) –Needed for partitioning of clusters into “prod” / “new” / “test” areas –Moving from 1-2 linux upgrade/week to ‘scheduled upgrades’ where we only apply software updates between LHC runs –Aim to validate the complete software stack (OS + middleware + applications). Need experiment help! –Planned for ~end summer Lemon GUI service views (ongoing) –Allows to tailor Lemon GUI to show service specific metrics (e.g. castor request rate) –Planned for July Alarm SAM failures for CERN-PROD

CERN - IT Department CH-1211 Genève 23 Switzerland 7 Examples of the most frequent scheduled interventions at CERN Upgrades to storage system –Intrusive oracle intervention on the castor name server database (e.g. moving to new hardware, quarterly patches) Affects all VOs Batch normally paused during the intervention –New castor stager version, intrusive oracle interventions Usually only one VO at a time –Tape library interventions Degradation (long wait time for tape recalls) for all VOs Router/switch upgrades or changes Changes to the CE cluster –Draining of nodes for h/w intervention –Only degradation, no service interruption

CERN - IT Department CH-1211 Genève 23 Switzerland 8 Examples of the most frequent unscheduled interventions at CERN Power cuts / cooling failures CASTOR stager meltdowns –Usually only one VO –If left for longer period (e.g. over a Weekend) it can result in global degradation of the SRM service affecting all VOs CE overload

CERN - IT Department CH-1211 Genève 23 Switzerland 9 Points to improve in communication with ROC, other sites, VOs, rest of the world... Weekly site reports –Need to streamline better local reporting to fit the grid Content –Consistency –Avoid duplication where possible –Flag relevance (grid/non-grid) Synchronization in time Avoid duplication –Site availability Test categories can be confusing (CE – SE correlations) VO communication: when to use broadcast? –Currently using a mixture of mailing lists, hn, broadcasts –Trying to move to use broadcast more consistently but today this is not sufficient Weekly GGUS ticket review with ROC is very useful!

CERN - IT Department CH-1211 Genève 23 Switzerland 10 Plan for deployment of updates/new versions in continuous operation We plan to only apply software updates between LHC runs This means that we will prepare updates during LHC runs Aim to validate the complete software stack (OS + middleware + applications). Need experiment help! To be done in 2007: make a test cluster available to validate upcoming releases Hope to move to this mode in ~September

CERN - IT Department CH-1211 Genève 23 Switzerland 11 Communication with users Channels for service announcements agreed with each experiment VO managers individually –List of named people –Broadcasts –+ combinations of above Support: power-users know the shortcuts but we are trying to insist on use of support flows

CERN - IT Department CH-1211 Genève 23 Switzerland 12 Correlation of cross-site issues Particularly relevant for data transfers problems –Who owns the problem from start? –Who determines what site/component is at fault? –Successful problem determination usually depends on a coordinated investigations between the involved sites –Access to log files and other relevant information –Login access to remote hosts for ‘bare’ transfer tests

CERN - IT Department CH-1211 Genève 23 Switzerland 13 Percentage of real site problems are detected and reported by the COD before we know about them Have not gathered any statistics but most probably <50% What is a ‘real site problem’? –Degradation Failure of one CE node out of a loadbalanced cluster of 10 –Partial outage SE for one VO is unavailable due to CASTOR problems –Full outage Even if the problem is known, the COD reports are in general useful for us –Probing can detect configuration problems that we can’t see ourselves Recent case with groups.conf file on the CEs

CERN - IT Department CH-1211 Genève 23 Switzerland 14 Usefulness of operations bodies/meetings Weekly grid operations meeting is of little / no use for CERN-PROD service mgrs –Prefer expanding our weekly meeting with ROC mgrs (ticket review) to include Review site availability Weekly service reports Coordination of scheduled interventions Review GGUS tickets Operations bodies –User support Good filtering but sometimes slow escalation –Operation COD reports are in general useful

CERN - IT Department CH-1211 Genève 23 Switzerland 15