Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.

Slides:



Advertisements
Similar presentations
T1 at LBL/NERSC/OAK RIDGE General principles. RAW data flow T0 disk buffer DAQ & HLT CERN Tape AliEn FC Raw data Condition & Calibration & data DB disk.
Advertisements

GGUS summary (5 weeks) VOUserTeamAlarmTotal ALICE2002 ATLAS CMS6208 LHCb Totals
CERN IT Department CH-1211 Geneva 23 Switzerland t T0 report WLCG operations Workshop Barcelona, 07/07/2014 Maite Barroso, CERN IT.
LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.
Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)
WLCG Service Report ~~~ WLCG Management Board, 27 th October
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks LHCOPN Ops WG Act 4 – Conclusion Guillaume.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
GGUS summary (7 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1 To calculate the totals for this slide and copy/paste the usual graph please:
Monitoring the Grid at local, national, and Global levels Pete Gronbech GridPP Project Manager ACAT - Brunel Sept 2011.
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
Overview of day-to-day operations Suzanne Poulat.
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB
Databases E. Leonardi, P. Valente. Conditions DB Conditions=Dynamic parameters non-event time-varying Conditions database (CondDB) General definition:
WLCG Service Report ~~~ WLCG Management Board, 1 st September
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Overview ROC_LA CERN
WLCG Service Report ~~~ WLCG Management Board, 9 th August
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
Management Tools K. Pommès. Management Tools - The Project Planning Design Purchasing Production Installation To follow the project through its phases.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES GGUS Ticket review T1 Service Coordination Meeting 2010/10/28.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE1102 ATLAS CMS LHCb Totals
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals
4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.
WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Communication tools between Grid Virtual.
The ATLAS Computing Model and USATLAS Tier-2/Tier-3 Meeting Shawn McKee University of Michigan Joint Techs, FNAL July 16 th, 2007.
WLCG Service Report ~~~ WLCG Management Board, 18 th September
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
WLCG Service Report ~~~ WLCG Management Board, 14 th February
WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August
WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.
ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.
WLCG Service Report ~~~ WLCG Management Board, 10 th November
Analysis of Service Incident Reports Maria Girone WLCG Overview Board 3 rd December 2010, CERN.
WLCG Operations Coordination and Commissioning Maria Girone, CERN IT On behalf of the Operations Coordination Team 11 th March OSG All Hands Meeting,
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE5016 ATLAS CMS6118 LHCb Totals
WLCG Operations Coordination report Maria Dimou Andrea Sciabà IT/SDC On behalf of the WLCG Operations Coordination team GDB 12 th November 2014.
Visit to CERN/CMS Jan 2006 Patricia McBride Fermilab Slides taken from presentations by Hans Hoffmann and Werner Jank.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
Maria Alandes Pradillo, CERN Training on GLUE 2 information validation EGI Technical Forum September 2013.
GGUS summary ( 9 weeks ) VOUserTeamAlarmTotal ALICE2608 ATLAS CMS LHCb Totals
Operations Coordination Team Maria Girone, CERN IT-ES GDB, 11 July 2012.
WLCG IPv6 deployment strategy
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE ATLAS CMS
1 VO User Team Alarm Total ALICE 1 2 ATLAS CMS 4 LHCb 20
WLCG Management Board, 16th July 2013
WLCG Service Interventions
1 VO User Team Alarm Total ALICE ATLAS CMS
Nordic ROC Organization
1 VO User Team Alarm Total ALICE 2 ATLAS CMS LHCb 14
Take the summary from the table on
Presentation transcript:

Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013

List of critical services proposed to the experiments in February 2012 – derived from the WLCG MoU “high level services” Criticality defined in terms of urgency and impact Values assigned by the experiments and presented in February 2012 – significant differences understood and presented at the MB, November 2012 The list contains ALSO services which did NOT have the GGUS escalation for alarms – Action: Maria Girone and Maite Barroso to propose a process for handling ALARMs for services which are currently not covered by GGUS ALARM tickets. Maria Girone, IT-ES Recap 2

Px  Computer Centre network WLCG network (LHCOPN, GPN) CERN Oracle online CERN Oracle Tier-0 (including streaming) Frontier front-end and Squid CASTOR tape CASTOR disk EOS Batch service CE LFC FTS VOM(R)S BDII CERN Specific Services (2012) Myproxy gLite WMS CVMFS Stratum0 CVMFS Stratum1 Dashboard SAM VOBOXes AFS CAF CVS/SVN Twiki Mail and Web services Hypernews Indico Savannah/JIRA/TRAC Services with GGUS Alarm escalation was already in place in 2012: NO FURTHER ACTION NEEDED! Maria Girone, IT-ES ServiceUrgencyImpact SSO 710 DNS 710 NICE AD servers 610 Added by ALICE, Feb

– Px→Computer centre network – WLCG network (LHCOPN, GPN) – CVS/SVN – Twiki – Mail and web services – Indico IT Services needing GGUS Alarm Workflow -JIRA/TRAC -SSO -DNS -NICE AD servers -Dashboard -SAM IT/PES contacted and discussed with the relevant service managers on the ALARM workflow The workflow has been now modified to include ALL the relevant services provided by the Tier0 Maria Girone, IT-ES4

Proposal sent and approved by Computing Coordinators: Frontier front-end and squid – Critical service for CMS and ATLAS workflows but actually relies on DB and Voboxes services both of which are alarmed already The Frontier service with a functional DB can be restored with a VObox installation More complicated failures require devel oper interventions and will not be handled by alarm. No new alarms needed CAF – complex set of workflows for data validation and calibration, but relies on EOS and LSF both of which are alarmed already – No new alarms needed Savannah – No new alarm needed (alternatives exist to report problems) Hypernews – No new alarm needed (alternatives exist to communicate) e-groups – No new alarm needed (failure will be seen elsewhere in IT first) WLCG Services needing GGUS Alarm Workflow Maria Girone, IT-ES5

Conclusions The list of critical services maps the MoU “high level services” to “specific” services – Needs yearly updates and is maintained by the Operations Coordination Team at The flow for GGUS ALARMs has been modified to include ALL services provided by Tier0 The remaining services (CAF, Frontier frontend & squid, Savannah, hypernews and e-groups) have been re-discussed with the Computing Coordinators and agreement found  No new ALARMs needed Alarms are analyzed and discussed at the MB –No misuse. Should continue this way –Response from services (mostly on best effort) and has always been timely and well handled Maria Girone, IT-ES 6

Maria Dimou - CERN / WLCG - TrackTools coordinator Backup slides 7

Authorised ALARMer submits a GGUS ALARM ticket via the relevant dedicated web form. The “Notify Site” field is mandatory for ALARMs. Value is CERN- PROD for the Tier0: 1.As a result of this site selection, an notification is sent from GGUS to which The CERN computer operators who call service/piquet 24/7. Selected CERN service managers. computing experts selected by the experiment. 2.As a result of this site selection, the GGUS ticket is automatically assigned to the GGUS Support Unit (SU) ROC_CERN. notification is sent from GGUS to the Tier0 service managers. 3.As a result of this automatic assignment to SU ROC_CERN, a SNOW ticket is created automatically against Assignment Group: "CERN GRID 2nd Line Support 3rd Line Support“. notification is sent from SNOW to the relevant experts of all critical services. NOW at : “grid-cern-prod-ALARMS” 4.All SNOW updates are reflected in GGUS & vice versa. 5.All GGUS ALARMs are drilled in detail for the WLCG MB. Documentation: Maria Dimou - CERN / WLCG - TrackTools coordinator 8 GGUS ALARMs’ notifications

Operations related services High bandwidth connectivity from detector area to computer centre Recording and permanent storage in a MSS of raw and reconstructed data Disk storage of reconstructed data Distribution of raw and reconstructed data to Tier-1 sites in time with data acquisition Prompt reconstruction, calibration and alignment Storage and distribution of conditions data Data analysis facility Databases VO management services 20 March 2012 CERN Functional Services Tools and support services Tools and services for application development (CVS, SVN, etc.) Desktop services ( , web, Twiki, Indico, Vidyo, etc.)

“Functional” service – A high level service corresponding to a particular function of the computing system Example: data export from Tier-0 to Tier-1’s Defined in the WLCG MoU, Annex 3 – directly part of LHC computing operations – also included tools, desktop services and services for application development “Specific” service – A service contributing to one or more functional services Example: FTS Definition of Services Maria Girone, IT-ES10