2011/11/03 Partial downtimes management Pierre Girard WLCG T1 Service Coordination Meeting.

Slides:

Advertisements

Similar presentations

Congress for the New Urbanism is a Registered Provider with The American Institute of Architects Continuing Education Systems (AIA/CES). Credit(s) earned.

Advertisements

1 User Analysis Workgroup Update  All four experiments gave input by mid December  ALICE by document and links  Very independent.

Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)

Jan 2010 Current OSG Efforts and Status, Grid Deployment Board, Jan 12 th 2010 OSG has weekly Operations and Production Meetings including US ATLAS and.

London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.

UKI-SouthGrid Overview Face-2-Face Meeting Pete Gronbech SouthGrid Technical Coordinator Oxford June 2013.

Status of WLCG Tier-0 Maite Barroso, CERN-IT With input from T0 service managers Grid Deployment Board 9 April Apr-2014 Maite Barroso Lopez (at)

E-Infrastructure hierarchy Networking and Computational facilities in Armenia ASNET AM Network Armenian National Grid Initiative Armenian ATLAS site (AM-04-YERPHI)

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.

Status of the production and news about Nagios ALICE TF Meeting 22/07/2010.

Project Management Sarah Pearce 3 September GridPP21.

WLCG Service Report ~~~ WLCG Management Board, 1 st September

Cloud Status Laurence Field IT/SDC 09/09/2014. Cloud Date Title 2 SaaS PaaS IaaS VMs on demand.

Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.

SL6 Status at Oxford. Status  SL6 EMI-3 CREAMCE  SL6 EMI3 WN and gLExec  Small test cluster with three WN’s  Configured using Puppet and Cobbler 

EGEE is a project funded by the European Union under contract IST Support Operation Challenge – 1 SOC-1 Alistair Mills Torsten Antoni ARM-4,

CERN IT Department CH-1211 Geneva 23 Switzerland GT WG on Storage Federations First introduction Fabrizio Furano

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

INFSO-RI Enabling Grids for E-sciencE “Round the regions” - France P. Girard ARM-7, Krakow,

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)

WLCG Service Report ~~~ WLCG Management Board, 9 th August

1 User Analysis Workgroup Discussion  Understand and document analysis models  Best in a way that allows to compare them easily.

Julia Andreeva, CERN IT-ES GDB Every experiment does evaluation of the site status and experiment activities at the site As a rule the state.

WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.

8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.

Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.

CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?

WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September

WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.

Workload management, virtualisation, clouds & multicore Andrew Lahiff.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals

Validation of SAM3 monitoring data (availability & reliability of services) Ivan Dzhunov, Pablo Saiz (CERN), Elena Tikhonenko (JINR, Dubna) April 11, 2014.

WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals

Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.

WLCG Information System Use Cases Review WLCG Operations Coordination Meeting 18 th June 2015 Maria Alandes IT/SDC.

SL5 Site Status GDB, September 2009 John Gordon. LCG SL5 Site Status ASGC T1 - will be finished before mid September. Actually the OS migration process.

Criteria for Deploying gLite WMS and CE Ian Bird CERN IT LCG MB 6 th March 2007.

Next Steps after WLCG workshop Information System Task Force 11 th February

Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES Andrea Sciabà Ideal information system - CMS Andrea Sciabà IS.

WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

WLCG Service Report ~~~ WLCG Management Board, 9 th February

WLCG Service Report ~~~ WLCG Management Board, 14 th February

Placeholder ES 1 CERN IT EGI Technical Forum, Experiment Support group AAI usage, issues and wishes for WLCG Maarten Litmaath CERN.

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

WLCG Operations Coordination report Maria Alandes, Andrea Sciabà IT-SDC On behalf of the WLCG Operations Coordination team GDB 9 th April 2014.

WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks EGEE Operations: Evolution of the Role of.

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.

News from the HEPiX IPv6 Working Group David Kelsey (STFC-RAL) HEPIX, BNL 13 Oct 2015.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,

WLCG Accounting Task Force Update Julia Andreeva CERN GDB, 8 th of June,

WLCG Operations Coordination report Maria Dimou Andrea Sciabà IT/SDC On behalf of the WLCG Operations Coordination team GDB 12 th November 2014.

EGEE is a project funded by the European Union under contract IST Report from the PTF Fabrizio Pacini Datamat S.p.a. Milan, IT-CZ JRA1 meeting,

Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.

CMS Multicore jobs at RAL Andrew Lahiff, RAL WLCG Multicore TF Meeting 1 st July 2014.

INFN-Grid WS, Bari, 2004/10/15 Andrea Caltroni, INFN-Padova Marco Verlato, INFN-Padova Andrea Ferraro, INFN-CNAF Bologna EGEE User Support Report.

WLCG Service Report ~~~ WLCG Management Board, 9 th December 2008.

WLCG IPv6 deployment strategy

Pierre Girard Réunion CMS

WLCG Service Report 5th – 18th July

Pierre Girard ATLAS Visit

Presentation transcript:

2011/11/03 Partial downtimes management Pierre Girard WLCG T1 Service Coordination Meeting

2011/11/03 Content Use case Problem description Partial downtime options Conclusion 2

Use case On October 25th, a scheduled downtime was put on dCache SE for 2 hours. –ATLAS, CMS and LHCb both impacted –ALICE not impacted because is using xrootd Our VO-dedicated contacts informed the VOs –We understood that this SD was enough to make CCIN2P3 ban soon enough by impacted VOs –A WARNING downtime was anyway put on our CEs But during the SD we had to kill many jobs –A VO complained, see GGUS #75703GGUS #75703 –Apologies for that loss of CPU time. 2011/11/033

Problem description 2011/11/034 How to deal with downtimes of services that are not used by all the VOs ? How a site could make the possible collateral damages clear enough for VOs Who is supposed to anticipate the collateral damages (by draining or stopping job submissions soon enough) ? Resources / VOsALICEATLASCMSLHCb dCache SE √√√ Xrootd SE √ CVMFS √√ LFC-LHCb-RO √ LFC-Atlas √ Oracle cluster √√

Partial downtime options 1. By also closing the CEs (Site) –This is what we did some time ago –Unfair for not impacted VOs 2. By draining job submissions of impacted VOs (Site) –Do we have to block submission or to held new jobs in queue ? –How long before ? Ex.: by taking the time of the longest job from the last x days –How to announce/remind it to the impacted VOs? Ex: WARNING downtime on the CEs from draining start time. 3. By leaving the VOs to ban the site (VO) –Does any VO manage correctly the dependency between its jobs and the services ? –Is any VO able to conclude what to do from a service downtime ? –Does the VO need additional information ? 4. By requesting to EGI/WLCG some improvements in the downtime declaration –By making possible to declare a downtime by VOs Ex: CE in downtime for ATLAS, CMS and LHCb but not for ALICE –But probably not a short-term solution. –Option to address to WLCG TEG “Operations” (?) 2011/11/035

Conclusion Each VO is using a site in its own way –A site is not a common infrastructure for all the VOs A service hosted by a site may be not shared by all the VOs –A service outage may not impact all the VOs A site cannot unfortunately put a downtime by VO In order to avoid misunderstanding between VOs and sites –We must find a common way to deal with partial downtimes –Suggestions are welcome Until a clear statement, we’ll adopt option “2” –By holding jobs in queue x hours before the SD –By putting a WARNING downtime on the CEs from draining start time 2011/11/036