24 x 7 support in Amsterdam Jeff Templon NIKHEF GDB 05 september 2006.

Slides:

Advertisements

Similar presentations

CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.

Advertisements

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

SC4 Workshop Outline (Strong overlap with POW!) 1.Get data rates at all Tier1s up to MoU Values Recent re-run shows the way! (More on next slides…) 2.Re-deploy.

Preventing Common Causes of loss. Common Causes of Loss of Data Accidental Erasure – close a file and don’t save it, – write over the original file when.

Overview of day-to-day operations Suzanne Poulat.

Backdrop Particle Paintings created by artist Tom Kemp September Grid Information and Monitoring System using XML-RPC and Instant.

LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.

WLCG Service Report ~~~ WLCG Management Board, 24 th November

1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB

Preliminary Conclusions VO Box Task Force GDB Meeting 5 april 2006.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.

WLCG Service Report ~~~ WLCG Management Board, 1 st September

CCRC’08 Weekly Update Jamie Shiers ~~~ LCG MB, 1 st April 2008.

Maarten Litmaath (CERN), GDB meeting, CERN, 2006/02/08 VOMS deployment Extent of VOMS usage in LCG-2 –Node types gLite 3.0 Issues Conclusions.

WLCG Service Report ~~~ WLCG Management Board, 9 th August

DB Questions and Answers open session Carlos Fernando Gamboa, BNL WLCG Collaboration Workshop, CERN Geneva, April 2008.

BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006.

Alberto Aimar CERN – LCG1 Reliability Reports – May 2007

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

ATLAS Tier 1 at BNL Overview Bruce G. Gibbard Grid Deployment Board BNL 5-6 September 2006.

Nikhef/(SARA) tier-1 data center infrastructure

INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.

USATLAS dCache System and Service Challenge at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab 10/13/2005.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

ATLAS Bulk Pre-stageing Tests Graeme Stewart University of Glasgow.

Derek Ross E-Science Department DCache Deployment at Tier1A UK HEP Sysman April 2005.

Handling ALARMs for Critical Services Maria Girone, IT-ES Maite Barroso IT-PES, Maria Dimou, IT-ES WLCG MB, 19 February 2013.

GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.

Report from the WLCG Operations and Tools TEG Maria Girone / CERN & Jeff Templon / NIKHEF WLCG Workshop, 19 th May 2012.

5 Sept 2006GDB meeting BNL, MIlos Lokajicek Service planning and monitoring in T2 - Prague.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.

4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.

My Name: ATLAS Computing Meeting – NN Xxxxxx Service Level Agreement(SLA) The intelligence layer Tony Chan Jason Smith Mizuki Karasawa March 24,

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

LCG Service Challenges SC2 Goals Jamie Shiers, CERN-IT-GD 24 February 2005.

1 Cloud Services Requirements and Challenges of Large International User Groups Laurence Field IT/SDC 2/12/2014.

A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.

Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.

Griffith University Kevin Grant Manager, Database Management Services Information Services.

Development of a Tier-1 computing cluster at National Research Centre 'Kurchatov Institute' Igor Tkachenko on behalf of the NRC-KI Tier-1 team National.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

J. Templon Nikhef Amsterdam Physics Data Processing Group Nikhef Multicore Experience Jeff Templon Multicore TF

CASTOR in SC Operational aspects Vladimír Bahyl CERN IT-FIO 3 2.

WLCG Service Report ~~~ WLCG Management Board, 10 th November

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Reaching MoU Targets at Tier0 December 20 th 2005 Tim Bell IT/FIO/TSI.

Operation team at Ccin2p3 Suzanne Poulat –

Grid Computing Jeff Templon Programme: Group composition (current): 2 staff, 10 technicians, 1 PhD. Publications: 2 theses (PD Eng.) 16 publications.

Outcome should be a documented strategy Not everything needs to go back to square one! – Some things work! – Some work has already been (is being) done.

CERN - IT Department CH-1211 Genève 23 Switzerland t Service Level & Responsibilities Dirk Düllmann LCG 3D Database Workshop September,

DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.

J. Templon Nikhef Amsterdam Physics Data Processing Group “Grid” Computing J. Templon SAC, 26 April 2012.

Status of the NL-T1. BiG Grid – the dutch e-science grid Realising an operational ICT infrastructure at the national level for scientific research (e.g.

J. Templon Nikhef Amsterdam Physics Data Processing Group Large Scale Computing Jeff Templon Nikhef Jamboree, Utrecht, 10 december 2012.

Critical Issues in Distributed Computing Jeff Templon NIKHEF ACAT’07 Conference Amsterdam, 26 april 2007.

Ticketing system Internet proxy traffic load monitoring LDAP SSO file integrity check Mail Server Forms designer.

A Dutch LHC Tier-1 Facility

Proposal for obtaining installed capacity

Luca dell’Agnello INFN-CNAF

WLCG Service Interventions

Summary from last MB “The MB agreed that a detailed deployment plan and a realistic time scale are required for deploying glexec with setuid mode at WLCG.

Workshop Summary Dirk Duellmann.

Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010

Presentation transcript:

24 x 7 support in Amsterdam Jeff Templon NIKHEF GDB 05 september 2006

Jeff Templon – Amsterdam 24x7 support, GDB, BNL, x 7 support

Jeff Templon – Amsterdam 24x7 support, GDB, BNL, Main Principle: avoid needing it u Basic Infrastructure n Power : on-site emergency generators n Network : SURFnet console staffed 24 x 7 n Guard informs all relevant people in case of ‘calamities’ n Real people watching all services (and support / ticket systems) closely during working hours u Critical Servers: redundant failover n DNS server for farm networks n Databases (FTS, LFC, 3D, etc) n Pnfs server for dCache n dCache server itself

Jeff Templon – Amsterdam 24x7 support, GDB, BNL, More avoidance u Computing services: NIKHEF and SARA share computing, hence complete interruption of service is either network into Amsterdam, or something beyond our control u Tape Robot: dimension incoming disk cache to several days, hence can survive a weekend without tape if need be

Jeff Templon – Amsterdam 24x7 support, GDB, BNL, Monitoring u Create Dashboard (pieces exist) u Pool of people who agree to watch things and alert the relevant person in case of problems; check at least once every 12 hours. u Look into system a la IN2P3 for restart privileges to this team via special account and scripts. u Already have SMS service in place for some critical components

Jeff Templon – Amsterdam 24x7 support, GDB, BNL, Plan u Put this into place early 2007 u No formal 24 x 7 or on-call system u See how it goes n If we reach targets and don’t miss response deadlines, OK n If we miss targets and deadlines, start hard discussions n Note that 24 x 7 would depend on other pieces (like NIKHEF mail server) which themselves don’t have 24 x 7!

Jeff Templon – Amsterdam 24x7 support, GDB, BNL, Open Questions u What about dynamic redistribution from source in case of problems? Increases site load by 1 / (N(N-1)) naively u How big is CERN’s data buffer? u What to do with externally identified problems? GGUS will not get our on-call support number u Cost choices : what is the cheapest road? We expect that paying staff for 24 x 7 is not the cheapest. Grid is about distribution and redundancy, we should exploit it. u Are we making best choices? (push vs. pull?)