GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09.

Slides:



Advertisements
Similar presentations
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
Advertisements

1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
UNITED NATIONS Shipment Details Report – January 2006.
1 Marinus C. F. Heijl Acting Director Air Navigation Bureau ICAO 30 March 2007 SYMPOSIUM OUTCOMES AND THE WAY FORWARD.
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
Addition Facts
GridPP Status David Britton, 3/Sep/ /03/2014 Switching on the LHC The LHC was fully cold by mid August. This is being followed by continued powering.
S.L.LloydATSE e-Science Visit April 2004Slide 1 GridPP – A UK Computing Grid for Particle Physics GridPP 19 UK Universities, CCLRC (RAL & Daresbury) and.
Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow GridPP Oversight Committee Meeting.
GridPP4 – Revised Plan Implementing the PPAN recommendations.
Storage Review David Britton,21/Nov/ /03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC Data? Oversight.
Storage Workshop Summary Wahid Bhimji University Of Edinburgh On behalf all of the participants…
Project Status David Britton,15/Dec/ Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project.
RAL Tier1 Operations Andrew Sansum 18 th April 2012.
RAL Tier1: 2001 to 2011 James Thorne GridPP th August 2007.
Partner Logo Tier1/A and Tier2 in GridPP2 John Gordon GridPP6 31 January 2003.
S.L.LloydGridPP Collaboration Meeting IC Sept 2002Slide 1 Introduction Welcome to the 5 th GridPP Collaboration Meeting Steve Lloyd, Chair of GridPP.
David Britton, 28/May/ TeV Collisions 27 km circumference m 8.36 Tesla SC dipoles 8000 cryomagnets 40,000 tons of metal at -271c 700,000L.
1Oxford eSc – 1 st July03 GridPP2: Application Requirement & Developments Nick Brook University of Bristol ALICE Hardware Projections Applications Programme.
GridPP: Executive Summary Tony Doyle. Tony Doyle - University of Glasgow Oversight Committee 11 October 2007 Exec 2 Summary Grid Status: Geographical.
Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow GridPP Oversight Committee Meeting.
Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow GridPP24 Collaboration Meeting.
ZMQS ZMQS
Ian Bird LCG Project Leader Site Reviews WLCG Site Reviews Prague, 21 st March 2009.
Seungmi Choi PlanetLab - Overview, History, and Future Directions - Using PlanetLab for Network Research: Myths, Realities, and Best Practices.
Module 14 Major Incident Management Module 14 Major Incident Management incident Problems in major and complex incident management Major incident management.
VOORBLAD.
GridPP From Prototype to Production David Britton 21/Sep/06 1.Context – Introduction to GridPP 2.Performance of the GridPP/EGEE/wLCG Grid 3.Some Successes.
How to commence the IT Modernization Process?
15 May 2006Collaboration Board GridPP3 Planning Executive Summary Steve Lloyd.
EGEE statement EU and EU member states major investment in Grid Technology Several good prototype results Next Step: –Leverage current and planned national.
2 GridPP2 Budget David Britton, 4/12/03 Imperial College.
Global Analysis and Distributed Systems Software Architecture Lecture # 5-6.
Addition 1’s to 20.
25 seconds left…...
Outstanding communication skills. Housekeeping › mobile phones › break times › toilets › emergencies © smallprint 2.
Week 1.
We will resume in: 25 Minutes.
Weekly Attendance by Class w/e 6 th September 2013.
1 PART 1 ILLUSTRATION OF DOCUMENTS  Brief introduction to the documents contained in the envelope  Detailed clarification of the documents content.
GridPP Steve Lloyd, Chair of the GridPP Collaboration Board.
TSB of Canada –Reflections on a Career in Rail Safety at the TSB Ian Naish Director, Rail Investigations (retd.) Transportation Safety Board of Canada.
GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus.
GridPP23 – Final Steps to Data David Britton, 8/Sep/09.
Trading Traps or What is Obvious is Obviously Wrong Peter Borish.
GridPP3 project status Sarah Pearce 14 April 2010 GridPP24 RHUL.
GridPP3 Project Management GridPP20 Sarah Pearce 11 March 2008.
Project Management Sarah Pearce 3 September GridPP21.
SouthGrid SouthGrid SouthGrid is a distributed Tier 2 centre, one of four setup in the UK as part of the GridPP project. SouthGrid.
11 March 2008 GridPP20 Collaboration meeting David Britton - University of Glasgow GridPP Status GridPP20 Collaboration Meeting, Dublin David Britton,
GridPP3 project status Sarah Pearce 24 April 2010 GridPP25 Ambleside.
UK Tier 1 Centre Glenn Patrick LHCb Software Week, 28 April 2006.
Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow UK-T0 Meeting 21 st Oct 2015 GridPP.
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
WLCG Service Report ~~~ WLCG Management Board, 31 st March 2009.
Downloaded from SINGLE PILOT CRM Sun-N-Fun FAA Safety Center Lakeland, Florida April 17, 1999.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
WLCG Status Report Ian Bird Austrian Tier 2 Workshop 22 nd June, 2010.
Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.
Slide § David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow GridPP delivering The UK Grid.
Md. Nurul Alam. ◦ What is Disaster? ◦ Idea regarding various terminology used in Disaster Management.
EGEE Middleware Activities Overview
Update on Plan for KISTI-GSDC
UK Status and Plans Scientific Computing Forum 27th Oct 2017
WLCG Collaboration Workshop: Outlook for 2009 – 2010
Presentation transcript:

GridPP22 – Service Resilience and Disaster Planning David Britton, 1/Apr/09.

2 Resilience and Disaster Planning The Grid must be made resilient to failures and disasters over a wide scale, from simple disk failures up to major incidents like the prolonged loss of a whole site. One of the intrinsic characteristics of the Grid is the use of inherently unreliable and distributed hardware in a fault-tolerant infrastructure. Service resilience is about making this fault-tolerance a reality. 1/Apr/09 PLAN - A

3 Towards Plan-B Fortifying the Service Increasing the hardwares capacity to handle faults. Duplicating services or machines. Automatic restarts. Fast intervention. In depth investigation of the reason for failure. 1/Apr/09 Disaster Planning Taking control early enough. (Pre-) establishing possible options. Understanding user priorities. Timely Action. Effective Communication. See talks by Jeremy (today) and Andrew (tomorrow)

4 Disasters: Not if but when+where 1/Apr/09 wLCG weekly operations report, Feb-09

5 Disasters: Not if but how big 1/Apr/09 A typical campus incident

6 Purpose of GridPP22 To understand the experiment priorities and plans (insofar as they are defined) in the case of various disaster scenarios. To extract commonalities across our user-base, to inform our priorities and planning in such an event. To examine (and help crystallise) the current state of site (and experiment) resilience and disaster planning. Raise collective awareness and encourage collaboration and dissemination of best-practice. An ounce of prevention is worth a pound of cure. 1/Apr/09

7...and talking of quotes When anyone asks me how I can best describe my experience in nearly forty years at sea, I merely say, uneventful. Of course there have been winter gales, and storms and fog and the like. But in all my experience, I have never been in any accident... or any sort worth speaking about. I have seen but one vessel in distress in all my years at sea. I never saw a wreck and never have been wrecked nor was I ever in any predicament that threatened to end in disaster of any sort. E. J. Smith, 1907, Captain, RMS Titanic 1/Apr/09 ( Who ordered the ICE ? – E.J. Smith, 1912)

8 Status Update 1/Apr/09 Swansea, Sep 08

9 WLCG Growth September 2008March 2009

10 A Magic Moment 31/03/2014

11 Tier-1 Reliability 31/03/2014 Last 6 months (Sep-Feb) : RAL Reliability = 98% Target reliability for best 8 sites = 98% (from Jan). RAL was in top-5. But this was measured with OPs VO... ATLAS: Last 6 months (Sep- Feb) : RAL Reliability = 90% Target reliability for best 8 sites = 98% RAL was 8 th out of 11 sites. Atlas VO But RAL was one of the best sites for both CMS and LHCb

12 UK CPU Contribution 1/Apr/ Months

13 UK Site Contributions 1/Apr/ (8) NorthGrid: 34(22)% London: 28(25)% ScotGrid: 18(17)% Tier-1: 13(15)% SouthGrid: 7(16)% GridIreland: 6.1% (~) 1 - Year 6 - Month

14 CPU Efficiencies 1/Apr/09 1-Year 6-Months

15 Storage (doh!) 1/Apr/09

16 UK Tier-2 Storage 1/Apr/09 Integrals (08Q4) Pledged: 1500 TB Provided: 2700 TB Used: 420 TB

17 Data Transfers 1/Apr/09

18 STEP09 (i.e. CCRC09) 1/Apr/09 Currently, it seems likely that this will be in June. There may be conflicts with the (much delayed) move to R89. It raises issues to do with upgrades such as CASTOR

19 Current Issues: CASTOR Current version (CASTOR 2.1.7) appears to function at an acceptable level though there are a number of serious bugs that we are learning to work around (notably the BigID and CrossTalk problems). These problems have also been observed elsewhere which adds pressure for them to be addressed. CASTOR is under test at CERN and shortly at RAL. Consensus is that we need to be very cautious in moving to this version, even though it may address some of the bugs and offer additional features (eg. Xrootd functionality). Ultimately, this decision must be driven by the experiments (is a consensus possible?). Strongly prefer not to be the first non-CERN site to upgrade. Possible conflict with the STEP09 exercise (can we upgrade early enough not to risk participation? Does it make any sense to upgrade after?) Is there a bigger risk of not upgrading (degrading support for ?) 1/Apr/09

20 Hand-over delayed from Dec-22 nd 2008 by a number of issues: Cleanliness (addressed) Inaudible fire alarms (addressed) Cooling system (outstanding). Plan-A: R89 must be accepted by STFC by 1 st May to allow a 2-week migration towards the end of June. Plan-B (if there is a small delay) is a 1-week migration of critical components only. Plan-C (if there is a longer delay) is to remain completely in the ATLAS building. Must balance establishing a stable service for LHC data with the advantages of moving to a better environment. Other factors are STEP09; Castor-upgrade; Costs; and Convenience. Current Issues: R89 1/Apr/09

21 Tier-1 Hardware The FY2008 hardware procurement is currently waiting to be delivered pending resolution of the R89 situation: –CPU: ~2500 KSI2K to be add to the existing 4590 KSI2K. –DISK: ~1500 TB to be added to the existing 2222 TB. –Tape: up to 2000TB can be added to existing 2195 TB. The FY09 hardware procurement will start as soon as the experiments have determined revised requirements based on the new LHC schedule (i.e. soon). 31/03/2014

22 Current Issues: EGI/NGI 1/Apr/09 What follows on from EGEE-III in April 2010? Current idea is an EGI body (European Grid Infrastructure) coordinating a set of national NGIs, together with a middleware consortium and a set of Specialist Service Centres (e.g. one for HEP). EGI-DS underway. Timescales and transition are problematic. What is the UK NGI? Some evolution of the NGS with components of GridPP? Timescales and transition are problematic. Funding is complicated. Initial step: A joint GridPP/NGS working group to try and identify common services. See talks by John and Robin on the last afternoon.

23 Summary This meeting is about making the most of the window of opportunity before LHC data, to ensure that our Grid services are resilient and our disaster planning is as in place, not just at RAL but also at the Tier-2s. Meanwhile, the UK continues to perform well and make incremental improvements in our delivery to the experiments and the wLCG. There are, and will continue to be, vexing issues and future uncertainties. We must all keep our eye-on-the-ball. Remember, its not if a disaster strikes but when and where. 1/Apr/09 LHC Data TBD !