Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tier-1 Overview Andrew Sansum 21 November 2007. Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony.

Similar presentations


Presentation on theme: "Tier-1 Overview Andrew Sansum 21 November 2007. Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony."— Presentation transcript:

1 Tier-1 Overview Andrew Sansum 21 November 2007

2 Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony mainly MoU commitments –CASTOR (Bonny) Storing the data and getting it to tape –Grid Infrastructure (Derek Ross) Grid Services dCache future Grid Only Access –Fabric Talk (Martin Bly) Procurements Hardware infrastructure (inc Local Network) Operation Afternoon Presentations –Neil (RAL benefits) –Site Networking (Robin Tasker) –Machine Rooms (Graham Robinson)

3 What I’ll Cover Mainly going to cover MoU commitments –Response Times –Reliability –On-Call –Disaster planning Also cover staffing

4 GRIDPP2 Team Organisation Grid Services Grid/exp Support Ross Condurache Hodges Klein (PPS) Vacancy Fabric (H/W and OS) Bly Wheeler Vacancy Thorne White (OS support) Adams (HW support) CASTOR SW/Robot Corney (GL) Strong (Service Manager) Folkes (HW Manager) deWitt Jensen Kruk Ketley Jackson (CASE) Prosser (Contractor) (Nominally 5.5 FTE) Machine Room operations (1.5 FTE) Networking Support (0.5 FTE) Database Support (0.5 FTE) (Brown) Project Management (Sansum/Gordon/(Kelsey)) (1.5 FTE)

5 Staff Evolution to GRIDPP3 Level –GRIDPP2 (13.5 GRIDPP + 3.0 e-Science) –GRIDPP3 (17.0 GRIDPP + 3.4 e-Science) Main changes –Hardware repair effort 1->2 FTE –New incident response team (2 FTE) –Extra castor effort (0.5 FTE) (but this is already effort that has been working on CASTOR unreported. –Small changes elsewhere Main problem –We have injected 2 FTE of effort temporarily into CASTOR. Long term GRIDPP3 plan funds less effort than current experience suggests that we need.

6 ServiceMaximum delay in responding to operational problemsAverage availability measured on an annual basis Service interru ption Degradation of the capacity of the service by more than 50% Degradation of the capacity of the service by more than 20% During accelerator operation At all other times Acceptance of data from the Tier-0 12 hours 24 hours99%n/a Networking service to the Tier-0 during accelerator operation 12 hours24 hours48 hours98%n/a Data-intensive analysis services, including networking to Tier-0, Tier-1 centres 24 hours48 hours 98% All other services – prime service hours [1] [1] 2 hour 4 hours98% All other services – other times 24 hours48 hours 97% [1] [1] Prime service hours are 08:00-18:00 during the working week of the centre, except public holidays. WLCG/GRIDPP MoU Expectations

7 Response Time Time to acknowledge fault ticket 12-48 hour response time outside prime shift On-call system should easily cover this provided possible to automatically classify problem tickets by level of service required. Cover during prime shift more challenging (2-4 hours) but is already a routine task for Admin on Duty To hit availability target must be much faster (2 hours or less)

8 Reliability Have made good progress in last 12 months –Prioritised issues affecting SAM test failures. –Introduced “issue tracking” and weekly reviews of outstanding issues. –Introduced resilience into trouble spots (but more still to do) –Moved services to appropriate capacity hardware, seperated services, etc etc. –Introduced new team role: “Admin on Duty”. Monitoring farm operation, ticket progression, EGEE broadcast info. Best Tier-1 averaged over last 3 months (other than CERN).

9 RAL-LCG2 Availability

10 MoU Commitments (Availability) Really reliability (availability while scheduled up) Still tough – 97-99% service availability will be hard (1% is just 87 hours per year). –OPN reliability predicted to be 98% without resilience, site SJ5 connection is much better (Robin will discuss). –Most faults (75%) will fall outside normal working hours –Software components still changing (eg CASTOR upgrades, WMS) etc. –Many faults in 2008 will be “new” only emerging as WLCG ramps up to full load. –Emergent faults can take a long time to diagnose and fix (days) To improve on current availability will need to: –Improve automation –Speed up manual recovery process –Improve monitoring further –Provide on-call

11 On-Call On-Call will be essential in order to meet response and availability targets. On-Call project now running (Matt Hodges), target is to have on-call operational by March 2008. Automation/recovery/monitoring all important parts of on-call system. Avoid callouts by avoiding problems. May be possible to have some weekend on-call cover before March for some components. On-call will continue to evolve after March as we learn from experience.

12 Disaster Planning (I) Extreme end of availability problem. Risk analysis exists, but aging and not fully developed. Highest Impact risks: –Extended environment problem in machine room Fire Flood Power Failure Cooling failure –Extended network failure –Major data loss through loss of CASTOR metadata –Major security incident (site or Tier-1)

13 Disaster Planning (II) Some disaster plan components exist –Disaster plan for machine room. Assuming equipment is undamaged, relocate and endeavour to sustain functions but at much reduced capacity. –Datastore (ADS) disaster recovery plan developed and tested –Network plan exists –Individual Tier-1 systems have documented recovery processes and fire-safe backups or can be instanced from kickstart server. Not all these are simple nor are all fully tested. Key Missing Components –National/Global services (RGMA/FTS/BDII/LFC/…). Address by distributing elsewhere. Probably feasible and is necessary – 6 months. –CASTOR – All our data holdings depend on integrity of catalogue. Recover from first principles not tested. Is flagged as a priority area but balance against need to make CASTOR work. –Second – independent Tier-1 build infrastructure to allow us to rebuild Tier-1 at new physical location. Would allow us to address major issues such as fire. Major project – priority?

14 Conclusions Made a lot of progress in many areas this year. Availability improving, hardware reliable, CASTOR working quite well and upgrades on- track. Main challenges for 2008 (data taking) –Large hardware installations and almost immediate next procurement –CASTOR at full load –On-call and general MoU processes


Download ppt "Tier-1 Overview Andrew Sansum 21 November 2007. Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony."

Similar presentations


Ads by Google