Presentation on theme: "Storage Review David Britton,21/Nov/08.. 2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC 2.1.4 Data? Oversight."— Presentation transcript:
Storage Review David Britton,21/Nov/08.
2 31/03/2014 One Year Ago Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC Data? Oversight Committee – Oct Data was expected in early summer CASTOR was broken (2.1.2 and 2.1.3) and a serious concern. Alternative to CASTOR (dCache and HPSS/enStore) had been considered and rejected.
3 31/03/2014 OC Feedback Time Line Apr-09Jan-09 Oct-08 Jul-08Apr-08Jan-08 Oct-07 OC Data? NOTES FROM THE OCTOBER 2007 OC on CASTOR : The main concern was progress towards fixing CASTOR at the Tier-1. It was understood that various actions were ongoing, but that it was necessary to manage this (and the associated expectations on each side). We were asked to make all deadlines as clear as possible to all those involved in the project (since delays in this area inevitably have a large impact across the project). We need to agree, where necessary, sets of milestones and deadlines from CERN, the Tier- 1, ATLAS, CMS and LHCb for end-December, February (prior to CCRC-1) and May (prior to CCRC-2) in anticipation of the next OC meeting in mid-May.
4 31/03/2014 Tier-1 Review Time Line Apr-09Jan-09 Oct-08 Jul-08Apr-08Jan-08 Oct-07 OC Data? NOTES ON CASTOR FROM THE NOVEMBER 2007 Tier-1 Review: Concerns: "2.1 CASTOR: The effort required over the next 12 months on CASTOR may be larger than planned." This was about 5 FTE (half funded by GridPP) compared to plan of 1.5 FTE. Recommendations: 3.1 The CASTOR level of effort is appropriate for steady-state operation, but given the current status, it needs to be monitored. Based on current input, we do not believe that a long-term redistribution of manpower in this area would lead to an optimum overall plan. In the short term, it is recognised that dedicated effort is required for testing. This should be regarded as transitionary. (Point-2.1) Tier-1 Review
5 31/03/ Time Line CCRC08 Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC Data? CASTOR S.I.R.s Tier-1 Review
6 31/03/ – The Present Time Line Apr-09 Jan-09 Oct-08 Jul-08 Apr-08 Jan-08 Oct-07 OC Tier-1 Review Data? CCRC08 CASTOR S.I.R.s Storage Review OC ??????????????????????????
7 Where do we go from here? At the review last year the feedback noted: We were also pleased to see signs of improvement w.r.t. CASTOR, following dedicated efforts from several individuals from a potentially disastrous situation. A year later, it is clear that: The CASTOR and Database teams have put in an enormous amount of work and achieved many successes. They have significantly improved the infrastructure, monitoring, and management processes. BUT …. we have not yet established a stable, reliable, load-tolerant mass storage service that is adequate for data. At this point we need to take a step back and look at the big picture to ensure that over the next 6 months we can address this. 31/03/2014
8 (Sample) Questions –Can we benefit by making our CASTOR setup mimic CERNs more closely? 31/03/2014 Cost issues? Knowledge issues? Manpower issues? Other non-CERN CASTOR sites?
9 (Sample) Questions –Is the main problem actually the database and Is the RAC set-up a large part of most problems? 31/03/2014 Licences and hardware costs? Oracle Expertise? CERN / Oracle Support? Other non-CERN CASTOR sites?
10 (Sample) Questions –What effort is needed on CASTOR/databases over the next 6-months and the next 2 years, and can we provide it? 31/03/2014 Backdrop: 2 FTE funded by GridPP in this area. 11 FTE total effort reported by Tier-1 against 17 FTE funded
11 (Sample) Questions –Have we optimised the management, operation and internal and external interfaces of the Database and Castor teams? 31/03/2014 Do we have the right skill mixture? Is there enough agility? How do we interface to CERN? To the Experiments?
12 (Sample) Questions –Is our hardware resilient (enough) and is our architecture optimal? 31/03/2014 Disk failures (correlations; replacement process)? Load levels ? RAC ?
13 (Sample) Questions –How do we approach future CASTOR upgrades? 31/03/2014 Is our test-bed sufficient (RAC?)? Can we/do we generate representative loads? Do we have enough (the right sort of) manpower? How do we make the decision to deploy?
14 (Sample) Questions –Does or will the changing (relative) costs of disk and tape (infrastructure) change the usage model? 5 FTE = £350k/p.a. Tape infrastructure FY08 £694k (+ £75k media).
15 (Sample) Questions –Is there light at the end of the CASTOR tunnel on the timescale of data? 31/03/2014 Fundamentally, are we in a different position this year? What are the key indicators that show this? How do we monitor/measure/present this?
16 (Sample) Questions –Are there alternatives to CASTOR that we should start to look at more seriously? 31/03/2014 Options (from AS): Keep running CASTOR; Switch to dCache, either with DMF or some other HSM; Switch to dCache with Enstore; Write our own tapestore interface for dCache; Buy a commercial HSM and rewrite either DPM or the CASTOR SRM to interface to it, or write our own SRM interface; Run BeStMan or JASMINE; Stop providing tape storage and switch to a disk-only Tier 1.
17 (Sample) Questions –Do we (deployers and users) still believe CASTOR is the right mid- and long-term solution? 31/03/2014 Are the experiment mid/long term plans evolving ? Archival storage on spin-on-demand disks or other technologies? Is CASTOR appropriate for disk (only) storage (at any level?) Can we/should we reduce our exposure/dependence on CASTOR?
18 (Sample) Questions 31/03/2014 –Can we benefit by making our CASTOR setup mimic CERNs more closely? –Is the main problem actually the database and Is the RAC set-up a large part of most problems? –What effort is needed on CASTOR/databases over the next 6-months and the next 2 years, and can we provide it? –Have we optimised the management, operation and internal and external interfaces of the Database and Castor teams? –Is our hardware resilient (enough) and is our architecture optimal? –How do we approach future CASTOR upgrades? –Does or will the changing (relative) costs of disk and tape (infrastructure) change the usage model? –Is there light at the end of the CASTOR tunnel on the timescale of data? –Are there alternatives to CASTOR that we should start to look at more seriously? –Do we (deployers and users) still believe CASTOR is the right mid- and long-term solution? –Are ATLASs problems due to lack of embedded ATLAS effort at RAL and/or their file sizes? –Weve seen lots of load related problems – need for a CCRC09? –Data base load – can we reduce it for a modest cost? –0.5% data loss: What would (spin-on-demand) disk give us? –CNAF model of CASTOR only for tape (and STORM for disk)? –Is there a training issue for DB experts on CASTOR architecture/operation? –Oracle/RAC architecture optimisation and what are reasonable/expected loads (by V0)?