Presentation on theme: "Project Status David Britton,15/Dec/08.. 2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project."— Presentation transcript:
Project Status David Britton,15/Dec/08.
2 Outline Programmatic Review Outcome CCRC08 LHC Schedule Changes Service Resilience CASTOR Current Status Project Management Feedback from the last Oversight Committee Forward Look 31/03/2014
3 Programmatic Review 31/03/2014 The programmatic review recommended a 5% cut to GridPP: Although ALICE and LHCb ultimately rescued, the cut was still imposed. However, there was a silver lining: Bottom Line: GridPP3 reduced by £1.24m on top of the £1.20m removed from GridPP2 noted at the last OC.
4 Funding Cut 31/03/2014 Savings of £1.24m achieved by: –Planned and unplanned late starts to a number of GridPP3 posts. –Reduction in Tier-1 hardware to reflect changes imposed by the programmatic review (LHCb and BaBar). –Re-costing of hardware based on the 2007 procurement. –A reduction in the budget line for the second tranche of Tier-2 hardware, consistent with the reduction in Tier-1 hardware. –Reduction in travel and miscellaneous spending. New plan presented to STFC in July 08; Updated in GridPP- PMB-133-Resources.doc
5 CCRC08 The Combined Computing Readiness Challenge took place in two phases, February and May Largely successful for all experiments. 31/03/2014
6 LHC Schedule Current indications are: - Machine cold in June. - First beams in July. - Collisions at some point later. - Plans may change! Consequences on GridPP - Capacity and services need to be ready in June. - Meanwhile many exercises (MC productions, Cosmics re-processings, Analysis challenges) to keep things busy and stress the system. - Prudent to maintain procurement schedule for April 2009 (little downside to this and helps reduce risks). - Opportunity to build on the service quality and resilience.
7 Service Resilience Emphasis over the last year of making the Grid resilient. –Much work on monitoring and alarms. –24 x 7 service initiated. –Extensive work on making the component services more resilient at many levels (see document). Future work on Resilience –Create project-manager overview to keep this active at the PMB level –Provision a back-up link for the OPN (significant cost). –Link to the (evolving) experiment disaster planning (UCL meeting) 31/03/2014
8 CASTOR CASTOR proved unreliable in early 2007 but performed well with the upgrade to for CCRC08. In time for first collisions, an upgrade from to was required in order to maintain a version supported by CERN. This coincided with a move to a resilient RAC Oracle system – combination of upgrades led to instability in August and September. System is now stabilising and the problems have lead to improved communications and management processes. –High load-testing identified as a critical missing step for new releases. –Oracle problems raised to a higher level of awareness in wLCG. –Storage Review at RAL in November. Other Tier-1s have had similar or worse problems with mass storage – a difficult area where effort is underestimated. 31/03/2014
9 Status: Resources 2008 (2007) 31/03/2014 Tier-1 Tier-2 CPU [kSI2k] 4590 (1500) (8588) Disk [TB] 2222 (750 ) 1365 (743) Tape [TB] 2195 (~800) MOU commitments for 2008 met. Combined effort from all Institutions.
10 Global Resource 31/03/2014 Status in Oct 2007: 245 sites, 40,518 CPUs, 24,135 TB storage Status in Dec 2008: 263 sites, 81,953 CPUs, xx,xxx TB storage
11 Current Performance 31/03/2014 Tier-1 Tier-2s Good and improving reliability at the Tier-1 and Tier-2s (but need to move to experiment-specific SAM tests) MOU resources at Tier-1 and Tier-2s delivered in full. Following CCRC08 successes, other exercises continue: eg. CMS Cosmic Reprocessing at the end of November which inadvertently ran (successfully) at 10x the I/O rate (Tier-1 LAN and CASTOR service) for 3.5 days! Although some problems, RAL ~the best Tier-1 for LHCb globally. CMS needs also ~met.
12 Current Performance 31/03/2014 Disk failure rate ~1/working day or ~6% failure rate (twice our assumption). ATLAS hit by two multiple disk failures within a RAID array resulting in data loss. CASTOR and the Oracle RAC upgrade caused considerably instability and ATLAS lost 2 weeks of UK simulated production when the Tier-1 became unavailable to receive data. Database loads are running several times higher than at CERN; this is partly a cost-issue; also partly triggered by the higher than average number of transactions triggered by some ATLAS jobs.
13 Project Map 31/03/2014
14 Project Plan 31/03/2014
15 Feedback from last Oversight Committee 8.1(Disaster recovery) – GridPP-PMB-135-Resilience.doc 8.2(CASTOR) – GridPP-PMB-136-CASTOR.doc 8.3(Documentation) (Certificates) (24x7 Cover) – Now fully operational. 8.6(Experiment Support Posts) – Despite all the cuts we have managed to fund 1-FTE for each of ATLAS, CMS, and LHCb. 31/03/2014
16 Forward Look Move to the new building at RAL. Concentrate on further improving service resilience and engage ATLAS, CMS, LHCb in developing coherent disaster management strategies. Investigate (even more) rigorous certification of CASTOR releases. Recognise global conclusion that mass data storage requires more effort than anticipated. Preparations for GridPP3 took ~20 months: Need to start considering now what happens after GridPP3. 31/03/2014
17 Backup Slides 31/03/2014
18 Job Success Rate 31/03/2014 ATLAS data analysis site tests – Nov