Presentation on theme: "GridPP Status David Britton, 3/Sep/08.. 2 31/03/2014 Switching on the LHC The LHC was fully cold by mid August. This is being followed by continued powering."— Presentation transcript:
GridPP Status David Britton, 3/Sep/08.
2 31/03/2014 Switching on the LHC The LHC was fully cold by mid August. This is being followed by continued powering tests, consolidation and machine checkout in preparation for beam. Two short injection tests took place 9/10 August and 23/34 August. During the 2 nd of these tests, Low intensity pilot bunches were injected at point 8, through LHCb, sector 78 and to the collimators at point 7. It all went very well! Remarkable performance from a huge number of systems. The third test is planned for 5th - 8th September during which beam will be taken from point 8 to the beam dump at point 6, and a the same time perform a dry run of the totality of the LHC in preparation for the start of beam commissioning proper on the 10th Sep.
3 Switching on the Experiments 31/03/2014
4 Switching on the Grid 31/03/2014 CCRC08 in March and May
5 CCRC08 Conclusions (Jamie Shiers) 31/03/2014
6 WLCG Growth 31/03/2014 March 2008September 2008
7 GridMap 31/03/2014 Most sites appear to be "ready" for September with storage tokens in place and reasonable SAM test availability. However, GridMap continues to show many sites as degraded. RAL-LCG2 (Tier-1) due to CASTOR; QMUL has a clock problem (also bringing a SToRM instance online); ECDF unknown; Brunel's cluster is undergoing maintenance ; IC-HEP failed over the weekend possibly due to network outage (manpower shortage); Bristol (CMS filled storage) and BHAM network related problems. Other sites with recent problems include RHUL as a disk pool failed, Oxford after a torque server directory filled up, Lancaster after certificate problems. Glasgow saw problems with its CE overnight which required a globus-gatekeeper restart.
12 Availability and Reliability 31/03/2014 Last 6 months: RAL Reliability = 97% Target reliability for best 8 sites raised From 95% to 97% in June July-08: RAL Reliability = 99% (Target 97%) July-08: RAL Availability = 98% (Target 97%)
13 UK Reliability (Steves Tests) 31/03/2014
14 Outcome of Funding Crisis As anticipated at the time of GridPP20, the programmatic review recommended a 5% cut to GridPP: 31/03/2014 Although ALICE and LHCb ultimately rescued, the cut was still imposed. However, there was a silver lining:
15 Funding Cut Bottom line was that GridPP had to return £1.2m which was achieved by: –Planned and unplanned late starts to a number of GridPP3 posts. –Reduction in Tier-1 hardware to reflect changes imposed by the programmatic review (LHCb and BaBar). –Recosting of hardware based on the 2007 procurement. –A reduction in the budget line for the second tranche of Tier-2 hardware, consistent with the reduction in Tier-1 hardware. –Reduction in travel and miscellaneous spending. 31/03/2014
16 Tier-1 Hardware The FY2007 hardware procurement was brought in to production at the end of April: –182 Disk servers added 1439 TB of disk to the existing 922 TB –113 CPU systems added 3000 KSI2K to the existing 1450 KSI2K –3030 tapes (0.45TB each) were added to the existing ~4000. –12 T10K Tape drives added to the existing 3 (TK10) + 6(9940). This was a major increase in hardware in preparation for data FY2008 hardware procurement is now well underway (£2m). 31/03/2014
17 Current Issues: Castor At the last Oversight Committee meeting in Oct 2007 CASTOR was a flagged as a key concern. Subsequently, various deployments of CASTOR culminating in version 2.1.6, deployed as separate instances for each experiment, established a successful service for CCRC08. Unfortunately, various issues and missing functionality required one further upgrade to Although this was tested, once deployed it fails under heavy load. Unfortunately, it is a different subversion than CERN (due to site differences). Unfortunately, it is not possible to roll-back to Fortunately, the problem is only a corruption of requests and not of files. CASTOR team has worked extremely hard and this is an undeserved bookend to mark 12 months of work! 31/03/2014
18 Current Issues: CA Another current issue is the Certificate Authority. Spun-off from GridPP to NGS a few years ago, we have come to rely on this service. A Series of Unfortunate Events beset the CA, including the loss of a copy of the root certificate private key and then the Debian security problem which required a new root key. The NGS has responded decisively and have implemented management changes designed to try head off future unfortunate events, or handle them with greater agility. This is a timely move to try and ensure the highest level of service, which GridPP welcomes and thanks the NGS. GridPP should take note – we need to ensure all are services aspire to the highest levels. 31/03/2014
19 The Three Rs Reduce, Re-use, and Recycle (motto for the 21 st Century?) Robust, Resilient and Reliable (motto for Grid Services?) 31/03/2014 Robust: Strongly built or constructed. Resilient: Able to deal readily with unexpected difficulties. Reliable: To be certain of it working as expected.....well, if I were going there, I wouldnt start from here...
20 The Elephant in the Room Our tidy little world of expert users running well honed production jobs on tidy data sets is about to be trampled under foot. We must anticipate working with a large number of less expert users, who are likely to want to run a more diverse range of applications in a whole variety of ways. 31/03/2014
21 Red Button Day September 10 th is the official LHC start. All day Radio-4 coverage Events in Westminster and the Scottish Parliament. Anticipate local interest (and ensure your site is visible!) 31/03/2014
22 The end: The Beginning After 7 years of preparation, this is the moment weve been working towards. There are challenges ahead. Some things will go wrong. Communication is vital. 31/03/2014