Presentation on theme: "Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow GridPP Collaboration Board 2 nd."— Presentation transcript:
Slide David Britton, University of Glasgow IET, Oct 09 1 Prof. David Britton GridPP Project leader University of Glasgow GridPP Collaboration Board 2 nd Mar 2015 GridPP5 Proposal
Slide Context A GridPP5 proposal was prepared and submitted to a strategic review committee a year ago. We were asked to present a Flat Cash and 50% of Flat Cash scenarios; in fact, we presented two additional intermediate scenarios as well. In the end, the review effectively concluded that GridPP could not deliver what was required for a scenario that could be afforded. So a 1-year extension was given for GridPP4. A new GridPP5 proposal was invited, addressing Flat Cash, 90% and 70% of Flat Cash, and the remit was moved from the PPRP to the PPGP. A special committee has been set up to review the GridPP5 bid and to make recommendations to the core PPGP committee. Submission date is March 10 th. David Britton, University of Glasgow GridPP CB 2
Slide Strategy David Britton, University of Glasgow GridPP CB 3 The Strategic Objectives of GridPP5 are: To meet STFC's MoU commitment to CERN and the WLCG by ensuring that GridPP is able to handle the challenge of higher data rates and volumes of LHC Run-2. To prepare GridPP for the 2020 start of LHC Run-3 by exploiting developments in IT technology (such as cloud and big-data) and by influencing and contributing to WLCG's future technical direction and development. To reduce the cost to STFC of operation of GridPP by developing cost- sharing collaborations and by evolving the infrastructure to reduce the operational cost.
Slide Objective-1 David Britton, University of Glasgow GridPP CB 4 To meet STFC's MoU commitment to CERN and the WLCG by ensuring that GridPP is able to handle the challenge of higher data rates and volumes of LHC Run-2. By providing our expected share of the hardware at the Tier-1 and Tier-2s By providing the expected Grid services at the expected service levels. By contributing at an appropriate level to the development and evolution of the WLCG infrastructure.
Slide Objective-1 David Britton, University of Glasgow GridPP CB 5 To meet STFC's MoU commitment to CERN and the WLCG by ensuring that GridPP is able to handle the challenge of higher data rates and volumes of LHC Run-2. By providing our expected share of the hardware at the Tier-1 and Tier-2s By providing the expected Grid services at the expected service levels. By contributing at an appropriate level to the WLCG collaboration.
Slide Hardware In a practical sense (based on our actual purchases at RAL), Moore’s Law appears to be slowing down. Hardware is also becoming hard to use at full efficiency (multi-cores; disk i/o speeds). David Britton, University of Glasgow GridPP CB 6
Slide Hardware Estimate David Britton, University of Glasgow GridPP CB 7 (These numbers are not final yet – about £2m cheaper than original GridPP5 proposal because of Moore’s Law and we’ve had capital injection in 2015).
Slide Objective-1 David Britton, University of Glasgow GridPP CB 8 To meet STFC's MoU commitment to CERN and the WLCG by ensuring that GridPP is able to handle the challenge of higher data rates and volumes of LHC Run-2. By providing our expected share of the hardware at the Tier-1 and Tier-2s By providing the expected Grid services at the expected service levels. By contributing at an appropriate level to the WLCG collaboration.
Slide Service Decomposition: 1-EGI David Britton, University of Glasgow GridPP CB 9 EGI Services: The UK provides about 14% of the effort required to run all 17 EGI services but this is partly funded from EGI. The net financial contribution from the UK is funding for 8.6% of these global services (1.65 FTE), mostly from GridPP. The UK leads two of these, co-leads a third, and makes contributions to two more. NOTE: These numbers are estimates
Slide David Britton, University of Glasgow GridPP CB 10 WLCG Services: The UK contributes to 13 of 14 services and co-leads two. Most WLCG partners contribute to these shared international responsibilities. GridPP provides about 10.3% of the total effort, of which 8.5% is funded from GridPP (corresponding to 38.5 FTE) and 1.5% from the Tier-2 institutes in the form of non-GridPP funded effort Service Decomposition: 2-WLCG NOTE: These numbers are estimates
Slide David Britton, University of Glasgow GridPP CB 11 UK Services: These are services that every country needs to perform for the benefit of their own infrastructure. These 14 tasks that are core business for GridPP. The effort corresponds to 9.5 FTE funded from GridPP, complemented by 0.51 FTE funded from other UK sources. Service Decomposition: 3-UK NOTE: These numbers are estimates
Slide Service View: Key Message This service-oriented view provides an important way to understand the full responsibilities of the GridPP5 project: Almost all these tasks and services are required even in de-scoped scenarios. To first order, reducing the number of GridPP sites does not linearly scale down the manpower required to deliver the infrastructure. We must be aware that WLCG as a whole is reducing effort so it is unlikely that we can transfer responsibilities to others. Indeed, the advantages of a distributed infrastructure are increased leverage from institutes and increased local engagement, support, and impact opportunities. We will put forward a proposal to meet Objective-1 in the Flat Cash and 90% of Flat Cash scenarios (with some increase in risk). Unfortunately, we don’t believe we can meet this key objective in the 70% scenario. David Britton, University of Glasgow GridPP CB 12
Slide Objective-2 David Britton, University of Glasgow GridPP CB 13 To prepare GridPP for the 2020 start of LHC Run-3 by exploiting developments in IT technology (such as cloud and big-data) and by influencing and contributing to WLCG's future technical direction and development. By actively evolving the GridPP infrastructure. -Requires innovation, development, prototyping, and scale testing. By contributing to (including leading) WLCG technical groups. - We must do more than simply run the infrastructure. By working closely with the LHC experiments to ensure they can capitalise on the potential of new technology. - We need dedicated embedded effort in the experiments. This objective primarily requires manpower. We believe the Flat Cash and 90% scenarios will allow us to achieve this, but not the 70%
Slide Objective-3 David Britton, University of Glasgow GridPP CB 14 To reduce the cost to STFC of operation of GridPP by developing cost-sharing collaborations and by evolving the infrastructure to reduce the operational cost. By reducing the staff effort required to deliver both the Tier-1 and the Tier-2 service. By sharing the cost of hardware and infrastructure with a wider community. By making use of new technology that increases efficiency and reduces effort.
Slide Achieving Objective-3 David Britton, University of Glasgow GridPP CB 15 To reduce the cost to STFC of GridPP we need to reduce the manpower required at the Tier-1 and Tier-2s. We believe this should be a strategic evolution, rather than a managed decline. Over the last decade, the Tier-1 staffing level has fallen from 26 FTE to 19.5 in GridPP4, whilst providing vastly more resources and an enormously more robust level of service. In the Flat Cash and 90% GridPP scenarios, we plan to continue this trend, with staffing levels in the first two years of 17.5 FTE dropping to 14.5 FTE in the latter two years. Similarly, at the Tier-2s, we propose for both the Flat Cash and 90% scenarios, that by the end of GridPP5 we should have an infrastructure that can be run by 15 FTE, compared to FTE at present (more details in a moment).
Slide Flat Cash David Britton, University of Glasgow GridPP CB 16 So far I have not distinguished between the Flat Cash and 90% scenarios. To fully meet Objective-3, and meet STFC strategic goals in broadening the usage and linkage of e-infrastructures (eg. EU-T0), the Flat Cash scenario includes: ~£340k of hardware in addition to the full amount needed to meet the existing requirements, to provide seed-resources to new groups. We request specific effort (~1.5 FTE) to work with other groups (building on existing work with LSST, LIGO, LOFAR, DiRAC, and discussions with SKA) with the ultimate goal of ensuring that, where appropriate, we optimse things that we can do together (eg, Tape Store at RAL). We would protect existing GridPP technical expertise as far as possible, because this is the most valuable resource we have to offer.
Slide Scenario Summary The Flat Cash scenario ramps the Tier-1 and Tier-2 effort down to 14.5 and 15.0 FTE, respectively, over the project, reducing the cost of providing the LHC computing. In addition, we propose to actively work on the STFC agenda, badged as EU-T0 or UK-T0, to broaden and harmonise the use of e-infrastructure in the UK and Europe. All three objectives are fully met in this scenario. In the 90% scenario, the manpower at the Tier-1 and Tier-2s ramps down in the same way as above, but we give up the EU-T0 work and only meet Objective-3 in the narrower sense of reducing the GridPP operating costs. The other two objectives are fully met. In the 70% scenario, we only meet the third objective in the sense that the cost of operating GridPP is reduced. We are not able to meet the first two objectives. David Britton, University of Glasgow GridPP CB 17
Slide Manpower Summary David Britton, University of Glasgow GridPP CB 18 (These numbers are not final yet) These are averages of a tapered profile over the four project-years.
Slide Tier-1 Plan We can currently run the Tier-1 with 17.5 FTE but have another 2 FTE who are working on developing services that will ultimately reduce the manpower required and provide services that can be better used by others. By the end of GridPP4+ we will have benefited from this work and be able to run the Tier-1 with reduced effort. However, we propose to fund 17.5 FTE for the first two years to allow this virtuous-cycle to continue. By 2018 we hope to have broadened the user base to reduce overall costs as well as having benefitted from service developments. We plan to reduce Tier-1 effort to 14.5 FTE from FY18. This gives an average staffing level of 16 FTE for GridPP5 but a clear trajectory of savings. David Britton, University of Glasgow GridPP CB 19
Slide Tier-2 Plan David Britton, University of Glasgow GridPP CB 20 The Tier-2 infrastructure needs to balance the advantages of leverage, local support and engagement, redundancy, and the use of distributed bandwidth, with economies of scale that may arise from less fragmentation of resources. We have pointed out that much of GridPP effort delivers a service, rather than runs hardware, so the required effort does not scale linearly with the number of sites. We propose a sustainable Tier-2 infrastructure is achieved by the end of GridPP5 and a smooth transition made during the project. This means ramping down Tier-2 effort over GridPP5 according to a strategic plan. We believe a ramp-down is necessary so technical changes made for Run-2 can be bedded down before technical expertise is reduced.
Slide Sustainable Tier-2 We propose that the final infrastructure should consist of: –Four large Tier-2 centres providing the full ATLAS capability – 2 FTE each. –Four smaller Tier-2 centres for ATLAS providing supporting capability – 0.5 FTE each. –One large Tier-2 for CMS providing the full CMS capability – 2 FTE. –Two additional Tier-2 centres for CMS providing supporting capability – 0.5 FTE each. –Three or four T2Ds for LHCb, that are symbiotic with large ATLAS or CMS sites – 0.5 FTE each. –TOTAL = 15 FTE at end of GridPP5. The ratio of ATLAS/CMS/LHCb assigned effort approximately reflects the volume of hardware we anticipate hosting for each experiment in GridPP5. David Britton, University of Glasgow GridPP CB 21
Slide Transition Issues We don’t believe we are yet able to run the smaller sites (which will be quite substantial) with 0.5 FTE. However, we do currently have substantial sites that are run with 1.0 FTE. Therefore, we believe 0.5 FTE will be possible with increased use of new technology and an appropriate support structure. But we need to make an orderly transition, and reduce the effort gradually during GridPP5 at sites that have a 0.5 FTE endpoint. For sites losing all support, an orderly transition will help kit purchased by GridPP to continue to deliver until end of life. David Britton, University of Glasgow GridPP CB 22
Slide ATLAS - Hard Choices David Britton, University of Glasgow GridPP CB 23 GridPP has monitored site delivery according to Experiment defined metrics for several years. These have been a bit distorted in 2014 due to delayed HW funding to some of the large sites. 4 Large 4 smaller The final choice here was made in conjunction with GridPP in order to maintain compatibility with choices made by the other experiments. The ATLAS metrics take into account the provision, use, and reliability of both CPU and disk at the sites.
Slide CMS and LHCb CMS has the more obvious choice of picking IC as the large site, and Brunel and PPD as the supporting sites. LHCb’s computing model has evolved over GridPP4 to include Tier-2D centres. The UK is a large fraction of LHCb (21%) and the aspiration is to host four T2D co- located with large ATLAS or CMS sites, requiring an additional 0.5 FTE each. The potential manpower actually scales to 1.75 FTE for LHCb. LHCb’s choice of T2D sites, reflecting past delivery and local groups, are Manchester, Liverpool, Glasgow and IC. David Britton, University of Glasgow GridPP CB 24
Slide LHCb Metric David Britton, University of Glasgow GridPP CB 25
Slide EGI CPU Data David Britton, University of Glasgow GridPP CB 26 Funded effort is proposed to continue at these sites at some level. It is proposed to gradually phase out any GridPP funded effort at these. Fraction delivered by Site
Slide Storage Capacity David Britton, University of Glasgow GridPP CB 27 TB of Storage CMS + LHCb CMS ATLAS + LHCb ATLAS ATLAS + LHCb ATLAS ATLAS + LHCb ALICE?
Slide ALICE ALICE falls under the remit of the NPGP, rather than the PPGP from whose budget GridPP5 will be funded. Therefore, GridPP will cost support for ALICE (~300k hardware and 0.5 FTE) but will present it as as a separate line item. That is, we calculate what we can afford in each of the budget scenarios, and then add ALICE support on top. It would be very appropriate, then if this was supported by the NPGP and the 0.5 FTE would naturally go to Birmingham. David Britton, University of Glasgow GridPP CB 28
Slide PPD PPD has the unique advantage of being co-located with the Tier-1 (though it is not in R89 but in the ATLAS centre which is used as a back-up location as part of the Tier-1 disaster management plan). PPD has also been fairly unique in contributing to all three of the major experiments, though with more focus on CMS and then LHCb. However, the close proximity with the Tier-1 gives us the opportunity to work on alternative cheaper service provision and test our ability to run a robust service with less dedicated manpower. For these reasons, PPD/CMS will lead the way in prototyping the operation of a substantial site using more cloud technology. As we have more flexibility to adjust manpower here than at any other Tier- 2 site, we propose to carefully and appropriately ramp down the effort to 0.5 FTE as we develop the new mode of operation. David Britton, University of Glasgow GridPP CB 29
Slide Tier-2 Effort Evolution David Britton, University of Glasgow GridPP CB 30 End point is 15 FTE to run the Tier-2 infrastructure; Average is about 17 FTE. Flat Cash and 90% Scenario
Slide 70 % Scenario In the 70% scenario it is not possible to meet the first two strategic objectives (fully deliver resources to WLCG and/or prepare for Run-3). We can only meet the 3 rd objective of reducing the cost of GridPP, but this is only because we don’t deliver our obligations. Painful decisions must be made in all areas. Not possible to meet international obligations. Detailed discussions with experiments a year ago (wrt 50% FC scenario requested at that time) led to a decision to prioritise analysis capability for UK physics, over international obligations for reconstruction/reprocessing. David Britton, University of Glasgow GridPP CB 31
Slide Hardware in 70% scenario It should be noted that the LHC experiment requirements have been designed to match a “flat cash” budget. That is, the Funding Agencies via the CERN RRB and CRSG provided this as a planning guideline. The experiments have worked hard to meet this limit (6x increase in resources was reduced to 2.5x). Therefore, if STFC is now unable to provide Flat Cash funding, it is unrealistic to expect that the UK will deliver the required level of hardware. And therefore, in the 70% scenario, we request funding for 70% of the required level of hardware. We believe it would be detrimental to ask for more hardware at the expense of staff to provide the co-required level of service. David Britton, University of Glasgow GridPP CB 32
Slide Tier-1 in the 70% plan We would need to cut a major component to reduce hardware and manpower costs. The choice was between the CPU compute service or the Tape service, because it makes no sense to preserve CPU service without Disk service. The decision was to prioritise Tape and cut compute: Saves ~£1m in hardware and some reduction in manpower. To descope sufficiently, availability management and QA would also be severely cut; the on-call service would cease to be viable; and the Tier-1 would fail to meet its MoU commitments for availability. This leads to a partial Tier-1 requiring 11.5 FTE. The Tier-1 disk and tape hardware will be reduced to 70% of the MoU requirement (can deliver ~80% if we don’t support non-LHC). David Britton, University of Glasgow GridPP CB 33
Slide An Aside: Non LHC VOs David Britton, University of Glasgow GridPP CB 34 9% of Tier-2 CPU and 4% of Tier-1 CPU was delivered to non-LHC VOs between Jan 2012 and Dec 2014
Slide Tier-2 in the 70% plan The funding available means the average Tier-2 staffing level in GridPP5 needs to be about 14.5 FTE as we transition to a lower level over the project. To accommodate this, the final level needs to be ~13.5 FTE (c.f. 15 in the other scenarios). –LHCb will contract to 3 T2Ds with the loss of funded effort at Liverpool. –ATLAS will then shrink to 3 smaller Tier-2s with the loss of Liverpool. –We will squeeze the effort at PPD even harder and try to run in the final state with 0.25 FTE. This may not be feasible…. The Tier-2 Hardware budget will also contract to 70% of the MOU commitment (can deliver 75-80% if we drop support for non-LHC). David Britton, University of Glasgow GridPP CB 35
Slide 70% Tier-2 Transition Plan David Britton, University of Glasgow GridPP CB 36
Slide Staffing Summary David Britton, University of Glasgow GridPP CB 37 In 70% scenario we also lose Experiment Support for non-LHC experiments (1 FTE); a Data Management post (1 FTE); reduce security (0.5 FTE – either operational security or international leadership of policy); stop operation of the GOCDB (0.3 FTE – we then lose EGI funds to develop this service (0.5 FTE); lose 0.2 FTE to support operation of APEL; lose 0.5 FTE management; and lose the Impact Officer (0.5 FTE). DRAFT
Slide Conclusion With Flat Cash we can meet our strategic objectives, continuing to deliver appropriate resources and excellent service to the LHC experiment and the broader community. In addition we can play an active role in developing the UK/EU-T0 agenda with the ultimate goal of creating a more general and sustainable infrastructure. With 90% of Flat Cash, we can mainly meet our strategic goals but we will not be able to actively contribute to the development of the UK e-infrastructure. With 70% of Flat Cash, we will not meet our strategic goals, except in the sense that we reduce the cost of GridPP. We will not deliver the resources or service level expected by the WLCG MoU. We will reduce the capability of the RAL Tier-1 to respond to requirements from other collaborations. David Britton, University of Glasgow GridPP CB 38
Slide Back-up Slides David Britton, University of Glasgow GridPP CB 39
Slide Hardware Statement GridPP anticipates that some sites with with no GridPP-funded effort would still be interested in running GridPP-funded hardware during GridPP5. In principle, we feel that this offers potential benefits in terms of leverage of additional resources, engagement with local groups, and opportunities to develop additional synergies with other computing infrastructures. However, these potential advantages must be balanced against the potential effort required by both GridPP and, in particular, by the global experiment computing infrastructure (such as the ATLAS ADC), to keep these sites functioning at the required level of service. Particularly with new technologies, GridPP believes that this will be feasible and see the provision of hardware at non-staffed sites as a good opportunity to access additional hardware when it available; to engage with the wider community; and to establish new affordable modes of operation. However, GridPP will need to discuss with the experiments the appropriate level of capacity hardware that might be located outside of the dedicated centers and, in order to protect the experiment infrastructures from additional work and maintain the UK reputation, we would want to establish our own service-level requirements so that we can act on issues before they became externally visible. David Britton, University of Glasgow GridPP CB 40
Slide Leverage David Britton, University of Glasgow GridPP CB 41 Capital costs primarily reflect contributions in the form of dedicated machine rooms. Hardware is estimated by looking at Tier-2 resources reported in the Grid Accounting system, compared to the resources GridPP has funded. Electricity has been estimated from the total resources reported. Manpower is that reported in the GridPP quarterly reports, which is not funded by GridPP.