Presentation on theme: "D. Britton8/Nov/2006GridPP3 Last 30 Days of CSA06 RAL."— Presentation transcript:
D. Britton8/Nov/2006GridPP3 Last 30 Days of CSA06 RAL
D. Britton Response to the PPRP David Britton 8/Nov/06
D. Britton8/Nov/2006GridPP3 Introduction GridPP has addressed the 10 questions received from the PPRP in the document This presentation will go through those responses. In addition, GridPP has addressed 120 comments and questions from 7 referees in the documents: The PPRP has also been presented with a number of related documents detailing the input to the PPRP questions from the three large LHC Experiments and various background documents addressing individual areas. All this information is collected on the web page:
D. Britton8/Nov/2006GridPP3 RTM Display RTM DISPLAY – Narrative This is a real live visualisation of the wLCG using the GridPP RTM that is internationally used as the public face of the LCG/EGEE Grid. For the LHC, computing is as much part of the requirements as the detector and the machine. Its not an optional extra; this is new regime – OPAL was 1TB etc. Five years ago a decision was made in the UK to develop a Grid approach: This paradigm offered the hope of transparent access to large quantities of resources and the opportunity to avoid unnecessary duplication by developing commonalities across the experiments (particularly in deployment, management, and support of resources and their use). Over the past 5 years we have worked in a complex and international environment at the bleeding edge of Grid development. We have succeeded in deploying the largest scientific Grid in the UK as part of the world- wide LCG. (bunch of stats – 200sites world-wide; 10% in UK; 30%? of CPU in the last quarter? 10-20K jobs simultaneously; However, impressive though it presently is, to achieve the goals it has to grow by a factor of 10 in CPU and closer to 100 in storage. CMS data challenge ~25% of 2008 rate exercise: 1PB transferred this year. Its within this perspective that we answer the first of the PPRP questions…
D. Britton8/Nov/2006GridPP3 PPRP Question-1 1. The Panel would like to further understand the advantages of the proposed overarching GridPP model for operations (as opposed to development) as against each experiment making its own arrangements. A: The GridPP identity – enables a unified and coordinated voice for the UK community that raises our profile, strengthens our negotiating power; increases our influence and enables better communication. B: Cross-experiment support – the middleware stack is presently divided into lower- level middleware that is part of the gLite release and higher level middleware that is provided by the experiments. The common goal is to continue to move middleware from the experiment specific to the generic level. Thus, future support for the middleware stack must follow this transition and is part of the overarching GridPP model. C: The Tier centre structure – has been set up by and through the GridPP project. The Tier-2 MOUs between GridPP and the Institutes, establish a uniform responsibility and the critical relationship between the Tier-1 and the Tier-2s is carefully supervised through the deployment team. An overarching project is more likely to succeed in nurturing these structures to optimise the UK Grid for Particle Physics.
D. Britton8/Nov/2006GridPP3 PPRP Question-1 1. The Panel would like to further understand the advantages of the proposed overarching GridPP model for operations (as opposed to development) as against each experiment making its own arrangements. D: The GridPP Deployment Team – The deployment of LCG releases will be better implemented by a coordinated deployment team managed by a common project. E: Without an overarching project, there is a risk that the UK Particle Physics Grid would fragment into a set of experiment-specific resource clusters which would completely undermine the advantages that predicated the decision to take the Grid approach that has been the basis for investment over the last 5 years. In addition, statements have been received (and presented in full to the PPRP) from the three large LHC experiment which: - All strongly support the concept. - Propose no alternative.
D. Britton8/Nov/2006GridPP3 PPRP Question-2 2. The Panel would like to explore the priorities and potential options for descope. -If funding were only available to support, 30%, 50% or 70% of the total request what would be the priority areas for investment in terms of obtaining the best UK science return? -What would be the political and experimental impacts of funding at a much lower level? -How would you prioritise the work packages?
D. Britton8/Nov/2006GridPP3 PPRP Question-2: PREAMBLE 0) Computing is an integral part of the LHC project. 1)GridPP3 is the continuation of a project with a previously defined scope. This is not a new initiative where the scope and scale are more elastic. WE HAVE A WELL DEFINED (FIXED) TASK TO PERFORM. 2)The scope of the project was described in the PPARC call: The original proposal required a careful evaluation of the minimum requirements consistent with meeting the PPARC call. In particular it did not provide the UK with additional capacity that might give a competitive edge. 3)Hardware is the biggest item. As you de-scope and reduce hardware you keep all the data and service tasks but throw away the ability to do any physics. 4)We are embedded in an international context and have been for ~5 years. Can not sensibly move away from the LCG model of middleware, operations and support. The levels of service expected are agreed in the MOU signed by PPARC. 5)As in any international collaboration, (e.g. the detectors) there are elements of service work that need to be contributed in a broadly pro-rata manner. BOTTOM LINE: Enormously difficult to de-scope a project that is well underway with well defined responsibilities. We basically start to fail as we de-scope.
D. Britton8/Nov/2006GridPP3 Input to Scenario Planning - Resources Changes in the LHC schedule have prompted another round of resource planning. New global resource requirements presented to CRRB (Oct 24 th ) from which new UK resource requirements have been derived and incorporated in the scenario planning. Hardware prices have been re-examined following recent Tier-1 purchase (CPU was cheaper than expected). We have adjusted (lower) our best empirical estimate of future prices but have also declared a contingency on hardware spend of 25% (up from 15%) over the lifetime of the project. Combination of the above result in a 9% savings on the project cost.
D. Britton8/Nov/2006GridPP3 Input to Scenario Planning - ATLAS The priority of the ATLAS-UK collaboration to ensure the best science return is the hardware and its operation. Within this, ATLAS notes that UK Tier-2 resources contribute directly to the UK output, whereas shortages in Tier-1 resources affect all ATLAS physicists globally. For Tier-1 resources, ATLAS regard the 15% hardware reduction proposed in the 70% scenario as barely manageable; the 50% scenario would do serious damage to the analysis capacity for the large UK physics community and it would also threaten the calibration and commissioning of the SCT. To reduce the Tier-2 hardware, cuts would have to be made in simulation, calibration, and then analysis capability but even the first of these will degrade physics output. Tier-2 cannot be cut below the 70% scenario. ATLAS has derived the UK fraction of the global requirements by noting that UK authorship is 12.5% (now 13.9%) of the Global ATLAS Tier-1 authorship and that there are 4 out of 30 (13.3%) of ATLAS Tier-2s are in the UK.
D. Britton8/Nov/2006GridPP3 Input to Scenario Planning - CMS The priority of the CMS-UK collaboration is access to Tier-2 resources in the UK and access to Tier-1 resources preferably in the UK. CMS argue that the 70% scenario, achieved with a 15% reduction in the requested hardware, would be at the threshold for CMS to host a UK Tier-1. In the 50% scenario, the priority for CMS would be to protect their Tier-2 resources which would have to be hosted by a Tier-1 external to the UK. The revised CMS UK hardware request is based on a more detailed algorithm than a simple fraction of the global requirements. The scale is set by dual requirements of (a) a minimum size for a CMS Tier-1 of 50% of average CMS Tier-1 (~7% of global requirements) and (b) the UK fraction of Tier-1 authors (same bases at ATLAS) of ~8%. The details are calculated from the dual requirements to accept 4 out of CMSs 50 data-streams (8%) and the need for the Tier-1 to serve an entire AOD dataset.
D. Britton8/Nov/2006GridPP3 Input to Scenario Planning - LHCb The LHCb collaboration has a somewhat different computing model from ATLAS and CMS with most analysis performed at the Tier-1 and the Tier-2 used predominantly for Monte Carlo simulation. LHCb prioritizes Tier-1 hardware and its operation, followed by Tier-2 hardware and its operation and finally support etc. The revised hardware requests from UK LHCb are based on the new global requirements, calculated from the UK authorship fraction of 18.6% (revised upwards from 16.6% at the time of the GridPP3 submission). The Tier-2 resource request also includes 18.6% of the global LHCb Tier-2 resource shortfall of 30% to give a total of about 24% of the global Tier-2 requirements. It is noted that any fall below the global authorship fraction of 18.6% at either the Tier-1 or Tier-2 would have to be negotiated in a global context.
D. Britton8/Nov/2006GridPP3 70% Scenario An example 70% scenario based on Experiment Inputs and a bottom-up examination of all posts. 73.5%
D. Britton8/Nov/2006GridPP3 What has been lost in the 70% scenario? - 15% of Hardware - Hardware at the Tier-1 and Tier-2 is reduced by 15%. - Contributes to a global shortfall of Tier-1 resources for all three LHC experiments. If cuts applied uniformly: - Takes CMS to the threshold level for a UK Tier-1. - Takes ATLAS to the threshold for holding the entire AOD in the UK. - Reduces the LHCb UK Tier-1 resources below the UK authorship fraction. (Un-quantified cost/consequences). The reduction of hardware directly (and disproportionately) impacts the ability of UK groups to produce physics output and will be a competitive disadvantage.
D. Britton8/Nov/2006GridPP3 What has been lost in the 70% scenario? - 7% of Tier-1 Staff Effort - Staffing effort at Tier-1 in the proposal is barely adequate to meet MOU quality of service and was identified as a significant risk. - Staffing effort does not scale linearly with hardware. - Cuts achieved by removing 3-FTE ramp-up of Tier-1 staff in the GridPP2+ period (designed to match the ramp-up of hardware) and 1-FTE during the GridPP3 period (probably from incident response team). - The working allowance, previously included to address risk of failing to meet MOU service levels, has also been removed. - Net result is a significant increase in the risk that the Tier-1 service levels will not be met in full.
D. Britton8/Nov/2006GridPP3 What has been lost in the 70% scenario? - 11% of Tier-2 Staff Effort -The Tier-2 staff would be reduced by 1.75 FTE out of This is likely to contribute to either or both of: (a) a reduction of Tier-2 resources levered from the institutes; (b) a reduction in the service level achieved at the Tier-2s. - The working allowance, previously included to address risk of failing to meet MOU service levels, has also been removed. - Net result is an increase in the risk that the Tier-2 resource and service levels will not be met.
D. Britton8/Nov/2006GridPP3 What has been lost in the 70% scenario? - 31% of Support Staff Effort The Data Management post (1 FTE) for Replica Optimisation is not funded. This work was judged as a good investment to optimise the use of limited storage resources. Removing funding for this post removes the likelihood of much greater savings on the purchase of storage resource in the future. A reduction in data storage support (0.5 SY) reduces the flexibility to support multiple storage technology in the UK. (GridPP does not wish to support multiple storage technologies but recognises the likely need). Continuing support (0.5 FTE) for the GridPP Real Time Monitor would not be funded. The RTM is the face of the LCG/EGEE grid, is a highly visible and acclaimed demonstration show piece that has repeatedly illustrated the UKs position as a major international player in this field. A 1-FTE reduction in the support for the R-GMA information and monitoring system. This major UK contribution is deeply embedded in the EGEE/LCG stack. Any reduction in effort must be carefully planned in conjunction with our partners to try to minimise disruption globally.
D. Britton8/Nov/2006GridPP3 What has been lost in the 70% scenario? - 31% of Support Staff Effort The Security Vulnerability work (0.5 FTE) would be dropped. During GridPP2 the UK has pro-actively taken a leading international role developing security vulnerability policies and procedures. Support for GridSite would be reduced by 0.5 FTE. The GridSite security toolkit developed by GridPP, is embedded in the EGEE/LCG middleware and used as the basis for the GridPP and other websites together with the GridSiteWiki. A Networking post in the GridPP3 proposal designed to help network provision and network monitoring would be reduced to 50%. This reduces the network support at a time when the network will be coming under intense stress and production standards are required. -The loss of over 30% of the support staff means that UK Grid will operate less effectively (Data Management; Storage; Networking) and International roles and responsibilities will be reduced or lost (RTM; R-GMA; Vulnerabilities; GridSite).
D. Britton8/Nov/2006GridPP3 What has been lost in the 70% scenario? 12% of Operations; 10% of Management; 25% of Outreach. In the reduced scenarios the task of managing the project is likely to be as least as difficult as for the full proposal. Nevertheless, management effort would be reduced primarily by not buying out 25% FTE for the User Board Chair as currently proposed. There is a risk that the User Board would not be as pro-active at collecting or presenting the Users requirements and concerns, as desired. The 0.5 FTE requested for Industrial Liaison would be dropped. This means that we are unlikely to establish much industrial outreach. Support for the UK Grid Operations Centre in GridPP3 would be reduced from 3 to 2 FTE. The current manpower is 5.5 funded by EGEE. This increases the risk that the Grid Operations Centre on which GridPP relies to provide Grid monitoring, ticketing and accounting, would not function effectively.
D. Britton8/Nov/2006GridPP3 50% Scenario An example 50% scenario based on Experiment Inputs and a bottom-up examination of all posts.
D. Britton8/Nov/2006GridPP3 What has been lost in the 50% scenario? - 40% of Tier-1 Hardware 40% of the Tier-1 HW will be lost. All three LHC Experiments will need to negotiate the consequences of providing significantly less Tier-1 resources than their UK Author fraction. (Un-quantified cost). The UK could no longer host a CMS Tier-1 centre and special arrangements would need to be made to provide UK CMS Tier-2s, access to resources and support at a non-UK Tier-1. (Un-quantified cost). For ATLAS and LHCb, this level of Tier-1 resource would do serious damage to the analysis capacity for the large UK physics communities and for ATLAS it would also threaten the calibration and commissioning of the SCT.
D. Britton8/Nov/2006GridPP3 What has been lost in the 50% scenario? - 30% of Tier-2 Hardware 30% of the Tier-2 HW will be lost. The physics output for all three experiments would be reduced. Competitive advantage would be completely lost. ATLAS would apply reductions to simulation, calibration, and then analysis capability but even the first of these will degrade physics output. LHCb would reduce Monte Carlo simulation, similarly compromising physics output. As CMS sole UK resource, the reduction would directly scale the CMS physics output.
D. Britton8/Nov/2006GridPP3 What has been lost in the 50% scenario? 22% of Tier-1 Staff; 23% of Tier-2 Staff Tier-1 staff would be further reduced from 17 to 14 FTE. Comparing this with the current level of 13.5 FTE it is quite apparent that the Tier-1 (which would have much more hardware by that point) could not reach the level of service defined in the MOU signed by PPARC. There would need to be international negotiations as to whether the Tier-1 could function as such, for either of the remaining two experiments. Tier-2 staff would be further reduced from 13 to 11 FTE. This is likely to contribute to either or both of (a) a reduction of Tier-2 resources levered from the institutes; (b) a reduction in the service level achieved at the Tier-2s.
D. Britton8/Nov/2006GridPP3 What has been lost in the 50% scenario? - 66% of the Support Staff lost. The support post for generic metadata issues would be lost and all support would have to be via the experiments. Support for grid storage technologies would be reduced from 7 SY to 2SY over the project. This would (probably) be limited to Castor support at CCLRC. Institutes would need to look elsewhere for support on the technologies likely to be deployed therein. The portal work would be stopped leaving the smaller or future experiments with a higher hurdle to getting on the Grid. The testing and performance monitoring work associated with the Work Load Management system would stop. This is an area where there is strong European pressure to continue and is of potentially direct benefit to UK physics by providing knowledge about the current condition of the Grid on a site-by-site basis.
D. Britton8/Nov/2006GridPP3 What has been lost in the 50% scenario? - 66% of the Support Staff lost. Support for information and monitoring systems would be reduced to 1FTE. (R-GMA could not be supported and negotiations with our international partners would have to determine how best to use this post to help the transition to whatever new system evolved internationally). Security support would be reduced to 1 FTE. This would be split as deemed appropriate at the time between VOMS support and Operational Security. The support for GridSite (an international obligation) would be dropped. Again, this would have to involve discussion with international partners since the LCG/EGEE middleware stack would be at risk. The networking support post for monitoring and provision would be lost. This would be in a regime where the need for Network support has become more critical with at least one of the major experiments attempting to use a non-UK Tier-1.
D. Britton8/Nov/2006GridPP3 What has been lost in the 50% scenario? - 30% of the Operations Staff; 25% of Management; all dissemination (except GridPP2+ period). Support for the Grid Operations Centre would be further reduced from 2 to 1.5 FTE, further increasing the risk that the GOC on which GridPP relies to provide Grid monitoring, ticketing and accounting, would not function effectively. One of the four Tier-2 coordinators would be lost. This increases the risk of failure of part of the Tier-2 organisation; reduces the deployment team; and increase the likelihood that delays to upgrades at some sites will reduce the available resources with a direct impact on physics output. Management would be further reduced (this would have to be optimised). There is a risk that the management becomes less engaged and therefore less effective. All dissemination and outreach activities would be stopped after the GridPP2+ phase is complete.
D. Britton8/Nov/2006GridPP3 30% Scenario GridPP has examined the original PPARC call and has determined that it is unable to form a proposal that meets any of the criteria listed with funding at the 30% level: 2. a) Underpin the particle physics programme by delivering the functional Tier 1 centre for the LHC experiments and for the other experiments where UK groups will require computing GRID access and facilities. The 50% scenario presented above already fails to meet this criterion because the Tier-1 would be sub-threshold for at least one of the LHC experiments. At the 30% funding level there could only be a Tier-1 for (probably) one LHC experiment. Most likely, in a 30% scenario there would be no Tier-1 and the resources would be used as a Tier-2 (though it is not clear what to do about LHCb). Etc….. (see document)
D. Britton8/Nov/2006GridPP3 Question-2 Summary GridPP has taken input from the 3 large LHC experiments as guidance in an attempt to design a GridPP3 project in 70% and 50% funding scenarios. The outcome is a 74% funding scenario that preserves 85% of the hardware (~the threshold for a UK CMS Tier-1) but is likely to result in a failure to meet service levels; inadequate support across the UK in many areas; and the elimination of much of the UK obligation to the international effort that directly and indirectly benefits UK physicists. A 55% scenario is provided that doesnt work: It does not respect the criteria of the call and there are large political and financial unknowns associated with delivering less than a pro-rata share of LHC hardware so the real cost cannot be provided. The UK Grid would not function at the required level of service and support for UK users would be completely inadequate. We do not regard the fine details of these scenarios as fixed but they are offered as examples of our approach to, and the consequences of, funding below ~90% of the original proposal.
D. Britton8/Nov/2006GridPP3 Risks GridPP believes that the risks introduced by the 74% scenario are very large and urges the PPRP to consider an outcome closer to 90%. 1)In the 90% scenario, all the Risks defined in the GridPP3 proposal still apply except that there is an increased risk that hardware is more costly than planned. 2)In the 74% scenario, there is an additional risk that the level of hardware provision for all 3 experiments will compromise the physics output. For LHCb there are unknown consequences at providing hardware below the authorship fraction level. 3)In the 74% scenario, service levels signed up to by PPARC at the Tier-1 and Tier-2 are severely at risk. 4)In the 74% scenario, support for middleware in the UK will be inadequate. There is a risk that this will seriously undermine physics output. 5)In the 74% scenario, most middleware contributions by the UK to the international effort will be dropped. This puts the whole Grid at risk and damages the UK reputation and influence.
D. Britton8/Nov/2006GridPP3 PPRP Question-3 3. The UK would like to play a key role in this important project but the current financial constraints necessitate focusing on the crucial areas and what needs to be done. The Panel would like to identify these areas, giving consideration to the current LHC timescale, and to understand the implications of delaying parts of the project, especially with regard to hardware (e.g. same CPU performance with fewer, fast processors). Identifying crucial areas is covered the Scenario Planning presented in response to Question-2 and by each of the responses to GridPP from the three large LHC experiments. The new LHC timescale has been included in the new resource requirements prepared by the LHC experiments and presented to the CRRB on October 24th These new global requirements have been used to derive new UK requirements, as described in the response to Question-2 and in the experiment documents. The resource requirements are effectively shifted which, combined with reduced hardware cost estimates used by GridPP, have resulted in about a 9% saving on the project cost. This is embedded in the 70% and 50% scenario plans.
D. Britton8/Nov/2006GridPP3 PPRP Question-4 PART-1 The Panel wishes to understand better the apparent disparity between the estimated Tier-1 needs of CMS and ATLAS. It seems that ATLAS requires roughly twice the CPU and disk resource, but less tape than CMS. Given the similar computing models between the two experiments, relatively small differences in the parameters chosen seem to have significant implications on the assessment of need and hence cost. PART-2 -How has GridPP interacted with the experiments to ensure that the most cost effective solution has been arrived at? PART-3 - The Panel wishes to understand the levels of requests for tier-1 facilities by the different experiments relative to the UK contribution to the each experiment. GridPP relies on the careful scrutiny and rigorous peer review of the computing models and global resource levels by the LHCC and the CRRB to ensure that the most cost effective solution has been achieved. Part-2:
D. Britton8/Nov/2006GridPP3 PPRP Question-4 ATLAS has derived the UK fraction of the global Tier-1 requirements by noting that UK authorship is 12.5% (now 13.9%) of the Global ATLAS Tier-1 authorship. CMS has derived their UK Tier-1 hardware request based on a more detailed algorithm than a simple fraction of the global requirements. The scale is set by dual requirements of (a) a minimum size for a CMS Tier-1 of 50% of average CMS Tier-1 (~7% of global requirements) and (b) the UK fraction of Tier-1 authors (same bases at ATLAS) of ~8%. The details are calculated from the dual requirements to accept 4 out of CMSs 50 data-streams (8%) and the need for the Tier-1 to serve an entire AOD dataset. This latter requirement results in a slightly large fractional requirement at the Tier-1 in early years which then reduces to ~8% in the steady state. LHCb has derived the UK fraction of Tier-1 resources the UK authorship fraction of 18.6% (revised from16.6% at the time of the GridPP3 submission). PART-3:
D. Britton8/Nov/2006GridPP3 PPRP Question-4 Latest round of resource review has led to convergence of the models. In particular, CMS has increased trigger rate (now similar to ATLAS) during early years to acquire more calibration and standard-model physics data. Event sizes, data rates, processing times, and replication strategy have evolved to become significantly closer. Remaining difference is the strategy for data storage and replication: ATLAS copies of the ESD data distributed over all Tier-1 centres; plus a cumulative AOD sample spanning multiple years, all on disk. CMS copy of RECO (ESD) is stored over all Tier-1 centres, in addition to CERN, and only a single years AOD is stored on disk at Tier-1s (previous years are accessible from tape). This leads to a smaller Tier-1 disk requirements from CMS, but higher requirements on tape infrastructure, bandwidth and storage. These are different optimisations that will probably converge as experience is gained. PART-1:
D. Britton8/Nov/2006GridPP3 PPRP Question-5 5. The Panel would like the applicants to justify the rationale behind the proposed regional Tier-2 structure in GridPP3 and to set out the pros and cons of other possible structures, for example, experiment based or rationalised structure with fewer Tier-2 sites, or fewer institutes. The Panel would like the applicants to consider possible cost savings and improvements in efficiency and service delivery that different structures might produce. Need to discuss The Past, The Present, and The Future. The underlying message is that, the proposed system is the logical development of the current structure which works well and, in turn, was developed for good reasons. We see much much bigger risks to performance in breaking the current structure than in keeping it.
D. Britton8/Nov/2006GridPP3 PPRP Question-5 History of the Tier-2 Structure The current Tier-2s were formed naturally in response to local and regional funding opportunities and other geo-political considerations. Many assumed (used as leverage) a continuing relationship with the Particle Physics community. It is natural that all Particle Physics groups wished to be associated to a T2, but this was not a GridPP requirement. However, clearly it was uniformly perceived as beneficial for the local physicists and the institute. In GridPP1 there was no PPARC funding for Tier-2s and in GridPP2 there was PPARC funding for some manpower at Tier-2s (plus some specialised servers) but not for the bulk of the computing resources. Nevertheless large amounts of resources were made available. GridPP has interacted with four Tier-2 centres through their management boards. The overhead of having more than one site within the Tier-2 is, to first order, an internal choice (the JeS submission requirement for the GridPP3 proposal broke this model).
D. Britton8/Nov/2006GridPP3 PPRP Question-5 Current Status of Tier-2 Structure: There are currently 17 Institutes organised into 4 Distributed Tier-2s. Of the 17 Institutes, 4 have no GridPP manpower, 8 have less than one FTE and 5 have one or more FTEs of GridPP manpower. The total of 9 FTE funded by GridPP for hardware support (plus 5.5 FTE specialist posts) is clearly is a very cost effective situation given the 3703 KSI2K of CPU and 263 TB of disk available (06Q1 numbers). For comparison, the Tier-1 had 13.5 GridPP-funded FTE and made available 830 KSI2K and 180 TB in the same period. Performance measures are being developed (within GridPP and wLCG). The UK is probably ahead of the game here. There are more details in the written response but the UK Tier-2 performance is: -good relative to other counties; -improving even though the hurdles are getting higher; -on track to meet the MOU requirements.
D. Britton8/Nov/2006GridPP3 PPRP Question-5 Future of Tier-2 Structure: GridPP proposes to continue to develop 4 Regional Tier-2 centres. GridPP would like to remain neutral on the number of sites and institutions within each Tier-2, and simply offer a packaged of hardware money and effort to each Tier-2 in return for the delivery of a specified quantity of resource and a specified service level. We believe this approach: 1)Allows a market-driven optimisation of resources according to constraints which are outside the control and knowledge of GridPP (e.g. Other sources of funding; Institutional priorities and strategies; prior commitments and aspirations.) 2)Builds upon a system that is both viewed and measured as successful. 3)Is in the best interests of Physicists at all Institutes; allowing some small measure of local control whilst enabling Grid access to vast resources; and providing on-site expertise in as many places as possible.
D. Britton8/Nov/2006GridPP3 PPRP Question-5 Future of Tier-2 Structure: Alternate structures have been considered: 1)Fewer Tier-2s – foresee no advantage in having the same number of institutes associated with fewer Tier-2s. Clear disadvantages. 2)Fewer Institutes – Hardware and manpower costs remain the same; running and infrastructure costs likely to become more visible. Some gains in the efficiency of staff effort by concentration of resources (though this means less levered effort, not less GridPP effort; service level may be easier to achieve). May alienate some institutes; will result in less leverage of resources; will leave some institutes without local expertise. Conclude: It will cost more; deliver less resources; service level might be better but physicists less supported. Not the optimisation we chose. 3)Experiment-based Tier-2s – runs against the grain and would leave the UK at odds with the rest of the wLCG; not a sensible Grid structure and would limit peak resources available to individual Experiments. Would most likely lead to a divergence from standards and a fragmented UK Grid.
D. Britton8/Nov/2006GridPP3 PPRP Question-6 6. The Panel would like to explore the impact to the UK of leadership roles within LCG. What are the benefits and costs to the UK of this, particularly with regard to middleware? The Big Picture: Roles (eg Leadership) and duties (eg Middleware support) for the LCG project must be shared between the members. This allows the common project to benefit from all the available skills and expertise; it provides a contribution in kind that should broadly reflect the size of the contributing group; it demonstrates the engagement of all partners; and in return, it enables strategic influence and other tangible benefits. Performing duties, gives us the credibility to take on leadership roles. Appendix-D of the proposal listed 86 external roles of members of GridPP within related projects, 17 of which are specifically LCG related, 22 are within EGEE, and a further 8 associated with computing within the LHC Experiment collaborations.
D. Britton8/Nov/2006GridPP3 PPRP Question-6 Specific Examples: a) David Kelsey: Coordinator of LCG Grid Security, Chair of Joint (LCG/EGEE/OSG) Security Policy Group and Deputy Director of EGEE Security. b) Jeremy Coles: Secretary of LCG Grid Deployment Board. c) John Gordon – UK Representative on LCG Management Board and a Deputy Chair. d) Neil Geddes – UK member of the LCG Oversight Board (OB) and LCG Collaboration Board Chair. e) EGEE: Project Executive Board: Frank Harris; Dave Kelsey, and previously Pete Clarke. Project Management Board Chair: Robin Middleton (to summer 06). Project Collaboration Board: Dave Colling; John Gordon; Jeff Tseng; Tony Doyle; and Roger Barlow. EGEE JRA1 (Middleware re-engineering) Cluster Leader (UK): Steve Fisher.
D. Britton8/Nov/2006GridPP3 PPRP Question-6 Related Examples : i) Nick Brook (formerly GridPP UB Chair and PMB member) is the LHCb computing coordinator. ii) Roger Jones (currently GridPP Applications Coordinator and PMB member) is the chair of the ATLAS International Computing Board. iii) Dave Newbold (formerly GridPP UB chair and PMB member) is the chair of the CMS Computing Committee. Conclude: That as a consequence of investment and hard work over the last five years, the current overall influence of the UK in the LHC Experiments is very high. This ultimately benefits UK physicists and has been a good investment.
D. Britton8/Nov/2006GridPP3 PPRP Question-7 7. Before making a recommendation to the office about the extension to GridPP2 the Panel would like more information about each of the posts and to know whether they are core activities. What are the implications of not funding these posts and what evidence is there that a delay in resolving this will lead to a loss of staff who might be expected to continue into GridPP3? Detailed information on the areas covered by the GridPP extension was provided in the GridPP3 proposal. Specific information on each individual post was provided on the Institutional JeS forms submitted to PPARC. All these posts are considered core to the current programme during the 7-month period of GridPP2+ when it will be necessary in the build-up of the Production Grid prior to LHC data-taking. It should be noted that funding for the applications posts was not requested but that many of these have not been funded on the RG leaving a serious shortage of effort. If not funded: We will lose our entire pool of highly skilled staff; the UK will not be ready for LHC data; much of the current work will be abandoned in and large amounts of resources will have been wasted. Evidence: 25% turnover of staff since proposal submission, c.f. ~10% p.a. previously.
D. Britton8/Nov/2006GridPP3 PPRP Question-8 8. The Panel would like to see a full justification for each of the posts requested in GridPP3 and to see the cost to PPARC (including estates and indirect costs) of each post. A separate document has been provided for PPARC staff including full details extracted from the Institutional JeS submissions. This incorporates a compilation of the Institute submissions organised by work package, giving the justification and costs for each post that should be read in conjunction with the proposal and relevant appendices.
D. Britton8/Nov/2006GridPP3 PPRP Question-9 9. The Panel would like to explore the issues of quality assurance in both Tier-1 and Tier-2 activities. How will the applicants ensure that GridPP3 provides an adequate and cost-effective service to its users? The service levels at the Tier-1 and Tier-2 are defined by the International Memorandum of Understanding. The Tier-1/A Management Board, including PPARC representation, advises all stakeholders on whether the Tier-1/A Service at RAL is delivering its objectives on time and making appropriate use of its available resources. The main instrument for assuring quality and levels of service at the Tier-2s will be a new Memorandum of Understanding between GridPP and the institutes as described in the Tier-2 Appendix to the GridPP3 Proposal. This would set out the required levels of services in order for the UK to meet its WLCG MoU commitments and provide the necessary service to UK physicists. (continued…)
D. Britton8/Nov/2006GridPP3 PPRP Question-9 Quality Assurance is performed by monitoring the performance of the Tier-1 and Tier-2 compared to MOU commitments, and the performance compared to international partners. As previously described, monitoring is already advanced and being developed further. We currently monitor: - CPU and storage usage; - Site functional test; - Configuration tests; - Ticket response times; - Upgrade timescales; - Schedule downtime; - VO support; - Transfer tests.
D. Britton8/Nov/2006GridPP3 PPRP Question The Panel would like information on where the Tier-1 centre will be housed at RAL - Is any construction or refurbishment of an appropriate building on the critical path for the GridPP project? - Will the centre have sufficient space available to meet GridPP's requirements? - What are the risks associated with this? - How will this be funded? Atlas Centre at RAL has sufficient capacity to house the full GridPP3 requirements for 2008 LHC running as given in the proposal. CCLRC has approved construction of a new computer building at RAL budgeted at approx £17M and will be funded by the CCLRC Capital Investment Plan. Completion is due in summer of 2008 in time for the autumn delivery which will meet the 2009 data-taking requirements. This has sufficient space for capacity to grow to 2012 when the number of racks is expected to have reached a steady state.
D. Britton8/Nov/2006GridPP3 PPRP Question-10 The main risks are: a)Late completion. There is some slack in the schedules to meet the data taking requirements for April 2009 which mitigates this risk. b) Power and cooling required to deliver the required resources may exceed the estimates. This is mitigated by inclusion of chilled water mains in the new building to allow direct water cooling of the hottest racks if power densities exceed current estimates. c) Electricity charges for power and cooling which are currently met by CCLRC overheads charges. It is possible that at some future time these may be attributed directly to GridPP. This is explicitly listed as a potential call on contingency in the GridPP3 proposal.
D. Britton8/Nov/2006GridPP3 SUMMARY 1.GridPP and the experiments have described the advantages of the proposed overarching GridPP model for operations. 2.The potential options for descoping the GridPP project are extremely limited, but we have provided input on the 3 scenarios. 3.The crucial GridPP areas have been described, taking the LHC experiment requirements fully into account. 4.The ATLAS and CMS planned trigger rates have converged and the computing models are similar. Residual differences have been identified. 5.The proposed Tier-2 structure and the pros and cons of other possible structures have been explored.
D. Britton8/Nov/2006GridPP3 SUMMARY 6.The impact to the UK of leadership and benefit of middleware roles within LCG and EGEE has been provided. 7. GridPP has collated the required financial information about each of the posts in the GridPP2+ and GridPP3 period, according to WP. 8.GridPP has collated the required post descriptions for each of the posts in the GridPP2+ and GridPP3 period, according to work package. 9.The mechanisms that have ensured quality assurance at the Tier-1 and Tier-2 have been described and the associated costs recognised in order to deliver a performant Grid to end-users. 10.GridPP has indicated that the computer building at RAL will be funded via CCLRC, is no longer on the critical path, and will provide sufficient space. Residual risks have been identified.
D. Britton8/Nov/2006GridPP3 BACKUP SLIDES
D. Britton8/Nov/2006GridPP3 Support Staff
D. Britton8/Nov/2006GridPP3 R-GMA Used by other middleware components: 1)CMS dashboard used by CMS (+ATLAS) production managers to monitor production. R-GMA used to puck up job information. 2)GridView monitoring of gridFTP transfers during service challenges. 3)Used to re-publish information from the Logging and Bookkeeping database (WLMS) for high-level monitoring tools (ATLAS, BaBar, CMS). 4)Aggregates Worker Node data on the state of running jobs for use by higher level applications such as VO monitoring tools (eg. CMS Dashboard). 5)NPM network monitoring framework uses R-GMA published information and makes information available via a web service. 6)Grid Ireland uses R-GMA as part of intrusion detection work. 7)Service Discovery is an API to various information system backends, one of which is R-GMA. 8)APEL accounting package uses R-GMA to aggregate usage records. 9)The Real Time Monitor can be configured to use R-GMA to avoid direct access to the L&B database. 10)Externally used in (at least) 4 projects associated with Trinity College.
D. Britton8/Nov/2006GridPP3 GridSite A security toolkit developed within GridPP to enable secure access to resources. GACL (Grid Access Control Language) used by gLite Logging and Bookkeeping service and by the WMProxy service through which all jobs enter EGEE/LCG sites. Mod_gridsite is an Apache extension used as the basis of WMProxy service to provide verification of the user credentials. Delegation code allows jobs to authenticate to remote file servers and data catalogues. WLMS uses it to copy files from the CE. That is, GridSite is deeply embedded in the gLite middleware. ATLAS and CMS also use various components. GridSite is used for page management and access control to various websites (GridPP, GOG, Grid Ireland, NGS) and Grid-Wikis (GridPP, NGS).
D. Britton8/Nov/2006GridPP3 Storage No single solution to SRM interface requirements: - CASTOR – required for tapestore at RAL. No other solution scales; - dCache – needed to meet needs of larger sites but requires dedicated effort; - DPM – ideal for smaller sites; - Remaining gap for distributed file systems? Storage group coordinates across GridPP to ensure that at least the different institutes are running the same versions of the middleware (reduces conflicts) and provides support (removes the need for local expertise at all sites). Provides international leadership (largest national deployment in the wLCG) and currently developing SRM-2 interface to CASTOR (buy-in; will need to be maintained).
D. Britton8/Nov/2006GridPP3 WLMS WLMS testing and performance monitoring – extracts data on individual jobs to compile statistics on (1) WLMS performance, (2) job throughput, and (3) site efficiency. (1) Is valuable to WLMS developers; (2) helps measure/monitor the performance of the Grid as a whole; and (3) is of particular benefit to users trying to optimise usage. Information is used by the Experiment Dashboards; Information extraction scripts need on-going support. Potentially rich source of information for users currently underdeveloped. Area becomes more critical as Grid increases in size. Currently recognised as international leaders; strong encouragement to continue.
D. Britton8/Nov/2006GridPP3 Operational Security Duties include (see Middleware appendix for more extensive list): Provide expert support and advice to the GridPP Deployment team and the Tier1/Tier2 centres in all aspects of Grid security with the aim of achieving secure and reliable operation of the GridPP production Grid. Monitor operating system, network and Grid middleware vulnerabilities, in collaboration with Grid Operation Centres and Operational Security teams, and advise GridPP system administrators, middleware developers and application developers on appropriate action. Lead and/or coordinate the investigation and response to security incidents in GridPP and act as the interface to other Grids involved. Investigate, develop and deploy appropriate security tools, procedures and new technologies in collaboration with other Grid projects.
D. Britton8/Nov/2006GridPP3 VOMS Virtual Organisation Management Service (VOMS) manages group and role information for VO members and provides a mechanism to embed thus information securely in a users proxy. VOMS server run by GridPP in association with NGS. Needs to be operated in a highly secure manner.
D. Britton8/Nov/2006GridPP3 Security International Coordination Interoperation of Grids depends on an agreed security infrastructure (policy and procedure in addition to technical requirements). Essential to wLCG. Trust between Grids, Sites and VOs depends on agreeing registration procedures; mutually acceptable security policies; and procedures. Lead by Dave Kelsey from GridPP (Coordinator of LCG Grid Security, Chair of Joint (LCG/EGEE/OSG) Security Policy Group and Deputy Director of EGEE Security). Clear benefit to the UK users by having a leading voice in the shaping of these policies and procedures.
D. Britton8/Nov/2006GridPP3 RealTime Monitor
D. Britton8/Nov/2006GridPP3 Power Costs New inputs since proposal was submitted: - 7p/KWh in 2008 (Gov. Energy Review) - 260W/system for CPU until 2008 then inflating at 5% - 700W/system for disk in 2007 and then inflating at 5% (inflators are industry standard from ASHRAE) Estimate ~£3m to £3.4m for full Tier-1 hardware depending on installation date and details of the model. 85% of hardware in the 70% scenario implies about £2.5m - £2.9m. Tier-2 running costs assumed to be covered (though there is explicit contingency on the Tier-2 hardware, over and above the price uncertainty, to address the risk that this model breaks down).
D. Britton8/Nov/2006GridPP3 B. Interoperability GridPP/NGS meeting - Nottingham EMCC, September 2006 Present: Tony Doyle, David Britton, Paul Jeffreys, David Wallom, Robin Middleton, Andy Richards, Stephen Pickles, Steven Young, Dave Colling, Peter Clarke, Neil Geddes Agenda: 1.Ultimate goals and the model for achieving them and any constraints 2.Timetables 3.Required software (in both directions)
D. Britton8/Nov/2006GridPP3 B. Interoperability The current "minimal software stack" approach of NGS is being reviewed as a greater variety of partner resources are considered (data centres and research facilities) Different "stacks" will be relevant to different sorts of partners i.e. there is likely to be a range of "NGS Profiles For the foreseeable future, NGS is likely to exist in a world with multiple parallel software stacks and it will not be possible merge them Installing parallel stacks or profiles is not a problem if they are easy to install and do not interfere One possibility is that the different NGS profiles would reflect Different stacks such as GT4 or gLite Operations-can we present accounting information consistently
D. Britton8/Nov/2006GridPP3 B. Interoperability Next steps/timetable GridPP3 MoUs - No action required. Can wait until next year and should be informed by lessons learned over the next 6-12 months. GridPP sites currently meet the minimal requirements for NGS through the standard GridPP installations. If Sites enable the NGS VO then this effectively gives NGS affiliation if they wish. Formal Affiliation would, however, require that the interface be monitored by NGS. Agreed that the next step should be to understand in detail what is actually required for NGS partnership.
D. Britton8/Nov/2006GridPP3 B. Interoperability Next steps/timetable Agreed to focus on two sites, Glasgow and LeSC. Aim to be ready to achieve NGS partnership by Christmas The decision as to whether or not to actually apply for formal partnership can be left to later in the year. The principal goal is to understand the steps and requirements etc. It was agreed that NGS should provide a Glite CE for core NGS nodes which would allow the nodes To be a part of the EGEE/LCG SAM infrastructure. Accounting and monitoring are areas which are still developing and where it is not clear what the best solution is (for NGS) Meet once more before Christmas..