GridPP project management Sarah Pearce 15 September 2009 GridPP Oversight Committee.

2 2 15/9/09 Project Map GridPP3 Q2 09

3 3 GridPP Oversight Committee 15/9/09 Project map - statistics MetricsMilestones Q208Q308Q408Q109Q209 Metric OK99142155172184 Metric close to target2447393222 Metric not OK4132 2127 Not able to be measured272211103 Milestone achieved1122324257 Milestone overdue2713174 Milestone not due / metric n/a10180696058 Suspended066912 Awaiting input34512103 Total339363369373370

4 4 GridPP Oversight Committee 15/9/09 Red metrics LHCb 1.2.2 - MC production (generation) efficiency 1.2.3 - T1 MC production (reconstruction, stripping) efficiency 1.2.4: T1 MC/Event user analysis – UK efficiency 1.2.11 LHCb SAM tests uptime T1 1.2.23 Keep LHCb GANGA user training material updated Operations 2.1.3 - Proportion of available jobslots used 2.1.6 - Job success rates 2.1.10 - GridPP deployment web-pages up-to- date - review underway Tier-1 3.1.8 - Availability of CE 3.2.11 - Farm Occupancy 3.2.13 – Quarterly report not available 3.4.4 - % met of UB Allocation for Disk 3.4.8 – CASTOR SAM tests: LHC VOs Tier-2s SAM availability and reliability tests LondonGrid (4.1.3, 4.1.4). Metric 5 – SLL ATLAS test performance, LondonGrid and SouthGrid. 4.2.6 – ScotGrid average SLL SE test performance Metrics 7&8 - CPU utilisation (wall clock time & CPU time) LondonGrid, SouthGrid Metric 9 - % of disk used ScotGrid, SouthGrid 4.4.11 - Number of management meetings NorthGrid 4.1.14 Middleware upgrading LondonGrid Project execution 5.2.9 – CB meetings

5 5 GridPP Oversight Committee 15/9/09 Overdue milestones Front end systems 3.1.22 LHC Monitoring infrastructure operational at RAL – waiting on work by Dante Resource delivery 3.2.16 - Disaster and Business Continuity Plan Available. 3.2.18 - Disaster Plan fully implemented New disaster management system is operational and working well, but some contingency plans remain to be completed. Storage systems General ADS Service Ends. Not been a priority but closure process has started.

6 6 GridPP Oversight Committee Key milestones/ deliverables Main requirement will be to deliver for the experiments once LHC data starts – measured by a combination of metrics along the top of the ProjectMap. 15/9/09 Milestone no. DescriptionOwnerDeadline 3.3.3 Tier-1 able to meet 2009 WLCG MoU resource commitmentAndrew Sansum31/08/2009 6.2.3 EGI Transition Planning for GridPPRobin Middleton01/10/2009 5.1.9 Allocations calculated for round 2 of Tier-2 hardware grantsSteve Lloyd31/10/2009 6.2.5 Agreement with NGS/NGI on partition of services between GridPP and NGS/NGI Robin Middleton01/11/2009 5.1.8 Post-GridPP planning initiatedDave Britton01/01/2010 3.3.10 2010 Disk Tender StartedMartin Bly02/01/2010 3.3.21 2010 CPU Tender StartedMartin Bly02/01/2010 5.1.10 Grants for 2009 Tier-2 hardware issuedSarah Pearce31/03/2010 5.1.11 Grants for 2010 Tier-2 hardware issuedSarah Pearce30/04/2010

7 7 GridPP Oversight Committee 15/9/09 Risk register

8 8 GridPP Oversight Committee 15/9/09 Top 5 risks (I) IDNameLi ImIm Ris k Owne r Current process for managing riskFuture mitigation optionCosts R1Recruitme nt /retention difficulties 339SP1. Universities encouraged to advertise early (though some won't allow this until grants are received) 2. Tier-1 staff are on long-term contracts so retention is better. 3. Tier-2 coordinators can to some extent cover for missing Tier-2 support posts 4. Ensure staff remain motivated 5. Building likelihood of turnover and unfilled posts into staffing model, especially at the Tier-1 1. Use contract staff for well defined tasks and short periods 2. Try to establish future funding early, to aid with retention 3. Escalate to PMB and/or Director e- Science at RAL Possible extra costs for employing contract staff - but could be offset by underspend in other posts. R5Service insufficientl y resilient wrt storage 248JC1. Tier-1 storage review analysed issues with CASTOR 2. Extra staff member has been appointed in D/B CASTOR area. 3. Monitoring of available disk space. 4. Procurements take account of experiment requests. 1. Attention to be concentrated on the Oracle database system to ensure that it operates at an appropriate load and to engage Oracle better. 2. Have T1 excess disk capacity available at short notice (i.e. procure more or make data center arrangements). 3. Experiment experts embedded with CASTOR team 1 extra staff member employed for FY09 and FY10. Estimated cost of 140k met by reprofiling RAL staff costs. R1 0 Hardware resources inadequate /insufficient 248GPQuarterly review of resources and priorities at UB meetings. Weekly review of storage resources at Castor meetings. Ability to redefine intra- experiment CPU fairshares at short notice. Purchase more hardware and/or improve profiling and procurement. Reduce non-LHC experiment resources. Agree programme priorities through PMB and STFC. If necessary, we would aim to reprofile Tier-1 hardware funds to meet requirements.

9 9 GridPP Oversight Committee Top 5 risks (II) 1/4/09 R12Machine room problems comprom ise Tier-1 4312ASAS 1. Two separate issues appearing in the week of Aug 10th (an aircon failure and a small water leak due to condensation) are currently being managed by the Tier-1 Disaster Management protocol. 2. Ensure Involvement at all levels of project. e-Science department is project sponsor. 1. Re-certification of the machine room environment by an independent consultant. 2. Return Tier-1 to the ATLAS building. 3. If problem occurs after migration has been completed (for example air conditioning unreliable). Seek remedy from builder. Rent airconditioning units if necessary. Run critical services from UPS generator or move small volumes back to old building. R14Network/ OPN breakage 248PCPC 1. Existing practice for network outages, i.e. "talk with your neighbour". For the OPN- the link layer neighbour is JANET(UK), the routing layer neighbour is CERN. 2. LHCOPN Operational Handbook describes and defines responsibilities perationalModel 3. Plan for N Gbit/s back up provision. 1. Fully funded second N Gbit/s backup provision across a diverse route provides fallback routing on link failure 2. Disaster recovery plans exist for the RAL network components that are used on the path from the OPN to the Tier 1. 3. Service Continuity Plan exists to manage a "wider crisis" (separate document) GridPP proposes to spend £52k of our existing hardware budget to install a backup link, supported by a recurrent cost of between £40k and £60k per annum, depending on negotiations about the end-point costs.

10 10 GridPP Oversight Committee 15/9/09 Finances £335k of Tier-1 hardware rolled over from FY08 to FY09, as a result of planning for LHC reschedule and R89 delay. £1m of Tier-1 hardware delayed until FY10, at request of STFC. At request of STFC, most Tier-2 hardware grants should be in early FY10 – small number of sites require in FY09

11 11 GridPP Oversight Committee 15/9/09 Staffing Some areas not finished recruiting, so funded effort under that expected But in all cases more than compensated by unfunded effort

