Presentation is loading. Please wait.

Presentation is loading. Please wait.

GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus.

Similar presentations


Presentation on theme: "GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus."— Presentation transcript:

1 GridPP Status Report David Britton, 15/Sep/09

2 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus on resilience and disaster management (GridPP22) The UK infrastructure has been validated by STEP09. Moved the Tier-1 to R89. Procured significant new hardware. Adapted to developments in the LHC schedule; the EGI+ proposals; and the UK funding constraints. Issues from the last Oversight: “Other Experiments.” EGI/NGI/NGS etc. CASTOR. OPN network. To be covered by Project Manager: Project Milestones/Deliverables. Project Risks. Project Finances.

3 3 WLCG: Largest scientific Grid in the world September 2009: >315,000 KSI2K Worldwide: 288 sites in 55 countries – 190,000 CPUs In the UKI: 22 sites and about 19,000 CPUs 15/Sep/09

4 4 UK CPU Contribution Same picture if non-LHC VOs included 15/Sep/09

5 5 8/Sep/09 UK Site Contributions 2007 – 8 - 9 NorthGrid: 34% – 22% - 15% London: 28% – 25% - 32% ScotGrid: 18% – 17% - 22% Tier-1: 13% – 15% - 13% SouthGrid: 7% – 16% - 13% GridIreland: 0% – 6% - 5% All areas of the UK make valuable contributions “Other VOs” used 16% of the CPU time this year.

6 6 UK Site Contributions: Non LHC VOs 15/Sep/09 All regions supported the“Other VOs”. Top-12 “Other VOs” include many disciplines

7 7 Tier-2 Resources 1/Apr/09 The Tier-2s have delivered (Brunel currently installing 600TB of disk) Accounting error: 230TB delivered.

8 8 Tier-2 Performance Resource-weighted averages 8/Sep/09 The Tier-2s have improved and are performing well.

9 9 Service Resilience GridPP23 Agenda 15/Sep/09 A sustained push was made on improving service resilience at all levels. Many improvements were made at many sites and, ultimately, STEP09 demonstrated the the UK Grid was ready for data (see later slide). Disaster management processes were developed and are regularly engaged (see later slide).

10 10 STEP09 UK Highlights –RAL was the best ATLAS Tier-1 after the BNL ATLAS-only Tier-1 –Glasgow ran more jobs then any of the 50-60 ATLAS Tier-2 sites throughout the world. –Tier-2 sites made good contributions and were tuning (not fire- fighting) during STEP09 and subsequent testing. –Quote: “The responsiveness of RAL to CMS during STEP09 was in stark-contrast to many other Tier-1s.” –CMS noted the tape performance at RAL was very good as was the CPU efficiency (CASTOR 2.1.7 worked well). –Many (if not all) the metrics for the experiments were met, and in some cases, significantly exceeded at RAL during STEP09. 15/Sep/09

11 11 STEP09: RAL Operations Overview Generally very smooth operation: –Most service systems relatively unloaded plenty of spare capacity –Calm atmosphere. Daytime “production team” monitored service Only one callout, Most of the team even took two days out off site for department meeting! –Very good liaison with VOs and good idea what was going on. In regular informal contact with UK representatives –Some problems with CASTOR tape migration (3 days) on ATLAS instance but all handled satisfactorily and fixed. Did not visibly impact experiments. Robot broke down for several hours (stuck handbot led to all drives de- configured in CASTOR). Caught up quickly. Very useful exercise – learned a lot, but very reassuring –More at: http://www.gridpp.rl.ac.uk/blog/category/step09/

12 12 STEP09: RAL Batch Service Farm typically running > 2000 jobs. By 9 th June at equilibrium: (ATLAS 42%, CMS 18%, Alice 3%, LHCB 20%) Problem 1: ATLAS job submission exceeded 32K files on CE –See hole on 9 th. We thought ATLAS had paused  took time to spot. Problem 2: Fair shares not honoured as aggressive ALICE submission beat ATLAS to job starts. –Need more ATLAS jobs in queue faster. Manually cap ALICE. Fixed by 9 th June. See decrease in (red) ALICE work. Problem 3: Occupancy initially poor (initially 90%). Short on memory (2GB/core but ATLAS jobs needed 3GB vmem). Gradually increase MAUI over-commit on memory to 50%. Occupancy --> 98%.

13 13 Data Transfers 15/Sep/09 RAL achieved the highest average input and output data rates of any Tier-1.

14 14 OPN Resilience 15/Sep/09

15 15 In the end, hand-over to STFC was delay from Dec to Apr 09. Hardware was delayed but we were (almost) rescued by the LHC schedule change. Minor (?) issues remain with R89 (Aircon-trips; water-proof membrane?) (GridPP22) Current Issues: R89 15/Sep/09

16 16 Tier-1 Hardware The FY2008 hardware procurement had to await the acceptance of R89. The CPU is tested, accepted, and being deployed (14,000 HEPSPEC06 to add to current 19,000) The disk procurement (2 PB to add to existing 1.9PB) was split into two halves (different disks and controllers to mitigate against acceptance problems). This has proved sensible, as one batch has demonstrated ejection issues. One half of the disk is being deployed; progress is being made on the other half and best guess is deployment by end of November. A second SL85000 tape robot is available. The FY09 hardware procurement is underway. 15/Sep/09

17 17 Disaster Management A four-stage disaster management process was established at the Tier-1 earlier this year as part of our focus on resilience and disaster management. Designed to be used regularly so that process is familiar. This means low-threshold to trigger Stage-1 “disasters” At Stage-3, the process formally involves stake-holders outside the Tier-1, including GridPP management. This has now happened several times including: –R89 aircon trip –R89 water leak –Disk procurement problem –Swine flu planning. The process is still being honed, but I believe it is very useful. 15/Sep/09

18 18 - NGI EGI/NGI EGI UK-NGI Coordinating body in Amsterdam National initiatives in member countries GridPP NGS Involves STFC, EPSRC and JISC (at least) in the UK. EGI is vital to GridPP but it is not GridPP’s core business to run an e- science infrastructure for the whole of the UK: seek a middle ground. 15/Sep/09

19 19 EU Landscape SSC EMI EGI Heavy Users SSC SSC (Roscoe) Unicore ARC gLite UK involvement with Ganga? UK involvement via the UK NGI with global tasks such as GOGDB, security, dissemination, training.... UK involvement with APEL, GridSite? … UK involvement: FTS/LFC support post at RAL? 15/Sep/09

20 20 User Support Help pages. GridPP23 talks. User survey at RAL

21 21 Actions OPN – Detailed document provided. Cost is covered by existing GridPP hardware funds. Propose to proceed immediately to provision. Other Experiments – Usage shown on Slide-6. Allocation Policy is on the UserBoard web-pages: http://www.gridpp.ac.uk/eb/allocpolicy.html EGI/NGI/NGS – Paper provided. GridPP/UK has established potential links with all the structural units and is engaged in the developments. CASTOR – Paper provided. Paper provided. Version 2.1.7 used during STEP09 worked well beyond the levels needed. 2.1.8 becoming an issue. 15/Sep/09

22 22 Current Issues Operational: Timing of CASTOR 2.1.8 upgrade. Shake-down issues with R89. Problem with 50% of current disk purchase. High Level: Hardware planning – lack of clarity on approved global resources. Hardware pledges – financial constraints and the 2010 pledges. GridPP4 – lack of information on scope, process or timing against a backdrop of severe financial problems within STFC. 15/Sep/09

23 23 Key issue in the next six months To receive a sustained flow of data from CERN and to meet all the experiment expectations associated with custodial storage; data reprocessing; data distribution; and analysis. Requires: A resilient OPN network Stable operation of CASTOR storage Tier-1 hardware and services Tier-1 to Tier-2 networking Tier-2 hardware and services Help, support, deployment and operations. That is, the UK Particle Physics Grid. 15/Sep/09 The milestones necessary to meet these requirements have been met (with the possibly exception of the first) and the entire system validated with STEP09. We believe the UK is ready. We know that problems will arise and have focused on resilience to reduce the incidence of these, and on disaster management to handle those that do occur.

24 24 The End

25 25 Schedule It is foreseen that LHC will ready for beam by mid-November Before that All sectors powered separately to operating energy ++ Dry runs of many accelerator systems (from Spring) Injection, extraction, RF, collimators Controls Full machine checkout before taking beam Beam tests TI8 (June) TI2 (July) TI2 and TI8 interleaved (September) Injection tests (late October)

26 26 1/Apr/09


Download ppt "GridPP Status Report David Britton, 15/Sep/09. 2 Introduction Since the last Oversight: The UK has continued to be a major contributor to wLCG A focus."

Similar presentations


Ads by Google