Presentation on theme: "Tony Doyle Overview of UK Development and Deployment Programme, LCG PEB Meeting, CERN, 16 September 2003."— Presentation transcript:
Tony Doyle Overview of UK Development and Deployment Programme, LCG PEB Meeting, CERN, 16 September 2003
Tony Doyle - University of GlasgowOutline Management The Project Map GridPP Status UK Grid Users Deployment – LCG and UK perspective Current Resources EDG 2.0/LCG 1.0 Deployment Status Accounting Todays Operations Future Operation Planning Middleware Status Middleware Evolution GridPP2 Planning Status
Tony Doyle - University of Glasgow Institutes GridPP GridPP in Context Core e-Science Programme GridPP CERN LCG Tier-1/A Middleware Experiments Tier-2 Grid Support Centre EGEE Not to scale! Apps Dev Apps Int
Tony Doyle - University of Glasgow GridPP Management CB (20 members) meets half-yearly to provide Institute overview PMB (12 members) meets weekly [via VC] to provide management of project TB (10 members) meet as required in response to technical needs and regularly via phone EB (14 members) meet quarterly to provide experiments input
Tony Doyle - University of Glasgow GridPP Project Overview
Tony Doyle - University of Glasgow Financial Breakdown Five components –Tier-1/A = Hardware + 10 CLRC e-Science Staff –DataGrid = 25 DataGrid Posts inc. CLRC PPD Staff –Applications = 17 Experiments Posts (to interface middleware) –Operations = Travel (~100 people)+ Management + Early Investment –CERN = 25 LCG posts + Tier-0 + LTA
Tony Doyle - University of Glasgow Quarterly Reporting Quarterly reporting allows comparison of delivered effort with expected effort Feedback loop as issues arise
Tony Doyle - University of Glasgow Funded Effort Breakdown (Snapshot 2003Q3) LCG effort is largest single area of GridPP Future project priorities focussed on LCG and EGEE
Tony Doyle - University of Glasgow GridPP Status: The Project Map
Tony Doyle - University of Glasgow GridPP Status: Summary GridPP1 has now completed 2 ex 3 years All metrics are currently satisfied 103 of 182 tasks are complete 70 tasks not yet complete or overdue 9 tasks are overdue: –6 are associated with LCG 2 of these are trivial (definition of future milestones) 4 of these are related to the delay in LCG-1 –2 are associated with applications (CMS and D0) –1 is associated with the UK infrastructure (test of a heterogeneous testbed)
Tony Doyle - University of Glasgow Risk Register (Status April 03) Scaling up to a production system (LCG-1 deployment) System management effort at UK Tier-2 sites (being addressed as part of GridPP2)
Tony Doyle - University of Glasgow UK Certificates and VO membership 1.UK e-Science CA now used in production EDG testbed 2.PP users engaged from many institutes 3.UK participating in 6 ex 9 EDG VOs
Tony Doyle - University of Glasgow UK Deployment Overview Significant resources within EDG. Currently being upgraded to EDG2. Integrating EDG on farms has been repeated many times but it is difficult. Sites are keen to take part within EDG2 currently, with LCG1 deployment after this. By the end of the year many HEP farms plan to be contributing to LCG1 resources. Basis of Deployment Input to LCG Plan. Input from Tier-1 (~50%) initially and four distributed Tier-2s (50%) on ~Q timescale. CPU (kSI2K) Disk TB Support FTE Tape TB CERN Czech Repub France Germany Holland Italy Japan Poland Russia Taiwan Spain Sweden Switzerland UK USA Total
Tony Doyle - University of Glasgow RAL CE SE LCG 1.0/EDG 2.0 5xWN LCG Testbed CE LCG 1.0/EDG xWN Tier1/a CE SE EDG 2.0 WP3 Testbed MON CE SE EDG 2.0 1xWN EDG Dev Testbed MON SE ADS UI within CSF. NM for EDG2. Top level MDS for EDG. Various WP3 and WP5 dev nodes. VOMS for DEV TB. SE LCG0 Testbed CE 1xWN
Tony Doyle - University of Glasgow London Grid: Imperial College CE SE EDG 2.0 EDG Testbed WNs CE EDG 2.0 WNs BaBar Farm CE SE CMS-LCG0 WN CE SE EDG 2.0 1xWN WP3 Testbed MON RB for EDG 2.0. Plan to be in LCG1 and other testbeds.
Tony Doyle - University of Glasgow London Grid: Queen Mary and UCL CE SE EDG 1.4 1xWN EDG Testbed 32xWN Queen Mary CE also feeds EDG jobs to 32 node e-Science farm. Plan to have LCG1/EDG2 running for the end of the year. Expansion with SRIF grants.(64WN+2TB in Jan 2004, 100WN + 8TB in Dec 2004.) CE SE EDG 1.4 1xWN EDG Testbed UCL Network Monitors for WP7 development. SRIF bid in place for 200 CPUs for the end of the year to join LCG1.
Tony Doyle - University of Glasgow Southern Grid: Bristol CE SE EDG 2.0 1xWN EDG Testbed CE SE EDG 2.0 1xWN WP3 Testbed MON CE SE CMS-LCG0 CMS/LHCb Farm 24xWN CE SE EDG 1.4 BaBar Farm 78xWN GridPP RC. Plan to join LCG1
Tony Doyle - University of Glasgow Southern Grid: Cambridge and Oxford CE SE EDG xWN EDG Testbed Cambridge farm shared with local NA-48, GANGA users. Some RH73 WNs for ongoing ATLAS challenge. 3TB GridFTP-SE. Plan to join LCG1/EDG2 later in the year with an extra 50 CPUs. EDG jobs will be fed into local e-Science farm. CE SE EDG 1.4 2xWN EDG Testbed Oxford: Plan to join EDG2/LCG1. Nagios monitoring has been set up. (RAL is also evaluating Nagios) Planning to send EDG jobs into 10 WN CDF farm. 128 node cluster being ordered now.
Tony Doyle - University of Glasgow Southern Grid: RAL PPD and Birmingham CE SE EDG 2.0 9xWN EDG Testbed CE SE EDG 2.0 1xWN WP3 Testbed MON PPD User Interface Part of Southern Tier2 Centre within LCG1. 50 CPUs and 5TB of disk expected for the end of year. CE SE EDG 1.4 1xWN EDG Testbed Birmingham: Expansion to 60 CPUs and 4TBs. Expect to participate within LCG1/EDG2
Tony Doyle - University of Glasgow NorthGrid: Manchester and Liverpool CE SE(1.5TB) EDG xWN CE SE(5TB) EDG xWN CE SE EDG 1.4 9xWN EDG Testbed BaBar Farm DZero Farm GridPP and BaBar VO Servers. User Interface Plan that DZero farm will join LCG. SRIF bid in place for significant HEP resources. CE SE EDG 1.4 1xWN EDG Testbed Liverpool plan to follow EDG 2, possibly integrating newly installed Dell (funded by NW Development Agency) and BaBar farm. Largest single Tier-2 resource.
Tony Doyle - University of Glasgow ScotGrid: Glasgow, Edinburgh and Durham CE SE EDG 1.4 ScotGRID 59xWN WNs on a private network with outbound NAT in place. Various WP2 development boxes. 34 dual blade servers just arrived. 5TB FastT500 expected soon. Shared resources (CDF and Bioinformatics) CE SE EDG 2.0 WP3 Testbed MON Edinburgh: 24TB FastT700 and 8-way server just arrived. Durham: existing farm available. Plan to be part of LCG. CDF LHC BIO
Tony Doyle - University of Glasgow EDG 2.0 Deployment Status 12/9/03 RAL (Tier1A): Up and running with UI gppui04 available (as part of CSF) and offer to give access to LCFGng node to help people compare with their own LCFGng setup. IC: Existing WP3 testbed site is at Standard 2.0 RB available UCL: Trying to go to 2.0: SE up so far. QMUL: 2.0 installation ongoing. RAL (PPD): site up and running. Oxford: wait until October for 2.0. Birmingham: Working on getting a 2.0 site up next week Bristol: WP3 testbed site at Also doing a new 2.0 site install. UI and MON up, still doing CE, SE and WN. Cambridge to follow. Manchester: Trying to get set up. Glasgow: Concentrating on commissioning new hardware during the next month. Wait until then before going to 2.0. Edinburgh to follow.
Tony Doyle - University of Glasgow Meeting Current LHC Requirements: Experiment Accounting Experiment- driven project. Priorities determined by Experiments Board.
Tony Doyle - University of Glasgow Tier-1/A Accounting LHCb ATLAS CMS Monthly accounting: Online Ganglia-based monitoring, see: Last month: CMS (and BaBar) jobs. Annual accounting: ATLAS, CMS and LHCb jobs. Generally dominated by BaBar since January.
Tony Doyle - University of Glasgow Todays Operations 1.Support Team built from sysadmins. 4 funded by GridPP to work on EDG WP6, the rest are site sysadmins. 2.Methods list, phone meetings, personal visits, job submission monitoring RB, VO, RC for UK use to support non-EDG use 3.Rollout Experience from RAL in EDG dev testbeds and IC and Bristol in CMS testbeds 10 sites have been part of EDG app testbed at one time
Tony Doyle - University of Glasgow GridPP2 Operations To move from testbed to production, GridPP plans a bigger team with a full-time Operations Manager Manpower will be from the Tier-1 and Tier-2 Centres who will contribute to the Production Team The team will run a UK Grid which will belong to various grids (EDG, LCG,..) and also support other experiments RAL is also leading the LCG Security Group –written 4 documents setting out procedures and User Rules –working with GOC task force on Security Policy –Risk Analysis and further planning for LCG in 2004
Tony Doyle - University of Glasgow LCG Operations RAL has led project to develop an Operations Centre for LCG1 –Applied GridPP and MapCenter monitoring to LCG1 –Dashboard combining several types of monitoring –Set up a web site with contact information –Developing Security Plan –Accounting (the current priority, building upon resource centre and experiment accounting)
Tony Doyle - University of Glasgow
EGEE Tier1 (16.5 FTE) UK Team (8 FTE) UK GSC (2 FTE) (2FTE) EGEE ROC (5 FTE) EGEE CIC (4.5 FTE) The UK Production Team will be expanded as part of EGEE ROC and CIC posts to meet EGEE requirements To deliver an EGEE grid infrastructure that must also deliver to other communities and projects Could do this just within PP (matching funding available) but also want to engage fully with UK Core programme
Tony Doyle - University of Glasgow Tier-1/A Services [FTE] High quality data services National and International Role UK Focus for International Grid development Highest single priority within GridPP2 Regained Programme CPU 2.0 Disk 1.5 AFS 0.0 Tape 2.5 Core Services 2.0 Operations 2.5 Networking 0.5 Security 0.0 Deployment 2.0 Experiments 2.0 Management 1.5 Total 16.5
Tony Doyle - University of Glasgow Tier-2 Services [FTE] Four Regional Tier-2 Centres London: Brunel, Imperial College, QMUL, RHUL, UCL. SouthGrid: Birmingham, Bristol, Cambridge, Oxford, RAL PPD. NorthGrid: CLRC Daresbury, Lancaster, Liverpool, Manchester, Sheffield. ScotGrid: Durham, Edinburgh, Glasgow. Hardware provided by Institutes GridPP provides added manpower Current Planning Y 1Y 2Y 3 Hardware Support Core Services4.0 User Support Specialist Services Security1.0 Resource Broker1.0 Network0.5 Data Management2.0 VO Management Existing Staff-4.0 GridPP Total SY 40.0
Tony Doyle - University of Glasgow Operational Roles Core Infrastructure Services (CIC) –Grid information services –Monitoring services –Resource brokering –Allocation and scheduling services –Replica data catalogues –Authorisation services –Accounting services Still to be defined fully in EGEE Core Operational Tasks (ROC) –Monitor infrastructure, components and services –Troubleshooting –Verification of new sites joining Grid –Acceptance tests of new middleware releases –Verify suppliers are meeting SLA –Performance tuning and optimisation –Publishing use figures and accounts
Tony Doyle - University of Glasgow SRB for CMS UK eScience has been interested in SRB for several years. CCLRC has gained expertise for other projects and is collaborating with SDSC Now hosting MCAT for worldwide CMS pre-DC04 Interfaced to RAL Datastore –Service Started 1 July ,000 files registered 10 TB of data stored in system Used across 13 sites worldwide including CERN and Fermilab 30 Storage resources managed across the sites MCAT Database MCAT Server SRB A Server SRB B Server SRB Client a b cd e f g
Tony Doyle - University of Glasgow EDG StorageElement Not initially adopted by LCG1 Since then limited SRM functionality has been added to support GFAL –available for test by LCG Full SRMv1 functionality has been developed and is currently being integrated on internal testbed GACLs being integrated
Tony Doyle - University of Glasgow RGMA - Status Running on WP3, EDG-development and EDG-application testbeds Application Deployment: 29 CEs, 11 SEs, 10 sites in 6 countries –RGMA browser access in < 1sec Monitoring scripts being run on the testbeds and results linked from the WP3 web page –http://hepunx.rl.ac.uk/edg/wp3/ Registry replication is being tested on WP3 testbed –Better performance & higher reliability required Authentication successfully tested on WP3 testbed Two known bugs remain –Excessive threads requiring GOUT machine restart New code has been developed with extensive unit tests. Now being tested on WP3 testbed This new code will support at least 90 sites –Latest Producer choosing algorithm failing to reject bad LPs – shows up intermittent absence of information Revised algorithm needs coding (localised change)
Tony Doyle - University of Glasgow RGMA - Users Users and Interfaces to other systems: –Resource Broker –CMS (Boss) –Service and Service Status for all EDG services –Network Monitoring & Network Cost Function –MapCenter –Logging & Bookkeeping –UK e-Science, CrossGrid and BaBar evaluating –Replica Manager –MDS (GIN/GOUT) –Nagios –Ganglia (Ranglia) Future: RB direct use of RGMA (no GOUT) –Better performance and reliability
Tony Doyle - University of Glasgow Middleware, Security and Network Service Evolution Information Services [5+5 FTE] and Networking [ FTE]: strategic roles within EGEE Security expands to meet reqts. Data and Workload Management continue No further configuration management development programme defined by –mission criticality (experiment requirements driven) –International/UK-wide lead –leverage of EGEE, UK core and LCG developments ActivityCurrent Planning Security3.5 Info-Mon.4.0 Data & Storage4.0 Workload1.5 Networking3.0 TOTAL16.0 Security Middleware Networking
Tony Doyle - University of Glasgow GridPP2 Proposal ~30 page proposal + figures/tables + 11 planning documents: 15.Tier-0 16.Tier-1 17.Tier-2 18.The Network Sector 19.Middleware 20.Applications 21.Hardware Requirements 22.Management 23.Travel 24.Dissemination 25.From Testbed to Production
Tony Doyle - University of Glasgow Current planning based upon £19.6m Funding Scenario PPPARC Review Timeline: Projects Peer Review Panel (14-15/7/03) Grid Steering Committee (28-29/7/03) Science Committee (October 03)
Tony Doyle - University of GlasgowTimeline
GridPP2 Project Map Need to build this in: to identify progress…
Tony Doyle - University of Glasgow Experiment Requirements: UK only Total Requirement:
Tony Doyle - University of Glasgow Meeting the Experiments Hardware Requirements Significant… Production Grid inc. Tier-2 resources needed…
Tony Doyle - University of Glasgow Projected Hardware Resources Total Resources: (note x2 scale change)
Tony Doyle - University of Glasgow Application Interfaces - Service Evolution Applications –18 FTEs: ongoing programme of work can continue –Difficult to involve experiment activity not already engaged within GridPP Project would need to build on cross-experiment collaboration – GridPP1 already has experience –GANGA: ATLAS & LHCb –SAM: CDF & D0 –Persistency: CMS & BaBar Encourage new joint developments across experiments
Tony Doyle - University of GlasgowConclusions Management under control via the Project Map and Project Plan GridPP Status is defined in terms of high level tasks and metrics: under control Major component is LCG –We contribute significantly to LCG and our success depends critically on LCG Deployment – high and low level perspectives merge via accounting Resource centre and experiment accounting are both important Comprehensive accounting is a priority, built up from existing systems Todays operations in the UK are built around a small team Future operations planning expands this team significantly: Production Manager being appointed Middleware deployment focussing on Information Service performance issues Existing IS team will be reinforced in UK within EGEE Security (deployment and policy) is emphasised GridPP2 planning status: formal feedback in November