Presentation is loading. Please wait.

Presentation is loading. Please wait.

CERN LCG-1 Deployment Plan Ian Bird LCG Project Deployment Area Manager IT Division, CERN GridPP 7 th Collaboration Meeting Oxford 1 July 2003.

Similar presentations

Presentation on theme: "CERN LCG-1 Deployment Plan Ian Bird LCG Project Deployment Area Manager IT Division, CERN GridPP 7 th Collaboration Meeting Oxford 1 July 2003."— Presentation transcript:

1 CERN LCG-1 Deployment Plan Ian Bird LCG Project Deployment Area Manager IT Division, CERN GridPP 7 th Collaboration Meeting Oxford 1 July 2003

2 CERN Overview Milestones and goals for 2003 LCG-1 Roll-out plan –Where, how, when Infrastructure Status –Middleware functionality & status –Security & operational issues Plans for rest of 2003 –Additional resources –Additional functionality –Operational improvements

3 CERN LCG - Goals The goal of the LCG project is to prototype and deploy the computing environment for the LHC experiments Two phases: –Phase 1: 2002 – 2005 –Build a service prototype, based on existing grid middleware –Gain experience in running a production grid service –Produce the TDR for the final system –Phase 2: 2006 – 2008 –Build and commission the initial LHC computing environment F LCG is not a development project – it relies on other grid projects for grid middleware development and support

4 CERN LCG - Timescale Why such a rush – LHC wont start until 2007 ??? TDR must be written in mid-2005: –Approval of TDR –Need 1 year to procure, build, test, deploy, commission the computing fabrics and infrastructure – to be in place end 2006 In order to write the TDR, essential to have at least 1 year experience –In running a production service –At a scale that is representative of the final system (50% of 1 expt) –Running data challenges – including analysis, not just simulations It can easily take 6 months to prepare such a service F We must start now … goal is to have a service in place in July

5 CERN LCG - Milestones The agreed Level 1 project milestones for Phase 1 are:The agreed Level 1 project milestones for Phase 1 are: –deployment milestones are in red M1.1 - July 03First Global Grid Service (LCG-1) available M1.2 - June 03Hybrid Event Store (Persistency Framework) available for general users M1.3a - November 03LCG-1 reliability and performance targets achieved M1.3b - November 03Distributed batch production using grid services M1.4 - May 04Distributed end-user interactive analysis from Tier 3 centre M1.5 - December 0450% prototype (LCG-3) available M1.6 - March 05Full Persistency Framework M1.7 - June 05LHC Global Grid TDR

6 CERN LCG Regional Centres Tier 0 CERN Tier 1 Centres Brookhaven National Lab CNAF Bologna Fermilab FZK Karlsruhe IN2P3 Lyon Rutherford Appleton Lab (UK) University of Tokyo CERN Other Centres Academica Sinica (Taipei) Barcelona Caltech GSI Darmstadt Italian Tier 2s(Torino, Milano, Legnaro) Manno (Switzerland) Moscow State University NIKHEF Amsterdam Ohio Supercomputing Centre Sweden (NorduGrid) Tata Institute (India) Triumf (Canada) UCSD UK Tier 2s University of Florida– Gainesville University of Prague …… Confirmed Resources: Centres taking part in the LCG prototype service : 2003 – 2005

7 CERN Elements of a Production LCG Service Middleware: –Testing and certification –Packaging, configuration, distribution and site validation –Support – problem determination and resolution; feedback to middleware developers Operations: –Grid infrastructure services –Site fabrics run as production services –Operations centres – trouble and performance monitoring, problem resolution – 24x7 globally Support: –Experiment integration – ensure optimal use of system –User support – call centres/helpdesk – global coverage; documentation; training

8 CERN 2003 Milestones Project Level 1 Deployment milestones for 2003: –July: Introduce the initial publicly available LCG-1 global grid service With 10 Tier 1 centres in 3 continents –November: Expanded LCG-1 service with resources and functionality sufficient for the 2004 Computing Data Challenges Additional Tier 1 centres, several Tier 2 centres – more countries Expanded resources at Tier 1s (e.g. at CERN make the LXBatch service grid- accessible) Agreed performance and reliability targets

9 CERN LCG Resource Commitments – 1Q04 CPU (kSI2K) Disk TB Support FTE Tape TB CERN Czech Republic France Germany Holland Italy Japan Poland Russia Taiwan Spain Sweden Switzerland UK USA Total

10 CERN Deployment Goals for LCG-1 Production service for Data Challenges in 2H03 & 2004 –Initially focused on batch production work –But 04 data challenges have (as yet undefined) interactive analysis Experience in close collaboration between the Regional Centres –Must have wide enough participation to understand the issues Learn how to maintain and operate a global grid Focus on a production-quality service –Robustness, fault-tolerance, predictability, and supportability take precedence; additional functionality gets prioritized LCG should be integrated into the sites physics computing services – should not be something apart –This requires coordination between participating sites in: Policies and collaborative agreements Resource planning and scheduling Operations and Support

11 CERN Middleware Deployment LCG-0 was deployed and installed at 10 Tier 1 sites –Installation procedure was straightforward and repeatable –Many local integration issues were addressed LCG-1 will be deployed to these 10 sites to meet the July milestone –Time is short – integrating the middleware components took much longer than anticipated –Planning under way to do the deployment in a short time once the middleware is packaged –LCG team will work directly with these sites during the deployment –Initially testing activities to stabilise service will take priority –Expect experiments to start to test the service by mid-August

12 CERN LCG-0 Deployment Status SiteScheduledStatus Tier 1 0CERN15/2/03Done 1CNAF28/2/03Done 2RAL28/2/03Done 3FNAL30/3/03Done 4Taipei15/4/03Done 5FZK30/4/03Done 6IN2P37/5/03In prep. 7BNL15/5/03Done 8Russia (Moscow)21/5/03In prep. 9Tokyo21/5/03Done Tier 2 10Legnaro (INFN)After CNAFDone These sites deployed the LCG-0 pilot system and will be the first sites to deploy LCG-1

13 CERN LCG-1 Distribution Packaging & Configuration –Service machines – fully automated installation LCFGng – either full or light version –Worker nodes – aim is to allow sites to use existing tools as required LCFGng – provides automated installation Installation scripts provided by us – manual installation Instructions allowing system managers to use their existing tools –User interface LCFGng Installed on a cluster (e.g. Lxplus at CERN) Pacman? Distribution –Distribution web site being set up now (updated from LCG-0) Sets of rpms etc organised by service and machine type User guide, Installation guides, release notes, etc., being written now

14 CERN Middleware Status Integration work of EDG 2.0 has taken longer than hoped –EDG has not quite released version 2.0 – imminent LCG has a working system – able to run jobs: –Resource Broker: many changes since previous version, needs significant testing to determine scalability and limitations –RLS: Initial deployment will be single instance (per VO) of LRC/RMC Distributed service with many LRC and indexes not yet debugged Initially will run LRC for all VOs at CERN with Oracle service backend –Information system: R-GMA is not yet stable We will initially use MDS: work to improve stability (bug fixes), and redundancy – based on experience with EDG testbeds and Nikhef, NorduGrid work Intend to make direct comparison between MDS and R-GMA on certification testbed Waiting for bug fixes – of several components Still to do before release: –Reasonable level of testing –Packaging and preparation for deployment

15 CERN Certification & Testing This is primary tool to stabilise and debug the system –Process and testbed has been set up –This is intended to parallel the production service Certification testbed: –Set of 4 clusters at CERN – simulates a grid on LAN –External sites that will be part of cert. tb. U. Wisconsin, FNAL – currently Moscow, Italy – soon This testbed is being used to test the release candidate –Will be used to reproduce and resolve problems found in the production system, and to do regression testing for updated middleware components before deployment

16 CERN Infrastructure for initial service - 2 Security issues –Agreement on set of CAs that all LCG sites will accept EDG list of traditional CAs FNAL on-line KCA –Agreement on basic registration procedure for users LCG VO where users sign Acceptable Usage Rules for LCG 4 experiment VOs – will use existing EDG services run by Nikhef Agreement on basic set of information to be collected –All initial registrations will expire in 6 months – we know the procedures will change –Experiment VO managers will verify bona fides of users –Acceptable Use Rules – adaptation based on EDG policy for now –Audit trails – basic set of tools and log correlations to provide basic essential functions

17 CERN Infrastructure – 3 Operations Service: –RAL is leading sub-project on developing operations services –Initial prototype for July – Basic monitoring tools Mail lists and rapid communications/coordination for problem resolution –Monitoring: GridICE (development of DataTag Nagios-based tools) being integrated with release candidate Existing Nagios-based tools GridPP job submission monitoring Together these give a reasonable coverage of basic operational issues User support –FZK leading sub-project to develop user support services –Initial prototype for July – Web portal for problem reporting Expectation that initially experiments will triage problems and experts will submit LCG problems to the support service

18 CERN Initial Deployment Services RLS (RMC&LRC) CMS RLS (RMC&LRC) ATLAS RLS (RMC&LRC) ALICE RLS (RMC&LRC) LCG-Team VO LCG Proxy UI AFS users RB-2RB-1 SE Disk UI-b Lxplus CE-1CE-2 WN PBS VO LCG-Team LCG Registration Server LCG CVS Server RLS (RMC&LRC) LHCb UI ProxyRB-1 SE Disk CE WN PBS/ ???? VO CMS VO LHCb VO ATLAS VO NIKHEF Services at CERN Services at other sites CE-3CE-4 WN LSF

19 CERN RegionA1 GIIS RegionA2 GIIS BDII A LDAP BDII B LDAP RB RegionB1 GIIS RegionB2 GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteC GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteD GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteA GIIS CE1 GRIS CE2 GRIS SE1 GRIS SE2 GRIS SiteB GIIS Query Register /dataCurrent/../dataNew/.. BDII LDAP Swap&Restart Query While using the data from one directory the BDII will query the regional GIISes to fill another directory structure. If this has finished the BDII is stopped, the dirs swapped and the BDII is then restarted. The restart takes less than 0.5 sec. To improve the availability during this time it was suggested that the TCP port should be switched off and the TCP protocol should take care of the retry (David). This has to be tested. Another idea worth testing is to remove the site GIIS and configure the GRISes to register directly with the region GIISes secondary primary Using multiple BDIIs requires RB changes LCG-1 First Launch Information System Overview

20 CERN LCG-1 First Launch Information System Sites and Regions A Region should contain not too many sites since we have observed problems with MDS if a large number of sites are involved. To allow for future expansion, but not to make the system too complex I suggest to start with two regions and if needed split later to smaller regions. The regions are: West of 0 degree and East. The idea is to have a large region and a small one and see how they work For the West 2 region GIISes and for the East 3 should be setup at the beginning, RAL FNAL BNL WEST1 RegionGIIS WEST2 RegionGIIS CERN CNAF LYONMOSCOW FZK TOKYO TAIWAN EAST1 RegionGIIS EAST2 RegionGIIS EAST3 RegionGIIS

21 CERN Plans for remainder of 2003 Once the service has been deployed, priorities are: –Problem resolution and bug fixing – to address problems of reliability and scalability in existing middleware –Incrementally adding additional functionality –Adding additional sites –Expanding site resources accessible to the grid service –Addressing integration issues Worker node WAN connectivity, etc. –Developing distributed prototypes of Operations centres User support services To provide reasonable level of global coverage –Improving security model –Developing tools to facilitate operating the service

22 CERN Plans for 2003 – 2 Middleware functionality Top priority is problem resolution and issues of stability/scalability RLS developments –Distributed service – multiple LRC, and RLI –Later: develop a service to replace client command-line tools VOMS service –To permit user and role-based authorization Validation of R-GMA –And then deployment of multiple registries – initial implementation has singleton Grid File Access Library –LCG development: POSIX-like I/O layer to provide local file access Development of SRM/SE interfaces to other MSS –Work that must happen at each site with a MSS Basic upgrades –Compiler support –Move to Globus 2.4 (release supported through 2004) Cut-off for functionality improvements is October – in order to have a stable system for 2004

23 CERN Incremental Deployment Development of LCG middleware July starting point: as much as feasible … VDT upgrade VOMS RLS (distributed) R-GMA RB RLS (basic) VDT Globus Continuous bug fixing & re-release … RH 8.x gcc 3.2 October 1: cut-off defines functionality for 2004 EDG Integration, ends September

24 CERN Expansion of LCG resources Adding new sites –Will be a continuous process as sites are ready to join the service –Expect as a minimum 15 sites (15 countries have committed resources for LCG in 1Q04), reasonable to expect sites by end 2003 –LCG team will work directly with Tier 1 (or primary site in a region) –Tier 1s will provide first level support for bringing Tier 2 sites into the service Once the Tier 1s are stable this can go in parallel in many regions LCG team will provide 2 nd level support for Tier 2s Increase grid resources available at many sites –Requires LCG to demonstrate utility of service – experiments in agreement with site managers add resources to the LCG service

25 CERN Operational plans for 2003 Security –Develop full security policy –Develop longer term user registration procedures and tools to support it –Develop Acceptable Use policy for longer term – requires legal review Operations –Develop distributed prototype operations centres/services Monitoring developments driven by experience –Provide at least 16 hr/day global coverage – problem response –Basic level of resource use accounting – by VO and user –Minimal level of security incident response and coordination User Support –Development direction depends strongly on experience in the deployed system –Operations and User Support must address the issues of interchanging problem reports – with each other and with sites, network ops, etc.

26 CERN Middleware roadmap Short term (2003) –Use what exists – try and stabilize, debug, fix problems, etc. –Exceptions may be needed – WN connectivity, client tools rather than services, user registration, … Medium term ( ?) –Same middleware, but develop missing services, remove exceptions –Separate services from WNs – aim for more generic clusters –Initial tests of re-engineered middleware (service based, defined interfaces, protocols) Longer term (2005? - ) –LCG service based on service definitions, interfaces, protocols, - aim to be able to have interoperating, different implementations of a service

27 CERN Inter-operability Since LCG will be VDT + higher level EDG components: Sites running same VDT version should be able to be part of LCG, or continue to work as now LCG (as far as possible) has goal of appearing as a layer of services in front of a cluster, storage system, etc. –State of the art currently implies compromises …

28 CERN Integration Issues LCG will try to be non-intrusive: –Will assume base OS is already installed –Provide installation & config tool for service nodes –Provide recipes for installation of WNs – assume sites will use existing tools to manage their clusters No imposition of a particular batch system –As long as your batch system talks to Globus (OK for LSF, PBS, Condor, BQS, FBSng) No longer requirement for shared filesystem between gatekeeper and WNs –was a problem for AFS, NFS does not scale to large clusters Information publishing –Define what information a site should provide (accounting, status, etc), rather than imposing tools But … maybe some compromises in short term (2003)

29 CERN Worker Node connectivity In general (and eventually) it cannot be assumed that the cluster nodes will have connectivity to remote sites –Many clusters on non-routed networks (for many reasons) –Security issues –In any case this assumption will not scale BUT… –To achieve this several things are necessary: –Some tools (e.g. replica management) must become services –Databases (e.g. conditions db) must either be replicated to each site (or equivalent), or proxy service, or … –Analysis models must take this into account –Again, short term exceptions (up to a point) possible Current additions to LXbatch at CERN have this limitation

30 CERN Timeline for the LCG services Event simulation productions Service for Data Challenges, batch analysis, simulation Validation of computing models Acquisition, installation, testing of Phase 2 service Phase 2 service in production LCG-1LCG-2LCG-3 CMS DC04 Agree LCG-1 Spec LCG-1 service opens LCG-2 with upgraded m/w, management etc. TDR for Phase 2 Stabilize, expand, developEvaluation 2 nd generation middleware LCG-3 full multi-tier prototype batch+interactive service Computing model TDRs

31 CERN Resource Requests Resources – compute & storage Grid Infrastructure Services Grid Deployment Organisation Grid Deployment manager LCG security group grid infra- structure team Grid Deployment Board (GDB) LCG operations team experiment support team regional centre operations regional centre operations regional centre operations regional centre operations regional centre operations regional centre operations regional centre operations regional centre operations CERN-based teams policies, strategy, scheduling, standards, recommendations Grid Resource Coordinator ALICE ATLAS CMS LHCb security tools operations call centre grid monitoring Joint Trillium/ EDG/LCG testing team anticipated teams at other institutes LCG toolkit integration & certification core infra- structure

32 CERN Conclusions Essential to start operating a service as soon as possible – we need 6 months to be able to develop this to a reasonably stable service Middleware components are late – but we will still deploy a service of reasonable functionality and scale –Much work will be necessary on testing and improving the basic service Several functional and operational improvements are expected during 3Q03 Expansion of sites and resources foreseen during 2003 should provide adequate resources for 2004 data challenges There are many issues to resolve and a lot of work to do – but this must be done incrementally on the running service

33 CERN Conclusions From the point of view of the LCG plan – we are late in having testable middleware with the functionality that we had hoped for We will keep to the July deployment schedule –We expect to have the major components – the user view of the middleware (i.e. via the RB) should not change –Expect to be able to do less testing and commissioning than planned –But hopefully, with a suitable process we will incrementally improve & add functionality as it becomes available and tested

Download ppt "CERN LCG-1 Deployment Plan Ian Bird LCG Project Deployment Area Manager IT Division, CERN GridPP 7 th Collaboration Meeting Oxford 1 July 2003."

Similar presentations

Ads by Google