Presentation on theme: "10 th GridPP Meeting – 4 June 2004 - 1 LCG Deployment Ian Bird IT Department, CERN 10 th GridPP Meeting CERN 4 th June 2004."— Presentation transcript:
10 th GridPP Meeting – 4 June 2004 - 1 LCG Deployment Ian Bird IT Department, CERN 10 th GridPP Meeting CERN 4 th June 2004
10 th GridPP Meeting – 4 June 2004 - 2 Overview Deployment area organisation Some history where we are now Data challenges – experiences Evolution service challenges Transition to EGEE Interoperability Summary
10 th GridPP Meeting – 4 June 2004 - 3 Deployment Area Manager Deployment Area Manager Grid Deployment Board Grid Deployment Board Certification Team Certification Team Deployment Team Deployment Team Experiment Integration Team Experiment Integration Team Testing group Security group Security group Storage group Storage group GDB task forces JTB HEPiX GGF Grid Projects: EDG, Trillium, Grid3/OSG, etc Regional Centres LHC Experiments LCG Deployment Area LCG Deployment Organisation and Collaborations Operations Centres - RAL Operations Centres - RAL Call Centres - FZK Call Centres - FZK Advises, informs, Sets policy Set requirements Collaborative activities participate
10 th GridPP Meeting – 4 June 2004 - 4 Communication Weekly GDA meetings (Monday 14:00, VRVS, phone) Mail-list – email@example.com@cern.ch Open to all – need experiments, regional centres, etc. Technical discussions, understand what priorities are Policy issues referred back to PEB or GDB Experience so far: Experiments join, regional centres dont NEED participation of system managers and admins – we need a rounded view of the issues Weekly core site phone conference Address specific issues with deployment Also at CERN: Weekly DC coordination meetings with each experiment GDB meetings monthly Make sure your GDB rep keeps you informed Open to ways to improve communication!
10 th GridPP Meeting – 4 June 2004 - 5 Some history – 2003/2004 Recall goals: LCG-0 : pilot service was deployed in Feb/March Was used by CMS in Italy very successfully for productions LCG-1 : based on VDT & EDG 2.0 was deployed in September Not heavily used by experiments – but was successfully used by CMS for production over Christmas and (US-)Atlas demonstrated interoperability with Grid2003 Lacked a real (managed) SE and integration with MSS LCG-2 : based on VDT & EDG 2.1 was ready by end 2003 Data management tools integrated with SRM, intended to package dCache as managed disk-SE. Deployed in Jan/Feb 2004 – many updates – used by experiments in 2004 data challenges July: Introduce the initial publicly available LCG-1 global grid service November: Expanded LCG-1 service with resources and functionality sufficient for the 2004 Computing Data Challenges
10 th GridPP Meeting – 4 June 2004 - 6 Sites in LCG-2/EGEE-0 : June 4 2004 AustriaU-Innsbruck CanadaTriumf Alberta Carleton Montreal Toronto Czech Republic Prague-FZU Prague-CESNET FranceCC-IN2P3 Clermont-Ferrand GermanyFZK Aachen DESY Wuppertal GreeceHellasGrid HungaryBudapest IndiaTIFR IsraelTel-Aviv Weizmann ItalyCNAF Frascati Legnaro Milano Napoli Roma Torino JapanTokyo NetherlandsNIKHEF PakistanNCP PolandKrakow PortugalLIP RussiaSINP-Moscow JINR-Dubna SpainPIC UAM USC UB-Barcelona IFCA CIEMAT IFIC SwitzerlandCERN CSCS TaiwanASCC NCU UKRAL Birmingham Cavendish Glasgow Imperial Lancaster Manchester QMUL RAL-PP Sheffield UCL USBNL FNAL HPPuerto-Rico 22 Countries 58 Sites (45 Europe, 2 US, 5 Canada, 5 Asia, 1 HP) Coming: New Zealand, China, other HP (Brazil, Singapore) 3800 cpu
10 th GridPP Meeting – 4 June 2004 - 7 Experience: Data challenges Alice has been running since March CMS DC04 LHCb now starting seriously Atlas starting now See talks from June 2
10 th GridPP Meeting – 4 June 2004 - 8 Data challenges – so far Resources CPU available – Alice could not fully utilise – storage limitations Disk available – mostly very small amounts Need: –Plan space vs cpu at a site –Ensure that commitments are provided To some extent not requested – delay in dcache SE – asked not to commit all to classic SEs as expected/worried about migration Alice and CMS – Number and size (small) of files: –limitations of existing Castor system, also problems in Enstore/dCache CPU is mostly in core sites At the moment (most of) the other sites have relatively few cpu assigned
10 th GridPP Meeting – 4 June 2004 - 9 Data challenges – 2 Services: LCG-2 services (RB, BDII, CE, SE etc) have been extremely reliable and stable Even RLS was stable (other issues) BDII has been extremely reliable Provided to experiments – allowed them to define a view of the system Software deployment system works Needs some improvement – esp for sites with no shared filesystem Information system Schema does not match batch system functionality Information published (job slots, ETT, etc.) does not reflect batch system Solve with CE per VO, need to improve/adapt schema (?)
10 th GridPP Meeting – 4 June 2004 - 10 RLS issues RLS performance was biggest problem Many fixes made during challenge: CLI tools based on C++ API in place on Java tools Added support for non-SE entries Additional tools (register with existing guid) Case sensitivity Performance analysis – usage of metadata queries Lack of bulk operations No support for transactions Still unresolved service performance issue (see degradation) – seems to be server related No data loss or extended service downtime Replication tests with CNAF Not really tested by CMS
10 th GridPP Meeting – 4 June 2004 - 11 RLS – cont. Many of above issues addressed in version currently being tested Preparing a note describing proposed improvements for discussion: e.g. Combine RMC and LRC into single db to allow db to optimise and join Resolve issues found in data challenges Model for replicated/distributed catalogs? Is the model of metadata appropriate? Experiment vs POOL vs RLS With DB group continue to investigate Oracle replication
10 th GridPP Meeting – 4 June 2004 - 12 Evolution: missing features A full storage element dCache has had many problems Packaged - to be deployed Is dCache sufficient/the only solution? Demonstrated integration of Tier 1 MSSs Full data management tools Nice features of SRM gave users a lot of convenience: - auto directory creation; We were able to continue improving our setup during DC04: - The biggest performance gain was: Michael and his team in DESY developed a new module that reduces the delegated proxy's modulus size in SRM and speeds up the interaction between SRM client and server 3.5 times; (From CMS FNAL team, based on work done by deployment group)
10 th GridPP Meeting – 4 June 2004 - 13 Evolution: missing features Functionally: Port to other RH-derived linux This is now becoming urgent – new hardware, security patches, … VOMS At least the basic part R-GMA For monitoring Replace OpenPBS as default batch system Operationally: Assumption of real operational management by GOCs A lot of work on basics has been done – but need problem management User call centre Lack of take-up Propose FZK/GOC team come to CERN for 1-2 days to really sort this out Accounting: Critical – we have no information about what has been used during the DCs – important for us and for the experiments Monitoring: Grid: lack of consistency in what is presented for each site Experiments: we must put R-GMA in place (at least)
10 th GridPP Meeting – 4 June 2004 - 14 Evolution: Service Challenges Purpose Understand what it takes to operate a real grid service – run for days/weeks at a time (outside of experiment Data Challenges) Trigger/encourage the Tier1 planning – move towards real resource planning for phase 2 – based on realistic usage patterns How does a Tier 1 decide what capacity to provide? What planning is needed to achieve that? Where are we in this process? Get the essential grid services ramped up to needed levels – and demonstrate that they work Set out milestones needed to achieve goals during the service challenges NB: This is focussed on Tier 0 – Tier 1/large Tier 2 Data management, batch production and analysis By end 2004 – have in place a robust and reliable data management service and support infrastructure and robust batch job submission
10 th GridPP Meeting – 4 June 2004 - 15 Service challenges – examples Data Management Networking, file transfer, data management Storage management and interoperability Fully functional storage element (SE) Continuous job probes Understand limits Operations centres Accounting, assume levels of service responsibility, etc Hand-off of responsibility (RAL-Taipei-US/Canada) "Security incident" Detection, incident response, dissemination and resolution IP connectivity Milestones to remove (implementation) need outbound connection from WN User support Assumption of responsibility, demonstrate staff in place, etc VO management Robust and flexible registration, management interfaces, etc Etc.
10 th GridPP Meeting – 4 June 2004 - 16 Data Management – example Data management builds on a stack of underlying services: Network Robust file transfer Storage interfaces and functionality Replica location service Data management tools
10 th GridPP Meeting – 4 June 2004 - 17 Data management – 2 Network layer: Proposed set of network milestones already in draft Network and fabric groups at CERN – collaborate with (initially) official Tier 1s Dedicated private networks for Tier 0 Tier 1 online raw data transfers File transfer service layer: Move a file from A to B, with good perfomance and reliability This service would normally only be visible via the data movement service Only app that can access/schedule/control this network E.g. of this is gridftp, bbftp, etc. Reliability – the service must detect failure, retry, etc. Interfaces to storage systems (SRM) The US-CMS/CERN Edge Computing project might be an instance of this layer (network + file transfer)
10 th GridPP Meeting – 4 June 2004 - 18 Data management – 3 Data movement service layer: Builds on top of file transfer and network layers To provide an absolutely reliable and dependable service with good performance Implement queuing, priorities, etc. Initiates file transfers using file transfer service Acts on applications behalf – a file handed to the service will be guaranteed to arrive Replica Location Service: Makes use of data movement Should be distributed: Distributed/replicated databases (Oracle) with export/import to XML/other dbs? RLI model?
10 th GridPP Meeting – 4 June 2004 - 19 Job probes – example Continuous flood of jobs Fill all resources Use as probes – test if they can use the resources Data access, cpu, etc Understand limitations, bottlenecks of the system Baseline measurement, find limits, build and improve This might be a function of the GOC Overseen by RAL-Taipei-+ collaboration ? A challenge might run for a week Outside of experiment data challenges In parallel (or part of) data management or other challenges
10 th GridPP Meeting – 4 June 2004 - 20 Clarify: LCG project LCG applications LCG middleware release LCG infrastructure EGEE (i.e. LCG) infrastructure The LCG-2 infrastructure IS the EGEE infrastructure Can be used now by other applications Expect to run LCG-2 based infrastructure for 1 year New middleware has to be better than this becomes EGEE-developed middleware runs on pre-production Moves to production when more functional/stable/reliable/… middleware and infrastructure Transition to EGEE
10 th GridPP Meeting – 4 June 2004 - 21 Some remarks Existing LCG-2 sites already support many VOs Not only LCG Front-line support for all VOs is via the ROCs Process to introduce a new VO Well defined Some tools needed to make the mechanics simpler Evaluation of new middleware by applications, and preparation for deployment in EGEE-1 This is what the pre-production service is for Resource allocation/negotiation OMC/ROC managers/NA4 – negotiate with RCs and apps
10 th GridPP Meeting – 4 June 2004 - 22 Joining EGEE – Overview of process Application nominates VO manager Find (CIC) to operate VO server VO is added to registration procedure Determine access policy: Propose discussion (body) NA4 + ROC manager group Which sites will accept to run app (funding, political constraints) Need for a test VO? Modify site configs to allow the VO access Negotiate CICs to run VO-specific services: VO server (see above) RLS service if required Resource Brokers (can be some general at CIC and others owned by apps), UIs – general at CIC/ROC – or on apps machines etc Potentially (if needed) BDII to define apps view of resources Application software installation Understand application environment, and how installed at sites Many of these issues can be negotiated by NA4/SA1 in a short discussion with the new apps community
10 th GridPP Meeting – 4 June 2004 - 23 Resource Negotiation Policy The EGEE infrastructure is intended to support and provide resources to many virtual organisations Initially HEP (4 LHC experiments) + Biomedical Each RC supports many VOs and several application domains – situation now for centres in LCG Initially must balance resources contributed by the application domains and those that they consume Resource centres may have specific allocation policies E.g. due to funding agency attribution by science or by project Expect a level of peer review within application domains to inform the allocation process New VOs and Resource centres should satisfy minimum requirements Commit to bring a level of additional resources consistent with their requirements Requirement on JRA1 to provide mechanisms to implement/enforce quotas, etc Selection of new VO/RC via NA4
10 th GridPP Meeting – 4 June 2004 - 24 New Resource Centres Procedure for new sites to join LCG2/EGEE is well defined and documented Sites can join now Coordination for this is via the ROCs Who will support the installations, set-up, and operation
10 th GridPP Meeting – 4 June 2004 - 25 Certification, Testing and Release Cycle Certification, Testing and Release Cycle CERTIFICATION TESTING SERVICES Integrate Basic Functionality Tests Run tests C&T suites Site suites Run Certification Matrix Release candidate tag PRE-PRODUCTION PRODUCTION APP INTEGR Certified release tag DEVELOPMENT & INTEGRATION UNIT & FUNCTIONAL TESTING Dev Tag JRA1 HEP EXPTS BIO-MED OTHER TBD APPS SW Installation DEPLOYMENT PREPARATION Deployment release tag DEPLOY SA1 Production tag
10 th GridPP Meeting – 4 June 2004 - 26 Interoperability Several grid infrastructures for LHC experiments: LCG-2/EGEE, Grid2003/OSG, Nordugrid, other national grids LCG/EGEE explicit goals to interoperate One of LCG service challenges Joint projects on storage elements, file catalogs, VO management, etc. Most are VDT (or at least Globus-based) Grid3 & LCG use GLUE schema Issues are: File catalogs, information schema, etc at technical level Policy and semantic issues
10 th GridPP Meeting – 4 June 2004 - 27 Deployment – GridPP support GridPP contributions to deployment have been crucial: 5 of CERN deployment team funded by PPARC Essential to bringing the current release to such stability and reliability – and its not that hard to install – 58 sites so far Grid Operations Centre at RAL Security team – very active
10 th GridPP Meeting – 4 June 2004 - 28 Summary Huge amount of work done in the last year to produce a robust set of middleware These lessons must be applied to new developments LCG is being successfully used in experiment data challenges Many problems found and addressed (tools, bugs, etc) Other fundamental problems subject of development Services are now very reliable Plans for service challenges to help move forward Must ensure that only single developments – coordinate EGEE/LCG/OSG/etc. Push for interoperability at all levels – experiments have big role to play in insisting on single solutions Emphasis now on strengthening the operational infrastructure EGEE investment helps here PPARC/GridPP support has been essential
Your consent to our cookies if you continue to use this website.