Presentation is loading. Please wait.

Presentation is loading. Please wait.

12 July 2004 - 1 Experience Deploying the Large Hadron Collider Computing Grid Markus Schulz CERN IT GD-GIS

Similar presentations


Presentation on theme: "12 July 2004 - 1 Experience Deploying the Large Hadron Collider Computing Grid Markus Schulz CERN IT GD-GIS"— Presentation transcript:

1 12 July 2004 - 1 Experience Deploying the Large Hadron Collider Computing Grid Markus Schulz CERN IT GD-GIS markus.schulz@cern.ch

2 12 July 2004 - 2 Overview  LHC Computing Grid  CERN  Challenges for LCG  The project  Deployment and Status  LCG-EGEE  Summary Large Hadron Collider

3 12 July 2004 - 3 CERN CERN European Organization for Nuclear Research located on the Swiss - French border close to Geneva funded by 20 european member states In addition several observer states and non member states participating in the experiment program world’s largest center for particle physics research provides infrastructure and tools (accelerators etc.) ~3000 employees and several thousand visiting scientists 50 years of history (several Nobel Prizes) The place where the WWW was born (1990) Next milestone: the LHC (Large Hadron Collider) http://public.web.cern.ch/public/about/aboutCERN.html

4 12 July 2004 - 4 Challenges for the LHC Computing Grid Challenges for the LHC Computing Grid http://lcg.web.cern.ch/lcg http://lcg.web.cern.ch/lcg LHC (Large Hadron Collider) with 27 km of magnets the largest superconducting installation proton beams collide at an energy of 14TeV 40 million events per second from each of the 4 experiments after triggers and filters 100-1000MBytes/second remain every year ~15PetaByte of data will be recorded this data has to be reconstructed and analyzed by the users in addition large computational effort to produce Monte Carlo data

5 12 July 2004 - 5 The Four Experiments at LHC LHCb Federico.carminati, EU review presentation

6 12 July 2004 - 6 15 PetaByte/year 15 PetaByte/year have to be: Recorded Cataloged Managed Distributed Processed 50 CD-ROM = 35 GB 6 cm Concorde (15 Km) Balloon (30 Km) CD stack with 1 year LHC data! (~ 20 Km) Mt. Blanc (4.8 Km)

7 12 July 2004 - 7 Core Tasks Core Tasks Reconstruction: transform signals from the detector to physical properties –energy, charge, tracks, momentum, particle id. –this task is computational intensive and has modest I/O requirements –structured activity (production manager) Simulation: start from the theory and compute the responds of the detector –very computational intensive –structured activity, but larger number of parallel activities Analysis: complex algorithms, search for similar structures to extract physics –very I/O intensive, large number of files involved –access to data cannot be effectively coordinated –iterative, parallel activities of hundreds of physicists

8 12 July 2004 - 8 Computing Needs Computing Needs Some 100 Million SPECInt2000 are needed A 3 GHz Pentium IV ~ 1K SPECInt2000 O(100k) CPUs are needed

9 12 July 2004 - 9 Large and distributed user community Large and distributed user community CERN Collaborators > 6000 users from 450 institutes none has all the required computing all have access to some computing Europe: 267 institutes 4603 users Elsewhere: 208 institutes 1632 users Solution: Connect all the resources into a computing grid

10 12 July 2004 - 10 The LCG Project (and what it isn’t) Mission To prepare, deploy and operate the computing environment for the experiments to analyze the data from the LHC detectors Two phases: Phase 1: 2002 – 2005 Build a prototype, based on existing grid middleware Deploy and run a production service Produce the Technical Design Report for the final system Phase 2: 2006 – 2008 Build and commission the initial LHC computing environment LCG is NOT a development project for middleware but problem fixing is permitted (even if writing code is required)

11 12 July 2004 - 11 LCG Time Line Testing, with simulated event productions 2003 2004 2005 2006 2007 first data physicscomputing service open LCG-1 – 15 Sept * TDR – technical design report Computing models TDR for the Phase 2 grid experiment setup & preparation Phase 2 service in production Phase 2 service acquisition, installation, commissioning principal service for LHC data challenges LCG-3 – second generation middleware validation of computing models Second generation middleware prototyping, development LCG-2 - upgraded middleware, mgt. and ops tools

12 12 July 2004 - 12 RAL IN2P3 BNL FZK CNAF PIC ICEPP FNAL USC NIKHEF Krakow CIEMAT Rome Taipei TRIUMF CSCS Legnaro UB IFCA IC MSU Prague Budapest Cambridge Tier-1 small centres Tier-2 desktops portables LCG Scale and Computing Model Tier-0 reconstruct (ESD) record raw and ESD distribute data to tier-1 Tier-1 data heavy analysis permanent, managed grid-enabled storage (raw, analysis, ESD), MSS reprocessing regional support Tier-2 managed disk storage simulation end user analysis parallel interactive analysis Data distribution ~70Gbits/sec

13 12 July 2004 - 13 LCG - a Collaboration Building and operating the LHC Grid - a collaboration between The physicists and computing specialists from the experiments The projects in Europe and the US that have been developing Grid middleware European DataGrid (EDG) US Virtual Data Toolkit (Globus, Condor, PPDG, iVDGL, GriPhyN) The regional and national computing centres that provide resources for LHC some contribution from HP (tier2 centers) The research networks Researchers Software Engineers Service Providers

14 12 July 2004 - 14 LCG-2 Software LCG-2_1_0 core packages: VDT (Globus2) EDG WP1 (Resource Broker) EDG WP2 (Replica Management tools) One central RMC and LRC for each VO, located at CERN, ORACLE backend Several bits from other WPs (Config objects, InfoProviders, Packaging…) GLUE 1.1 (Information schema) + few essential LCG extensions MDS based Information System with LCG enhancements Almost all components have gone through some reengineering robustness scalability efficiency adaptation to local fabrics

15 12 July 2004 - 15 LCG-2 Software Authentication and Authorization Globus GSI based on X509 certificates LCG established trust relationship between the CAs in the project Virtual Organizations (VOs) registration hosted at different sites Data Management Tools Catalogues keep track of replicas (Replica Metadata Catalog, Local Replica Catalog) SRM interface for several HMSS and disk pools Wide area transport via GridFTP RLI LFN-1 LFN-2 LFN-3 GUID RMC SURL-1 LRC SURL-2 lfn:Toto.psNot4Humans srm://host.domain/path/file

16 12 July 2004 - 16 LCG-2 Software Information System Globus MDS based for information gathering on a site LDAP + lightweight DB based system for collecting data from sites LCG-BDII solved scalability problem of Globus2 MDS (>200sites tested) Contains information on capacity, capability, utilization and state of services (computing, storage, cataloges..) Work Load Data Management Tools Matches user requests with the resources available for a VO requirements formulated in JDL (classadds) user tunable ranking of resources Uses RLS and information system Keeps state of jobs and manages extension of credentials input output sandboxes proxy renewal…. Interfaces to local batch systems via a gateway node

17 12 July 2004 - 17 LCG-2 Software Job Status submitted waiting ready scheduled running done cleared UI Replica Catalog Inform. Service Network Server Job Contr. - CondorG Workload Manager RB node CE characts & status SE characts & status RB storage Match- Maker/ Broker Job Adapter Log Monitor Logging & Bookkeeping sandbox Matching Job Adapter On CE Processed Output back User done Arrived on RB Input Sandbox is what you take with you to the node Output Sandbox is what you get back Failed jobs are resubmitted

18 12 July 2004 - 18 LCG Grid Deployment Area  Goal: - deploy & operate a prototype LHC computing environment  Scope: Integrate a set of middleware and coordinate and support its deployment to the regional centres Provide operational services to enable running as a production-quality service Provide assistance to the experiments in integrating their software and deploying in LCG; Provide direct user support  Deployment Goals for LCG-2 Production service for Data Challenges in 2004 Initially focused on batch production work Experience in close collaboration between the Regional Centres Learn how to maintain and operate a global grid Focus on building a production-quality service Focus on robustness, fault-tolerance, predictability, and supportability Understand how LCG can be integrated into the sites’ physics computing services

19 12 July 2004 - 19 Deployment Area Manager Deployment Area Manager Grid Deployment Board Grid Deployment Board Certification Team Certification Team Deployment Team Deployment Team Experiment Integration Team Experiment Integration Team Testing group Security group Security group Storage group Storage group GDB task forces JTB HEPiX GGF Grid Projects: EDG, Trillium, Grid3, etc Regional Centres LHC Experiments LCG Deployment Area LCG Deployment Organisation and Collaborations Operations Centres - RAL Operations Centres - RAL Call Centres - FZK Call Centres - FZK Advises, informs, Sets policy Set requirements Collaborative activities participate

20 12 July 2004 - 20 Implementation  A core team at CERN – Grid Deployment group (~30)  Collaboration of the regional centres – through the Grid Deployment Board  Partners take responsibility for specific tasks (e.g. GOCs, GUS)  Focussed task forces as needed  Collaborative joint projects – via JTB, grid projects, etc. CERN deployment group Core preparation, certification, deployment, and support activities Integration, packaging, debugging, development of missing tools, Deployment coordination & support, security & VO management, Experiment integration and support GDB: Country representatives for regional centres Address policy, operational issues that require general agreement Brokered agreements on: Initial shape of LCG-1 via 5 working groups Security What is deployed

21 12 July 2004 - 21 Operations Services Operations Service: RAL (UK) is leading sub-project on developing operations services Initial prototype http://www.grid-support.ac.uk/GOC/ http://www.grid-support.ac.uk/GOC/ Basic monitoring tools Mail lists for problem resolution GDB database containing contact and site information Working on defining policies for operation, responsibilities (draft document) Working on grid wide accounting Monitoring: GridICE (development of DataTag Nagios-based tools) GridPP job submission monitoring Many more like http://goc.grid.sinica.edu.tw/gstat/http://goc.grid.sinica.edu.tw/gstat/ Deployment and operation support: Hierarchical model CERN acts as 1 st level support for the Tier 1 centres Tier 1 centres provide 1 st level support for associated Tier 2s

22 12 July 2004 - 22 User Support Central model for user support VOs provide 1st level triage FZK (germany) leading sub-project to develop user support services Web portal for problem reporting http://gus.fzk.de/http://gus.fzk.de/ Experiments contacts send problems through the FZK portal During the data challenges the experiments used a direct channel via the GD teams. Experiment integration support by CERN based group close collaboration during data challenges Documentation Installation guides (manual and management tool based) See:http://grid-deployment.web.cern.ch/grid-deployment/cgi- bin/index.cgi?var=releaseshttp://grid-deployment.web.cern.ch/grid-deployment/cgi- bin/index.cgi?var=releases Rather comprehensive user guides

23 12 July 2004 - 23 Security LCG Security Group LCG1 usage rules (still used by LCG2) Registration procedures and VO management Agreement to collect only minimal amount of personal data Initial audit requirements are defined Initial incident response procedures Site security contacts etc. are defined Set of trusted CAs (including Fermilab online KCA) Draft of security policy (to be finished by end of year) Web site http://proj-lcg-security.web.cern.ch/proj-lcg- security/http://proj-lcg-security.web.cern.ch/proj-lcg- security/

24 12 July 2004 - 24 Certification and release cycles Developers CERTIFICATION TESTING SERVICES Integrate Basic Functionality Tests Run tests C&T suites Site suites Run Certification Matrix Release candidate tag PRE-PRODUCTION PRODUCTION APP INTEGR Certified release tag DEVELOPMENT & INTEGRATION UNIT & FUNCTIONAL TESTING Dev Tag HEP EXPTS BIO-MED OTHER TBD APPS SW Installation DEPLOYMENT PREPARATION Deployment release tag DEPLOY Production tag

25 12 July 2004 - 25 2003-2004 Milestones Project Level 1 Deployment milestones for 2003:  July: Introduce the initial publicly available LCG-1 global grid service With 10 Tier 1 centres in 3 continents  November: Expanded LCG-1 service with resources and functionality sufficient for the 2004 Computing Data Challenges Additional Tier 1 centres, several Tier 2 centres – more countries Expanded resources at Tier 1s Agreed performance and reliability targets “LHC Data Challenges”2004 the “LHC Data Challenges” Large-scale tests of the experiments’ computing models, processing chains, grid technology readiness, operating infrastructure ALICE and CMS data challenges started at the beginning of March LHCb and ATLAS – started in May/June The big challenge for this year - data – - file catalogue, (million of files) - replica management, - database access, - integrating all available mass storage systems (several hundred TByte)

26 12 July 2004 - 26 History Jan 2003 GDB agreed to take VDT and EDG components March 2003 LCG-0 existing middleware, waiting for EDG-2 release September 2003 LCG-1 3 month late -> reduced functionality extensive certification process improved stability (RB, Information system) integrated 32 sites ~300 CPUs operated until early January, first use for production December 2003 LCG-2 Full set of functionality for DCs, but only “classic SE”, first MSS integration Deployed in January, Data challenges started in February -> testing in production Large sites integrate resources into LCG (MSS and farms) Mai 2004 -> now releases with incrementaly: Improved services SRM enabled storage for disk and MSS systems

27 12 July 2004 - 27 LCG1 Experience (2003) Integrate sites and operate a grid Problems: only 60% of personnel, software was late introduced hierarchical support model (primary and secondary sites) worked well for some regions (less for others) installation and configuration was an issue only time to package software for the LCFGng tool (problematic) not sufficient documentation (partially compensated by travel) manual installation procedure documented when new staff arrived communication/cooperation between sites needed to be established deployed MDS + EDG-BDII in a robust way redundant regional GIISes vastly improved the scalability and robustness of the information system upgrades, especially non backward compatible ones took very long not all sites showed the same dedication still some problems with the reliability of some of the core services Big step forward

28 12 July 2004 - 28 LCG2 productionOperate large scale production service Started with 8 “core” sites each bringing significant resources sufficient experience to react quickly weekly core site phone conference weekly meeting with each of the experiments weekly joined meeting of the sites and the experiments (GDA) Introduced a testZone for new sites LCG1 showed that ill configured sites can affect all sites sites stay in the testZone until they have been stable for some time Further improved (simplified) information system addresses: manageability, improves: robustness and scalability allows partitioning of the grid into independent views Introduced local testbed for experiment integration rapid feedback on functionality from the experiments triggered several changes to the RLS system

29 12 July 2004 - 29 LCG2 Focus on integrating local resources batch systems at CNAF, CERN, NIKHEF already integrated MSS systems with CASTOR at several sites, enstore at FNAL Experiment software distribution mechanism based on shared file system with access for privileged users tool to publish the installed software in the information system needs to be as transparent as possible. (Some work to be done) Improved documentation and installation sites have the choice to use LCFGng or follow a manual installation guide generic config. description eases integration with local tools documentation includes simple tests, sites join in a better state improved readability by going to HTML and PDF Release page

30 12 July 2004 - 30

31 12 July 2004 - 31 Sites in LCG2 July 8 2004 22 Countries 63 Sites 49 Europe, 2 US, 5 Canada, 6 Asia, 1 HP Coming: New Zealand, China, Korea other HP (Brazil, Singapore) 6100 cpu

32 12 July 2004 - 32 Usage Hard to measure: VOs “Pick” services and add own components for job submission, file catalogs, replication … we have no central control of the resources accounting has to be improved File catalogues (only used by 2 VOs ) ~ 2.5 Million entries

33 12 July 2004 - 33 Integrating Site Resources The plan: Provide defined grid interfaces to a grid site: Storage, compute clusters, etc. Integration with local systems is site responsibility Middleware layered over existing system installations But (real life): Interfaces are not well defined (SRM maybe a first?) Lots of small sites require a packaged solution Including fabric management (disk pool managers, batch systems) That installs magically out of the box Strive for the first view, while providing the latter But – “some assembly is required” – it costs effort Constraints: Packaging and installation integrated with some middleware Complex dependencies of middleware packages Current software requires that many holes are punched into the firewalls of the sites

34 12 July 2004 - 34 Integrating Site Resources Adding Sites: Site contacts the deployment team or T1 centre Deployment team send form for contact DB and points site to the release page Site decides after consultation on scope and method of installation Site installs, problems are resolved via mailing list and T1s intervention Site runs initial certification tests (provided with installation guides) Site is added to the testZone information system Deployment team runs certification jobs and helps the site to fix problems Tests are repeated and the status is published (GIIS) and (Status)GIISStatus internal web based tool to follow history VOs add stable sites to their RBs Sites are added to the productionZone Most frequent problems: Missing or wrong localization of the configuration firewalls are not configured correctly

35 12 July 2004 - 35 2004 Data Challenges LHC experiments use multiple grids and additional services Integration, Interoperability expect some central resource negotiation concerning: –queue length, memory,scratch space, storage etc. Service provision planned to provide shared services RBs, BDIIs, UIs etc experiments need to augment the services on the UI and need to define their super/subsets of the grid –individual RB/BDII/UIs for each VO (optional on one node) Scalable services for data transport needed DNS switched access to gridFTP Performance issues with several tools (RLS, Info System, RBs) most understood, work around and fixes implemented and part of 2_1_0 Local Resource Managers (batch systems) are too smart for Glue GlueSchema can’t express the richness of batch systems (LSF etc.) –load distribution not understandable for users (looks wrong) –problem is understood and workaround in preperation

36 12 July 2004 - 36 Interoperability Several grid infrastructures for LHC experiments: LCG-2/EGEE, Grid2003/OSG, NorduGrid, other national grids LCG/EGEE explicit goals to interoperate One of LCG service challenges Joint projects on storage elements, file catalogues, VO management, etc. Most are VDT (or at least Globus-based) Grid2003 & LCG use GLUE schema Issues are: File catalogues, information schema, etc at technical level Policy and semantic issues

37 12 July 2004 - 37 Developments in 2004 General: LCG-2 will be the service run in 2004 – aim to evolve incrementally Goal is to run a stable service Service challenges (data transport (500 MB/s one week), jobs, interoperability) Some functional improvements: Extend access to MSS – tape systems, and managed disk pools Distributed vs replicated replica catalogs – with Oracle back-ends To avoid reliance on single service instances Operational improvements: Monitoring systems – move towards proactive problem finding, ability to take sites on/offline; experiment monitoring (R-GMA), accounting Control system “Cookbook” to cover planning, installation and operation Activate regional centres to provide and support services this has improved over time, but in general there is too little sharing of tasks Address integration issues: With large clusters (on non routed networks), with storage systems, with different OSs, Integrating with other experiments and apps

38 12 July 2004 - 38 Changing landscape The view of grid environments has changed in the past year From A view where all LHC sites would run a consistent and identical set of middleware, To A view where large sites must support many experiments each of which have grid requirements National grid infrastructures are coming – catering to many applications, and not necessarily driven by HEP requirements We have to focus on interoperating between potentially diverse infrastructures (“grid federations”) At the moment these have underlying same m/w But modes of use and policies are different (IS, file catalogues,..) Need to have agreed services, interfaces, protocols The situation is now more complex than anticipated

39 12/05 2004 - 39 LCG – EGEE LCG-2 will be the production service during 2004 Will also form basis of EGEE initial production service Will be maintained as a stable service Will continue to be developed Expect in parallel a development service – Q204 Based on EGEE middleware prototypes Run as a service on a subset of EGEE/LCG production sites The core infrastructure of the LCG and EGEE grids will be operated as a single service LCG includes US and Asia, EGEE includes other sciences The ROCs support Resource Centres and applications Similar to LCG primary sites Some ROCs and LCG primary sites will be merged LCG Deployment Manager will be the EGEE Operations Manager Will be member of PEB of both projects EGEE LCG geographical applications

40 12/05 2004 - 40 Summary Deplyment of Grid services

41 12/05 2004 - 41 Summary 2003 – first MC event production In 2004 we must show that we can handle the data – supporting simple computing models -- This is the key goal of the 2004 Data Challenges Target for end of this year – Basic model demonstrated using current grid middleware All Tier-1s and ~25% of Tier-2s operating a reliable service Validate security model, understand storage model Clear idea of the performance, scaling, operations and management issues 2004 2005 2006 2007 first data Initial service in operation Decisions on final core middleware Demonstrat e core data handling and batch analysis Installation and commissioning

42 12/05 2004 - 42 Summary II Getting the grid services going in time for LHC will be even harder than we think today The service for LHC must be in permanent operation by September 2006 - CERN↔Tier-1s ↔ major Tier-2s So we will spend the first part of 2006 in installation and commissioning  the technology we use must be working (really working) by the end of 2006  a year from now we will have to decide which middleware we are going to use From now until the end of 2006 we have to turn prototypes into pilot services 2004 2005 2006 2007 first data Initial service in operation Decisions on final core middleware Demonstrat e core data handling and batch analysis Installation and commissioning Prototype services  pilot services

43 12/05 2004 - 43 Conclusion (the last “last” slide) There are still many questions about grids & data handling EGEE provides LCG with opportunities - to develop an operational grid in an international multi-science context to influence the evolution of a generic middleware package But the LHC clock is ticking - deadlines will dictate simplicity and pragmatism LCG has long-term requirements – and at present EGEE is a two-year project LCG must encompass non-European resources and grids (based on different technologies) No shortage of challenges and opportunities 2004 2005 2006 2007 first data Initial service in operation Decisions on final core middleware Demonstrat e core data handling and batch analysis Installation and commissioning


Download ppt "12 July 2004 - 1 Experience Deploying the Large Hadron Collider Computing Grid Markus Schulz CERN IT GD-GIS"

Similar presentations


Ads by Google