Presentation is loading. Please wait.

Presentation is loading. Please wait.

25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS

Similar presentations


Presentation on theme: "25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS"— Presentation transcript:

1 25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS oliver.keeble@cern.ch

2 25 May 2004 - 2 Overview  LHC Computing Grid  LHC and its computing requirements  Challenges for LCG  The project  Deployment and Status  LCG-EGEE  Summary Large Hadron Collider

3 25 May 2004 - 3 The Four Experiments at LHC LHCb Federico.carminati, EU review presentation

4 25 May 2004 - 4 Challenges for the LHC Computing Grid Challenges for the LHC Computing Grid http://lcg.web.cern.ch/lcg http://lcg.web.cern.ch/lcg LHC (Large Hadron Collider) with 27 km of magnets the largest superconducting installation 40 million events per second from each of the 4 experiments after triggers and filters 100-1000MBytes/second remain every year ~15PetaByte of data will be stored this data has to be reconstructed and analyzed by the users in addition large computational effort to produce Monte Carlo data

5 25 May 2004 - 5 Challenges for the LHC Computing Grid Challenges for the LHC Computing Grid http://lcg.web.cern.ch/lcg http://lcg.web.cern.ch/lcg CERN Collaborators global effort > 6000 users from 450 institutes none has the required computing all have access to some computing Europe: 267 institutes 4603 users Elsewhere: 208 institutes 1632 users

6 25 May 2004 - 6 The LCG Project (and what it isn’t) Mission To prepare, deploy and operate the computing environment for the experiments to analyze the data from the LHC detectors Two phases: Phase 1: 2002 – 2005 Build a prototype, based on existing grid middleware LCG-2 Deploy and run a production service Produce the Technical Design Report (TDR) for the final system Phase 2: 2006 – 2008 Build and commission the initial LHC computing environment LCG is NOT a development project for middleware but problem fixing is permitted (even if writing code is required)

7 25 May 2004 - 7 LCG Time Line Testing, with simulated event productions 2003 2004 2005 2006 2007 first data physicscomputing service open LCG-1 (achieved) – 15 Sept * TDR – technical design report Computing models TDR for the Phase 2 grid experiment setup & preparation Phase 2 service in production Phase 2 service acquisition, installation, commissioning principal service for LHC data challenges LCG-3 – second generation middleware validation of computing models Second generation middleware prototyping, development LCG-2 - upgraded middleware, mgt. and ops tools

8 25 May 2004 - 8 LCG-1 Experience (2003) Jan 2003 GDB agreed to take VDT and EDG components March 2003 LCG-0 existing middleware, waiting for EDG-2 release September 2003 LCG-1 first production release – integrate sites and operate a grid integrated 32 sites ~300 CPUs 3 month late -> reduced functionality extensive middleware certification process operated until early January, first use for production introduced hierarchical support model (primary and secondary sites) worked well for some regions (less for others) communication/cooperation between sites needed to be established installation and configuration was an issue only time to package software for the LCFGng tool (problematic) not sufficient documentation (partially compensated by travel) manual installation procedure documented when new staff arrived

9 25 May 2004 - 9 LCG-1 -> LCG-2 Deployed MDS + EDG-BDII in a robust way redundant regional GIISes – BIG STEP FORWARD vastly improved the scalability and robustness of the information system upgrades, especially non backward compatible ones took very long not all sites showed the same dedication still some problems with the reliability of some of the core services Project Level 1 Deployment milestones for 2003:  July: Introduce the initial publicly available LCG-1 global grid service With 10 Tier 1 centres in 3 continents  November: Expanded LCG-1 service with resources and functionality sufficient for the 2004 Computing Data Challenges Additional Tier 1 centres, several Tier 2 centres – more countries Expanded resources at Tier 1s Agreed performance and reliability targets Around 30 sites

10 25 May 2004 - 10 Data Challenges “LHC Data Challenges”2004 the “LHC Data Challenges” Large-scale tests of the experiments’ computing models, processing chains, operating infrastructure, grid readiness ALICE and CMS data challenges started at the beginning of March LHCb and ATLAS – start in May The big challenge for this year - data – - integrating mass storage (SRM) December 2003 LCG-2 Full set of functionality for DCs, but only “classic SE”, first MSS integration Deployed in January Data challenges started in February -> testing in production Large sites integrate resources into LCG (MSS and farms) May 2004 Improved services SRM enabled storage for disk and MSS systems Significant broadening of participation

11 25 May 2004 - 11 LCG - a Collaboration Building and operating the LHC Grid – an international collaboration between The physicists and computing specialists from the experiments The projects in Europe and the US that have been developing Grid middleware European DataGrid (EDG) US Virtual Data Toolkit (Globus, Condor, PPDG, iVDGL, GriPhyN) The regional and national computing centres that provide resources for LHC some contribution from HP (tier2 centre) The research networks Researchers Software EngineersService Providers

12 25 May 2004 - 12 RA L IN2 P3 BN L FZ K CN AF PIC IC EP P FN AL US C NI KH EF Kra kow CIE MA T Ro me Tai pei TRI UM F CS Leg nar o UB IFC A IC MS U Pra gue Bud ape st Ca mbr idge Tier-1 small centres Tier-2 desktops portables LCG Scale and Computing Model Sites classified by resources Tier-0 reconstruct Experiment Summary Data (ESD) record raw and ESD distribute data to tier-1 Tier-1 data heavy analysis permanent, managed grid-enabled storage (raw, analysis, ESD), MSS reprocessing regional support Tier-2 managed disk storage CPU intensive tasks e.g. simulation end user analysis parallel interactive analysis Data distribution ~70Gbits/sec

13 25 May 2004 - 13 LCG2 productionOperate large scale production service Started with 8 “core” sites each bringing significant resources sufficient experience to react quickly weekly meetings: core site phone conference, each of the experiments, joined meeting of the sites and the experiments (GDA) Introduced a testZone for new sites LCG1 showed that ill configured sites can affect all sites sites stay in the testZone until they have been stable for some time Further improved (simplified) information system “LCG BDII” addresses: manageability, improves: robustness and scalability allows partitioning of the grid into independent views Introduced local testbed for experiment integration runs TAG N+1 rapid feedback on functionality from the experiments triggered several changes to the RLS system

14 25 May 2004 - 14 LCG2 Focus on integrating local resources batch systems at CNAF, CERN, NIKHEF already integrated MSS systems with CASTOR at several sites, enstore at FNAL, RAL very soon Experiment software distribution mechanism based on shared file system with access for privileged users tool to publish the installed software (or compatibility) in the information system Improved documentation and installation sites have the choice to use LCFGng or follow a manual installation guide LCFGng has a large overhead and is only appropriate for >10 nodes full set of manual install documentation documentation includes simple tests, sites join in a better state install notes: http://markusw.home.cern.ch/markusw/LCG2InstallNotes.htmlhttp://markusw.home.cern.ch/markusw/LCG2InstallNotes.html

15 25 May 2004 - 15 Operations Services Operations Service: RAL is leading sub-project on developing operations services Initial prototype http://goc.grid-support.ac.uk/ Basic monitoring tools Mail lists and for problem resolution GDB database containing contact and site information GOC ultimately to be distributed 24hr service Monitoring: GridICE (development of DataTag Nagios-based tools) GridPP job submission monitoring User support service FZK leading sub-project to develop user support services Draft on user support policy Web portal for problem reporting http://gus.fzk.de/http://gus.fzk.de/

16 25 May 2004 - 16 Release Process Priorities for future releases agreed in Grid Deployment Area meetings based on: experiments experience, problems and needs operational experience Monthly coordinated releases in the past everything not perfect was labeled “showstopper” releases took very long now we reach gradually a more stable situation all components have to pass the C&T testing not all releases will be deployed releases go first to core sites

17 25 May 2004 - 17 Last 3 weeks from 28 to 50 sites Last 2 weeks from 2200 to 3340 CPUS

18 25 May 2004 - 18 LCG2 Data Challenges Most services stable, esp IS Lessons learned LHC experiments use multiple grids and additional services Integration, Interoperability Service provision planned to provide shared services RBs, BDIIs, UIs etc, but… experiments need to augment the services on the UI and need to define their super/subsets of the grid –individual RB/BDII/UIs for each VO (optional on one node) Resource usage expected uniform utilization (100% from start on) turned out to have some granularity –steep build up on almost empty service, followed by a plateau and then tapering off Application level software distribution – not ideal, but improved over LCG-1

19 25 May 2004 - 19 LCG2 Data Challenges Lessons learned Scalable services for data transport needed SRM needed for more than Castor and Enstore Performance issues with RLS tools bulk file registration with RLS, understood, work around and fix Local Resource Managers (batch systems) are too smart for Glue GlueSchema can’t express the richness of batch systems (LSF etc.) –Users cannot reliably anticipate loading New flatter IS architecture First scaling problems encountered around 40 sites (RB) RB slows down, solution will make it into the May release DCs need resources Disk storage not sufficient

20 25 May 2004 - 20 Expected Developments in 2004 General: LCG-2 will be an incrementally evolving, stable service Some functional improvements: Extend access to MSS – tape systems, and managed disk pools GFAL – Posix I/O to heterogeneous MSS (large range used across labs) Operational improvements: Monitoring systems – move towards proactive problem finding, ability to take sites on/offline; experiment monitoring (R-GMA), accounting “Cookbook” to cover planning, installation and operation Activate regional centres more to provide and support services this has improved over time, but in general there is too little sharing of tasks Address integration issues: With large clusters, with storage systems, with different OSs Sites will not run consistent and identical middleware Better integration of farms with non routed networks for the WNs Regional centres already supporting other experiments CERN integrating projects for the accelerator group National Grid infrastructures coming

21 25 May 2004 - 21 Grid Guide Doc Tool to consolidate all site configuration parameters in one place Web interface Wizard-style interface upon initial registration Management of configuration Advice Full site configuration exported as an XML file Providing the input to installation scripts Creating customised documentation Building other tools Debugging Aims to Reduce the barriers towards participation in the grid Enable sites to join in a more stable state

22 25/05 2004 - 22 LCG – EGEE EU project to build an e-science grid for Europe LCG-2 will be the production service during 2004 Will also form basis of EGEE initial production service Will be maintained as a stable service Will continue to be developed Expect in parallel a development service – Q204 Based on EGEE middleware prototypes Run as a service on a subset of EGEE/LCG production sites The core infrastructure of the LCG and EGEE grids will be operated as a single service LCG includes US and Asia, EGEE includes other sciences The ROCs support Resource Centres and applications Similar to LCG primary sites Some ROCs and LCG primary sites will be merged LCG Deployment Manager is the EGEE Operations Manager member of PEB of both projects EG EE LCG geographical applications

23 25/05 2004 - 23 Summary LCG-2 is running as a production service Anticipate further improvements in infrastructure Broadening of participation and increase in available resources In 2004 we must show that we can handle the data - meeting the Data Challenges is the key goal of 2004 2004 2005 2006 2007 first data Initial service in operation Decisions on final core middleware Demonstrate core data handling and batch analysis Installation and commissioning


Download ppt "25 May 2004 - 1 LCG Deployment Status Oliver Keeble CERN IT GD-GIS"

Similar presentations


Ads by Google