LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN.

LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN

GridPP 13; Durham, 4 th July, 2005 2 Overview  Introduction  Project goals and overview  Status  Applications area  Fabric  Deployment and Operations  Baseline Services  Service Challenges  Summary

GridPP 13; Durham, 4 th July, 2005 3 LCG – Goals  The goal of the LCG project is to prototype and deploy the computing environment for the LHC experiments  Two phases:  Phase 1: 2002 – 2005  Build a service prototype, based on existing grid middleware  Gain experience in running a production grid service  Produce the TDR for the final system  Phase 2: 2006 – 2008  Build and commission the initial LHC computing environment We are here LCG and Experiment Computing TDRs completed and presented to the LHCC last week

GridPP 13; Durham, 4 th July, 2005 4 Joint with EGEE Applications Area Pere Mato Development environment Joint projects, Data management Distributed analysis Middleware Area Frédéric Hemmer Provision of a base set of grid middleware (acquisition, development, integration) Testing, maintenance, support CERN Fabric Area Bernd Panzer Large cluster management Data recording, Cluster technology Networking, Computing service at CERN Grid Deployment Area Ian Bird Establishing and managing the Grid Service - Middleware, certification, security operations, registration, authorisation, accounting Project Areas & Management Distributed Analysis - ARDA Massimo Lamanna Prototyping of distributed end-user analysis using grid technology Project Leader Les Robertson Resource Manager – Chris Eck Planning Officer – Jürgen Knobloch Administration – Fabienne Baud-Lavigne

Applications Area

GridPP 13; Durham, 4 th July, 2005 6 Application Area Focus  Deliver the common physics applications software  Organized to ensure focus on real experiment needs  Experiment-driven requirements and monitoring  Architects in management and execution  Open information flow and decision making  Participation of experiment developers  Frequent releases enabling iterative feedback  Success defined by experiment validation  Integration, evaluation, successful deployment

GridPP 13; Durham, 4 th July, 2005 7 Validation Highlights  POOL successfully used in large scale production in ATLAS, CMS, LHCb data challenges in 2004  ~400TB of POOL data produced  Objective of a quickly-developed persistency hybrid leveraging ROOT I/O and RDBMSes has been fulfilled  Geant4 firmly established as baseline simulation in successful ATLAS, CMS, LHCb production  EM & hadronic physics validated  Highly stable: 1 G4-related crash per O(1M) events  SEAL components underpin POOL’s success, in particular the dictionary system  Now entering a second generation with Reflex  SPI’s Savannah project portal and external software service are accepted standards inside and outside the project

GridPP 13; Durham, 4 th July, 2005 8 Current AA Projects  SPI – Software process infrastructure (A. Aimar)  Software and development services: external libraries, savannah, software distribution, support for build, test, QA, etc.  ROOT – Core Libraries and Services (R. Brun)  Foundation class libraries, math libraries, framework services, dictionaries, scripting, GUI, graphics, etc.  POOL – Persistency Framework (D. Duellmann)  Storage manager, file catalogs, event collections, relational access layer, conditions database, etc.  SIMU - Simulation project (G. Cosmo)  Simulation framework, physics validation studies, MC event generators, participation in Geant4, Fluka.

GridPP 13; Durham, 4 th July, 2005 9 SEAL and ROOT Merge  Major change in the AA has been the merge of the SEAL project with ROOT project  Details of the merge are being discussed following a process defined by the AF  Breakdown into a number of topics  Proposals discussed with the experiments  Public presentations  Final decisions by the AF  Current status  Dictionary plans approved  MathCore and Vector libraries proposals have been approved  First development release of ROOT including these new libraries

GridPP 13; Durham, 4 th July, 2005 10 Ongoing work  SPI  Porting LCG-AA software to amd64 (gcc 3.4.4)  Finalizing software distribution based on Pacman  QA tools: test coverage and savannah reports  ROOT  Development version v5.02 released last week  Including new libraries: mathcore, reflex, cintex, roofit  POOL  Version 2.1 released including new file catalog implementations: LFCCatalog (lfc), GliteCatalog (glite, Fireman), GTCatalog (globus toolkit)  New version of Conditions DB (COOL) 1.2  Adapting POOL to new dictionaries (Reflex)  SIMU  New Geant4 public minor release 7.1 is being prepared  Public release of Fluka expected by end July  Intense activity in the combined calorimeter physics validation with ATLAS, report in September.  New MC generators being added (CASCADE, CHARYBDIS, etc.) into the already long list of generators provided  Prototyping persistency of Geant4 geometry with ROOT

Fabric Area

GridPP 13; Durham, 4 th July, 2005 12 configuration, installation and management of nodes lemon LHC Era Monitoring - system & service monitoring LHC Era Automated Fabric – hardware / state management Extremely Large Fabric management system CERN Fabric  Fabric automation has seen very good progress  The new systems for managing large farms are in production at CERN Includes technology developed by European DataGrid

GridPP 13; Durham, 4 th July, 2005 13 CERN Fabric  Fabric automation has seen very good progress  The new systems for managing large farms are in production at CERN  New CASTOR Mass Storage System  Was deployed first on the high throughput cluster for the recent ALICE data recording computing challenge  Agreement on collaboration with Fermilab on Linux distribution  Scientific Linux based on Red Hat Enterprise 3  Improves uniformity between the HEP sites serving LHC and Run 2 experiments  CERN computer centre preparations  Power upgrade to 2.5 MW  Computer centre refurbishment well under way  Acquisition process started

GridPP 13; Durham, 4 th July, 2005 14 Preparing for 7,000 boxes in 2008

GridPP 13; Durham, 4 th July, 2005 15 High Throughput Prototype openlab/LCG  Experience with likely ingredients in LCG:  64-bit programming  next generation I/O (10 Gb Ethernet, Infiniband, etc.)  High performance cluster used for evaluations, and for data challenges with experiments  Flexible configuration  components moved in and out of production environment  Co-funded by industry and CERN

GridPP 13; Durham, 4 th July, 2005 16 Alice Data Recording Challenge  Target – one week sustained at 450 MB/sec  Used the new version of Castor mass storage system  Note smooth degradation and recovery after equipment failure

Deployment and Operations

GridPP 13; Durham, 4 th July, 2005 18 Country providing resources Country anticipating joining In LCG-2:  139 sites, 32 countries  ~14,000 cpu  ~5 PB storage Includes non-EGEE sites: 9 countries 18 sites Computing Resources: June 2005 Number of sites is already at the scale expected for LHC - demonstrates the full complexity of operations

GridPP 13; Durham, 4 th July, 2005 19 Operations Structure  Operations Management Centre (OMC):  At CERN – coordination etc  Core Infrastructure Centres (CIC)  Manage daily grid operations – oversight, troubleshooting  Run essential infrastructure services  Provide 2 nd level support to ROCs  UK/I, Fr, It, CERN, + Russia (M12)  Hope to get non-European centres  Regional Operations Centres (ROC)  Act as front-line support for user and operations issues  Provide local knowledge and adaptations  One in each region – many distributed  User Support Centre (GGUS)  In FZK – support portal – provide single point of contact (service desk)

GridPP 13; Durham, 4 th July, 2005 20 Grid Operations  The grid is flat, but  Hierarchy of responsibility  Essential to scale the operation  CICs act as a single Operations Centre  Operational oversight (grid operator) responsibility  rotates weekly between CICs  Report problems to ROC/RC  ROC is responsible for ensuring problem is resolved  ROC oversees regional RCs  ROCs responsible for organising the operations in a region  Coordinate deployment of middleware, etc  CERN coordinates sites not associated with a ROC CIC RC ROC RC ROC RC ROC RC ROC OMC RC = Resource Centre It is in setting up this operational infrastructure where we have really benefited from EGEE funding

GridPP 13; Durham, 4 th July, 2005 21 Grid monitoring  GIIS Monitor + Monitor Graphs  Sites Functional Tests  GOC Data Base  Scheduled Downtimes  Live Job Monitor  GridIce – VO + fabric view  Certificate Lifetime Monitor  Operation of Production Service: real-time display of grid operations  Accounting information  Selection of Monitoring tools:

GridPP 13; Durham, 4 th July, 2005 22 Operations focus  Main focus of activities now:  Improving the operational reliability and application efficiency:  Automating monitoring  alarms  Ensuring a 24x7 service  Removing sites that fail functional tests  Operations interoperability with OSG and others  Improving user support:  Demonstrate to users a reliable and trusted support infrastructure  Deployment of gLite components:  Testing, certification  pre-production service  Migration planning and deployment – while maintaining/growing interoperability  Further developments now have to be driven by experience in real use LCG-2 (=EGEE-0) prototyping product 2004 2005 LCG-3 (=EGEE-x?) product

GridPP 13; Durham, 4 th July, 2005 23 Recent ATLAS work ATLAS jobs in EGEE/LCG-2 in 2005 In latest period up to 8K jobs/day Several times the current capacity for ATLAS at CERN alone – shows the reality of the grid solution Number of jobs/day ~10,000 concurrent jobs in the system

Baseline Services & Service Challenges

GridPP 13; Durham, 4 th July, 2005 25 Baseline Services: Goals  Experiments and regional centres agree on baseline services  Support the computing models for the initial period of LHC  Thus must be in operation by September 2006.  Expose experiment plans and ideas  Timescales  For TDR – now  For SC3 – testing, verification, not all components  For SC4 – must have complete set  Define services with targets for functionality & scalability/performance metrics.  Very much driven by the experiments’ needs –  But try to understand site and other constraints Not done yet

GridPP 13; Durham, 4 th July, 2005 26 Baseline services  Storage management services  Based on SRM as the interface  Basic transfer services  gridFTP, srmCopy  Reliable file transfer service  Grid catalogue services  Catalogue and data management tools  Database services  Required at Tier1,2  Compute Resource Services  Workload management  VO management services  Clear need for VOMS: roles, groups, subgroups  POSIX-like I/O service  local files, and include links to catalogues  Grid monitoring tools and services  Focussed on job monitoring  VO agent framework  Applications software installation service  Reliable messaging service  Information system  Nothing really surprising here – but a lot was clarified in terms of requirements, implementations, deployment, security, etc

GridPP 13; Durham, 4 th July, 2005 27 Preliminary: Priorities Service ALICEATLASCMSLHCb Storage ElementAAAA Basic transfer toolsAAAA Reliable file transfer serviceAAA/BA Catalogue servicesBBBB Catalogue and data management toolsCCCC Compute ElementAAAA Workload ManagementBAAC VO agentsAAAA VOMSAAAA Database servicesAAAA Posix-I/OCCCC Application software installationCCCC Job monitoring toolsCCCC Reliable messaging serviceCCCC Information systemAAAA A: High priority, mandatory service B: Standard solutions required, experiments could select different implementations C: Common solutions desirable, but not essential

GridPP 13; Durham, 4 th July, 2005 28 Service Challenges – ramp up to LHC start-up service SC2 SC3 LHC Service Operation Full physics run 200520072006 2008 First physics First beams cosmics June05 - Technical Design Report Sep05 - SC3 Service Phase May06 – SC4 Service Phase Sep06 – Initial LHC Service in stable operation SC4 SC2 – Reliable data transfer (disk-network-disk) – 5 Tier-1s, aggregate 500 MB/sec sustained at CERN SC3 – Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain – grid data throughput 500 MB/sec, including mass storage (~25% of the nominal final throughput for the proton period) SC4 – All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc. analysis – sustain nominal final grid data throughput LHC Service in Operation – September 2006 – ramp up to full operational capacity by April 2007 – capable of handling twice the nominal data throughput Apr07 – LHC Service commissioned See Jamie’s talk for more details

Baseline Services, Service Challenges, Production Service, Pre-production service, gLite deployment, … … confused?

GridPP 13; Durham, 4 th July, 2005 30 Services …  Baseline services  Are the set of essential services that the experiments need to be in production by September 2006  Verify components in SC3, SC4  Service challenges  The ramp up of the LHC computing environment – building up the production service, based on results and lessons of the service challenges  Production service  The evolving service putting in place new components prototyped in SC3, SC4  No big-bang changes, but many releases!!!  gLite deployment  As new components are certified, will be added to the production service releases, either in parallel with or replacing existing services  Pre-production service  Should be literally a preview of the production service,  But is a demonstration of gLite services at the moment – this has been forced on us by many other constraints (urgency to “deploy” gLite, need for reasonable scale testing, … )

GridPP 13; Durham, 4 th July, 2005 31 Releases and Distributions  We intend to maintain a single line of production middleware distributions  Middleware releases from [JRA1, VDT, LCG, …]  Middleware distributions for deployment from GDA/SA1  Remember: announcement of a release is months away from a deployable distribution (based on last 2 years experience)  Distributions still labelled “LCG-2.x.x”  Would like to change to something less specific to avoid LCG/EGEE confusion  Frequent updates for Service challenge sites  But only needed for SC sites  Frequent updates as gLite is deployed  Not clear if all sites will deploy all gLite components immediately  This is unavoidable  A strong request from LHC experiment spokesmen to the LCG POB:  “ early, gradual and frequent releases of the [baseline] services is essential rather than waiting for a complete sets ” Throughout all this, we must maintain a reliable production service, which gradually improves in reliability and performance

GridPP 13; Durham, 4 th July, 2005 32 Summary  We are at end of LCG Phase 1  Good time to step back and look at achievements and issues  LCG Phase 2 has really started  Consolidation of AA projects  Baseline services  Service challenges and experiment data challenges  Acquisitions process starting  No new developments  make what we have work absolutely reliably, and be scaleable, performant  Timescale is extremely tight  Must ensure that we have appropriate levels of effort committed

LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN.

Similar presentations

Presentation on theme: "LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN.

Similar presentations

Presentation on theme: "LCG Status and Plans GridPP13 Durham, UK 4 th July 2005 Ian Bird IT/GD, CERN."— Presentation transcript:

Similar presentations

About project

Feedback