Presentation is loading. Please wait.

Presentation is loading. Please wait.

LCG Status Report LHCC Open Session CERN 28th June 2006.

Similar presentations


Presentation on theme: "LCG Status Report LHCC Open Session CERN 28th June 2006."— Presentation transcript:

1 LCG Status Report LHCC Open Session CERN 28th June 2006

2 Outline Project Status Applications Area CERN Tier 0
Organisation for Phase II Applications Area CERN Tier 0 Castor-2 Tier 0 infrastructure LHC networking Grid Infrastructure Status Service Challenges – results and plans Regional centres Middleware status Physics Support & Analysis Summary

3 The Worldwide LHC Computing Grid
Purpose Develop, build and maintain a distributed computing environment for the storage and analysis of data from the four LHC experiments Ensure the computing service … and common application libraries and tools Phase I – Development & planning Phase II – – Deployment & commissioning of the initial services

4 WLCG Collaboration The Collaboration ~130 computing centres
12 large centres (Tier-0, Tier-1) 40-50 federations of smaller “Tier-2” centres 29 countries Memorandum of Understanding Agreed in October 2005, now being signed Purpose Focuses on the needs of the 4 LHC experiments Commits resources Each October for the coming year 5-year forward look Agrees on standards and procedures

5 Collaboration Board – chair Neil Geddes (RAL)
Sets the main technical directions One person from Tier-0 and each Tier-1, Tier (or Tier-2 Federation) Experiment spokespersons Overview Board – chair Jos Engelen (CERN CSO) Committee of the Collaboration Board oversee the project resolve conflicts One person from Tier-0, Tier-1s Experiment spokespersons Management Board – chair Project Leader Experiment Computing Coordinators One person fromTier-0 and each Tier-1 Site GDB chair Project Leader, Area Managers EGEE Technical Director Grid Deployment Board – chair Kors Bos (NIKHEF) With a vote: One person from a major site in each country One person from each experiment Without a vote: Experiment Computing Coordinators Site service management representatives Project Leader, Area Managers Architects Forum – chair Pere Mato (CERN) Experiment software architects Applications Area Manager Applications Area project managers Physics Support

6 More information on the collaboration
Boards and Committees All boards except the OB have open access to agendas, minutes, documents Planning data: MoU Documents and Resource Data Technical Design Reports Phase 2 Plans Status and Progress Reports Phase 2 Resources and costs at CERN

7 LCG Applications Area

8 Merge of SEAL and ROOT projects
Single team working together successfully for more than one year ~50 % of SEAL functionality has been migrated to ROOT In use by the experiments (will be in production for this year’s data challenges) What is left is easily maintainable (no new development) Started to plan the migration of the second 50% Collected information from experiments Detailed plan in preparation In general no urgency from the experiments Will need to persuade experiments to migrate when software is ready

9 AA Project status (1) Software Process Infrastructure Project (SPI)
Stable running of services and improving them: savannah, hypernews, software installations, and software distributions Support experiments directly to provide complete software configurations Support for new platforms: SLC4, MacOSX Core Libraries and Services Project (ROOT) Many developments for the integration of Reflex and CINT. Plan to release new system this fall Consolidation of the new Math libraries. New packages: Multi-Variate analysis, Fast Fourier Transforms Many performance improvements in many areas (e.g. I/O and Trees) Many new developments in PROOF: asynchronous queries, connect/disconnect mode, package manager, monitoring, etc. Improvements and new functionality in GUI and Graphics packages

10 AA Project Status (2) Persistency Framework Project (POOL & COOL)
CORAL, a reliable generic RDBMS interface for Oracle, MySQL, SQLight and FroNTier  LCG 3D project Provides db lookup, failover, connection pooling, authentication, monitoring COOL and POOL can access all back-ends via CORAL CORAL also used as separate package by ATLAS/CMS online Improved COOL versioning functionalities (user tags and hierarchical tags) Simulation Project Improved tools for geometry model interchange (GDML) Extended framework for interfacing test beam simulations with Geant4 and Fluka. Physics analysis expected soon. Considerable effort on the study of hadronic shower shapes to resolve discrepancies with test beam data. Improved regression suite to investigate and compare hadronic shower shapes New C++ Monte Carlo generators (Pythia8, ThePEG/Herwig++) have been added to the generator library (GENSER) Created new precise elastic process for protons and neutrons in Geant4 New efficient method to detect overlaps in geometries when constructed and support for parallel geometries New Python interface module interfacing key Geant4 classes

11 CERN Tier 0 and LHC Networking status

12 CERN Castor storage system
A CASTOR2 review took place at CERN on June 6th – 9th: Members : John Harvey (CERN, chair), Miguel Branco (ATLAS), Don Petravick (FNAL), Shaun de Witt (RAL) Details and the final report 

13 Castor2 Highlights in 2006 ATLAS Tier 0 test in January at nominal rates (320 MB/s, no Tier 1 export) Various large scale data challenges: SC4 data export from Castor2 disk pool at ~1.6 GB/s Castor2 disk pool stress tests at 4.3 GB/s, cf the expected load of 4.5 GB/s aggregate for all 4 experiments during pp running Successful integration of 2 new tape storage systems from IBM and STK with tested peak rates of 1.6 GB/s to tape Successful transition of all 4 experiments from Castor1 to Castor2 Today ~ 1 PB disk space in Castor2 disk pools with ~2.5 million files on disk Castor2 disk pool for CMS served analysis data successfully for ~ 1000 simultaneous clients with 1 GB/s aggregate performance Second ATLAS Tier test just started at nominal rates:

14 Tier 0 ramp-up 12,000 spinning disks 2 PB of diskspace Batch system
May 2006 Sep 2006 Feb 2007 space [TB] servers Alice 78 20 231 ~60 500 Atlas 123 25 176 ~45 370 CMS 138 27 LHCb 121 26 188 total LHC 460 98 771 ~180 1610 ~480 SC4 187 40 ITDC 169 42 170 ~40 public ~200 ~100 total 816 180 940 220 ~2000 ~600 Batch system boxes kSI2K Today 2300 4300 2007 =2500 +5700 =10000 2008 25000

15 Computer Centre Electrical Infrastructure …
The new substation is operational Two power cuts were caused by the new equipment in January (6th, 24th); reasons understood rapidly and fixed Critical services maintained as designed during problems on May 16th; full services back within 3 hours after power restored. 1st new UPS module being installed will be commissioned by mid-July No capacity increase; replaces current UPS only Additional UPS capacity only at end-2006 extremely tight schedule requires removal/relocation of existing equipment from July 15th – August 15th! and two month period for 2nd phase of foundation reinforcement

16 … and Cooling infrastructure
Work (much) delayed wrt initial plan weather delays more than expected many safety concerns Three major cooling problems since end-March (and other minor problems): Focus has been to maintain critical lab services (network, admin services, ,…)  physics services were shutdown to reduce heat load Production chillers being commissioned now 1st unit in production June 19th; 2nd on June 23rd. 3rd unit in production by June 30th Final configuration in by mid-July Future work Installation of sensors: 2-3 per equipment row Completion of ducts on righthand (barn) side [4th chiller; yet to be funded]

17 The new European Network Backbone
LCG working group with Tier-1s and national/ regional research network organisations New GÉANT 2 – research network backbone  Strong correlation with major European LHC centres Swiss PoP at CERN

18

19 Grid Infrastructure

20 LCG Service Hierarchy Tier-0 – the accelerator centre
Data acquisition & initial processing Long-term data curation Distribution of data  Tier-1 centres Canada – Triumf (Vancouver) France – IN2P3 (Lyon) Germany – Forschunszentrum Karlsruhe Italy – CNAF (Bologna) Netherlands Tier-1 (Amsterdam) Nordic countries – distributed Tier-1 Spain – PIC (Barcelona) Taiwan – Academia SInica (Taipei) UK – CLRC (Oxford) US – FermiLab (Illinois) – Brookhaven (NY) Tier-1 – “online” to the data acquisition process  high availability Managed Mass Storage –  grid-enabled data service Data-heavy analysis National, regional support Tier-2 : ~120 centres (40-50 federations) in ~29 countries Simulation End-user analysis – batch and interactive

21 LCG depends on 2 major science grid infrastructures …
The LCG service runs & relies on grid infrastructure provided by: EGEE - Enabling Grids for E-Science OSG - US Open Science Grid

22 EGEE Grid Sites : Q1 2006 sites EGEE: Steady growth over the lifetime of the project CPU EGEE: > 180 sites, 40 countries > 24,000 processors, ~ 5 PB storage

23 A global, federated e-Infrastructure
BalticGrid NAREGI SEE-GRID OSG EUChinaGrid EUMedGrid EUIndiaGrid EELA At Feb review: 100 sites, 10K CPUs 1st gLite release foreseen for March’05 6 domains and EGEE infrastructure ~ 200 sites in 39 countries ~ CPUs > 5 PB storage > concurrent jobs per day > 60 Virtual Organisations

24 Use of the infrastructure
More than 35K jobs/day on the EGEE Grid LHC VOs  30K jobs/day Sustained & regular workloads of >35K jobs/day spread across full infrastructure doubling/tripling in last 6 months – no effect on operations Several applications now depend on EGEE as their primary computing resource

25 EGEE Operations Process
Grid operator on duty 6 teams working in weekly rotation CERN, IN2P3, INFN, UK/I, Ru,Taipei Crucial in improving site stability and management Expanding to all ROCs in EGEE-II Operations coordination Weekly operations meetings Regular ROC managers meetings Series of EGEE Operations Workshops Nov 04, May 05, Sep 05, June 06 Geographically distributed responsibility for operations: There is no “central” operation Tools are developed/hosted at different sites: GOC DB (RAL), SFT (CERN), GStat (Taipei), CIC Portal (Lyon) Procedures described in Operations Manual Introducing new sites Site downtime scheduling Suspending a site Escalation procedures etc Highlights: Distributed operation Evolving and maturing procedures Procedures being in introduced into and shared with the related infrastructure projects

26 Site testing Measuring response times and availability:
Site Availability Monitor – SAM Based upon Site Functional Test suite monitoring services by running regular tests basic services – SRM, LFC, FTS, CE, RB, Top-level BDII, Site BDII, MyProxy, VOMS, R-GMA, …. VO environment – tests supplied by experiments results stored in database displays & alarms for sites, grid operations, experiments high level metrics for management integrated with EGEE operations-portal - main tool for daily operations Mechanism and tests shared with OSG

27 Sustainability: Beyond EGEE-II
Need to prepare for permanent Grid infrastructure Maintain Europe’s leading position in global science Grids Ensure a reliable and adaptive support for all sciences Independent of short project funding cycles Modelled on success of GÉANT Infrastructure managed in collaboration with national grid initiatives Expand the idea and problems of the JRU

28 Structure Federated model bringing together National Grid Initiatives (NGIs) to build a European organisation EGEE federations would evolve into NGIs Each NGI is a national body Recognised at the national level Mobilises national funding and resources Contributes and adheres to international standards and policies Operates the national e-Infrastructure Application independent, open to new user communities and resource providers

29 OSG & WLCG OSG Infrastructure is a core piece of the WLCG.
OSG delivers accountable resources and cycles for LHC experiment production and analysis. OSG Federates with other infrastructures. Experiments see a seamless global computing facility

30 Ramp up of OSG use last 6 months
deployment OSG deployment

31 Data Transfer by VOs e.g. CMS

32 Operations Grid Operations Center.
Facility, Service and VO Support Centers. Manual or automated flow of tickets within OSG and bridged to other Grids. Ownership of problems at end-points and by GOC. Guided by Operations Model, Standard Procedures, Support Center Agreements

33 WLCG Interoperability
Cross-grid job submission: Most advanced with OSG – cross job submission has been put in place for WLCG Used in production by US-CMS for several months EGEE Generic Info Provider installed on OSG site (now in VDT) Allows all sites to be seen in info system Monitoring (GStat and SFT) can run on OSG sites EGEE clients installed on OSG-LCG sites Inversely – EGEE sites can run OSG jobs All use SRM SEs; File catalogues are application choice – LFC widely used Support and operations: Workflows and processes being put in place and tested Operations workshop last week tried to finalise some of the open issues

34 LCG Service planning Pilot Services – stable service from 1 June 06
2006 cosmics LHC Service in operation – 1 Oct 06 over following six months ramp up to full operational capacity & performance 2007 LHC service commissioned – 1 Apr 07 first physics 2008 full physics run

35 Service Challenges Jun-Sep 2006 – SC4 – pilot service
Purpose Understand what it takes to operate a real grid service – run for weeks/months at a time (not just limited to experiment Data Challenges) Trigger and verify Tier1 & large Tier-2 planning and deployment – - tested with realistic usage patterns Get the essential grid services ramped up to target levels of reliability, availability, scalability, end-to-end performance Four progressive steps from October 2004 thru September 2006 End SC1 – data transfer to subset of Tier-1s Spring 2005 – SC2 – include mass storage, all Tier-1s, some Tier-2s 2nd half 2005 – SC3 – Tier-1s, >20 Tier-2s –first set of baseline services Jun-Sep 2006 – SC4 – pilot service  Autumn 2006 – LHC service in continuous operation – ready for data taking in 2007

36 SC4 – the Pilot LHC Service from June 2006
A stable service on which experiments can make a full demonstration of experiment offline chain DAQ  Tier-0  Tier-1 data recording, calibration, reconstruction Offline analysis - Tier-1  Tier-2 data exchange simulation, batch and end-user analysis And sites can test their operational readiness Service metrics  MoU service levels Grid services Mass storage services, including magnetic tape Extension to most Tier-2 sites Evolution of SC3 rather than lots of new functionality In parallel – Development and deployment of distributed database services (3D project) Testing and deployment of new mass storage services (SRM 2.2)

37 Sustained Data Distribution Rates: CERN  Tier-1s
Centre ALICE ATLAS CMS LHCb Rate into T1 MB/sec (pp run) ASGC, Taipei X 100 CNAF, Italy 200 PIC, Spain IN2P3, Lyon GridKA, Germany RAL, UK 150 BNL, USA FNAL, USA TRIUMF, Canada 50 NIKHEF/SARA, NL Nordic Data Grid Facility Totals 1,600 Design target is twice these rates to enable catch-up after problems

38 SC4 T0-T1: Results Easter w/e Target 10 day period
Target: sustained disk – disk transfers at 1.6GB/s out of CERN at full nominal rates for ~10 days Result: just managed this rate on Easter Sunday (1/10) Easter w/e Target 10 day period

39 ATLAS SC4 tests ATLAS transfers Background transfers
From last week: initial ATLAS SC4 work Rates to ATLAS T1 sites close to target rates ATLAS transfers Background transfers

40 Service readiness Internal LCG review of services was held 8-9th June
Mandate: Assess the service readiness and preparations of the Tier 1 sites Scope: all aspects of LCG except applications area First day: Review of each Tier 1: status, planning, issues Second day: Middleware plans and priorities (EGEE and OSG) Interoperability Experiment views of status of middleware Status of storage interface (SRM) Difficult to assess overall status of sites – each Tier 1 is unique in its management, environment, issues All are now taking the timescale seriously Final report from the review expected in July

41 Middleware: Baseline services
In June 2005 the set of baseline service were agreed: Basic set of middleware required from the grid infrastructures Agreed by all experiments – minor variations of priority Baseline service group, and later workshops documented missing features LCG priorities for development agreed at Mumbai workshop in Feb Now reflected in EGEE & OSG middleware development plans gLite-3.0 (released in May for SC4) contains all of the baseline services SRM v2.2 for storage interfaces has a longer timescale (Nov) Still reliability, performance, management issues to be addressed gLite-3.0 is an evolution of the previous LCG-2.7 and gLite-1.x middleware Deployed in production without disturbing production environment Forms the basis for evolution of the services to add missing features, improve performance and reliability Several services (FTS, LFC, VOMS, BDII) are used everywhere (not just EGEE sites)

42 Physics Support and Analysis

43 Supporting the experiments in grid activities
Original activity on the Grid has focused on large productions Essential activity Still requiring effort (middleware and experiments sw evolving) Genuine need now for user analysis Big step forward compared to production Preparation stages still going on Tools maturing All components are being finalised Concrete signs of analysis activity  ALICE: Support production and analysis Integration and support ATLAS: Distributed analysis coordination and analysis (Ganga) Experiment dashboard Job reliability CMS: Experiment dashboard Integration and support Job reliability LHCb: Support analysis (Ganga)

44 Analysis efforts (CMS)
6k analysis jobs/day It was negligible less than 1 year ago A factor of two increase since late 2005 Jobs to finalise Physics TDR

45 Analysis efforts (cont)
ATLAS and LHCb Use a common tool to expose users (Ganga) Several demos and tutorials ALICE 3 tutorials for users have started (Jan 06) – more than 50 attendees Typically active users CHEP06 presentation (U. Egede) #users over last two months (2 services connecting users to the grid)

46 Experiment dashboard Originally proposed by CMS ATLAS dashboard
Now in production ATLAS dashboard Similar concept and re-use of experience and software Preview available Aggregation of monitor information from all sources Allow to follow the history of activity Allow to correlate information (e.g. data sets and sites) Allow to track down problems ATLAS production

47 Service reliability Bring together monitoring of experiment-specific services and applications with that of the middleware components to study and improve the LCG service Middleware weaknesses Infrastructure mis-configuration and instabilities Feedback into LCG/EGEE Deployment & Middleware development Example: 20th June Top “good” sites (“grid” efficiency) MIT = 99.6% DESY = 100.% Bari = 100 % Pisa = 100% FNAL = 100% ULB-VUB = 96.8% KBFI = 100% CNAF = 99.6 ITEP = 100%

48 Summary International science grid infrastructures are really operational And relied upon for daily production use at large scale More than 200 sites in EGEE and OSG Real grid operations in place for over a year LCG depends upon 2 major science grid infrastructures: EGEE and OSG ~130 computer centres in 49 countries Excellent global networking Good understanding now of: Experiment computing models and requirements Agreement on the baseline grid services Experience of the problems and issues But: Reliability must be improved The full computing models will be tested this year Big ramp-up needed in terms of capacity, number of jobs, Tier 2 sites participating Will there be a scaling problem?  must be tested in the next 12 months Data will arrive next year  No new developments  make what we have work absolutely reliably, and be scaleable, and performant


Download ppt "LCG Status Report LHCC Open Session CERN 28th June 2006."

Similar presentations


Ads by Google