Presentation on theme: "The WLCG Service Starts Here… SC4 Production == WLCG Pilot --- Jamie Shiers IT-GD Group Meeting, July 7 th 2006."— Presentation transcript:
The WLCG Service Starts Here… SC4 Production == WLCG Pilot --- Jamie Shiers IT-GD Group Meeting, July 7 th 2006
My (CERN) Background Started at CERN in CO group back in 1984 Started at CERN as student in 1978… Since then, we’ve had the following major accelerator startups: pp collider at CERN; LEP; FNAL collider runs I & II; SLC at SLAC; (others too…) Enjoy the calm, relaxing environment you currently enjoy.. (The quiet before the storm…)
The Worldwide LHC Computing Grid Purpose Develop, build and maintain a distributed computing environment for the storage and analysis of data from the four LHC experiments Ensure the computing service … and common application libraries and tools Phase I – 2002-05 - Development & planning Phase II – 2006-2008 – Deployment & commissioning of the initial services The solution!
July 2006 WLCG Service Challenges: Overview and Outlook Overview SC4 Phases: –Throughput Phase (April) May was reserved for gLite 3.0 upgrades –Service Phase (June – September inclusive) –Experiment production activities / requirements WLCG Production Service –In principle October on… –ATLAS CSC / CMS CSA06 start early / mid September Some comments on Tier2 workshop –Much more complete review at Wednesday’s GDB
July 2006 WLCG Service Challenges: Overview and Outlook
The Evolution of Databases in HEP CHEP 92 – the Birth of OO in HEP? Wide-ranging discussions on the future of s/w development in HEP A number of proposals presented leading to (DRDC/LCRB/LCB): RD41 – MOOSE [ Kors Bos ] The applicability of OO to offline particle physics code RD44 – GEANT4 [ Simone Giani ] Produce a global object-oriented analysis and design of an improved GEANT simulation toolkit for HEP RD45 – A Persistent Object Manager for HEP [ JDS ] (and later also LHC++ (subsequently ANAPHE)) [ JDS ] ROOT [ René ] Started working on LHC Computing full-time!
LCG Service Deadlines full physics run first physics cosmics 2007 2008 2006 Pilot Service – stable service from 1 June 06 i.e. we have already taken off! LCG Service in operation – 1 Oct 06 over following six months ramp up to full operational capacity & performance LCG service commissioned – 1 Apr 07 ~6 months prior to first collisions Updated LHC schedule coming…
July 2006 WLCG Service Challenges: Overview and Outlook The LHC Machine Some clear indications regarding LHC startup schedule and operation are now available –Press release issued two weeks ago Comparing our (SC) actual status with ‘the plan’, we are arguably one year late! –Some sites cheerfully claim two… –We were supposed to test all offline Use Cases of experiments during SC3 production phase (Sep 2005) We still have an awful lot of work to do Not the time to relax!
July 2006 WLCG Service Challenges: Overview and Outlook Press Release - Extract CERN confirms LHC start-up for 2007 Geneva, 23 June 2006. First collisions in the … LHC … in November 2007 said … Lyn Evans at the 137th meeting of the CERN Council... A two month run in 2007, with beams colliding at an energy of 0.9 TeV will allow the LHC accelerator and detector teams to run-in their equipment ready for a full 14 TeV energy run to start in Spring 2008 –Service Challenge ’07? The schedule announced today ensures the fastest route to a high-energy physics run with substantial quantities of data in 2008, while optimising the commissioning schedules for both the accelerator and the detectors that will study its particle collisions. It foresees closing the LHC’s 27 km ring in August 2007 for equipment commissioning. Two months of running, starting in November 2007, will allow the accelerator and detector teams to test their equipment with low-energy beams. After a winter shutdown in which commissioning will continue without beam, the high-energy run will begin. Data collection will continue until a pre-determined amount of data has been accumulated, allowing the experimental collaborations to announce their first results.
L.R. Evans 10 Important Milestones Last magnet deliveredOctober 2006 Last magnet testedDecember 2006 Last magnet installedMarch 2007 Machine closedAugust 2007 First collisionsNovember 2007
L.R. Evans 11 Sectors 7-8 and 8-1 will be fully commissioned up to 7 TeV in 2006-2007 The other sectors will be commissioned up to the field needed for de-Gaussing (1.2 TeV) Initial operation will be at 900 GeV (CM) with a static machine (no ramp, no squeeze) to dedug machine and detectors and to give a significant sample of W and Z Full commissioning up to7 TeV will be done in the winter 2008 shutdown Machine Commissioning
L.R. Evans 13 Conclusions All key objectives have been reached for the end of 2005 and installation is now proceeding smoothly. Three quarters of the machine has been liberated for magnet installation and interconnect work is proceeding in 2 octants in parallel. Magnet installation is now steady at 25/wk. Installation will finish end March 2007. The machine will be closed in August 2007. Every effort is being made to establish colliding beams before the end of 2007 at reduced energy. The full commissioning up to 7 TeV will be done during the 2008 winter shutdown ready for a Physics run at full energy in 2008.
LHC Commissioning Expect to be characterised by: Poorly understood detectors, calibration, software, triggers etc. Lower than design luminosity & energy (~injection energy) Most likely no AOD or TAG from first pass – but ESD will be larger? Possible large impact on Tier2s – RAW and ESD samples to Tier2s? The pressure will be on to produce some results as soon as possible! There will not be sufficient resources at CERN to handle the load We need a fully functional distributed system - ENTER THE GRID There are many Use Cases we did not yet clearly identify Nor indeed test --- this remains to be done in the coming months!
July 2006R.Bailey, Chamonix XV, January 200615 Breakdown of a normal year 7-8 ~ 140-160 days for physics per year Not forgetting ion and TOTEM operation Leaves ~ 100-120 days for proton luminosity running ? Efficiency for physics 50% ? ~ 50 days ~ 1200 h ~ 4 10 6 s of proton luminosity running / year - From Chamonix XIV - Service upgrade slots?
P. Sphicas LHC experiments’ software July 2006 WLCG Service Challenges: Overview and Outlook 16 Multiplicity paper: Introduction Detector system - Pixel (& TPC) Analysis method Presentation of data - dN/dη and mult. distribution ( s dependence) Theoretical interpretation - ln 2 (s) scaling?, saturation, multi-parton inter… Summary p T paper outline: Introduction Detector system - TPC, ITS Analysis method Presentation of data - p T spectra and p T -multiplicity correlation Theoretical interpretation - soft vs hard, mini-jet production… Summary Startup physics (ALICE) Can publish two papers 1-2 weeks after LHC startup
What are the requirements for the WLCG? Over the past 18 – 24 months, we have seen: The LHC Computing Model documents and Technical Design Reports; The associated LCG Technical Design Report; The finalisation of the LCG Memorandum of Understanding (MoU) Together, these define not only the functionality required (Use Cases), but also the requirements in terms of Computing, Storage (disk & tape) and Network But not necessarily in an site-accessible format… We also have close-to-agreement on the Services that must be run at each participating site Tier0, Tier1, Tier2, VO-variations (few) and specific requirements We also have close-to-agreement on the roll-out of Service upgrades to address critical missing functionality We have an on-going programme to ensure that the service delivered meets the requirements, including the essential validation by the experiments themselves
More information on the Experiments’ Computing Models LCG Planning Page GDB Workshops Mumbai Workshop - see GDB Meetings page Experiment presentations, documents Tier-2 workshop and tutorials CERN - 12-16 June Technical Design Reports LCG TDR - Review by the LHCC ALICE TDR supplement: Tier-1 dataflow diagrams ATLAS TDR supplement: Tier-1 dataflow CMS TDR supplement Tier 1 Computing Model LHCb TDR supplement: Additional site dataflow diagrams
simulation reconstruction analysis interactive physics analysis batch physics analysis batch physics analysis detector event summary data raw data event reprocessing event reprocessing event simulation event simulation analysis objects (extracted by physics topic) Data Handling and Computation for Physics Analysis event filter (selection & reconstruction) event filter (selection & reconstruction) processed data firstname.lastname@example.org CER N
LCG Service Model Tier0 – the accelerator centre (that’s us) Data acquisition & initial processing Long-term data curation Distribution of data Tier1s This is where FTS comes in… Canada – Triumf (Vancouver) France – IN2P3 (Lyon) Germany – Forschungszentrum Karlsruhe Italy – CNAF (Bologna) Netherlands – NIKHEF (Amsterdam) Nordic countries – distributed Tier-1 Spain – PIC (Barcelona) Taiwan – Academia Sinica (Taipei) UK – CLRC (Didcot) US – FermiLab (Illinois) – Brookhaven (NY) Tier1 – “online” to the data acquisition process high availability Managed Mass Storage – grid-enabled data service Data intensive analysis National, regional support Continual reprocessing activity (or is that continuous?) Tier2 – ~100 centres in ~40 countries Simulation End-user analysis – batch and interactive Les Robertson
July-August 2006 Disk-Tape Rates CentreATLAS full ATLAS tape CMSLHCbALICECombined Tape Rates Nominal pp All to tape StartNow Now* Now24 JulyAugust + ASGC60.02410--35100 CNAF59.024252360135200 PIC48.6203023-75100 IN2P390.236152360100200 GridKA74.630152360130200 RAL59.02410233090150 BNL196.880--- 200 TRIUMF47.620--- 50 SARA87.636-233090150 NDGF48.620--- 50 FNAL-50-- 200 Totals~8251600 *As soon as stable transfers have been achieved +Assumes ~1 week ramp-up for ALICE
SC4 T0-T1: Results Target: sustained disk – disk transfers at 1.6GB/s out of CERN at full nominal rates for ~10 days Result: just managed this rate on Good Sunday (1/10) Easter w/e Target 10 day period
Easter Sunday: > 1.6GB/s including DESY GridView reports 1614.5MB/s as daily average
Service Challenges - Reminder Purpose real grid service Understand what it takes to operate a real grid service – run for weeks/months at a time (not just limited to experiment Data Challenges) Trigger and verify Tier-1 & large Tier-2 planning and deployment – - tested with realistic usage patterns Get the essential grid services ramped up to target levels of reliability, availability, scalability, end-to-end performance Four progressive steps from October 2004 thru September 2006 End 2004 - SC1 – data transfer to subset of Tier-1s Spring 2005 – SC2 – include mass storage, all Tier-1s, some Tier-2s 2 nd half 2005 – SC3 – Tier-1s, >20 Tier-2s – first set of baseline services Jun-Sep 2006 – SC4 – pilot service Autumn 2006 – LHC service in continuous operation – ready for data taking in 2007
We have shown that we can drive transfers at full nominal rates to: Most sites simultaneously; All sites in groups (modulo network constraints – PIC); At the target nominal rate of 1.6GB/s expected in pp running In addition, several sites exceeded the disk – tape transfer targets There is no reason to believe that we cannot drive all sites at or above nominal rates for sustained periods. But There are still major operational issues to resolve – and most importantly – a full end-to-end demo under realistic conditions SC4 – Executive Summary
July 2006 WLCG Service Challenges: Overview and Outlook Experiment Plans for SC4 All 4 LHC experiments will run major production exercises during WLCG pilot / SC4 Service Phase These will test all aspects of the respective Computing Models plus stress Site Readiness to run (collectively) full production services These plans have been assembled from the material presented at the Mumbai workshop, with follow-up by Harry Renshall with each experiment, together with input from Bernd Panzer (T0) and the Pre-production team, and summarised on the SC4 planning page.Mumbai workshopSC4 planning page We have also held a number of meetings with representatives from all experiments to confirm that we have all the necessary input (all activities: PPS, SC, Tier0, …) and to spot possible clashes in schedules and / or resource requirements. (See “LCG Resource Scheduling Meetings” under LCG Service Coordination Meetings).LCG Service Coordination Meetings The conclusions of these meetings has been presented to the weekly operations meetings and the WLCG Management Board in written form (documents, presentations)weekly operations meetingsWLCG Management Board –See the SC4 Combined Action List for more information…SC4 Combined Action List
Summary of Experiment Plans All experiments will carry out major validations of both their offline software and the service infrastructure during the next 6 months There are significant concerns about the state-of-readiness (of everything…) – not to mention manpower at ~all sites + in experiments I personally am considerably worried –- seemingly simply issues, such as setting up LFC/FTS services, publishing SRM end-points etc. have taken O(1 year) to be resolved (across all sites). and [still] don’t even mention basic operational procedures (Some big improvements here recently…) And all this despite heroic efforts across the board But – oh dear – your planet has just been blown up by the Vogons [ So long and thanks for all the fish ] Mini Computer Microcomputer Cluster mainframe
July 2006WLCG Service Challenges: Overview and Outlook 31 ATLAS SC plans/requirements Running now till 7 July to demonstrate the complete Atlas DAQ and first pass processing with distribution of raw and processed data to Tier 1 sites at the full nominal rates. Will also include data flow to some Tier2 sites and full usage of the Atlas Distributed Data Management system, DQ2. Raw data to go to tape, processed to disk only. Sites to delete from disk and tape After summer investigate scenarios of recovery from failing Tier 1 sites and deploy cleanup of pools at Tier 0. Later, test distributed production, analysis and reprocessing. DQ2 has a central role with respect to Atlas Grid tools –ATLAS will install local DQ2 catalogues and services at Tier 1 centres –ATLAS define a region of a Tier 1 and well network connected sites that will depend on the Tier 1 DQ2 catalogue. –Expect such (volunteer) Tier 2 to join SC when T0/T1 runs stably –ATLAS will delete DQ2 catalogue entries Require VO box per Tier 0 and Tier 1 – done Require LFC server per Tier 1 – done, must be monitored Require FTS server and validated channels per Tier 0 and Tier 1 – close Require ‘durable’ MSS disk area at Tier 1 – few sites have it. To be followed up by Atlas and SC team. Atlas would like their T1 sites to attend (VRVS) their weekly (Wed at 14.00) SC review meeting during this running phase. No commitments were made.
July 2006WLCG Service Challenges: Overview and Outlook 32 ALICE SC Plans Validation of the LCG/gLite workload management services: ongoing –Stability of the services is fundamental for the entire duration of the exercise Validation of the data transfer and storage services –2 nd phase: end July/August T0 to T1 (recyclable tape) at 300 MB/sec –The stability and support of the services have to be assured during and beyond these throughput tests Validation of the ALICE distributed reconstruction and calibration model: August/September reconstruction at Tier 1 Integration of all Grid resources within one single – interfaces to different Grids (LCG, OSG, NDGF) will be done by ALICE End-user data analysis: September/October
July 2006WLCG Service Challenges: Overview and Outlook 33 ALICE Requirements/Issues ALICE have deployed a VO box at all their T0-T1-T2 sites –Installation and maintenance by ALICE –Site related problems to be handled by site administrators FTS services required as plugin to AliEn File Transfer Daemon LFC required at all ALICE sites –Used as a local catalogue for the site SE –ALICE will take care of the LFC updates Require FTS endpoints at the T0 and T1 with SRM enabled storage to tape (tapes must be written) and automatic data deletion (by the sites) for the 300 MB/sec throughput test (24 to 30 July). Will the SC team setup and test this before handing it over to ALICE ? Answer: SC team to follow up the enabled storage but up to ALICE to test. Require site support during the whole tests and beyond: –What are the site contacts for the central and distributed support teams, or does everything go through GGUS ? Answer: All problems to be reported to GGUS. SC team to check possibility of out of hours action.
July 2006WLCG Service Challenges: Overview and Outlook 34 CMS SC Plans/Requirements In September/October run CSA06, a 50 million event exercise to test the workflow and dataflow associated with the data handling and data access model of CMS Now till end June –Continue to try to improve file transfer efficiency. Low rates and many errors now. – Attempt to hit 25k batch jobs per day and increase the number and reliability of sites aiming to obtain 90% efficiency for job completion In July – Demonstrate CMS analysis submitter in bulk mode with the gLite RB In July and August – 25M events per month with the production systems –Second half of July participate in multi-experiment FTS Tier-0 to Tier-1 transfers at 150 MB/sec out of CERN –Continue through August with transfers Requirements: Improve Tier-1 to Tier-2 transfers and the reliability of the FTS channels. CMS are exercising the channels available to them, but there are still issues with site preparation and reliability –the majority of sites are responsive, but there is a lot of work for this summer Require to deploy the LCG-3D infrastructure –From late June deploy Frontier for SQUID caches All participating sites should be able to complete the CMS workflow and metrics (as defined in the CSA06 documentation)
July 2006WLCG Service Challenges: Overview and Outlook 35 LHCB SC Plans/Requirements Will start DC06 challenge at beginning of July using LCG production services and run till end August: –Distribution of raw data from CERN to Tier 1s at 23 MB/sec –Reconstruction/stripping at Tier 0 and Tier 1 –DST distribution to CERN and Tier 1s –Job prioritisation will be dealt with by LHCB but it is important jobs are not delayed by other VO activities Preproduction for this is ongoing with 125 TB of MC data at CERN Production will go on throughout the year for an LHCB physics book due in 2007 Require SRM 1.1 based SE’s separated for disk and MSS at all Tier 1 as agreed in Mumbai and FTS channels for all CERN-T1’s –Data access directly from SE to ROOT/POOL (not just GridFTP/srmcp). For NIKHEF/SARA (firewall issue) this could perhaps be done via GFAL. Require VO boxes at Tier 1 – so far at CERN, IN2P3, PIC and RAL. Need CNAF, NIKHEF and GridKa Require central LFC catalogue at CERN and read-only copy at certain T1 (currently setting up at CNAF) DC06-2 in Oct/Nov requires T1’s to run COOL and 3D database services
July 2006WLCG Service Challenges: Overview and Outlook 36 Summary of Open Experiment Tier 1 Site Requirements Verify ‘durable’ disk storage end points for ATLAS The ATLAS LFC instances at the Tier 1’s must be monitored ATLAS request T1 sites to attend weekly SC review during current run Verify tape storage end points for ALICE Verify LFC available for ALICE at all their sites Good site support emphasised by ALICE. Can any best-efforts out of hours support be offered ? CMS need improved performance and reliability of file transfers CMS require LCG-3D infrastructure from July CMS require sites to complete their CSA06 metrics LHCB require separate disk and MSS (tape) storage classes LHCB require direct ROOT/POOL data access to SE’s LHCB require VO boxes at CNAF, NIKHEF and GridKa LHCB will require COOL and 3D database services (October) We have tentatively scheduled another technical meeting for 15 September with reviews of the June to August challenges and experiment plans for the rest of 2006 and for the 2007 production services.
WLCG Service Challenges: Overview and Outlook Experiment Summary All experiments will be ramping up their activity between now and first collisions The period of ‘one experiment having priority’ – as was done in SC3 and for ATLAS until this weekend – is over It is full, concurrent production from now on!
WLCG Service Challenges: Overview and Outlook Workshop Feedback >160 people registered and (a few more) participated! –This is very large for a workshop – about same as Mumbai Some comments related directly to this (~40 replies received so far) Requests for more: –Tutorials, particularly “hands-on” –Direct Tier2 involvement –Feedback sessions, planning concrete actions etc. Active help from Tier2s in preparing / defining future events would be much appreciated –Please not just the usual suspects… See also Duncan Rand’s talk to GridPP16Duncan Rand’sGridPP16 –Some slides included below
WLCG Service Challenges: Overview and Outlook Tutorial Rating – 10=best
WLCG Service Challenges: Overview and Outlook Workshop Rating
WLCG Service Challenges: Overview and Outlook Workshop Comments Many positive comments on all sessions of the workshop and tutorials Possibility to discuss with other sites and the developers also much appreciated Sessions which some liked least others liked most! I hope that the people who didn’t reply also feel the same! “Very very inspiring” “Hope to do it again soon” “Tutorials were very useful” “The organisation was excellent” “Discussions were very enlightening” “Information collected together in one place”
WLCG Service Challenges: Overview and Outlook Workshop Summary Workshops have been well attended and received –Feedback will help guide future events Need to improve on Tier1+Tier2 involvement –Preparing agenda / chairing sessions / giving talks etc. Strong demand for more tutorials –Hands-on where possible / appropriate Thanks to everyone for their contribution to both workshop and tutorials!
HEPiX Rome 05apr06 LCG email@example.com The Service Challenge programme this year must show that we can run reliable services Grid reliability is the product of many components – middleware, grid operations, computer centres, …. Target for September 90% site availability 90% user job success Requires a major effort by everyone to monitor, measure, debug First data will arrive next year NOT an option to get things going later Too modest? Too ambitious?