Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jamie Shiers February 2004 Assembled from SC4 Workshop presentations + Les’ plenary talk at CHEP The Worldwide LHC Computing Grid Service Experiment Plans.

Similar presentations


Presentation on theme: "Jamie Shiers February 2004 Assembled from SC4 Workshop presentations + Les’ plenary talk at CHEP The Worldwide LHC Computing Grid Service Experiment Plans."— Presentation transcript:

1 Jamie Shiers February 2004 Assembled from SC4 Workshop presentations + Les’ plenary talk at CHEP The Worldwide LHC Computing Grid Service Experiment Plans for SC4

2 LCG les.robertson@cern.ch Introduction  Global goals and timelines for SC4  Experiment plans for pre, post and SC4 production  Medium term outline for WLCG services  The focus of Service Challenge 4 is to demonstrate a basic but reliable service that can be scaled up - by April 2007 - to the capacity and performance needed for the first beams.  Development of new functionality and services must continue, but we must be careful that this does not interfere with the main priority for this year – reliable operation of the baseline services

3 LCG les.robertson@cern.ch LCG Service Deadlines full physics run first physics cosmics 2007 2008 2006 Pilot Services – stable service from 1 June 06 LHC Service in operation – 1 Oct 06 over following six months ramp up to full operational capacity & performance LHC service commissioned – 1 Apr 07 Service Challenge 4

4 LCG les.robertson@cern.ch SC4 – the Pilot LHC Service from June 2006  Full demonstration of experiment production  DAQ  Tier-0  Tier-1 data recording, calibration, reconstruction  Full offline chain – Tier-1  Tier-2 data exchange simulation, batch and end-user analysis  Service metrics  MoU service levels  Extension to most Tier-2 sites  Functionality - modest evolution from current services  Focus on reliability, performance

5 5 ALICE Data Challenges 2006 Last chance to show that things are working together (i.e. to test our computing model) whatever does not work here is likely not to work when real data are there –So we better plan it well and do it well

6 6 ALICE Data Challenges 2006 Three main objectives –Computing Data Challenge Final version of rootifier / recorder Online data monitoring –Physics data challenge Simulation of signal events: 10 6 Pb-Pb, 10 8 p-p Final version reconstruction Data analysis –PROOF data challenge Preparation of the fast reconstruction / analysis framework

7 7 Main points Data flow Realistic system stress test Network stress test SC4 Schedule Analysis activity

8 8 Data Flow Not very fancy… always the same Distributed Simulation Production –Here we stress-test the system with the number of jobs in parallel Data back to CERN First reconstruction at CERN –RAW/ESD Scheduled “push-out” – here we do the network test Distributed reconstruction –Here we stress test the I/O subsystem Distributed (batch) analysis –“And here comes the proof of the pudding” - FCA

9 9 SC3 -> SC4 Schedule February 2006 –Rerun of SC3 disk – disk transfers (max 150MB/s X 7 days) –Transfers with FTD, either triggered via AliEn jobs or scheduled –T0 -> T1 (CCIN2P3, CNAF, Grid.Ka, RAL) March 2006 –T0-T1 “loop-back” tests at 2 x nominal rate (CERN) –Run bulk production @ T1,T2 (simulation+reconstruction jobs) and send data back to CERN –(We get ready with proof@caf) April 2006 –T0-T1 disk-disk (nominal rates) disk-tape (50-75MB/s) –First Push out (T0 -> T1) of simulated data, reconstruction at T1 –(First tests with proof@caf) July 2006 –T0-T1 disk-tape (nominal rates) –T1-T1, T1-T2, T2-T1 and other rates TBD according to CTDRs –Second chance to push out the data –Reconstruction at CERN and remote centres September 2006 –Scheduled analysis challenge –Unscheduled challenge (target T2’s?)

10 10 SC4 Rates - Scheduled Analysis Users –Order of 10 at the beginning of SC4 Input –1.2M Pb-Pb events, 100M p-p events, ESD stored at T1s Job rate –Can be tuned, according to the availability of resources Queries to MetaData Catalogue –Time/Query to be evaluated (does not involve LCG services) Job splitting –Can be done by AliEn according to the query result (destination set for each job) –CPU availability is an issue (sub-jobs should not wait too much for delayed executions) –Result merging can be done by a separate job Network –Not an issue

11 11 SC4 Rates - Scheduled Analysis Some (preliminary) numbers –Based on 20 minutes jobs

12 12 SC4 Rates - Unscheduled Analysis To be defined

13 Dario Barberis: ATLAS SC4 Plans WLCG SC4 Workshop - Mumbai, 12 February 2006 13 ATLAS SC4 Tests l Complete Tier-0 test nInternal data transfer from “Event Filter” farm to Castor disk pool, Castor tape, CPU farm nCalibration loop and handling of conditions data  Including distribution of conditions data to Tier-1s (and Tier-2s) nTransfer of RAW, ESD, AOD and TAG data to Tier-1s nTransfer of AOD and TAG data to Tier-2s nData and dataset registration in DB (add meta-data information to meta-data DB) l Distributed production nFull simulation chain run at Tier-2s (and Tier-1s)  Data distribution to Tier-1s, other Tier-2s and CAF nReprocessing raw data at Tier-1s  Data distribution to other Tier-1s, Tier-2s and CAF l Distributed analysis n“Random” job submission accessing data at Tier-1s (some) and Tier-2s (mostly) nTests of performance of job submission, distribution and output retrieval

14 Dario Barberis: ATLAS SC4 Plans WLCG SC4 Workshop - Mumbai, 12 February 2006 14 ATLAS SC4 Plans (1) l Tier-0 data flow tests: nPhase 0: 3-4 weeks in March-April for internal Tier-0 tests  Explore limitations of current setup  Run real algorithmic code  Establish infrastructure for calib/align loop and conditions DB access  Study models for event streaming and file merging  Get input from SFO simulator placed at Point 1 (ATLAS pit)  Implement system monitoring infrastructure nPhase 1: last 3 weeks of June with data distribution to Tier-1s  Run integrated data flow tests using the SC4 infrastructure for data distribution  Send AODs to (at least) a few Tier-2s  Automatic operation for O(1 week)  First version of shifter’s interface tools  Treatment of error conditions nPhase 2: 3-4 weeks in September-October  Extend data distribution to all (most) Tier-2s  Use 3D tools to distribute calibration data l The ATLAS TDAQ Large Scale Test in October-November prevents further Tier-0 tests in 2006… n… but is not incompatible with other distributed operations

15 Dario Barberis: ATLAS SC4 Plans WLCG SC4 Workshop - Mumbai, 12 February 2006 15 ATLAS SC4 Plans (2) l ATLAS CSC includes continuous distributed simulation productions: nWe will continue running distributed simulation productions all the time  Using all Grid computing resources we have available for ATLAS  The aim is to produce ~2M fully simulated (and reconstructed) events/week from April onwards, both for physics users and to build the datasets for later tests lWe can currently manage ~1M events/week; ramping up gradually l SC4: distributed reprocessing tests: nTest of the computing model using the SC4 data management infrastructure  Needs file transfer capabilities between Tier-1s and back to CERN CAF  Also distribution of conditions data to Tier-1s (3D)  Storage management is also an issue nCould use 3 weeks in July and 3 weeks in October l SC4: distributed simulation intensive tests: nOnce reprocessing tests are OK, we can use the same infrastructure to implement our computing model for simulation productions  As they would use the same setup both from our ProdSys and the SC4 side nFirst separately, then concurrently

16 Dario Barberis: ATLAS SC4 Plans WLCG SC4 Workshop - Mumbai, 12 February 2006 16 ATLAS SC4 Plans (3) l Distributed analysis tests: n“Random” job submission accessing data at Tier-1s (some) and Tier-2s (mostly)  Generate groups of jobs and simulate analysis job submission by users at home sites  Direct jobs needing only AODs as input to Tier-2s  Direct jobs needing ESDs or RAW as input to Tier-1s  Make preferential use of ESD and RAW samples available on disk at Tier-2s  Tests of performance of job submission, distribution and output retrieval  Test job priority and site policy schemes for many user groups and roles  Distributed data and dataset discovery and access through metadata, tags, data catalogues. nNeed same SC4 infrastructure as needed by distributed productions  Storage of job outputs for private or group-level analysis may be an issue nTests can be run during Q3-4 2006  First a couple of weeks in July-August (after distributed production tests)  Then another longer period of 3-4 weeks in November

17 Dario Barberis: ATLAS SC4 Plans WLCG SC4 Workshop - Mumbai, 12 February 2006 17 Overview of requirements for SC4 l SRM (“baseline version”) on all storages l VO Box per Tier-1 and in Tier-0 l LFC server per Tier-1 and in Tier-0 l FTS server per Tier-1 and in Tier-0 l Disk-only area on all tape systems nPreferably we could have separate SRM entry points for “disk” and “tape” SEs. Otherwise a directory set as permanent (“durable”?) on disk (non- migratable). nDisk space is managed by DQ2. nCounts as online (“disk”) data in the ATLAS Computing Model l Ability to install FTS ATLAS VO agents on Tier-1 and Tier-0 VO Box (see next slides) l Single entry point for FTS with multiple channels/servers l Ability to deploy DQ2 services on VO Box as during SC3 l No new requirements on the Tier-2s besides SRM SE

18 Dario Barberis: ATLAS SC4 Plans WLCG SC4 Workshop - Mumbai, 12 February 2006 18 Movement use cases for SC4 l EF -> Tier-0 migratable area l Tier-0 migratable area -> Tier-1 disk l Tier-0 migratable area -> Tier-0 tape l Tier-1 disk -> Same Tier-1 tape l Tier-1 disk -> Any other Tier-1 disk l Tier-1 disk -> Related Tier-2 disk (next slides for details) l Tier-2 disk -> Related Tier-1 disk (next slides for details) l Not done: nProcessing directly from tape (not in ATLAS Computing Model) nAutomated multi-hop (no ‘complex’ data routing) nBuilt-in support for end-user analysis: goal is to exercise current middleware and understand its limitations (metrics)

19 Dario Barberis: ATLAS SC4 Plans WLCG SC4 Workshop - Mumbai, 12 February 2006 19 ATLAS SC4 Requirement (new!) l Small testbed with (part of) CERN, a few Tier-1s and a few Tier- 2s to test our distributed systems (ProdSys, DDM, DA) prior to deployment nIt would allow testing new m/w features without disturbing other operations nWe could also tune properly the operations on our side nThe aim is to get to the agreed scheduled time slots with an already tested system and really use the available time for relevant scaling tests nThis setup would not interfere with concurrent large-scale tests or data transfers run by other experiments l A first instance of such a system would be useful already now! nApril-May looks like a realistic request

20 Dario Barberis: ATLAS SC4 Plans WLCG SC4 Workshop - Mumbai, 12 February 2006 20 Summary of requests l March-April (pre-SC4): 3-4 weeks in for internal Tier-0 tests (Phase 0) l April-May (pre-SC4): tests of distributed operations on a “small” testbed l Last 3 weeks of June: Tier-0 test (Phase 1) with data distribution to Tier-1s l 3 weeks in July: distributed processing tests (Part 1) l 2 weeks in July-August: distributed analysis tests (Part 1) l 3-4 weeks in September-October: Tier-0 test (Phase 2) with data to Tier-2s l 3 weeks in October: distributed processing tests (Part 2) l 3-4 weeks in November: distributed analysis tests (Part 2)

21 LCG les.robertson@cern.ch 21

22 LCG les.robertson@cern.ch 22

23 LCG les.robertson@cern.ch 23

24 LCG les.robertson@cern.ch 24

25 LCG les.robertson@cern.ch 25

26 LCG les.robertson@cern.ch 26

27 LCG les.robertson@cern.ch 27

28 LCG les.robertson@cern.ch 28

29 LCG les.robertson@cern.ch 29

30 30 LHCb DC06 “Test of LHCb Computing Model using LCG Production Services” Distribution of RAW data Reconstruction + Stripping DST redistribution User Analysis MC Production Use of Condition DB (Alignment + Calibration)

31 31 SC4 Aim for LHCb Test Data Processing part of CM Use 200 M MC RAW events: –Distribute –Reconstruct –Stripped and Re-distribute Simultaneous activities: –MC production –User Analysis

32 32 Preparation for SC4 Event generation, detector simulation & digitization 100M B-physics + 100M min bias events: –3.7 MSI2k · month required (~2-3 months) –125 TB on MSS at Tier-0 (keep MC True) Timing: –Start productions mid March, –Full capacity end March

33 33 LHCb SC4 (I) Timing: –Start June –Duration 2 months Distribution of RAW data –Tier0 MSS SRM  Tier1’s MSS SRM 2 TB/day out of CERN –125 TB on MSS @ Tier1’s

34 34 LHCb SC4 (II) Reconstruction/stripping –270 kSi2k · month –60 TB on MSS @ Tier1’s (full DST) –1k Job/day (following the data) Jobs duration 2 hour 90 % Jobs (Rec):Input 3.6 GBOutput 2 GB 10% Jobs (Strip):Input 20 GBOutput 0.5 GB DST Distribution –2.2 TB on Disk / Tier1 + CAF (selected DST+RAW)

35 35 DIRAC Tools & LCG DIRAC Transfer Agent @ Tier-0 + Tier-1’s –FTS + SRM DIRAC Production Tools –Production Manager console –Transformation Agents DIRAC WMS –LFC + RB + CE Applications: –GFAL: Posix I/O via LFN

36 36 To be Tested after SC4 Data Management: –SRM v2 –gridFTP 2 –FPS Workload Management: –gLite RB ? –gLite CE ? VOMS –Integration with MW Applications –Xrootd

37 LCG les.robertson@cern.ch Monthly Summary (I)  February  ALICE: data transfers T0->T1 (CCIN2P3, CNAF, Grid.Ka, RAL)  ATLAS:  CMS:  LHCb:  March  ALICE: bulk production at T1/T2; data back to T0  ATLAS: 3-4 weeks Mar/Apr T0 tests  CMS: PhEDEx integration with FTS  LHCb: start generation of 100M B-physics + 100M min bias events (2-3 months; 125 TB on MSS at Tier-0)  April  ALICE: first push out of sim. data; reconstruction at T1s.  ATLAS: see above  CMS: 10TB to tape at T1s at 150MB/s  LHCb: see above  dTeam: T0-T1 at nominal rates (disk); 50-75MB/s (tape) Extensive testing on PPS by all VOs

38 LCG les.robertson@cern.ch Monthly Summary (II)  May  ALICE:  ATLAS:  CMS:  LHCb:  June  ALICE:  ATLAS: Tier-0 test (Phase 1) with data distribution to Tier-1s (3 weeks)  CMS: 2-week re-run of SC3 goals (beginning of month)  LHCb: reconstruction/stripping: 2 TB/day out of CERN - 125 TB on MSS @ Tier1’s  July  ALICE: Reconstruction at CERN and remote centres  ATLAS:  CMS: bulk simulation (2 months)  LHCb: see above  dTeam: T0-T1 at full nominal rates (to tape) Deployment of gLite 3.0 at major sites for SC4 production

39 LCG les.robertson@cern.ch Monthly Summary (III)  August  ALICE:  ATLAS:  CMS: bulk simulation continues  LHCb: Analysis on data from June/July … until spring 07 or so…  September  ALICE: Scheduled + unscheduled (T2s?) analysis challenges  ATLAS:  CMS:  LHCb: see above

40 LCG les.robertson@cern.ch WLCG - Medium Term Evolution 3D distributed database services development test SC4 SRM 2 test and deployment Plan being elaborated October? Additional planned Functionality to be agreed & completed in the next few months then - tested deployed Subject to progress & experience New functionality Evaluation & development cycles Possible components for later years ??

41 LCG les.robertson@cern.ch So What Happens at the end of SC4?  Well prior to October we need to have all structures and procedures in place…  … to run –-- and evolve --- a production service for the long-term  This includes all aspects – monitoring, automatic problem detection, resolution, reporting, escalation, {site, user} support, accounting, review, planning for new productions, service upgrades …  For the precise reason that things will evolve, should avoid over-specification…

42 LCG les.robertson@cern.ch Summary  Two grid infrastructures are now in operation, on which we are able to build computing services for LHC  Reliability and performance have improved significantly over the past year  The focus of Service Challenge 4 is to demonstrate a basic but reliable service that can be scaled up - by April 2007 - to the capacity and performance needed for the first beams.  Development of new functionality and services must continue, but we must be careful that this does not interfere with the main priority for this year – reliable operation of the baseline services


Download ppt "Jamie Shiers February 2004 Assembled from SC4 Workshop presentations + Les’ plenary talk at CHEP The Worldwide LHC Computing Grid Service Experiment Plans."

Similar presentations


Ads by Google