Presentation is loading. Please wait.

Presentation is loading. Please wait.

ATLAS, eScience and the Grid Birmingham 9 th June 2004 RWL Jones Lancaster University.

Similar presentations


Presentation on theme: "ATLAS, eScience and the Grid Birmingham 9 th June 2004 RWL Jones Lancaster University."— Presentation transcript:

1 ATLAS, eScience and the Grid Birmingham 9 th June 2004 RWL Jones Lancaster University

2 RWL Jones, Lancaster University Overview What is eScience, what are Grids? What is eScience, what are Grids? Why does ATLAS need them? Why does ATLAS need them? What deployments exist? What deployments exist? What is the ATLAS Computing Model? What is the ATLAS Computing Model? How will ATLAS test this? How will ATLAS test this? Conclusions Conclusions

3 RWL Jones, Lancaster University What is eScience? Electronic Science? Electronic Science? For particle physics, eScience mainly means Grids… For particle physics, eScience mainly means Grids… Science on `E? (Maybe!) Science on `E? (Maybe!) Enhanced Science – John Taylor Enhanced Science – John Taylor Actually, anything involving HPC &/or high speed networking, but really that can only be done with modern computing! Actually, anything involving HPC &/or high speed networking, but really that can only be done with modern computing! Cynical view: It has been a useful way to get funding from Governments etc! Cynical view: It has been a useful way to get funding from Governments etc! GridPP had £17.5M for LCG, hardware (£3.5M), middleware, applications GridPP2 has £14M for more hardware, deployment, applications

4 RWL Jones, Lancaster University The Grid Note: Truly HPC, but requires more Not designed for tight-coupled problems, but spin- offs many

5 RWL Jones, Lancaster University Grids – 3 Different Kinds Computational Grid Computational Grid Lots of fast processors spread over a large physical area interlinked by fast networks Effectively a huge multiprocessor computer Shared memory more difficult but do-able Data Grid Data Grid Lots of databases linked by fast networks Need effective access to mass stores Need database query tools that span different sites and different database systems Sloan Sky Survey, Social sciences Sensor or Control Grid Sensor or Control Grid Wide-area sensor networks or remote control, connections by fast networks Flood-plain monitoring, Accelerator control rooms ATLAS needs a hybrid of the first two

6 RWL Jones, Lancaster University Time Hype Peak of Inflated Expectations Trough of Disillusionment Slope of Enlightenment Plateau of Productivity Trigger

7 RWL Jones, Lancaster University The ATLAS Data ATLAS ATLAS Not one experiment! A facility for many different measurements and physics topics Event selection Event selection 1 GHz pp collision rate 40 MHz bunch- crossing rate 200 Hz event-rate to mass- storage Real-time selection Leptons Jets

8 RWL Jones, Lancaster University The ATLAS Computing Challenge Running conditions at startup: Running conditions at startup: CPU: ~14.5M SpecInt2k including analysis CPU: ~14.5M SpecInt2k including analysis 0.8x10 9 event sample 1.3 PB/year, before data processing 0.8x10 9 event sample 1.3 PB/year, before data processing Reconstructed events, Monte Carlo data ~10 PB/year (~3 PB on disk) Reconstructed events, Monte Carlo data ~10 PB/year (~3 PB on disk) CERN alone can handle only a fraction of these resources

9 RWL Jones, Lancaster University The System Tier2 Centre ~200kSI2k Event Builder Event Filter ~7MSI2k T0 ~5MSI2k UK Regional Centre (RAL) US Regional Centre French Regional Centre Asian Regional Centre SheffieldManchesterLiverpool Lancaster ~0.25TIPS Workstations >10 GB/sec 450 Mb/sec MB/s Some data for calibration and monitoring to institutes Calibrations flow back Each Tier 2 has ~25 physicists working on one or more channels Each Tier 2 should have the full AOD, TAG & relevant Physics Group summary data Tier 2 do bulk of simulation Physics data cache ~Pb/sec ~ 300MB/s/T1 /expt Tier2 Centre ~200kSI2k 622Mb/s Tier 0 Tier 1 PC (2004) = ~1 kSpecInt2k Northern Tier ~200kSI2k Tier 2 ~9 Pb/year No simulation 622Mb/s N Tier 1s each store 1/N of raw data, reprocess it & archive the ESD, hold 2/N of current ESD for scheduled analysis & all AOD+TAG

10 RWL Jones, Lancaster University ATLAS is a worldwide collaboration, and so we span most Grid projects We benefit from all developments We have problems maintaining coherence It is almost certain will ultimately be working with several Grids (with defined interfaces) This may not be what funders like the EU want to hear! Complexity of the Problem

11 RWL Jones, Lancaster University The ATLAS Components Grid Projects Grid Projects Develop the middleware Provide hardware resources Provide some manpower resource But also drain resources from our core activities Computing Model Computing Model Dedicated group to develop the computing model Revised resources and planning paper evolving Sep 2004 Now examine from DAQ to end-user Must include University/local resources Devise various scenarios, different distributions of data Data Challenges Data Challenges Test the computing model Service other needs in ATLAS (but this must be secondary in DC2)

12 RWL Jones, Lancaster University Grid Projects etc. EGEE Until these groups provide interoperability the experiments must provide it themselves

13 RWL Jones, Lancaster University Deployments Whichever deployment you have, you need: Whichever deployment you have, you need: Hardware to run things on Middleware to glue it together Scheduler Database of known files Information system for available resources Authentication and authorisation File replication Resource broker (maybe) Front ends to hide complexity from the users

14 RWL Jones, Lancaster University Current Grid3 Status (3/1/04) (http://www.ivdgl.org/grid2003) 28 sites, multi-VO shared resources ~2000 CPUs dynamic – roll in/out Main LCG middleware: Virtual Data Toolkit, captures recipe to remake data Chimera, captures workflows in jobs

15 RWL Jones, Lancaster University LCG-2 today (May 14) Inherited European Data Grid software From Development to Deployment Resource Brokerage Replica Lookup Service Metadata services Security R-GMA information system ARDA middleware 166 FTE, about 20 to UK Also provides experiment support POOL object persistency SEAL core libraries and services Software Process and Infrastructure Simulation (G4 and GenSen)

16 RWL Jones, Lancaster University NorduGrid middleware is deployed in : NorduGrid middleware is deployed in : Sweden (15 sites) Denmark (10 sites) Norway (3 sites) Finland (3 sites) Slovakia (1 site) Estonia (1 site) Sites to join before/during DC2 (preliminary): Sites to join before/during DC2 (preliminary): Norway (1-2 sites) Russia (1-2 sites) Estonia (1-2 sites) Sweden (1-2 sites) Finland (1 site) Germany (1 site) Lightweight deployment based on Globus Lightweight deployment based on Globus Many prototypes Important contribution to ATLAS, especially installations NorduGrid Resources: details

17 RWL Jones, Lancaster University GridPP & GridPP2 Deployment Area Deployment Area Hardware (tier-1/A and front-ends for tier-2s) Hardware support for Tier-2s Grid Operations Centre for EGEE Middleware Middleware Security and Virtual Organisation Management Service R-GMA deployment Networking services MSS Applications Applications Complete the Grid integration of the first wave of experiments Support new experiments Generic Grid portal

18 RWL Jones, Lancaster University GridPP1 Components LHC Computing Grid Project (LCG) Applications, Fabrics, Technology and Deployment European DataGrid (EDG) Middleware Development UK Tier-1/A Regional Centre Hardware and Manpower Grid Application Development LHC and US Experiments + Lattice QCD Management Travel etc

19 RWL Jones, Lancaster University GridPP2 Components C. Grid Application Development LHC and US Experiments + Lattice QCD + Phenomenology B. Middleware Security Network Development F. LHC Computing Grid Project (LCG Phase 2) [review] E. Tier-1/A Deployment: Hardware, System Management, Experiment Support A. Management, Travel, Operations D. Tier-2 Deployment: 4 Regional Centres - M/S/N support and System Management

20 RWL Jones, Lancaster University GridPP Summary: From Prototype to Production BaBar D0 CDF ATLAS CMS LHCb ALICE 19 UK Institutes RAL Computer Centre CERN Computer Centre SAMGrid BaBarGrid LCG EDG GANGA EGEE UK Prototype Tier-1/A Centre CERN Prototype Tier-0 Centre 4 UK Tier-2 Centres LCG UK Tier-1/A Centre CERN Tier-0 Centre UK Prototype Tier-2 Centres ARDA Separate Experiments, Resources, Multiple Accounts 'One' Production Grid Prototype Grids

21 RWL Jones, Lancaster University EDG and LCG Strategy Try to write examples of the main components Try to write examples of the main components Try to get a small working Grid for Production jobs Try to get a small working Grid for Production jobs Well defined datasets Well defined (pre-installed) code Coherent job submission Test scalability Test scalability Redesign Redesign Set-up an analysis environment Set-up an analysis environment Develop User Interfaces in Parallel Develop User Interfaces in Parallel Develop Experiment-specific Tools Develop Experiment-specific Tools ! Requires clean interfaces/component design You can develop end-to-end prototypes faster (e.g. NorduGrid) but this aims for something robust, generic and reusable

22 RWL Jones, Lancaster University Rough Architecture Installation of Software and Env Compute + Store Sites User Interface to Grid + experiment framework User Middleware RB, GIS Data Catalogue Job Configuration/VDC /metadata

23 RWL Jones, Lancaster University ATLAS Computing Model Areas being addressed: Areas being addressed: 1.Computing Resources 2.Networks from DAQ to primary storage 3.Databases 4.Grid Interfaces 5.Computing Farms 6.Distributed Analysis 7.Distributed Production 8.Alignment & Calibration Procedures 9.Tests of Computing Model 10.Minimum permissible service 11.Simulation of model Report end 2004 ready for Computing Technical Design Report Report end 2004 ready for Computing Technical Design Report

24 RWL Jones, Lancaster University A More Grid-like Model CERN Tier2 Lab a Lancs Lab c Uni n Lab m Lab b Uni b Uni y Uni x Physics Department Desktop Germany Tier 1 USA FermiLab UK France Italy NL USA Brookhaven ………. The LHC Computing Facility NorthGridSouthGridLondonGridScotGrid

25 RWL Jones, Lancaster University Features of the Model ALL T1 faciltiies have 1/6 of the raw data ALL T1 faciltiies have 1/6 of the raw data Allows reprocessing! All T1 facilities have 1/3 of the full reconstructed data All T1 facilities have 1/3 of the full reconstructed data Allows more on disk/fast access space, saves tape All regional facilities have all of the analysis data (AOD) All regional facilities have all of the analysis data (AOD) Centres become facilities (even at T2 level) Centres become facilities (even at T2 level) Facilities are Regional and NOT National Facilities are Regional and NOT National Physicists from other Regions should have also Access to the Computing Resources Cost sharing is an issue Implications for the Grid middleware on accounting and priorities Implications for the Grid middleware on accounting and priorities Between experiments Between regions Between analysis groups Virtual Organization Management System Also, different activities will require different priorities Also, different activities will require different priorities

26 RWL Jones, Lancaster University Operation of Tier-0 The Tier-0 facility at CERN will have to: The Tier-0 facility at CERN will have to: hold a copy of all raw data to tape copy in real time all raw data to Tier-1s (second copy useful also for later reprocessing) keep calibration data on disk run first-pass reconstruction distribute ESDs to external Tier-1s (2/N to each one of N Tier-1s) Currently under discussion: Currently under discussion: shelf vs automatic tapes archiving of simulated data sharing of facilities between HLT and Tier-0 Tier-0 will have to be a dedicated facility, where the CPU power and network bandwidth match the real time event rate Tier-0 will have to be a dedicated facility, where the CPU power and network bandwidth match the real time event rate

27 RWL Jones, Lancaster University The Global View The Global View Distribution to ~6 T1s Each T1 holds 1/3 of the reconstructed data. The ability to do research requires therefore a sophisticated software infrastructure for complete and convenient data-access for the whole collaboration, and sufficient network bandwidth (2.5 Gb/s) for keeping up the data-transfer from T0 to T1s.

28 RWL Jones, Lancaster University Operation of Tier-1s and Tier-2s We envisage at least 6 Tier-1s for ATLAS. Each one will: We envisage at least 6 Tier-1s for ATLAS. Each one will: keep on disk 2/N of the ESDs and a full copy of AODs and TAGs keep on tape 1/N of Raw Data keep on disk 2/N of currently simulated ESDs and on tape 1/N of previous versions provide facilities (CPU and disk space) for Physics Group analysis of ESDs run simulation, calibration and/or reprocessing of real data We estimate ~4 Tier-2s for each Tier-1. Each one will: We estimate ~4 Tier-2s for each Tier-1. Each one will: keep on disk a full copy of AODs and TAGs (possibly) keep on disk a selected sample of ESDs provide facilities (CPU and disk space) for user analysis (~25 users/Tier-2) run simulation and/or calibration procedures

29 RWL Jones, Lancaster University Analysis on Tier-2s and Tier-3s This area is under the most active change This area is under the most active change We are trying to forecast resource usage and usage patterns from Physics Working Groups Assume about ~10 selected large AOD datasets, one for each physics analysis group Assume about ~10 selected large AOD datasets, one for each physics analysis group Assume that each large local centre will have full TAG to allow simple selections Assume that each large local centre will have full TAG to allow simple selections Using these, jobs submitted to T1 cloud to select on full ESD New collection or ntuple-equivalent returned to local resource Distributed analysis systems under development Distributed analysis systems under development Metadata integration, event navigation, database designs are all at top priority ARDA may help, but will be late in the day for DC2 (risk of interference with DC2 developments)

30 RWL Jones, Lancaster University Resource Summary CERNAll T1All T2Total Auto tape (Pb) Shelf tape (Pb) Disk (Pb) CPU (MSI2k)

31 RWL Jones, Lancaster University New ATLAS Production System LCGNGGrid3LSF LCG exe LCG exe NG exe G3 exe LSF exe super ProdDB Data Man. System RLS jabber soap jabber Don Quijote Windmill Lexor AMI Capone Dulcinea Much of the problem is data management This must cope with >= 3 Grid catalogues The demands will be greater for analysis

32 RWL Jones, Lancaster University GANGA Interfacing Athena/Gaudi to the GRID Athena/GAUDI Application GANGA/Grappa GUI jobOptions/ Virtual Data Algorithms GRID Services Histograms Monitoring Results ? ? For LHCb an end-to-end solution For ATLAS a front end For Babar a working option! Major contribution from Alvin Tan – Job Options Editor, design Highly rated in GridPP review This is a substantial UK contribution

33 RWL Jones, Lancaster University GANGA Design - User has access to functionality of GANGA components through GUI and CLI, layered one over the other above a Python software bus -Components used by GANGA to define a job Python classes -They fall into 3 categories: Ganga components of general applicability (to right in diagram) Ganga components providing specialised functionality (to left in diagram) External components (at bottom in diagram)

34 RWL Jones, Lancaster University ATLAS-specific environment Analysis: Next Component Next step: Grid for distributed analysis Next step: Grid for distributed analysis Run analysis jobs from home computer: Jobs partitioned and sent to centres where the data resides and/or Relevant data extracted from the remote centres and transferred to local installation ARDA will eventually provide the lower middleware ARDA will eventually provide the lower middleware First prototype in test, but too late for computing model tests this year Personal view on importance of ARDA: Central role of clients (deployment over development)

35 RWL Jones, Lancaster University ATLAS Distributed Analysis & GANGA The ADA (ATLAS Distributed Analysis) project started in late 2003 to bring together in a coherent way all efforts already present in the ATLAS Collaboration to develop a DA infrastructure: The ADA (ATLAS Distributed Analysis) project started in late 2003 to bring together in a coherent way all efforts already present in the ATLAS Collaboration to develop a DA infrastructure: GANGA (GridPP in the UK) – front-end, splitting DIAL (PPDG in the USA) – job model It is based on a client/server model with an abstract interface between services It is based on a client/server model with an abstract interface between services thin client in the users computer, analysis service consisting itself of a collection of services in the server The vast majority of GANGA modules fit easily into this scheme (or are being integrated right now): The vast majority of GANGA modules fit easily into this scheme (or are being integrated right now): GUI, CLI, JobOptions editor, job splitter, output merger,... Job submission will go through (a clone of) the production system Job submission will go through (a clone of) the production system using the existing infrastructure to access resources on the 3 Grids and the legacy systems The forthcoming release of ADA (with GANGA 2.0) will have the first basic functionality to allow DC2 Phase III to proceed The forthcoming release of ADA (with GANGA 2.0) will have the first basic functionality to allow DC2 Phase III to proceed

36 RWL Jones, Lancaster University Analysis Framework Job 1 Job 2 ApplicationTask Dataset 1 Analysis Service 1. L ocate 2. select3. Create or select 4. select 5. submit(app,tsk,ds) 6. split Dataset Dataset 2 7. create e.g. ROOT e.g. athena Result 9. create 10. gather Result 9. create exe, pkgsscripts, code Atlas Data Analysis GANGA+DIAL+AtCom+CMT/Pacman

37 RWL Jones, Lancaster University Analysis System First prototype exists First prototype exists Integrate with the ARDA back-end Integrate with the ARDA back-end Much work needed on metadata for analysis (LCG and GridPP metadata projects) Much work needed on metadata for analysis (LCG and GridPP metadata projects) N.B. GANGA allows non- production MC job submission and data reconstruction end-to- end in LCG N.B. GANGA allows non- production MC job submission and data reconstruction end-to- end in LCG Middleware service interfaces CE WMS File Catalogue etc....etc. Middleware services High level service interfaces (AJDL) Analysis Service ROOT cmd line Client GANGA cmd line Client GANGA Task Management Graphical Job Builder GANGA Job Management High-level services Client tools Catalogue services GANGA GUI Dataset Splitter Dataset Merger Job Management

38 RWL Jones, Lancaster University Installation Tools To use the Grid, deployable software must be deployed on the Grid fabrics, and the deployable run- time environment established To use the Grid, deployable software must be deployed on the Grid fabrics, and the deployable run- time environment established Installable code and run-time environment/configuration Installable code and run-time environment/configuration No explicit absolute paths – now OK No licensed software – now OK Deployable package (e.g. set of RPMs) Deployable package (e.g. set of RPMs) Both ATLAS and LHCb use CMT for the software management and environment configuration CMT knows the package interdependencies and external dependencies this is the obvious tool to prepare the deployable code and to `expose the dependencies to the deployment tool rpms, tar Grid aware tool to deploy the above Grid aware tool to deploy the above PACMAN is a candidate which seems fairly easy to interface with CMT, see following talk This is a substantial UK contribution

39 RWL Jones, Lancaster University POOL/SEAL release (done) ATLAS release 7 (with POOL persistency) (done) LCG-1 deployment (in progress...) ATLAS complete Geant4 validation (done) ATLAS release 8 (done) DC2 Phase 1: simulation production DC2 Phase 2: intensive reconstruction (the real challenge!) Combined test beams (barrel wedge) Computing Model paper Computing Memorandum of Understanding ATLAS Computing TDR and LCG TDR DC3: produce data for PRR and test LCG-n Physics Readiness Report Start commissioning run GO! NOW LCG and GEANT 4 Integration Testing the Computing Model DC2 Testing the Physics Readiness DC3 Data-ready versions Confront with data Packages shake-down in DC3 (or earlier) ready for physics in 2007 ATLAS Computing Timeline

40 RWL Jones, Lancaster University Data Challenges Test Bench – Data Challenges ATLAS DC 1Jul 2002-May 2003 ATLAS DC 1Jul 2002-May 2003 Showed the many resources available (hardware, willing people) Made clear the need for integrated system Very manpower intensive Some tests of Grid software Mainly driven by HLT and Physics Workshop needs One external driver is sustainable, two is not!

41 RWL Jones, Lancaster University The goals include: The goals include: Use widely the GRID middleware and tools Large scale physics analysis Computing model studies (document end 2004) Slice test of the computing activities in 2007 Run as much as possible the production on LCG-2 Simultaneous with Test beam Simultaneous with Test beam Simulation of full ATLAS and 2004 combined Test beam Test the calibration and alignment procedures Using same tools DC2: May – Sept 2004

42 RWL Jones, Lancaster University Preparation phase: worldwide exercise (May-June 04) Preparation phase: worldwide exercise (May-June 04) Event generation; Simulation; pile-up and digitization All Byte-stream sent to CERN Reconstruction: at Tier0 Reconstruction: at Tier0 ~400 processors, short term, sets scale Several streams Express lines Calibration and alignment lines Different output streams ESD and AOD replicated to Tier1 sites Out of Tier0 Re-calibration new calibrations and alignment parameters Re-processing Analysis

43 RWL Jones, Lancaster University Monitoring & Accounting We need to monitor the operation to validate the model We need to monitor the operation to validate the model The production database gives a historical integrated view Publish on the web, in real time, relevant data concerning the running of DC-2 and event production SQL queries are submitted to the Prod DB hosted at CERN Result is HTML formatted and web published A first basic tool is already available as a prototype We also need to have snapshots to find bottlenecks We also need to have snapshots to find bottlenecks Needs Grid monitoring tools MonaLisa is deployed for Grid3 and NG monitoring MonaLisa is deployed for Grid3 and NG monitoring On LCG: effort to verify the status of the Grid On LCG: effort to verify the status of the Grid o two main tasks: site monitoring and job monitoring o based on R-GMA & GridICE

44 RWL Jones, Lancaster University DC3 From the end of September, pre-production begins for DC3 From the end of September, pre-production begins for DC3 This will be more than an order of magnitude bigger than DC2 The Physics TDR will be a major driver The Physics TDR will be a major driver We will have many real users The last chance to validate the software and computing before the real data

45 RWL Jones, Lancaster University Conclusions The Grid is the only practical way to function as a world-wide collaboration The Grid is the only practical way to function as a world-wide collaboration DC1 showed we have many resources, especially people DC1 showed we have many resources, especially people Grid projects are starting to deliver Grid projects are starting to deliver Slower than desirable Tensions over manpower Problems of coherence Real tests of the computing model due this year Real tests of the computing model due this year Serious and prompt input needed from the community Revised costs are encouraging Real sharing of resources is required Real sharing of resources is required The rich must shoulder a large part of the burden Poorer members must also contribute This technology allows them to do this more effectively

46 RWL Jones, Lancaster University Data Management Architecture AMI ATLAS Metatdata Interface Query LFN Associated attributes and values Don Quixote replaces MAGDA Manage replication, physical location VDC Virtual Data Catalog Derive and transform LFNs


Download ppt "ATLAS, eScience and the Grid Birmingham 9 th June 2004 RWL Jones Lancaster University."

Similar presentations


Ads by Google