Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University.

Similar presentations


Presentation on theme: "Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University."— Presentation transcript:

1 Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University of Glasgow

2 Outline: Data to Metadata to Data Introduction Yesterday.. all my troubles seemed so far away (non-Grid) Database Access Data Hierarchy Today.. is the greatest day Ive ever known Grids and Metadata Management File Replication Replica Optimisation Tomorrow.. never knows Event Replication Query Optimisation Introduction Yesterday.. all my troubles seemed so far away (non-Grid) Database Access Data Hierarchy Today.. is the greatest day Ive ever known Grids and Metadata Management File Replication Replica Optimisation Tomorrow.. never knows Event Replication Query Optimisation

3 : : E.g., Resource-specific implementations of basic services E.g., transport protocols, name servers, differentiated services, CPU schedulers, public key infrastructure, site accounting, directory service, OS bypass Resource-independent and application-independent services remote data access authentication, authorisation, resource location, resource allocation, events, accounting, remote data access, information, policy, fault detection Distributed computing toolkit Grid Fabric (Resources) Grid Services (Middleware) Application Toolkits Data-intensiveapplicationstoolkit Collaborative applications toolkit Remote Visualisation applications toolkit Problem solving applications toolkit Remote instrumentation applications toolkit Applications Chemistry Biology Cosmology High Energy Physics Environment GRID Services: Context

4 Online Data Rate vs Size Level 1 Rate (Hz) High Level-1 Trigger (1 MHz) High No. Channels High Bandwidth (500 Gbit/s) High Data Archive (PetaByte) LHCB KLOE HERA-B CDF II CDF H1 ZEUS UA1 LEP NA49 ALICE Event Size (bytes) ATLAS CMS It doesnt… Factor O(1000) Online data reduction via trigger selection How can this data reach the end user?

5 Offline Data Hierarchy RAW, ESD, AOD, TAG RAW Recorded by DAQ Triggered events Detector digitisation ~1 MB/event ESD Pseudo-physical information: Clusters, track candidates (electrons, muons), etc. Reconstructedinformation ~100 kB/event AOD Physical information: Transverse momentum, Association of particles, jets, (best) id of particles, Physical info for relevant objects Selectedinformation ~10 kB/event TAG Analysisinformation ~1 kB/event Relevant information for fast event selection

6 Physics Analysis ESD: Data or Monte Carlo Event Tags Event Selection Analysis Object Data AOD Analysis Object Data AOD Calibration Data Analysis, Skims Raw Data Tier 0,1 Collaboration wide Tier 2 Analysis Groups Tier 3, 4 Physicists Physics Analysis Physics Objects Physics Objects Physics Objects INCREASING DATA FLOWINCREASING DATA FLOW

7 REAL and SIMULATED data required. Central and Distributed production. Data Structure Raw Data Reconstruction Data Acquisition Level 3 trigger Trigger Tags Event Summary Data ESD Event Summary Data ESD Event Tags Physics Models Monte Carlo Truth Data MC Raw Data Reconstruction MC Event Summary Data MC Event Tags Detector Simulation Calibration Data Run Conditions Trigger System

8 A running (non-Grid) experiment Three Steps to select an event today 1.Remote access to O(100) TBytes of ESD data 2.Via remote access to 100 GBytes of TAG data 3.Using offline selection e.g. ZeusIO- Variable (Ee>20.0)and(Ntrks>4) Access to remote store via batch job 1% database event finding overhead O(1M) lines of reconstruction code No middleware 20k lines of C++ glue from Objectivity (TAG) to ADAMO (ESD) database TAG ESD Million selected events from 5 years data 4 TAG selection via 250 variables/event

9 A future (Grid) experiment Three steps to (analysis) heaven 1.10 (1) PByte of RAW (ESD) data/yr 2.1 TByte of TAG data (local access)/yr 3.Offline selection e.g. ATLASIO- Variable (Mee>100.0)and(Njets>4) Interactive access to local TAG store Automated batch jobs to distributed Tier-0, -1, -2 centres O(1M) lines of reconstruction code O(1M) lines of middleware… NEW… O(20k) lines of Java/C++ provide TAG glue from TAG to ESD database All working? Efficiently? Million events from 1 years data-taking 4 TAG selection via 250 variables DataBase Solutions Inc. Inter

10 Grid Data Management: Requirements 1.Robust - software development infrastructure 2.Secure – via Grid certificates 3.Scalable – non-centralised 4.Efficient – Optimised replication Examples: 1.Robust - software development infrastructure 2.Secure – via Grid certificates 3.Scalable – non-centralised 4.Efficient – Optimised replication Examples: GDMPSpitfireReptorOptor

11 1.Robust? Development Infrastructure CVS Repository management of DataGrid source code all code available (some mirrored) Bugzilla Package Repository public access to packaged DataGrid code Development of Management Tools statistics concerning DataGrid code auto-building of DataGrid RPMs publishing of generated API documentation latest build = Release 1.2 (August 2002) CVS Repository management of DataGrid source code all code available (some mirrored) Bugzilla Package Repository public access to packaged DataGrid code Development of Management Tools statistics concerning DataGrid code auto-building of DataGrid RPMs publishing of generated API documentation latest build = Release 1.2 (August 2002) Lines of Code 10 Languages (Release 1.0)

12 ComponentETTUTITNINFFMBSD Resource Broker vvvl Job Desc. Lang. vvvl Info. Index vvvl User Interface vvvl Log. & Book. Svc. vvvl Job Sub. Svc. vvvl Broker Info. API vvl SpitFire vvl GDMP l Rep. Cat. API vvl Globus Rep. Cat. vvl ETTExtensively Tested in Testbed UTUnit Testing ITIntegrated Testing NINot Installed NFFSome Non-Functioning Features MBSome Minor Bugs SDSuccessfully Deployed ComponentETTUTITNINFFMBSD Schema vvvl FTree vvl R-GMA vvl Archiver Module vvl GRM/PROVE vvl LCFG vvvl CCM vl Image Install. vl PBS Info. Prov. vvvl LSF Info. Prov. vvl ComponentETTUTITNINF F MBSDSD SE Info. Prov. Vvl File Elem. Script l Info. Prov. Config. Vvl RFIO Vvl MSS Staging l Mkgridmap & daemon vl CRL update & daemon vl Security RPMs vl EDG Globus Config. vvl ComponentETTUTITNINFFMBMB SDSD PingER vvl UDPMon vvl IPerf vvl Globus2 Toolkit vvl 1.Robust? Software Evaluation

13 1.Robust? Middleware Testbed(s) Validation/ Maintenance =>Testbed(s) EU-wide development

14 1. Robust? Code Development Issues Reverse Engineering (C++ code analysis and restructuring; coding standards) => abstraction of existing code to UML architecture diagrams Language choice (currently 10 used in DataGrid) Java = C features (global variables, pointer manipulation, goto statements, etc.). Constraints (performance, libraries, legacy code) Testing (automation, object oriented testing) Industrial strength? OGSA-compliant? O(20 year) Future proof?? Reverse Engineering (C++ code analysis and restructuring; coding standards) => abstraction of existing code to UML architecture diagrams Language choice (currently 10 used in DataGrid) Java = C features (global variables, pointer manipulation, goto statements, etc.). Constraints (performance, libraries, legacy code) Testing (automation, object oriented testing) Industrial strength? OGSA-compliant? O(20 year) Future proof?? ETTExtensively Tested in Testbed UTUnit Testing ITIntegrated Testing NINot Installed NFFSome Non-Functioning Features MBSome Minor Bugs SDSuccessfully Deployed

15 Data Management on the Grid Data in particle physics is centred on events stored in a database… Groups of events are collected in (typically GByte) files… In order to utilise additional resources and minimise data analysis time, Grid replication mechanisms are currently being used at the file level. Access to a database via Grid certificates (Spitfire/OGSA-DAI) Replication of files on the Grid (GDMP/Giggle) Replication and Optimisation Simulation (Reptor/Optor) Data in particle physics is centred on events stored in a database… Groups of events are collected in (typically GByte) files… In order to utilise additional resources and minimise data analysis time, Grid replication mechanisms are currently being used at the file level. Access to a database via Grid certificates (Spitfire/OGSA-DAI) Replication of files on the Grid (GDMP/Giggle) Replication and Optimisation Simulation (Reptor/Optor)

16 Servlet Container SSLServletSocketFactory TrustManager Security Servlet Does user specify role? Map role to connection id Authorization Module HTTP + SSL Request + client certificate Yes Role Trusted CAs Is certificate signed by a trusted CA? No Has certificate been revoked? Revoked Certs repository Find default No Role repository Role ok? Connection mappings Translator Servlet RDBMS Request a connection ID Connection Pool 2. Spitfire Secure? At the level required in Particle Physics

17 2. Database client API A database client API has been defined Implement as grid service using standard web service technologies Ongoing development with OGSA-DAI A database client API has been defined Implement as grid service using standard web service technologies Ongoing development with OGSA-DAI Talk: Project Spitfire - Towards Grid Web Service Databases

18 3. GDMP and the Replica Catalogue StorageElement1StorageElement2StorageElement3 Globus 2.0 Replica Catalogue (LDAP) Centralised LDAP based GDMP 3.0 = File mirroring/replication tool Originally for replicating CMS Objectivity files for High Level Trigger studies. Now used widely in HEP. Replica Catalogue TODAY

19 3. Giggle: Hierarchical P2P LRC RLI LRC Hierarchical indexing. The higher- level RLI contains pointers to lower-level RLIs or LRCs. Storage Element Storage Element Storage Element Storage Element Storage Element RLI = Replica Location Index LRC = Local Replica Catalog LRC Scalable? Trade-off: Consistency Versus Efficiency

20 4. Reptor/Optor: File Replication/ Simulation Tests file replication strategies: e.g. economic model Replica Location Index Site Replica Manager Storage Element Computing Element Optimiser Resource Broker User Interface Pre-/Post- processing Core API Optimisation API Processing API Local Replica Catalogue Replica Location Index Replica Metadata Catalogue Replica Location Index Site Replica Manager Storage Element Computing Element Optimiser Pre-/Post- processing Local Replica Catalogue Reptor: Replica architecture Optor: Test file replication strategies: economic model Reptor: Replica architecture Optor: Test file replication strategies: economic model Demo and Poster: Studying Dynamic Grid Optimisation Algorithms for File Replication Efficient? Requires simulation Studies…

21 Application Requirements The current EMBL production database is 150 GB, which takes over four hours to download at full bandwidth capability at the EBI. The EBI's data repositories receive 100,000 to 250,000 hits per day with 20% from UK sites; 563 unique UK domains with 27 sites have more than 50 hits per day. MyGrid Proposal Suggests: Less emphasis on efficient data access and data hierarchy aspects (application specific). Large gains in biological applications from efficient file replication. Larger gains from application-specific replication?

22 Events.. to Files.. to Events RAW ESD AOD TAG Interesting Events List RAW ESD AOD TAG RAW ESD AOD TAG Tier-0(International) Tier-1(National) Tier-2(Regional) Tier-3(Local) Data Files Data Files Data Files TAG Data Files Data Files Data Files RAW Data File Data Files Data Files ESD Data Files Data Files AOD Data Event 1 Event 2 Event 3 Not all pre-filtered events are interesting… Non pre-filtered events may be… File Replication Overhead.

23 Events.. to Events Event Replication and Query Optimisation RAW ESD AOD TAG Interesting Events List RAW ESD AOD TAG RAW ESD AOD TAG Tier-0(International) Tier-1(National) Tier-2(Regional) Tier-3(Local) Event 1 Event 2 Event 3 Knowledge Stars in Stripes Distributed (Replicated) Database

24 @#%&*! Data Grid for the Scientist E = mc 2 Grid Middleware …In order to get back to the real (or simulated) data. Incremental Process… Level of the metadata? file?… event?… sub-event?…

25 Summary Yesterdays data access issues are still here They just got bigger (by a factor 100) Data Hierarchy is required to access more data more efficiently… insufficient Todays Grid tools are developing rapidly Enable replicated file access across the grid File replication standard (lfn:\\, pfn:\\) Emerging standards for Grid Data Access.. Tomorrow.. never knows Replicated Events on the Grid?.. Distributed databases?.. or did that diagram look a little too monolithic? Yesterdays data access issues are still here They just got bigger (by a factor 100) Data Hierarchy is required to access more data more efficiently… insufficient Todays Grid tools are developing rapidly Enable replicated file access across the grid File replication standard (lfn:\\, pfn:\\) Emerging standards for Grid Data Access.. Tomorrow.. never knows Replicated Events on the Grid?.. Distributed databases?.. or did that diagram look a little too monolithic?


Download ppt "Data Centric Issues Particle Physics and Grid Data Management Tony Doyle University of Glasgow Particle Physics and Grid Data Management Tony Doyle University."

Similar presentations


Ads by Google