POOL: Component Overview and use of the File Catalog

POOL: Component Overview and use of the File Catalog
Maria Girone, PPARC-LCG, CERN What is POOL ? - POOL architecture - POOL Work Package breakdown - Interaction between the main POOL Components POOL and the Grid The POOL File Catalog - Some typical use cases for LCG-1 GridPP7th meeting,

What is POOL? Pool Of persistent Objects for LHC
POOL project: develop a common Persistency Framework for physics applications at the LHC Pool Of persistent Objects for LHC Started in April 2002 LHC Computing Grid (LCG) Application Area context Common effort between the LHC experiments and the CERN IT-DB group for defining its scope and architecture for the development of its components Ramping up over the last year from 1.5 FTE to ~10 FTE First production use expected for summer 2003 GridPP7 meeting, Maria Girone, CERN

POOL project purpose To allow the multi-PB of experiment data and associated meta data to be stored in a distributed and Grid enabled fashion various types of data of different volumes (event data, physics and detector simulation, detector data and bookkeeping data) Hybrid technology approach, combining C++ object streaming technology, such as Root I/O, for the bulk data transactional safe Relational Database (RDBMS) services, such as MySQL, for catalogs, collections and meta data In particular, it provides Persistency for C++ transient objects Transparent navigation from one object to another even if not in the same file Integrated with a external File Catalog to keep track of the file physical location, allowing files to be moved or replicated GridPP7 meeting, Maria Girone, CERN

POOL architecture POOL is a component based system
follows the LCG Architecture Blueprint Provides a technology neutral API Abstract component C++ interfaces Insulates the experiment framework user code from several concrete implementation details and technologies used today POOL user code is not dependent on implementation libraries No link time dependency on implementation packages (e.g. MySQL, Root, Xerces-c..) Backend component implementations are loaded at runtime via the SEAL plug-in infrastructure Three major domains, weakly coupled, interacting via abstract interfaces GridPP7 meeting, Maria Girone, CERN

POOL Work Package breakdown
Storage Manager Streams transient C++ objects into/from storage Resolves a logical object reference into a physical object Uses Root I/O. A proof of concept with a RDBMS storage manager prototype underway File Catalog Maintains consistent lists of accessible files (physical and logical names) together with their unique identifiers (FileID), which appear in the object representation in the persistent space Resolves a logical file reference (FileID) into a physical file Collections Provides the tools to manage potentially (large) ensembles of objects stored via POOL persistence services Explicit: server-side selection of object from queryable collections Implicit: defined by physical containment of the objects GridPP7 meeting, Maria Girone, CERN

Interaction between POOL components
GridPP7 meeting, Maria Girone, CERN

POOL and the Grid POOL will be mainly used experiment frameworks, as client library loaded by user applications POOL applications are Grid aware via the File Catalog component based on the EDG Replica Location Service (RLS) File resolution and meta data queries are forwarded to Grid middleware requests The POOL storage manager ensures the remote file access via Root I/O (such as RFIO/dCache), possibly later replaced by the Grid File Access Library (GFAL), once it will be available POOL client on a CPU Node User Application Experiment Framework POOL Grid (File) Services Replica Location File Description Remote File I/O? remote access via ROOT I/O GridPP7 meeting, Maria Girone, CERN

POOL File Catalog Files are referred to inside POOL via a unique and immutable file identifier, (FileID) generated at creation time POOL added the system generated FileID to the standard Grid m-n mapping Stable inter-file reference Global Unique Identifier (GUID) implementation for FileID allows the production of a consistent sets of files with internal references without requiring a central ID allocation service catalog fragments created independently can later be merged without modification to corresponding data file FileID-LFN mapping supported but not used internally FileID-PFN mapping is sufficient for object lookup Logical Naming Object Lookup LFN2 LFNn PFN2, technology PFNn, technology File metadata (jobid, owner, …) GridPP7 meeting, Maria Girone, CERN

Concrete implementations
XML Catalog typically used as local file by a single user/process at a time no need for network supports R/O operations via http tested up to 50K entries Native MySQL Catalog handles multiple users and jobs (multi-threaded) tested up to 1M entries EDG-RLS Catalog Grid aware applications Oracle iAS or Tomcat + Oracle / MySQL backend pre-production service based on Oracle (from IT/DB) , RLSTEST, already in use for POOL V1.0 GridPP7 meeting, Maria Girone, CERN

File Catalog functionality
Connection and transaction control functions Catalog insertion and update functions on logical and physical filenames Catalog lookup functions (by filename, FileID or query) Clean-up after an unsuccessful job Catalog entries iterator File Meta data operations (e.g. define or insert file meta data) Cross catalog operations (e.g. extract a XML fragment and append it to the MySQL catalog) Python based graphic user interface for the catalog browsing GridPP7 meeting, Maria Girone, CERN

File Catalog Browser Prototype
GridPP7 meeting, Maria Girone, CERN

Use case: isolated system
XML lookup input files register output files Import jobs Publish No network EDG The user extracts a set of interesting files and a catalog fragment describing them from a (central) Grid based catalog into a local XML catalog Selection is performed based on file or collection descriptions After disconnecting from the Grid the user executes some standard jobs navigating through the extracted data New output files are registered into the local XML catalog Once the new data is ready for publishing and the user is connected the new catalog fragment is submitted to the Grid based catalog GridPP7 meeting, Maria Girone, CERN

Use case: farm production
lx1.cern.ch quality check lookup register XML jobs publish MySQL pc3.cern.ch publish quality check publish lookup register jobs publish EDG A production job runs and creates files and their catalog entries in a local XML file During the production the catalog can be used to cleanup files Once the data quality checks have been passed the production manager decides to publishes the production XML catalog fragment to the site database one and eventually to the Grid based catalog GridPP7 meeting, Maria Girone, CERN

Summary and perspectives
The LCG POOL project provides a hybrid store combining object streaming (eg Root I/O) for bulk data with RDBMS technology (eg MySQL) for File Catalog, Collections and Meta data Strong emphasis on component decoupling Integration with Grid technology (via EDG-RLS) but preserving networked and grid-decoupled working modes POOL scheduled releases and planned functionality on time, so far POOL release V1.1 has a complete LCG-1 feature set POOL started to be integrated into the ATLAS and CMS software framework in early June 2003 (from V1.0) Positive feedback from the experiments Experiment-specific production services for the EDG Catalog provided in conjunction with POOL V1.1 GridPP7 meeting, Maria Girone, CERN

File Catalog performance tests
Preliminary tests done on POOL V0.5 XML: tested up to 50K entries start time: new catalog ~10ms catalog with 20K entries ~6s registerPFN: <0.3ms/entry MySQL: tested up to 1M entries up to 300 concurrent clients, commit every 100 entries or less frequent registerPFN: <1.5ms/entry EDG-RLS based catalog registerPFN: ~30ms/entry (autocommit) Pentium III-1.2GHz free memory-220MB PFN-200 char; FileID-36 char GridPP7 meeting, Maria Girone, CERN

File Catalog Performance Requirements
A very preliminary model of the frequency of access to the File Catalog, for Initial LCG-1 service Analysis activities at the LHC start-up Strongly based on experiments inputs (mainly CMS), subject to modifications It is based on the following assumptions LCG-1 CPU power: 1 GHz, Pentium III (400 SPECInt2000) LHC start-up (2008): total CPU capacity of 20M SPECInt2000 (2M SPECInt2000 at each of 5 T1 centres, with an equal amount shared over all T2s GridPP7 meeting, Maria Girone, CERN

Current tests show performance well in excess of these requirements!
LCG-1 access figures Take CMS PCP as example: Total number of events to produce (kine, Simul, Digi and half Reco): 50M [1] July November Fraction of expected LCG-1 production [2] % % (and thus requiring central cataloguing of the files created) Based on the length of different job types, it is expected that File lookup frequency: Hz Hz File registration frequency: Hz Hz Total interaction rate: Hz Hz 1 interaction every sec sec Current tests show performance well in excess of these requirements! GridPP7 meeting, Maria Girone, CERN

Estimated performance requirements in 2008
It is assumed that the file catalog accesses will be dominated by analysis jobs. From different perspectives and inputs, the expected rates are Analysis of 100 kB events is assumed to require 100 SPECInt2000 s, with analysis jobs accessing at most 10% of the data stored in a 2GB file Max number of file opening in CMS world wide = 20 GB/s / 0.2 GB gives 100 different files will be accessed per second From another perspective, the aggregate bandwidth of jobs running at the CERN Tier0/Tier1 facility is expected to be 50GB/s, which means 250 different files will be accessed per second Summing up the requests from the 4 experiments to have 23M SPECInt2000 at CERN, one gets that 120 different files will be accessed per second GridPP7 meeting, Maria Girone, CERN

Requirements from the experiments
POOL will be integrated in experiments’ data challenges in Alice: not currently planned Atlas: second quarter of 2004 for next larger production activity DC2 (integration into ATHENA already started) CMS: POOL proposed as baseline for PCP (starting this summer) and later for DC04 LHCb: spring 2004 Expected number of entries in the file catalog ATLAS/DC2 – O(100) minimum bias input files/job, O(106) total output files CMS/PCP - O(10k) input files, O(106) total output files LHCb/spring ’04 - O(100) input files/job, O(105) total output files GridPP7 meeting, Maria Girone, CERN

POOL: Component Overview and use of the File Catalog

Similar presentations

Presentation on theme: "POOL: Component Overview and use of the File Catalog"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

POOL: Component Overview and use of the File Catalog

Similar presentations

Presentation on theme: "POOL: Component Overview and use of the File Catalog"— Presentation transcript:

Similar presentations

About project

Feedback