The POOL Persistency Framework

The POOL Persistency Framework
Dirk Düllmann CERN, Geneva DESY Computing Seminar Hamburg, 12. January 2004 The POOL Persistency Framework

Outline Project goals and organisation
POOL components and their high level interaction Main technical problems and solutions chosen First production experience and user feedback The POOL Persistency Framework D.Duellmann, CERN

What is POOL? Pool Of persistent Objects for LHC
Project Goal: Develop a common Persistency Framework for physics applications at the LHC Pool Of persistent Objects for LHC Second project ConditionsDB just starting Part of the LHC Computing Grid (LCG) LCG is the computing infrastructure for LHC Physics Applications, Grid Technology, Grid Deployment, Fabric Management POOL is one of the first Application Area Projects Common effort between LHC experiments and CERN IT-DB group for defining its scope and architecture for the development of its components The POOL Persistency Framework D.Duellmann, CERN

POOL Timeline and Statistics
POOL project started April 2002 Ramping up from 1.6 to ~10 FTE Persistency Workshop in June 2002 First internal release POOL V0.1 in October 2002 In about one year of active development since then 12 public releases POOL V1.4.0 last year, V1.5 expected this month Some 60 internal releases Often picked up by experiments to confirm fixes/new functionality Very useful to insure that releases meet experiment expectations beforehand Handled some 165 bug reports Savannah web portal proven helpful POOL followed from the beginning a rather aggressive schedule to meet the first production needs of the experiments. The POOL Persistency Framework D.Duellmann, CERN

POOL Objectives To allow the multi-petabyte of experiment data and associated meta data to be stored in a distributed and grid-enabled fashion various types of data of different volumes (event data, physics and detector simulation, detector data and bookkeeping data) Hybrid technology approach, combining C++ object streaming technology Root I/O for the bulk data Transactionally safe Relational Database (RDBMS) services, MySQL for catalogs, collections and meta data In particular POOL provides Persistency for C++ transient objects Transparent navigation among objects across file and technology boundaries Integrated with a external File Catalog to keep track of the file physical location, allowing files to be moved or replicated The POOL Persistency Framework D.Duellmann, CERN

Component Architecture
POOL is a component based system Closely follows the LCG Architecture Blueprint Provides a technology neutral API Abstract component C++ interfaces Insulates the experiment framework user code from implementation details of the technologies used today POOL user code is not dependent on implementation libraries No link time dependency on implementation packages (e.g. MySQL, Root, Xerces-C..) Backend component implementations are loaded at runtime via the SEAL plug-in infrastructure Three major domains, weakly coupled, interacting via abstract interfaces The POOL Persistency Framework D.Duellmann, CERN

POOL Component Breakdown
The POOL Persistency Framework D.Duellmann, CERN

Work Package Breakdown
Storage Manager Streams transient C++ objects into/from disk storage Resolves a logical object reference into a physical object Uses Root I/O. A proof of concept with a RDBMS storage manager prototype underway File Catalog Maintains consistent lists of accessible files (physical and logical names) together with their unique identifiers (FileID), which appear in the object representation in the persistent space Resolves a logical file reference (FileID) into a physical file Collections Provides the tools to manage potentially (large) ensembles of objects stored via POOL persistence services Explicit: server-side selection of object from queryable collections Implicit: defined by physical containment of the objects The POOL Persistency Framework D.Duellmann, CERN

The LHC Computing Centre
Tier3 physics department    Desktop Tier2 Lab a Uni a Lab c Uni n Lab m Lab b Uni b Uni y Uni x regional group Germany Tier 1 USA UK France Italy ………. CERN Tier 1 CERN The LHC Computing Centre physics group The POOL Persistency Framework D.Duellmann, CERN

POOL on the Grid File Catalog Collections Meta Data Replica Location
Service Catalog Grid Dataset Registry Grid Resources Experiment Framework User Application LCG POOL Grid Middleware RootI/O Replica Manager The POOL Persistency Framework D.Duellmann, CERN

POOL off the Grid File Catalog Collections Meta Data XML / MySQL
MySQL or RootIO Collection Experiment Framework User Application LCG POOL Disconnected Laptop RootI/O The POOL Persistency Framework D.Duellmann, CERN

HEP Data Models HEP data models are complex!
Typically hundreds of structure types (classes) Many relations between them Different access patterns LHC experiments rely on OO technology OO applications deal with networks of C++ objects Pointers (or references) are used to describe relations Event TrackList Tracker Calor. Track HitList Hit The POOL Persistency Framework D.Duellmann, CERN

POOL Storage Hierarchy
A simple and generic model exposed to the application A application may access databases (eg ROOT files) from a set of catalogs Each database has containers of one specific technology (eg ROOT trees) Smart Pointers are used to transparently load objects into a client side cache define object associations across file or technology boundaries The POOL Persistency Framework D.Duellmann, CERN

Navigational Access DatabaseID ContainerID ObjectID Internally: a unique Object Identifier (OID) per object Direct access to any object in the distributed store Natural extension of the pointer concept OIDs allow to implement networks of persistent objects (“associations”) Association Cardinality: 1:1 , 1:n, n:m OIDs are used in C++ via so-called “smart-pointers” The POOL Persistency Framework D.Duellmann, CERN Indentity == Physical Location

How do “smart pointers” work?
Ref<Track> is a smart pointer to a Track Refs are the most visible part of the POOL C++ API The storage system automatically locates objects in the store as they are accessed and reads them. Transparent to the user/application programmer User does not need to know about physical locations. No host or file names in the code Allows de-coupling of logical and physical model The POOL Persistency Framework D.Duellmann, CERN

A Code Example Event TrackList Tracker Calor. Track HitList Hit
Collection<Event> events; // an event collection Collection<Event>::iterator evt; // a collection iterator // loop over all events in the input collection for(evt = events.begin(); evt != events.end(); evt++) { // access the first track in the tracklist Ref<Track> aTrack; aTrack = evt->tracker->trackList[0]; // print the charge of all its hits for (int i = 0; i < aTrack->hits.size(); i++) cout << aTrack->hits[i]->charge << endl; } Event TrackList Tracker Calor. Track HitList Hit The POOL Persistency Framework D.Duellmann, CERN

Physical and Logical Data Model
File Catalog Important to clearly separate between the two models Avoid pollution of the user code with unnecessary detail about file and host names Leaving the physical model free to optimise for performance or changing access patterns without affecting existing applications The POOL Persistency Framework D.Duellmann, CERN

POOL and its Environment
Exp. DB Services Book Keeping Production Workflow POOL is mainly used from experiment frameworks mostly as client library loaded from user application Production Manager Creates and maintains shared file catalogs and (event) collections End User Uses shared or private collections POOL applications are grid-aware via the Replica Location Service (RLS) File location and meta data queries are forwarded to Grid services The POOL storage manager provides access to local or remote files access via Root I/O (eg RFIO/dCache), possibly later replaced by the Grid File Access Library (GFAL) POOL client on a CPU Node User Application Experiment Framework RDBMS Services Collection Description POOL Collection Location? remote access Collection Access Grid (File) Services Replica Location File Description Remote File I/O The POOL Persistency Framework D.Duellmann, CERN

Storage Components Data Cache StorageMgr Disk Storage Object Pointer
Persistent Address Technology DB name Cont.name Item ID Disk Storage database Database Container Objects The POOL Persistency Framework D.Duellmann, CERN

Mapping to Technologies
Identify commonalties and differences between technologies Model adapts to any storage technology with direct access Record identifier should be known before flushing to disk Use of RDBMS rather conventional No special object support required Primary key must be uniquely determined before writing The POOL Persistency Framework D.Duellmann, CERN

Technology Independent Type Dictionary
To store any object in memory on disk one needs an precise description of the class layout. Logical Type Info Name and type of all data attributes Inheritance hierarchy Logical == independent of platform and compiler Physical Type Info Offset, endianness and size of all data attributes Base classes are not much more than embedded data members of that type Depends on CPU, compiler, compiler flags (eg padding) POOL needs only the data attributes and the dynamic type to persist an object. The POOL Persistency Framework D.Duellmann, CERN

Dictionary: Population/Conversion
.h ROOTCINT CINT dictionary code .xml Code Generator GCC-XML LCG dictionary code Technology dependent Dictionary Generation CINT dictionary I/O Data I/O LCG dictionary Gateway Reflection Other Clients The POOL Persistency Framework D.Duellmann, CERN

Client Access via Object Caches
Data Service Manages object cache Ref<T> Client access data through References Client Ref<T> Object Cache Data Service Different context Event data Detector data other Ref<T> Object Cache Data Service The POOL Persistency Framework D.Duellmann, CERN

POOL Cache Access Cache Manager Persistency Manager Ref<T> Key
Object OID … 2 <pointer> Key CacheMgr Pointer Ref<T> Token Storage Technology Object Type Persistent Location Persistency Manager The POOL Persistency Framework D.Duellmann, CERN

Reducing Storage Overhead - The Link Table
<…> <pointer> OID Object (2) (3) (4) Link ID Link Info DB/Cont.name,... ... <number> Local lookup table in each file (1) Entry ID Link ID The POOL Persistency Framework D.Duellmann, CERN

POOL File Catalog Files are referred to inside POOL via a unique and immutable file identifier, (FileID) generated at creation time POOL added the system generated FileID to the standard Grid m-n mapping Stable inter-file reference Optionally, File metadata attributes are associated with the FileID, used for fast query based catalog manipulation. Not used internally by POOL Logical Naming Object Lookup LFN2 LFNn PFN2, technology PFNn, technology File metadata (jobid, owner, …) FileID-LFN mapping supported but not used internally FileID-PFN mapping is sufficient for object lookup The POOL Persistency Framework D.Duellmann, CERN

GUIDs as File Identifiers
Global Unique Identifier (GUID) implementation for FileID allows the production of a consistent sets of files with internal references without requiring a central ID allocation service Jobs can execute in complete isolation (as in eg CMS “winter mode”) Catalog Fragments can be used asynchronously as a means of propagating catalog or meta data changes through the system Without loosing the internal object references Without risking to clash with any already existing objects Without the need to rewrite any existing data files The POOL Persistency Framework D.Duellmann, CERN

Catalog Implementations
XML Catalog typically used as local file by a single user/process at a time no need for networked supports R/O operations via http tested up to 50K entries Native MySQL Catalog Shared catalog eg in a production LAN handles multiple users and jobs (multi-threaded) tested up to 1M entries EDG-RLS Catalog Grid aware applications production service based on Oracle (from CERN IT/DB), already in use for POOL V1.0 (july ’03) Oracle iAS or Tomcat + Oracle / MySQL backend The POOL Persistency Framework D.Duellmann, CERN

Catalog Queries Application level meta-data can be associated to cataloged files meta-data are normally used to query large file catalogs, e.g. extract catalog subsets based on production parameters Catalog fragments including meta data can be shipped. Cross-catalog operations between different catalog implementations are supported Queries are supported at the client side for XML and EDG and at the server side for MySQL “jobid=‘cmssim’ ”, “owner like ‘%wild%’ ” “pfname like ‘path’ and owner = ‘atlasprod’ ” EDG RLS enhancements requested server side query evaluation of numerical query expressions The POOL Persistency Framework D.Duellmann, CERN

Use case: isolated system
XML lookup input files register output files Import jobs Publish No network EDG-RLS The user extracts a set of interesting files and a catalog fragment describing them from a (central) Grid based catalog into a local XML catalog Selection is performed based on file or collection descriptions After disconnecting from the Grid the user executes some standard jobs navigating through the extracted data New output files are registered into the local XML catalog Once the new data is ready for publishing and the user is connected the new catalog fragment is submitted to the Grid based catalog The POOL Persistency Framework D.Duellmann, CERN

Use case: farm production
lx1.cern.ch quality check lookup register XML jobs publish RDBMS pc3.cern.ch publish quality check publish lookup register jobs publish EDG A production job runs and creates files and their catalog entries in a local XML file During the production the catalog can be used to cleanup files Once the data quality checks have been passed the production manager decides to publishes the production XML catalog fragment to the site database one and eventually to the Grid based catalog The POOL Persistency Framework D.Duellmann, CERN

CMS & POOL Fruitful collaboration with POOL team since inception
2.6 FTE direct participation Efficient communication Savannah Portal Direct mail and phone exchange among developers In person meetings when required Continuous and prompt feedback CMS typically provides feedback on any new pre-release in few hours POOL typically responds to bug reports in 24/48 hours Only a few took more than a week to be fixed in a new pre-release Review response slides from the experiments.

CMS Current Status First public release of CMS products (COBRA etc.) based on POOL 1.3.x available on 30 Sep 2003 Used in production, deployed to physicists e.g. 2 Million events produced with OSCAR (G4 simulation) in a weekend New tutorials (just started) based on software released against POOL 1.3.3 Essentially same functionality as Objectivity-base code, but: No concurrent update of databases No direct connection to central database while running Remote access limited to RFIO or dCache No schema evolution Still a few bugs, missing features and performance problems Affect more complex use-cases Impact the deployment to a large developer/user community

LHCb Strategy Full LHCb data model in POOL. Complete write/read
Adiabatic adaptation of Gaudi to SEAL/POOL Slow integration according to available manpower (1 year for full migration) Take advantage of face-lifting of “bad” interfaces and implementations Minimal changes to interfaces visible to physicists Integration of SEAL in steps Dictionary integration and plugin manager Use of SEAL services later Integration of POOL earlier than SEAL Working prototype by the end of the year Necessity to read “old” ROOT data Keep the LHCb event model unchanged Full LHCb data model in POOL. Complete write/read of a LHCb event from GAUDI achieved in December.

ATLAS SEAL/POOL Status
Required functionality is essentially in place Main outstanding problem is that of CLHEP matrix class ATLAS has not yet tested EDG RLS-based file catalog POOL explicit and implicit collections are available via Athena, but fuller integration will require extensions from POOL (and refactoring on ATLAS side) We’re moving from an expert to general developer environment The event data model is being filled out Goal is to have it essentially complete by end of the year Although parts of transient model are still being redesigned Python scripting support is being incorporated now PyROOT, PyLCGDict, GaudiPython, etc.

Summary The LCG Pool project provides a hybrid store integrating object streaming (eg Root I/O) with RDBMS technology (eg MySQL) for consistent meta data handling Strong emphasis on component decoupling and well defined communication/dependencies Transparent cross-file and cross-technology object navigation via C++ smart pointers Integration with Grid wide File Catalogs (eg EDG-RLS) but preserving networked and grid-decoupled working modes Decoupling of logical ond physical data model No file or hostnames in user code POOL has been integrated into LHC experiments software frameworks and is used for production activities in CMS Persistency mechanism for CMS, ATLAS and LHCb data challenges The POOL Persistency Framework D.Duellmann, CERN

How to find out more about POOL?
POOL home page POOL savannah web portal (bug reports, cvs) POOL binary distribution (provided by LCG-SPI) The POOL Persistency Framework D.Duellmann, CERN

The end. Thank you! The POOL Persistency Framework

The POOL Persistency Framework

Similar presentations

Presentation on theme: "The POOL Persistency Framework"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The POOL Persistency Framework

Similar presentations

Presentation on theme: "The POOL Persistency Framework"— Presentation transcript:

Similar presentations

About project

Feedback