Presentation on theme: "Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN."— Presentation transcript:
Marianne BargiottiBK Workshop – CERN - 6/12/2007 1 Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN
Marianne BargiottiBK Workshop – CERN - 6/12/2007 2 Outline BK overview Logical data model and DB schema BK services and User Interface Conclusions Appendix A,B
BK Workshop – CERN - 6/12/2007 3 LHCb Bookkeeping Meta Data Catalogue The Bookkeeping (BK) is the AMGA * based system that manages the meta-data infos (file-metadata) of data files. It contains information about jobs, files and their relations: Job: Application name, Application version, Application parameters, which files it has generated etc.. File: size, event, filename, from which job it was generated etc. The Bookkeeping DB represents the main gateway for users to select the available data and datasets. Three main services are available: Booking service: to write data to the bookkeeping Servlets service: for the BK web browsing and selection of the data files. AMGA server: for remote application use. *: AMGA is the ARDA implementation of the ARDA/gLite Metadata Catalog Interface (see http://amga.web.cern.ch).
Marianne BargiottiBK Workshop – CERN - 6/12/2007 4 Logical data models The AMGA-Schema shows how the information is logically grouped: The logical model is built around the two main entities: Jobs and Files The relation between them is of type input/output. A job can take one or more files as input and produce more than a file (usually a data file plus a couple of log files). Around these two entities there is a full set of satellite entities (Fileparams, Jobparams etc) that help to keep extra information. At each entity is associated one or more attributes. For instance LFN is an attribute of Files or Program Name is an attribute of Jobparams. the entities in the AMGA logical model are directories (see Appendix A)
5 DB schema The database tables are logically grouped in two (plus one) sets based on their functionality: the Warehouse tables : at each AMGA directory/entity is associated a table in the database (see Appendix A) the Views: The views summarise the information stored in the Warehouse database to best suite the physicists query providing a good performance Most important: rottree and jobfileinfo views: each row in roottree summarise the attributes associate to the data-files stored in a JobFileInfoXXX table (as many JobFileInfoXXX tables as the entries in the roottree table) Auxiliary tables The process that elaborates the Warehouse data to create or update the Views makes use of auxiliary tables (see Appendix B)
Marianne BargiottiBK Workshop – CERN - 6/12/2007 6 BK service The bookkeeping service is made up of: an application: BkkManager. and two sub-services: BkkReceiver tomcat Plus: two satellite services tightly related to the bookkeeping: FileCatalog and BkkMonitor. All these services are currently deployed on volhcb01.cern.ch
Marianne BargiottiBK Workshop – CERN - 6/12/2007 7 Booking of data The Booking of data is how the information about jobs and files reach the bookkeeping and how they are registered in the database. The information about jobs and files are sent in xml format and are stored in files Two central services involved: BkkReceiver BkkManager
Marianne BargiottiBK Workshop – CERN - 6/12/2007 8 BkkReceiver BkkReceiver is responsible for receiving the xml files and stores them in a directory. The directory works as a queue where files are processed in a FIFO order. BkkReceiver service is listening on port 8092 (on the deployment machine volhcb01) for jobs to send the xml formatted information about the generated files.
Marianne BargiottiBK Workshop – CERN - 6/12/2007 9 BkkManager BkkManager is responsible of reading the xml files, checking the correctness of their format and information and uploading the new data in the database. Two DTD definition files are used: one is used to define the jobs and files tags (Book.xml) the second is used for the information on the replica (Replica.dtd).
Marianne BargiottiBK Workshop – CERN - 6/12/2007 10 NewConfirm servlet BkkManager does deploy, to accomplish his duty, the NewConfirm servlet service. NewConfirm takes care of checking the conformity of the xml files to its DTD format then it checks that the information provided are correct. The information is inserted in the Database only if all the checks are fine. If one of check fails an error message is saved in a file and no information is uploaded.
Marianne BargiottiBK Workshop – CERN - 6/12/2007 11 Night updates The BkkManager application takes care of selecting the xml files from the queue and asks NewConfirm to book them. Every night it makes a backup of all the xml files that have been successfully booked and run the update views script. before extracting the xml files from the queue, it checks the errors generated during the processing of the previous xml files to see if there are files that need to be reprocessed.
Marianne BargiottiBK Workshop – CERN - 6/12/2007 12 Tomcat & BkkMonitor Tomcat is the servlets container used by the bookkeeping listening on port 8080 on the deployment machine. BkkMonitor is a monitoring service which controls FileCatalog and BkkReceiver servers. The service actively ping these two services at one minute interval. In case of problems (service not responding): warning email sent to BK operation manager in charge the server will be restarted.
Marianne BargiottiBK Workshop – CERN - 6/12/2007 13 User Interface: Bookkeeping Web Page The web page allows to browse the bookkeeping contents and get information about file and their provenance. It is also used to generated Gaudi Card, the list of files to be processed by a job. On left frame links many browsing options: File look-up, Job look-up, Production look- up, BK summary Dataset search: retrieve a list of files based on their provenance history. The result is:
14 FileCatalog FileCatalog is the service used by the genCatalog script and the bookkeeping to get information about the Physical File Name of a file and its ancestors. It is a frontend to the LFC and bookkeeping database. No security is required on this service since it provides read only API. Accessible through the web page selecting the ‘Dataset Replicated at’ section: the system looks first for LFNs in the bookkeeping database and then it tries to get the physical location for each of them from LFC search is expensive: always done on a limited number of files !! (bunches of 200)
Marianne BargiottiBK Workshop – CERN - 6/12/2007 15 Conclusion Several issues raised: By users: web interface: lack of functionality in the data sets search Java code to be replaced with python necessity of having a defined structure embedded in DIRAC Forthcoming changes in the DB schema with data taking necessity for a new versatile tool able to match different requests
16 Appendix A Description of each entity/directory: Jobs: Each Job has a Configuration Name and Version plus the date of its execution. These three attributes are always present. Extra information about it is kept in the Jobparms and Inputfiles entities. Jobparams: Provide extra information about job like the program name and version, the location where it was executed etc. Some attributes may not be compulsory. Inputfiles: Contains the list of input files used by each job. No entries are present for jobs that didn’t take any input file. Files: At each file is always associated the Logical Filename, the job that generated it and the type of file. Fileparams: Similar to Jobparams provides extra information about files like the file size, file GUID etc. Some attributes may be not compulsory. Qualityparams: This entity provides information on the quality of the files. It says for which group of physicists a file may be of interest. Eventtypes: Keeps information about the event types like its description. Typeparams: Extra information about the file type: name, description and version. The description may not be present.
Marianne BargiottiBK Workshop – CERN - 6/12/2007 17 AppendixB Auxiliary tables: FileSummary: the table contains an entry for each files with all the related information. JobSummary: the table contains an entry for each job with related information. JobHistory: contains for each job the information on his immediate ancestor if any. JobHistory2Level: contains for each job the information on his ancestor of second degree if any. Summary: Contains an entry for each possible n-tuple (eventtype,config,filetype, dbversion, program0, inputfile1, program1, inputfile2, program2). Jobs_FileSummary: Is just the join of filesummary and jobsummary done on the column job_id.
Marianne BargiottiBK Workshop – CERN - 6/12/2007 18 AMGA server There is no direct access to the DB: the access is direct toward the AMGA server, which then will take care of contacting the DB to serve the information to the client. The AMGA server comes with client APIs for C++, Python, Java and PerlAMGA