Presentation is loading. Please wait.

Presentation is loading. Please wait.

Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.

Similar presentations


Presentation on theme: "Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL."— Presentation transcript:

1 Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL

2 Background Brookhaven National Laboratory (BNL) is a multi-disciplinary research laboratory funded by US government. Brookhaven National Laboratory (BNL) is a multi-disciplinary research laboratory funded by US government. BNL is the site of Relativistic Heavy Ion Collider (RHIC) and four of its experiments – Brahms, Phenix, Phobos and Star. BNL is the site of Relativistic Heavy Ion Collider (RHIC) and four of its experiments – Brahms, Phenix, Phobos and Star. The Rhic Computing Facility (RCF) was formed in the mid 90’s, in order to address computing needs of RHIC experiments. The Rhic Computing Facility (RCF) was formed in the mid 90’s, in order to address computing needs of RHIC experiments.

3 Background (cont.) The data collected by Rhic experiments is written to tapes in HPSS mass storage facility, to be reprocessed at a later time. The data collected by Rhic experiments is written to tapes in HPSS mass storage facility, to be reprocessed at a later time. In order to automate the process of data reconstruction a home-grown batch system (“Old CRS”) was developed. In order to automate the process of data reconstruction a home-grown batch system (“Old CRS”) was developed. “Old CRS” manages data staging from and to HPSS system, schedules jobs and monitors their execution “Old CRS” manages data staging from and to HPSS system, schedules jobs and monitors their execution Due to the growth of the size of the farm, the “old CRS” system does not scale well and needs to be replaced. Due to the growth of the size of the farm, the “old CRS” system does not scale well and needs to be replaced.

4 RCF facility Server racks HPSS storage

5 The RCF Farm Batch Systems - present Data Analysis farm (LSF batch system) Reconstruction farm (CRS system) Berlin wall

6 New CRS - requirements Stage input files from HPSS, Unix (and in the future – Grid) Stage input files from HPSS, Unix (and in the future – Grid) Stage output files to HPSS, Unix (and in the future – Grid) Stage output files to HPSS, Unix (and in the future – Grid) Capable of running mass production, high degree of automation Capable of running mass production, high degree of automation Error diagnostics Error diagnostics Bookkeeping Bookkeeping

7 Condor as the scheduler of new CRS system Condor comes with Dagman – a meta scheduler which allows to build “graphs” of interdependent batch jobs. Condor comes with Dagman – a meta scheduler which allows to build “graphs” of interdependent batch jobs. Dagman allows to construct jobs which consist of several subjobs, which perform data staging operations and data reconstruction separately … Dagman allows to construct jobs which consist of several subjobs, which perform data staging operations and data reconstruction separately … … which in turn allows us to optimize the staging of data tapes to minimize the number of tape mounts … which in turn allows us to optimize the staging of data tapes to minimize the number of tape mounts

8 New CRS batch system CRS job HPSS interface MySQL Logbook server HPSS

9 Anatomy of a CRS job Parent job (1 per input file) Parent job (1 per input file) Main job Parent job (1 per input file) Each CRS job consist of several subjobs. Parent jobs (one per each input file) are responsible for locating input Data and – if necessary – staging them from tapes to HPSS disk cache. The main job is responsible for actual data reconstruction and is executed if and only if all parent jobs completed succesfully. ……..

10 Lifecycle of the Main Job Import input files from NFS or HPSS to local disk Run user’s executable Export output files Check exit code, check If all required data files are present Did all parent jobs completed successfully? Perform error diagnostics and recovery. Update job databases. At all stages of execution keep track of the production status and update the job/files databases. yes no

11 Lifecycle of the HPSS interface subjob Check if HPSS is available Submit “stage file” request Stage successful? Notify the system about the error. Perform error diagnostics. (For example: can resubmitting the request help?) Notify CRS system Update databases no yes no yes

12 Logbook manager Each job should provide own logfile, for monitoring and debugging purposes. Each job should provide own logfile, for monitoring and debugging purposes. It is easy to keep a logfile of an individual job running on a single machine It is easy to keep a logfile of an individual job running on a single machine CRS jobs consist of several subjobs which run independently, on different machines at different times. CRS jobs consist of several subjobs which run independently, on different machines at different times. In order to synchronize the record keeping a dedicated logbook manager is needed which is responsible for compiling reports from individual subjobs into one, human readable, logfile. In order to synchronize the record keeping a dedicated logbook manager is needed which is responsible for compiling reports from individual subjobs into one, human readable, logfile.

13 CRS Database In order to keep track of the production CRS is interfaced to MySQL database In order to keep track of the production CRS is interfaced to MySQL database The database stores information about each job and subjob status, status of data files, HPSS staging requests, open data transfer links and statistics of completed jobs. The database stores information about each job and subjob status, status of data files, HPSS staging requests, open data transfer links and statistics of completed jobs.

14 CRS control panel

15 The RCF Farm – future. Reconstruction farm (Condor system) Analysis farm (Condor system)

16 General Remarks

17 Condor as batch system for HEP Condor has several nice features which make it very useful as batch system (for example DAG) Condor has several nice features which make it very useful as batch system (for example DAG) Submitting very large number of condor jobs (o(10k)) from a single node can put a very high load on submit machine, leading to potential condor meltdown. Submitting very large number of condor jobs (o(10k)) from a single node can put a very high load on submit machine, leading to potential condor meltdown. This is not a problem when many users submit jobs from many machines, but for centralized services (data reconstruction, MC production,…) this can become a very difficult issue. This is not a problem when many users submit jobs from many machines, but for centralized services (data reconstruction, MC production,…) this can become a very difficult issue.

18 Status of condor – cntd. People from Condor team were extremely helpful with solving Condor problems – when they were on site (in BNL). People from Condor team were extremely helpful with solving Condor problems – when they were on site (in BNL). Remote support (by e-mail) is slow. Remote support (by e-mail) is slow.

19 Management barrier So far HEP experiments were relatively small (approx 500 people) by business standards. So far HEP experiments were relatively small (approx 500 people) by business standards. Everybody knows everybody Everybody knows everybody This is going to change – next generation of experiments will have thousands of physicists This is going to change – next generation of experiments will have thousands of physicists

20 Management barrier – cntd. In such a big communities it will become increasingly hard to introduce new software products. In such a big communities it will become increasingly hard to introduce new software products. Users have (understandable) tendency to be conservative and they do not want any changes in the environment in which they work Users have (understandable) tendency to be conservative and they do not want any changes in the environment in which they work Convincing users to switch to a new product and aiding them in transition will become a much more complex task than inventing this product! Convincing users to switch to a new product and aiding them in transition will become a much more complex task than inventing this product!


Download ppt "Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL."

Similar presentations


Ads by Google